Official Implementation of Web-based Visual Corpus Builder (WEBVICOB)
WEBVICOB 🕸, Web-based Visual Corpus Builder, is a dataset generator that can readily construct a large-scale visual corpus (i.e., images with text annotations) from a raw Wikipedia HTML dump. The constructed visual corpora can be utilized in building Visual Document Understanding (VDU) backbones. Our academic paper, which describes our engine in detail and provides full experimental results and analyses, can be found here:
On Web-based Visual Corpus Construction for Visual Document Understanding.
Donghyun Kim, Teakgyu Hong, Moonbin Yim, Yoonsik Kim and Geewook Kim. In ICDAR 2023.
2023-05-03 Our paper is accepted at ICDAR2023. A new version of the paper has been published on arXiv.
2023-02-11 HTML Section Chunker added, Solve memory-leak issue.
2022-11-08 Paper published on arxiv.
2022-11-04 First Commit, We release the codebase.
python >= 3.8
We use GoogleFonts for various visual information. (font/google)
Init submodule if you want to use it.
$ git submodule update --init --recursive
$ bash install_dependencies.sh
You can download various versions of chrome-drivers from here. Please note that you should match chrome driver version with your system's installed one.
$ google-chrome --version
Google Chrome 106.0.5249.103
$ pip install -U six wheel setuptools
$ pip install -r requirements.txt
JUST DO IT FIRST !! RUN FOLLOWING SCRIPT !!
To visualize outputs, you should use "debug" option.
$ PYTHONPATH=$PWD python webvicob/wikipedia/wikipedia.py \
--chrome_path=/path/to/your/chrome/driver \
--workspace=./resources/workspace_example \
--target_lang=en \
--num_train=10 \
--num_val=1 \
--num_test=1 \
--debug=True
option | default | desc |
---|---|---|
workspace (str) | ./ | Dir to load json files and save lmdb. |
chrome_path (str) | resources/chromedriver | Path of your chorme driver |
target_lang (str) | ja | Whatever you want. |
num_train (int) | -1 | Number of train samples. |
num_val (int) | 0 | Number of val samples. |
num_test (int) | 0 | Number of test samples. |
debug (bool) | False | Debug option. |
num_process (int) | -1 | Number of processes. -1 ==> os.cpu_count() value is used. |
shrink_heuristic (bool) | True | Use heuristic shrinking of character boxes. |
remove_background (bool) | True | Remove background img of html. |
unroll_contents (bool) | False | Unroll html contents. |
change_para_font (bool) | True | Change paragraph fonts with google-fonts. |
sleep_time (int) | 1 | sleep time for every render. |
capture_widths (tuple[int]) | (800, 1200, 1600) | Randomly select capture width. This is different from final_width. This option determines the width of the browser when rendering. final_width is an option to resize the finally rendered image and annotations. |
capture_height_limit (int) | 16384 | Skip the rendering process if rendered page's height is larger than the limit value. |
final_width (int) | None | Final save img width size. (Useful when you do not have a lot of storage) |
chunk_idx (int) | None | Chunk index of json_list. Useful when you have multiple computers. |
total_chunk (int) | None | Total number of chunks of json_list. |
html_section_chunker (bool) | True | Chunk HTML by section. This options is very useful when HTML page has a lot of contents. Experiments in paper didn't use chunk option. |
font_dir_path (str) | font_dir_path | Font directory path |
We made sample ndjson files on resources/workspace_example.
Each sample ndjson files has 100 samples.
If you want to download whole crawled data,
Download ndjson files ([lang]wiki-NS0-[version]-ENTERPRISE-HTML.json.tar.gz
) at https://dumps.wikimedia.org/other/enterprise_html/runs
And untar ndjson files on [your workspace path]/raw
.
character | word | line | paragraph | image |
---|---|---|---|---|
If you find this work useful to you, please cite:
@InProceedings{kim2023web,
title = {On Web-based Visual Corpus Construction for Visual Document Understanding},
author = {Kim, Donghyun and Hong, Teakgyu and Yim, Moonbin and Kim, Yoonsik and Kim, Geewook},
booktitle = {Document Analysis and Recognition - ICDAR 2023},
year = {2023},
}
Please use pre-commit which uses Black and Isort.
$ pip install pre-commit
$ pre-commit install
- Open new issue.
- Match code style (black, isort)
- execute commands in webvicob directory.
black .
isort --profile black .
- Write test code.
- Branch ([date]_[whatever]).
- Delete branch after Squash&Merge.
Required Approve: 1
WEBVICOB is licensed under Apache-2.0, except resources/workspace_example/raw which is adopted from https://dumps.wikimedia.org/other/enterprise_html/ under CC BY-SA 3.0 See LICENSE for the full license text.
WEBVICOB
Copyright 2022-present NAVER Corp.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.