This project is the GUI Grounding Pre-training dataset construction project for SeeClick, using the Common Crawl dataset as the source of URLs, crawling web page data using Selenium, and extracting web element grounding data for continuous pre-training of SeeClick.
This is the English introduction of the project.
- preprocess_cdx.py: Extract URLs from the Common Crawl dataset and remove duplicates.
- crawel.py: Implementation of crawling logic, crawling web page data using Selenium, and extracting grounding data.
- main.py: Main program for the web crawler, parallel crawling of data using a divide-and-conquer strategy.
- utils.py: Utility code.
- Preparing the Common Crawl dataset in advance, and unzip it to a specific directory.
- Install Chrome browser and the corresponding version of ChromeDriver.
- Install Python dependencies.
pip install -r requirements.txt
- Run preprocess_cdx.py to extract URLs and remove duplicates.
python preprocess_cdx.py --cdx_file_path /path/to/cdx --unique_cdx_file_path /path/to/unique_cdx
- Run main.py to crawl data.
python main.py --cdx_file_path /path/to/unique_cdx --out_root /path/to/output --num_workers 20