GitHub - chuyg1005/seeclick-crawler

GUI Grounding Pre-training Data for SeeClick

This project is the GUI Grounding Pre-training dataset construction project for SeeClick, using the Common Crawl dataset as the source of URLs, crawling web page data using Selenium, and extracting web element grounding data for continuous pre-training of SeeClick.

This is the English introduction of the project.

中文 README

Project Structure

preprocess_cdx.py: Extract URLs from the Common Crawl dataset and remove duplicates.
crawel.py: Implementation of crawling logic, crawling web page data using Selenium, and extracting grounding data.
main.py: Main program for the web crawler, parallel crawling of data using a divide-and-conquer strategy.
utils.py: Utility code.

How to Use

Preparing the Common Crawl dataset in advance, and unzip it to a specific directory.
Install Chrome browser and the corresponding version of ChromeDriver.
Install Python dependencies.

pip install -r requirements.txt

Run preprocess_cdx.py to extract URLs and remove duplicates.

python preprocess_cdx.py --cdx_file_path /path/to/cdx --unique_cdx_file_path /path/to/unique_cdx

Run main.py to crawl data.

python main.py --cdx_file_path /path/to/unique_cdx --out_root /path/to/output --num_workers 20

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md
crawel.py		crawel.py
main.py		main.py
preprocess_cdx.py		preprocess_cdx.py
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GUI Grounding Pre-training Data for SeeClick

Project Structure

How to Use

About

Releases

Packages

Languages

chuyg1005/seeclick-crawler

Folders and files

Latest commit

History

Repository files navigation

GUI Grounding Pre-training Data for SeeClick

Project Structure

How to Use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages