Skip to content

chuyg1005/seeclick-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GUI Grounding Pre-training Data for SeeClick

This project is the GUI Grounding Pre-training dataset construction project for SeeClick, using the Common Crawl dataset as the source of URLs, crawling web page data using Selenium, and extracting web element grounding data for continuous pre-training of SeeClick.

This is the English introduction of the project.

中文 README

Project Structure

  • preprocess_cdx.py: Extract URLs from the Common Crawl dataset and remove duplicates.
  • crawel.py: Implementation of crawling logic, crawling web page data using Selenium, and extracting grounding data.
  • main.py: Main program for the web crawler, parallel crawling of data using a divide-and-conquer strategy.
  • utils.py: Utility code.

How to Use

  1. Preparing the Common Crawl dataset in advance, and unzip it to a specific directory.
  2. Install Chrome browser and the corresponding version of ChromeDriver.
  3. Install Python dependencies.
pip install -r requirements.txt
  1. Run preprocess_cdx.py to extract URLs and remove duplicates.
python preprocess_cdx.py --cdx_file_path /path/to/cdx --unique_cdx_file_path /path/to/unique_cdx
  1. Run main.py to crawl data.
python main.py --cdx_file_path /path/to/unique_cdx --out_root /path/to/output --num_workers 20

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages