Crawling curated list of sites: Data Sourcing Candidate seeds spreadsheet

We want to be able to obtain all web and media content associated with a specific list pre-identified domain names.

This issue tracks potential crawling seeds identified [**BigScience Data Sourcing Participants**](https://docs.google.com/spreadsheets/d/1DNLAGz--qvLh-0qQ7pMPGiNeUMgp-fRgn-8mbLagC7U/edit#gid=513216703), primarily in Spanish and SEA English (and three Chinese).

The steps to follow are:
1. filter the CommonCrawl (or another archive) for all WARC records with one of the given domain names
  - filtering all dumps form the last two years
2. obtain overall metrics and metrics per domain name
  - page counts, content languages, content types, etc.
3. upload all of the relevant WARC records for each domain name to a HF dataset in the [BigScience Catalogue Data Organization](https://huggingface.co/bigscience-catalogue-data)
  - minimal filtering of WARC records to include human-readable pages AND pages that reference links to objects we want to download (e.g. PDFs)
  - Extract the HTML tags corresponding to all URLs in the WARC entries
  - optional: post-process the above list to identify outgoing links, extract their domain name, and content type
  - optional: run text extraction
 
In particular, the list of domain names mentioned in outgoing link may be used to obtain a "depth 1 pseudo-crawl" by running the same process again

cc @sebastian-nagel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawling curated list of sites: Data Sourcing Candidate seeds spreadsheet #299

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Crawling curated list of sites: Data Sourcing Candidate seeds spreadsheet #299

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions