Skip to content

Crawling curated list of sites: Data Sourcing Candidate seeds spreadsheet #299

@yjernite

Description

@yjernite

We want to be able to obtain all web and media content associated with a specific list pre-identified domain names.

This issue tracks potential crawling seeds identified BigScience Data Sourcing Participants, primarily in Spanish and SEA English (and three Chinese).

The steps to follow are:

  1. filter the CommonCrawl (or another archive) for all WARC records with one of the given domain names
  • filtering all dumps form the last two years
  1. obtain overall metrics and metrics per domain name
  • page counts, content languages, content types, etc.
  1. upload all of the relevant WARC records for each domain name to a HF dataset in the BigScience Catalogue Data Organization
  • minimal filtering of WARC records to include human-readable pages AND pages that reference links to objects we want to download (e.g. PDFs)
  • Extract the HTML tags corresponding to all URLs in the WARC entries
  • optional: post-process the above list to identify outgoing links, extract their domain name, and content type
  • optional: run text extraction

In particular, the list of domain names mentioned in outgoing link may be used to obtain a "depth 1 pseudo-crawl" by running the same process again

cc @sebastian-nagel

Metadata

Metadata

Assignees

No one assigned

    Labels

    data catalogGathering data from data sources

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions