Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common Crawl crawler: adapt to new data access scheme, fixes #223 #226

Merged

Conversation

sebastian-nagel
Copy link
Contributor

In order to support both authenticated access via the S3 API and anonymous access via HTTP (see #223):

  • First, try to instantiated an S3 client using the default configuration (see boto3: configure credentials). If the S3 client is successfully instantiated, it's used both to list and download WARC files.
  • If using the S3 client fails, WARC files are downloaded using https://data.commoncrawl.org/ as base URL. The available WARC file listings (warc.paths.gz) are used to get the list of all WARC files for the configured time span.

This PR also addresses:

  • fixes a bug in the selection of months for the configured time span: WARC file listings for the last month in the time span weren't always achieved.
  • add a parameter dry_run to let the crawler list the WARC files to be processed without actually processing them

…#223

- try to connect to s3://commoncrawl/ using the default configuration
- if accessing s3://commoncrawl/ fails
  - fetch data using https://data.commoncrawl.org/ as base URL
  - use provided monthly WARC path listings
and verifying whether WARC files are properly selected
without actually extracting news articles from WARCs
timestamps in file names without a timestamp (ie. `warc.paths.gz`)
- must always fetch WARC file listing for the month of the end date
from documentation; add dry_run param to example
@fhamborg fhamborg merged commit f59a95d into fhamborg:master May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants