Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Common Crawl crawler: adapt to new data access scheme, fixes #223 #226

Merged

Commits on Apr 22, 2022

  1. Common Crawl crawler: adapt to new data access scheme, fixes fhamborg…

    …#223
    
    - try to connect to s3://commoncrawl/ using the default configuration
    - if accessing s3://commoncrawl/ fails
      - fetch data using https://data.commoncrawl.org/ as base URL
      - use provided monthly WARC path listings
    sebastian-nagel committed Apr 22, 2022
    Configuration menu
    Copy the full SHA
    77829ea View commit details
    Browse the repository at this point in the history
  2. Common Crawl crawler: add option dry_run for debugging

    and verifying whether WARC files are properly selected
    without actually extracting news articles from WARCs
    sebastian-nagel committed Apr 22, 2022
    Configuration menu
    Copy the full SHA
    3e90448 View commit details
    Browse the repository at this point in the history
  3. Common Crawl crawler: catch exceptions when trying to parse

    timestamps in file names without a timestamp (ie. `warc.paths.gz`)
    sebastian-nagel committed Apr 22, 2022
    Configuration menu
    Copy the full SHA
    17f4257 View commit details
    Browse the repository at this point in the history
  4. Common Crawl crawler: fix selection of WARC files by time range:

    - must always fetch WARC file listing for the month of the end date
    sebastian-nagel committed Apr 22, 2022
    Configuration menu
    Copy the full SHA
    c8efe98 View commit details
    Browse the repository at this point in the history
  5. Common Crawl crawler: remove awscli as dependency and related notes

    from documentation; add dry_run param to example
    sebastian-nagel committed Apr 22, 2022
    Configuration menu
    Copy the full SHA
    14d0a20 View commit details
    Browse the repository at this point in the history

Commits on May 7, 2022

  1. Common Crawl crawler: adapt to new data access scheme

    - bug fix: typo in variable name
    sebastian-nagel committed May 7, 2022
    Configuration menu
    Copy the full SHA
    436c09c View commit details
    Browse the repository at this point in the history