Skip to content
An open clone of the GPT-2 WebText dataset by OpenAI. Still WIP.
Branch: master
Clone or download
Latest commit 02875cc Feb 28, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore Initial Commit Feb 18, 2019
Pipfile
Pipfile.lock Update pyyaml Feb 28, 2019
README.md
download.py bug fixes Feb 28, 2019
download_old.py
filter.py Don't load all urls in memory at once Feb 28, 2019
get_urls.py Initial Commit Feb 18, 2019
scrapers.py bug fixes Feb 28, 2019
utils.py

README.md

OpenWebText

This project is a clone of the GPT-2 WebText dataset as outlined in the OpenAI paper. This project is still heavily WIP.

Huge thanks to jcpeterson for letting me use his download code. His version of OpenWebText is super well written, so please check it out!

Dependencies

Pipenv, Python 3,

To install python dependencies:

pipenv install

Newspaper Dependencies:

On Ubuntu:

sudo apt-get install libxml2-dev libxslt-dev

On OS X:

brew install libxml2 libxslt

Usage

  1. Get list of URLs from reddit:
pipenv run python get_urls.py
  1. Download data from URLs:
pipenv run python download.py

Resulting files will be deposited in data/ with format {domain}-{sha256 hash of url}.txt.

Enjoy!

You can’t perform that action at this time.