Beam Datasets

Intro

Some datasets are too big to be processed on a single machine, for example: wikipedia, wiki40b, etc. Instead, we allow to process them using Apache Beam.

Beam processing pipelines can be executed on many execution engines like Dataflow, Spark, Flink, etc. More infos about the different runners here.

Already processed datasets are provided

At Hugging Face we have already run the Beam pipelines for datasets like wikipedia and wiki40b to provide already processed datasets. Therefore users can simply do run load_dataset('wikipedia', '20200501.en') and the already processed dataset will be downloaded.

How to run a Beam dataset processing pipeline

If you want to run the Beam pipeline of a dataset anyway, here are the different steps to run on Dataflow:

First define which dataset and config you want to process:

DATASET_NAME=your_dataset_name  # ex: wikipedia
CONFIG_NAME=your_config_name    # ex: 20200501.en

Then, define your GCP infos:

PROJECT=your_project
BUCKET=your_bucket
REGION=your_region

Specify the python requirements:

echo "datasets" > /tmp/beam_requirements.txt
echo "apache_beam" >> /tmp/beam_requirements.txt

Finally run your pipeline:

python -mdatasets-cli run_beam datasets/$DATASET_NAME \
--name $CONFIG_NAME \
--save_infos \
--cache_dir gs://$BUCKET/cache/datasets \
--beam_pipeline_options=\
"runner=DataflowRunner,project=$PROJECT,job_name=$DATASET_NAME-gen,"\
"staging_location=gs://$BUCKET/binaries,temp_location=gs://$BUCKET/temp,"\
"region=$REGION,requirements_file=/tmp/beam_requirements.txt"

Note

you can also use the flags num_workers or machine_type to fit your needs.

Note that it also works if you change the runner to Spark, Flink, etc. instead of Dataflow or if you change the output location to S3 or HDFS instead of GCS.

How to create your own Beam dataset

It is highly recommended to be familiar with Beam pipelines first. Then, you can start looking at the wikipedia.py script for an example.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beam_dataset.rst

beam_dataset.rst

Beam Datasets

Intro

Already processed datasets are provided

How to run a Beam dataset processing pipeline

How to create your own Beam dataset

Files

beam_dataset.rst

Latest commit

History

beam_dataset.rst

File metadata and controls

Beam Datasets

Intro

Already processed datasets are provided

How to run a Beam dataset processing pipeline

How to create your own Beam dataset