## Install Python packages 
Let us first install the required python packages.

In [None]:
!pip3 install datasets


## Load dataset

Now we load `midas/semeval2017` Hugging Face dataset, and inspect the `test` split.

In [None]:
# get entire dataset
from datasets import load_dataset
dataset = load_dataset("midas/semeval2017", "raw")
# sample from the test split
test_dataset  = dataset["test"]
test_dataset

## Ingest dataset into S3 bucket

Below we must specify the S3 `universe_bucket` so we can ingest the dataset into the S3 bucket.

In [None]:
import json
import boto3
from uuid import uuid4

universe_bucket="ajayvohra-phrase-piece-pdx-1"
assert universe_bucket, "universe bucket is required"
documents_prefix = f"midas/semeval2017/documents"
keyphrases_prefix = f"midas/semeval2017/keyphrases"

s3_client = boto3.client("s3")
for row in test_dataset:
    text = " ".join(row['document'])
    id=row['id']

    json_obj = {
        "document_id": id,
        "keyphrases": row['extractive_keyphrases']
    }
    file_name = str(uuid4())
    key = f"{documents_prefix}/id={id}/{file_name}.txt"
    s3_client.put_object( Bucket=universe_bucket, Key=key, Body=text)
    key = f"{keyphrases_prefix}/id={id}/{file_name}.json"
    s3_client.put_object( Bucket=universe_bucket, Key=key, Body=json.dumps(json_obj))

## Conclusion

The data has been ingested into the S3 bucket.
