# Label a dataset with AWS Sagemaker Ground Truth

Based on the tutorial https://aws.amazon.com/tutorials/machine-learning-tutorial-label-training-data/

In [None]:
# install HF datasets library
! pip install datasets

In [None]:
from datasets import load_dataset

In [None]:
# Download the Rotten Tomatoes dataset
rotten_tomatoes_dataset = load_dataset("rotten_tomatoes")

# print the first movie review and label
print(rotten_tomatoes_dataset["train"][0])

In [None]:
# select random 20 samples of movie review data 
text_list = rotten_tomatoes_dataset['train'].shuffle(seed=42)['text'][0:20]

# create directory
! mkdir -p text_sample_20

# save into txt files
for k, tt in enumerate(text_list):
    with open(f'text_sample_20/{k}.txt', 'w') as f:
        f.write(tt)

In [None]:
text_list

In [None]:
import sagemaker
import boto3

# upload reviews to S3 bucket
sess = sagemaker.Session()
bucket = sess.default_bucket() # Set a default S3 bucket
s3 = boto3.client('s3')
for k in range(len(text_list)):
    s3.upload_file(f'text_sample_20/{k}.txt', bucket, f"rotten_tomatoes_demo/text_sample_20/{k}.txt")



In [None]:
# list files
sess.list_s3_files(bucket, "rotten_tomatoes_demo/text_sample_20")

In [None]:
# take note of sagemaker default bucket 
print("s3://%s/%s/" % (bucket, "rotten_tomatoes_demo/text_sample_20"))

Follow the steps in the tutorial https://aws.amazon.com/tutorials/machine-learning-tutorial-label-training-data/ to label the movie reviews.

(Note: this tutorial uses images as the input data, but for our lab, we will use text from the rotten_tomatoes dataset as the input. You can skip step 1, 2 and 5, and use this Jupyter Studio to complete this tutorial. Adjust the tutorial where necessary to use the Rotten Tomatoes review texts, rather than the caltech-101 images)
