# Dataset Preparation

Here we will use the `fetch20_newsgroups` dataset from `sklearn.datasets` to create a dataset for our model. We will use the `train_test_split` function from `sklearn.model_selection` to split the dataset into training and testing sets. 

These data will then be uploaded to S3 for use in a BERT Topic Model. 

## Step 1: Define `S3`
Here we will define the `S3` bucket and prefix where we will upload our data.

In [6]:
import logging

import boto3
from botocore.exceptions import ClientError


def create_bucket(bucket_name, region=None):
    """Create an S3 bucket in a specified region.

    If a region is not specified, the bucket is created in the S3 default
    region (us-east-1).

    :param bucket_name: Bucket to create
    :param region: String region to create bucket in, e.g., 'us-west-2'
    :return: True if bucket created, else False
    """
    # Create bucket
    try:
        if region is None or region == 'us-east-1':
            s3_client = boto3.client('s3')
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client = boto3.client('s3', region_name=region)
            location = {'LocationConstraint': region}
            s3_client.create_bucket(Bucket=bucket_name,
                                    CreateBucketConfiguration=location)
    except ClientError as e:
        logging.error(e)
        return False
    return True

In [7]:
def create_folder_in_bucket(bucket_name: str, folder_name: str) -> None:
    """Creates a folder in the specified S3 bucket.

    Args:
        bucket_name (str): The name of the S3 bucket.
        folder_name (str): The name of the folder to be created.

    Returns:
        None
    """
    s3_client = boto3.client('s3')
    if not folder_name.endswith('/'):
        folder_name += "/"
    s3_client.put_object(Bucket=bucket_name, Key=folder_name)

In [8]:
create_bucket(bucket_name='with-context-sagemaker', region = 'us-east-1')
create_folder_in_bucket(bucket_name='with-context-sagemaker', folder_name ='/datasets/bert-topic/')

## Step 2: Import the dataset
In this step we perform the EDA and data engineering to ETL that data into a format that can be used by our model in Sagemaker. 


In [9]:
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

data = fetch_20newsgroups(
    subset="train",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=("headers", "footers", "quotes"),
)

documents = data['data']
documents = documents[:100]

In [10]:
# We're limiting to 100 documents for ease of training.
len(documents)

100

### Step 2A: Prepare the data
Not performed here, but, this is where you'd likely be doing some form of data preparation if training for the first time in Sagemaker. This can be automated the second and third time around. 

## Step 3: Write Dataset
We will use the `tempfile` library to write the data to a temporary file. This file will then be uploaded to S3.

In [12]:
import tempfile

from botocore.exceptions import NoCredentialsError


def upload_to_s3(file_name, bucket, object_name=None):
    """Upload a file to an S3 bucket.

    :param file_name: File to upload
    :param bucket: Bucket to upload to
    :param object_name: S3 object name. If not specified then file_name is used
    :return: True if file was uploaded, else False
    """
    # If S3 object_name was not specified, use file_name
    if object_name is None:
        object_name = file_name

    # Upload the file
    s3_client = boto3.client('s3')
    try:
        s3_client.upload_file(file_name, bucket, object_name)
    except FileNotFoundError:
        print("The file was not found")
        return False
    except NoCredentialsError:
        print("Credentials not available")
        return False

    return True

# Create a temporary file and upload it to S3
with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix = '.txt') as temp_file:
    # Write the documents to the temporary file
    for document in documents:
        temp_file.write(
            document.replace('\n', '\\n') + '\n'
        )

    # Get the path of the temporary file
    temp_file_path = temp_file.name

    # Upload file to S3
    upload_to_s3(
        file_name = temp_file_path,
        bucket = "with-context-sagemaker",
        object_name = "datasets/bert-topic/training_file.txt"
    )

## Step 4: Create the Output S3 Directories
Here we will create the output directories in S3 where the data will be uploaded to following Sagemaker Estimator training.

In [13]:
create_folder_in_bucket(
    bucket_name='with-context-sagemaker',
    folder_name ='fits/bert-topic/'
)

# Back Pocket