# Dataset Preparation

Here we will use the `fetch20_newsgroups` dataset from `sklearn.datasets` to create a dataset for our model. We will use the `train_test_split` function from `sklearn.model_selection` to split the dataset into training and testing sets. 

These data will then be uploaded to S3 for use in a BERT Topic Model. 

## Step 1: Define `S3`
Here we will define the `S3` bucket and prefix where we will upload our data.

In [4]:
import logging

import boto3
from botocore.exceptions import ClientError


def create_bucket(bucket_name, region=None):
    """Create an S3 bucket in a specified region.

    If a region is not specified, the bucket is created in the S3 default
    region (us-east-1).

    :param bucket_name: Bucket to create
    :param region: String region to create bucket in, e.g., 'us-west-2'
    :return: True if bucket created, else False
    """
    # Create bucket
    try:
        if region is None:
            s3_client = boto3.client('s3')
            s3_client.create_bucket(Bucket=bucket_name)
        else:
            s3_client = boto3.client('s3', region_name=region)
            location = {'LocationConstraint': region}
            s3_client.create_bucket(Bucket=bucket_name,
                                    CreateBucketConfiguration=location)
    except ClientError as e:
        logging.error(e)
        return False
    return True

In [5]:
def create_folder_in_bucket(bucket_name: str, folder_name: str) -> None:
    """Creates a folder in the specified S3 bucket.

    Args:
        bucket_name (str): The name of the S3 bucket.
        folder_name (str): The name of the folder to be created.

    Returns:
        None
    """
    s3_client = boto3.client('s3')
    if not folder_name.endswith('/'):
        folder_name += "/"
    s3_client.put_object(Bucket=bucket_name, Key=folder_name)

In [None]:
create_bucket(bucket_name='with-context-sagemaker-examples', region = 'us-east-2')
create_folder_in_bucket(bucket_name='with-context-sagemaker-examples', folder_name ='/datasets/bert-topic/')

## Step 2: Import the dataset
In this step we perform the EDA and data engineering to ETL that data into a format that can be used by our model in Sagemaker. 


In [6]:
from sklearn.datasets import fetch_20newsgroups

categories = [
    "alt.atheism",
    "talk.religion.misc",
    "comp.graphics",
    "sci.space",
]

data = fetch_20newsgroups(
    subset="train",
    categories=categories,
    shuffle=True,
    random_state=42,
    remove=("headers", "footers", "quotes"),
)

In [8]:
documents = data.data

'\n\nSeems to be, barring evidence to the contrary, that Koresh was simply\nanother deranged fanatic who thought it neccessary to take a whole bunch of\nfolks with him, children and all, to satisfy his delusional mania. Jim\nJones, circa 1993.\n\n\nNope - fruitcakes like Koresh have been demonstrating such evil corruption\nfor centuries.'