# Environment Setup Progress

This notebook outlines the initial setup steps that were already completed for the project.

1. Downloaded the required datasets from Kaggle using the command line.
2. Created an S3 bucket and uploaded the downloaded data.
3. Created a Glue database and table to catalog the raw data.


In [None]:
# Download dataset from Kaggle
# Requires Kaggle API credentials (~/.kaggle/kaggle.json)
!kaggle datasets download -d <owner/dataset-name> -p ./data


In [None]:
# Create S3 bucket and upload files
!aws s3 mb s3://my-text-data-bucket
!aws s3 cp ./data s3://my-text-data-bucket/ --recursive


In [None]:
import boto3

glue = boto3.client('glue')

# Create a Glue database if it doesn't already exist
try:
    glue.create_database(DatabaseInput={'Name': 'text_data_db'})
except glue.exceptions.AlreadyExistsException:
    pass

# Example table creation (schema simplified)
try:
    glue.create_table(
        DatabaseName='text_data_db',
        TableInput={
            'Name': 'raw_news',
            'StorageDescriptor': {
                'Columns': [
                    {'Name': 'date', 'Type': 'string'},
                    {'Name': 'title', 'Type': 'string'},
                ],
                'Location': 's3://my-text-data-bucket/snp_news/'
            }
        }
    )
except glue.exceptions.AlreadyExistsException:
    pass
