# Data Prep Lab

## Problem Statement

* Data source - set of random data from [randomuser.me](https://randomuser.me/)
* From this data file...
    * Which percentage of users are male vs female?
    * What are the ages of most users?
    * Of the users, how many are in their 20s, 30s, 40s, etc?
    * Convert the data to CSV and store it in s3
    * Transform gender feature to a binary value - male 1, female 0.


## Upload Data to S3

In [None]:
import boto3

In [None]:
s3client = boto3.client('s3')

In [None]:
import os
account_no = os.environ['ACCOUNT_NO']

In [None]:
bucket_name = '3034034dataprep'
filename = './userdata.txt'
key='raw/userdata.txt'
db_name='userdata'
crawler_name='userdatacrawler'
crawler_role=f'arn:aws:iam::{account_no}:role/service-role/AWSGlueServiceRole-UserDataCrawler'

In [None]:
# Create an s3 bucket
response = s3client.create_bucket(
    Bucket=bucket_name
)
print(response)

In [None]:
# Copy userdata.txt to the bucket
response = s3client.upload_file(
    Filename=filename,
    Bucket=bucket_name,
    Key=key
)
print(response)

## Data Catalog

Need to set up a crawler to crawl our bucket

In [None]:
# Glue doesn't seem to pick up region via AWS_REGION
from botocore.config import Config

my_config = Config(
    region_name = 'us-east-1'
)

glueClient = boto3.client('glue', config=my_config)

In [None]:
# Create database
response = glueClient.create_database(
    DatabaseInput= {
        'Name':db_name,
        'Description':'user data from randomuser.me'
    }
)
print(response)

In [None]:
# Shortcut - created crawler manually to create the role using the console, deleted everything, then reused role here.
# TODO - create role and policy in this notebook

In [None]:
f's3://{bucket_name}/raw'

In [None]:
response = glueClient.create_crawler(
    Name=crawler_name,
    Role=crawler_role,
    DatabaseName=db_name,
    Targets={
        'S3Targets': [
            {
                'Path': f's3://{bucket_name}/raw',
            },
        ]
    },
    
)
print(response)

## Cleanup

In [None]:
# Delete crawler
glueClient.delete_crawler(
    Name=crawler_name
)

In [None]:
# Delete database
glueClient.delete_database(
    Name=db_name
)

In [None]:
response = s3client.delete_object(
    Bucket=bucket_name,
    Key=key
)
print(response)

In [None]:
response = s3client.delete_bucket(
    Bucket=bucket_name
)
print(response)