## 1. Upload data to S3

First you need to create a bucket for this experiment. Upload the data from the following public location to your own S3 bucket. To facilitate the work of the crawler use two different prefixs (folders): one for the billing information and one for reseller. 

### Download the data

In [None]:
# your bucket name
your_bucket = 'XXXXXXXXX'

In [None]:
!wget https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/billing_sm.csv
!wget https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/reseller_sm.csv
!wget https://ml-lab-mggaska.s3.amazonaws.com/sales-forecast/awswrangler-0.0b2-py3.6.egg

In [None]:
import boto3, os
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('billing', 'billing_sm.csv')).upload_file('billing_sm.csv')
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('reseller', 'reseller_sm.csv')).upload_file('reseller_sm.csv')
boto3.Session().resource('s3').Bucket(your_bucket).Object(os.path.join('python', 'awswrangler-0.0b2-py3.6.egg')).upload_file('awswrangler-0.0b2-py3.6.egg')


## 2. Create Crawler

To use this csv information in the context of a Glue ETL, first we have to create a Glue crawler pointing to the location of each file. The crawler will try to figure out the data types of each column. The safest way to do this process is to create one crawler for each table pointing to a different location.

1. Create an IAM role GlueCrawlerRole with the policy AWSGlueServiceRole.
2. Under Services, go to AWS Glue.
3. Under crawlers Add Crawler and two crawlers: create one pointing to each S3 location (one to billing and one to reseller)

    3.1 Name: Billing, Data Store, Specific Path in my Account, Navigate to your bucket and your folder Billing, use an existing IAM role AWSGlueServiceRole, add database implementationdb, Next, Finish
    
    3.2 After the crawler is created select Run it now.
    
    3.3 Name: Reseller, Data Store, Specific Path in my Account, Navigate to your bucket and your folder Reseller, use an existing IAM role AWSGlueServiceRole, select database implementationdb, Next, Finish
    
    3.4 After the crawler is created select Run it now.

After both crawlers run you should see one table is been adeed for each. You can use Athena to inspect the tables and double check the data is been added properly.

## 3. Create Sample View in Athena

When dealing with medium or large datasets, it's a good practice to work on a sample. This will allow you to perform faster experiments regarding feature engineering, data exploration, etc. 

Under Services > Athena we can create the following view on the query editor that will sample only a 10% of the resellers from the billing dataset.

    CREATE VIEW resellers_sample AS
    select * from billing where 
    id_reseller in (select distinct id_reseller from reseller TABLESAMPLE BERNOULLI(10))
