# Download and split raw data into training and test data    
Christoph Windheuser, ThoughtWorks, June 19, 2020    
     
This notebook needs to be run in SageMaker Studio. It reads the data as csv-file from a public S3 bucket. Then it splits the data into a training and a test set and saves all in a personal S3 bucket.

## Import the necessary libraries    
pandas is a python data science library to handle dataframes    
boto3 is the Amazon Web Services SDK for Python. It enables Python developers to create, configure, and manage AWS services, such as EC2 and S3.    
S3Uploader and S3Downloader are routines to upload or download data into S3.

In [16]:
import pandas as pd
import boto3
from sagemaker.s3 import S3Uploader, S3Downloader


# Do some preparations

In [20]:
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 10)         # Keep the output on one page


## Read raw data file

In [27]:
filename     = 'store47-2016.csv'
s3_data_path = 'https://christoph-windheuser-public.s3.amazonaws.com/'
data         = pd.read_csv(s3_data_path + filename)


In [28]:
data.head(10)

Unnamed: 0,id,date,item_nbr,unit_sales,family,class,perishable,transactions,year,month,day,dayofweek,days_til_end_of_data,dayoff
0,88219279,2016-08-16,103520,10.0,GROCERY I,1028,0,3570,2016,8,16,1,364,False
1,88219280,2016-08-16,103665,4.0,BREAD/BAKERY,2712,1,3570,2016,8,16,1,364,False
2,88219281,2016-08-16,105574,9.0,GROCERY I,1045,0,3570,2016,8,16,1,364,False
3,88219282,2016-08-16,105575,45.0,GROCERY I,1045,0,3570,2016,8,16,1,364,False
4,88219283,2016-08-16,105577,8.0,GROCERY I,1045,0,3570,2016,8,16,1,364,False
5,88219284,2016-08-16,105693,2.0,GROCERY I,1034,0,3570,2016,8,16,1,364,False
6,88219285,2016-08-16,105737,6.0,GROCERY I,1044,0,3570,2016,8,16,1,364,False
7,88219286,2016-08-16,105857,14.0,GROCERY I,1092,0,3570,2016,8,16,1,364,False
8,88219287,2016-08-16,106716,13.0,GROCERY I,1032,0,3570,2016,8,16,1,364,False
9,88219288,2016-08-16,108079,2.0,GROCERY I,1030,0,3570,2016,8,16,1,364,False


## Split into train and test data set

In [29]:
# Split the data at the date 2017-08-02 (last 14 days of data set)
data_train = data[data['date'] < '2017-08-02']
data_test  = data[data['date'] >= '2017-08-02']


## Save train and test data as csv-file

First, save files locally on the SageMaker instance:

In [30]:
train_filename = 'store47-2016-train.csv'
test_filename  = 'store47-2016-test.csv'
data_path      = 'data/'

data_train.to_csv(data_path + train_filename, index=False)
data_test.to_csv(data_path + test_filename, index=False)


Then, save to your S3 bucket:

In [31]:
sess = boto3.Session()


In [32]:
account_id = sess.client('sts', region_name=sess.region_name).get_caller_identity()["Account"]
bucket = 'sagemaker-studio-{}-{}'.format(sess.region_name, account_id)
prefix = 'demandforecast-rf'

try:
    if sess.region_name == "us-east-1":
        sess.client('s3').create_bucket(Bucket=bucket)
    else:
        sess.client('s3').create_bucket(Bucket=bucket, 
                                        CreateBucketConfiguration={'LocationConstraint': sess.region_name})
except Exception as e:
    print("Looks like you already have a bucket of this name. That's good. Uploading the data files...")

# Return the URLs of the uploaded file, so they can be reviewed or used elsewhere
s3url = S3Uploader.upload('{}/{}'.format(data_path, train_filename), 's3://{}/{}/{}'.format(bucket, prefix,'train'))
print(s3url)
s3url = S3Uploader.upload('{}/{}'.format(data_path, test_filename), 's3://{}/{}/{}'.format(bucket, prefix,'test'))
print(s3url)


'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.
'upload_data' method will be deprecated in favor of 'S3Uploader' class (https://sagemaker.readthedocs.io/en/stable/s3.html#sagemaker.s3.S3Uploader) in SageMaker Python SDK v2.


s3://sagemaker-studio-us-east-1-261586618408/demandforecast-rf/train/store47-2016-train.csv
s3://sagemaker-studio-us-east-1-261586618408/demandforecast-rf/test/store47-2016-test.csv
