- Install dependencies for this Jupyter notebook

In [7]:
!pip install kaggle
!sudo apt-get update
!sudo apt-get install -y jq

Get:1 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Hit:2 http://archive.ubuntu.com/ubuntu bionic InRelease                        
Get:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB]
Get:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB]
Fetched 252 kB in 1s (257 kB/s)   
Reading package lists... Done
Reading package lists... Done
Building dependency tree       
Reading state information... Done
jq is already the newest version (1.5+dfsg-2).
0 upgraded, 0 newly installed, 0 to remove and 31 not upgraded.


- Upload your own `kaggle.json` using the Sagemaker Studio File Broswser to the left, then...
- Move `kaggle.json` to `/root/.kaggle/` on the Sagemaker Studio notebook instance, and change permissions to ensure it's not readable
- Download and unzip the santander-customer-transaction-prediction dataset to the notebook instance (https://www.kaggle.com/c/santander-customer-transaction-prediction/data)

In [8]:
#%bash
#mv kaggle.json /root/.kaggle/
#chmod 600 /root/.kaggle/kaggle.json
!kaggle competitions download -c santander-customer-transaction-prediction
!unzip ./santander-customer-transaction-prediction.zip

Downloading santander-customer-transaction-prediction.zip to /root
 98%|███████████████████████████████████████▎| 246M/250M [00:03<00:00, 73.3MB/s]
100%|████████████████████████████████████████| 250M/250M [00:04<00:00, 59.5MB/s]
Archive:  ./santander-customer-transaction-prediction.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


- Take a quick look at `train.csv`

In [9]:
import pandas as pd
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


Amazon SageMaker requires that a CSV file doesn't have a header record and that the target variable is in the first column. So let's:
- read `train.csv` into a pandas dataframe
- drop the `ID_code` column
- drop the header row
- overwrite `train.csv` with this new format.  (original version is still in the .zip if we need it)

In [10]:
df.drop('ID_code', axis=1, inplace=True)
df.to_csv('train.csv', header=False, index=False)

- double check the new `train.csv` file has the expected format now

In [11]:
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,-4.92,...,4.4354,3.9642,3.1364,1.6909999999999998,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
0,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,3.1468,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
1,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,-4.9193,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
2,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,-5.8609,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
3,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,6.2654,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104
4,0,11.4763,-2.3182,12.608,8.6264,10.9621,3.5609,4.5322,15.2255,3.5855,...,-6.3068,6.6025,5.2912,0.4403,14.9452,1.0314,-3.6241,9.767,12.5809,-4.7602


- Create a temporary S3 bucket for this project
- Upload `train.csv` to S3

In [12]:
%%bash
AWS_ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account') #account ID makes bucket name globally unique
BUCKET="smstudio-santander-$AWS_ACCOUNT_ID"
aws s3api create-bucket --bucket $BUCKET --region $AWS_REGION --create-bucket-configuration LocationConstraint=$AWS_REGION
aws s3 cp ./train.csv s3://$BUCKET/train.csv

{
    "Location": "http://smstudio-santander-735164016588.s3.amazonaws.com/"
}
upload: ./train.csv to s3://smstudio-santander-735164016588/train.csv
