- Install dependencies for this Jupyter notebook

In [None]:
!pip install kaggle
!sudo apt-get update
!sudo apt-get install -y jq

- Upload your own `kaggle.json` using the Sagemaker Studio File Broswser to the left, then...
- Move `kaggle.json` to `/root/.kaggle/` on the Sagemaker Studio notebook instance, and change permissions to ensure it's not readable
- Download and unzip the santander-customer-transaction-prediction dataset to the notebook instance (https://www.kaggle.com/c/santander-customer-transaction-prediction/data)

In [None]:
#%bash
!mv kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!kaggle competitions download -c santander-customer-transaction-prediction
!unzip ./santander-customer-transaction-prediction.zip

- Take a quick look at `train.csv`

In [None]:
import pandas as pd
df = pd.read_csv('train.csv')
df.head()

Amazon SageMaker requires that a CSV file with the target variable in the first column. So let's:
- read `train.csv` into a pandas dataframe
- drop the `ID_code` column
- overwrite `train.csv` with this new format.  (original version is still in the .zip if we need it)

In [None]:
df.drop('ID_code', axis=1, inplace=True)
df.to_csv('train.csv', index=False)

- double check the new `train.csv` file has the expected format now

In [None]:
df = pd.read_csv('train.csv')
df.head()

- Create a temporary S3 bucket for this project
- Upload `train.csv` to S3

In [None]:
%%bash
AWS_ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account') #account ID makes bucket name globally unique
BUCKET="smstudio-santander-$AWS_ACCOUNT_ID"
aws s3api create-bucket --bucket $BUCKET --region $AWS_REGION --create-bucket-configuration LocationConstraint=$AWS_REGION
aws s3 cp ./train.csv s3://$BUCKET/train.csv