- Install dependencies for this Jupyter notebook

In [None]:
!pip install kaggle
!sudo apt-get update
!sudo apt-get install -y jq

- Log in to Kaggle and grab your API credentials (`kaggle.json`) from your Account Settings
- Upload your own `kaggle.json` using the Sagemaker Studio File Broswser to the left, then...
- Move `kaggle.json` to `/root/.kaggle/` on the Sagemaker Studio notebook instance, and change permissions to ensure it's not readable
- Download and unzip the santander-customer-transaction-prediction dataset from Kaggle to the notebook instance

In [6]:
!mv kaggle.json /root/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!kaggle competitions download -c santander-customer-transaction-prediction
!unzip ./santander-customer-transaction-prediction.zip

Downloading santander-customer-transaction-prediction.zip to /root
 99%|███████████████████████████████████████▌| 247M/250M [00:03<00:00, 66.1MB/s]
100%|████████████████████████████████████████| 250M/250M [00:04<00:00, 64.4MB/s]
Archive:  ./santander-customer-transaction-prediction.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


- Take a quick look at `train.csv`

In [7]:
import pandas as pd
df = pd.read_csv('train.csv')
df.head()

Unnamed: 0,ID_code,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
0,train_0,0,8.9255,-6.7863,11.9081,5.093,11.4607,-9.2834,5.1187,18.6266,...,4.4354,3.9642,3.1364,1.691,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
1,train_1,0,11.5006,-4.1473,13.8588,5.389,12.3622,7.0433,5.6208,16.5338,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.356,1.9518
2,train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
3,train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.925,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
4,train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104


Notice `ID_code` column is neither a feature nor the target variable, so let's:
- drop the `ID_code` column
- overwrite `train.csv` with this new format.  (original file is still in the .zip if we need it)

In [8]:
df.drop('ID_code', axis=1, inplace=True)
df.to_csv('train.csv', index=False)

- double check the new `train.csv` file has the expected format now

In [None]:
df = pd.read_csv('train.csv')
df.head()

- Create a temporary S3 bucket for this project
- Upload `train.csv` to S3

In [11]:
%%bash
AWS_ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account') #account ID makes bucket name globally unique
BUCKET="smstudio-santander-$AWS_ACCOUNT_ID"
aws s3api create-bucket --bucket $BUCKET --region $AWS_REGION --create-bucket-configuration LocationConstraint=$AWS_REGION
aws s3 cp ./train.csv s3://$BUCKET/train.csv

{
    "Location": "http://smstudio-santander-792047125340.s3.amazonaws.com/"
}
upload: ./train.csv to s3://smstudio-santander-792047125340/train.csv


## screencap of SMS AutoML steps

In [None]:
import boto3

client = boto3.client('sagemaker')


In [None]:
response = client.list_training_jobs_for_hyper_parameter_tuning_job(
    HyperParameterTuningJobName='tuning-job-1-3ec69a12130e443a86',
    StatusEquals='Completed',
    SortBy='FinalObjectiveMetricValue',
    SortOrder='Descending',
    MaxResults=1
)

In [None]:
response['TrainingJobSummaries'][0]['TrainingJobName']

In [None]:
model_name = 'tuning-job-1-3ec69a12130e443a86-177-fb23b88c'
info = client.describe_training_job(TrainingJobName='tuning-job-1-3ec69a12130e443a86-177-fb23b88c')
model_data = info['ModelArtifacts']['S3ModelArtifacts']
print(model_data)

primary_container = {
    'Image': '257758044811.dkr.ecr.us-east-2.amazonaws.com/sagemaker-xgboost:0.90-1-cpu-py3',
    'ModelDataUrl': model_data
}

create_model_response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = 'arn:aws:iam::273210948404:role/service-role/AmazonSageMaker-ExecutionRole-20200328T115688',
    PrimaryContainer = primary_container)

print(create_model_response['ModelArn'])

In [None]:
!pip install 'sagemaker[local]' --upgrade

In [None]:
from sagemaker.transformer import Transformer
transformer = Transformer(model_name='tuning-job-1-3ec69a12130e443a86-177-fb23b88c',
                          instance_count=1,
                          instance_type='ml.m4.xlarge',
                          assemble_with='Line',
                          max_payload=1
                         )

In [None]:
transformer.transform('s3://smstudio-santander-273210948404/test.csv', content_type='text/csv', split_type='Line')

In [None]:
%%bash
AWS_ACCOUNT_ID=$(aws sts get-caller-identity | jq -r '.Account') #account ID ensures s3 bucket name is globally unique
BUCKET="smstudio-santander-$AWS_ACCOUNT_ID"
#aws s3api create-bucket --bucket $BUCKET --region $AWS_REGION --create-bucket-configuration LocationConstraint=$AWS_REGION
aws s3 cp ./test.csv s3://$BUCKET/test.csv

In [None]:
import csv

with open('test.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=',')
    line_count = 0
    for row in csv_reader:
        if line_count == 0:
            print(f'Column names are {", ".join(row)}')
            line_count += 1
        else:
            print(f'\t{row[0]} works in the {row[1]} department, and was born in {row[2]}.')
            line_count += 1
        if line_count >= 10:
            break
    print(f'Processed {line_count} lines.')

In [None]:
row