# Using Amazon SageMaker builtin algorithm to predict fraud 

# Part 1: Preprocessing data
If you are intersted in learning about preprocessing data, you should start here, otherwise you could simply start from part 2, when we load the data from npy files.

In [None]:
#imports
import boto3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import io
import sagemaker.amazon.common as smac
import os
import sagemaker
from sagemaker.predictor import csv_serializer, json_deserializer
import re

### Add your user number to the name variable

In [None]:
name = "sagemaker-user-" # For example: sagemaker-user-1
if name.endswith('-'):
    raise Exception('you must add your user number. For example: sagemaker-user-1')

In [None]:
# run me
# get the data from s3
region = boto3.session.Session().region_name
bucket = 'sagemaker-eu-west-1-483308273948'
original_key = 'visa-kaggle/original.csv'
protocol="s3://"
datafile = 'data/original.csv'
prefix = name+ '/dataset'
 
# get sagemaker IAM role
role = sagemaker.get_execution_role()

# start sagemaker session
sagemaker_session = sagemaker.Session()

Before starting, spend a minute getting familiar with the following tools:  
1. [Pandans](https://bit.ly/2y1DVeI) - A data manipulation and analysis library.
2. [NumPy](https://en.wikipedia.org/wiki/NumPy) - A multi-dimensional arrays and matrices library for Python.

# Part 1: Data ingestion, exploration and preperation
We're going to do the following main steps:
1. Downloding file locally to the Nootbook instance
2. Loading file into [Pandas](https://bit.ly/2y1DVeI) for inspection  
3. Converting data to [numpy](https://en.wikipedia.org/wiki/NumPy)
4. Shuffling the data - Think why it's required...
5. Breaking up each data set to data and label - Label is what we're trying to predict - think why it's required
6. Spliting data into training and test datasets - [Read why here](https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets)

### 1. Downloading the file to a local folder

In [None]:
!mkdir -p data
with open(datafile, 'wb') as f:
    boto3.client('s3').download_fileobj(bucket, original_key, f)

### 2. Loading data into pandas for inspection
Read the local CSV and understand it's format.
Use pandas to load the file

```python
df = pd.read_csv(datafile)
```

and read the first 5 lines using:

```python
df.head(5)
```

In [None]:
# loading data into pandas for inspection


The table above has 31 columns.  
The 'Class' colume tells whether the transaction was a fraud or not. This is what we'd like to predict.  
Each transaction has 30 fields that the algorithm would learn: Time, Amount, and V1-V28 columes that had their colume names and values anonymized (the algorithm won't care). 
  
Using Pandas, we can get more info about the data.
```python 
print(df['Class'].value_counts())
``` 
Tis will shows us that there are 284,807 records in this dataset, but only 492 of them are fraud.

Let's visualize by:
```python 
df['Class'].value_counts().plot(kind='pie')
```

In [None]:
# Use the command above to examine the data


### 3. Converting pandas to numpy
We'll convert the data to numpy for data shuffling, data manipulation and extracting the labels from the dataset.

Run 
```python
raw_data = df.values
``` 
to convert the df to numpy.  
```python 
raw_data.shape
``` 
will print the structure of the matrix

In [None]:
# Converting Data Into Numpy


### 4+5 Shuffling the data and spliting between data and labels

By shuffeling the data we make sure that the order of the data will not affect the learning process.  

using 
```python
np.random.seed(123)
``` 
we can configure the numpy random generator to use a constant seed so results are reproducible.
  
shuffle the data 
```
np.random.shuffle(raw_data)
```

and split the data between the 30 explaining columes, and the colume the colume we'll predict 'Class'.

```
label = raw_data[:, -1]
data = raw_data[:, :-1]
```

let's make sure that both have the same number of records:
```python
print("label_shape = {}; data_shape= {}".format(label.shape, data.shape))
```

In [None]:
# Shuffling the data and splitting between data and label


### 6. Spliting data into training and test datasets
In this example we'll use 60% of the data for training and 40% for test.
Training data is used to train the model. Once trained, test data is used to evaluate the models' accuracy. test data is never used during traning.

we can get the training dataset size using:
```python
train_size = int(data.shape[0]*0.6)
```

we'll split both the training and test data sets (data and labels)
```python
train_data  = data[:train_size, :]
test_data = data[train_size:, :]

train_label = label[:train_size]
test_label = label[train_size:]
```

We'll verifiy the shapes:
```python
print("training data shape= {}; training label shape = {} \ntest data shape= {}; test label shape = {}".format(train_data.shape, train_label.shape,test_data.shape,test_label.shape))
```

In [None]:
#Splitting data into test and training and breaking dataset into data and label


Create the train and test sets:
```python
train_set = (train_data, train_label)
test_set = (test_data, REPLACE_ME)
```
Replace REPLACE_ME with the relevant varaible.

In [None]:
# create train and test sets


# Part 2: Training
In this part we load the data from pre-processed files and train the model.


## Data Conversion
[Amazon Built-in Agorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/algos.html) support csv and recordio/protobuf data formats. In this example we'll work with [recordio](https://github.com/eclesh/recordio).  

```python
vectors = np.array([t.tolist() for t in train_set[0]]).astype('float32')
labels = np.array([t.tolist() for t in train_set[1]]).astype('float32')

buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, vectors, labels)
buf.seek(0)
```

In [None]:
# Data Conversion


## Upload training data
Now that we've created our recordIO-wrapped protobuf, we'll need to upload it to S3, so that Amazon SageMaker training can use it.

Upload to S3:
```python
key = 'recordio-pb-data'
boto3.resource('s3').Bucket(sagemaker_session.default_bucket()).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(sagemaker_session.default_bucket(), prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))
```

In [None]:
# Upload training data


Let's also setup an output S3 location for the model artifact that will be output as the result of training with the algorithm.
```python
output_location = 's3://{}/{}/output'.format(sagemaker_session.default_bucket(), prefix)
print('Training artifacts will be uploaded to: {}'.format(output_location))
```

In [None]:
# Output location


# Training the model
At this point we are using an linear learner from amazon algorithms. Docker file containing the model is located in multiple regions. We tool the following steps
1. define containers
2. Create am Estimator object and pass the hyper-parameters as well as the model location to it.
3. run Estimator.fit to begin training the model

SageMaker uses one of thse prebuilt containers for the linear-learner built in algo
```python
containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest',
              'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest',
              'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest',
              'eu-west-1': '438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest'}
```

In [None]:
# Create the containers dictionary


Use the Estimator function to create a new sagemaker trainning job
```python
sess = sagemaker.Session()
linear = sagemaker.estimator.Estimator(containers[region],
                                       role, #S3 role, so the notebook can read the data and upload the model
                                       train_instance_count=1, #number of instances for training - leave at 1
                                       train_instance_type='ml.m5.large', # type of training instance
                                       output_path=output_location, #S3 location for the trained model
                                       sagemaker_session=sess,
                                       base_job_name='linear-learner-' + name)
```
Set the [hyperparamaeters](https://bit.ly/2D1qaLE). You'll need to use black magic and call up some data scientists friends to come up with the optimal values. [This might help](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSpVTGTOxh8D5xu0_ZYcvL4dHvGYVCP3piIcu6_j58xz_1kBnaLEA).
Altenatively, we'll discuss Auto Model Tuning later today.
```python
l1 = REPLACE_ME # Choose value between 0.1 to 0.9
mini_batch_size = REPLACE_ME # Choose a value between 100 and 5,000
use_bias = REPLACE_ME # Choose True or False
learning_rate = REPLACE_ME # choose a value between 0.001 to 0.9

linear.set_hyperparameters(feature_dim=30, # dataset has 30 columns (features)
                           predictor_type='binary_classifier',
                           l1=l1,
                           mini_batch_size=mini_batch_size,
                           use_bias=use_bias,
                           learning_rate=learning_rate)
```
send the link to S3 to the trainning job and start the trainning
```python
linear.fit({'train': s3_train_data})
```

This will start a Sagemaker job that will launch one or more instances to train the model. You can monitor the job in the [Sagemaker AWS console](https://console.aws.amazon.com/sagemaker/home?region=us-east-1#/jobs). This should take ~2min.

In [None]:
# Use the Estimator function to create a new sagemaker trainning job


## Training - Record how well your model did
Tail to the end of the Traning output. To see how well training did search for the last occurrence of *train binary_classification_accuracy* value.  
Share results with you teammates [here](https://docs.google.com/spreadsheets/d/1BNrvQHq1wAWjyBlN1t0v6KXmnVSzPIp73tmh_WMlHSc/edit#gid=0).

# Hosting the model
We use sagemaker to host the live model by calling deploy from estimator we defined previously. This action will create a dockerized environment using ECS and permits autoscaling. 

```python
linear_predictor = linear.deploy(initial_instance_count=1, #Initial number of instances. 
                                                           #Autoscaling can increase the number of instances.
                                 instance_type='ml.t2.medium',# instance type
                                 name='linear-learner-' + name)
```

In [None]:
# Host a SageMaker endpoint for predicition


# Prediction
deploy resturn a live endpoint (linear_predictor). Predictors in sagemaker accept csv and json. In this case we use json serialization.

configure the predicition endpoint:
```python
linear_predictor.content_type = 'text/csv'
linear_predictor.serializer = csv_serializer
linear_predictor.deserializer = json_deserializer
```

Predict a single item:
```python
print(train_set[0][48:49])
print("The data Actual label: " + str(train_set[1][48:49][0]))
linear_predictor.predict(train_set[0][48:49])
```

Create a [confusion matrix](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) from the test data:
```python
non_zero = np.count_nonzero(test_set[1])
zero = len(test_set[1]) - non_zero
print("test set includes: {} non zero and {} items woth value zero".format(non_zero, zero))

predictions = []
for array in np.array_split(test_set[0], 100):
    result = linear_predictor.predict(array)
    predictions += [r['predicted_label'] for r in result['predictions']]

predictions = np.array(predictions)

import pandas as pd

pd.crosstab(test_set[1], predictions, rownames=['actuals'], colnames=['predictions'])
```

In [None]:
# Create a confusion matrix

Try to explain what each of the 4 figures of the confusion matrix represents.

# Delete the endpoint
if you're ready to be done with this notebook, please run the delete_endpoint line in the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on.

In [None]:
import sagemaker
sagemaker.Session().delete_endpoint(linear_predictor.endpoint)