# Customer Revenue Prediction

## PyTorch LSTM Model
*Machine Learning Nanodegree Program | Capstone Project*

---

In this notebook I will be creating a PyTorch LSTM model and compare it with the baseline model I created earlier.

### Overview:
- Reading the data
- Preparing the tensors for the PyTorch Model
- Initializing the LSTM model
- Training the model with the train dataset
- Validating the model using the val dataset
- Predict the revenue for customer in test dataset
- Visualizing the results
- Compare the results with the baseline model
- Saving the results to a csv 

First, import the relevant libraries into notebook

In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sagemaker
import boto3

from os import path
from sklearn.metrics import mean_squared_error

%matplotlib inline

In [16]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket()

prefix = 'sagemaker/capstone-project'

Couldn't call 'get_role' to get Role ARN from role name arn:aws:iam::201308845573:root to get Role path.


ValueError: The current AWS identity is not a role: arn:aws:iam::201308845573:root, therefore it cannot be used as a SageMaker execution role

Set the various paths for the training, validation, test files and storing the baseline results

In [17]:
data_dir = '../datasets'

if not path.exists(data_dir):
    raise Exception('{} directory not found.'.format(data_dir))

train_file = '{}/{}'.format(data_dir, 'train.zip')
print('\nTrain file: {}'.format(train_file))

val_file = '{}/{}'.format(data_dir, 'val.zip')
print('\nValidation file: {}'.format(val_file))

pred_val_file = '{}/{}'.format(data_dir, 'lstm_pred_val.zip')
print('\nValidation Prediction file: {}'.format(pred_val_file))

test_file = '{}/{}'.format(data_dir, 'test.zip')
print('\nTest file: {}'.format(test_file))

pred_test_file = '{}/{}'.format(data_dir, 'lstm_pred_test.zip')
print('\nTest Prediction file: {}'.format(pred_test_file))

imp_features_file = '{}/{}'.format(data_dir, 'lstm_importances-01.png')
print('\nImportant Features file: {}'.format(imp_features_file))

input_data = sagemaker_session.upload_data(path=train_file, bucket=bucket, key_prefix=prefix)
print('\nInput data S3 path: {}'.format(input_data))


Train file: ../datasets/train.zip

Validation file: ../datasets/val.zip

Validation Prediction file: ../datasets/lstm_pred_val.zip

Test file: ../datasets/test.zip

Test Prediction file: ../datasets/lstm_pred_test.zip

Important Features file: ../datasets/lstm_importances-01.png


NameError: name 'bucket' is not defined

In [None]:
empty_check = []

for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('Test passed!')

Method to load the dataset from the files

In [5]:
def load_data(zip_path):
    df = pd.read_csv(
        zip_path,
        dtype={'fullVisitorId': 'str'},
        compression='zip'
    )
    
    [rows, columns] = df.shape

    print('Loaded {} rows with {} columns from {}.'.format(
        rows, columns, zip_path
    ))
    
    return df

Load the train, validation and test datasets.

In [7]:
%%time

train_df = load_data(train_file)
val_df = load_data(val_file)
test_df = load_data(test_file)

print()

Loaded 765707 rows with 26 columns from ../datasets/train.zip.
Loaded 137946 rows with 26 columns from ../datasets/val.zip.
Loaded 804684 rows with 25 columns from ../datasets/test.zip.

CPU times: user 8.98 s, sys: 681 ms, total: 9.66 s
Wall time: 10.9 s


In [10]:
train_df.head()

Unnamed: 0,totals.transactionRevenue,fullVisitorId,channelGrouping,device.browser,device.deviceCategory,device.isMobile,device.operatingSystem,geoNetwork.city,geoNetwork.continent,geoNetwork.country,...,trafficSource.keyword,trafficSource.medium,trafficSource.source,trafficSource.referralPath,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,visitNumber,visitStartTime
0,0.0,1131660440785968503,0.571429,0.284483,0.0,0.0,0.869565,0.395178,0.4,0.92511,...,0.113337,0.8,0.416834,1.0,1.0,0.0,1.0,0.0,0.0,0.088405
1,0.0,377306020877927890,0.571429,0.353448,0.0,0.0,0.26087,0.579665,1.0,0.048458,...,0.113337,0.8,0.416834,1.0,1.0,0.0,1.0,0.0,0.0,0.089979
2,0.0,3895546263509774583,0.571429,0.284483,0.0,0.0,0.869565,0.496855,0.6,0.814978,...,0.113337,0.8,0.416834,1.0,1.0,0.0,1.0,0.0,0.0,0.089512
3,0.0,4763447161404445595,0.571429,0.672414,0.0,0.0,0.217391,0.579665,0.4,0.409692,...,0.298089,0.8,0.416834,1.0,1.0,0.0,1.0,0.0,0.0,0.090012
4,0.0,27294437909732085,0.571429,0.284483,0.5,1.0,0.0,0.579665,0.6,0.955947,...,0.113337,0.8,0.416834,1.0,1.0,0.0,0.0,0.0,0.002538,0.088159


For the LSTM model, the labels should be separated from the features. I only need the _**fullVisitorId**_ to identify the customer and not for the training of the model. So I will drop the _**fullVisitorId**_ and _**totals.transactionRevenue**_ from the training and validation datasets and store them separately so that I can evaluate the results at later stage. From the test dataset we only need to drop _**fullVisitorId**_

In [14]:
train_id = train_df['fullVisitorId'].values
val_id = val_df['fullVisitorId'].values
test_id = test_df['fullVisitorId'].values

train_y = train_df['totals.transactionRevenue'].values
train_log_y = np.log1p(train_y)

val_y = val_df['totals.transactionRevenue'].values
val_log_y = np.log1p(val_y)

train_X = train_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1)
val_X = val_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1)
test_X = test_df.drop(['fullVisitorId'], axis=1)

array([[0.57142857, 0.28448276, 0.        , ..., 0.        , 0.        ,
        0.08840489],
       [0.57142857, 0.35344828, 0.        , ..., 0.        , 0.        ,
        0.08997852],
       [0.57142857, 0.28448276, 0.        , ..., 0.        , 0.        ,
        0.08951173],
       ...,
       [1.        , 0.22413793, 0.5       , ..., 0.04273504, 0.        ,
        0.42664789],
       [1.        , 0.28448276, 0.        , ..., 0.04487179, 0.        ,
        0.42874861],
       [1.        , 0.28448276, 0.5       , ..., 0.06410256, 0.        ,
        0.42816706]])

In [12]:
header = pd.MultiIndex.from_product(
    [['Raw','Transformed'], ['Rows','Columns']],
    names=['Type','Dataset']
)

shape_df = pd.DataFrame(
    [train_df.shape + train_X.shape, val_df.shape + val_X.shape, test_df.shape + test_X.shape], 
    index=['Train', 'Validation', 'Test'], 
    columns=header
)

shape_df.style.set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'center')]}
])

Type,Raw,Raw,Transformed,Transformed
Dataset,Rows,Columns,Rows,Columns
Train,765707,26,765707,24
Validation,137946,26,137946,24
Test,804684,25,804684,24


In [None]:
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point='lstm_train.py',
    source_dir='models/pytorch',
    role=role,
    sagemaker_session=sagemaker_session,
    framework_version='1.2',
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    hyperparameters={
        'input_features': 3,
        'epochs': 100
    }
)

In [None]:
estimator.fit({'train': input_data})

In [None]:
from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(
    model_data=estimator.model_data,
    role=role,
    framework_version='1.2',
    entry_point='predict.py',
    source_dir='models/pytorch'
)

predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

In [None]:
# read in test data, assuming it is stored locally

test_data = pd.read_csv(os.path.join(data_dir, "test.csv"), header=None, names=None)

# labels are in the first column
test_y = test_data.iloc[:,0]
test_x = test_data.iloc[:,1:]

In [None]:
def delete_endpoint(predictor):
    try:
        predictor.delete_endpoint()
        print('Deleted {}'.format(predictor.endpoint))
    except: 
        print('Already deleted: {}'.format(predictor.endpoint))
        

delete_endpoint(predictor)