# Customer Revenue Prediction

## PyTorch LSTM Model
*Machine Learning Nanodegree Program | Capstone Project*

---

In this notebook I will be creating a PyTorch LSTM model and compare it with the baseline model I created earlier.

### Overview:
- Reading the data
- Preparing the tensors for the PyTorch Model
- Initializing the LSTM model
- Training the model with the train dataset
- Validating the model using the val dataset
- Predict the revenue for customer in test dataset
- Visualizing the results
- Compare the results with the baseline model
- Saving the results to a csv 

First, import the relevant libraries into notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sagemaker
import boto3

from os import path
from sklearn.metrics import mean_squared_error

%matplotlib inline
pd.set_option('display.float_format', lambda x: '%.10f' % x)

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket()

prefix = 'sagemaker/capstone-project'

print(bucket)

sagemaker-us-east-1-201308845573


Set the various paths for the training, validation, test files and storing the baseline results

In [3]:
data_dir = '../datasets'

if not path.exists(data_dir):
    raise Exception('{} directory not found.'.format(data_dir))

train_file = '{}/{}'.format(data_dir, 'train.zip')
print('\nTrain file: {}'.format(train_file))

val_file = '{}/{}'.format(data_dir, 'val.zip')
print('\nValidation file: {}'.format(val_file))

pred_val_file = '{}/{}'.format(data_dir, 'lstm_pred_val.zip')
print('\nValidation Prediction file: {}'.format(pred_val_file))

test_file = '{}/{}'.format(data_dir, 'test.zip')
print('\nTest file: {}'.format(test_file))

pred_test_file = '{}/{}'.format(data_dir, 'lstm_pred_test.zip')
print('\nTest Prediction file: {}'.format(pred_test_file))

imp_features_file = '{}/{}'.format(data_dir, 'lstm_importances-01.png')
print('\nImportant Features file: {}'.format(imp_features_file))

input_s3_train_file = sagemaker_session.upload_data(path=train_file, bucket=bucket, key_prefix=prefix)
print('\nInput data S3 Train file: {}'.format(input_s3_train_file))

input_s3_dir = 's3://{}/{}'.format(bucket, prefix)
print('\nInput data S3 directory: {}'.format(input_s3_dir))



Train file: ../datasets/train.zip

Validation file: ../datasets/val.zip

Validation Prediction file: ../datasets/lstm_pred_val.zip

Test file: ../datasets/test.zip

Test Prediction file: ../datasets/lstm_pred_test.zip

Important Features file: ../datasets/lstm_importances-01.png

Input data S3 Train file: s3://sagemaker-us-east-1-201308845573/sagemaker/capstone-project/train.zip

Input data S3 directory: s3://sagemaker-us-east-1-201308845573/sagemaker/capstone-project


In [4]:
empty_check = []

for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('\nTest passed!')

sagemaker/capstone-project/train.zip

Test passed!


Method to load the dataset from the files

In [5]:
def load_data(zip_path):
    df = pd.read_csv(
        zip_path,
        dtype={'fullVisitorId': 'str'},
        compression='zip'
    )
    
    [rows, columns] = df.shape

    print('Loaded {} rows with {} columns from {}.'.format(
        rows, columns, zip_path
    ))
    
    return df

Load the train, validation and test datasets.

In [6]:
%%time

train_df = load_data(train_file)
val_df = load_data(val_file)
test_df = load_data(test_file)

print()

Loaded 765707 rows with 26 columns from ../datasets/train.zip.
Loaded 137946 rows with 26 columns from ../datasets/val.zip.
Loaded 804684 rows with 25 columns from ../datasets/test.zip.

CPU times: user 11.7 s, sys: 446 ms, total: 12.1 s
Wall time: 11.9 s


##### remove this at the end

import torch

train_df.head()

train_y = np.log1p(train_df['totals.transactionRevenue'].values)
train_y = torch.from_numpy(train_y).float().squeeze()

train_X = train_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1).values
train_X = torch.from_numpy(train_X).float()

train_X = train_X.reshape(train_X.shape[0], 1, train_X.shape[1])

train_df.head()

In [7]:
from sagemaker.pytorch import PyTorch

output_path = 's3://{}/{}'.format(bucket, prefix)

estimator = PyTorch(
    entry_point='lstm_train.py',
    source_dir='../models/pytorch/',
    role=role,
    output_path=output_path,
    sagemaker_session=sagemaker_session,
    framework_version='1.2',
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    hyperparameters={
        'input_dim': 24,
        'epochs': 50,
        'batch-size': 1024
    }
)

In [8]:
estimator.fit({'train': input_s3_dir})

2020-01-27 19:57:39 Starting - Starting the training job...
2020-01-27 19:57:41 Starting - Launching requested ML instances......
2020-01-27 19:58:47 Starting - Preparing the instances for training......
2020-01-27 19:59:51 Downloading - Downloading input data...
2020-01-27 20:00:24 Training - Downloading the training image...
2020-01-27 20:00:53 Training - Training image download completed. Training in progress.[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-01-27 20:00:54,106 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-01-27 20:00:54,109 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-01-27 20:00:54,123 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-01-27 20:00:54,124 sagemaker_pytorch_container.training INFO     Invoking user training script

[34mEpoch: 31... RSMELoss: 1.8789861983...[0m
[34mEpoch: 32... RSMELoss: 1.8783266422...[0m
[34mEpoch: 33... RSMELoss: 1.8778205877...[0m
[34mEpoch: 34... RSMELoss: 1.8779503449...[0m
[34mEpoch: 35... RSMELoss: 1.8807111416...[0m
[34mEpoch: 36... RSMELoss: 1.8780939732...[0m
[34mEpoch: 37... RSMELoss: 1.8799312819...[0m
[34mEpoch: 38... RSMELoss: 1.8788687614...[0m
[34mEpoch: 39... RSMELoss: 1.8795332761...[0m
[34mEpoch: 40... RSMELoss: 1.8786097400...[0m
[34mEpoch: 41... RSMELoss: 1.8800388330...[0m
[34mEpoch: 42... RSMELoss: 1.8784713654...[0m
[34mEpoch: 43... RSMELoss: 1.8801017018...[0m
[34mEpoch: 44... RSMELoss: 1.8805722302...[0m
[34mEpoch: 45... RSMELoss: 1.8782014070...[0m
[34mEpoch: 46... RSMELoss: 1.8785339845...[0m
[34mEpoch: 47... RSMELoss: 1.8783615955...[0m
[34mEpoch: 48... RSMELoss: 1.8793040953...[0m
[34mEpoch: 49... RSMELoss: 1.8785098575...[0m
[34mEpoch: 50... RSMELoss: 1.8791619895...[0m
[34m2020-01-27 20:19:33,110 sagemaker-c

In [9]:
from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(
    model_data=estimator.model_data,
    role=role,
    framework_version='1.2',
    entry_point='lstm_predict.py',
    source_dir='../models/pytorch'
)

predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

-------------------!

In [10]:
def get_batches(dataset, size=1024):
    for i in range(0, len(dataset), size):  
        yield dataset[i:(i + size)] 
        
def predict_batch(predictor, dataset):
    pred_arr = np.array([])
    
    for next_batch in get_batches(dataset):
        temp_pred = predictor.predict(next_batch)
        
        pred_arr = np.append(pred_arr, temp_pred)
    
    return pred_arr

In [11]:
val_id = val_df['fullVisitorId'].values
val_y = val_df['totals.transactionRevenue'].values

val_X = val_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1).values
val_X = val_X.reshape(val_X.shape[0], 1, val_X.shape[1])

pred_val = predict_batch(predictor, val_X)

pred_val[pred_val < 0] = 0

pred_val_data = {
    'fullVisitorId': val_id,
    'transactionRevenue': val_y,
    'predictedRevenue': np.expm1(pred_val)
}

pred_val_df = pd.DataFrame(pred_val_data)

pred_val_df = pred_val_df.groupby('fullVisitorId')['transactionRevenue', 'predictedRevenue'].sum().reset_index()

pred_val_df.head()

Unnamed: 0,fullVisitorId,transactionRevenue,predictedRevenue
0,62267706107999,0.0,0.0
1,85059828173212,0.0,0.0
2,26722803385797,0.0,0.0
3,436683523507380,0.0,0.0
4,450371054833295,0.0,0.0


In [12]:
rsme_val = np.sqrt(
    mean_squared_error(
        np.log1p(pred_val_df['transactionRevenue'].values),
        np.log1p(pred_val_df['predictedRevenue'].values)
    )
)

print('\nRSME for validation data set: {:.10f}\n'.format(rsme_val))


RSME for validation data set: 2.1530207390



In [None]:
test_id = test_df['fullVisitorId'].values
test_X = test_df.drop(['fullVisitorId'], axis=1)

test_X = test_X.reshape(test_X.shape[0], 1, test_X.shape[1])

pred_val = predict_batch(predictor, val_X)

pred_test[pred_test < 0] = 0

pred_test_data = {
    'fullVisitorId': test_id,
    'predictedRevenue': np.expm1(pred_test)
}

pred_test_df = pd.DataFrame(pred_test_data)

pred_test_df = pred_test_df.groupby('fullVisitorId')['predictedRevenue'].sum().reset_index()

pred_test_df.head()

In [None]:
pred_val_df.to_csv(pred_val_file, index=False, compression='zip')

pred_test_df.to_csv(pred_test_file, index=False, compression='zip')

In [13]:
def delete_endpoint(predictor):
    try:
        predictor.delete_endpoint()
        print('Deleted {}'.format(predictor.endpoint))
    except: 
        print('Already deleted: {}'.format(predictor.endpoint))
        

delete_endpoint(predictor)

Deleted pytorch-inference-2020-01-27-20-21-10-783


In [14]:
bucket_to_delete = boto3.resource('s3').Bucket(bucket)

bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': '8914272EA172C0FD',
   'HostId': 'kyXsRCG9ag4OJSDaa9V9zZD5TgTCNPWm0HOfDN2QSCaSVxtV0ir6ZHsqROf4AK1EG1zqeLKzf5Y=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'kyXsRCG9ag4OJSDaa9V9zZD5TgTCNPWm0HOfDN2QSCaSVxtV0ir6ZHsqROf4AK1EG1zqeLKzf5Y=',
    'x-amz-request-id': '8914272EA172C0FD',
    'date': 'Mon, 27 Jan 2020 20:31:30 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'sagemaker/capstone-project/pytorch-training-2020-01-27-19-57-38-851/output/model.tar.gz'},
   {'Key': 'sagemaker/capstone-project/pytorch-training-2020-01-27-19-57-38-851/debug-output/training_job_end.ts'},
   {'Key': 'pytorch-training-2020-01-27-19-57-38-851/source/sourcedir.tar.gz'},
   {'Key': 'pytorch-inference-2020-01-27-20-21-10-343/model.tar.gz'},
   {'Key': 'sagemaker/capstone-project/train.zip'}]}]