# Customer Revenue Prediction

## PyTorch LSTM Model
*Machine Learning Nanodegree Program | Capstone Project*

---

In this notebook I will be creating a PyTorch LSTM model and compare it with the baseline model I created earlier.

### Overview:
- Reading the data
- Preparing the tensors for the PyTorch Model
- Initializing the LSTM model
- Training the model with the train dataset
- Validating the model using the val dataset
- Predict the revenue for customer in test dataset
- Visualizing the results
- Compare the results with the baseline model
- Saving the results to a csv 

First, import the relevant libraries into notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sagemaker
import boto3

from os import path
from sklearn.metrics import mean_squared_error

%matplotlib inline

pd.set_option('display.float_format', lambda x: '%.10f' % x)

In [2]:
# session and role
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket()

prefix = 'sagemaker/capstone-project'

print(bucket)

sagemaker-us-east-1-201308845573


Set the various paths for the training, validation, test files and storing the baseline results

In [3]:
data_dir = '../datasets'

if not path.exists(data_dir):
    raise Exception('{} directory not found.'.format(data_dir))

train_file = '{}/{}'.format(data_dir, 'train.zip')
print('\nTrain file: {}'.format(train_file))

val_file = '{}/{}'.format(data_dir, 'val.zip')
print('\nValidation file: {}'.format(val_file))

pred_val_file = '{}/{}'.format(data_dir, 'lstm_pred_val.zip')
print('\nValidation Prediction file: {}'.format(pred_val_file))

test_file = '{}/{}'.format(data_dir, 'test.zip')
print('\nTest file: {}'.format(test_file))

pred_test_file = '{}/{}'.format(data_dir, 'lstm_pred_test.zip')
print('\nTest Prediction file: {}'.format(pred_test_file))

imp_features_file = '{}/{}'.format(data_dir, 'lstm_importances-01.png')
print('\nImportant Features file: {}'.format(imp_features_file))

input_s3_train_file = sagemaker_session.upload_data(path=train_file, bucket=bucket, key_prefix=prefix)
print('\nInput data S3 Train file: {}'.format(input_s3_train_file))

input_s3_dir = 's3://{}/{}'.format(bucket, prefix)
print('\nInput data S3 directory: {}'.format(input_s3_dir))



Train file: ../datasets/train.zip

Validation file: ../datasets/val.zip

Validation Prediction file: ../datasets/lstm_pred_val.zip

Test file: ../datasets/test.zip

Test Prediction file: ../datasets/lstm_pred_test.zip

Important Features file: ../datasets/lstm_importances-01.png

Input data S3 Train file: s3://sagemaker-us-east-1-201308845573/sagemaker/capstone-project/train.zip

Input data S3 directory: s3://sagemaker-us-east-1-201308845573/sagemaker/capstone-project


In [4]:
empty_check = []

for obj in boto3.resource('s3').Bucket(bucket).objects.all():
    empty_check.append(obj.key)
    print(obj.key)

assert len(empty_check) !=0, 'S3 bucket is empty.'
print('\nTest passed!')

sagemaker/capstone-project/train.zip

Test passed!


Method to load the dataset from the files

In [5]:
def load_data(zip_path):
    df = pd.read_csv(
        zip_path,
        dtype={'fullVisitorId': 'str'},
        compression='zip'
    )
    
    [rows, columns] = df.shape

    print('Loaded {} rows with {} columns from {}.'.format(
        rows, columns, zip_path
    ))
    
    return df

Load the train, validation and test datasets.

In [6]:
%%time

train_df = load_data(train_file)
val_df = load_data(val_file)
test_df = load_data(test_file)

print()

Loaded 765707 rows with 26 columns from ../datasets/train.zip.
Loaded 137946 rows with 26 columns from ../datasets/val.zip.
Loaded 804684 rows with 25 columns from ../datasets/test.zip.

CPU times: user 11.5 s, sys: 409 ms, total: 11.9 s
Wall time: 11.6 s


In [9]:
import torch

train_df.head()

train_y = train_df['totals.transactionRevenue'].values

train_X = train_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1).values

train_y = torch.from_numpy(train_y).float().squeeze()
train_X = torch.from_numpy(train_X).float()

train_X = train_X.reshape(train_X.shape[0], 1, train_X.shape[1])

train_df.head()

tensor([[[0.5714, 0.2845, 0.0000,  ..., 0.0000, 0.0000, 0.0884]],

        [[0.5714, 0.3534, 0.0000,  ..., 0.0000, 0.0000, 0.0900]],

        [[0.5714, 0.2845, 0.0000,  ..., 0.0000, 0.0000, 0.0895]],

        ...,

        [[1.0000, 0.2241, 0.5000,  ..., 0.0427, 0.0000, 0.4266]],

        [[1.0000, 0.2845, 0.0000,  ..., 0.0449, 0.0000, 0.4287]],

        [[1.0000, 0.2845, 0.5000,  ..., 0.0641, 0.0000, 0.4282]]])

In [10]:
from sagemaker.pytorch import PyTorch

output_path = 's3://{}/{}'.format(bucket, prefix)

estimator = PyTorch(
    entry_point='lstm_train.py',
    source_dir='../models/pytorch/',
    role=role,
    output_path=output_path,
    sagemaker_session=sagemaker_session,
    framework_version='1.2',
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    hyperparameters={
        'input_dim': 24,
        'epochs': 10,
        'batch-size': 1024
    }
)

In [11]:
estimator.fit({'train': input_s3_dir})

2020-01-27 01:05:25 Starting - Starting the training job...
2020-01-27 01:05:26 Starting - Launching requested ML instances......
2020-01-27 01:06:29 Starting - Preparing the instances for training......
2020-01-27 01:07:36 Downloading - Downloading input data...
2020-01-27 01:08:02 Training - Downloading the training image..[34mbash: cannot set terminal process group (-1): Inappropriate ioctl for device[0m
[34mbash: no job control in this shell[0m
[34m2020-01-27 01:08:38,445 sagemaker-containers INFO     Imported framework sagemaker_pytorch_container.training[0m
[34m2020-01-27 01:08:38,448 sagemaker-containers INFO     No GPUs detected (normal if no gpus installed)[0m
[34m2020-01-27 01:08:38,461 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.[0m
[34m2020-01-27 01:08:38,462 sagemaker_pytorch_container.training INFO     Invoking user training script.[0m
[34m2020-01-27 01:08:38,746 sagemaker-containers INFO     Module lstm_train does n

[34mEpoch: 1, BCELoss: 0.017519260105298133[0m
[34mEpoch: 1, BCELoss: 0.017520505211365542[0m
[34mEpoch: 1, BCELoss: 0.01752137093009324[0m
[34mEpoch: 1, BCELoss: 0.017522090730852792[0m
[34mEpoch: 1, BCELoss: 0.0175225703987476[0m
[34mEpoch: 1, BCELoss: 0.017523438685771605[0m
[34mEpoch: 1, BCELoss: 0.017523844267687085[0m
[34mEpoch: 1, BCELoss: 0.01752454666809128[0m
[34mEpoch: 1, BCELoss: 0.017525178829746826[0m
[34mEpoch: 1, BCELoss: 0.01752571679518202[0m
[34mEpoch: 1, BCELoss: 0.017526155268589163[0m
[34mEpoch: 1, BCELoss: 0.01752663350335377[0m
[34mEpoch: 1, BCELoss: 0.017527340512046026[0m
[34mEpoch: 1, BCELoss: 0.017528158244160177[0m
[34mEpoch: 1, BCELoss: 0.017528584265018774[0m
[34mEpoch: 1, BCELoss: 0.017529385497362132[0m
[34mEpoch: 1, BCELoss: 0.01752979792486428[0m
[34mEpoch: 1, BCELoss: 0.017530567801487755[0m
[34mEpoch: 1, BCELoss: 0.017530989229933104[0m
[34mEpoch: 1, BCELoss: 0.01753142935801752[0m
[34mEpoch: 1, BCELoss: 0.01

[34mEpoch: 2, BCELoss: 3.2389868256408114e-05[0m
[34mEpoch: 2, BCELoss: 3.2741037409230916e-05[0m
[34mEpoch: 2, BCELoss: 3.3058497536788015e-05[0m
[34mEpoch: 2, BCELoss: 3.515435479217792e-05[0m
[34mEpoch: 2, BCELoss: 3.545888050435669e-05[0m
[34mEpoch: 2, BCELoss: 3.5615938051151144e-05[0m
[34mEpoch: 2, BCELoss: 3.717864923999971e-05[0m
[34mEpoch: 2, BCELoss: 3.7470061497661325e-05[0m
[34mEpoch: 2, BCELoss: 3.786208048342406e-05[0m
[34mEpoch: 2, BCELoss: 3.9102276719071315e-05[0m
[34mEpoch: 2, BCELoss: 3.93953788665249e-05[0m
[34mEpoch: 2, BCELoss: 3.967032423138651e-05[0m
[34mEpoch: 2, BCELoss: 4.002228222252305e-05[0m
[34mEpoch: 2, BCELoss: 4.082933408689105e-05[0m
[34mEpoch: 2, BCELoss: 4.1144761265200106e-05[0m
[34mEpoch: 2, BCELoss: 4.135198709419567e-05[0m
[34mEpoch: 2, BCELoss: 4.168856075386215e-05[0m
[34mEpoch: 2, BCELoss: 4.195260496386439e-05[0m
[34mEpoch: 2, BCELoss: 4.246807601737085e-05[0m
[34mEpoch: 2, BCELoss: 4.30322386741353e-05


2020-01-27 01:09:25 Uploading - Uploading generated training model
2020-01-27 01:09:25 Completed - Training job completed
[34mEpoch: 2, BCELoss: 0.0004971540359385112[0m
[34mEpoch: 2, BCELoss: 0.0004972415974290613[0m
[34mEpoch: 2, BCELoss: 0.0004982260489683578[0m
[34mEpoch: 2, BCELoss: 0.0004990233814762547[0m
[34mEpoch: 2, BCELoss: 0.0004996579211637677[0m
[34mEpoch: 2, BCELoss: 0.0005002276638574804[0m
[34mEpoch: 2, BCELoss: 0.0005009982750803138[0m
[34mEpoch: 2, BCELoss: 0.0005015710537937949[0m
[34mEpoch: 2, BCELoss: 0.0005021875752920903[0m
[34mEpoch: 2, BCELoss: 0.0005023317806865726[0m
[34mEpoch: 2, BCELoss: 0.0005026952265976958[0m
[34mEpoch: 2, BCELoss: 0.0005030879710781413[0m
[34mEpoch: 2, BCELoss: 0.0005042724992344445[0m
[34mEpoch: 2, BCELoss: 0.0005047003443740488[0m
[34mEpoch: 2, BCELoss: 0.0005052157870629933[0m
[34mEpoch: 2, BCELoss: 0.0005056017952128627[0m
[34mEpoch: 2, BCELoss: 0.0005058125924865076[0m
[34mEpoch: 2, BCELoss: 0.0

Training seconds: 109
Billable seconds: 109


In [44]:
from sagemaker.pytorch import PyTorchModel

model = PyTorchModel(
    model_data=estimator.model_data,
    role=role,
    framework_version='1.2',
    entry_point='lstm_predict.py',
    source_dir='../models/pytorch'
)

predictor = model.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')

-------------------!

In [45]:
val_id = val_df['fullVisitorId'].values
val_y = val_df['totals.transactionRevenue'].values

val_X = val_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1).values

val_X = val_X.reshape(val_X.shape[0], 1, val_X.shape[1])

In [74]:
def batch(dataset, size=1024):
    for i in range(0, len(dataset), size):  
        yield dataset[i:(i + size)] 
        
def predict(predictor, dataset):
    pred_arr = np.array([])
    
    for next_batch in batch(dataset):
        temp_pred = predictor.predict(next_batch)
        pred_arr = np.append(pred_arr, temp_pred)
    
    return pred_arr

In [75]:
pred_val = predict(predictor, val_X)

pred_val[pred_val < 0] = 0

pred_val_data = {
    'fullVisitorId': val_id,
    'transactionRevenue': val_y,
    'predictedRevenue': pred_val
}

pred_val_df = pd.DataFrame(pred_val_data)

pred_val_df = pred_val_df.groupby('fullVisitorId')['transactionRevenue', 'predictedRevenue'].sum().reset_index()

pred_val_df.head()

137946 137946


Unnamed: 0,fullVisitorId,transactionRevenue,predictedRevenue
0,62267706107999,0.0,0.0
1,85059828173212,0.0,0.0
2,26722803385797,0.0,0.0
3,436683523507380,0.0,0.0
4,450371054833295,0.0,0.0


In [76]:
rsme_val = np.sqrt(
    mean_squared_error(
        pred_val_df['transactionRevenue'].values, 
        pred_val_df['predictedRevenue'].values
    )
)

print('\nRSME for validation data set: {:.6f}\n'.format(rsme_val))


RSME for validation data set: 0.003056



In [None]:
test_id = test_df['fullVisitorId'].values
test_X = test_df.drop(['fullVisitorId'], axis=1)

pred_test = predictor.predict(test_X)

pred_test[pred_test < 0] = 0

pred_test_data = {
    'fullVisitorId': test_id,
    'predictedRevenue': pred_test
}

pred_test_df = pd.DataFrame(pred_test_data)

pred_test_df = pred_test_df.groupby('fullVisitorId')['predictedRevenue'].sum().reset_index()

pred_test_df.head()

In [None]:
pred_val_df.to_csv(pred_val_file, index=False, compression='zip')

pred_test_df.to_csv(pred_test_file, index=False, compression='zip')

In [77]:
def delete_endpoint(predictor):
    try:
        predictor.delete_endpoint()
        print('Deleted {}'.format(predictor.endpoint))
    except: 
        print('Already deleted: {}'.format(predictor.endpoint))
        

delete_endpoint(predictor)

Deleted pytorch-inference-2020-01-27-02-49-26-481


In [78]:
bucket_to_delete = boto3.resource('s3').Bucket(bucket)

bucket_to_delete.objects.all().delete()

[{'ResponseMetadata': {'RequestId': '89AD401C7742A43D',
   'HostId': 'lEtYq2NlUG/6/Ux+UsgwzCC7ZqPoVeaeqSbyxvqRDNg8qM8/TD1UUlZ1hXgSNCPUUB2Akc3bXwg=',
   'HTTPStatusCode': 200,
   'HTTPHeaders': {'x-amz-id-2': 'lEtYq2NlUG/6/Ux+UsgwzCC7ZqPoVeaeqSbyxvqRDNg8qM8/TD1UUlZ1hXgSNCPUUB2Akc3bXwg=',
    'x-amz-request-id': '89AD401C7742A43D',
    'date': 'Mon, 27 Jan 2020 03:11:10 GMT',
    'connection': 'close',
    'content-type': 'application/xml',
    'transfer-encoding': 'chunked',
    'server': 'AmazonS3'},
   'RetryAttempts': 0},
  'Deleted': [{'Key': 'pytorch-inference-2020-01-27-01-11-32-134/model.tar.gz'},
   {'Key': 'pytorch-inference-2020-01-27-02-38-41-180/model.tar.gz'},
   {'Key': 'sagemaker/capstone-project/train.zip'},
   {'Key': 'pytorch-inference-2020-01-27-01-28-47-706/model.tar.gz'},
   {'Key': 'pytorch-inference-2020-01-27-02-22-30-874/model.tar.gz'},
   {'Key': 'pytorch-inference-2020-01-27-01-43-04-438/model.tar.gz'},
   {'Key': 'sagemaker/capstone-project/pytorch-training-2