# Customer Revenue Prediction

## Baseline Light GBM Model
*Machine Learning Nanodegree Program | Capstone Project*

---

In this notebook I will be creating a baseline model that can be used to evaluate the performance of the Pytorch model that I will be creating as part of the project.

### Overview:
- Reading the data
- Initializing the Light GBM model
- Training the model with the train dataset
- Validating the model using the val dataset
- Predict the revenue for customer in test dataset
- Visualizing the results
- Saving the base line results to a csv 

### Load and prepare Data

First, import the relevant libraries into notebook

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from lightgbm import LGBMModel
from os import path
from sklearn.metrics import mean_squared_error

%matplotlib inline
pd.set_option('display.float_format', lambda x: '%.10f' % x)

Set the various paths for the training, validation, test files and storing the baseline results

In [2]:
data_dir = '../datasets'

if not path.exists(data_dir):
    raise Exception('{} directory not found.'.format(data_dir))

train_file = '{}/{}'.format(data_dir, 'train.zip')
print('\nTrain file: {}'.format(train_file))

val_file = '{}/{}'.format(data_dir, 'val.zip')
print('\nValidation file: {}'.format(val_file))

pred_val_file = '{}/{}'.format(data_dir, 'lgbm_pred_val.zip')
print('\nValidation Prediction file: {}'.format(pred_val_file))

test_file = '{}/{}'.format(data_dir, 'test.zip')
print('\nTest file: {}'.format(test_file))

pred_test_file = '{}/{}'.format(data_dir, 'lgbm_pred_test.zip')
print('\nTest Prediction file: {}'.format(pred_test_file))

imp_features_file = '{}/{}'.format(data_dir, 'lgbm_importances-01.png')
print('\nImportant Features file: {}'.format(imp_features_file))


Train file: ../datasets/train.zip

Validation file: ../datasets/val.zip

Validation Prediction file: ../datasets/lgbm_pred_val.zip

Test file: ../datasets/test.zip

Test Prediction file: ../datasets/lgbm_pred_test.zip

Important Features file: ../datasets/lgbm_importances-01.png


Method to load the dataset from the files

In [3]:
def load_data(zip_path):
    df = pd.read_csv(
        zip_path,
        dtype={'fullVisitorId': 'str'},
        compression='zip'
    )
    
    [rows, columns] = df.shape

    print('Loaded {} rows with {} columns from {}.'.format(
        rows, columns, zip_path
    ))
    
    return df

Load the train, validation and test datasets.

In [4]:
%%time

train_df = load_data(train_file)
val_df = load_data(val_file)
test_df = load_data(test_file)

print()

Loaded 765707 rows with 26 columns from ../datasets/train.zip.
Loaded 137946 rows with 26 columns from ../datasets/val.zip.
Loaded 804684 rows with 25 columns from ../datasets/test.zip.

CPU times: user 9.17 s, sys: 644 ms, total: 9.81 s
Wall time: 10.8 s


In [26]:
train_df.head()

Unnamed: 0,totals.transactionRevenue,fullVisitorId,channelGrouping,device.browser,device.deviceCategory,device.isMobile,device.operatingSystem,geoNetwork.city,geoNetwork.continent,geoNetwork.country,...,trafficSource.keyword,trafficSource.medium,trafficSource.source,trafficSource.referralPath,totals.bounces,totals.hits,totals.newVisits,totals.pageviews,visitNumber,visitStartTime
752,37860000.0,6194193421514403509,0.2857142857,0.2844827586,0.0,0.0,0.2173913043,0.0387840671,0.2,0.9603524229,...,0.6171396772,0.0,0.0,1.0,0.0,0.0200400802,1.0,0.0213675214,0.0,0.0888219012
753,306670000.0,5327166854580374902,0.5714285714,0.2844827586,0.0,0.0,0.2608695652,0.606918239,0.2,0.9603524229,...,0.1133370432,0.8,0.4168336673,1.0,0.0,0.0200400802,0.0,0.0192307692,0.0050761421,0.0888640865
799,68030000.0,8885051388942907862,0.8571428571,0.2844827586,0.0,0.0,0.2173913043,0.606918239,0.2,0.9603524229,...,0.6171396772,1.0,0.7174348697,0.0,0.0,0.0240480962,0.0,0.0213675214,0.0152284264,0.0883102699
802,26250000.0,185467632009737931,0.8571428571,0.2844827586,0.0,0.0,0.8695652174,0.5649895178,0.2,0.9603524229,...,0.6171396772,1.0,0.7174348697,0.0,0.0,0.0240480962,0.0,0.0235042735,0.0126903553,0.0889112683
859,574150000.0,3244885836845029978,0.8571428571,0.2844827586,0.0,0.0,0.2608695652,0.5796645702,0.2,0.9603524229,...,0.6171396772,1.0,0.7174348697,0.0,0.0,0.0320641283,0.0,0.0277777778,0.0076142132,0.0882288086


For the LightBGM model, the labels should be separated from the features. I only need the _**fullVisitorId**_ to identify the customer and not for the training of the model. So I will drop the _**fullVisitorId**_ and _**totals.transactionRevenue**_ from the training and validation datasets and store them separately so that I can evaluate the results at later stage. From the test dataset we only need to drop _**fullVisitorId**_

In [6]:
train_id = train_df['fullVisitorId'].values
val_id = val_df['fullVisitorId'].values
test_id = test_df['fullVisitorId'].values

train_y = train_df['totals.transactionRevenue'].values
train_log_y = np.log1p(train_y)

val_y = val_df['totals.transactionRevenue'].values
val_log_y = np.log1p(val_y)

train_X = train_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1)
val_X = val_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1)
test_X = test_df.drop(['fullVisitorId'], axis=1)

In [7]:
header = pd.MultiIndex.from_product(
    [['Raw','Transformed'], ['Rows','Columns']],
    names=['Type','Dataset']
)

shape_df = pd.DataFrame(
    [train_df.shape + train_X.shape, val_df.shape + val_X.shape, test_df.shape + test_X.shape], 
    index=['Train', 'Validation', 'Test'], 
    columns=header
)

shape_df.style.set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'center')]}
])

Type,Raw,Raw,Transformed,Transformed
Dataset,Rows,Columns,Rows,Columns
Train,765707,26,765707,24
Validation,137946,26,137946,24
Test,804684,25,804684,24


### LightBGM Model

I will be using the [LGBMModel](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMModel.html#lightgbm.LGBMModel). This funtion creates the model trains the model using the train set and predicts the transaction revenue for val and test dataset.

In [19]:
def lgbm_model(train_X, train_y, val_X, val_y, test_X):
    model = LGBMModel(
        objective='regression',
        metric='rmse',
        n_estimators=1000,
        learning_rate=0.01,
        min_child_samples=100,
        bagging_fraction=0.7,
        feature_fraction=0.5,
        bagging_freq=5,
        bagging_seed=2020
    )
    
    model = model.fit(
        train_X, 
        train_y, 
        eval_set=(val_X, val_y),
        early_stopping_rounds=100,
        verbose=100
    )
        
    pred_test_y = model.predict(
        test_X, 
        num_iteration=model.best_iteration_
    )
    pred_val_y = model.predict(
        val_X, 
        num_iteration=model.best_iteration_
    )
    
    return pred_test_y, pred_val_y, model

In [20]:
pred_test, pred_val, model = lgbm_model(train_X, train_log_y, val_X, val_log_y, test_X)

Training until validation scores don't improve for 100 rounds
[100]	valid_0's rmse: 1.8608
[200]	valid_0's rmse: 1.76686
[300]	valid_0's rmse: 1.734
[400]	valid_0's rmse: 1.7174
[500]	valid_0's rmse: 1.70967
[600]	valid_0's rmse: 1.70388
[700]	valid_0's rmse: 1.70017
[800]	valid_0's rmse: 1.69745
[900]	valid_0's rmse: 1.69625
[1000]	valid_0's rmse: 1.69537
Did not meet early stopping. Best iteration is:
[1000]	valid_0's rmse: 1.69537


### Prepare validation results

In order to evaluate the model, I will be calculating the Root Mean Squared Error between the actual _**total.transactionRevenue**_ and the _**predictedRevenue**_ for the validation sets

In [21]:
pred_val[pred_val < 0] = 0

pred_val_data = {
    'fullVisitorId': val_id,
    'transactionRevenue': val_y,
    'predictedRevenue': np.expm1(pred_val)
}

pred_val_df = pd.DataFrame(pred_val_data)

pred_val_df = pred_val_df.groupby('fullVisitorId')['transactionRevenue', 'predictedRevenue'].sum().reset_index()

pred_val_df.head()

Unnamed: 0,fullVisitorId,transactionRevenue,predictedRevenue
0,62267706107999,0.0,0.0
1,85059828173212,0.0,0.0
2,26722803385797,0.0,0.0
3,436683523507380,0.0,1.0636517332
4,450371054833295,0.0,0.0


In [22]:
rsme_val = np.sqrt(
    mean_squared_error(
        np.log1p(pred_val_df['transactionRevenue'].values),
        np.log1p(pred_val_df['predictedRevenue'].values)
    )
)

print('\nRSME for validation data set: {:.10f}'.format(rsme_val))


RSME for validation data set: 1.7183142410


### Prepare test results

We will use this RSME as the baseline and compare this model with the Pytorch model that will be created to evaluate which model is better

Now lets create a base line output file for the predictions for the test dataset. 

In [23]:
pred_test[pred_test < 0] = 0

pred_test_data = {
    'fullVisitorId': test_id,
    'predictedRevenue': np.expm1(pred_test)
}

pred_test_df = pd.DataFrame(pred_test_data)

pred_test_df = pred_test_df.groupby('fullVisitorId')['predictedRevenue'].sum().reset_index()

In [24]:
pred_test_df.head()

Unnamed: 0,fullVisitorId,predictedRevenue
0,259678714014,2.4062848919
1,49363351866189,0.0
2,53049821714864,0.0
3,59488412965267,0.0
4,85840370633780,0.0417426113


Write the predictions of val and test so that we can compare them with the future model

In [None]:
pred_val_df.to_csv(pred_val_file, index=False, compression='zip')
pred_test_df.to_csv(pred_test_file, index=False, compression='zip')

### Important Features 

Visualization to see the feature importances for the model 

In [None]:
feature_imp = pd.DataFrame({
    'Values':model.feature_importances_,
    'Features':train_X.columns
})

feature_imp = feature_imp.sort_values(by="Values", ascending=False)

plt.figure(figsize=(14, 14))

sns_bar = sns.barplot(
    x="Values", 
    y="Features", 
    data=feature_imp,
    palette="hls",
)

for patch in sns_bar.patches:
    x = patch.get_x() + patch.get_width() + 5
    y = patch.get_y() + (patch.get_height() / 2)
    
    sns_bar.annotate(int(patch.get_width()), (x, y))

plt.title(
    "\nLightGBM - Feature Importance\n(Feature Importances vs Features)\n", 
    fontsize=12
)
plt.xlabel('') 
plt.ylabel('') 

plt.savefig(
    fname=imp_features_file,
    bbox_inches = "tight"
)

plt.show()


From the above plot, we see that **_totals.hits_**, **_visitStartTime_** and **_totals.pageviews_** are the most important features