# Customer Revenue Prediction

## Baseline Light GBM Model
*Machine Learning Nanodegree Program | Capstone Project*

---

In this notebook I will be creating a baseline model that can be used to evaluate the performance of the Pytorch model that we will be creating as part of the project.

### Overview:
- Reading the data
- Initializing the Light GBM model
- Training the model with the train dataset
- Validating the model using the val dataset
- Predict the revenue for customer in test dataset
- Visualizing the results
- Saving the base line results to a csv 

First, import the relevant libraries into notebook

In [38]:
import pandas as pd
import numpy as np
import lightgbm as lgb

from os import path
from sklearn import metrics

In [39]:
files_dir = '../data/files'

if not path.exists(files_dir):
    raise Exception('{} directory not found.'.format(
        files_dir
    ))

train_file = '{}/{}'.format(files_dir, 'train.zip')
print('\nTrain file: {}'.format(train_file))

val_file = '{}/{}'.format(files_dir, 'val.zip')
print('\nVal file: {}'.format(val_file))

test_file = '{}/{}'.format(files_dir, 'test.zip')
print('\nTest file: {}'.format(test_file))


Train file: ../data/files/train.zip

Val file: ../data/files/val.zip

Test file: ../data/files/test.zip


In [40]:
def load_data(zip_path):
    df = pd.read_csv(
        zip_path,
        dtype={'fullVisitorId': 'str'},
        compression='zip'
    )
    
    [rows, columns] = df.shape

    print('\nLoaded {} rows with {} columns from {}.\n'.format(
        rows, columns, zip_path
    ))
    
    return df

In [41]:
%%time

train_df = load_data(train_file)
val_df = load_data(val_file)
test_df = load_data(test_file)


Loaded 765707 rows with 26 columns from ../data/files/train.zip.


Loaded 137946 rows with 26 columns from ../data/files/val.zip.


Loaded 804684 rows with 25 columns from ../data/files/test.zip.

CPU times: user 8.43 s, sys: 618 ms, total: 9.05 s
Wall time: 9.46 s


In [42]:
train_id = train_df["fullVisitorId"].values
val_id = val_df["fullVisitorId"].values
test_id = test_df["fullVisitorId"].values

train_y = np.log1p(train_df["totals.transactionRevenue"].values)
val_y = np.log1p(val_df["totals.transactionRevenue"].values)


train_X = train_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1)
val_X = val_df.drop(['totals.transactionRevenue', 'fullVisitorId'], axis=1)
test_X = test_df.drop(['fullVisitorId'], axis=1)

In [47]:
print('\nShape of the train dataset: {}'.format(train_X.shape))
print('\nShape of the val dataset: {}'.format(val_X.shape))
print('\nShape of the test dataset: {}\n'.format(test_X.shape))


Shape of the train dataset: (765707, 24)

Shape of the val dataset: (137946, 24)

Shape of the test dataset: (804684, 24)



In [None]:
def lgbm_model()