# Google Revenue Prediction

## Kaggle Challenge Description

## What are we predicting?
We are predicting the **natural log** of the sum of all transactions **per user**.

$$ y_{user} = \sum_{i=1}^n transaction_{user_{i}} $$
$$ target_{user} = ln(y_{user} + 1) $$

## Data Cleansing 

Simply change csv_file to either "test_v2.csv" or "train_v2.csv" to determine which csv to use. Using "test_v2.csv" will make it easier to manage for your system.

In [1]:
%%time
import pandas as pd
import sklearn
import json
import os
import matplotlib
import matplotlib.pyplot as plt
from pandas.io.json import json_normalize
import numpy as np


CPU times: user 646 ms, sys: 402 ms, total: 1.05 s
Wall time: 1.61 s


In [None]:
%%time
data = 'data'
csv_file = "train_v2.csv"
JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
print("Loading csv file")
df = pd.read_csv(os.path.join(os.getcwd(),data,csv_file),
                converters={column: json.loads for column in JSON_COLUMNS},
                dtype={'fullVisitorId': 'str'},
                nrows=None)

for column in JSON_COLUMNS:
    column_as_df = json_normalize(df[column])
    column_as_df.columns = ["%s.%s" % (column, subcolumn) for subcolumn in column_as_df.columns]
    df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    
print("Loading done.")

Loading csv file


## Columns and Types
Initially we should dissect the data and learn what columns are found inside the csv file.

In [None]:
df.dtypes

### Analyzing which Columns are constant

In [None]:
const_cols = [c for c in df.columns if df[c].nunique(dropna=False)==1 ]
const_cols

We should also remove the constant columns, as they will bring no bearing to the weights on the model.

In [None]:
cols_to_drop = const_cols + ['sessionId']

df = df.drop(cols_to_drop + ["trafficSource.campaignCode"], axis=1)
#test_df = test_df.drop(cols_to_drop, axis=1)

# Target Variable Exploration

We need to sum up the transaction revenue on the user level. Then graph a scatter plot of the natural log.


In [None]:
df["totals.transactionRevenue"] = df["totals.transactionRevenue"].astype('float')
gdf = df.groupby("fullVisitorId")["totals.transactionRevenue"].sum().reset_index()

plt.figure(figsize=(8,6))
plt.scatter(range(gdf.shape[0]), np.sort(np.log1p(gdf["totals.transactionRevenue"].values)))
plt.xlabel('Customer', fontsize=12)
plt.ylabel('TransactionRevenue', fontsize=12)
plt.show()

The scatter plot above confirms the 80/20 marketing principle that states 80% of the profits come from 20% of the customers.

From this analysis, this would mean that we have a distinct dataset between customers with TransactionRevenue and customers without TransactionRevenue.


# Developing the model
To discover which is the best model to predict the log revenue of customers. We can do a training analysis based on popular models such as LightGBM, ADABOOST and XGBOOST. All these models can be found in scikit-learn.



In [None]:
## TODO: Create a template possible to fit with all the models.
## TODO: Divide the training set based on the presence of TransactionRevenue.
## TODO: Write metrics visualization code.

## LightGBM

In [None]:
## TODO: Build LightGBM training function.

## ADABOOST 

In [None]:
## TODO: Build ADABOOST training function.

## XGBOOST

In [None]:
## TODO: Build XGBOOST training function.