# Google Revenue Prediction

## Kaggle Challenge Description

## What are we predicting?
We are predicting the **natural log** of the sum of all transactions **per user**.

$$ y_{user} = \sum_{i=1}^n transaction_{user_{i}} $$
$$ target_{user} = ln(y_{user} + 1) $$

## Data Cleansing 

Simply change csv_file to either "test_v2.csv" or "train_v2.csv" to determine which csv to use. Using "test_v2.csv" will make it easier to manage for your system.

In [None]:
%%time
import pandas as pd
import sklearn
from sklearn import model_selection, preprocessing, metrics
import json
import os
import matplotlib
import matplotlib.pyplot as plt
from pandas.io.json import json_normalize
import numpy as np
import lightgbm as lgb


In [None]:
%%time
data = 'data'
def load_csv(data, csv_file, nrows):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    print("Loading csv file")
    df = pd.read_csv(os.path.join(os.getcwd(),data,csv_file),
                    converters={column: json.loads for column in JSON_COLUMNS},
                    dtype={'fullVisitorId': 'str'},
                    nrows=nrows)

    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = ["%s.%s" % (column, subcolumn) for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    return df

# Load both the training csv and testing csv
train_df = load_csv(data, "train_v2.csv",2000)
test_df = load_csv(data, "test_v2.csv", 2000)

print("Loading done.")


## Columns and Types
Initially we should dissect the data and learn what columns are found inside the csv file.

In [None]:
train_df.dtypes

### Analyzing which Columns are constant

In [None]:
const_cols = [c for c in train_df.columns if train_df[c].nunique(dropna=False)==1 ]
const_cols

We should also remove the constant columns, as they will bring no bearing to the weights on the model.

In [None]:
cols_to_drop = const_cols
#+ ["trafficSource.campaignCode"]
train_df = train_df.drop(cols_to_drop,axis=1)
test_df = test_df.drop(cols_to_drop, axis=1)

# Target Variable Exploration

We need to sum up the transaction revenue on the user level. Then graph a scatter plot of the natural log.


In [None]:
train_df["totals.transactionRevenue"] = train_df["totals.transactionRevenue"].astype('float')
gdf = train_df.groupby("fullVisitorId")["totals.transactionRevenue"].sum().reset_index()

plt.figure(figsize=(8,6))
plt.scatter(range(gdf.shape[0]), np.sort(np.log1p(gdf["totals.transactionRevenue"].values)))
plt.xlabel('Customer', fontsize=12)
plt.ylabel('TransactionRevenue', fontsize=12)
plt.show()

The scatter plot above confirms the 80/20 marketing principle that states 80% of the profits come from 20% of the customers.

From this analysis, this would mean that we have a distinct dataset between customers with TransactionRevenue and customers without TransactionRevenue.


# Developing the model
To discover which is the best model to predict the log revenue of customers. We can do a training analysis based on popular models such as LightGBM, ADABOOST and XGBOOST. All these models can be found in scikit-learn.

First, we need to build the training and testing dataset respectively.

In [None]:
## TODO: Create a template possible to fit with all the models.
## TODO: Write metrics visualization code.
cat_cols = ["channelGrouping", "device.browser", 
            "device.deviceCategory", "device.operatingSystem", 
            "geoNetwork.city", "geoNetwork.continent", 
            "geoNetwork.metro",
            "geoNetwork.networkDomain", "geoNetwork.region", 
            "geoNetwork.subContinent", "trafficSource.adContent", 
            "trafficSource.adwordsClickInfo.adNetworkType", 
            "trafficSource.adwordsClickInfo.gclId", 
            "trafficSource.adwordsClickInfo.page", 
            "trafficSource.adwordsClickInfo.slot", "trafficSource.campaign",
            "trafficSource.keyword", "trafficSource.medium", 
            "trafficSource.referralPath", "trafficSource.source",
            'trafficSource.adwordsClickInfo.isVideoAd', 'trafficSource.isTrueDirect']
for col in cat_cols:
    print(col)
    lbl = preprocessing.LabelEncoder()
    lbl.fit(list(train_df[col].values.astype('str')) + list(test_df[col].values.astype('str')))
    train_df[col] = lbl.transform(list(train_df[col].values.astype('str')))

num_cols = ["totals.hits", "totals.pageviews", "visitNumber", "visitStartTime", 'totals.bounces',  'totals.newVisits']    
for col in num_cols:
    train_df[col] = train_df[col].astype(float)
    test_df[col] = test_df[col].astype(float)

train_set,test_set = model_selection.train_test_split(train_df, test_size=0.2)

## LightGBM

In [None]:
## TODO: Build LightGBM training function.
def train_with_gbm():

## ADABOOST 

In [None]:
## TODO: Build ADABOOST training function.
def train_with_adaboost():

## XGBOOST

In [None]:
## TODO: Build XGBOOST training function.
def train_with_xgboost():

## Evaluation