# Workshop 6: Google Analytics Revenue Determination Crucial Factors



### Contents of this Kernel

1. Problem Statement  
2. Dataset Preparation  
3. Simple Look into Visitor Profiles
4. Boosting: catboost, lightgbmboost and xgboost 
 

## 1. Problem Statement 

In this exercise https://www.kaggle.com/c/ga-customer-revenue-prediction , the aim is to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer. The exercise here is an explosition of the data and look into the GA data about the personas of the visitors and the analysis might lead to a better use of marketing budgets for those companies who choose to use data analysis on top of GA data. 

This exercise is to introduce a tool help to extract the important factors determinate your business objectives given you have done the exercise in WOrkshop 4 enable to format the data for analysis.

### Questions:

#### Q1: Seek out insights of the personas of the visitors to form crucial insights
#### Q2: Compare different kinds of boosting algorithms

As the first step, lets load the required libraries.


### Remember to download the text.csv (6G bit) from below link to the same directory for working in this Jupyter notebook exercises.

Since github does not allow to upload file more than 2Gbit, the data file is placed on my Google drive as follows: https://drive.google.com/file/d/1euXsx5hfq0N5mowMyo3ecqEGcwzHalpT/view?usp=sharing

In [None]:
import numpy as np 
import pandas as pd 
import json
from pandas.io.json import json_normalize
import seaborn as sns 
import matplotlib.pyplot as plt 
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
from plotly import tools
init_notebook_mode(connected=True)

If your PC does not have plotly, you need to find way to install plotly to call in init_notebook_mode, iplot, plotly.graph_objs and tools

## 2. Understanding the Dataset

The data is shared in csv format. The csv files contains some filed with json objects. The description about dataset fields is given  https://www.kaggle.com/c/ga-customer-revenue-prediction/data 



### 2.1 Dataset Preparation

Lets read the dataset in csv format and unwrap the json fields. Students can reference on https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.json_normalize.html 
and
https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields

In [None]:
# read in data

df = pd.read_csv("test.csv", nrows=10000)
df.shape

In [None]:
# See the variables 
df.head()


In [None]:
# See the variables 
df.dtypes


#### it is noted in below how servers record some steamless data over interact with clients in the JSON format


Only "device", "geoNetwork", "totals", "trafficSource" are JSON objects whereas "customDimensions" and "hits" are different data nested dictionary objects. One will notice that the data strcuture of series of "customDimensions" and "hits" are more complicated, in this exercise, they are not included in the analysis.

"hits" record all the pages footprints. The Google Analytics data variables description one can find some details about https://support.google.com/analytics/answer/3437719?hl=en. Google develops a BigQuery SQL API that one can access its GA data for further AB design testing, reference https://towardsdatascience.com/how-to-query-and-calculate-google-analytics-data-in-bigquery-cab8fc4f396 (which we will come back to look into it in lecture 8-9)

#### Use pandas.read_csv(converters) 
Dict of functions for converting values in certain columns. Keys can either be integers or column labels.

In [None]:
def load_df(csv_path='test.csv', nrows=None):
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        # "normalize" convert semi-structured JSON data into a flat table.
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        # using "." rather than "_" for subcolumn
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {csv_path}. Shape: {df.shape}")
    return df


#### Run the whole data may take hours depending the computing efficiency of your machine, it is suggested to limit to data size of 100000

In [None]:
train = load_df("test.csv", 300000)

### 2.2 Dataset Snapshot

Lets view the snapshot of the test dataset. 

In [None]:
print ("There are " + str(train.shape[0]) + " rows and " + str(train.shape[1]) + " raw columns in this dataset")

print ("Snapshot: ", train.head())

### 2.3 Missing Values Percentage

From the snapshot we can observe that there are many missing values in the dataset. Let's plot the missing values percentage for columns having missing values. 

> The following graph shows only those columns having missing values, all other columns are fine. 

In [None]:
miss_per = {}
for k, v in dict(train.isna().sum(axis=0)).items():
    if v == 0:
        continue
    miss_per[k] = 100 * float(v) / len(train)
    
import operator 
sorted_x = sorted(miss_per.items(), key=operator.itemgetter(1), reverse=True)
print ("There are " + str(len(miss_per)) + " columns with missing values")

kys = [_[0] for _ in sorted_x][::-1]
vls = [_[1] for _ in sorted_x][::-1]
trace1 = go.Bar(y = kys, orientation="h" , x = vls, marker=dict(color="#d6a5ff"))
layout = go.Layout(title="Missing Values Percentage", 
                   xaxis=dict(title="Missing Percentage"), 
                   height=400, margin=dict(l=300, r=300))
figure = go.Figure(data = [trace1], layout = layout)
iplot(figure)

> - So we can observe that there are some columns in the dataset having very large number of missing values. 


## 3. Visitor Profile 

Lets create the visitor profile by aggregating the rows for every customer. 

### 3.1 Visitor Profile Snapshot

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

agg_dict = {}
for col in ["totals.bounces", "totals.hits", "totals.newVisits", "totals.pageviews", "totals.transactionRevenue"]:
    train[col] = train[col].astype('float')
    agg_dict[col] = "sum"
tmp = train.groupby("fullVisitorId").agg(agg_dict).reset_index()
tmp.head()

### 3.2 Total Transactions Revenue

In [None]:
non_zero = tmp[tmp["totals.transactionRevenue"] > 0]["totals.transactionRevenue"]
print ("There are " + str(len(non_zero)) + " visitors in the train dataset having non zero total transaction revenue")

plt.figure(figsize=(12,6))
sns.distplot(non_zero)
plt.title("Distribution of Non Zero Total Transactions");
plt.xlabel("Total Transactions");

Lets take the natural log on the transactions

In [None]:
plt.figure(figsize=(12,6))
sns.distplot(np.log1p(non_zero))
plt.title("Log Distribution of Non Zero Total Transactions");
plt.xlabel("Log - Total Transactions");

### 3.3 Visitor Profile Attributes

In [None]:
def getbin_hits(x):
    if x < 5:
        return "1-5"
    elif x < 10:
        return "5-10"
    elif x < 30:
        return "10-30"
    elif x < 50:
        return "30-50"
    elif x < 100:
        return "50-100"
    else:
        return "100+"

tmp["total.hits_bin"] = tmp["totals.hits"].apply(getbin_hits)
tmp["totals.bounces_bin"] = tmp["totals.bounces"].apply(lambda x : str(x) if x <= 5 else "5+")
tmp["totals.pageviews_bin"] = tmp["totals.pageviews"].apply(lambda x : str(x) if x <= 50 else "50+")

t1 = tmp["total.hits_bin"].value_counts()
t2 = tmp["totals.bounces_bin"].value_counts()
t3 = tmp["totals.newVisits"].value_counts()
t4 = tmp["totals.pageviews_bin"].value_counts()

fig = tools.make_subplots(rows=2, cols=2, subplot_titles=["Total Hits per User", "Total Bounces per User", 
                                                         "Total NewVistits per User", "Total PageViews per User"], print_grid=False)

tr1 = go.Bar(x = t1.index[:20], y = t1.values[:20])
tr2 = go.Bar(x = t2.index[:20], y = t2.values[:20])
tr3 = go.Bar(x = t3.index[:20], y = t3.values[:20])
tr4 = go.Bar(x = t4.index, y = t4.values)

fig.append_trace(tr1, 1, 1)
fig.append_trace(tr2, 1, 2)
fig.append_trace(tr3, 2, 1)
fig.append_trace(tr4, 2, 2)

fig['layout'].update(height=700, showlegend=False)
iplot(fig)

In [None]:
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return round(np.sqrt(mean_squared_error(y_true, y_pred)), 5)

def load_preprocessed_dfs(drop_full_visitor_id=True):
    """
    Loads files `TRAIN`, `TEST` and `Y` generated by preprocess() into variables
    """
    X_train = pd.read_csv(TRAIN, converters={'fullVisitorId': str})
    X_test = pd.read_csv(TEST, converters={'fullVisitorId': str})
    y_train = pd.read_csv(Y, names=['LogRevenue']).T.squeeze()
    
    # This is the only `object` column, we drop it for train and evaluation
    if drop_full_visitor_id: 
        X_train = X_train.drop(['fullVisitorId'], axis=1)
        X_test = X_test.drop(['fullVisitorId'], axis=1)
    return X_train, y_train, X_test


In [None]:
X, y, X_test = load_preprocessed_dfs()
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, random_state=1)

print(f"Train shape: {X_train.shape}")
print(f"Validation shape: {X_val.shape}")
print(f"Test (submit) shape: {X_test.shape}")


### LightGB

In [None]:
def run_lgb(X_train, y_train, X_val, y_val, X_test):
    
    params = {
        "objective" : "regression",
        "metric" : "rmse",
        "num_leaves" : 40,
        "learning_rate" : 0.005,
        "bagging_fraction" : 0.6,
        "feature_fraction" : 0.6,
        "bagging_frequency" : 6,
        "bagging_seed" : 42,
        "verbosity" : -1,
        "seed": 42
    }
    
    lgb_train_data = lgb.Dataset(X_train, label=y_train)
    lgb_val_data = lgb.Dataset(X_val, label=y_val)

    model = lgb.train(params, lgb_train_data, 
                      num_boost_round=5000,
                      valid_sets=[lgb_train_data, lgb_val_data],
                      early_stopping_rounds=100,
                      verbose_eval=500)

    y_pred_train = model.predict(X_train, num_iteration=model.best_iteration)
    y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)
    y_pred_submit = model.predict(X_test, num_iteration=model.best_iteration)

    print(f"LGBM: RMSE val: {rmse(y_val, y_pred_val)}  - RMSE train: {rmse(y_train, y_pred_train)}")
    return y_pred_submit, model

In [None]:
%%time
# Train LGBM and generate predictions
lgb_preds, lgb_model = run_lgb(X_train, y_train, X_val, y_val, X_test)

In [None]:
print("LightGBM features importance...")
gain = lgb_model.feature_importance('gain')
featureimp = pd.DataFrame({'feature': lgb_model.feature_name(), 
                   'split': lgb_model.feature_importance('split'), 
                   'gain': 100 * gain / gain.sum()}).sort_values('gain', ascending=False)
print(featureimp[:10])

### XGBoost

In [None]:
def run_xgb(X_train, y_train, X_val, y_val, X_test):
    params = {'objective': 'reg:linear',
              'eval_metric': 'rmse',
              'eta': 0.001,
              'max_depth': 10,
              'subsample': 0.6,
              'colsample_bytree': 0.6,
              'alpha':0.001,
              'random_state': 42,
              'silent': True}

    xgb_train_data = xgb.DMatrix(X_train, y_train)
    xgb_val_data = xgb.DMatrix(X_val, y_val)
    xgb_submit_data = xgb.DMatrix(X_test)

    model = xgb.train(params, xgb_train_data, 
                      num_boost_round=2000, 
                      evals= [(xgb_train_data, 'train'), (xgb_val_data, 'valid')],
                      early_stopping_rounds=100, 
                      verbose_eval=500
                     )

    y_pred_train = model.predict(xgb_train_data, ntree_limit=model.best_ntree_limit)
    y_pred_val = model.predict(xgb_val_data, ntree_limit=model.best_ntree_limit)
    y_pred_submit = model.predict(xgb_submit_data, ntree_limit=model.best_ntree_limit)

    print(f"XGB : RMSE val: {rmse(y_val, y_pred_val)}  - RMSE train: {rmse(y_train, y_pred_train)}")
    return y_pred_submit, model

In [None]:
%%time
xgb_preds, xgb_model = run_xgb(X_train, y_train, X_val, y_val, X_test)

### CatBoost

In [None]:
def run_catboost(X_train, y_train, X_val, y_val, X_test):
    model = CatBoostRegressor(iterations=1000,
                             learning_rate=0.05,
                             depth=10,
                             eval_metric='RMSE',
                             random_seed = 42,
                             bagging_temperature = 0.2,
                             od_type='Iter',
                             metric_period = 50,
                             od_wait=20)
    model.fit(X_train, y_train,
              eval_set=(X_val, y_val),
              use_best_model=True,
              verbose=True)
    
    y_pred_train = model.predict(X_train)
    y_pred_val = model.predict(X_val)
    y_pred_submit = model.predict(X_test)

    print(f"CatB: RMSE val: {rmse(y_val, y_pred_val)}  - RMSE train: {rmse(y_train, y_pred_train)}")
    return y_pred_submit, model

In [None]:
%%time
# Train Catboost and generate predictions
cat_preds, cat_model = run_catboost(X_train, y_train, X_val, y_val,  X_test)

## Ensemble

In [None]:
# Note: this is currently being reconstructed!
ensemble_preds_70_30_00 = 0.7 * lgb_preds + 0.3 * cat_preds + 0.0 * xgb_preds 
ensemble_preds_70_25_05 = 0.7 * lgb_preds + 0.25 * cat_preds + 0.05 * xgb_preds 
def submit(predictions, filename='submit.csv'):
    """
    Takes a (804684,) 1d-array of predictions and generates a submission file named filename
    """
    _, _, X_submit = load_preprocessed_dfs(drop_full_visitor_id=False)
    submission = X_submit[['fullVisitorId']].copy()
    
    submission.loc[:, 'PredictedLogRevenue'] = predictions
    grouped_test = submission[['fullVisitorId', 'PredictedLogRevenue']].groupby('fullVisitorId').sum().reset_index()
    grouped_test.to_csv(filename,index=False)

submit(lgb_preds, "submit-lgb.csv")
# Note: I disabled XGB to make the notebook run faster
submit(xgb_preds, "submit-xgb.csv")
submit(cat_preds, "submit-cat.csv")
submit(ensemble_preds_70_30_00, "submit-ensemble-70_30_00.csv")
submit(ensemble_preds_70_25_05, "submit-ensemble-70_25_05.csv")

ensemble_preds_70_30_00_pos = np.where(ensemble_preds_70_30_00 < 0, 0, ensemble_preds_70_30_00)
submit(ensemble_preds_70_30_00_pos, "submit-ensemble-70_30_00-positive.csv")

ensemble_preds_70_25_05_pos = np.where(ensemble_preds_70_25_05 < 0, 0, ensemble_preds_70_25_05)
submit(ensemble_preds_70_25_05_pos, "submit-ensemble-70_25_05-positive.csv")