# HOMEWORK 2

* Work on Google Analytics Customer Revenue Prediction (https://www.kaggle.com/c/ga-customer-revenue-prediction/overview)
* https://docs.google.com/document/d/1cz4u465B4Oi86gLZSoHeEVIHH5BzSGnodHDaIUWCh0w/edit

#### FROM THE SPEC ABOVE (Copied here for easy reference)

1. Take a look at the training data. There may be anomalies in the data that you may need to factor in before you start on the other tasks. Make a note of the anomalies that you notice. Clean the data first to handle these issues. Explain what you did to clean the data (in bulleted form). (10 points)

2. Generate a heatmap and two other plots (with a subset of variables) visualizing interesting positive and negative correlations. Explain the reason for your choice for these variables and any interesting results associated with them. (15 points)

3. Cluster the data based on geographic information available with a subset of variables that you find relevant.  Include a visualization plot.  Describe your inferences from the clustering and discuss their significance. (15 points)

4. Define a buying score or probability function for each user, which predicts the likelihood of a user buying a product from the GStore. Rank the ten most likely users as who will buy a product from the store.  Does it seem that you that it produces good results?  Report why or why not. (15 points)

5. Identify at least one external data set which you can integrate into your transaction prediction analysis to make it better. Discuss/analyze the extent to which this data helps with the prediction task. (10 points).

6. Finally, build the best prediction model you can to solve the Kaggle task.  Use any data, ideas, and approach that you like. Submit the results of your best models on Kaggle.  Report the rank, score, number of entries, for your highest rank. Include a snapshot of your best score on the leaderboard as confirmation. (20 points)

7. Do a permutation test to determine whether your model really benefits from each input variable you use.   In particular, one at a time, for each relevant input variable, permute the value of this variable and see how they impact the accuracy of the results.  Run enough permutations per variable to establish a p-value of how good your predictions of log of sum of transactions per user are.  You can use whatever metric you wish to score your model (like mean absolute error).  (15 points)



## Data Fields


* **fullVisitorId**- A unique identifier for each user of the Google Merchandise Store.
* **channelGrouping** - The channel via which the user came to the Store.
* **date** - The date on which the user visited the Store.
* **device** - The specifications for the device used to access the Store.
* **geoNetwork** - This section contains information about the geography of the user.
* **socialEngagementType** - Engagement type, either "Socially Engaged" or "Not Socially Engaged".
* **totals** - This section contains aggregate values across the session.
* **trafficSource** - This section contains information about the Traffic Source from which the session originated.
* **visitId** - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
* **visitNumber** - The session number for this user. If this is the first session, then this is set to 1.
* **visitStartTime** - The timestamp (expressed as POSIX time).
* **hits** - This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.
* **customDimensions** - This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.
* **totals** - This set of columns mostly includes high-level aggregate data.
    * **totals.transactionRevenue** - The target variable we want to predict


In [1]:
import json
from os import path

import pandas as pd
from pandas.io.json import json_normalize
from tqdm import tqdm

In [3]:
# transform df
# https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields

JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
CSV_PATH = './homework2_data/train_v2.csv'
NEW_CSV_PATH = './homework2_data/train_v2_norm.csv'
CHUNK_SIZE = 100_000
count = 0

if not path.exists(NEW_CSV_PATH):
    csv_chunks = pd.read_csv(
        CSV_PATH,
        chunksize=CHUNK_SIZE,
        converters={column: json.loads for column in JSON_COLUMNS}, 
        dtype={'fullVisitorId': 'str'}, # Important!!
    )

    for df in tqdm(csv_chunks):
        for column in JSON_COLUMNS:
            column_as_df = json_normalize(df[column])
            column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
            df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)

        df.to_csv(NEW_CSV_PATH, mode='a', index=False)
        count += len(df)
    
    print(f'Created a new csv in {NEW_CSV_PATH}')
else:
    print('New csv already exists.')

print(f'Processed {count} records.')

New csv already exists.
Processed 0 records.


In [5]:
csv_chunks = pd.read_csv(
    CSV_PATH,
    chunksize=CHUNK_SIZE,
)

for df in tqdm(csv_chunks):
    df.to_csv('./temp.csv', mode='a', index=False)

13it [11:15, 57.21s/it]

KeyboardInterrupt: 

# 1. Detect and handle anomalies

In [9]:
df = pd.read_csv(NEW_CSV_PATH, dtype={'fullVisitorId': 'str'})

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
pr

In [15]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Unnamed: 0,100000.0,49999.5,28867.66,0.0,24999.75,49999.5,74999.25,99999.0
totals.bounces,49854.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
totals.hits,100000.0,4.36707,8.596425,1.0,1.0,2.0,4.0,331.0
totals.newVisits,76365.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
totals.pageviews,99989.0,3.647661,6.147223,1.0,1.0,1.0,4.0,230.0
totals.sessionQualityDim,46128.0,3.567681,10.78262,1.0,1.0,1.0,1.0,96.0
totals.timeOnSite,50026.0,241.9511,456.1073,1.0,29.0,76.0,231.0,9837.0
totals.totalTransactionRevenue,997.0,136845700.0,302446600.0,3400000.0,27990000.0,53570000.0,114650000.0,5501000000.0
totals.transactionRevenue,997.0,119069400.0,258881800.0,1200000.0,22380000.0,46380000.0,108470000.0,5498000000.0
totals.transactions,1002.0,1.055888,0.3531933,1.0,1.0,1.0,1.0,8.0


In [14]:
df.describe(include='all').transpose()

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Unnamed: 0,100000,,,,49999.5,28867.7,0.0,24999.8,49999.5,74999.2,99999.0
channelGrouping,100017,9.0,Organic Search,43254.0,,,,,,,
customDimensions,100017,7.0,"[{'index': '4', 'value': 'North America'}]",43785.0,,,,,,,
date,100017,40.0,2.01611e+07,4055.0,,,,,,,
fullVisitorId,100017,90098.0,1957458976293878100,20.0,,,,,,,
hits,100017,93603.0,"[{'hitNumber': '1', 'time': '0', 'hour': '2', ...",28.0,,,,,,,
socialEngagementType,100017,2.0,Not Socially Engaged,100000.0,,,,,,,
visitId,100017,98116.0,visitId,17.0,,,,,,,
visitNumber,100017,240.0,1,75085.0,,,,,,,
visitStartTime,100017,98105.0,visitStartTime,17.0,,,,,,,
