# Google Analytics Customer Revenue Prediction

### Predict how much GStore customers will spend

Marketing teams are challenged to make appropriate investments in promotional strategies. In this competition, we’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to** predict revenue per customer.**

Hopefully, the outcome will be more actionable operational changes and a better use of marketing budgets for those companies who choose to use data analysis on top of GA data.

# Step 1 - Understanding the Business Objective


## Frame the problem as a ML project: 


The machine learning methods to be developed should provide a prediction of the natural log of how many dollars in total a unique GStore user will spend by visiting the webstore (transation revenue per visitor). This is a typical supervised regression problem.

* Supervised: the labels are included in the training data and the goal is to train a model to learn to predict the labels from the features

* Regression: the label is a numeric variable. It requires the prediction of a quantity.


### What are we predicting?


We are predicting the natural log of the sum of all transactions per user. For every user in the test set, the target is: 


> $y_{user} = \sum_{i=1}^{n} transaction_{user_i}$

> $target_{user} = \ln({y_{user}+1})$




### Evaluation Metric


Root Mean Squared Error (RMSE): Submissions are scored on the root mean squared error. RMSE is defined as:


${RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y_i})^2}$ 


where $\hat{y}$ is the predicted revenue for a customer and y is the natural log of the actual revenue value.


**Output of the Analysis:**


* CSV Submission File:
For each fullVisitorId in the test set, you must predict the natural log of their total revenue in PredictedLogRevenue. The submission file should contain a header and have the following format:

| fullVisitorId | PredictedLogRevenue |
|:------------------|:-:|
|0000000259678714014| 0 | 
|0000049363351866189| 0 |
| etc.              | 0 |

***

# Step 2 - Data Collection & Data Preparation

The data is provided by [Google](<https://www.kaggle.com/c/google-analytics-customer-revenue-prediction/data>). There are 3 different CSV files:

* **train.csv** - the training set

* **test.csv** - the test set 

*  **sampleSubmission.csv** - a sample submission file in the correct format. Contains all fullVisitorIds in test.csv.

### Data Fields


Each row in the dataset is one visit to the store. Because we are _predicting the log of the total revenue per user_, not all rows in test.csv will correspond to a row in the submission, but all unique fullVisitorIds will correspond to a row in the submission.


* fullVisitorId- A unique identifier for each user of the Google Merchandise Store.

* channelGrouping - The channel via which the user came to the Store.

* date - The date on which the user visited the Store.

* device - The specifications for the device used to access the Store.

* geoNetwork - This section contains information about the geography of 
the user.

* sessionId - A unique identifier for this visit to the store.

* socialEngagementType - Engagement type, either "Socially Engaged" or "Not Socially Engaged".

* totals - This section contains aggregate values across the session.

* trafficSource - This section contains information about the Traffic Source from which the session originated.

* visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie_. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.

	> ´__utmb__  cookie ´  |  Default Expiration Time: 30 mins from set/update  | Description: Used to determine new sessions/visits. The cookie is created when the javascript library executes and no existing __utmb__ cookies exists. The cookie is updated every time data is sent to Google Analytics.

* visitNumber - The session number for this user. If this is the first session, then this is set to 1.

* visitStartTime - The timestamp (expressed as POSIX time).


Both __train.csv__ and __test.csv__ contain the columns listed under Data Fields. 


#### Removed Data Fields


Some fields were censored to remove target leakage. **Data Leakage** occurs when our predictors include data that will not be available at the time we make predictions. The major censored fields are listed below.


* hits - This row and nested fields are populated for any and all types of hits. Provides a record of all page visits.

* customDimensions - This section contains any user-level or session-level custom dimensions that are set for a session. This is a repeated field and has an entry for each dimension that is set.

* totals - Multiple sub-columns were removed from the totals field.