# Analysis report

## EDA & Preprocessing

*Code and outputs documented in exploration.ipynb*

* Dataset to be analyzed was bigquery-public-data.google_analytics_sample. It consists of 366 tables, each of which contains analytics data for one day for googlemerchandisestore.com between August 2016 and July 2017.
* Chose March 2017 as the month from which to design the training set. Since a one-month period was specified for this purpose, decided to choose a month that would avoid anomalies associated with holiday shopping.
* Used a wildcard table to query all of the March 2017 tables together. Quantified number of sessions (top-level rows), number of hits (our layer of interest and location of the target) and number of times the target appeared. This showed a class imbalance of about 96% target-negative rows.
* Viewed a small sample of the data in a DataFrame to get a sense of field datatypes and the locations of repeated records

* Wrote an exploratory query to unnest all records in the month's worth of data and save it to a DataFrame to facilitate reviewing and selecting features

## Feature Selection & Engineering

*Code and outputs documented in exploration.ipynb and ingest.ipynb*

* Fully unnesting the data produced a total of 302 columns. First step was to drop all columns that were completely empty, after which 144 columns remained.

* Found target by searching the Universal Analytics BigQuery export schema for mentions of fields related to an 'add to cart' action. It seems to have been doubly recorded, as a string in hits.eventInfo.eventAction and as a (string-type) numeral in hits.eCommerceAction.action_type. I used the latter to engineer the target column, which I specified to contain a 1 in rows where the action type corresponded to "add to cart", and a 0 in all other fields.
* Even though an ecommerce action type of 0 corresponds to "unknown" and such rows accounted for at least 70% of the data, I left them in as I still considered such rows to be part of the group (hits) from which we were trying to predict the target's occurrence.
* Reviewed all fields that were partially null. I dropped another set of fields which were almost completely null or very high-cardinality categorical that I couldn't see a straightforward way to encode.
* Targeted and dropped the fields that were explicitly obfuscated. 
* Attempted to manually review the unique values and their relative frequencies in the remaining fields. This revealed a handful of zero-variance or redundant fields I was able to drop.
* Identified a handful of categorical fields that I suspected would yield information on the target but were too high-cardinality to use directly. Decided to transform these features by keeping the most commonly appearing values in place and transforming the rest to 'other'. By this point the DataFrame was down to 46 columns.
* After investigating some of the GCP model training options, re-added the product sku field in the hope this would be a one-feature way to insert a lot of product information into the model. Planned to use label encoding on this 800-value field if a better solution couldn't be found.
* Engineered sku feature so that a value would only be inserted in rows where exactly one product was associated with the hit, to prevent spurious duplication of hits
* Finally I dropped rows that indicated bounces in an effort to reduce the class imbalance slightly; not sure if this was advisable or not.
* I wish I had decided earlier in the process that I would as a rule exclude any column from a repeated record **other** than hits, in order to avoid erroneously duplicating individual hits. The exception was sku which came from the product record, with the aforementioned engineering to prevent row duplcation.

## Model Choice

*Model generation queries are documented in ingest.ipynb*

* Decided to try two versions each of two model types: XGBoost and Logistic Regression

* These models are well suited to binary classification tasks and are readily available for use in GCP
* For XGBoost I specified label encoding for the category encoding method, although upon reviewing the model's metadata afterward it appears that one-hot was actually used on all categorical features. Would like to investigate further why this happened and if it impacted performance
* Tried two versions of an XGBoost model: one with automatic class weights applied and one without. Given the highly imbalanced label classes it was interesting to compare the presence or absence of this adjustment
* The logistic regression model also used automatic weights, and there were also two versions here: one with the same number of trials as the XGBoost attempts, and one with double the trials and half the parallel trials.
* I omitted the sku field from the logistic regression models because they do not allow specifying label encoding.
* I went with the Vizier hyperparameter tuning algorithm for all versions as I thought it would be much more time-efficient than adjusting hyperparameters manually.
* I left the default hparam tuning objective as roc-auc; it's a suitable metric for binary classification and is also useful in cases of label class imbalance

## Model Tuning & Evaluation Metrics

*Full metrics are documented in eval.ipynb*

* Due to an accidental overwrite, model 4 was trained before model 3 and their optimal trial results are now identical. I suspect this is due to the transfer learning feature of the Vizier algorithm.
* Model 1 was the only model trained without class weights. Its precision score (50%) was much higher than its recall score (15%).
* Conversely, models 2 and 3 strongly favored recall in their performance.
* The best balance between precision and recall came from model 4, which was distinguished by an increased number of trials.
* Model 4 also used a lower number of max parallel trials, which may have enabled the hyperparameter search algorithm to improve it further.

* The HP search space for both XGBoost models was:  
Learn rate: [1, 10]  
L1 regularization: [0, 10]  
L2 regularization: [0, 10]  
Max tree depth: [1, 10]  
Subsample: [1e-14, 1]

* The HP search space for both logistic regression models was:  
L1 regularization: [0, 10]  
L2 regularization: [0, 10]  