<a href="https://colab.research.google.com/github/emilyhoughkovacs/splice/blob/main/CatBoostClassifier_model_no_feature_encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Get Data
The following three sections are part of the cookiecutter provided out-of-the-box when exporting data from BigQuery to a python notebook. Jump to "Prepare data to be ready to use in model" for my analysis.

In [None]:
# @title Setup
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table

project = 'emily-hk' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()

## Reference SQL syntax from the original job
Use the ```jobs.query```
[method](https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query) to
return the SQL syntax from the job. This can be copied from the output cell
below to edit the query now or in the future. Alternatively, you can use
[this link](https://console.cloud.google.com/bigquery?j=emily-hk:US:bquxjob_1c159ffa_18dd7de1034)
back to BigQuery to edit the query within the BigQuery user interface.

In [None]:
# Running this code will display the query used to generate your previous job

job = client.get_job('bquxjob_1c159ffa_18dd7de1034') # Job ID inserted based on the query results selected to explore
print(job.query)

with browsers as (SELECT 
  device.browser,
  count(*)
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
group by 1
order by 2 desc
limit 3),
oses as (SELECT 
  device.operatingSystem,
  count(*)
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
group by 1
order by 2 desc
limit 5),
countries as (SELECT 
  geoNetwork.country,
  count(*)
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`
group by 1
order by 2 desc
LIMIT 10),
lp as (SELECT 
  if(h.page.pagePath LIKE '/google+redesign/apparel%', '/google+redesign/apparel', h.page.pagePath) as landing_page,
  count(*)
FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`, unnest(hits) h
where isEntrance is true
group by 1
order by 2 desc
LIMIT 5
),
visits_landing_page as (SELECT
  CONCAT(fullVisitorId, "-", visitId, "-", date) as unique_session_id,
  if(h.page.pagePath LIKE '/google+redesign/apparel%', '/google+redesign/apparel', h.page.pagePath) as landing_page
  FROM `bigquery-pub

## Result set loaded from BigQuery job as a DataFrame
Query results are referenced from the Job ID ran from BigQuery and the query
does not need to be re-run to explore results. The ```to_dataframe```
[method](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob.to_dataframe)
downloads the results to a Pandas DataFrame by using the BigQuery Storage API.

To edit query syntax, you can do so from the BigQuery SQL editor or in the
```Optional:``` sections below.

In [None]:
# Running this code will read results from your previous job

job = client.get_job('bquxjob_1c159ffa_18dd7de1034') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results



Unnamed: 0,unique_session_id,hits,pageviews,medium,browser,operatingSystem,country,landing_page,is_transaction
0,1012363918031449847-1486510446-20170207,20,16,(none),Chrome,Windows,United States,Other,1
1,158519894595822758-1478202366-20161103,18,15,(none),Chrome,Windows,Brazil,/home,0
2,4127569729086528635-1475610057-20161004,20,10,(none),Chrome,Windows,Other,/home,0
3,5291604673691887906-1480657743-20161201,74,56,(none),Chrome,Other,United States,/home,1
4,2647212466031308445-1482324895-20161221,45,32,(none),Chrome,Macintosh,Other,/home,0
...,...,...,...,...,...,...,...,...,...
903548,4109944507997654994-1471328759-20160815,16,12,organic,Chrome,Macintosh,Other,/home,0
903549,5716994513150474293-1481684687-20161213,16,14,(none),Chrome,Windows,United States,/google+redesign/shop+by+brand/youtube,1
903550,2180515369167502520-1498663407-20170628,16,14,(none),Chrome,Windows,United States,Other,1
903551,8118214310316526596-1494572992-20170512,16,14,organic,Chrome,Windows,Thailand,/home,0


# Prepare data to be ready to use in model

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [None]:
df = results.drop('unique_session_id', axis=1)

In [None]:
df.shape

(903553, 8)

In [None]:
y = df['is_transaction']
X = df.drop(['is_transaction'], axis=1)

In [None]:
X.dtypes

hits                Int64
pageviews           Int64
medium             object
browser            object
operatingSystem    object
country            object
landing_page       object
dtype: object

# Set up and train the CatBoost model
The major benefit of CatBoostClassifier is that it does not require encoding of categorical variables, so we can train it out of the box.

In [None]:
!pip install catboost



In [None]:
from catboost import CatBoostClassifier
from catboost import Pool

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

In [None]:
clf = CatBoostClassifier()

In [None]:
cat_features = ['medium', 'browser', 'operatingSystem', 'country', 'landing_page']

In [None]:
train_pool = Pool(data=X_train, label=y_train, cat_features=cat_features)

In [None]:
clf.fit(train_pool)

Learning rate set to 0.171295
0:	learn: 0.3570492	total: 1.16s	remaining: 19m 23s
1:	learn: 0.1699387	total: 2.17s	remaining: 18m 5s
2:	learn: 0.0764705	total: 3.2s	remaining: 17m 42s
3:	learn: 0.0490787	total: 4.15s	remaining: 17m 13s
4:	learn: 0.0404287	total: 4.98s	remaining: 16m 31s
5:	learn: 0.0364834	total: 6.29s	remaining: 17m 21s
6:	learn: 0.0343236	total: 7.95s	remaining: 18m 47s
7:	learn: 0.0330578	total: 8.73s	remaining: 18m 2s
8:	learn: 0.0324723	total: 9.55s	remaining: 17m 31s
9:	learn: 0.0320596	total: 10.5s	remaining: 17m 16s
10:	learn: 0.0316417	total: 11.3s	remaining: 16m 56s
11:	learn: 0.0311523	total: 12.3s	remaining: 16m 48s
12:	learn: 0.0309307	total: 13.2s	remaining: 16m 42s
13:	learn: 0.0305568	total: 14.1s	remaining: 16m 30s
14:	learn: 0.0304296	total: 14.9s	remaining: 16m 17s
15:	learn: 0.0301868	total: 15.8s	remaining: 16m 10s
16:	learn: 0.0300695	total: 16.6s	remaining: 16m
17:	learn: 0.0298767	total: 17.5s	remaining: 15m 52s
18:	learn: 0.0298158	total: 18.5s

<catboost.core.CatBoostClassifier at 0x7a1feb07d7b0>

# Get feature importance

In [None]:
feature_importance = clf.get_feature_importance(train_pool)

In [None]:
importance_df = pd.DataFrame({
    'Feature': train_pool.get_feature_names(),
    'Importance': feature_importance
})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

In [None]:
importance_df

Unnamed: 0,Feature,Importance
1,pageviews,39.170139
5,country,25.353987
6,landing_page,18.136527
0,hits,6.389519
2,medium,5.174843
4,operatingSystem,3.474569
3,browser,2.300416


The above DataFrame shows us that the most important features in our model are pageviews, country, and landing page. Let's drill deeper into each of these features.

## Country

In [None]:
# Get predicted probabilities for each class
pred_probs = clf.predict_proba(train_pool)

# Create a DataFrame to store predicted probabilities for each class
pred_probs_df = pd.DataFrame(pred_probs, columns=clf.classes_)

# Add the relevant features and the target labels to the DataFrame
pred_probs_df[['pageviews', 'country', 'landing_page']] = X_train[['pageviews', 'country', 'landing_page']]
pred_probs_df['target'] = y_train.values

In [None]:
pred_probs_df



Unnamed: 0,0,1,pageviews,country,landing_page,target
0,0.999986,1.404360e-05,16,United States,Other,0
1,0.999957,4.252363e-05,15,Brazil,/home,0
2,0.999999,1.151563e-06,,,,0
3,0.999996,3.879431e-06,56,United States,/home,0
4,1.000000,3.510761e-07,32,Other,/home,0
...,...,...,...,...,...,...
722837,0.999881,1.191285e-04,4,United States,/home,0
722838,0.999737,2.629987e-04,4,Other,/home,0
722839,0.999954,4.626251e-05,4,United States,/home,0
722840,0.999998,1.917359e-06,3,United States,/home,0


In [None]:
# Group by 'country' and calculate the mean target value for each category
country_target_means = pred_probs_df.groupby('country')['target'].mean()

# Sort by mean target value to identify countries more likely to hit the target
country_target_means_sorted = country_target_means.sort_values(ascending=False)

# Display the sorted mean target values
print(country_target_means_sorted)

country
Japan             0.013876
Germany           0.013799
Brazil            0.013466
Vietnam           0.012999
India             0.012802
United States     0.012735
Other             0.012729
Turkey            0.012547
Thailand          0.012423
United Kingdom    0.012318
Canada            0.011669
Name: target, dtype: Float64


## Landing page

In [None]:
# Group by 'landing_page' and calculate the mean target value for each category
lp_target_means = pred_probs_df.groupby('landing_page')['target'].mean()

# Sort by mean target value to identify countries more likely to hit the target
lp_target_means_sorted = lp_target_means.sort_values(ascending=False)

# Display the sorted mean target values
print(lp_target_means_sorted)

landing_page
/home                                     0.012965
Other                                     0.012742
/google+redesign/apparel                  0.012657
/basket.html                              0.012412
/google+redesign/shop+by+brand/youtube    0.012009
/signin.html                              0.009024
Name: target, dtype: Float64


## Prediction analysis

In [None]:
predictions = clf.predict(X_test)

In [None]:
X_test_with_predictions = X_test.copy()

In [None]:
X_test_with_predictions['predictions'] = predictions

In [None]:
conversion_segments = X_test_with_predictions.groupby('country')['predictions'].mean()

In [None]:
conversion_segments.sort_values(ascending=False)

country
United States     0.012685
Canada            0.000578
Other             0.000084
Brazil            0.000000
Germany           0.000000
India             0.000000
Japan             0.000000
Thailand          0.000000
Turkey            0.000000
United Kingdom    0.000000
Vietnam           0.000000
Name: predictions, dtype: float64

In [None]:
conversion_lp_segments = X_test_with_predictions.groupby('landing_page')['predictions'].mean()

In [None]:
conversion_lp_segments.sort_values(ascending=False)

landing_page
/basket.html                              0.098626
/google+redesign/apparel                  0.008704
Other                                     0.008385
/signin.html                              0.004860
/home                                     0.003163
/google+redesign/shop+by+brand/youtube    0.001298
Name: predictions, dtype: float64