# Introduction

## Import packages

In [58]:
import numpy as np
import math
import pandas as pd

## Google Bigquery Datasets 
Note: If you don't want to get the datasets from Google Bigquery, you can find them in the repository without the need of querying anything, just load them on the notebook

Paste in the _credentials_ variable your [service account](https://cloud.google.com/iam/docs/creating-managing-service-account-keys) key. You can also find your _project_id_ on your Google Cloud Platform (if you want to know more about it, follow this [link](https://cloud.google.com/storage/docs/projects))

In [9]:
from google.cloud import bigquery
from google.oauth2 import service_account


credentials = service_account.Credentials.from_service_account_file(
    r'PATH\TO\SERVICE_ACCOUNT_KEY.json')
project_id = 'YOUR-PROJECT-ID'
client = bigquery.Client(credentials= credentials,project=project_id)

In [10]:
client = bigquery.Client()

We start by querying the google analytics database to obtain a value for each user-item interaction. The value depends on what action a user has taken with respect to a particular item.This value is then aggregated to obtain the total "interest" each user _u_ has for item _i_.

The criterion for assigning an arbitrary value to each user-item interaction in the following:
- If the user has clicked on the item from the product list (action_type = 1), the arbitrary value assigned to it is 0.5, ... <br>

and so on as you can see in the query above. Note that the values assigned are just arbitrary and not the product of deterministic calculations. <br> 
These are the action_type descriptions taken from [Google Bigquery Export Schema](https://support.google.com/analytics/answer/3437719?hl=en):

- 2: product detail was viewed by the user,
- 3: product was added to cart,
- 4: product has been removed from cart (negative value assigned to it),
- 5: product has been checked out,
- 6: the item purchase is complete,
- 7: a refund was requested for the item (_very_ negative value assigned to it).
<br>
Finally, action_type "0" indicates that the action taken by the user with respect to the item is unknown. Since I was not sure on how to treat an unknown action, I decided to exclude it from the value calculation.

In [11]:
query = """
        with event as
        (SELECT hp.productsku productsku, fullvisitorid,v2ProductName product_name,
        CASE
        WHEN hits.eCommerceAction.action_type = '1' THEN 0.5
        WHEN hits.eCommerceAction.action_type = '2' THEN 1
        WHEN hits.eCommerceAction.action_type = '3' THEN 2.5
        WHEN hits.eCommerceAction.action_type = '4' THEN -2.5
        WHEN hits.eCommerceAction.action_type = '5' THEN 3.5
        WHEN hits.eCommerceAction.action_type = '6' THEN 6
        WHEN hits.eCommerceAction.action_type = '7' THEN -8
        END AS eventStrength_exp
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, unnest(hits) hits, unnest(hits.product) as hp
        WHERE hits.eCommerceAction.action_type != '0'
        )
        SELECT distinct productsku,fullvisitorid, sum(eventStrength_exp) as tot_interest, product_name
        FROM event
        GROUP BY 1,2,4
        """
df = client.query(query).result().to_dataframe()

We give a look at our dataset:

In [12]:
df.head()

Unnamed: 0,productsku,fullvisitorid,tot_interest,product_name
0,GGOEGAAX0330,3963931733144286855,1.5,YouTube Men's Long & Lean Tee Charcoal
1,GGOEGETB023799,750846065342433129,1.5,Google Power Bank
2,GGOEADWQ015699,138058039294367332,2.0,Android Rise 14 oz Mug
3,GGOEGBRB013899,551028300396393478,1.5,Google Laptop Backpack
4,GGOEYDHJ056099,2865117450599304911,2.5,22 oz YouTube Bottle Infuser


We also create another dataset that associates each _productsku_ with the product's corresponding name:

In [94]:
query = """
        SELECT distinct productsku, v2ProductName product_name
        FROM `bigquery-public-data.google_analytics_sample.ga_sessions_20170801`, unnest(hits) hits, unnest(hits.product) as hp
        WHERE hits.eCommerceAction.action_type != '0';
        """
item_names = client.query(query).result().to_dataframe()

In [14]:
item_names.head()

Unnamed: 0,productsku,product_name
0,GGOEGAAX0330,YouTube Men's Long & Lean Tee Charcoal
1,GGOEGETB023799,Google Power Bank
2,GGOEADWQ015699,Android Rise 14 oz Mug
3,GGOEGBRB013899,Google Laptop Backpack
4,GGOEYDHJ056099,22 oz YouTube Bottle Infuser


In [15]:
# map each repo and user to a unique numeric value
df['fullvisitorid'] = df['fullvisitorid'].astype("category")
df['productsku'] = df['productsku'].astype("category")

The above cell is needed to create lists of unique customers, products and to store the interactions of each user to each item

In [40]:
customers = list(np.sort(df.fullvisitorid.unique()))
products = list(df.productsku.unique())
interaction = list(df.tot_interest)

In [91]:
products = dict(enumerate(df['productsku'].cat.categories))
productskus = {r: i for i, r in products.items()}

## Data analysis using the interactions function

In [90]:
customers_df = pd.DataFrame(customers)

For our recommendation example, we need to take a random customer from the dataset and check with which items he/she has interacted with. This is done so that we are able to tell, more or less, if the model is accurate enough in predictions.
Let's define a function __interactions__ that would allow us to return the products with which the user interacted with and the corresponding interest for that product:

In [92]:
def interactions(customer_id):
    return df[['product_name','tot_interest']][df['fullvisitorid'] == customers_df.loc[customer_id][0]]

Now, let's take a random customer, e.g. 50, and use the function defined above:

In [79]:
interactions(50)

Unnamed: 0,product_name,tot_interest
524,Google Men's Zip Hoodie,3.5
525,Google Men's Watershed Full Zip Hoodie Grey,6.0
526,YouTube Men's Fleece Hoodie Black,4.0


From the above result, it looks like the person is especially interested in Hoodies!

Create a sparse matrix of all the users/items, where all the factors other than 1 are filled with zeros. The model is then trained on this sparse matrix

In [17]:
from scipy.sparse import csr_matrix

# create a sparse matrix of all the users/items
# pivot ratings into item features
df_items_users  = df.pivot(
    index='productsku',
    columns='fullvisitorid',
    values='tot_interest'
).fillna(0)
df_items_users  = csr_matrix(df_items_users.values)

In [19]:
df_items_users

<319x427 sparse matrix of type '<class 'numpy.float64'>'
	with 1145 stored elements in Compressed Sparse Row format>

As we can see, the matrix has been stored in a sparse row format for memory optimization purposes

# Model creation
Firstly, let's define a couple of functions that we are going to need later to define and train the model using our sparse matrix

In [82]:
from implicit.als import AlternatingLeastSquares
from implicit.bpr import BayesianPersonalizedRanking

def model_creation(model_type):
    if model_type == 'ALS': 
        model = AlternatingLeastSquares(factors=100,
                                    regularization=1e-3,
                                    dtype=np.float64)
        return model
    
    elif model_type == 'BPR':
        model = BayesianPersonalizedRanking(factors=50,
                                    regularization=1e-4,
                                    dtype=np.float64)
        return model

def model_fitting(model, confidence, sparse_matrix):
    model.fit(confidence * sparse_matrix)

## Bayesian Personalized Ranking 
##### (If you want to have a more extensive explanation, you can find it on [this](https://arxiv.org/pdf/1205.2618.pdf) paper)
The BPR method is, as defined by who designed it, an optimization criterion that evaluates the "recommending items" problem under a Bayesian point of view. The method has the advantage of optimizing the model parameters specifically for increasing ranking accuracy, and in order to do so it basically compares pairs of items (i.e. the user specific ranking order of two items) instead of taking each item separately and regressing it over a single number. <br>
More formally, let's first define the set containing the pairs of items as: <br>
\begin{equation*}
D_s := \{(u,i,j) | i \in I_u^+ \land j \in I \setminus I_u^+ \}
\end{equation*}
where _u_ is a specific user,( _i_, _j_ ) are the pairs of items described above, $I$ is the set of all items and $I_u^+$ is the set of items that user u has interacted with.
<br> The model consists in an optimization criterion derived from analyzing the problem using the likelihood function $p(i >_u j|\theta)$.
By maximizing the posterior probability  $ p(\theta | >_u) \propto p(>_u |\theta) \cdot p(\theta) $ , we are able to obtain the optimization criterion for personalized ranking BPR-OPT, which after introducting the ranking properties of _totality, antisymmetry_ and _transitivity_, we can define as: <br>
\begin{equation*}
BPR-OPT :=   \sum_{(u,i,j)\in D_s} \ln \sigma(\hat{x}_{uij}) - \lambda_\theta \| \theta \|^2   
\end{equation*}
$\hat{x}_{uij}$ is the predicted score of the pair of items (i,j) for user u, $\theta$ represents the parameter vector of an arbitrary model class (since we know that this method can be applied to a number of model classes), and $\lambda_\theta$ are the model regularization parameters.

In [36]:
# model creation
modelBPR = model_creation('BPR')

# model fitting
confidence = 40
sparse_matrix = df_items_users
model_fitting(modelBPR, confidence, sparse_matrix)

HBox(children=(IntProgress(value=0), HTML(value='')))




## Alternating Least Squares 
##### (If you want to have a more extensive explanation of ALS model and of collaborative filterning models in general, you can find it on [this](http://yifanhu.net/PUB/cf.pdf) paper)
Another approach for implicit feedback datasets is the ALS approach, which is similar to matrix factorization techniques (popular with explicit feedback information), but adapted for implicit information.
Let's first define __$p_{ui}$__ to be the binarized preference of user u to item i. In other words, if user u has consumed item i, then $p_{ui}$ is going to take value 1. Viceversa, if the user has not consumed item i, then the variable's value is going to be zero.
We also need to introduce another variable that accounts for the uncertainty deriving from implicit datasts: the __confidence__ level, $c_{ui}$. The variable is motivated by the fact that we cannot be sure whether a certain user has interacted with an item because of preference, similarly if a user has not interacted with a certain item we cannot assume that the user is not going to be interested on it in the future. To account for that, the paper mentioned above defines the confidence level as: <br>
$ c_{ui} = 1 + \alpha r_{ui}$,<br> 
where the variable $r_{ui}$ indicates the arbitrary interaction value of user u with respect to item i. The confidence rate is regulated by $\alpha$.,<br> 
A possible value of $\alpha$ that works well with the dataset is 40, but it necessitates a case-by-case analysis to find the optimal value to assign.
An assumption of the model is that preferences can be expressed using user-factors and item-factors vectors, and by taking their inner product we could express user preference towards item i as: $p_{ui} = x_u^Ty_i$, where x_u^T is just the transposed user-factors vector, while y_i indicates the item-factor vector.
Factors are then computed by minimizing the following cost function: 
\begin{equation*}
\phi = \min_{x_*,y_*} \sum_{u,i} c_{ui}(p_{ui} - x_u^Ty_i)^2 + \lambda (\sum_{u} \| x_i \|^2 + \sum_{i} \| y_i \|^2) ,
\end{equation*}
where the term $\lambda (\sum_{u} \| x_i \|^2 + \sum_{i} \| y_i \|^2)$ is used in order to regularize the model (that means, to discourage the parameters the vectors from taking too large values). In our case, I found that a regularization parameter of 1e-4 works well. <br>
The problem now is that in order to minimize this function, usual methods such as Stochastic Gradient Descent cannot be used. By observing that when either the user-factors or the item-factors are fixed, the cost function becomes quadratic and its global minimum can be easily computed. This leads
to the alternating-least-squares optimization process implemented by the homonymous function that I used above.

In [93]:
modelALS = model_creation('ALS')

confidence = 10
sparse_matrix = df_items_users
model_fitting(modelALS, confidence, sparse_matrix)

HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




# Recommendations

In order to be able to use the __recommend__ function, we firstly need to create another sparse matrix taking the same dataset as before, but now having users as rows and items as column. This matrix is just the transposed of the above defined items-users matrix

In [84]:
# create an items-users matrix for recommendations 
from scipy.sparse import csr_matrix
# pivot ratings into item features
df_item_features  = df.pivot(
    index='fullvisitorid',
    columns='productsku',
    values='tot_interest'
).fillna(0)
df_users_items  = csr_matrix(df_item_features.values)

In [85]:
df_users_items

<427x319 sparse matrix of type '<class 'numpy.float64'>'
	with 1145 stored elements in Compressed Sparse Row format>

## Example

In the recommendation example below, we use the ALS model to recommend a list of 10 items to our example user 120:

In [88]:
recs = modelALS.recommend(userid = 50, user_items = df_users_items, N = 10)

In [89]:
[(item_names['product_name'][item_names['productsku'] == products[r]].tolist()[0], s) for r, s in recs]

[("Google Men's Watershed Full Zip Hoodie Grey", 0.3775164226056508),
 ("Google Men's Airflow 1/4 Zip Pullover Black", 0.2179932849706377),
 ("Google Women's Short Sleeve Hero Tee Grey", 0.17134788062758352),
 ("Google Men's Performance Full Zip Jacket Black", 0.14189075271228782),
 ("Google Women's Scoop Neck Tee Black", 0.13460256755832284),
 ("Google Men's Quilted Insulated Vest Black", 0.12718054364784773),
 ('Waze Dress Socks', 0.11386148716233549),
 ("Google Men's  Zip Hoodie", 0.11225598742409833),
 ("Google Men's  Zip Hoodie", 0.11143913038752763),
 ('Recycled Paper Journal Set', 0.0974964641327746)]

As we would expect, we got recommendations that are somewhat similar to what our user was interested in: a pullover and a hoodie are ahead of the list. You can now try to change parameter values and see if the accuracy of the model increases.