# Implicit Recommendations based on ALS

Recommender systems rely on different types of input. Most convenient is the high quality explicit feedback, which includes explicit input by users regarding their interest in products. For example, Netflix collects star ratings for movies and TiVo users indicate their preferences for TV shows by hitting thumbs-up/down buttons. However, explicit feedback is not always available. Thus, recommenders can infer user preferences from the more abundant implicit feedback, which indirectly reflect opinion through observing user behavior. Types of implicit feedback include purchase history, browsing history, search patterns, or even mouse movements.

The vast majority of the literature in the field is focused on processing explicit feedback; probably thanks to the con venience of using this kind of pure information. However, in many practical situations recommender systems need to be centered on implicit feedback. This may reflect reluctance of users to rate products, or limitations of the system that is unable to collect explicit feedback. In an implicit model, once the user gives approval to collect usage data, no additional explicit feedback (e.g. ratings) is required on the user’s part

In [2]:
import pandas as pd
import numpy as np
import datetime 
import time
%matplotlib inline
import matplotlib.pyplot as plt 
import seaborn as sns
import random
import sklearn.utils
from sklearn.preprocessing import MinMaxScaler
import scipy
import scipy.sparse as sparse

In [51]:
events_df = pd.read_csv('events.csv')

In [53]:
all_customers = events_df.visitorid.unique()
len(all_customers)

1407580

So, we have a total of 1407580 unique visitors

In [54]:
customer_purchased = events_df[events_df.transactionid.notnull()].visitorid.unique()
len(customer_purchased)

11719

Out of 1407580 visitors, only 11719 ended up purchasing.

In [55]:
customer_browsed = [x for x in all_customers if x not in customer_purchased]
len(customer_browsed)

1395861

Rest 1395861 left without buying.

Because we have higly imbalanced and a large dataset here, processing them would have taken a LONG time.So, we are going to limit our dataset to a 50:50 mixture of data from both "purchasing" as well as "just browsing" visitors.

In [57]:
random_browsed = random.sample(customer_browsed,len(customer_purchased))
sample = list(customer_purchased) + list(random_browsed)

Let's filter our datset to contain only those data instances that are in our calculated sample.

In [58]:
events_df = events_df[events_df['visitorid'].isin(sample)]
events_df = events_df.sample(frac=1).reset_index(drop=True)

Let's now map <i>visitorid</i> and <i>itemid</i> to values(categorical?) starting from zero, so they can be easily indexed. 

In [59]:
events_df['visitor_id'] = events_df['visitorid'].astype('category').cat.codes
visitor_lookup = events_df[['visitor_id','visitorid']].drop_duplicates()
events_df['item_id'] = events_df['itemid'].astype('category').cat.codes
item_lookup = events_df[['item_id','itemid']].drop_duplicates()

events_df.drop(['itemid','visitorid'],axis=1,inplace=True)

Let's now create a sparse matrix that will store for a user it's browsing history, that is how many times the user has interacted with an item.

In [8]:
visitors = list(np.sort(events_df.visitor_id.unique()))
items = list(np.sort(events_df.item_id.unique()))

mat = np.zeros((len(visitors),len(items)))

events_df.head()

for row in events_df.itertuples():
    itmid = row.item_id
    visid = row.visitor_id
    mat[visid][itmid] += 1

data_sparse = sparse.csr_matrix(mat)

## ALS

Alternating Least Squares (ALS) is a the model we’ll use to fit our data and find similarities.

#### Matrix factorization Implicit data
The idea is basically to take a large (or potentially huge) matrix and factor it into some smaller representation of the original matrix. Doing this reduction and working with fewer dimensions makes it both much more computationally efficient and but also gives us better results since we can reason about items.

There are different ways to factor a matrix, like Singular Value Decomposition (SVD) or Probabilistic Latent Semantic Analysis (PLSA) if we’re dealing with explicit data. With implicit data the difference lies in how we deal with all the missing data in our very sparse matrix. For explicit data we treat them as just unknown fields that we should assign some predicted rating to. But for implicit we can’t just assume the same since there is information in these unknown values as well. As stated before we don’t know if a missing value means the user disliked something, or if it means they love it but just don’t know about it. Basically we need some way to learn from the missing data.

#### Back to ALS
ALS is an iterative optimization process where we for every iteration try to arrive closer and closer to a factorized representation of our original data. We have our original matrix R of size (u x i) with our users, items and some type of feedback data. We then want to find a way to turn that into one matrix with users and hidden features of size (u x f) and one with items and hidden features of size (f x i). In U and V we have weights for how each user/item relates to each feature. What we do is we calculate U and V so that their product approximates R as closely as possible: R ≈ U x V.





![alt text](factor.png "Title")

By randomly assigning the values in U and V and using least squares iteratively we can arrive at what weights yield the best approximation of R.
With the alternating least squares approach we use the same idea but iteratively alternate between optimizing U and fixing V and vice versa. We do this for each iteration to arrive closer to R = U x V.

The approach we’re going to use with our implicit dataset is the one outlined in [Collaborative Filtering for Implicit Feedback Datasets](http://yifanhu.net/PUB/cf.pdf) by Hu, Korenand and Volinsky (and used by Facebook and Spotify).

Their solution is to merge the preference (p) for an item with the confidence (c) we have for that preference. We start out with missing values as a negative preference with a low confidence value and existing values a positive preference but with a high confidence value.

* Preference

![alt text](preference.png "Title")
Basically our preference is a binary representation of our feedback data r. If the feedback is greater than zero we set it to 1.

* Confidence

![alt text](confidence.png "Title")
Confidence is calculated using the magnitude of <i>r</i> (the feedback data) giving us a larger confidence the more times a user has played, viewed or clicked an item. The rate of which our confidence increases is set through a linear scaling factor <i>α</i>. We also add 1 so we have a minimal confidence even if α x r equals zero.

The goal now is to find the vector for each user (xu) and item (yi) in feature dimensions which means we want to minimize the following loss function:
![alt text](loss_function.png "Title")

As the paper notes, if we fix the user factors or item factors we can calculate a global minimum. The derivative of the above equation gets us the following equation for minimizing the loss of our users:
![alt text](user.png "Title")
And the this for minimizing it for our items
![alt text](item.png "Title")

One more step is that by realizing that the product of Y-transpose, Cu and Y can be broken out as shown below:
![alt text](broken_down.png "Title")
Now we have Y-transpose-Y and X-transpose-X independent of u and i which means we can precompute it and make the calculation much less intensive. So with that in mind our final user and item equations are:
![alt text](final_equations.png "Title")

By iterating between computing the two equations above we arrive at one matrix with user vectors and one with item vectors that we can then use to produce recommendations or find similarities.

#### Similar Items
To calculate the similarity between items we compute the dot-product between our item vectors and it’s transpose. So if we want artists similar to say "A" we take the dot product between all item vectors and the transpose of the A's item vector. This will give us the similarity score:
![alt text](similar_items.png "Title")

#### Making Recommendations
To make recommendations for a given user we take a similar approach. Here we calculate the dot product between our user vector and the transpose of our item vectors. This gives us a recommendation score for our user and each item:
![alt text](recommendation.png "Title")


The brute implementation of the ALS, especially on a large datset as of ours, it will take a VERY long time.

Following the implementation and code by [Ben Frederickson ](http://www.benfrederickson.com/fast-implicit-matrix-factorization/) we can replace our implicit_als function with the below code and speed things up quite a bit. Here we’re using the approach outlined in [this paper ](https://www.semanticscholar.org/paper/Applications-of-the-conjugate-gradient-method-for-Tak%C3%A1cs-Pil%C3%A1szy/46e905e9134e97c625ea6c8f6fa961b0b4c80fcf) using the Conjugate Gradient (CG) method

In [14]:
def nonzeros(m, row):
    for index in range(m.indptr[row], m.indptr[row+1]):
        yield m.indices[index], m.data[index]

In [15]:
def implicit_als_cg(Cui, features=20, iterations=20, lambda_val=0.1):
    user_size, item_size = Cui.shape

    X = np.random.rand(user_size, features) * 0.01
    Y = np.random.rand(item_size, features) * 0.01

    Cui, Ciu = Cui.tocsr(), Cui.T.tocsr()
    
    def least_squares_cg(Cui, X, Y, lambda_val, cg_steps=3):
        users, features = X.shape
    
        YtY = Y.T.dot(Y) + lambda_val * np.eye(features)

        for u in range(users):

            x = X[u]
            r = -YtY.dot(x)

            for i, confidence in nonzeros(Cui, u):
                r += (confidence - (confidence - 1) * Y[i].dot(x)) * Y[i]

            p = r.copy()
            rsold = r.dot(r)

            for it in range(cg_steps):
                Ap = YtY.dot(p)
                for i, confidence in nonzeros(Cui, u):
                    Ap += (confidence - 1) * Y[i].dot(p) * Y[i]
    
                alpha = rsold / p.dot(Ap)
                x += alpha * p
                r -= alpha * Ap
                rsnew = r.dot(r)
                p = r + (rsnew / rsold) * p
                rsold = rsnew
                
            X[u] = x
            
    
    for iteration in range(iterations):
        print ('iteration %d of %d' % (iteration+1, iterations))
        least_squares_cg(Cui, X, Y, lambda_val)
        least_squares_cg(Ciu, Y, X, lambda_val)
    
    return sparse.csr_matrix(X), sparse.csr_matrix(Y)

    

In [16]:
alpha_val = 15
conf_data = (data_sparse * alpha_val).astype('double')
user_vecs, item_vecs = implicit_als_cg(conf_data, iterations=20, features=20)

iteration 1 of 20
iteration 2 of 20
iteration 3 of 20
iteration 4 of 20
iteration 5 of 20
iteration 6 of 20
iteration 7 of 20
iteration 8 of 20
iteration 9 of 20
iteration 10 of 20
iteration 11 of 20
iteration 12 of 20
iteration 13 of 20
iteration 14 of 20
iteration 15 of 20
iteration 16 of 20
iteration 17 of 20
iteration 18 of 20
iteration 19 of 20
iteration 20 of 20


In [41]:
#Let's say we want recommendation for visitor with visitorid '1397760'
vid = 1397760

# Lt's get its visitor_id in the dataframe by consulting visitor_lookup
v_id = visitor_lookup.loc[visitor_lookup.visitorid == vid]['visitor_id'].iloc[0]

#Let's get items consunmed by user
consumed_idx = data_sparse[v_id,:].nonzero()[1].astype(str)
consumed_items = item_lookup.loc[item_lookup.item_id.isin(consumed_idx)]
print(consumed_items)

        item_id  itemid
0         42601  455170
56853     28614  305940
71208     15572  167319
199157    41429  442415
242495     1899   20884


In [48]:
#Let's now create user recommendation
def recommend(vid, data_sparse, user_vecs, item_vecs, item_lookup, num_items=10):
    
    user_id = visitor_lookup.loc[visitor_lookup.visitorid == vid]['visitor_id'].iloc[0]
    # Get all interactions by the user
    user_interactions = data_sparse[user_id,:].toarray()

    # We don't want to recommend items the user has consumed. So let's
    # set them all to 0 and the unknowns to 1.
    user_interactions = user_interactions.reshape(-1) + 1 #Reshape to turn into 1D array
    user_interactions[user_interactions > 1] = 0

    # This is where we calculate the recommendation by taking the 
    # dot-product of the user vectors with the item vectors.
    rec_vector = user_vecs[user_id,:].dot(item_vecs.T).toarray()

    # Let's scale our scores between 0 and 1 to make it all easier to interpret.
    min_max = MinMaxScaler()
    rec_vector_scaled = min_max.fit_transform(rec_vector.reshape(-1,1))[:,0]
    recommend_vector = user_interactions*rec_vector_scaled
   
    # Get all the item indices in order of recommendations (descending) and
    # select only the top "num_items" items. 
    item_idx = np.argsort(recommend_vector)[::-1][:num_items]

    items = []
    scores = []

    # Loop through our recommended items indicies and look up the actial artist name
    for idx in item_idx:
        items.append(item_lookup.loc[item_lookup.item_id == idx]['itemid'].iloc[0])
        scores.append(recommend_vector[idx])

    # Create a new dataframe with recommended artist names and scores
    recommendations = pd.DataFrame({'items': items, 'score': scores})
    
    return recommendations

In [49]:
recommendations = recommend(1397760,data_sparse, user_vecs, item_vecs, item_lookup)

In [50]:
print(recommendations)

    items     score
0  461686  1.000000
1  218794  0.788879
2  171878  0.724170
3   10572  0.711860
4   32581  0.616565
5  450082  0.579289
6  268883  0.566760
7  420960  0.566450
8  336842  0.527456
9  285154  0.515774
