# Collaborative filtering algorithm 

Collaborative filtering algorithms are a set of methods to predict user preferences using the information from other users (Collaborative), see [wikipedia](https://en.wikipedia.org/wiki/Collaborative_filtering) for a reference. Here,  cosine similarity matrix and implicit Collaborative filtering are implemented an compared.

In [1]:
import implicit
import pandas as pd
import numpy as np
import csv
from scipy.spatial.distance import cosine
from scipy.sparse import csr_matrix

# from numpy import linalg as la

## Collaborative Filtering using similarity matrix

Collaborative filtering referred as item-to-item is implemented in this section. The algorithm consist of two steps.
First, a similarity matrix between *items* (item-to-item) is calculated. This step is time consuming and most be carried out off-line. Lastly, more similar items are retrieved using the similarity matrix. This method has the advantage of been simple and accurate but it depends on the expensive calculation of a similarity matrix.
The example data and code implementation was taken from Salem Marafi blog [here](http://www.salemmarafi.com/code/collaborative-filtering-with-python/).

In [2]:
# reading data from csv
data = pd.read_csv("/home/carlos/scripts/dataScience/data/lastfm-matrix-germany.csv")
data.head().iloc[:,1:10]


Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire
0,0,0,0,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0


In [3]:
# getting dimensions
data.shape

(1257, 286)

In [4]:
# first 10 list names
names = list(data)
names[:10]

['user',
 'a perfect circle',
 'abba',
 'ac/dc',
 'adam green',
 'aerosmith',
 'afi',
 'air',
 'alanis morissette',
 'alexisonfire']

In [5]:
## calculate matriloc of similarity between items
def colFiltering (dataPath):
    # --- Read Data --- #
    data = pd.read_csv(dataPath)
    # --- Start Item Based Recommendations --- #
    # Drop any column named "user"
    data = data.drop('user', 1)
    # Create a placeholder dataframe listing item vs. item
    similarities = pd.DataFrame(index=data.columns,columns=data.columns)
    # fill in with cosine similarities
    # Loop through the columns
    for i in range(0,len(similarities.columns)) :
        # Loop through the columns for each column
        for j in range(0,len(similarities.columns)) :
            # Fill in placeholder with cosine similarities
            similarities.iloc[i,j] = 1-cosine(data.iloc[:,i],data.iloc[:,j])
        
 
    return similarities;

In [6]:
similarities = colFiltering("/home/carlos/scripts/dataScience/data/lastfm-matrix-germany.csv")
similarities.head()

Unnamed: 0,a perfect circle,abba,ac/dc,adam green,aerosmith,afi,air,alanis morissette,alexisonfire,alicia keys,...,timbaland,tom waits,tool,tori amos,travis,trivium,u2,underoath,volbeat,yann tiersen
a perfect circle,1.0,0.0,0.0179172,0.0515539,0.0627765,0.0,0.0517549,0.0607177,0,0.0,...,0.0473381,0.0811998,0.394709,0.125553,0.0303588,0.111154,0.0243975,0.06506,0.0521641,0.0
abba,0.0,1.0,0.0522788,0.0250706,0.0610563,0.0,0.0167789,0.0295269,0,0.0,...,0.0,0.0,0.0,0.0610563,0.0295269,0.0,0.0949158,0.0,0.0253673,0.0
ac/dc,0.0179172,0.0522788,1.0,0.113154,0.177153,0.0678942,0.0757299,0.0380762,0,0.0883332,...,0.0445288,0.0678942,0.0582408,0.0393673,0.0,0.0871313,0.122398,0.0203997,0.130849,0.0
adam green,0.0515539,0.0250706,0.113154,1.0,0.0566365,0.0,0.0933859,0.0,0,0.0254164,...,0.0,0.146516,0.0837892,0.0566365,0.0821687,0.0250706,0.0220113,0.0,0.023531,0.0880451
aerosmith,0.0627765,0.0610563,0.177153,0.0566365,1.0,0.0,0.113715,0.100056,0,0.0618984,...,0.0520051,0.0297351,0.0255072,0.0689655,0.0333519,0.0,0.214423,0.0,0.0573068,0.0


### Validation

In [7]:
similarities.loc["guns n roses",:].sort_values(ascending=False)[:5]

guns n roses           1
queen           0.160422
metallica       0.154303
deep purple      0.14825
depeche mode    0.141019
Name: guns n roses, dtype: object

From the validation it can be seen that predictions seems accurate.

## Collaborative filtering using implicit library

Implicit Collaborative filtering is a method developed to make use of implicit information from user preferences like click or visits behavior rather than explicit information like product rating and purchases records.
I this section *implicit* ([library](https://github.com/benfred/implicit)) is evaluated to implement implicit collaborative filtering.

In [8]:
data = pd.read_csv("/home/carlos/scripts/dataScience/data/lastfm-matrix-germany.csv")
data = data.drop('user',1)
data = data.T
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1247,1248,1249,1250,1251,1252,1253,1254,1255,1256
a perfect circle,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
abba,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ac/dc,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0
adam green,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,1
aerosmith,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In the following code the method is implemented. From which can be seen that no parameters are required.

In [15]:
## tranform data to sparse matrix
count = csr_matrix(data,dtype=np.float64)

# initialize a model
model = implicit.als.AlternatingLeastSquares()

# train the model on a sparse matrix of item/user/confidence weights
np.random.seed(23)
model.fit(count)




### Validation

In [16]:
## check recommendations for guns n roses
rec = model.similar_items(names.index("guns n roses"))
[names[rec[i][0]] for i in range(1,len(rec))]


['black sabbath',
 'mando diao',
 'death cab for cutie',
 'simple plan',
 'arcade fire',
 'joy division',
 'bloc party',
 'dido',
 'eminem']

It can be seen that predictions are not accurate. 

## Conclusions

In this report, two Collaborative Filtering algorithms are assesed. First, a similarity based method is implemented. Importantly, this method was the more accurate based on manually checked predictions. However, a major disadvantage of this method is its scalability. On the other hand, Implicit Collaborative filtering using implicit library has the advantage of better performance. However, in our hands the quality of predicitons were not accurate.   