## Machine Learning - Recommender System

The actual mathematics behind recommender systems is pretty heavy in Linear Algebra, and in addition to the libraries I usually use, I will be using a library called 'SciPy'

### Methods Used

The two most common types of recommender systems are **Content-Based** and **Collaborative Filtering (CF)**. 

* Collaborative filtering produces recommendations based on the knowledge of users’ attitude to items, that is it uses the "wisdom of the crowd" to recommend items. 
* Content-based recommender systems focus on the attributes of the items and give you recommendations based on the similarity between them.

I will implement Model-Based CF by using singular value decomposition (SVD) and Memory-Based CF by computing cosine similarity.

### The Data

We will use the famous MovieLens dataset, which is one of the most common datasets used when implementing and testing recommender engines. It contains 100k movie ratings from 943 users and a selection of 1682 movies.

### Import Libraries

In [1]:
# starting with importing libariraies 

import numpy as np
import pandas as pd

In [2]:
# importing data visualization libraries 

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

### Get the data 

I will create artificial data to use in this project

In [3]:
columns_names = ['user_id', 'item_id', 'rating', 'timestamp']

In [4]:
# it's going to be a tab separated file, so I will sep argument

df = pd.read_csv('u.data', sep='\t', names=columns_names)

In [5]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,0,50,5,881250949
1,0,172,5,881250949
2,0,133,1,881250949
3,196,242,3,881250949
4,186,302,3,891717742


when checking our dataset, we notice that 'item_id' is only numbers, whilce it should be movies' titles
so I will creat a variable named 'movie_titels' and replace its values with values in 'item_id'.

In [6]:
movie_titles = pd.read_csv('Movie_Id_Titles')

In [7]:
movie_titles

Unnamed: 0,item_id,title
0,1,Toy Story (1995)
1,2,GoldenEye (1995)
2,3,Four Rooms (1995)
3,4,Get Shorty (1995)
4,5,Copycat (1995)
...,...,...
1677,1678,Mat' i syn (1997)
1678,1679,B. Monkey (1998)
1679,1680,Sliding Doors (1998)
1680,1681,You So Crazy (1994)


In [8]:
# now we merge the dataframes:

df = pd.merge(df, movie_titles, on='item_id')

In [9]:
df

# now we notice a new column named 'title' matching wiht the 'item_id' values

Unnamed: 0,user_id,item_id,rating,timestamp,title
0,0,50,5,881250949,Star Wars (1977)
1,290,50,5,880473582,Star Wars (1977)
2,79,50,4,891271545,Star Wars (1977)
3,2,50,5,888552084,Star Wars (1977)
4,8,50,5,879362124,Star Wars (1977)
...,...,...,...,...,...
99998,840,1674,4,891211682,Mamma Roma (1962)
99999,655,1640,3,888474646,"Eighth Day, The (1996)"
100000,655,1637,3,888984255,Girls Town (1996)
100001,655,1630,3,887428735,"Silence of the Palace, The (Saimt el Qusur) (1..."


Now let's take a quick look at the number of unique users and movies.

In [10]:
n_users = df.user_id.nunique()
n_items = df.item_id.nunique()

In [11]:
print ('Num. of Users: '+ str(n_users))
print ('Num of Movies: '+str(n_items))

Num. of Users: 944
Num of Movies: 1682


### Train Test Split

in order to evaluate Recommendation System, I will not use our classic X_train,X_test,y_train,y_test split. 
Instead I can actually just segement the data into two sets of data

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
# segmanting the data into train data and test data

train_data, test_data = train_test_split(df, test_size=0.25)

**the training matrix contains 75% of the ratings and the testing matrix contains 25% of the ratings**

### 1 - Memory-Based Collaborative Filtering

Memory-Based Collaborative Filtering approaches can be divided into two main sections: **user-item filtering** and **item-item filtering**. 


* *Item-Item Collaborative Filtering:* Recommends items based on similarities between items. If you liked a particular item, it suggests other items that users who liked that item also liked.

* *User-Item Collaborative Filtering:* Recommends items based on similarities between users. It suggests items liked by users who have similar preferences or behaviors to you.

In both cases, I shall create a user-item matrix which built from the entire dataset.

After building the user-item matrix, we calculate the similarity and create a similarity matrix. Similarity values between items in *'Item-Item Collaborative Filtering'* are measured by observing all the users who have rated both items. 

For *'User-Item Collaborative Filtering'* the similarity values between users are measured by observing all the items that are rated by both users.

**our first step will be to create the user-item matrix. Since both testing and training data are available to create two matrice**

In [14]:
# create two user-item matrices, one for training and another for testing

# Training 

train_matrix = np.zeros((n_users, n_items))
for line in train_data.itertuples():
    train_matrix[line[1]-1, line[2]-1] = line[3] 

In [15]:
#Testing

test_matrix = np.zeros((n_users, n_items))
for line in test_data.itertuples():
    test_matrix[line[1]-1, line[2]-1] = line[3]

will use the pairwise_distances function from sklearn to calculate the cosine similarity. 

**N: Note that the output will range from 0 to 1 since the ratings are all positive**

In [16]:
# importing pairwise_distances from sikitlearn

from sklearn.metrics.pairwise import pairwise_distances

In [17]:
# use the pairwise_distances method, to calculate the cosine similarity with train & test data

user_similarity = pairwise_distances(train_matrix, metric='cosine')
item_similarity = pairwise_distances(train_matrix.T, metric='cosine')

after creating similarity matrix for user, item. now we have to make a prediction. Will need to make sure that the ratings values are between 1 and 5 to have correct values. then we sum the average ratings for the user that we are trying to predict.

**users could have a very similar taste but treat the rating system differently.**



In [18]:
def predict(ratings, similarity, type='user'):
    if type == 'user':
        mean_user_rating = ratings.mean(axis=1)
        #You use np.newaxis so that mean_user_rating has same format as ratings
        ratings_diff = (ratings - mean_user_rating[:, np.newaxis]) 
        pred = mean_user_rating[:, np.newaxis] + similarity.dot(ratings_diff) / np.array([np.abs(similarity).sum(axis=1)]).T
    elif type == 'item':
        pred = ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])     
    return pred

In [19]:
# after applying the function we created now we predict

item_prediction = predict(train_matrix, item_similarity, type='item')
user_prediction = predict(train_matrix, user_similarity, type='user')

### Evaluation

I will use one of the most popular metric used to evaluate accuracy of predicted ratings is *Root Mean Squared Error (RMSE)*.

Since we want to consider predicted ratings that are in the test dataset, we filter out all other elements in the prediction matrix with  
`prediction[ground_truth.nonzero()]`

In [20]:
# importing 

from sklearn.metrics import mean_squared_error
from math import sqrt

In [21]:
# creating a function that will perform RMS, RMSE

def rmse(prediction, ground_truth):
    prediction = prediction[ground_truth.nonzero()].flatten() 
    ground_truth = ground_truth[ground_truth.nonzero()].flatten()
    return sqrt(mean_squared_error(prediction, ground_truth))

In [22]:
# results 

print('User-based CF RMSE: ' + str(rmse(user_prediction, test_matrix)))
print('Item-based CF RMSE: ' + str(rmse(item_prediction, test_matrix)))

User-based CF RMSE: 3.1240010250233547
Item-based CF RMSE: 3.453830666451291


**Model-based CF methods are scalable and can deal with higher sparsity level than memory-based models, on the onter hand, it sufferes when new users or items that don't have any ratings enter the system**

### 2 - Model-based Collaborative Filtering

Model-based Collaborative Filtering is based on matrix factorization (MF), as an unsupervised learning method. Matrix factorization is widely used for recommender systems where it can deal better with scalability and sparsity than Memory-based CF.

Let's calculate the sparsity level of MovieLens dataset

In [23]:
sparsity=round(1.0-len(df)/float(n_users*n_items),3)
print('The sparsity level of MovieLens100K is ' +  str(sparsity*100) + '%')

The sparsity level of MovieLens100K is 93.7%


### SVD

A well-known matrix factorization method is **Singular value decomposition (SVD)**. it can be formulated by approximating a matrix `X` by using singular value decomposition.

In [24]:
# importing libraries 

import scipy.sparse as sp
from scipy.sparse.linalg import svds


In [25]:
#get SVD components from train matrix. Choose k.

u, s, vt = svds(train_matrix, k = 20)
s_diag_matrix=np.diag(s)
X_pred = np.dot(np.dot(u, s_diag_matrix), vt)
print('User-based CF MSE: ' + str(rmse(X_pred, test_matrix)))

User-based CF MSE: 2.722437992371685


#### Review:

* implement simple **Collaborative Filtering** methods, both memory-based CF and model-based CF.

* **Memory-based models** are based on similarity between items or users, where we use cosine-similarity.

* **Model-based CF** is based on matrix factorization where we use SVD to factorize the matrix.

* Building recommender systems that perform well in cold-start scenarios (where little data is available on new users and items) remains a challenge. The standard collaborative filtering method performs poorly is such settings. 

### - - - - The end - - - -