# Recommender Systems Project

## 0. Quick Start
To run this notebook you just need to have [pipenv](https://github.com/pypa/pipenv) installed.
Then run these 3 commands:
- first install the dependencies with: `pipenv install`
- launch the virtual env: `pipenv shell`
- finally start jupyter and open the notebook: `jupyter-lab`

In [1]:

pip install surprise

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\Maha\anaconda3\envs\venv\python.exe -m pip install --upgrade pip' command.


In [2]:
from tqdm import tqdm
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel 

from surprise import NormalPredictor, SVD, KNNBasic, NMF
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import cross_validate, KFold

## 1. Introduction
Recommender systems goal is to push *relevant* items to a given user. Understanding and modelling the user's preferences is required to reach this goal. In this project you will learn how to model the user's preferences with the [Surprise library](http://surpriselib.com/) to build different recommender systems. The first one will be a pure *collaborative filtering* approach, and the second one will rely on item attributes in a *content-based* way.

## 2. Loading Data
We use here the [MovieLens dataset](https://grouplens.org/datasets/movielens/). It contains 25 millions of users ratings. the data are in the `./data/raw` folder. We could load directly the .csv file with [a built-in Surprise function](https://github.com/NicolasHug/Surprise/blob/ef3ed6e98304dbf8d033c8eee741294b05b5ba07/surprise/dataset.py#L105), but it's more convenient to load it through a Pandas dataframe for later flexibility purpose.

In [3]:
RATINGS_DATA_FILE = './raw/ratings.csv'
MOVIES_DATA_FILE = './raw/movies.csv'

In [4]:
# load the raw csv into a data_frame
df_ratings = pd.read_csv(RATINGS_DATA_FILE)

# drop the timestamp column since we dont need it now
df_ratings = df_ratings.drop(columns="timestamp")

# movies dataframe
df_movies = pd.read_csv(MOVIES_DATA_FILE)

In [5]:
# check we have 25M users' ratings
df_ratings.userId.count()

25000095

In [6]:
def get_subset(df, number):
    """
        just get a subset of a large dataset for debug purpose
    """
    rids = np.arange(df.shape[0])
    np.random.shuffle(rids)
    df_subset = df.iloc[rids[:number], :].copy()
    return df_subset
df_ratings_100k = get_subset(df_ratings, 1000)
df_movies_100 = get_subset(df_movies, 100)

In [7]:
# Surprise reader
reader = Reader(rating_scale=(0, 5))

# Finally load all ratings
ratings = Dataset.load_from_df(df_ratings_100k, reader)

In [8]:
df_ratings_100k.head(5)

Unnamed: 0,userId,movieId,rating
19246771,124895,2,3.5
16256421,105421,316,4.0
3154882,20802,314,5.0
14595260,94463,225,4.0
16890268,109519,500,3.0


## 3. Collaborative Filtering
We can test first any of the [Surprise algorithms](https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html).

In [9]:
# define a cross-validation iterator
kf = KFold(n_splits=3)

algos = [SVD(), NMF(), KNNBasic()]    

In [10]:
def get_rmse(algo, testset):
        predictions = algo.test(testset)
        accuracy.rmse(predictions, verbose=True)
        
for trainset, testset in tqdm(kf.split(ratings)): 
    """
        get an evaluation with cross-validation for different algorithms
    """  
    for algo in algos:
        algo.fit(trainset)
        get_rmse(algo, testset)

0it [00:00, ?it/s]

RMSE: 0.9837


1it [00:00,  2.37it/s]

RMSE: 0.9970
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.9870
RMSE: 1.1373


2it [00:00,  2.18it/s]

RMSE: 1.1433
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.1433
RMSE: 1.0262


3it [00:01,  2.04it/s]

RMSE: 1.0294
Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0294





## 4. Content-based Filtering
Here we will rely directly on items attributes. First we have to describe a user profile with an attributes vector. Then we will use these vectors to generate recommendations.

In [11]:
# computing similarities requires too much ressources on the whole dataset, so we take the subset with 100 items
df_movies_100 = df_movies_100.reset_index(drop=True)

In [12]:
# we compute a TFIDF on the titles of the movies
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(df_movies_100['title'])

In [13]:
# we get cosine similarities: this takes a lot of time on the real dataset
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

In [14]:
# we generate in 'results' the most similar movies for each movie: we put a pair (score, movie_id)
results = {}
for idx, row in df_movies_100.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1] 
    similar_items = [(cosine_similarities[idx][i], df_movies_100['movieId'].loc[[i]].tolist()[0]) for i in similar_indices] 
    results[idx] = similar_items[1:]

In [15]:
# transform a 'movieId' into its corresponding movie title
def item(id):  
    return df_movies_100.loc[df_movies_100['movieId'] == id]['title'].tolist()[0].split(' - ')[0] 

In [16]:
# transform a 'movieId' into the index id
def get_idx(id):
    return df_movies_100[df_movies_100['movieId'] == id].index.tolist()[0]

In [17]:
# Finally we put everything together here:
def recommend(item_id, num):
    print("Recommending " + str(num) + " products similar to " + item(item_id) + "...")   
    print("-------")    
    recs = results[get_idx(item_id)][:num]   
    for rec in recs: 
        print("\tRecommended: " + item(rec[1]) + " (score:" +      str(rec[0]) + ")")

In [19]:
df_movies_100.head()

Unnamed: 0,movieId,title,genres
0,137236,Whistle and I'll Come to You (2010),Horror
1,3220,Night Tide (1961),Drama
2,161578,Poison Berry in my Brain (2015),Comedy|Romance
3,203369,A Ship to India (1947),Drama
4,93272,Dr. Seuss' The Lorax (2012),Animation|Fantasy|Musical|IMAX


Suppose a user wants the 10 most 'similar' (from a CBF point of view) movies from the movie 'Alley Cats Strike':

In [20]:
recommend(item_id=137236, num=10)

Recommending 10 products similar to Whistle and I'll Come to You (2010)...
-------
	Recommended: 4.3.2.1 (2010) (score:0.2765157737468538)
	Recommended: Wild Things: Foursome (2010) (score:0.07646097313082126)
	Recommended: Our Life (La nostra vita) (2010) (score:0.06664922150009672)
	Recommended: Black House (2007) (score:0.0)
	Recommended: The Bullet Vanishes (2012) (score:0.0)
	Recommended: Alabama's Ghost (1973) (score:0.0)
	Recommended: Nowhere Mind (2018) (score:0.0)
	Recommended: The Wolf House (2018) (score:0.0)
	Recommended: Strange Powers: Stephin Merritt and the Magnetic Fields (2011) (score:0.0)
	Recommended: Freerunner (2011) (score:0.0)
