# MovieLens Collaborative Filtering System

## Problem framing

We want to create a model, that will recommend users new movies, based on their ratings on movies he already watched.

Also it'll recommend new movies similar to movie X if you liked it.

## Model's quality control

Because of a lack of posibility to use online-metrics (recommend movies to users and check if they would watch and liked it) we'll try to recreate this situation by artificially deleting some of user's ratings on movies and check if system will recommend them after.

## Data analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import time
import random

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler

plt.rcParams['figure.figsize']=(10,10)

In [2]:
links = pd.read_csv(r'movie_recommendation/data/link.csv')
movies = pd.read_csv(r'movie_recommendation/data/movie.csv')
ratings = pd.read_csv(r'movie_recommendation/data/rating.csv')
tags = pd.read_csv(r'movie_recommendation/data/tag.csv')

print('links: ')
print(links.head())
print('\n')
print('movies: ')
print(movies.head())
print('\n')
print('ratings: ')
print(ratings.head())
print('\n')
print('tags: ')
print(tags.head())
print('\n')

links: 
   movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0
3        4  114885  31357.0
4        5  113041  11862.0


movies: 
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


ratings: 
   userId  movieId  rating            timestamp
0       1        2     3.5  2005-04-02 23:53:47
1       1       29     3.5  2005-04-02 23:31:16
2       1       32     3.5  2005-04-02 23:33:39
3       1     

In [3]:
ratings.describe()

Unnamed: 0,userId,movieId,rating
count,20000260.0,20000260.0,20000260.0
mean,69045.87,9041.567,3.525529
std,40038.63,19789.48,1.051989
min,1.0,1.0,0.5
25%,34395.0,902.0,3.0
50%,69141.0,2167.0,3.5
75%,103637.0,4770.0,4.0
max,138493.0,131262.0,5.0


**At the first we'll check if our data is balanced.**

In [4]:
#Amount of movies users have rated 
ratings['userId'].value_counts().describe()

count    138493.000000
mean        144.413530
std         230.267257
min          20.000000
25%          35.000000
50%          68.000000
75%         155.000000
max        9254.000000
Name: userId, dtype: float64

The median user rated 68 movies, the std is about 230, so there is a dispersion in the amount of movies users have rated, but overall data seems to be balanced.

In [5]:
ratings['movieId'].value_counts().describe()

count    26744.000000
mean       747.841123
std       3085.818268
min          1.000000
25%          3.000000
50%         18.000000
75%        205.000000
max      67310.000000
Name: movieId, dtype: float64

In [6]:
ratings['rating'].describe()

count    2.000026e+07
mean     3.525529e+00
std      1.051989e+00
min      5.000000e-01
25%      3.000000e+00
50%      3.500000e+00
75%      4.000000e+00
max      5.000000e+00
Name: rating, dtype: float64

There are many unpopular movies with small amount of user's ratings, while the max value is about 200k.

**Overall data seems to be balanced, the only problem are rare movies with small amounts of ratings.**

## Data Preprocessing

### Decrease memory usage by changing dtype

In [7]:
ratings['rating'] = ratings['rating'].astype('float32')

### Removing duplicates

In [8]:
ratings.duplicated().value_counts()

False    20000263
dtype: int64

In [9]:
movies['title'].duplicated().value_counts()

False    27262
True        16
Name: title, dtype: int64

In [10]:
movies.drop_duplicates(subset='title', keep = 'first', inplace=True)

### Create user-item matrix

In terms of estimation time and memory usage we'll use 500k samples for learning.

In [11]:
#Amount of movies users have rated 
print('full:')
print(ratings['userId'].value_counts().describe())
print('mln samples:')
print(ratings.iloc[-500000:, :]['userId'].value_counts().describe())

full:
count    138493.000000
mean        144.413530
std         230.267257
min          20.000000
25%          35.000000
50%          68.000000
75%         155.000000
max        9254.000000
Name: userId, dtype: float64
mln samples:
count    3553.000000
mean      140.726147
std       223.118061
min         2.000000
25%        33.000000
50%        65.000000
75%       150.000000
max      3383.000000
Name: userId, dtype: float64


In [12]:
print(ratings['movieId'].value_counts().describe())
print(ratings.iloc[-500000:, :]['movieId'].value_counts().describe())

count    26744.000000
mean       747.841123
std       3085.818268
min          1.000000
25%          3.000000
50%         18.000000
75%        205.000000
max      67310.000000
Name: movieId, dtype: float64
count    12608.000000
mean        39.657360
std        109.438422
min          1.000000
25%          2.000000
50%          6.000000
75%         25.000000
max       1683.000000
Name: movieId, dtype: float64


According to statistical characteristics 250k samples are representative so we'll use it for learning

In [13]:
sample = ratings.iloc[-500000:, :]

In [14]:
UI_matrix = sample.pivot(index='userId', columns='movieId', values='rating')

### Scaling 

**We'll substract users' average from ratings. This way every user with have zero mean and we can fill nan rating with 0 as mean for user. This way we can bring all users to the same level removing their biases (for different people, different rating values mean different degrees of attractiveness).**

In [15]:
UI_norm = UI_matrix.subtract(UI_matrix.mean(axis=1), axis=0)

## Collaborative Filtering

### Movies recommendation based on one

Can be used on movie page in section "you may like"

**Item-item similarity**

5 most similar movies to movie X (according to Pearson correlation coefficient between movies' vectors (from item-user-matrix)) are recommended to watch, if you like movie X.

There will be 2 matrices: normalized one and non-normalized. We'll use unscaled ratings for the movies with high amounts of ratings, and scaled for the ones with small.

In [16]:
item_user_matrix_norm = UI_norm.transpose().copy()

In [17]:
item_user_matrix = UI_matrix.transpose().copy()

In [18]:
item_user_matrix_norm.fillna(0, inplace=True)

In [19]:
item_user_matrix.fillna(0, inplace=True)

**To decrease calculation time we'll take random 3500 movies, calculate corr and take movies with corr > 0.35 until we got N_MOVIES for recommendation. Also that will help increase randomization (each time new movies can be recommended)**

For testing and developing the parameter will be title, but in the production version we can add a posibility to pass a vector to function for the movies and users that are not represented in sample we're using for learning.

In [20]:
N_MOVIES = 5

def recommendation_movie_based(title, n_movies):
    start_time = time.time()   
    if len(movies[movies.title == title].movieId.values) == 0:
        print('No movie in base')
        return None
    else:
        movie_id = movies[movies.title == title].movieId.values[0]
        
    '''
        Checking amount of ratings
    ''' 
    if (sample.loc[sample['movieId'] == 34, 'rating'].shape[0] < 20):
        matrix = item_user_matrix_norm
    else:
        matrix = item_user_matrix
        
    movie_vector = matrix.loc[movie_id]
    '''
        To decrease calculation time we'll take random 3500 movies,
        calculate corr and take movies with corr > 0.35 until we got n_movies for recommendation.
    '''
    corr = set()
    while (len(corr) < n_movies):
        if ((time.time() - start_time) > 3.5):
            temp = matrix.sample(3500).corrwith(movie_vector, method='pearson',
                                         axis=1).sort_values(ascending=False)
            corr.update(temp.head(n_movies).index)
            break
        temp = matrix.sample(3500).corrwith(movie_vector, method='pearson',
                                         axis=1).sort_values(ascending=False)
        temp = temp[temp > 0.30]
        temp = temp[temp < 0.9999]
        corr.update(temp.head(n_movies).index)
    
    print('Original Movie:')
    print(title)
    print(movies[movies['movieId'] == movie_id].genres.values[0])
    print('\n')
    print('Recommended:')
    iteration_check = 1
    for i in corr:
        if iteration_check > N_MOVIES:
            break
        print(movies[movies['movieId'] == i].title.values[0])
        print(movies[movies['movieId'] == i].genres.values[0])
        print('\n')
        iteration_check += 1
    print(time.time() - start_time, ' sec')

### Testing

In [21]:
item_user_matrix_norm.iloc[7773].name

27571

In [22]:
movies[movies['movieId'] == item_user_matrix_norm.iloc[7773].name]

Unnamed: 0,movieId,title,genres
9382,27571,"Rage in Placid Lake, The (2003)",Comedy


#### Examples:

In [23]:
recommendation_movie_based('Big Lebowski, The (1998)', N_MOVIES)

Original Movie:
Big Lebowski, The (1998)
Comedy|Crime


Recommended:
Reservoir Dogs (1992)
Crime|Mystery|Thriller


Jackie Brown (1997)
Crime|Drama|Thriller


Full Metal Jacket (1987)
Drama|War


O Brother, Where Art Thou? (2000)
Adventure|Comedy|Crime


Goodfellas (1990)
Crime|Drama


1.6319520473480225  sec


In [24]:
recommendation_movie_based('Toy Story (1995)', N_MOVIES)

Original Movie:
Toy Story (1995)
Adventure|Animation|Children|Comedy|Fantasy


Recommended:
E.T. the Extra-Terrestrial (1982)
Children|Drama|Sci-Fi


Toy Story 2 (1999)
Adventure|Animation|Children|Comedy|Fantasy


Independence Day (a.k.a. ID4) (1996)
Action|Adventure|Sci-Fi|Thriller


Willy Wonka & the Chocolate Factory (1971)
Children|Comedy|Fantasy|Musical


Shrek (2001)
Adventure|Animation|Children|Comedy|Fantasy|Romance


1.5951826572418213  sec


In [25]:
recommendation_movie_based('Die Hard (1988)', N_MOVIES)

Original Movie:
Die Hard (1988)
Action|Crime|Thriller


Recommended:
Top Gun (1986)
Action|Romance


Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
Action|Adventure


Aliens (1986)
Action|Adventure|Horror|Sci-Fi


Terminator, The (1984)
Action|Sci-Fi|Thriller


Die Hard 2 (1990)
Action|Adventure|Thriller


1.6171982288360596  sec


In [26]:
recommendation_movie_based('Groundhog Day (1993)', N_MOVIES)

Original Movie:
Groundhog Day (1993)
Comedy|Fantasy|Romance


Recommended:
Men in Black (a.k.a. MIB) (1997)
Action|Comedy|Sci-Fi


Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
Action|Adventure


Lethal Weapon (1987)
Action|Comedy|Crime|Drama


Jerry Maguire (1996)
Drama|Romance


Truman Show, The (1998)
Comedy|Drama|Sci-Fi


1.6231858730316162  sec


In [27]:
### 584 ratings / 10 in sample
print(movies[movies['movieId'] == 66371].title)
### 562 ratings / 20 in sample
print(movies[movies['movieId'] == 86644].title)
### 278 ratings / 6 in sample
print(movies[movies['movieId'] == 7833].title)
### 2667 rating / 69 in sample
print(movies[movies['movieId'] == 678].title)
### 35 ratimgs / 2 in sample
print(movies[movies['movieId'] == 27571].title)

13452    Departures (Okuribito) (2008)
Name: title, dtype: object
17118    Fast Five (Fast and the Furious 5, The) (2011)
Name: title, dtype: object
7508    Shadow of the Thin Man (1941)
Name: title, dtype: object
668    Some Folks Call It a Sling Blade (1993)
Name: title, dtype: object
9382    Rage in Placid Lake, The (2003)
Name: title, dtype: object


In [28]:
recommendation_movie_based('Departures (Okuribito) (2008)', N_MOVIES)

Original Movie:
Departures (Okuribito) (2008)
Drama


Recommended:
Towelhead (a.k.a. Nothing is Private) (2007)
Drama


Mystic Masseur, The (2001)
Drama


Girl Cut in Two, The (Fille coupée en deux, La) (2007)
Drama|Thriller


Circle, The (Dayereh) (2000)
Drama


Esther Kahn (2000)
Drama


1.491302728652954  sec


In [29]:
recommendation_movie_based('Shadow of the Thin Man (1941)', N_MOVIES)

Original Movie:
Shadow of the Thin Man (1941)
Comedy|Crime|Mystery


Recommended:
Bedlam (1946)
Drama|Horror


Goodbye, Mr. Chips (1939)
Drama|Romance


Show Boat (1951)
Drama|Musical|Romance


Another Thin Man (1939)
Comedy|Crime|Drama|Mystery|Romance


Thin Man Goes Home, The (1945)
Comedy|Crime|Mystery


1.935102939605713  sec


In [30]:
recommendation_movie_based('Fast Five (Fast and the Furious 5, The) (2011)', N_MOVIES)

Original Movie:
Fast Five (Fast and the Furious 5, The) (2011)
Action|Crime|Drama|Thriller|IMAX


Recommended:
2 Guns (2013)
Action|Comedy|Crime


Harold & Kumar Escape from Guantanamo Bay (2008)
Adventure|Comedy


Red Lights (2012)
Drama|Mystery|Thriller


Horrible Bosses (2011)
Comedy|Crime


Salt (2010)
Action|Thriller


1.5870957374572754  sec


In [31]:
recommendation_movie_based('Some Folks Call It a Sling Blade (1993)', N_MOVIES)

Original Movie:
Some Folks Call It a Sling Blade (1993)
Drama|Thriller


Recommended:
It Came from Beneath the Sea (1955)
Sci-Fi


Of Love and Shadows (1994)
Drama


Sid and Nancy (1986)
Drama


Sling Blade (1996)
Drama


Hands on a Hard Body (1996)
Comedy|Documentary


6.870634078979492  sec


In [32]:
recommendation_movie_based('Rage in Placid Lake, The (2003)', N_MOVIES)

Original Movie:
Rage in Placid Lake, The (2003)
Comedy


Recommended:
Hotel Chevalier (Part 1 of 'The Darjeeling Limited') (2007)
Drama


Let the Fire Burn (2013)
Documentary


A Most Violent Year (2014)
Action|Crime|Drama|Thriller


Lola Versus (2012)
Comedy|Romance


Family, The (2013)
Action|Comedy|Crime


1.5456323623657227  sec


### Movie recommendation for user

**User-user similarity**

Taking the sum of 15 most similar users' to user X vectors (according to Pearson correlation coeff) with coefficients equaling Pearson correlation coeff. 
8 movies with highest value in result vector are recommended for user X.

In [33]:
user_item_matrix = UI_norm

In [34]:
user_item_matrix.fillna(0, inplace=True)

In [35]:
N_USERS = 40
N_USERS_MOVIES = 10

def recommendation_for_user(userId, n_users):
    start_time = time.time()
    if userId not in user_item_matrix.index:
        print('No such user in list')
        return False
    sum_vector = 0
    user_vector = user_item_matrix.loc[userId]
    corr = user_item_matrix.corrwith(user_vector, method='pearson', axis=1).sort_values(ascending=False).head(n_users)
    weights = 0
    for corr_user in corr[1:].index:
        weights += corr[corr.index == corr_user].values[0]
        # vector * corr_value
        sum_vector += user_item_matrix.loc[corr_user].to_numpy() * corr[corr.index == corr_user].values[0]
    # taking weighted mean
    sum_vector = sum_vector / weights
    result = pd.Series(sum_vector, index=user_item_matrix.columns)
    result = result.sort_values(ascending=False)
    i = 0
    j = 0
    while i < N_USERS_MOVIES:
        if (user_vector.loc[result.index[j]] == 0 and result.loc[result.index[j]] > 0):
            print(movies[movies['movieId'] == result.index[j]].title.values[0])
            print(movies[movies['movieId'] == result.index[j]].genres.values[0])
            print('\n')
            i += 1
        j += 1
    print(time.time() - start_time, ' sec')

**Examples:**

We cannot evaluate model with online-metrics, so we'll use "hit-rate". We'll exclude 10 random movies with different ratings and check if one (or more) of excluded will be recommended.

In [36]:
def show_user_ratings(userId):
    result = ratings[ratings['userId'] == userId][['movieId', 'rating']]
    for i in result.index:
        result.loc[i, 'title'] = movies[movies['movieId'] == result.loc[i, 'movieId']].title.values[0]
    return result

In [37]:
def evaluate_user(userId):
    print('Recommendation before excluding:')
    if (recommendation_for_user(userId, N_USERS) == False): return
    print('\n')
    movies_rated = show_user_ratings(userId)
    movies_excluded = random.sample(list(movies_rated['movieId']), 5)
    print('Movies excluded:')
    for movie in movies_excluded:
        print(movies_rated[movies_rated['movieId'] == movie].title.values[0])
        print('Rating: ' + str(movies_rated[movies_rated['movieId'] == movie].rating.values[0]))
        user_item_matrix.loc[userId, movies[movies['movieId'] == movie].movieId.values[0]] = 0
        print('\n')
    print('\n')
    print('Recommendation after excluding:')
    if (recommendation_for_user(userId, N_USERS) == False): return

In [38]:
evaluate_user(137100)

Recommendation before excluding:
Usual Suspects, The (1995)
Crime|Mystery|Thriller


Godfather: Part II, The (1974)
Crime|Drama


Schindler's List (1993)
Drama|War


Band of Brothers (2001)
Action|Drama|War


Silence of the Lambs, The (1991)
Crime|Horror|Thriller


Fugitive, The (1993)
Thriller


Dark Knight, The (2008)
Action|Crime|Drama|IMAX


Inception (2010)
Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX


Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
Action|Adventure


Goodfellas (1990)
Crime|Drama


1.7701396942138672  sec


Movies excluded:
Perfect Score, The (2004)
Rating: 2.5


Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)
Rating: 4.0


Apollo 13 (1995)
Rating: 4.0


Back to the Future (1985)
Rating: 4.5


Forrest Gump (1994)
Rating: 4.5




Recommendation after excluding:
Usual Suspects, The (1995)
Crime|Mystery|Thriller


Godfather: Part II, The (1974)
Crime|Drama


Schindler's List (1993)
Drama|War


Band of Brothers (2001)
Action|Dr

**Apollo 13 (rating 4.0) and Forrest Gump (5.0) are now recommended**

In [39]:
evaluate_user(137500)

Recommendation before excluding:
Shawshank Redemption, The (1994)
Crime|Drama


Godfather, The (1972)
Crime|Drama


Godfather: Part II, The (1974)
Crime|Drama


Forrest Gump (1994)
Comedy|Drama|Romance|War


Prestige, The (2006)
Drama|Mystery|Sci-Fi|Thriller


Silence of the Lambs, The (1991)
Crime|Horror|Thriller


Star Wars: Episode VI - Return of the Jedi (1983)
Action|Adventure|Sci-Fi


Gladiator (2000)
Action|Adventure|Drama


Snatch (2000)
Comedy|Crime|Thriller


Good, the Bad and the Ugly, The (Buono, il brutto, il cattivo, Il) (1966)
Action|Adventure|Western


1.6010088920593262  sec


Movies excluded:
Black Hawk Down (2001)
Rating: 4.0


Lives of Others, The (Das leben der Anderen) (2006)
Rating: 4.5


Village, The (2004)
Rating: 3.0


Schindler's List (1993)
Rating: 5.0


Borat: Cultural Learnings of America for Make Benefit Glorious Nation of Kazakhstan (2006)
Rating: 3.0




Recommendation after excluding:
Shawshank Redemption, The (1994)
Crime|Drama


Godfather, The (1972)

**Schindler's List (5.0) are now recommended**

In [41]:
evaluate_user(138400)

Recommendation before excluding:
Silence of the Lambs, The (1991)
Crime|Horror|Thriller


Forrest Gump (1994)
Comedy|Drama|Romance|War


Schindler's List (1993)
Drama|War


Fargo (1996)
Comedy|Crime|Drama|Thriller


Piano, The (1993)
Drama|Romance


Sense and Sensibility (1995)
Drama|Romance


Terminator 2: Judgment Day (1991)
Action|Sci-Fi


Philadelphia (1993)
Drama


What's Eating Gilbert Grape (1993)
Drama


Leaving Las Vegas (1995)
Drama|Romance


1.2437596321105957  sec


Movies excluded:
Babe (1995)
Rating: 5.0


Apollo 13 (1995)
Rating: 5.0


Aladdin (1992)
Rating: 4.0


Seven (a.k.a. Se7en) (1995)
Rating: 5.0


Net, The (1995)
Rating: 3.0




Recommendation after excluding:
Silence of the Lambs, The (1991)
Crime|Horror|Thriller


Seven (a.k.a. Se7en) (1995)
Mystery|Thriller


Schindler's List (1993)
Drama|War


Forrest Gump (1994)
Comedy|Drama|Romance|War


Fugitive, The (1993)
Thriller


Terminator 2: Judgment Day (1991)
Action|Sci-Fi


Piano, The (1993)
Drama|Romance


Toy S

**Apollo 13 (5.0 rating) and Seven (5.0) are recommended now.**

## Conclusion

The main problems of collaborative filtering approach is non-variety and "cold start problem". Unpopular items with less ratings usually are not recommended and it's more complicated to find a good recommendations to new users with few ratings. Non-variety means that same movies will be recommended at different time until user-item matrix update.

At movie similarity approach we solved non-variety problem taking random samples from user-item matrix.