# MovieLens collaborative filtering recommendation system

## Problem framing

We want to create a model, that will recommend users new movies, based on their ratings on movies he already watched.

Also it'll recommend new movies similar to movie X if you liked it.

## Model's quality control

Because of a lack of posibility to use online-metrics (recommend movies to users and check if they would watch and liked it) we'll try to recreate this situation by artificially deleting some of user's ratings on movies and check if system will recommend them after.

## Data analysis

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import time
import random

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler

plt.rcParams['figure.figsize']=(10,10)

In [2]:
links = pd.read_csv(r'movie_recommendation/data/link.csv')
movies = pd.read_csv(r'movie_recommendation/data/movie.csv')
ratings = pd.read_csv(r'movie_recommendation/data/rating.csv')
tags = pd.read_csv(r'movie_recommendation/data/tag.csv')

print('links: ')
print(links.head())
print('\n')
print('movies: ')
print(movies.head())
print('\n')
print('ratings: ')
print(ratings.head())
print('\n')
print('tags: ')
print(tags.head())
print('\n')

links: 
   movieId  imdbId   tmdbId
0        1  114709    862.0
1        2  113497   8844.0
2        3  113228  15602.0
3        4  114885  31357.0
4        5  113041  11862.0


movies: 
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  


ratings: 
   userId  movieId  rating            timestamp
0       1        2     3.5  2005-04-02 23:53:47
1       1       29     3.5  2005-04-02 23:31:16
2       1       32     3.5  2005-04-02 23:33:39
3       1     

In [3]:
ratings.describe()

Unnamed: 0,userId,movieId,rating
count,20000260.0,20000260.0,20000260.0
mean,69045.87,9041.567,3.525529
std,40038.63,19789.48,1.051989
min,1.0,1.0,0.5
25%,34395.0,902.0,3.0
50%,69141.0,2167.0,3.5
75%,103637.0,4770.0,4.0
max,138493.0,131262.0,5.0


**At the first we'll check if our data is balanced.**

In [4]:
#Amount of movies users have rated 
ratings['userId'].value_counts().describe()

count    138493.000000
mean        144.413530
std         230.267257
min          20.000000
25%          35.000000
50%          68.000000
75%         155.000000
max        9254.000000
Name: userId, dtype: float64

The median user rated 68 movies, the std is about 230, so there is a dispersion in the amount of movies users have rated, but overall data seems to be balanced.

In [5]:
ratings['movieId'].value_counts().describe()

count    26744.000000
mean       747.841123
std       3085.818268
min          1.000000
25%          3.000000
50%         18.000000
75%        205.000000
max      67310.000000
Name: movieId, dtype: float64

There are many unpopular movies with small amount of user's ratings, while the max value is about 200k.

**Overall data seems to be balanced, the only problem are rare movies with small amounts of ratings.**

## Data Preprocessing

### Decrease memory usage by changing dtype

In [6]:
ratings['rating'] = ratings['rating'].astype('float32')

### Create user-item matrix

In terms of estimation time and memory usage we'll use 500k samples for learning.

In [7]:
#Amount of movies users have rated 
print('full:')
print(ratings['userId'].value_counts().describe())
print('mln samples:')
print(ratings.iloc[-500000:, :]['userId'].value_counts().describe())

full:
count    138493.000000
mean        144.413530
std         230.267257
min          20.000000
25%          35.000000
50%          68.000000
75%         155.000000
max        9254.000000
Name: userId, dtype: float64
mln samples:
count    3553.000000
mean      140.726147
std       223.118061
min         2.000000
25%        33.000000
50%        65.000000
75%       150.000000
max      3383.000000
Name: userId, dtype: float64


In [8]:
print(ratings['movieId'].value_counts().describe())
print(ratings.iloc[-500000:, :]['movieId'].value_counts().describe())

count    26744.000000
mean       747.841123
std       3085.818268
min          1.000000
25%          3.000000
50%         18.000000
75%        205.000000
max      67310.000000
Name: movieId, dtype: float64
count    12608.000000
mean        39.657360
std        109.438422
min          1.000000
25%          2.000000
50%          6.000000
75%         25.000000
max       1683.000000
Name: movieId, dtype: float64


According to statistical characteristics 250k samples are representative so we'll use it for learning

In [9]:
UI_matrix = ratings.iloc[-500000:, :].pivot(index='userId', columns='movieId', values='rating')

In [10]:
UI_matrix

movieId,1,2,3,4,5,6,7,8,9,10,...,129428,129528,129532,129555,129838,130522,130642,130840,131013,131158
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
134941,,,,,,,,,,,...,,,,,,,,,,
134942,,,,,,,,,,,...,,,,,,,,,,
134943,,,,,,,,,,,...,,,,,,,,,,
134944,3.0,,,,,,,,,,...,,,,,,,,,,
134945,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138489,,,,,,,,,,,...,,,,,,,,,,
138490,,,,,,,,,,,...,,,,,,,,,,
138491,2.0,,,,,,,,,,...,,,,,,,,,,
138492,,,,,,,,,,,...,,,,,,,,,,


### Scaling 

**We'll use Standard normalization with std = 1. This way every user with have zero mean and we can fill nan rating with 0 as mean for user. This way we can bring all users to the same level removing their biases (for different people, different rating values mean different degrees of attractiveness)**

In [11]:
UI_numpy = UI_matrix.to_numpy()

In [12]:
scaler = StandardScaler(with_std=False)

In [13]:
UI_norm = scaler.fit_transform(UI_numpy.transpose())

In [14]:
UI_norm = pd.DataFrame(UI_norm.transpose(), index=UI_matrix.index, columns=UI_matrix.columns)

## Collaborative Filtering

### Movies recommendation based on one

Can be used on movie page in section "you may like"

**Item-item similarity**

5 most similar movies to movie X (according to Pearson correlation coefficient between movies' vectors (from item-user-matrix)) are recommended to watch, if you like movie X.

In [15]:
item_user_matrix = pd.DataFrame(UI_norm.to_numpy().transpose(), index=UI_norm.columns, columns=UI_norm.index)

In [16]:
item_user_matrix.fillna(0, inplace=True)

**To decrease calculation time we'll take random 1000 movies, calculate corr and take movies with corr > 0.12 until we got 5 for recommendation. Also that will help increase randomization (each time new movies can be recommended)**

For testing and developing the parameter will be title, but in the production version we can add a posibility to pass a vector to function for the movies and users that are not represented in sample we're using for learning.

In [17]:
N_MOVIES = 5

def recommendation_movie_based(title, n_movies):
    start_time = time.time()
    if len(movies[movies.title == title].movieId.values) == 0:
        print('No movie in base')
        return None
    else:
        movie_id = movies[movies.title == title].movieId.values[0]
    movie_vector = item_user_matrix.loc[movie_id]
    
    '''
        To decrease calculation time we'll take random 4500 movies,
        calculate corr and take movies with corr > 0.12 until we got n_movies for recommendation.
    '''
    corr = set()
    while (len(corr) < n_movies):
        temp = item_user_matrix.sample(3500).corrwith(movie_vector, method='pearson',
                                         axis=1).sort_values(ascending=False)
        temp = temp[temp > 0.12]
        temp = temp[temp < 1]
        corr.update(temp.head(n_movies+1).index)
        
    print('Original Movie:')
    print(title)
    print(movies[movies['movieId'] == movie_id].genres.values[0])
    print('\n')
    print('Recommended:')
    iteration_check = 1
    for i in corr:
        if iteration_check > 5:
            break
        print(movies[movies['movieId'] == i].title.values[0])
        print(movies[movies['movieId'] == i].genres.values[0])
        print('\n')
        iteration_check += 1
    print(time.time() - start_time, ' sec')

### Testing

#### Examples:

In [18]:
ratings['movieId'].value_counts()

296       67310
356       66172
318       63366
593       63299
480       59715
          ...  
123607        1
90823         1
123609        1
123613        1
131136        1
Name: movieId, Length: 26744, dtype: int64

In [19]:
recommendation_movie_based('Big Lebowski, The (1998)', N_MOVIES)

Original Movie:
Big Lebowski, The (1998)
Comedy|Crime


Recommended:
Trainspotting (1996)
Comedy|Crime|Drama


Fight Club (1999)
Action|Crime|Drama|Thriller


Unforgiven (1992)
Drama|Western


Clockwork Orange, A (1971)
Crime|Drama|Sci-Fi|Thriller


Apocalypse Now (1979)
Action|Drama|War


1.4211773872375488  sec


In [20]:
recommendation_movie_based('Toy Story (1995)', N_MOVIES)

Original Movie:
Toy Story (1995)
Adventure|Animation|Children|Comedy|Fantasy


Recommended:
Incredibles, The (2004)
Action|Adventure|Animation|Children|Comedy


Lion King, The (1994)
Adventure|Animation|Children|Drama|Musical|IMAX


Beauty and the Beast (1991)
Animation|Children|Fantasy|Musical|Romance|IMAX


Bug's Life, A (1998)
Adventure|Animation|Children|Comedy


Back to the Future (1985)
Adventure|Comedy|Sci-Fi


3.005195379257202  sec


In [21]:
recommendation_movie_based('Die Hard (1988)', N_MOVIES)

Original Movie:
Die Hard (1988)
Action|Crime|Thriller


Recommended:
Die Hard (1988)
Action|Crime|Thriller


Terminator 2: Judgment Day (1991)
Action|Sci-Fi


Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)
Action|Adventure


Lethal Weapon (1987)
Action|Comedy|Crime|Drama


Untouchables, The (1987)
Action|Crime|Drama


1.4465672969818115  sec


In [22]:
recommendation_movie_based('Groundhog Day (1993)', N_MOVIES)

Original Movie:
Groundhog Day (1993)
Comedy|Fantasy|Romance


Recommended:
Reservoir Dogs (1992)
Crime|Mystery|Thriller


Vertigo (1958)
Drama|Mystery|Romance|Thriller


L.A. Confidential (1997)
Crime|Film-Noir|Mystery|Thriller


Usual Suspects, The (1995)
Crime|Mystery|Thriller


Time Code (2000)
Comedy|Drama


1.4386675357818604  sec


In [23]:
### 584 ratings
print(movies[movies['movieId'] == 66371].title)
### 195 ratings
print(movies[movies['movieId'] == 8695].title)
### 28 ratings
print(movies[movies['movieId'] == 86644].title)

13452    Departures (Okuribito) (2008)
Name: title, dtype: object
8012    Bachelor and the Bobby-Soxer, The (1947)
Name: title, dtype: object
17118    Fast Five (Fast and the Furious 5, The) (2011)
Name: title, dtype: object


In [24]:
recommendation_movie_based('Departures (Okuribito) (2008)', N_MOVIES)

Original Movie:
Departures (Okuribito) (2008)
Drama


Recommended:
Magical Mystery Tour (1967)
Comedy|Musical


Brick Lane (2007)
Drama


Moving McAllister (2007)
Comedy


India Song (1975)
Drama|Fantasy|Romance


Gainsbourg (Vie Héroïque) (2010)
Drama|Musical|Romance


1.4470324516296387  sec


In [25]:
recommendation_movie_based('Shadow of the Thin Man (1941)', N_MOVIES)

Original Movie:
Shadow of the Thin Man (1941)
Comedy|Crime|Mystery


Recommended:
Anne of Green Gables (1985)
Children|Drama


Nasty Girl, The (schreckliche Mädchen, Das) (1990)
Comedy|Drama


Apollo 13: To the Edge and Back (1994)
Documentary


Anne of Green Gables: The Sequel (a.k.a. Anne of Avonlea) (1987)
Children|Drama|Romance


Absence of Malice (1981)
Drama|Romance


1.4396324157714844  sec


In [26]:
recommendation_movie_based('Fast Five (Fast and the Furious 5, The) (2011)', N_MOVIES)

Original Movie:
Fast Five (Fast and the Furious 5, The) (2011)
Action|Crime|Drama|Thriller|IMAX


Recommended:
Pit and the Pendulum, The (1991)
Horror


Rock of Ages (2012)
Comedy|Drama|Musical|IMAX


Pit, The (1981)
Horror


Pieces (Mil gritos tiene la noche) (One Thousand Cries Has the Night) (1982)
Horror|Mystery|Thriller


Arthur (2011)
Comedy


1.5401551723480225  sec


### Movie recommendation for user

**User-user similarity**

Taking the sum of 15 most similar users' to user X vectors (according to Pearson correlation coeff) with coefficients equaling Pearson correlation coeff. 
8 movies with highest value in result vector are recommended for user X.

In [27]:
user_item_matrix = UI_norm

In [28]:
N_USERS = 40
N_USERS_MOVIES = 8

def recommendation_for_user(userId, n_users):
    start_time = time.time()
    if userId not in user_item_matrix.index:
        print('No such user in list')
        return False
    sum_vector = 0
    user_vector = user_item_matrix.loc[userId]
    corr = user_item_matrix.corrwith(user_vector, method='pearson', axis=1).sort_values(ascending=False).head(n_users)
    for corr_user in corr[1:].index:
        # vector * corr_value
        sum_vector += user_item_matrix.loc[corr_user].to_numpy() * corr[corr.index == corr_user].values[0]
    result = pd.Series(sum_vector, index=user_item_matrix.columns)
    result = result.sort_values(ascending=False)
    i = 0
    j = 0
    while i < N_USERS_MOVIES:
        if user_vector.loc[result.index[j]] == 0 and result.loc[result.index[j]] > 0:
            print(movies[movies['movieId'] == result.index[j]].title.values[0])
            print(movies[movies['movieId'] == result.index[j]].genres.values[0])
            print('\n')
            i += 1
        j += 1
    print(time.time() - start_time, ' sec')

**Examples:**

We cannot evaluate model with online-metrics, so we'll use "hit-rate". We'll exclude 10 random movies with different ratings and check if one (or more) of excluded will be recommended.

In [29]:
def show_user_ratings(userId):
    result = ratings[ratings['userId'] == userId][['movieId', 'rating']]
    for i in result.index:
        result.loc[i, 'title'] = movies[movies['movieId'] == result.loc[i, 'movieId']].title.values[0]
    return result

In [30]:
def evaluate_user(userId):
    print('Recommendation before excluding:')
    if (recommendation_for_user(userId, N_USERS) == False): return
    print('\n')
    movies_rated = show_user_ratings(userId)
    movies_excluded = random.sample(list(movies_rated['movieId']), 10)
    print('Movies excluded:')
    for movie in movies_excluded:
        print(movies_rated[movies_rated['movieId'] == movie].title.values[0])
        print('Rating: ' + str(movies_rated[movies_rated['movieId'] == movie].rating.values[0]))
        user_item_matrix.loc[userId, movies[movies['movieId'] == movie].movieId.values[0]] = 0
        print('\n')
    print('\n')
    print('Recommendation after excluding:')
    if (recommendation_for_user(userId, N_USERS) == False): return

In [31]:
evaluate_user(137100)

Recommendation before excluding:
Usual Suspects, The (1995)
Crime|Mystery|Thriller


Godfather: Part II, The (1974)
Crime|Drama


Schindler's List (1993)
Drama|War


Band of Brothers (2001)
Action|Drama|War


Silence of the Lambs, The (1991)
Crime|Horror|Thriller


Fugitive, The (1993)
Thriller


Dark Knight, The (2008)
Action|Crime|Drama|IMAX


Inception (2010)
Action|Crime|Drama|Mystery|Sci-Fi|Thriller|IMAX


1.3773250579833984  sec


Movies excluded:
Parent Trap, The (1961)
Rating: 4.0


Mr. & Mrs. Smith (1941)
Rating: 4.0


27 Dresses (2008)
Rating: 4.5


12 Angry Men (1957)
Rating: 4.5


Matilda (1996)
Rating: 4.0


Akeelah and the Bee (2006)
Rating: 3.0


Shawshank Redemption, The (1994)
Rating: 5.0


Never Been Kissed (1999)
Rating: 4.0


Princess Bride, The (1987)
Rating: 4.5


Divine Secrets of the Ya-Ya Sisterhood (2002)
Rating: 4.5




Recommendation after excluding:
Usual Suspects, The (1995)
Crime|Mystery|Thriller


Godfather: Part II, The (1974)
Crime|Drama


Shawshank Re

**Shawshank Redemption (5.0 rating) is now recommended**

In [32]:
evaluate_user(137500)

Recommendation before excluding:
Shawshank Redemption, The (1994)
Crime|Drama


Godfather, The (1972)
Crime|Drama


Godfather: Part II, The (1974)
Crime|Drama


Forrest Gump (1994)
Comedy|Drama|Romance|War


Prestige, The (2006)
Drama|Mystery|Sci-Fi|Thriller


Silence of the Lambs, The (1991)
Crime|Horror|Thriller


Star Wars: Episode VI - Return of the Jedi (1983)
Action|Adventure|Sci-Fi


Gladiator (2000)
Action|Adventure|Drama


1.3781049251556396  sec


Movies excluded:
Saving Private Ryan (1998)
Rating: 4.5


The Interview (2014)
Rating: 2.5


Chronicles of Riddick, The (2004)
Rating: 3.0


Shaun of the Dead (2004)
Rating: 4.0


Good bye, Lenin! (2003)
Rating: 4.0


Bruce Almighty (2003)
Rating: 3.5


Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
Rating: 4.0


Kill Bill: Vol. 1 (2003)
Rating: 4.5


American History X (1998)
Rating: 4.0


Pirates of the Caribbean: The Curse of the Black Pearl (2003)
Rating: 4.0




Recommendation after excluding:
Shaws

**American History X (4.0 rating) is now recommended**

In [33]:
evaluate_user(138400)

Recommendation before excluding:
Silence of the Lambs, The (1991)
Crime|Horror|Thriller


Schindler's List (1993)
Drama|War


Forrest Gump (1994)
Comedy|Drama|Romance|War


Fugitive, The (1993)
Thriller


Fargo (1996)
Comedy|Crime|Drama|Thriller


Get Shorty (1995)
Comedy|Crime|Thriller


Toy Story (1995)
Adventure|Animation|Children|Comedy|Fantasy


Terminator 2: Judgment Day (1991)
Action|Sci-Fi


1.3505089282989502  sec


Movies excluded:
Batman Forever (1995)
Rating: 3.0


Clueless (1995)
Rating: 3.0


Net, The (1995)
Rating: 3.0


Apollo 13 (1995)
Rating: 5.0


Batman (1989)
Rating: 3.0


Ace Ventura: Pet Detective (1994)
Rating: 3.0


Usual Suspects, The (1995)
Rating: 5.0


Dances with Wolves (1990)
Rating: 4.0


Disclosure (1994)
Rating: 3.0


Crimson Tide (1995)
Rating: 4.0




Recommendation after excluding:
Silence of the Lambs, The (1991)
Crime|Horror|Thriller


Usual Suspects, The (1995)
Crime|Mystery|Thriller


Forrest Gump (1994)
Comedy|Drama|Romance|War


Terminator 2: 

**Apollo 13 (5.0 rating) and Usual Suspects (5.0) are recommended now.**

## Conclusion

The main problems of collaborative filtering approach is non-variety and "cold start problem". Unpopular items with less ratings usually are not recommended and it's more complicated to find a good recommendations to new users with few ratings. Non-variety means that same movies will be recommended at different time until user-item matrix update.

At movie similarity approach we solved non-variety problem taking random samples from user-item matrix.