### Project Introduction

The goal of this project is to build a simple movie recommendation system that I can potentially use for myself.

### Notebook Setup

In [1]:
import pandas as pd
import numpy as np
import regex as re

### Reading the Data Set

In [2]:
movies = pd.read_csv('movies.csv')
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


### Cleaning the Movie Titles Using Regex

In [4]:
def clean_title(title):
    """
    function to remove any character 
    that isn't a letter, digit, or a space
    """
    return re.sub("[^a-zA-Z0-9 ]","",title)

In [5]:
#Appying function above to create a new cleaned title column

movies['clean_title'] = movies['title'].apply(clean_title)
movies.head(5)

Unnamed: 0,movieId,title,genres,clean_title
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995


### Creating a TFIDF Matrix

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range=(1,2))

tfidf = vectorizer.fit_transform(movies['clean_title'])


In [7]:
tfidf

<62423x170073 sparse matrix of type '<class 'numpy.float64'>'
	with 446566 stored elements in Compressed Sparse Row format>

### Creating a Search Function

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

def search(title):
    title = clean_title(title)
    query_vec = vectorizer.transform([title])
    similarity = cosine_similarity(query_vec, tfidf).flatten()
    indices = np.argpartition(similarity, -5)[-5:]
    results = movies.iloc[indices][::-1]
    return results

In [9]:
search('any')

Unnamed: 0,movieId,title,genres,clean_title
30671,136788,Any Day (2015),Drama|Romance|Thriller,Any Day 2015
2892,2984,On Any Sunday (1971),Documentary,On Any Sunday 1971
19838,103030,Any Day Now (2012),Drama,Any Day Now 2012
20269,104947,At Any Price (2012),Drama|Thriller,At Any Price 2012
3080,3173,Any Given Sunday (1999),Drama,Any Given Sunday 1999


### Building an Interactive Search Box in Jupyter

In [10]:
import ipywidgets as widgets
from IPython.display import display


movie_input = widgets.Text(
                value='Toy Story',
                description = 'Movie Title',
                disabled=False
            )

movie_list = widgets.Output()

def on_type(data):
    with movie_list:
        movie_list.clear_output()
        title = data['new']
        if len(title) > 5:
            display(search(title))
            
movie_input.observe(on_type, names='value')

display(movie_input, movie_list)

Text(value='Toy Story', description='Movie Title')

Output()

### Reading in Movie Ratings Data

In [11]:
ratings = pd.read_csv('ratings.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


What we are going to do next to build this recommendation engine is to look for users who also liked a movie we liked, and then find what other movies they liked, as these may be good recommendations for us.

We can then come up with some sort of threshold, and say that at least 10% of users like us need to have liked that movie as well. 

### Finding Users Who Liked the Same Movie

Users who also liked the movie we liked:

In [12]:
movieId = 1

similar_users = ratings[(ratings['movieId']==movieId) & 
                        (ratings['rating']>4)]['userId'].unique()

similar_users

array([    36,     75,     86, ..., 162527, 162530, 162533])

Finding other movies that these users gave a high rating to:

In [13]:
similar_user_recs = ratings[(ratings['userId'].isin(similar_users))&
                           (ratings['rating']>4)]['movieId']
similar_user_recs

5101            1
5105           34
5111          110
5114          150
5127          260
            ...  
24998854    60069
24998861    67997
24998876    78499
24998884    81591
24998888    88129
Name: movieId, Length: 1358326, dtype: int64

Finally, looking for movies that 10% or more of the users similar to us, liked:

In [14]:
similar_user_recs = similar_user_recs.value_counts()/len(similar_users)
similar_user_recs = similar_user_recs[similar_user_recs > 0.1]

In [15]:
similar_user_recs

1        1.000000
318      0.445607
260      0.403770
356      0.370215
296      0.367295
           ...   
953      0.103053
551      0.101195
1222     0.100876
745      0.100345
48780    0.100186
Name: movieId, Length: 113, dtype: float64

### Determining how many overall users like these movies

It is not just about finding movies that a lot of similar users also liked, but also about finding movies that are similar to the movie that we liked.

For example a very popular movie, such as "Harry Potter" might be liked by most, not just users similar to us. 

So the next step here is to find the number of overall users that also liked each of these recommended movies.

In [16]:
similar_user_recs.index

Int64Index([    1,   318,   260,   356,   296,  2571,  1196,  1198,   593,
              527,
            ...
             8368,  4896,  1259, 59315,   778,   953,   551,  1222,   745,
            48780],
           dtype='int64', length=113)

In [17]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [18]:
all_users = ratings[(ratings['movieId'].isin(similar_user_recs.index)) &
       (ratings['rating']>4)]

In [19]:
#Above is the number of overall users that liked each of the movies that are recommended. Let us convert it into a percentage of overall users.
all_user_recs = all_users['movieId'].value_counts() / len(all_users['userId'].unique())

In [20]:
all_user_recs

318      0.342220
296      0.284674
2571     0.244033
356      0.235266
593      0.225909
           ...   
551      0.040918
50872    0.039111
745      0.037031
78499    0.035131
2355     0.025091
Name: movieId, Length: 113, dtype: float64

### Creating a Recommendation Score

Let us start by comparing the percentages of similar users and all users who like each of the movies in our recommended list.

In [21]:
rec_percentages = pd.concat([similar_user_recs, all_user_recs],axis=1)
rec_percentages.columns = ['similar','all']
rec_percentages

Unnamed: 0,similar,all
1,1.000000,0.124728
32,0.160711,0.100293
34,0.130555,0.052229
47,0.225909,0.144469
50,0.275604,0.200513
...,...,...
59315,0.104593,0.054269
60069,0.170640,0.076307
68954,0.159172,0.064944
78499,0.152960,0.035131


Now, to create a recommendation score, we can divide the 'similar' column with the 'all' column to get a ratio. Our goal is for the difference between these columns to be large. The bigger the number in our case, the better the recommendation. 

In [22]:
rec_percentages['score'] = rec_percentages['similar']/rec_percentages['all']
rec_percentages.sort_values('score', ascending=False, inplace=True)
rec_percentages.head()

Unnamed: 0,similar,all,score
1,1.0,0.124728,8.017414
3114,0.280648,0.053706,5.225654
2355,0.110539,0.025091,4.405452
78499,0.15296,0.035131,4.354038
4886,0.235147,0.070811,3.320783


In [23]:
rec_percentages.head(10).merge(movies, left_index=True, right_on='movieId')

Unnamed: 0,similar,all,score,movieId,title,genres,clean_title
0,1.0,0.124728,8.017414,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
3021,0.280648,0.053706,5.225654,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
2264,0.110539,0.025091,4.405452,2355,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy,Bugs Life A 1998
14813,0.15296,0.035131,4.354038,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
4780,0.235147,0.070811,3.320783,4886,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,Monsters Inc 2001
580,0.216618,0.067513,3.208539,588,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,Aladdin 1992
6258,0.228139,0.072268,3.156862,6377,Finding Nemo (2003),Adventure|Animation|Children|Comedy,Finding Nemo 2003
587,0.1794,0.059977,2.99115,595,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX,Beauty and the Beast 1991
8246,0.203504,0.068453,2.972889,8961,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,Incredibles The 2004
359,0.253411,0.085764,2.954762,364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,Lion King The 1994


### Building a Recommendation Function

In [24]:
def find_similar_movies(movie_id):
    similar_users = ratings[(ratings['movieId']==movie_id) & 
                    (ratings['rating']>4)]['userId'].unique()

    similar_user_recs = ratings[(ratings['userId'].isin(similar_users))&
                       (ratings['rating']>4)]['movieId']
    similar_user_recs = similar_user_recs.value_counts()/len(similar_users)
    similar_user_recs = similar_user_recs[similar_user_recs > 0.1]

    all_users = ratings[(ratings['movieId'].isin(similar_user_recs.index)) &
   (ratings['rating']>4)]
    all_user_recs = all_users['movieId'].value_counts() / len(all_users['userId'].unique())

    rec_percentages = pd.concat([similar_user_recs, all_user_recs],axis=1)
    rec_percentages.columns = ['similar','all']
    rec_percentages['score'] = rec_percentages['similar']/rec_percentages['all']
    rec_percentages.sort_values('score', ascending=False, inplace=True)

    final_rec = rec_percentages.head(10).merge(movies, left_index=True, right_on='movieId')
    final_rec = final_rec[['score','title','genres']]
    return final_rec

In [25]:
find_similar_movies(1)

Unnamed: 0,score,title,genres
0,8.017414,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3021,5.225654,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy
2264,4.405452,"Bug's Life, A (1998)",Adventure|Animation|Children|Comedy
14813,4.354038,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX
4780,3.320783,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy
580,3.208539,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical
6258,3.156862,Finding Nemo (2003),Adventure|Animation|Children|Comedy
587,2.99115,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX
8246,2.972889,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy
359,2.954762,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX


### Creating an Interactive Recommendation Widget

In [26]:
rec_input = widgets.Text(
                value='Toy Story',
                description = 'Movie Title:',
                disabled=False
            )

recommended_list = widgets.Output()

def on_type_rec(data):
    with recommended_list:
        recommended_list.clear_output()
        title = data['new']  
        if len(title)>5:
            df = search(title)
            movie_id = df.iloc[0]['movieId']
            display(find_similar_movies(movie_id))

rec_input.observe(on_type_rec, names='value')

display(rec_input, recommended_list)

Text(value='Toy Story', description='Movie Title:')

Output()