## Movie Recommendation: IMDeezMovies
Aavash Upadhyaya and Quang Vo

Data set retrieved from https://grouplens.org/datasets/movielens/

General overview
- Users search for a movie, this is run through the search function
- Ratings data then finds users who rated the searched movie highly and look for other things those users liked
- Returns the best fit movie based on what other users liked

### Importing and Formatting Data

In [276]:
import pandas as pd
import numpy as np
import re ##regular expression library
from sklearn.feature_extraction.text import TfidfVectorizer #turns text into feature vectors
from sklearn.metrics.pairwise import cosine_similarity

movieData = pd.read_csv("movies.csv")
ratingData = pd.read_csv("ratings.csv")

In [277]:
movieData

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
62418,209157,We (2018),Drama
62419,209159,Window of the Soul (2001),Documentary
62420,209163,Bad Poems (2018),Comedy|Drama
62421,209169,A Girl Thing (2001),(no genres listed)


In [278]:
def formatTitle(title):
    return re.sub("[^a-zA-Z0-9 ]", "", title)

In [279]:
movieData["formatTitle"] = movieData["title"].apply(formatTitle) ##Leaves only alphanumerics in titles

In [280]:
movieData ##should create another row with properly formatted titles

Unnamed: 0,movieId,title,genres,formatTitle
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jumanji 1995
2,3,Grumpier Old Men (1995),Comedy|Romance,Grumpier Old Men 1995
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Waiting to Exhale 1995
4,5,Father of the Bride Part II (1995),Comedy,Father of the Bride Part II 1995
...,...,...,...,...
62418,209157,We (2018),Drama,We 2018
62419,209159,Window of the Soul (2001),Documentary,Window of the Soul 2001
62420,209163,Bad Poems (2018),Comedy|Drama,Bad Poems 2018
62421,209169,A Girl Thing (2001),(no genres listed),A Girl Thing 2001


In [281]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2)) ##converts movie title data into a frequency matrix based on frequncy
                                                  #of movie title terms
tfidf = vectorizer.fit_transform(movieData["formatTitle"])

## Computing Title Similarities

we will use this for the user search portion

In [282]:
title = "Toy"
title = formatTitle(title)
qVec = vectorizer.transform([title])
similarArr = cosine_similarity(qVec, tfidf).flatten()
indices = np.argpartition(similarArr, -5) [-5:]
topFive = movieData.iloc[indices] [::-1] #top five most similar

In [283]:
similarArr #Searches for a movie with Toy in the title
           #Array showing similarity (index 1 = "Toy Story 1995" so higher similarity log)

array([0.47886319, 0.        , 0.        , ..., 0.        , 0.        ,
       0.        ])

In [284]:
topFive

Unnamed: 0,movieId,title,genres,formatTitle
14813,78499,Toy Story 3 (2010),Adventure|Animation|Children|Comedy|Fantasy|IMAX,Toy Story 3 2010
3021,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
4823,4929,"Toy, The (1982)",Comedy,Toy The 1982
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
59767,201588,Toy Story 4 (2019),Adventure|Animation|Children|Comedy,Toy Story 4 2019


## Retrieve Recommendations from User Recs

In [285]:
similarRecs = ratingData[(ratingData["movieId"] == 1) & (ratingData["rating"] >= 4.0)] ["userId"].unique()
##looks at all users in ratings.csv that rating movieId 1 (Toy story) a 4.0 or higher
similarRecs

array([     3,      5,      8, ..., 162530, 162533, 162534], dtype=int64)

In [286]:
similarUserRec = ratingData[(ratingData["userId"].isin(similarRecs)) & (ratingData["rating"] > 4)]["movieId"]
similarUserRec

255             29
256             32
257             50
261            214
263            293
             ...  
24999248    101962
24999269    109487
24999326    164179
24999329    165549
24999348    177593
Name: movieId, Length: 2321248, dtype: int64

In [287]:
similarUserRec = similarUserRec.value_counts() / len(similarRecs)
similarUserRec = similarUserRec[similarUserRec > .1]
similarUserRec

1       0.499483
318     0.421226
260     0.367817
296     0.353337
356     0.322708
          ...   
1148    0.103609
1527    0.102867
4995    0.102522
778     0.102495
34      0.100162
Name: movieId, Length: 90, dtype: float64

In [288]:
userSet = ratingData[(ratingData["movieId"].isin(similarUserRec.index) & (ratingData["rating"] > 4))]
userSetRec = userSet["movieId"].value_counts() / len(userSet["userId"].unique())
##Finding the movie that specifically people who like a given movie like as well, not just a movie that most people would also like
##So above finds the number of ALL users who like a given movie to compare with the number of users similar to us
userSetRec

318     0.345497
296     0.287399
2571    0.246370
356     0.237518
593     0.228071
          ...   
3114    0.054220
2716    0.053892
34      0.052729
1073    0.049232
1148    0.047922
Name: movieId, Length: 90, dtype: float64

In [289]:
recommend_quality = pd.concat([similarUserRec, userSetRec], axis=1)
#recommendation percentage comparing similar users to the overall user set
#we are looking for movies with big differences where most similar users like but overall not so much
recommend_quality.columns = ["similar","all"]
recommend_quality

Unnamed: 0,similar,all
1,0.499483,0.125923
318,0.421226,0.345497
260,0.367817,0.224334
296,0.353337,0.287399
356,0.322708,0.237518
...,...,...
1148,0.103609,0.047922
1527,0.102867,0.066762
4995,0.102522,0.076403
778,0.102495,0.075473


In [290]:
recommend_quality["quality"] = recommend_quality["similar"]/recommend_quality["all"]
recommend_quality = recommend_quality.sort_values("quality", ascending=False)
##movies with higher quality are better recommendation
recommend_quality

Unnamed: 0,similar,all,quality
1,0.499483,0.125923,3.966586
3114,0.170357,0.054220,3.141967
4886,0.166645,0.071489,2.331060
6377,0.166565,0.072960,2.282977
1073,0.111591,0.049232,2.266621
...,...,...,...
58559,0.180461,0.147871,1.220392
318,0.421226,0.345497,1.219189
4973,0.136148,0.113481,1.199744
2959,0.252380,0.218792,1.153517


In [291]:
recommend_quality.head(10).merge(movieData, left_index=True, right_on="movieId")

Unnamed: 0,similar,all,quality,movieId,title,genres,formatTitle
0,0.499483,0.125923,3.966586,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 1995
3021,0.170357,0.05422,3.141967,3114,Toy Story 2 (1999),Adventure|Animation|Children|Comedy|Fantasy,Toy Story 2 1999
4780,0.166645,0.071489,2.33106,4886,"Monsters, Inc. (2001)",Adventure|Animation|Children|Comedy|Fantasy,Monsters Inc 2001
6258,0.166565,0.07296,2.282977,6377,Finding Nemo (2003),Adventure|Animation|Children|Comedy,Finding Nemo 2003
1047,0.111591,0.049232,2.266621,1073,Willy Wonka & the Chocolate Factory (1971),Children|Comedy|Fantasy|Musical,Willy Wonka the Chocolate Factory 1971
8246,0.154207,0.069109,2.231373,8961,"Incredibles, The (2004)",Action|Adventure|Animation|Children|Comedy,Incredibles The 2004
580,0.151449,0.068159,2.221989,588,Aladdin (1992),Adventure|Animation|Children|Comedy|Musical,Aladdin 1992
1120,0.103609,0.047922,2.162033,1148,Wallace & Gromit: The Wrong Trousers (1993),Animation|Children|Comedy|Crime,Wallace Gromit The Wrong Trousers 1993
359,0.18473,0.086585,2.133522,364,"Lion King, The (1994)",Adventure|Animation|Children|Drama|Musical|IMAX,Lion King The 1994
587,0.12806,0.060551,2.1149,595,Beauty and the Beast (1991),Animation|Children|Fantasy|Musical|Romance|IMAX,Beauty and the Beast 1991


## Putting it all into a function:

Compiling the functions written before into one

In [292]:
def findSimilarMovies(movieId):
    similarRecs = ratingData[(ratingData["movieId"] == movieId) & (ratingData["rating"] >= 4.0)] ["userId"].unique()
    similarUserRec = ratingData[(ratingData["userId"].isin(similarRecs)) & (ratingData["rating"] > 4)]["movieId"]

    similarUserRec = similarUserRec.value_counts() / len(similarRecs)
    similarUserRec = similarUserRec[similarUserRec > .1]

    userSet = ratingData[(ratingData["movieId"].isin(similarUserRec.index) & (ratingData["rating"] > 4))]
    userSetRec = userSet["movieId"].value_counts() / len(userSet["userId"].unique())

    recommend_quality = pd.concat([similarUserRec, userSetRec], axis=1)
    recommend_quality.columns = ["similar","all"]

    recommend_quality["quality"] = recommend_quality["similar"]/recommend_quality["all"]
    recommend_quality = recommend_quality.sort_values("quality", ascending=False)
    return recommend_quality.head(10).merge(movieData, left_index=True, right_on="movieId")[["movieId","quality","genres","formatTitle"]]

In [293]:
findSimilarMovies(2) ##Finding recommendation quality for movieId 2, which is Jumanji

Unnamed: 0,movieId,quality,genres,formatTitle
1,2,18.20058,Adventure|Children|Fantasy,Jumanji 1995
578,586,5.504137,Children|Comedy,Home Alone 1990
495,500,5.460957,Comedy|Drama,Mrs Doubtfire 1993
362,367,5.196343,Action|Comedy|Crime|Fantasy,Mask The 1994
721,736,5.00557,Action|Adventure|Romance|Thriller,Twister 1996
579,587,4.482292,Comedy|Drama|Fantasy|Romance|Thriller,Ghost 1990
312,316,4.244766,Action|Adventure|Sci-Fi,Stargate 1994
372,377,3.858918,Action|Romance|Thriller,Speed 1994
534,539,3.642351,Comedy|Drama|Romance,Sleepless in Seattle 1993
1922,2011,3.61657,Adventure|Comedy|Sci-Fi,Back to the Future Part II 1989
