**MOVIE RECOMMENDATION ENGINE**

> **Recommending movies**, using the **KNN algorithom**, on the **dataset from Kaggle** 



In [2]:
import pandas as pd
import numpy as np
from math import sqrt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import pairwise_distances
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import CountVectorizer 

Now let's upload the files. 

>**links.csv** contains the **iMDB ID** of the movies

>**movies.csv** contains **movie_name** and **genre** of the movies

>**ratings.csv** contains the **ratings** given by users for the movies

>**tags.csv** contains the **overall review** for the movies by the user

For the moment we will work with **movies.csv** and **ratings.csv** files

>As **links.csv** are only the iMDB ID's, we **don't actually need** those for recommendation

>And **tags.csv** contains user overall review. that might be **too biased**, so for the moment we are **not using that**

Now **construct the dataframes**, from the available data/information


In [3]:
df_links = pd.read_csv('links.csv')
df_movies = pd.read_csv('movies.csv')
df_ratings = pd.read_csv('ratings.csv')
df_tags = pd.read_csv('tags.csv')
print("Done")

Done


In [5]:
df_links.head(5)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [6]:
df_movies.head(5)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
df_ratings.head(5)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [8]:
df_tags.head(5)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


* Now let's **combine the DataFrame** form the important dataframes **movies** and **ratings**. 

* We will use the **combined DataFrame** for the **rest of the process** of building the recommendation

* After combining we need to **remove the unimportant columns** (i.e. timestamps or genres (for now)).

In [9]:
df_combined = pd.merge(df_ratings, df_movies, on = 'movieId')
df_combined.head(5)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [10]:
new_df = df_combined.drop(['timestamp', 'genres'], axis = 1)
new_df.head(5)

Unnamed: 0,userId,movieId,rating,title
0,1,1,4.0,Toy Story (1995)
1,5,1,4.0,Toy Story (1995)
2,7,1,4.5,Toy Story (1995)
3,15,1,2.5,Toy Story (1995)
4,17,1,4.5,Toy Story (1995)


In [11]:
new_df.shape

(100836, 4)

**Explaination of the approach**

*   First we will make a **pivot table**, with **users being the row** and **movies being the column** and **ratings being the values**, to have better understanding of **which user, have given which movies, what rating** !! We also will have a clear view of **which user has seen which movies**, and what are the movies yet to be watched by each user. 
*   Then we will prapare some lists based on that pivot table. i.e. **list of movies seen by each user, index of those movies** and also **list of the movies yet to be seen by each user, the index of those movies**. Those will help us in further proceedings.
*   As rating in any movie, completely depends on the user, to avoid the bias, we will **not use the ratings as pivot table values**. Rather we should use binary values, where **1 represents the user has seen the movie**, and **0 represents the user hasn't seen the movie**.
*   Now each **column of the pivot table denotes a Vector** (depending on the users, if seen or not), based on that we will find the **cosine similarity of the movies**, with each other (here we will use python (SkLearn) built in cosine_sim and  Nearest_Neighbour functions). In that way we can easily find the **K-most similar movies (with the similarity values)**, for each movie (here we assume K = 10), representing K-Nearest-Neighbours.
*   Now comes the recommendation time. **For each user, the movies he/she has seen, for all of those movies we will find the similar movies. We will list them out, then remove all the movies from that list which has already been watched. Then we will sort them based on the similarity values. And finally recommend the movies which have the maximum similarity values.**


In [12]:
see_whats_happening = new_df.pivot_table(values = 'rating', index = 'userId', columns = 'title')
see_whats_happening.head(5)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


In [15]:
see_whats_happening.shape

(610, 9719)

**So from the Pivot Table we can see :**
*   We have 610 different users and
*   We have 9719 different movies

In [None]:
rows = [x for x in range (0, len(see_whats_happening.columns))]
movies_by_user = {}
movies_not_rated_by_user = {}
movie_indexes = {}
movie_not_rated_indexes = {}
for i, row in see_whats_happening.iterrows():
  if_reviewed = list(zip(row.index, row.values, rows))
  #rated list
  if_rated = [(x,z) for x,y,z in if_reviewed if str(y) != 'nan']
  rated_movies_index = [x[1] for x in if_rated]
  rated_movies = [x[0] for x in if_rated]
  movies_by_user[i] = rated_movies
  movie_indexes[i] = rated_movies_index
  #unrated list
  if_not_rated = [(x,z) for x,y,z in if_reviewed if str(y) == 'nan']
  unrated_movies_index = [x[1] for x in if_not_rated]
  unrated_movies = [x[0] for x in if_not_rated]
  movies_not_rated_by_user[i] = unrated_movies
  movie_not_rated_indexes[i] = unrated_movies_index
print('Done')

In [59]:
print(movies_by_user[1])
print(movie_indexes[1])

print(movies_not_rated_by_user[1])
print(movie_not_rated_indexes[1])

['13th Warrior, The (1999)', '20 Dates (1998)', 'Abyss, The (1989)', 'Adventures of Robin Hood, The (1938)', 'Alice in Wonderland (1951)', 'Alien (1979)', 'All Quiet on the Western Front (1930)', 'American Beauty (1999)', 'American History X (1998)', 'American Tail, An (1986)', 'Apocalypse Now (1979)', 'Austin Powers: International Man of Mystery (1997)', 'Back to the Future (1985)', 'Back to the Future Part III (1990)', 'Bambi (1942)', 'Basic Instinct (1992)', 'Batman (1989)', 'Batman Returns (1992)', 'Bedknobs and Broomsticks (1971)', 'Beetlejuice (1988)', 'Being John Malkovich (1999)', 'Best Men (1997)', 'Big (1988)', 'Big Lebowski, The (1998)', 'Big Trouble in Little China (1986)', 'Billy Madison (1995)', 'Black Cauldron, The (1985)', 'Blazing Saddles (1974)', 'Blown Away (1994)', 'Blues Brothers, The (1980)', 'Bottle Rocket (1996)', 'Braveheart (1995)', 'Canadian Bacon (1995)', "Charlotte's Web (1973)", 'Citizen Kane (1941)', 'Clear and Present Danger (1994)', 'Clerks (1994)', 'Cl

In [68]:
#filling the NaN with 0's
pivot_table = see_whats_happening.fillna(0)
#making the table a binary one(only 0/1)
pivot_table = pivot_table.apply(np.sign)
#reversing the table, making rows into columns and columns into rows
pivot_table.T.head(5)
pivot_table.head(5)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...All the Marbles (1981),...And Justice for All (1979),00 Schneider - Jagd auf Nihil Baxter (1994),1-900 (06) (1994),10 (1979),10 Cent Pistol (2015),10 Cloverfield Lane (2016),10 Items or Less (2006),10 Things I Hate About You (1999),10 Years (2011),"10,000 BC (2008)",100 Girls (2000),100 Streets (2016),101 Dalmatians (1996),101 Dalmatians (One Hundred and One Dalmatians) (1961),101 Dalmatians II: Patch's London Adventure (2003),101 Reykjavik (101 Reykjavík) (2000),102 Dalmatians (2000),10th & Wolf (2006),"10th Kingdom, The (2000)","10th Victim, The (La decima vittima) (1965)","11'09""01 - September 11 (2002)",11:14 (2003),"11th Hour, The (2007)",12 Angry Men (1957),12 Angry Men (1997),12 Chairs (1971),12 Chairs (1976),12 Rounds (2009),12 Years a Slave (2013),...,Zathura (2005),Zatoichi and the Chest of Gold (Zatôichi senryô-kubi) (Zatôichi 6) (1964),Zazie dans le métro (1960),Zebraman (2004),"Zed & Two Noughts, A (1985)",Zeitgeist: Addendum (2008),Zeitgeist: Moving Forward (2011),Zeitgeist: The Movie (2007),Zelary (2003),Zelig (1983),Zero Dark Thirty (2012),Zero Effect (1998),"Zero Theorem, The (2013)",Zero de conduite (Zero for Conduct) (Zéro de conduite: Jeunes diables au collège) (1933),Zeus and Roxanne (1997),Zipper (2015),Zodiac (2007),Zombeavers (2014),Zombie (a.k.a. Zombie 2: The Dead Are Among Us) (Zombi 2) (1979),Zombie Strippers! (2008),Zombieland (2009),Zone 39 (1997),"Zone, The (La Zona) (2007)",Zookeeper (2011),Zoolander (2001),Zoolander 2 (2016),Zoom (2006),Zoom (2015),Zootopia (2016),Zulu (1964),Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [77]:
#for us let k be 10, for K-Nearest Neighbours
k = 10
#we will use cosine similarity
cosine_sim = NearestNeighbors(n_neighbors = k, algorithm = 'brute', metric = 'cosine') 
#now let's fit our data
item_sim = cosine_sim.fit(pivot_table.T.values)
#now the neighbours
item_distances, item_indices = item_sim.kneighbors(pivot_table.T.values)

In [78]:
print(item_indices[1])
print(item_distances[1])

[7888    1 5773 3913    2  362   11 8735 9596 7362]
[0.         0.         0.         0.29289322 0.29289322 0.29289322
 0.42264973 0.5        0.5        0.5       ]


In [81]:
#now lets list out(dictionary) the nearest neighbours(movies)
sim_movies = {}
for i in range(len(pivot_table.T.index)):
  movie_names = pivot_table.T.index[item_indices[i]].tolist()
  sim_movies[pivot_table.T.index[i]] = movie_names

In [87]:
recommendations = {}
for user,seen_movies in movie_indexes.items():
  neighbour_ind = [j for i in item_indices[seen_movies] for j in i]
  neighbour_dis = [j for i in item_distances[seen_movies] for j in i]
  #now combine the index and the distances
  neighbour_info = list(zip(neighbour_ind, neighbour_dis))
  #now create another dictionary to seperate the movies which the user have not seen
  not_seen_neighbour = {i:d for i,d in neighbour_info if i not in seen_movies}
  #now create a list back from dictionary
  neighbours_not_seen_info = list(zip(not_seen_neighbour.keys(), not_seen_neighbour.values())) 
  #now sort based on distances(nearest)
  nearest_neighbours = sorted(neighbours_not_seen_info, key = lambda x:x[1])
  #converting the indexes into movie names
  nearest_unseen_movies = [(pivot_table.columns[i], distance) for i, distance in nearest_neighbours]
  #now add that to the final recommendations dictionary
  recommendations[user] = nearest_unseen_movies
  if user == 1:
    print(recommendations[user])

[('Heart Condition (1990)', 0.18350341907227397), ('Deuces Wild (2002)', 0.18350341907227397), ('Who Is Cletis Tout? (2001)', 0.18350341907227397), ('Rare Birds (2001)', 0.18350341907227397), ("Jesus' Son (1999)", 0.18350341907227397), ('Liberty Heights (1999)', 0.18350341907227397), ('Get Real (1998)', 0.18350341907227397), ('Dark Blue World (Tmavomodrý svet) (2001)', 0.18350341907227397), ('About Adam (2000)', 0.18350341907227397), ('Presidio, The (1988)', 0.2857142857142858), ('Hardball (2001)', 0.2928932188134523), ('Gunga Din (1939)', 0.2928932188134523), ('West Beirut (West Beyrouth) (1998)', 0.29289321881345254), ('Golden Bowl, The (2000)', 0.29289321881345254), ("On Her Majesty's Secret Service (1969)", 0.30689671991632794), ('Pretty Woman (1990)', 0.31146962734090355), ('Cliffhanger (1993)', 0.3169138702432417), ('Terminator 2: Judgment Day (1991)', 0.32282902938325886), ('Lord of the Rings: The Fellowship of the Ring, The (2001)', 0.3308173701821038), ('For Your Eyes Only (19

In [100]:
#now the time to recommend someone :)
def recommend_movie_to(user, no_of_movies):
  if(user < 0 or user > len(pivot_table.index)):
    print("The userId you entered is invalid, please tray again!!")
  else:
    print("Let's see which movies youhave watched so far >>>\n")
    print("{}".format('\n'.join(movies_by_user[user])))
    print("\n\n")
    print("Ohk!! Now you should definitely watch these movies >>>\n")
    for user_rec, mov_rec in recommendations.items():
      if user_rec == user:
        for item in mov_rec[:no_of_movies]:
          print("{} with a similarity value {:.4f}".format(item[0], 1-item[1]))

In [105]:
#let's bring in some requests
recommend_movie_to(546, 20)

Let's see which movies youhave watched so far >>>

101 Dalmatians (One Hundred and One Dalmatians) (1961)
8 ½ Women (a.k.a. 8 1/2 Women) (a.k.a. Eight and a Half Women) (1999)
Airplane II: The Sequel (1982)
Airplane! (1980)
American Beauty (1999)
American Psycho (2000)
Amityville Horror, The (1979)
Back to the Future (1985)
Battle for the Planet of the Apes (1973)
Beneath the Planet of the Apes (1970)
Big Kahuna, The (2000)
Billy Jack (1971)
Billy Jack Goes to Washington (1977)
Boogie Nights (1997)
Casper (1995)
Conquest of the Planet of the Apes (1972)
Diamonds Are Forever (1971)
Eagle Has Landed, The (1976)
Erin Brockovich (2000)
Escape from the Planet of the Apes (1971)
Fargo (1996)
Femme Nikita, La (Nikita) (1990)
Final Destination (2000)
Fly II, The (1989)
Fly, The (1986)
Friday the 13th Part 3: 3D (1982)
Friday the 13th Part IV: The Final Chapter (1984)
From Dusk Till Dawn (1996)
Ghost Dog: The Way of the Samurai (1999)
Halloween 5: The Revenge of Michael Myers (1989)
Happy Gilmo