# Imports

In [7]:
import os
import pandas as pd
import numpy as np
import math
import ast
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import pairwise_distances

# Content Based Recommender System

## Data Process

- Edit genres and keywords to be one string with relevant genres listed.
- Fill NaN votes with 0
- Define characters as top 10 members of cast and only use director from crew members

In [21]:
Data = pd.read_csv(r"archive\movies_metadata.csv")
Data["genres"] = Data["genres"].apply(lambda x: " ".join([List["name"] for List in ast.literal_eval(x)]))
Data["vote_count"] = Data["vote_count"].fillna(0) #Fill na's for movies with no votes

Links = pd.read_csv(r"archive\links_small.csv")

KeyWords = pd.read_csv(r"archive\keywords.csv")
KeyWords["keywords"] = KeyWords["keywords"].apply(lambda x: " ".join([List["name"] for List in ast.literal_eval(x)]))

Data = pd.read_csv(r"archive\movies_metadata.csv")
Data = Data.drop([19730, 29503, 35587])

Credits = pd.read_csv(r"archive\credits.csv")
Credits["characters"] = Credits["cast"].head().apply(lambda x: " ".join([List["character"] for List in ast.literal_eval(x)[0:10]]))
Credits["director"] = Credits["crew"].apply(lambda x: " ".join([List["name"] for List in ast.literal_eval(x)[0:10] if List["job"] == "Director"]))

Data["tmdbId"] = Data["id"].astype("int")
Data["id"] = Data["id"].astype("int")
Data = pd.merge(Data, Links, on='tmdbId')
Data = pd.merge(Data, KeyWords, on="id")
Data = pd.merge(Data, Credits, on ="id")

Ratings = pd.read_csv(r"archive\ratings_small.csv")

  interactivity=interactivity, compiler=compiler, result=result)


# Collaborative Filtering

## Content Based Filtering

Next steps are to build a user based recommender system, try not to use the kaggle kernel so much with this one, I want to be
think about the way Andrew NG has explained it along with other online documents.

- Pick one user and run the algo for, want to think about a user in N dimensions where N is the number of movies they've rated

User-based: For a user U, with a set of similar users determined based on rating vectors consisting of given item ratings
, the rating for an item I, which hasn’t been rated, is found by picking out N users from the similarity list who have 
rated the item I and calculating the rating based on these N ratings.

Item-based: For an item I, with a set of similar items determined based on rating vectors consisting of received user ratings
, the rating by a user U, who hasn’t rated it, is found by picking out N items from the similarity list that have been rated 
by U and calculating the rating based on these N ratings.

I will first do a simplified explanation which I used to learn the inner workings of this technique, I will then define a vectorised function that will execute the actual algorithm

We first create a User/Movie/Rating matrix with users as columns, movies as rows and a user movie combination is that users rating of the movie.

We then choose a user that we want to recommend a movie to, in this case we will choose userId 4. When you have your user, you will then calculate a similarity score for each other user, based on the movies they have both rated.

The similarity can be either a Euclidian distance, which is a measure of distance between vectors or a cosine similarity, which is a measure of the angular separation between two vectors.

In [6]:
#Matrix of all user movie ratings
UserMovies = pd.pivot_table(Ratings, columns="userId", values="rating", index="movieId")

#Similarity between user 4 and 2
User4Idx = 4
User2Idx = 2

#Get movies rated by both
Temp = UserMovies[(UserMovies[4].isna() == False) & (UserMovies[2].isna() == False)].T # Movies rated by both users

# Get ratings of movies rated by user 4 and user 2
User4 = np.array(Temp.loc[4].dropna())
User2 = np.array(Temp.loc[2].dropna())

# Calculate taste similarity
Similarity = np.sqrt(((User4 - User2)**2).sum()) # User similarity by euclidian distance
Similarity

5.291502622129181

Once this has been done, you can choose the n most similar users or the top x% of similar users to choose movies to recommend to the user in question. For simplicity of explanation, we will choose random users defined below as the "most similar".

For each of these similar users, we calculate the mean value of their ratings, this gives an impression of what each user thinks the "average movie" is, and allows us to determine how above/below average the user thinks each movie is, through a technique called mean normalisation.

Once this has been done, we will limit the selection criteria to be only movies that have been rated by at least half of our most similar users, this prevents us from suggesting top rated movies that have only been rated by one similar user.

We then calculate the average mean normalised rating for each movie and suggest the top ten most "above average" movies to user 4.

In [1038]:
# Pick random similar users
Users = [User2Idx, 10, 11, 21, 33]
Temp = UserMovies
UserRatings = {}

#Get average movies rating for each of these users (all movies)
for User in Users:
    UserRatings[User] = UserMovies[User].mean() 

    
#Get all movies not rated by user 4 but rated by at least half of the others 
Temp = Temp[Temp[User4Idx].isna() == True][Users].dropna(thresh=np.floor(len(Users) / 2)) 

#Adjust each users ratings by that users average rating, what to distinguish what "like" actually means for each user
for User in Idxs:
    Temp[User] = Temp[User] - UserRatings[User]

#Find averages for each movie
RecommendedMovies = Temp.mean(axis=1).sort_values(ascending=False).head(5).index.values

#Sort movies by how above average they are
Temp = Data[Data["movieId"].isin(RecommendedMovies)]
Temp["sorter"] = dict(zip(RecommendedMovies, range(len(RecommendedMovies))))

#Recommend
print("Recommended for you")
Temp.sort_values("sorter")[["title", "vote_count", "vote_average", "release_date"]]

Recommended for you


Unnamed: 0,title,vote_count,vote_average,release_date
494,The Nightmare Before Christmas,2135.0,7.6,1993-10-09
472,Schindler's List,4436.0,8.3,1993-11-29
2364,Hairspray,102.0,6.6,1988-02-16
48,The Usual Suspects,3334.0,8.1,1995-07-19
755,Citizen Kane,1244.0,8.0,1941-04-30


## Vectorised approach

In [8]:
#Matrix of all user movie ratings
UserMovies = pd.pivot_table(Ratings, columns="userId", values="rating", index="movieId")

In [10]:
UserMovies.head()

userId,1,2,3,4,5,6,7,8,9,10,...,662,663,664,665,666,667,668,669,670,671
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,3.0,,4.0,,...,,4.0,3.5,,,,,,4.0,5.0
2,,,,,,,,,,,...,5.0,,,3.0,,,,,,
3,,,,,4.0,,,,,,...,,,,3.0,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,3.0,,,,,,


In [11]:
Ratings.head()    

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [22]:
UserIdx = 10
Similarities = {}
UserRatings = {}

Temp = UserMovies.copy()

for User in Temp.columns.values:
    Sim = UserMovies[(UserMovies[UserIdx].isna() == False) & (UserMovies[User].isna() == False)][[UserIdx, User]].T
    if Sim.shape[1] > 3:
        Similarities[User] = pairwise_distances(Sim, metric="cosine")[0][1]
    else:
        Similarities[User] = np.nan
Similarities = pd.DataFrame.from_dict(Similarities, orient='index', columns=["Similarities"]).dropna().sort_values("Similarities", ascending=False)

Users = Similarities.head(5).index.values

#Get all movies not rated by user 4 but rated by at least half of the others 
Temp = Temp[Temp[UserIdx].isna() == True][Users].dropna(thresh=np.floor(len(Users) / 2)) 

#Get average movies rating for each of these users (all movies)
for User in Users:
    UserRatings[User] = UserMovies[User].mean() 

#Adjust each users ratings by that users average rating, what to distinguish what "like" actually means for each user
for User in Users:
    Temp[User] = Temp[User] - UserRatings[User]
    
#Find averages for each movie
RecommendedMovies = Temp.mean(axis=1).sort_values(ascending=False).head(5).index.values

#Sort movies by how above average they are
Temp = Data[Data["movieId"].isin(RecommendedMovies)]
Temp["sorter"] = dict(zip(RecommendedMovies, range(len(RecommendedMovies))))

#Recommend
print("Recommended for you")
Temp.sort_values("sorter")[["title", "vote_count", "vote_average", "release_date"]]

Recommended for you


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,title,vote_count,vote_average,release_date
2186,Eyes Wide Shut,1266.0,7.1,1999-07-14
266,Pulp Fiction,8670.0,8.3,1994-09-10
7139,Slumdog Millionaire,2918.0,7.6,2008-05-12
979,A Clockwork Orange,3432.0,8.0,1971-12-18
3044,Blow-Up,227.0,7.3,1966-12-18


In [36]:
def YouMightAlsoLike(Ratings, User):
    
    UserMovies = pd.pivot_table(Ratings, columns="userId", values="rating", index="movieId")
    
    #Empty dicts to append Similarities and Ratings to
    Similarities = {}
    UserRatings = {}
    Temp = UserMovies.copy()
    Movies = Data.copy()
    
    #For each other user, if both have rated at least 3 of the same, append similarity
    for User in Temp.columns.values:
        Sim = UserMovies[(UserMovies[User].isna() == False) & (UserMovies[User].isna() == False)][[User, User]].T
        if Sim.shape[1] > 3:
            Similarities[User] = pairwise_distances(Sim, metric="cosine")[0][1]
        else:
            Similarities[User] = np.nan
            
    Similarities = pd.DataFrame.from_dict(Similarities, orient='index', columns=["Similarities"]).dropna().sort_values("Similarities", ascending=False)
    
    #Get top 5 similar users
    Users = Similarities.head(20).index.values

    #Get all movies not rated by user but rated by at least half of the others 
    Temp = Temp[Temp[User].isna() == True][Users].dropna(thresh=np.floor(len(Users) / 2)) 

    #Get average movies rating for each of these users (all movies)
    for User in Users:
        UserRatings[User] = UserMovies[User].mean() 

    #Adjust each users ratings by that users average rating, what to distinguish what "like" actually means for each user
    for User in Users:
        Temp[User] = Temp[User] - UserRatings[User]

    #Find averages for each movie
    RecommendedMovies = Temp.mean(axis=1).sort_values(ascending=False).head(5).index.values

    #Sort movies by how above average they are
    Temp = Movies[Movies["movieId"].isin(RecommendedMovies)]
    Temp["sorter"] = dict(zip(RecommendedMovies, range(len(RecommendedMovies))))

    #Recommend
    print("Recommended for you")
    
    return Temp.sort_values("sorter")[["title", "vote_count", "vote_average", "release_date"]]

In [37]:
YouMightAlsoLike(Ratings, 2)

Recommended for you


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,title,vote_count,vote_average,release_date
266,Pulp Fiction,8670.0,8.3,1994-09-10
525,The Silence of the Lambs,4549.0,8.1,1991-02-01
321,Forrest Gump,8147.0,8.2,1994-07-06
472,Schindler's List,4436.0,8.3,1993-11-29


We will now define two more functions that enable the user to rate movies as User 1000, we will then use those ratings to recommend movies to the user as per the above algorithm.

In [38]:
def Rate(User, Movie, Rating, Ratings):
    Index = Data[Data["title"] == Movie]["movieId"].values[0]
    Ratings.loc[-1] = [User, Index, Rating, np.nan]
    Ratings.index = Ratings.index + 1  # shifting index
    Ratings = Ratings.sort_index()  # sorting by index
    return Ratings

def Clear(Ratings):
    Ratings = Ratings[Ratings["userId"] != 1000]
    return Ratings

In [39]:
Data[Data["title"].str.contains("Godfather")]["title"]

699               The Godfather
994      The Godfather: Part II
1602    The Godfather: Part III
5523           Tokyo Godfathers
Name: title, dtype: object

In [40]:
Ratings = Rate(1000, "The Godfather", 4, Ratings)
Ratings = Rate(1000, "The Dark Knight", 5, Ratings)
Ratings = Rate(1000, "American Psycho", 5, Ratings)
Ratings = Rate(1000, "Inception", 5, Ratings)

YouMightAlsoLike(Ratings, 1000)

Recommended for you


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,title,vote_count,vote_average,release_date
266,Pulp Fiction,8670.0,8.3,1994-09-10
525,The Silence of the Lambs,4549.0,8.1,1991-02-01
321,Forrest Gump,8147.0,8.2,1994-07-06
472,Schindler's List,4436.0,8.3,1993-11-29


In [34]:
Ratings = Clear(Ratings)