# RECOMMENDATION ENGINE:

### Notebook Contents:

- Content-Based Filtering (and Popularity Based Recommendation) using TMDB 5000 Dataset
- Collaborative Filtering using MovieLens Dataset
- Evaluation

In [2]:
import pandas as pd
import numpy as np

# 1. Content-Based Filtering

### - Importing the Datasets.

In [3]:
movies=pd.read_csv('dataset/tmdb_5000_movies.csv')
credits=pd.read_csv('dataset/tmdb_5000_credits.csv')
links=pd.read_csv('dataset/links.csv')

### - Preprocessing the Imported Data

Merging tmdb_5000_movies.csv and tmdb_5000_credits.csv on the basis of title column.

In [4]:
movies=movies.merge(credits, on='title')
movies=movies.merge(links, on='id')

We keep the relevant columns and remove the rest.

In [5]:
movies=movies[['id','title','overview','genres','keywords','cast','crew','popularity','vote_average','movieId']]

Converting genres and keywords fields to list format to make processing easier.

In [6]:
import ast

In [7]:
def convert1(item):
    arr=[]
    for i in ast.literal_eval(item):
        arr.append(i['name'])
    return arr

movies['genres']=movies['genres'].apply(convert1)
movies['keywords']=movies['keywords'].apply(convert1)

Converting cast field to list format and reducing data to the name of the first 5 most significant cast members, to make processing easier.

In [8]:
def convert2(item):
    arr=[]
    count=0
    for i in ast.literal_eval(item):
        if count!=5:
            arr.append(i['name'])
            count+=1
        else:
            break
    return arr

movies['cast']=movies['cast'].apply(convert2)

Extracting Director's details from crew field and rejecting other insignificant information.

In [9]:
def convert3(obj):
    L=[]
    for i in ast.literal_eval(obj):
        if i['job']=='Director':
            L.append(i['name'])
            break
    return L

movies['crew']=movies['crew'].apply(convert3)

Removing white spaces from keywords, cast and crew fields to avoid discrepancies.

In [10]:
movies['keywords']=movies['keywords'].apply(lambda x: [i.replace(" ","") for i in x])
movies['cast']=movies['cast'].apply(lambda x: [i.replace(" ","") for i in x])
movies['crew']=movies['crew'].apply(lambda x: [i.replace(" ","") for i in x])

Splitting the string in overview field to a list of strings.

In [11]:
movies['overview']=movies['overview'].apply(lambda x: str(x).split())

Incorporating the overview, genres, keywords, cast and crew fields into a single tags fields.

In [12]:
movies['tags']=movies['overview']+movies['genres']+movies['keywords']+movies['cast']+movies['crew']
new_df=movies[['id','title','tags','popularity','vote_average','movieId']]

Joining the list of strings in tags field into one single string that can be used for model training.

In [13]:
new_df['tags']=new_df['tags'].apply(lambda x: " ".join(x))
new_df['tags']=new_df['tags'].apply(lambda x: x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags']=new_df['tags'].apply(lambda x: " ".join(x))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags']=new_df['tags'].apply(lambda x: x.lower())


### - Model Building

Text Vectorisation - Using the Natural Language Toolkit, we reduce all words to stem words. This is necessary to ensure that similar words are processed by the model as one entity. 

In [14]:
import nltk
from nltk.stem.porter import PorterStemmer

ps=PorterStemmer()

def stem(text):
    y=[]
    for i in text.split():
        y.append(ps.stem(i))
    return " ".join(y)

new_df['tags']=new_df['tags'].apply(stem)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df['tags']=new_df['tags'].apply(stem)


CountVectorizer class from sklearn to vectorise the tags. Vector Similarity Matrix using cosine_similarity as a similarity measure between Movies.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

cv=CountVectorizer(max_features=5000, stop_words='english')
vectors=cv.fit_transform(new_df['tags']).toarray()

movie_similarity=cosine_similarity(vectors)

Creating a function recommend which takes a movie name as input and returns the top 5 closest vectors (similar movies) to it. This will be our blueprint for all recommend functions in the future.

In [16]:
def recommend(movie):
    movie_index=(new_df[new_df['title']==movie].index[0])
    distances=movie_similarity[movie_index]
    movies_list=sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:6]

    for i in movies_list:
        print(new_df.iloc[i[0]].title)

Using pickle to create pickle files and storing our created Similarity Matrix. Our application will be able to derive closest vectors during recommendation from these pickle files.

In [17]:
import pickle

pickle.dump(new_df.to_dict(),open('movies.pkl','wb'))
pickle.dump(movie_similarity,open('recommend_1.pkl','wb'))

# 2. Collaborative Filtering

Importing the user ratings dataset.

In [18]:
ratings = pd.read_csv('dataset/ratings.csv')

Splitting the dataset into training data and testing data, and training the model.

In [19]:
from sklearn.model_selection import train_test_split

x_train, x_test = train_test_split(ratings, test_size = 0.30, random_state = 42)
data = x_train.pivot(index = 'userId', columns = 'movieId', values = 'rating').fillna(0)

Making a copy of train and test datasets. 

These dummy datasets will be used to check whether user has given rating. The movies not rated by user is marked as 1 for prediction. The movies not rated by user is marked as 0 for evaluation.

In [20]:
dummy_train = x_train.copy()
dummy_test = x_test.copy()

dummy_train['rating'] = dummy_train['rating'].apply(lambda x: 0 if x > 0 else 1)
dummy_test['rating'] = dummy_test['rating'].apply(lambda x: 1 if x > 0 else 0)

In [21]:
dummy_train = dummy_train.pivot(index = 'userId', columns = 'movieId', values = 'rating').fillna(1)
dummy_test = dummy_test.pivot(index ='userId', columns = 'movieId', values = 'rating').fillna(0)

User Similarity Matrix using Cosine similarity as a similarity measure between Users.

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

user_similarity = cosine_similarity(data)
user_similarity[np.isnan(user_similarity)] = 0

We do not want to recommend the same movie that the user already watched. We will ignore the movies rated by the user and we will use our dummy training matrix. 

In [23]:
user_predicted_ratings = np.dot(user_similarity, data)
user_final_ratings = np.multiply(user_predicted_ratings, dummy_train)

Using pickle to create pickle files and storing our created User Similarity Matrix. Our application will be able to derive closest vectors during recommendation from these pickle files.

In [24]:
pickle.dump(user_final_ratings,open('recommend_2.pkl','wb'))

# 3. Evaluation

We will evaluate for the movies already rated by the User.

In [25]:
test_user_features = x_test.pivot(index = 'userId', columns = 'movieId', values = 'rating').fillna(0)
test_user_similarity = cosine_similarity(test_user_features)
test_user_similarity[np.isnan(test_user_similarity)] = 0

[[1.         0.         0.07126637 ... 0.0749648  0.         0.02105064]
 [0.         1.         0.         ... 0.02631254 0.         0.04691426]
 [0.07126637 0.         1.         ... 0.         0.         0.        ]
 ...
 [0.0749648  0.02631254 0.         ... 1.         0.06079015 0.12466251]
 [0.         0.         0.         ... 0.06079015 1.         0.02233952]
 [0.02105064 0.04691426 0.         ... 0.12466251 0.02233952 1.        ]]
- - - - - - - - - - 
(610, 610)


In [26]:
user_predicted_ratings_test = np.dot(test_user_similarity, test_user_features)
test_user_final_rating = np.multiply(user_predicted_ratings_test, dummy_test)

array([[ 8.01521825,  3.22701218,  1.71422693, ...,  0.04154912,
         0.        ,  0.        ],
       [ 1.64920152,  0.91304857,  0.02113666, ...,  0.        ,
         0.        ,  0.        ],
       [ 1.07587801,  0.07241296,  0.1867716 , ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [17.86102484, 10.1363879 ,  4.48304633, ...,  0.0274908 ,
         0.        ,  0.        ],
       [ 3.10351661,  2.6934212 ,  1.20357903, ...,  0.        ,
         0.        ,  0.        ],
       [12.36110509,  5.79632466,  1.96280959, ...,  0.        ,
         0.20526264,  0.23947308]])

Normalize the final ratings in range 0.5 to 5

In [29]:
from sklearn.preprocessing import MinMaxScaler

X = test_user_final_rating.copy() 
X = X[X > 0]
scaler = MinMaxScaler(feature_range = (0.5, 5))
scaler.fit(X)
pred = scaler.transform(X)

total_non_nan = np.count_nonzero(~np.isnan(pred))

[[       nan        nan        nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]
 ...
 [       nan 2.28631493        nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]
 [       nan        nan        nan ...        nan        nan        nan]]


In [32]:
test = x_test.pivot(index = 'userId', columns = 'movieId', values = 'rating')

Finding the Root Mean Square Error (RMSE)

In [33]:
diff_sqr_matrix = (test - pred)**2
sum_of_squares_err = diff_sqr_matrix.sum().sum() 

rmse = np.sqrt(sum_of_squares_err/total_non_nan)
print(rmse)

1.5635654266606624


Finding the Mean Absolute Error

In [34]:
mae = np.abs(pred - test).sum().sum()/total_non_nan
print(mae)

1.2116342196650216


We can Create a deviation of the MAE to make our model more accurate.