# Movie Recommendation and Rating - Team ZF3

© Explore Data Science Academy 2022

---

###### Team Members

1. Ubasinachi Eleonu
2. Bongani Mkhize
3. Abubakar Abdulkadir
4. Michael Mamah
5. Joseph Okonkwo
6. 

---

## Project Overview

<img src="https://th.bing.com/th/id/R.f32f6c0a36b1166033122544cf0dd8a1?rik=QmYumf41lwVQgA&pid=ImgRaw&r=0" style='margin-top:30x; margin-bottom:30px'/>
It is almost impossible for a person to attempt to consume all the products and choices available. It is even most likely that a person will not have the time, patience or resources to even view the myraids of choices in terms of products and services available at his disposal. Hence, it becomes almost imperative for producers of goods and services to help narrow down the choices of products presented to their users in an attempt to reduce overwhelming them and help them reach thier relevant products and services without waste of time and as a result, helping them have a better user experience, while also exposing them to more products and services they might have never discovered otherwise. This help comes in the form of  <b> recommendation </b>

Simple as the above sounds, it is not as easy to implement because the traditional approach would have been to deploy product recommender agents (like customer service representatives) who will handle recommendation requests from customers. But these agents will be unable to learn about every of thier customers and what products and services they might want and find useful. So how does one recommend products and services to people he does not know?

The response is using Recommender Systems. Recommender systems are machine learning systems that help users discover products and services based on the relationship between the users and the products.Recommender systems are like salesmen who have learnt to recognize customers and the products they might like based on their history and preferences. Recommender systems are so common place now that every time you shop online, a  recommendation system is guiding you towards the most likely product you might purchase.

There are several use cases of the recommender system. But this project will focus on movie recommendation.

---

## 1.0 Project Objective

To build a recommendation system capable of recommending movies to users and predicting ratings a user might give a movie they have never seen bebfore. <br ><br>

## 2.0 Packages

### 2.1. Installing Packages

For this project, two major libraries were leveraged on - sklearn and surprise. Sklearn is the most mopular of the two.

In [1]:
!pip install scikit-surprise



- <a href="http://surpriselib.com/"> Surprise</a> is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. It does not support implicit ratings or content-based information. Surprise was used in this project to make collaborative prediction. <br>

### 2.2 Importing Packages 

In [25]:
# data loading and preprocessing 
import numpy as np 
import pandas as pd 
import pickle as pkl
from collections import Counter
from surprise import Reader
from surprise import Dataset
import math

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# feature extration and similarity metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

#modeling and validation
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

<br />

## 3.0 Loading Datasets

    
The dataset used for this project is the MovieLens dataset maintained by the GroupLens research group in the Department of Computer Science and Engineering at the University of Minnesota. Additional movie content data was legally scraped from IMDB. The dataset can be found <a href="https://www.kaggle.com/competitions/edsa-movie-recommendation-2022/data"> here</a>. Pandas library will be used to access and Manipulate the datasets.

In [80]:
# read movie dataset
df_movies = pd.read_csv('data/movies.csv')

In [81]:
# read the ratings dataset
df_rating = pd.read_csv('data/train.csv')

In [82]:
# read the movie additional information
df_meta = pd.read_csv('data/imdb_data.csv')

<br><br>
## 4.0 Exploratory Data Analysis


Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Primarily, EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.This approach for data analysis uses many tools(mainly graphical to maximize insight into a data set, extract important variables, detect outliers and anomalies, amongst other details that is missed when looking at DataFrame. This step is very important especially when we model the data in order to apply Machine Learning techniques.

<br><br>
## 5.0 Content Based Filtering


This section of the project aims at making recommenadations and rating using the content-based aproach. This approach uses the similarity between items to make recommendations. It is based off the assumption that if a user likes a particular item, the user will like items similar to that items. Hence, if a user rates a particular movie very high, there is aa high chance the user will rank other similar movies high. 

### 5.1 Feature Engineering and Selection


This project considers building a recommender off the movie genre, the director and the plot keyword feature. 

#### 5.1.1 Selecting the Required Features

The movie genre is available in the movies dataset, the director and plot keywords features are in the imdb_data dataset. Hence, there is a need to merge both datasets and extract the required features. 

In [86]:
# Extract movieId, title_cast, director and plot_keywords from df_meta
df_meta = df_meta[['movieId', 'director', 'plot_keywords']]


# merge meta dataset to movies dataset to produce our train dataset
df_train = df_movies.merge(df_meta, on='movieId', how='left')
df_train.head()

Unnamed: 0,movieId,title,genres,director,plot_keywords
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,John Lasseter,toy|rivalry|cowboy|cgi animation
1,2,Jumanji (1995),Adventure|Children|Fantasy,Jonathan Hensleigh,board game|adventurer|fight|game
2,3,Grumpier Old Men (1995),Comedy|Romance,Mark Steven Johnson,boat|lake|neighbor|rivalry
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Terry McMillan,black american|husband wife relationship|betra...
4,5,Father of the Bride Part II (1995),Comedy,Albert Hackett,fatherhood|doberman|dog|mansion


<br>

#### 5.1.2 Cleaning the Selected features

The genres and plot keywords feature contains genres and keywords seperated by the '|' character. There is a need to replace the seperator character with a space. On the director feature, there is a need to remove the space between the director name and surname; this is so that the model will not percieve any similarity between Albert Johnson and Albert Robert. They are totally different persons. And lastly, merging the features together and changing them to all lowercasing.

In [87]:
# handle missing data
df_train.fillna(' ', inplace=True)

# replacing "|" and "(no genres listed)" with ' ' in genre
df_train['genres'] = df_train['genres'].apply(lambda x: x.replace("|" , ' ')
                                       .replace("(no genres listed)", ' '))

# replacing " " with ' ' in director
df_train['director'] = df_train['director'].apply(lambda x: ((x+'|'))
                                            .replace(" ", '')
                                            .replace("|", " "))

# replace "|" with ' ' in plot_keywords
df_train['plot_keywords'] = df_train['plot_keywords'].apply(lambda x: x.replace("|", " "))

# Merge the genres, plot_keywords and director names as our major predictors
df_train_string = df_train['genres'] + " " + df_train['director'] + " " + df_train['plot_keywords']

# change to lower case
df_train_string.apply(str.lower)

0        adventure animation children comedy fantasy jo...
1        adventure children fantasy jonathanhensleigh  ...
2        comedy romance markstevenjohnson  boat lake ne...
3        comedy drama romance terrymcmillan  black amer...
4        comedy alberthackett  fatherhood doberman dog ...
                               ...                        
62418                                            drama    
62419                                      documentary    
62420                                     comedy drama    
62421                                                     
62422                           action adventure drama    
Length: 62423, dtype: object

#### 5.1.3 Vectorization

To create a model, there is a need to have a set of feature(s) with numerical values since most models only accept numerical values for feature sets. For this project, our feature is a string of words. Hence there is a need to create vectors of digits from these words. The process is called Vectorization.

For this project we define a vectorizer with the following tuning
- analyser = 'word'
- ngram_range = (1, 1)
- max_df = 0.3
- min_df = 20
- stop_words = 'english'

In [117]:
vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 1), min_df=10, max_df=0.5, stop_words='english')
features = vectorizer.fit_transform(df_train_string)

features.shape

(62423, 1581)

<br>

### 5.2 Recommending

This section contains functions for making movie recommendation for a user using the vectorised features from the previous section. The section contains four functions 

#### 5.2.1 Retrieving Top N Movies Rated by a User

To perform content based filtering, there is a need to retrieve all movies rated by the user under focus sorted by the rating the user ascribe to them in descending order. 

In [118]:
#function to collect all movies rated by a particular user

def all_user_rated_movies(userId, n):
    rated_movies = df_rating[df_rating['userId'] == userId]
    rated_sorted =  rated_movies.sort_values(by='rating', ascending=False)
    return rated_sorted['movieId'].iloc[:n]

#### 5.2.1 Retrieving All Unseen Movies by a User

Similarly, there is also a need to retrieve all movies previously unseen by a user from which we can make recommendation from to avoid recommending movies which has previously been seen by a user.

In [119]:
def all_unseen_movies(userId):
    unseen_movies = df_rating[df_rating['userId'] != userId]['movieId']
    return df_movies[df_movies['movieId'].isin(unseen_movies)].index

#### 5.2.2 Recommending Top N Unseen Movies by user

Using the cosine similarity, the top N similar movies to each top rated movies by the user is recommended for the user.

In [123]:
# Recommend the Top N movies for each top rated movie by a user

def recommend(movie_df, userId, n=10):
    top_rated_movies_id = all_user_rated_movies(userId, int(n/2))
    unseen_movies = all_unseen_movies(userId)
    similarity_list = []
    
    for movieId in list(top_rated_movies_id):
        movie_index = movie_df[movie_df['movieId'] == movieId].index[0]
        sim_matrix = cosine_similarity(features[movie_index], features[unseen_movies])[0]
        
        for i in range(2):
            similarity_list.append(np.argmax(sim_matrix))
            sim_matrix[np.argmax(sim_matrix)] = 0
        
    return df_train.iloc[similarity_list]

Using the recommend function to recommend 15 movies for user with userId 100

In [131]:
recommend(df_movies, 600, 15)

Unnamed: 0,movieId,title,genres,director,plot_keywords
703,718,"Visitors, The (Visiteurs, Les) (1993)",Comedy Fantasy Sci-Fi,ChristianClavier,time travel year 1123 12th century 20th century
8150,8865,Sky Captain and the World of Tomorrow (2004),Action Adventure Sci-Fi,KerryConran,steampunk reporter dieselpunk mechanical monster
262,265,Like Water for Chocolate (Como agua para choco...,Drama Fantasy Romance,LauraEsquivel,mexico food marriage love
21113,109153,Ray Harryhausen: Special Effects Titan (2011),Documentary,GillesPenso,film producer giant gorilla animator visual ef...
452,457,"Fugitive, The (1993)",Thriller,JebStuart,one armed man on the run u.s. marshal surgeon
13388,69159,Jimmy and Judy (2006),Crime Drama Thriller,JonathanSchroder,watching tv character names as title forenames...
103,105,"Bridges of Madison County, The (1995)",Drama Romance,MerylStreep,bridge love farm photographer
9982,33669,"Sisterhood of the Traveling Pants, The (2005)",Adventure Comedy Drama,AnnBrashares,pantyhose friendship sisterhood female friendship
324,329,Star Trek: Generations (1994),Adventure Drama Sci-Fi,GeneRoddenberry,female empath half human half alien empath hum...
17040,89616,My Little Business (Ma petite entreprise) (1999),Comedy Drama,PierreJolivet,insurance fraud


In [133]:
recommend(df_movies, 25, 15)

Unnamed: 0,movieId,title,genres,director,plot_keywords
585,593,"Silence of the Lambs, The (1991)",Crime Horror Thriller,ThomasHarris,serial killer psycho thriller bad guy wins stu...
4284,4389,Lost and Delirious (2001),Drama,SusanSwan,lesbian girls' boarding school teenage sexuali...
149,151,Rob Roy (1995),Action Drama Romance War,AlanSharp,scotland highlands 18th century flintlock pistol
9871,33051,Skin Game (1971),Comedy Romance Western,,
530,535,Short Cuts (1993),Drama,RaymondCarver,waitress destruction of property female full r...
15865,83603,Fern flowers (Fleur de fougère) (1949),Animation,WladyslawStarewicz,
51,52,Mighty Aphrodite (1995),Comedy Drama Romance,WoodyAllen,new york city talking about sex husband wife r...
2482,2573,Tango (1998),Drama Musical,CarlosSaura,female nudity topless female nudity female ful...
292,296,Pulp Fiction (1994),Comedy Crime Drama Thriller,QuentinTarantino,nonlinear timeline overdose drug overdose bondage
24262,121077,Meet Boston Blackie (1941),Crime Drama,,


In [134]:
recommend(df_movies, 600, 15)

Unnamed: 0,movieId,title,genres,director,plot_keywords
703,718,"Visitors, The (Visiteurs, Les) (1993)",Comedy Fantasy Sci-Fi,ChristianClavier,time travel year 1123 12th century 20th century
8150,8865,Sky Captain and the World of Tomorrow (2004),Action Adventure Sci-Fi,KerryConran,steampunk reporter dieselpunk mechanical monster
262,265,Like Water for Chocolate (Como agua para choco...,Drama Fantasy Romance,LauraEsquivel,mexico food marriage love
21113,109153,Ray Harryhausen: Special Effects Titan (2011),Documentary,GillesPenso,film producer giant gorilla animator visual ef...
452,457,"Fugitive, The (1993)",Thriller,JebStuart,one armed man on the run u.s. marshal surgeon
13388,69159,Jimmy and Judy (2006),Crime Drama Thriller,JonathanSchroder,watching tv character names as title forenames...
103,105,"Bridges of Madison County, The (1995)",Drama Romance,MerylStreep,bridge love farm photographer
9982,33669,"Sisterhood of the Traveling Pants, The (2005)",Adventure Comedy Drama,AnnBrashares,pantyhose friendship sisterhood female friendship
324,329,Star Trek: Generations (1994),Adventure Drama Sci-Fi,GeneRoddenberry,female empath half human half alien empath hum...
17040,89616,My Little Business (Ma petite entreprise) (1999),Comedy Drama,PierreJolivet,insurance fraud


<br><br>

## 6.0 Collaborative Filtering


This section of the project aims at making recommenadations and rating using the collaborative aproach. This approach uses the similarity between users to make recommendations. It is based off the assumption that if a user likes a particular item, other users sharing similar trait with the user will most likely like the item. Hence, if a user rates a particular movie very high, there is a high chance another user who enjoys similar rating patern with the use will rank the movie high. 

For the collaborative filtering, we use the surprise package for handling data loading, data manipulation, modelling and testing.

<br>

### 6.1Reducing the Dataset Size

To reduce the dimension of our training dataset, we filter out movies with low ratings ans users who have only rated few movies.