# Movie Recommendation and Rating - Team ZF3

© Explore Data Science Academy 2022

---

###### Team Members

1. Ubasinachi Eleonu
2. Bongani Mkhize
3. Abubakar Abdulkadir
4. Michael Mamah
5. Joseph Okonkwo
6. 

---

## Project Overview

<img src="resources/images/bg.jfif" style='margin-top:30x; margin-bottom:30px'/>
It is almost impossible for a person to attempt to consume all the products and choices available. It is even most likely that a person will not have the time, patience or resources to even view the myraids of choices in terms of products and services available at his disposal. Hence, it becomes almost imperative for producers of goods and services to help narrow down the choices of products presented to their users in an attempt to reduce overwhelming them and help them reach thier relevant products and services without waste of time and as a result, helping them have a better user experience, while also exposing them to more products and services they might have never discovered otherwise. This help comes in the form of  <b> recommendation </b>

Simple as the above sounds, it is not as easy to implement because the traditional approach would have been to deploy product recommender agents (like customer service representatives) who will handle recommendation requests from customers. But these agents will be unable to learn about every of thier customers and what products and services they might want and find useful. So how does one recommend products and services to people he does not know?

The response is using Recommender Systems. Recommender systems are machine learning systems that help users discover products and services based on the relationship between the users and the products.Recommender systems are like salesmen who have learnt to recognize customers and the products they might like based on their history and preferences. Recommender systems are so common place now that every time you shop online, a  recommendation system is guiding you towards the most likely product you might purchase.

There are several use cases of the recommender system. But this project will focus on movie recommendation.

---

## 1.0 Project Objective

To build a recommendation system capable of recommending movies to users and predicting ratings a user might give a movie they have never seen bebfore. <br ><br>

## 2.0 Packages

### 2.1. Installing Packages

For this project, two major libraries were leveraged on - sklearn and surprise. Sklearn is the most mopular of the two.

In [None]:
!pip install scikit-surprise

- <a href="http://surpriselib.com/"> Surprise</a> is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. It does not support implicit ratings or content-based information. Surprise was used in this project to make collaborative prediction. <br>

### 2.2 Importing Packages 

In [1]:
# data loading and preprocessing 
import numpy as np 
import pandas as pd 
import pickle as pkl
from collections import Counter
from surprise import Reader
from surprise import Dataset
import math

# Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

# feature extration and similarity metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

#modeling and validation
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

<br />

## 3.0 Loading Datasets

    
The dataset used for this project is the MovieLens dataset maintained by the GroupLens research group in the Department of Computer Science and Engineering at the University of Minnesota. Additional movie content data was legally scraped from IMDB. The dataset can be found <a href="https://www.kaggle.com/competitions/edsa-movie-recommendation-2022/data"> here</a>. Pandas library will be used to access and Manipulate the datasets.

In [2]:
# read movie dataset
df_movies = pd.read_csv('data/movies.csv')

In [4]:
# read the ratings dataset
df_train = pd.read_csv('data/train.csv')

In [6]:
# read the movie additional information
df_meta = pd.read_csv('data/imdb_data.csv')

<br><br>
## 4.0 Exploratory Data Analysis


Exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Primarily, EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.This approach for data analysis uses many tools(mainly graphical to maximize insight into a data set, extract important variables, detect outliers and anomalies, amongst other details that is missed when looking at DataFrame. This step is very important especially when we model the data in order to apply Machine Learning techniques.

<br><br>
## 5.0 Content Based Recommendaton

## 5.0 Feature Engineering and Selection


In this section, the recommendations from the exploratory data analysis phase is implemented. The dataset were merged, cleaned and features selected for similarity assessment.

### 5.1 Text Cleaning

The dataset contains punctuations, links, emojis and twitter specific characters like @ and # symbols. Words also exist in different cases which models might translate and different. Hence, the proceeding function performs cleaning by:
- Changing the Case of the words
- Remove punctuations
- Remove links
- Remove emojis

In [8]:
# Extract movieId, title_cast, director and plot_keywords from df_meta

df_meta = df_meta[['movieId', 'title_cast', 'director', 'plot_keywords']]

In [9]:
df_train.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [10]:
# merging dataset to form our inital dataset

df_merged = df_movies.merge(df_meta, on='movieId', how='left')

In [11]:
df_merged.shape

(62423, 6)

In [12]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62423 entries, 0 to 62422
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   movieId        62423 non-null  int64 
 1   title          62423 non-null  object
 2   genres         62423 non-null  object
 3   title_cast     15201 non-null  object
 4   director       15347 non-null  object
 5   plot_keywords  14384 non-null  object
dtypes: int64(1), object(5)
memory usage: 3.3+ MB


In [13]:
# handle missing data
df_merged['title_cast'].fillna(' ', inplace=True)
df_merged['director'].fillna(' ', inplace=True)
df_merged['plot_keywords'].fillna(' ', inplace=True)

In [14]:
df_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62423 entries, 0 to 62422
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   movieId        62423 non-null  int64 
 1   title          62423 non-null  object
 2   genres         62423 non-null  object
 3   title_cast     62423 non-null  object
 4   director       62423 non-null  object
 5   plot_keywords  62423 non-null  object
dtypes: int64(1), object(5)
memory usage: 3.3+ MB


In [15]:
df_merged

Unnamed: 0,movieId,title,genres,title_cast,director,plot_keywords
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,toy|rivalry|cowboy|cgi animation
1,2,Jumanji (1995),Adventure|Children|Fantasy,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,board game|adventurer|fight|game
2,3,Grumpier Old Men (1995),Comedy|Romance,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,boat|lake|neighbor|rivalry
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,black american|husband wife relationship|betra...
4,5,Father of the Bride Part II (1995),Comedy,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,fatherhood|doberman|dog|mansion
...,...,...,...,...,...,...
62418,209157,We (2018),Drama,,,
62419,209159,Window of the Soul (2001),Documentary,,,
62420,209163,Bad Poems (2018),Comedy|Drama,,,
62421,209169,A Girl Thing (2001),(no genres listed),,,


In [16]:
# cleaning the data in genres
df_cleaned = df_merged.copy()
df_cleaned['genres'] = df_merged['genres'].apply(lambda x: x.replace("|" , ' ').replace("(no genres listed)", ' '))

In [17]:
df_cleaned['title_cast'] = df_merged['title_cast'].apply(lambda x: x.replace(" " , '').replace("|", ' '))


In [18]:
df_cleaned['director'] = df_merged['director'].apply(lambda x: ((x+'|') * 3).replace(" ", '').replace("|", " "))

In [19]:
df_cleaned['director']

0                  JohnLasseter JohnLasseter JohnLasseter 
1        JonathanHensleigh JonathanHensleigh JonathanHe...
2        MarkStevenJohnson MarkStevenJohnson MarkSteven...
3               TerryMcMillan TerryMcMillan TerryMcMillan 
4               AlbertHackett AlbertHackett AlbertHackett 
                               ...                        
62418                                                     
62419                                                     
62420                                                     
62421                                                     
62422                                                     
Name: director, Length: 62423, dtype: object

In [20]:
df_cleaned['plot_keywords'] = df_merged['plot_keywords'].apply(lambda x: x.replace("|", " "))

In [21]:
df_cleaned['plot_keywords']

0                         toy rivalry cowboy cgi animation
1                         board game adventurer fight game
2                               boat lake neighbor rivalry
3        black american husband wife relationship betra...
4                          fatherhood doberman dog mansion
                               ...                        
62418                                                     
62419                                                     
62420                                                     
62421                                                     
62422                                                     
Name: plot_keywords, Length: 62423, dtype: object

In [22]:
df_cleaned.head()

Unnamed: 0,movieId,title,genres,title_cast,director,plot_keywords
0,1,Toy Story (1995),Adventure Animation Children Comedy Fantasy,TomHanks TimAllen DonRickles JimVarney Wallace...,JohnLasseter JohnLasseter JohnLasseter,toy rivalry cowboy cgi animation
1,2,Jumanji (1995),Adventure Children Fantasy,RobinWilliams JonathanHyde KirstenDunst Bradle...,JonathanHensleigh JonathanHensleigh JonathanHe...,board game adventurer fight game
2,3,Grumpier Old Men (1995),Comedy Romance,WalterMatthau JackLemmon SophiaLoren Ann-Margr...,MarkStevenJohnson MarkStevenJohnson MarkSteven...,boat lake neighbor rivalry
3,4,Waiting to Exhale (1995),Comedy Drama Romance,WhitneyHouston AngelaBassett LorettaDevine Lel...,TerryMcMillan TerryMcMillan TerryMcMillan,black american husband wife relationship betra...
4,5,Father of the Bride Part II (1995),Comedy,SteveMartin DianeKeaton MartinShort KimberlyWi...,AlbertHackett AlbertHackett AlbertHackett,fatherhood doberman dog mansion


In [23]:
df_data_string = df_cleaned['title'] + " " + df_cleaned['genres'] + " " + df_cleaned['title_cast'] + " " + df_cleaned['director'] + " " + df_cleaned['plot_keywords']

In [24]:
df_data_string.head()

0    Toy Story (1995) Adventure Animation Children ...
1    Jumanji (1995) Adventure Children Fantasy Robi...
2    Grumpier Old Men (1995) Comedy Romance WalterM...
3    Waiting to Exhale (1995) Comedy Drama Romance ...
4    Father of the Bride Part II (1995) Comedy Stev...
dtype: object

In [None]:
# vectorization

vectorizer = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=1, max_df=0.5, stop_words='english')
features = vectorizer.fit_transform(df_data_string)

In [None]:
features.astype(np.float16)

In [None]:
# creating the similarity matrix
cosine_sim = cosine_similarity(features[0:1000], features)

In [None]:
cosine_sim.shape

In [45]:
df_cleaned.movieId

0             1
1             2
2             3
3             4
4             5
          ...  
62418    209157
62419    209159
62420    209163
62421    209169
62422    209171
Name: movieId, Length: 62423, dtype: int64