# TeamNM3 Movie Recommendation System

© Explore Data Science Academy

# ---
### Honour Code

We, **Murtala Umar**,
   **Harmony Odumuku**,
   **Njoku Okechukwu**,
   **Akinbo Akin Taylor**,
   **Prince Charles Amankwa Afriyie**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Overview:

The rapid growth of data collection has led to a new era of information. Data is being used to create more efficient systems and this is where Recommendation Systems come into play. Recommendation Systems are a type of information filtering systems as they improve the quality of recommendations and search results providing items that are more relevant to the user.

In today’s technology driven world, recommender systems are socially and economically critical to ensure that individuals can make optimised choices surrounding the content they engage with on a daily basis. One application where this is especially true is movie recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.


<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>



---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---


In [1]:
#Importing comet library
from comet_ml import Experiment

# Creating an experiment on comet with the api key
experiment = Experiment(
    api_key="2G2bIJZt14D56ogdKrPuvmg7P",
    project_name="move-recommendation-system-edsa-2022",
    workspace="murtalaua",
)

COMET INFO: Couldn't find a Git repository in 'C:\\Users\\alami\\Documents\\Explore\\Athena\\Unsupervised Learning\\Predict\\edsa-movie-recommendation-2022' nor in any parent directory. You can override where Comet is looking for a Git Patch by setting the configuration `COMET_GIT_DIRECTORY`
COMET INFO: Experiment is live on comet.ml https://www.comet.com/murtalaua/move-recommendation-system-edsa-2022/479142c433f246828b699d8063bf84d9



In [2]:
# Libraries for importing and loading data
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy. 
import matplotlib.pyplot as plt
import seaborn as sns
import os
from textwrap import wrap

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.feature_extraction.text import TfidfVectorizer
from surprise import SVD
from surprise import Reader, Dataset
from surprise.model_selection import GridSearchCV, cross_validate


# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration 
import heapq # <-- Efficient sorting of large lists
from time import time

# Setting global constants to ensure notebook results are reproducible

RANDOM_STATE = 42


import warnings
warnings.filterwarnings('ignore')

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

| ⚡ Description: Loading the data ⚡                                                                                                                                          |
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Here we loaded the movie data available to us for the purposes of the project.|

---

In [3]:
# load the data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
genome_scores = pd.read_csv('genome_scores.csv')
genome_tags = pd.read_csv('genome_tags.csv')
imdb_data = pd.read_csv('imdb_data.csv')
links = pd.read_csv('links.csv')
movies = pd.read_csv('movies.csv')
tags = pd.read_csv('tags.csv')

In [4]:
# Preview train dataset
print('The Shape of the data is: ', train.shape)
train.head()

The Shape of the data is:  (10000038, 4)


Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


In [5]:
# Preview train dataset
print('The Shape of the data is: ', test.shape)
test.head()

The Shape of the data is:  (5000019, 2)


Unnamed: 0,userId,movieId
0,1,2011
1,1,4144
2,1,5767
3,1,6711
4,1,7318


In [6]:
# Preview genome_scores dataset
print('The Shape of the data is: ', genome_scores.shape)
genome_scores.head()

The Shape of the data is:  (15584448, 3)


Unnamed: 0,movieId,tagId,relevance
0,1,1,0.02875
1,1,2,0.02375
2,1,3,0.0625
3,1,4,0.07575
4,1,5,0.14075


In [7]:
# Preview genome_scores dataset
print('The Shape of the data is: ', genome_tags.shape)
genome_tags.head()

The Shape of the data is:  (1128, 2)


Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [8]:
# Preview imdb_data dataset
print('The Shape of the data is: ', imdb_data.shape)
imdb_data.head()

The Shape of the data is:  (27278, 6)


Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [9]:
# Preview links dataset
print('The Shape of the data is: ', links.shape)
links.head()

The Shape of the data is:  (62423, 3)


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [10]:
# Preview movies dataset
print('The Shape of the data is: ', movies.shape)
movies.head()

The Shape of the data is:  (62423, 3)


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡                                                                                                                                           |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| To understand our data, we are taken an indept exploration of our data here.|

---


- look at data statistics
- plot relevant feature interactions
- evaluate correlation
- have a look at feature distributions


<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>
---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---



- remove missing values/ features
- create new features
- engineer existing features



<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

- split data
- create targets and features dataset
- create one or more ML models
- evaluate one or more ML models

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

- Compare model performance
- Choose best model and motivate why it is the best choice


<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

# discuss chosen methods logic


---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---


In [1]:

# Libraries for data loading, data manipulation and data visulisation
import pandas as pd

# Libraries for data preparation and model building
# Data Manipulation
import numpy as np
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns


%matplotlib inline

### Setting global constants to ensure notebook results are reproducible
# PARAMETER_CONSTANT = 



<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---

| ⚡ Description: Loading the data ⚡                                                                                                                                          |
|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Here we loaded the movie data available to us for the purposes of the project.|

---

In [24]:
### Use this if you have files locally in datasets subdirectory

movies = pd.read_csv("datasets/movies.csv")
imdb_data = pd.read_csv("datasets/imdb_data.csv")# load the data
genome_scores = pd.read_csv('datasets/genome_scores.csv')
links = pd.read_csv('datasets/links.csv')
tags = pd.read_csv('datasets/tags.csv')
genome_tags = pd.read_csv('datasets/genome_tags.csv')

train = pd.read_csv("datasets/train.csv")
test = pd.read_csv("datasets/test.csv")

### Use this if you do not have files locally (in datasets subdirectory)

# base_url = "https://storage.googleapis.com/kagglesdsdata/competitions/36029/3495179/genome_tags.csv?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1657753808&Signature=ZksGirewrm6%2FMk3KGSotOFO3KFCpTSYEduedm4W93dFFU%2BYWC7l2Q9hbAslYyMU97uGnLfRjONQwLOuWPZps1MiXUcMQfYdPg3sdGLO3%2BeMDWZLhOx2HToMMWGXkyyVuZGDDa78MKwNXmyNqxiQvRJxni0MIlkRskAVeLpQhrkP%2B7zK77bzE54OtnAi1VfIIl1WXKsMG76aVuDBvVGh%2BlyJspAMK%2BVtX3ojgPzro9kVGrxgOIxxodyCvG5xfGFVzqAJN8HsX8eBTNFM8fqpbfhcYlOpT3xbZgkpSmMnMOWZbRQIaVdKjUBjccmaNKQVMADNJViEbhyuCCGq1rEBHaw%3D%3D&response-content-disposition=attachment%3B+filename%3D"
# movies1 = pd.read_csv(base_url+"/movies.csv")
# imdb_data = pd.read_csv(base_url+"/imdb_data.csv")# load the data
# genome_scores = pd.read_csv(base_url+"/genome_scores.csv")
# links = pd.read_csv(base_url+"/links.csv")
# tags = pd.read_csv(base_url+"/tags.csv")
# genome_tags = pd.read_csv(base_url+"/genome_tags.csv")

# train = pd.read_csv(base_url+"/train.csv")
# test = pd.read_csv(base_url+"/test.csv")


In [3]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [4]:
movies.shape

(62423, 3)

In [5]:
movies.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


In [6]:
imdb_data.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [7]:
imdb_data.info();

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   movieId        27278 non-null  int64  
 1   title_cast     17210 non-null  object 
 2   director       17404 non-null  object 
 3   runtime        15189 non-null  float64
 4   budget         7906 non-null   object 
 5   plot_keywords  16200 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 1.2+ MB


In [8]:
imdb_data.shape

(27278, 6)

In [9]:
genome_scores.head()

Unnamed: 0,tagId,tag
0,1,7
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [10]:
genome_scores.shape

(1128, 2)

In [11]:
links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [12]:
links.shape

(62423, 3)

In [24]:
tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,3,260,classic,1439472355
1,3,260,sci-fi,1439472256
2,4,1732,dark comedy,1573943598
3,4,1732,great dialogue,1573943604
4,4,7569,so bad it's good,1573943455


In [14]:
tags.shape

(1093360, 4)

In [41]:
# tags['userId'== 4]
# movies watched by the user
watched_movies = tags.loc[tags['userId'] == 4]['movieId']
watched_movies

2       1732
3       1732
4       7569
5      44665
6     115569
7     115713
8     115713
9     115713
10    148426
11    164909
12    164909
13    168250
14    168250
Name: movieId, dtype: int64

In [68]:
tags["movieId"]
# list of users who have watched similar movies
tags.loc[tags['movieId'] == watched_movies]
# for movie in watched_movies:
#     for tag in tags:
#         print(tag.movieId)
#         print(f"{movie} == {tag}")
#         if movie not in tags['movieId']:
#             print("Got a match")

ValueError: Can only compare identically-labeled Series objects

In [15]:
genome_tags.head()

Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


In [16]:
genome_tags.shape

(1128, 2)

In [32]:
# existing files: 

"""
1. movies.csv
2. imdb_data.csv
3. genome_scores.csv
4. links.csv
5. tags.csv
6. genome_tags.csv
"""

# dataset = pd.merge(movies,imdb_data, on="movieId")
# merged_df = pd.concat([movies, imdb_data, genome_scores, links, tags, genome_tags])
# merged_df

# data = movies.merge(imdb_data, on='movieId')
# data = data.merge(links, on='movieId')
# data = data.merge(genome_scores, on="movieId")

# data

data = tags.merge(movies, on='movieId')
data = data.merge(links, on='movieId')


In [33]:
data.head()

Unnamed: 0,userId,movieId,tag,timestamp,title,genres,imdbId,tmdbId
0,14,110,epic,1443148538,Braveheart (1995),Action|Drama|War,112573,197.0
1,14,110,Medieval,1443148532,Braveheart (1995),Action|Drama|War,112573,197.0
2,815,110,overrated,1150006110,Braveheart (1995),Action|Drama|War,112573,197.0
3,2577,110,Oscar (Best Picture),1378324225,Braveheart (1995),Action|Drama|War,112573,197.0
4,3086,110,epic,1463675332,Braveheart (1995),Action|Drama|War,112573,197.0


In [19]:
scores.head()

NameError: name 'scores' is not defined

In [None]:
scores.shape

In [None]:
# check for datasets info
datasets.info

In [None]:
scores = pd.read_csv('datasets/genome_scores.csv')
scores.head()

In [None]:
train

In [None]:
test

In [None]:
# check for datasets info
datasets.info

In [None]:
test = pd.read_csv("datasets/test.csv")
test

In [None]:
data.columns

In [None]:
train = pd.read_csv("datasets/train.csv")
train

In [None]:
test = pd.read_csv("datasets/test.csv")
test

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡                                                                                                                                           |
|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| To understand our data, we are taken an indept exploration of our data here.|

---


- look at data statistics
- plot relevant feature interactions
- evaluate correlation
- have a look at feature distributions


<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>
---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---



- remove missing values/ features
- create new features
- engineer existing features



<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

- split data
- create targets and features dataset
- create one or more ML models
- evaluate one or more ML models

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

- Compare model performance
- Choose best model and motivate why it is the best choice


<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

# discuss chosen methods logic