# Unsupervised Learning Team 5 Solution (Gabe)

© Explore Data Science Academy

---
### Problem Statement

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options.

<img src="https://i.pinimg.com/originals/d9/58/5e/d9585efc140b5d3689b3341aa5c35df1.jpg" alt="movie-recommendation" style="width: 800px;"/>







Our team has been challenged with constructing a movie recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.

<a id="cont"></a>

## Table of Contents

<a href=#one>(i) Comet Experiment</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data and Data Descriptions</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## (i) Comet Experiment
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section we conduct our Comet Experiment. |

---

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section we import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [1]:
# data analysis libraries
import pandas as pd
import numpy as np

# Kaggle requirements
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

# visualisation libraries
from matplotlib import pyplot as plt
import seaborn as sns
from numpy.random import RandomState
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)


#word cloud
%matplotlib inline
import wordcloud

from wordcloud import WordCloud, STOPWORDS
%matplotlib inline
sns.set()


'''
# ML Models
!pip install scikit surprise
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# ML Pre processing
from surprise.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Hyperparameter tuning
from surprise.model_selection import GridSearchCV
'''

# High performance hyperparameter tuning
#from tune_sklearn import TuneSearchCV

# Remove warnings 
import warnings
warnings.filterwarnings("ignore")

<a id="two"></a>
## 2. Loading the Data and Data Descriptions
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data and data descriptions⚡ |
| :--------------------------- |
| In this section we load the data from the CSV files into a DataFrame. We also descripe the various csv files|

---

In [3]:
# read in all csv files 
train = pd.read_csv('../input/edsa-movie-recommendation-wilderness/train.csv')
test = pd.read_csv('../input/edsa-movie-recommendation-wilderness/test.csv')
genome_scores = pd.read_csv('../input/edsa-movie-recommendation-wilderness/genome_scores.csv')
genome_tags = pd.read_csv('../input/edsa-movie-recommendation-wilderness/genome_tags.csv')
imdb_data = pd.read_csv('../input/edsa-movie-recommendation-wilderness/imdb_data.csv')
links = pd.read_csv('../input/edsa-movie-recommendation-wilderness/links.csv')
movies = pd.read_csv('../input/edsa-movie-recommendation-wilderness/movies.csv')
tags = pd.read_csv('../input/edsa-movie-recommendation-wilderness/tags.csv')

#### Dataset Descriptions
The supplied dataset comprises the following:

1. genome_scores.csv - A score mapping the strength between movies and tag-related properties
2. train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data
3. test.csv - The test split of the dataset. Contains user and movie IDs with no rating data
3. tags.csv - User assigned for the movies within the dataset
3. links.csv - File providing a mapping between a movie ID, IMDB IDs and TMDB IDs
4. movies - File providing details about the title of the movie, genres and movieID that further can be used 5. to merge to other related dataset
6. imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file
7. genome_tags.csv - User assigned tags for genome-related scores

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section we perform an in-depth analysis of all the variables in the various dataFrames. |

---


Let's first take a look at the shape of all the datasets in order to have a general overview.

In [4]:
# Declaring a list that contains the names of the dataframes
dfs = [train, test, genome_scores, genome_tags, imdb_data, links, movies, tags]
# Create a list of the names of the imported datasets
df_names = ['train', 'test', 'genome_scores', 'genome_tags',
            'imdb_data', 'links', 'movies', 'tags']
dfs_dict = {}  # declaring an empty dictionary
for name, data in zip(df_names, dfs):  # iterate over the list and dictionary
    dfs_dict[name] = [data.shape[0], data.shape[1]]
    df_prop = pd.DataFrame(dfs_dict,
                          index=['rows', 'columns']).transpose()
df_properties = df_prop.sort_values(by='rows', ascending=False)

df_properties  # view the final output

#### Viewing of data and missing values 

Let's take a quick 'sneak peek' and some basic information at each of the datasets provided. 

In [31]:
train.head()

In [6]:
print('INFO OF DATASET')
train.info()
print('    ----------------')
print('MISSING VALUES OF DATASET')
train.isnull().sum()

In [30]:
test.head()

In [8]:
print('INFO OF DATASET')
test.info()
print('    ----------------')
print('MISSING VALUES OF DATASET')
test.isnull().sum()

In [29]:
genome_scores.head()

In [10]:
print('INFO OF DATASET')
genome_scores.info()
print('    ----------------')
print('MISSING VALUES OF DATASET')
genome_scores.isnull().sum()

In [32]:
genome_tags.head()

In [12]:
print('INFO OF DATASET')
genome_tags.info()
print('    ----------------')
print('MISSING VALUES OF DATASET')
genome_tags.isnull().sum()

In [33]:
imdb_data.head()

In [14]:
print('INFO OF DATASET')
imdb_data.info()
print('    ----------------')
print('MISSING VALUES OF DATASET')
imdb_data.isnull().sum()

We clearly see a number of missing data for each of the columns. Let's allow for a more visual representation of the missing data below...

In [22]:
import missingno as msno

# plot bar chart of the missing values
msno.bar(imdb_data)

The bar graph above gives a clear visual representation of the extent of missing data for each column. It should be noted that for the 'Budget' column, more than half the data is missing. If need be, we'll look to address all these issues at a later stage. For now, let's take a look at the distribution of the missing data below.. 

In [24]:
# plot a matrix of the missing data 
msno.matrix(imdb_data)

We see that the missing data is quite evenly distributed across the various columns. 

In [15]:
links.head(5)

In [16]:
print('INFO OF DATASET')
links.info()
print('    ----------------')
print('MISSING VALUES OF DATASET')
links.isnull().sum()

As before, let's get a more visual look of the missing data.. 

In [25]:
# plot bar chart of the missing values
msno.bar(links)

Only a slight fraction of data is missing in the 'tmdbId' column.

In [26]:
# plot a matrix of the missing data 
msno.matrix(links)

The matrix above shows us that there's only single small section in the whole 'tmdbId' column that has missing data. 

In [17]:
movies.head(5)

In [18]:
print('INFO OF DATASET')
movies.info()
print('    ----------------')
print('MISSING VALUES OF DATASET')
movies.isnull().sum()

In [19]:
tags.head(5)

In [20]:
print('INFO OF DATASET')
tags.info()
print('    ----------------')
print('MISSING VALUES OF DATASET')
tags.isnull().sum()

Only 16 rows missing for the 'tag' column. For consistency's sake, let's give a visual representation of this below...

In [27]:
# plot bar chart of the missing values
msno.bar(tags)

As expected, the missing data represented visually is practically negligible (bar graph above). 

In [28]:
# plot a matrix of the missing data 
msno.matrix(tags)

The missing data is not even noticeable if we represent is via a matrix. Let's provide a summary of our findings of all dataframes below.

Upon investigation of missing values, we have found the following: 

* The links dataset has 107 missing values in the tmdb column. This makes up for a total of only 0.17% of missing data.
* The tags dataset has 16 missing values in the tag column. This makes up for a total of only 0.00015% of the missing data
*  The imdb_data dataset has a range of missing values - if need be, we'll address this issue at a later stage.

In [None]:
# have a look at feature distributions

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section we clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
# remove missing values/ features

In [None]:
# create new features

In [None]:
# engineer existing features

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section we create one or more models. |

---

In [None]:
# split data

In [None]:
# create targets and features dataset

In [None]:
# create one or more ML models

In [None]:
# evaluate one or more ML models

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section we compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section we discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic