<a href="https://colab.research.google.com/github/endiesworld/2110ACDS_T7_C_Predict/blob/main/2110ACDS_T7_starter_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EDSA Movie Recommendation 2022

© Explore Data Science Academy

---
### Honour Code

{**2110ACDS_T6**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.


  

<h2><center> EDSA Movie Recommendation 2022</h2></center>
<figure>
<center><img src ="./assets/movies.png" width = "800" height = '500'/>

*Introduction*
<p align = "justify">Recommender System is a system that seeks to predict or filter preferences according to the user’s choices. Recommender systems are utilized in a variety of areas, and in this project we will use a recommender system to recommend movies for movie lovers.


*About the problem*
<p align = "justify">PUT PROBLEM STATEMENT HERE.

*Objective*
<p align = "justify"> We aim to provide an accurate and robust solution to this problem, by providing personalised recommendations to users of this product, and generating platform affinity for the streaming services which best facilitates their audience's viewing

*Process*
<p align = "justify"> In order to achieve this objective the team will follow the process below:-

1. analyse the supplied data, identify potential errors in the data and clean the existing data set;

2. determine if additional features can be added to enrich the data set;

3. build a model that is capable of predicting how a user will rate a movie;

4. evaluate the accuracy of the best machine learning model;

5. accurately predicting how a user will rate a movie they have not yet viewed, based on their historical preferences, and

6. explain the inner working of the model to a non-technical audience.

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [4]:

# Import comet_ml at the top of your file
from comet_ml import Experiment

# Create an experiment with your api key
experiment = Experiment(
    api_key="emBEBYBp72gW5tfeZBSGftD0Y",
    project_name="movie-recommendation",
    workspace="emmanuelokoro",
    log_code = True
)

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/emmanuelokoro/movie-recommendation/5c35e27c7446402ea92a85342ea3517e



In [2]:
# Libraries for importing and loading data
import numpy as np
import pandas as pd
import scipy as sp # <-- The sister of Numpy, used in our code for numerical efficientcy. 
import matplotlib.pyplot as plt
import seaborn as sns

# Entity featurization and similarity computation
from sklearn.metrics.pairwise import cosine_similarity 
from sklearn.feature_extraction.text import TfidfVectorizer

# Libraries used during sorting procedures.
import operator # <-- Convienient item retrieval during iteration 
import heapq # <-- Efficient sorting of large lists

# Imported for our sanity
import warnings
warnings.filterwarnings('ignore')

# Setting global constants to ensure notebook results are reproducible

RANDOM_STATE = 42


import warnings
warnings.filterwarnings('ignore')

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

### 2.1 Brief description of the data



In [5]:
# load the data


In [6]:
# Preview train dataset


#### Dataset summary


<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


### 3.1 Exploratory Data Analysis
>*What is Exploratory data analysis?*

>Exploratory data analysis (EDA) is the process of analysing and investigating data sets and summarizing their main characteristics, often employing both non-graphical and graphical methods. 

>*Why is conducting EDA important?*

>It aids in determining how best to manipulate data to get the required answers, expose trends, patterns, and relationships that are not readily apparent i.e. get insights into the dataset.

>*How is EDA conducted?*

>EDA can be conducted in the following ways:
- **Univariate**:- \
    i. **non-graphical**:- This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships.\
    ii. **graphical**:- Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. It involves visual exploratory analysis of the data.
- Multivariate:-  \
    i. **non-graphical**:- Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics. \
    ii. **graphical**:- Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

### 3.2 Univariate Non-Graphical Analysis
>For this analysis, we are going to view dataset on the below checks:  \
    >>i.  Check for the presence of *null* values  \
    >>ii. Descriptive statistical values *mean, std, minimum, quatiles, maximum, and kurtosis*  
    >>iii. Dataset data types

In [7]:
# Check data types for all columns


#### Summarize the above.  

In [8]:
# look at data statistics


#### summarize the above.

 **Descriptive Statistics**

>Descriptive statistics summarize the data by computing mean, median, mode, standard deviation likewise.descriptive statistics describe the dataset in a way simpler manner through;

*   The measure of central tendency 
>*  Mean:- The average value 
>*  Median:- The mid point value 
>*  Mode:- The most common value

*   Measure of spread  
>* Percentiles:- Percentiles are used in statistics to give you a number that describes the value that a given percent of the values are lower than.
>* standard deviation:-a number that describes how spread out the values are.
*  Measure of symmetry 
>* Skewness:- a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean.
>>* If skewness is less than -1 or greater than 1, the distribution is highly skewed.
>>* If skewness is between -1 and -0.5 or between 0.5 and 1, the distribution is moderately skewed.
>>* If skewness is between -0.5 and 0.5, the distribution is approximately symmetric. 
*  Measure of Peakedness 
>* Kurtosis:-  a measure of relative peakedness of a probability distribution, or alternatively how heavy or how light its tails are. A standard normal distribution has kurtosis of 3 and is recognized as mesokurtic. An increased kurtosis (>3) can be visualized as a thin “bell” with a high peak whereas a decreased kurtosis corresponds to a broadening of the peak and “thickening” of the tails. Kurtosis >3 is recognized as leptokurtic and <3 as platykurtic (lepto=thin; platy=broad).
>>








In [9]:
# look at data statistics


### 3.3 Univariate graphical inspection of data


### 3.5 Key Insights from EDA 


<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic

Run the next cell to make sure the experiment as ended. It notifies comit.

In [None]:
experiment.end()