<a href="https://colab.research.google.com/github/drikus-d/unsupervised-predict-streamlit-template/blob/main/Macs_Unsupervised_Predict.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**<font color='red'> Movie Recommender Systems: Unsupervised Learning - EDSA</font>**

© Explore Data Science Academy

Team 5
##**<font color='green'> Movie Recommender System</font>**

<p align="justify" > 

This Unsupervised predict layout is as follows:

1. Introduction and Problem Statement
2. Importing libraries and loading data
3. Exploratory Data Analysis
4. Data pre-processing
5. Build and evaluate models
7. Conclusion
8. Submission


##**<font color='cyan'>Problem Statement:</font>**
Develop an unsupervised machine learning model that can accurately predict how a user would rate a movie they haven't seen based on their previous browsing history and/or content or collaborative filtering.

##**<font color='purple'>Task: Movie Recommender System</font>**
<p align="justify" > To construct a recommendation algorithm based on content or collaborative filtering, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.

<div align="center" style="width: 800px; font-size: 100%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/drikus-d/unsupervised-predict-streamlit-template/master/DataSets/reccomnd%20pic.jpg"

</div>


# Importing libraries

In this section we are importing all the relavant packages which will be used for analysis and modeling.

In [None]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Libraries for data preparation and model building
# Import the scaling module
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample
# Import train/test split module

#Modelling 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import mean_squared_error, mean_absolute_error

# NLP Libraries
import nltk
import string
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer,SnowballStemmer
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.pipeline import Pipeline

# Packages for modeling
from surprise import Reader, Dataset, KNNWithMeans, KNNBasic, SVD, SVDpp, NMF, SlopeOne, CoClustering
from surprise.model_selection import cross_validate, GridSearchCV

#Loading DataSets

 Loading the data to be used to build our classification model.

In [None]:
# Loading in the datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

movies = pd.read_csv('https://raw.githubusercontent.com/drikus-d/unsupervised-predict-streamlit-template/master/DataSets/movies.csv')
imdb_data = pd.read_csv('https://raw.githubusercontent.com/drikus-d/unsupervised-predict-streamlit-template/master/DataSets/imdb_data.csv')
links_df = pd.read_csv('https://raw.githubusercontent.com/drikus-d/unsupervised-predict-streamlit-template/master/DataSets/links.csv')
genome_tags = pd.read_csv('https://raw.githubusercontent.com/drikus-d/unsupervised-predict-streamlit-template/master/DataSets/genome_tags.csv')

genome_score = pd.read_csv('.genome_scores.csv')
tags = pd.read_csv('tags.csv')

#Data Overview

<p align="justify" > This dataset consists of several million 5-star ratings obtained from users of the online MovieLens movie recommendation service. The MovieLens dataset has long been used by industry and academic researchers to improve the performance of explicitly-based recommender systems.

**Data Description:**

<p align="justify" > 
genome_scores.csv - a score mapping the strength between movies and tag-related properties. 

genome_tags.csv - user assigned tags for genome-related scores

imdb_data.csv - Additional movie metadata scraped from IMDB using the links.csv file.

links.csv - File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs.

sample_submission.csv - Sample of the submission format for the hackathon.

tags.csv - User assigned for the movies within the dataset.

test.csv - The test split of the dataset. Contains user and movie IDs with no rating data.

train.csv - The training split of the dataset. Contains user and movie IDs with associated rating data.

<a id="three"></a>
# Exploratory Data Analysis


<p align="justify" > This section provides an in depth EDA which allows us to gain deeper insights into the dimensions and features of our data.

We will take a closer look at our data to check for obvious and hidden underlying clues and relationships that exist within our differrent classes. 

In [None]:
#Checking the shape of the train and test set
train.shape,test.shape

In [None]:
round((train.isnull().sum()/test.shape[0])
      *100,2).astype(str)+ ' %'



Missing Data:

In [None]:
train.info()

In [None]:
movies.info()

In [None]:
# Create dataframe containing only the movieId and genres
movies_genres = pd.DataFrame(df_movies[['movieId', 'genres']],
                             columns=['movieId', 'genres'])

# Split genres seperated by "|" and create a list containing the genres allocated to each movie
movies_genres.genres = movies_genres.genres.apply(lambda x: x.split('|'))

# Create expanded dataframe where each movie-genre combination is in a seperate row
movies_genres = pd.DataFrame([(tup.movieId, d) for tup in movies_genres.itertuples() for d in tup.genres],
                             columns=['movieId', 'genres'])

movies_genres.head()

### Data cleaning



<p align="justify" >  Data cleaning is the process of detecting and correcting corrupt or inaccurate records from the dataset and identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data.

### Evaluation Metric

The evaluation metrics used to gauge model performance is the Root Mean Square Error. Root Mean Square Error (RMSE) is commonly used in regression analysis and forecasting, and measures the standard deviation of the residuals arising between predicted and actual observed values for a modelling process. For our task of generating user movie ratings via recommendation algorithms, the the formula is given by:

RMSE.PNG

Where R is the total number of recommendations generated for users and movies, with r{ui} and r-hat{ui} being the true, and predicted ratings for user u watching movie i, respectively.

#Collaborative Filtering

This is an approach that builds a model from a user's past behaviour (items previously purchased or selected and/or numerical ratings given to those items) as well as similar decisions made by other users. This model is then used to predict items (or ratings for items) that the user may have an interest in

#Content-based Filtering

This is an approach that utilizes a series of discrete characteristics of an item in order to recommend additional items with similar properties.

### Submission

In [None]:
sub = pd.DataFrame({"":df_test[''], '': pred})
sub = sub.set_index('')
sub.to_csv('submission.csv',index=True)
sub.head()

## To Comet:

##**<font color='green'>Conclusion: </font>**

<p align="justify" > 

##**<font color='green'>Recommendations: </font>**

<p align="justify" > 