<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Exercise.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Exercise: Recommender systems
© ExploreAI Academy

In this exercise, we will build a content-based recommendation system using a dataset of Netflix titles. We will preprocess the text data, convert it into numerical features with TF-IDF, and compute item similarities to generate recommendations. This hands-on activity will help us understand and implement key techniques in content-based filtering.

## Learning objectives

By the end of this exercise, you should be able to:
* Understand content-based recommendation systems.
* Clean and preprocess text data.
* Convert text data into numerical features using TF-IDF.
* Compute item similarities using cosine similarity.
* Build and evaluate a content-based recommendation model.

## Introduction

In this notebook, we will build a `content-based recommendation system` using the `Netflix` dataset. The primary goal of this task is to recommend similar titles to users based on the attributes of the media they have already interacted with. This will enhance the user experience by providing personalised content recommendations, thereby increasing user engagement and satisfaction. By predicting which titles a user might enjoy based on their previous interactions, content-based recommendation systems help platforms like `Netflix` keep users engaged and encourage them to explore a broader range of content.

The dataset is derived from Netflix's collection of movies and TV shows. This dataset includes various attributes for each title, such as:

* show_id: Unique identifier for each title.
* type: The type of media (e.g., Movie, TV Show).
* title: The name of the media.
* director: Directors involved in the media.
* cast: Main actors involved in the media.
* country: Countries where the media was produced.
* date_added: The date when the media was added to Netflix.
* release_year: The year the media was released.
* rating: The rating given to the media.
* duration: Duration of the media (e.g., 90 min, 1 Season).
* listed_in: Categories or genres the media belongs to.
* description: Brief summary or synopsis of the media.

The data was collected to provide a comprehensive overview of the available media on `Netflix`. It allows for detailed analysis and exploration of the media's attributes, which is essential for building a recommendation system.

Let's dive in!

Import the necessary libraries and read the data.

In [None]:
# Import necessary libraries
import numpy as np 
import pandas as pd 

# For text handling and regular expressions
import re
from sklearn.feature_extraction.text import TfidfVectorizer # For converting text to numerical data

# For computing cosine similarity
from sklearn.metrics.pairwise import linear_kernel


In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/unsupervised_sprint/netflix_titles.csv', index_col=0)
data.head()

## Exercises

In this exercise, we focus on the relevant columns `cast`, `title`, `description`, and `listed_in` because these textual features provide detailed descriptions and attributes essential for capturing the content similarities between media items. These columns contain detailed information about what the media is about, who stars in it, and its genres, which are crucial for generating meaningful recommendations in a content-based filtering approach.

### Exercise 1: Data cleaning and preprocessing

Before proceeding with our recommender system, we need to clean and process our data first to get the most accurate results. 

We need to do the following:

* Remove rows with missing or NaN values.
<br>
<br>
* Remove punctuation and extra spaces in the text data. This helps to standardise and clean the text, ensuring consistency in the dataset and facilitating accurate analysis and modelling by eliminating unnecessary noise and variations in the text.

**Hint**: 
> * For all the text columns, remove all characters that are not alphanumeric or whitespace.
> * For the 'cast' column, first remove all spaces and then replace commas with spaces. This ensures that the cast members' names are treated as single entities separated by spaces.
<br>
<br>
* Combine the columns `listed_in`, `cast`, `title`, and `description` into a single feature for the recommendation system. This creates a richer and more complete representation of each item, enhancing the effectiveness of the recommendation system by allowing it to consider all aspects of the content simultaneously.<br> 
**Hint**: Remember to drop the individual columns as they are now combined into one.
<br>
<br>   
* Drop the rest of the columns to streamline and focus on the most relevant data for our recommendation model so that we are only left with the `type`, `title`, and `combined` columns with `type` and `title` providing context and identification, and `combined` serving as the main feature for calculating similarities.


In [None]:
#Your code here

### Exercise 2: Feature extraction
Next, we want to convert the combined text feature into numerical features using TF-IDF.
This enables the application of mathematical and statistical techniques for measuring similarities between different media items. In its raw form, text data cannot be directly used for similarity calculations or machine learning algorithms. By transforming the text into numerical representations, we can leverage these techniques to analyse and compare the content effectively.

* Utilise TF-IDF to convert the `combined` column into numerical vectors, which represent the importance of words in the document. Initialise your TF-IDF Vectoriser without specifying any parameters, which means it will default to single-word tokens.
* Compute the cosine similarity between these vectors to measure how similar the titles are.


In [None]:
#Your code here

### Exercise 3: Building the recommendation function

Now, we can generate recommendations based on cosine similarity.

Define a function that, given a title, finds similar titles by looking up their cosine similarity scores and returns the top 10 recommendations based on these scores.

In [None]:
#Your code here

### Exercise 4: Test the recommender

Say you are trying to get recommendations for what movie to watch, and you particularly enjoyed the film `The Crown`. Run our recommender for this title and see what recommendations we get. 

Would you want to watch any of these titles?


In [None]:
#Your code here

## Solutions

### Exercise 1

In [None]:
# Drop rows with NaN values in the specified columns to ensure we have complete data for these important attributes
data.dropna(subset=['cast', 'title', 'description', 'listed_in'], inplace=True, axis=0)

# Reset the index after dropping rows to maintain a clean DataFrame
data = data.reset_index(drop=True)

# Clean text data by removing punctuation and extra spaces to ensure consistency in the text data
data['listed_in'] = [re.sub(r'[^\w\s]', '', t) for t in data['listed_in']] # Remove punctuation from 'listed_in' column
data['cast'] = [re.sub(',', ' ', re.sub(' ', '', t)) for t in data['cast']] # Replace commas with spaces in 'cast' column
data['description'] = [re.sub(r'[^\w\s]', '', t) for t in data['description']] # Remove punctuation from 'description'
data['title'] = [re.sub(r'[^\w\s]', '', t) for t in data['title']] # Remove punctuation from 'title'

# Combine the cleaned text data into a single 'combined' feature
# This helps in creating a unified representation of each media item for the recommendation system
data["combined"] = data['listed_in'] + ' ' + data['cast'] + ' ' + data['title'] + ' ' + data['description']

# Drop the individual columns as they are now represented in the 'combined' feature
# Also drop other columns that are not used in the recommendation model
data.drop(['director', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'cast', 'description'], axis=1, inplace=True)

# Display the first few rows of the cleaned data to verify the changes
data.head()


**Note!!**

It's important to acknowledge that dropping NaN values can potentially lead to information loss, and in situations where missing data follows a specific trend or pattern, dropping NaNs might not be ideal as it could bias the analysis. In such cases, other strategies like imputation or handling missing values through specialised techniques might be more suitable. Ultimately, the choice between dropping NaN values and handling them through imputation or other methods depends on the data, the specific context of the analysis and the goals of the study.

### Exercise 2

In [None]:
# Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the combined text data to numerical vectors
matrix = vectorizer.fit_transform(data["combined"])

# Compute cosine similarities between the vectors
#cosine_similarities = linear_kernel(matrix, matrix)
from sklearn.metrics.pairwise import cosine_similarity 
cosine_similarities = cosine_similarity(matrix, matrix)

# Extract the title column for later use in recommendations
movie_title = data['title']

# Create a series to map each title to its index in the dataset
indices = pd.Series(data.index, index=data['title'])


**Note!!**

Adjusting parameters like the n-grams in the vectoriser could capture more complex relationships, such as those between cast members. This could potentially lead to recommendations that reflect relationships like sequels or movies with similar casts. Exploring different n-gram ranges can be valuable for enhancing the recommendation system’s performance and capturing nuanced similarities within the text data.

### Exercise 3

In [None]:
# Define the recommendation function
def content_recommender(title):
    # Get the index of the given title
    idx = indices[title]
    
    # Get the pairwise similarity scores of all titles with the given title
    sim_scores = list(enumerate(cosine_similarities[idx]))
    
    # Sort the titles based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the indices of the most similar titles
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    
    # Return the titles of the most similar titles
    return movie_title.iloc[movie_indices]


### Exercise 4

In [None]:
# Test the recommendation function
content_recommender('The Crown')

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>