<a href="https://colab.research.google.com/github/devanshuprakash/Games/blob/main/Question_Template_Turing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install nltk -q

# Building a Simple Recommendation System

### Content-Based Filtering

In this exercise, you will build a simple movie recommendation system using content-based filtering. Fill in the blanks to complete the code. The idea is to recommend movies that are similar to a given movie based on their features like genre, tags, and summary.

*   **Imports Libraries:** This cell imports the necessary libraries for our recommendation system - `pandas` for data manipulation, `string` for text processing, and `sklearn` for vectorization and similarity calculations.

**Hints:**
- Import `CountVectorizer` from sklearn's feature extraction text module
- Import `cosine_similarity` from sklearn's pairwise metrics module

In [None]:
import pandas as pd
import string
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

*   **Loads Dataset:** This cell loads the 'NetflixDataset.csv' into a pandas DataFrame named `netflix_data`. The 'Title' column is set as the index for easy access.

**Hints:**
- Use `pd.read_csv()` to load CSV files
- Use `index_col` parameter to set the index column

In [None]:
netflix_data = pd._______('NetflixDataset.csv', encoding='latin-1', index_col='Title')
netflix_data.head()

Unnamed: 0_level_0,Genre,Tags,Languages,Country Availability,Runtime,Director,Writer,Actors,View Rating,IMDb Score,...,Awards Nominated For,Boxoffice,Release Date,Netflix Release Date,Production House,Netflix Link,Summary,Series or Movie,IMDb Votes,Image
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Lets Fight Ghost,"Crime, Drama, Fantasy, Horror, Romance","Comedy Programmes,Romantic TV Comedies,Horror ...","Swedish, Spanish",Thailand,< 30 minutes,Tomas Alfredson,John Ajvide Lindqvist,"Lina Leandersson, Kåre Hedebrant, Per Ragnar, ...",R,7.9,...,57.0,"$21,22,065",12-Dec-08,04-03-2021,"Canal+, Sandrew Metronome",https://www.netflix.com/watch/81415947,A med student with a supernatural gift tries t...,Series,205926.0,https://occ-0-4708-64.1.nflxso.net/dnm/api/v6/...
HOW TO BUILD A GIRL,Comedy,"Dramas,Comedies,Films Based on Books,British",English,Canada,1-2 hour,Coky Giedroyc,Caitlin Moran,"Cleo, Paddy Considine, Beanie Feldstein, Dónal...",R,5.8,...,,"$70,632",08-May-20,04-03-2021,"Film 4, Monumental Pictures, Lionsgate",https://www.netflix.com/watch/81041267,"When nerdy Johanna moves to London, things get...",Movie,2838.0,https://occ-0-1081-999.1.nflxso.net/dnm/api/v6...
The Con-Heartist,"Comedy, Romance","Romantic Comedies,Comedies,Romantic Films,Thai...",Thai,Thailand,> 2 hrs,Mez Tharatorn,"Pattaranad Bhiboonsawade, Mez Tharatorn, Thods...","Kathaleeya McIntosh, Nadech Kugimiya, Pimchano...",,7.4,...,,,03-Dec-20,03-03-2021,,https://www.netflix.com/watch/81306155,After her ex-boyfriend cons her out of a large...,Movie,131.0,https://occ-0-2188-64.1.nflxso.net/dnm/api/v6/...
Gleboka woda,Drama,"TV Dramas,Polish TV Shows,Social Issue TV Dramas",Polish,Poland,< 30 minutes,,,"Katarzyna Maciag, Piotr Nowak, Marcin Dorocins...",,7.5,...,4.0,,14-Jun-11,03-03-2021,,https://www.netflix.com/watch/81307527,A group of social welfare workers led by their...,Series,47.0,https://occ-0-2508-2706.1.nflxso.net/dnm/api/v...
Only a Mother,Drama,"Social Issue Dramas,Dramas,Movies Based on Boo...",Swedish,"Lithuania,Poland,France,Italy,Spain,Greece,Bel...",1-2 hour,Alf Sjöberg,Ivar Lo-Johansson,"Hugo Björne, Eva Dahlbeck, Ulf Palme, Ragnar F...",,6.7,...,1.0,,31-Oct-49,03-03-2021,,https://www.netflix.com/watch/81382068,An unhappily married farm worker struggling to...,Movie,88.0,https://occ-0-2851-41.1.nflxso.net/dnm/api/v6/...


*   **Cleans Data:** This cell removes duplicate titles and converts the 'Genre', 'Tags', and 'Summary' columns to string type for text processing.

**Hints:**
- Use `.astype('str')` to convert columns to string type
- Drop the duplicate rows using drop_duplicates function

In [None]:
# Remove duplicate titles
netflix_data = netflix_data.drop_duplicates()

# Convert columns to string type
netflix_data['Genre'] = netflix_data['Genre'].______('str')  # Fill: method to change data type
netflix_data['Tags'] = netflix_data['Tags'].astype('str')
netflix_data["Summary"] = netflix_data["Summary"].astype('str')

print(f"Dataset shape: {netflix_data.shape}")

*   **Defines Preprocessing Function:** This cell defines a `preprocess_text` function that takes text, converts it to lowercase, removes punctuation, and cleans up extra spaces. This prepares our text data for similarity comparison.

**Hints:**
- Use `.lower()` to convert text to lowercase
- Use `str.maketrans()` with `string.punctuation` to remove punctuation
- Use `re.sub()` to replace multiple spaces with single space

In [None]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.______()  # Fill: method to convert to lowercase
    # Remove punctuation
    text = text._______(str.maketrans('', '', string.______))  # Fill: attribute containing all punctuation characters
    # Remove extra spaces using regex
    text = re.sub(r"\s+", " ", text)
    return text

# Test the function
sample_text = "Action, Comedy, Drama!"
print(f"Original: {sample_text}")
print(f"Preprocessed: {_________(sample_text)}")

Original: Action, Comedy, Drama!
Preprocessed: action comedy drama


*   **Creates Combined Features:** This cell creates a new column called 'combined_features' by combining the 'Genre', 'Tags', and 'Summary' columns. This combined text will be used to find similar movies based on their content.

**Hints:**
- Use `+` operator to concatenate strings
- Add spaces between features using `' '`
- Use `.apply()` to apply the preprocessing function to the column

In [None]:
# Combine Genre, Tags, and Summary into a single feature
netflix_data['combined_features'] = netflix_data['Genre'] + ' ' + netflix_data[______] + ' ' + netflix_data[______]  # Fill: two column names to add

# Apply preprocessing to combined features
netflix_data['______'] = netflix_data['combined_features'].______(preprocess_text)  # Fill: method to apply function to each row

# Display sample
______[['Genre', 'Tags', 'combined_features']]._______()

Unnamed: 0_level_0,Genre,Tags,combined_features
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lets Fight Ghost,"Crime, Drama, Fantasy, Horror, Romance","Comedy Programmes,Romantic TV Comedies,Horror ...",crime drama fantasy horror romance comedy prog...
HOW TO BUILD A GIRL,Comedy,"Dramas,Comedies,Films Based on Books,British",comedy dramascomediesfilms based on booksbriti...
The Con-Heartist,"Comedy, Romance","Romantic Comedies,Comedies,Romantic Films,Thai...",comedy romance romantic comediescomediesromant...
Gleboka woda,Drama,"TV Dramas,Polish TV Shows,Social Issue TV Dramas",drama tv dramaspolish tv showssocial issue tv ...
Only a Mother,Drama,"Social Issue Dramas,Dramas,Movies Based on Boo...",drama social issue dramasdramasmovies based on...


*   **Creates Count Matrix:** This cell uses `CountVectorizer` to convert the text into a matrix of word counts. Each row represents a movie, and each column represents a unique word. This is similar to the vocabulary and encoding concepts we learned earlier.

**Hints:**
- Create a CountVectorizer object with `stop_words='english'` to remove common words
- Use `.fit_transform()` to convert text to count matrix

In [None]:
# Create CountVectorizer to convert text to word count matrix
count_vectorizer = _____________(______='english')  # Fill: parameter to remove common words

# Fit and transform the combined features
count_matrix = _________.______(_______['combined_features'])  # Fill: method to fit and transform data

print(f"Count Matrix Shape: {count_matrix.shape}")
print(f"Number of movies: {count_matrix.shape[0]}")
print(f"Number of unique words (vocabulary size): {count_matrix.shape[1]}")

Count Matrix Shape: (9144, 26320)
Number of movies: 9144
Number of unique words (vocabulary size): 26320


*   **Calculates Cosine Similarity:** This cell computes the cosine similarity between all movies. Cosine similarity measures how similar two movies are based on their word vectors. A value of 1 means identical, and 0 means completely different.

**Hints:**
- Use the `cosine_similarity()` function imported earlier
- Pass the count_matrix twice to compare all movies with each other

In [None]:
# Calculate cosine similarity between all movies
cosine_sim = ______(________, ______)  # Fill: similarity function, second argument (same matrix)

print(f"Cosine Similarity Matrix Shape: {cosine_sim.shape}")
print(f"Similarity between first two movies: {cosine_sim[0][1]:.4f}")

Cosine Similarity Matrix Shape: (9144, 9144)
Similarity between first two movies: 0.0436


*   **Creates Title Index Mapping:** This cell creates a mapping from movie titles to their index positions. This helps us quickly find a movie's position in our similarity matrix when making recommendations.

**Hints:**
- Use `.index.tolist()` to get all titles as a list
- Use dictionary comprehension to create the mapping: `{title: idx for idx, title in enumerate(titles)}`

In [None]:
# Create a mapping of movie titles to their index
titles = netflix_data.index.______()  # Fill: method to convert index to list
title_to_idx = {title: idx for idx, _____ in ______(titles)}  # Fill: function to get index and value pairs

print(f"Total movies indexed: {len(_______)}")
print(f"Sample titles: {titles[:5]}")

Total movies indexed: 9144
Sample titles: ['Lets Fight Ghost', 'HOW TO BUILD A GIRL', 'The Con-Heartist', 'Gleboka woda', 'Only a Mother']


*   **Defines Recommendation Function:** This cell defines the `get_recommendations` function that takes a movie title and returns the top N similar movies. It finds the movie's index, gets similarity scores with all other movies, sorts them, and returns the most similar ones.

**Hints:**
- Use `title_to_idx[title]` to get the index of the movie
- Use `sorted()` with a lambda function as key to sort by similarity score
- Use slicing `[1:num_recommendations + 1]` to exclude the movie itself and get top N

In [None]:
def get_recommendations(title, num_recommendations=10):
    # Check if title exists
    if title not in title_to_idx:
        print(f"Movie '{title}' not found in dataset.")
        return None

    # Get the index of the movie
    idx = ______[title]  # Fill: dictionary that maps titles to indices

    # Get similarity scores for all movies with this movie
    sim_scores = list(enumerate(cosine_sim[______]))  # Fill: variable containing movie index

    # Sort movies by similarity score (highest first)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=______)  # Fill: True or False for descending order

    # Get top N similar movies (excluding the movie itself)
    sim_scores = sim_scores[1:_____________]

    # Get movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return recommended movies
    recommendations = netflix_data.iloc[_______][['Genre', 'Tags']]
    return recommendations

*   **Tests Recommendation System:** This cell takes a movie name as input from the user and displays the top 10 recommended movies based on genre, tags, and summary similarity.

In [None]:
# Take movie name from the user as input
movie_name = input("Enter a movie name to get recommendations: ")
recommendations = get_recommendations(________, num_recommendations=10)
if recommendations is not None:
    print(f"\n=====Recommended movies for '{movie_name}'======")
    for movie in recommendations.index:
        print(_______)