# Task 1: Movie Recommender System

**Task:** Build a Movie Recommendation System using Content-Based Filtering  
**Dataset:** MovieLens  
**Approach:** TF-IDF + Cosine Similarity on genres and user tags

The recommender system suggests top 5 similar movies for a selected input, based on content metadata.


In [2]:
# STEP 1: Import essential libraries
import pandas as pd
import numpy as np

In [3]:
# For visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# For content-based filtering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [5]:
# To ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [6]:
# STEP 2: Load datasets
movies = pd.read_csv(r"C:\Users\akanksh_02\AICTE_AIML_Intern\1_Movie Recommender System\movie_lens_data\movie.csv")            # movieId, title, genres
ratings = pd.read_csv(r"C:\Users\akanksh_02\AICTE_AIML_Intern\1_Movie Recommender System\movie_lens_data\rating.csv")          # userId, movieId, rating, timestamp
tags = pd.read_csv(r"C:\Users\akanksh_02\AICTE_AIML_Intern\1_Movie Recommender System\movie_lens_data\tag.csv")                # userId, movieId, tag, timestamp
genome_scores = pd.read_csv(r"C:\Users\akanksh_02\AICTE_AIML_Intern\1_Movie Recommender System\movie_lens_data\genome_scores.csv")  # movieId, tagId, relevance
genome_tags = pd.read_csv(r"C:\Users\akanksh_02\AICTE_AIML_Intern\1_Movie Recommender System\movie_lens_data\genome_tags.csv")      # tagId, tag
links = pd.read_csv(r"C:\Users\akanksh_02\AICTE_AIML_Intern\1_Movie Recommender System\movie_lens_data\link.csv")              # movieId, imdbId, tmdbId

In [7]:
#STEP 3: Quick look at the data
print("Movies:\n", movies.head(), "\n")

Movies:
    movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy   



In [8]:
print("Ratings:\n", ratings.head(), "\n")

Ratings:
    userId  movieId  rating            timestamp
0       1        2     3.5  2005-04-02 23:53:47
1       1       29     3.5  2005-04-02 23:31:16
2       1       32     3.5  2005-04-02 23:33:39
3       1       47     3.5  2005-04-02 23:32:07
4       1       50     3.5  2005-04-02 23:29:40 



In [9]:
print("Tags:\n", tags.head(), "\n")

Tags:
    userId  movieId            tag            timestamp
0      18     4141    Mark Waters  2009-04-24 18:19:40
1      65      208      dark hero  2013-05-10 01:41:18
2      65      353      dark hero  2013-05-10 01:41:19
3      65      521  noir thriller  2013-05-10 01:39:43
4      65      592      dark hero  2013-05-10 01:41:18 



### Pre-Processing

In [11]:
# STEP 1: Fill missing genres with empty strings (precaution)
movies['genres'] = movies['genres'].fillna('')

In [12]:
# STEP 2: Preprocess genres
# The genres are pipe-separated like "Action|Adventure|Fantasy"
# We'll replace '|' with space so TF-IDF treats them as separate words
movies['genres_cleaned'] = movies['genres'].str.replace('|', ' ', regex=False)

### Feature Extraction

In [14]:
# STEP 3: TF-IDF Vectorization of genres
# TF-IDF helps convert text to numeric vectors based on term importance
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(movies['genres_cleaned'])

In [15]:
# Shape: (number of movies, number of unique genre terms)
print("TF-IDF matrix shape:", tfidf_matrix.shape)

TF-IDF matrix shape: (27278, 23)


In [16]:
# STEP 4: Compute cosine similarity between movies
# This gives a similarity score between every pair of movies
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [17]:
# STEP 5: Build a reverse map of movie titles to index
# This helps us retrieve the index of any movie by its title
indices = pd.Series(movies.index, index=movies['title']).drop_duplicates()

In [18]:
# Function to get movie recommendations
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices.get(title)

    if idx is None:
        return ["Movie not found. Please check spelling."]

    # Get pairwise similarity scores for all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort movies based on similarity scores (high to low)
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 6 most similar movies (excluding itself)
    sim_scores = sim_scores[1:6]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 5 similar movie titles
    return movies['title'].iloc[movie_indices].tolist()


In [19]:
# Example: Get top 5 movies similar to 'Toy Story (1995)'
recommended_movies = get_recommendations('Toy Story (1995)')
print("Top 5 recommended movies:\n", recommended_movies)

Top 5 recommended movies:
 ['Antz (1998)', 'Toy Story 2 (1999)', 'Adventures of Rocky and Bullwinkle, The (2000)', "Emperor's New Groove, The (2000)", 'Monsters, Inc. (2001)']




##  Enhancing Recommendations Using User Tags

To improve the quality of movie recommendations, we now include **user-generated tags** from the `genome_scores` and `genome_tags` datasets.

Each tag has a **relevance score** for a movie, indicating how strongly the tag applies. By selecting the **top 15 most relevant tags per movie**, we can build a richer profile for each film.

###  Updated Profile Components:
- **Genres** (from original movie metadata)
- **Top Tags** (based on tag relevance scores)

These combined features will be vectorized using TF-IDF again, and a new **cosine similarity matrix** will be computed for more accurate content-based recommendations.




In [21]:
# Step A2: Merge genome_scores with genome_tags to get readable tag names
genome_data = pd.merge(genome_scores, genome_tags, on='tagId')


In [22]:
# Step A3: For each movie, get the top N tags with highest relevance score
top_n = 15  # You can adjust this number
top_tags_per_movie = genome_data.sort_values(['movieId', 'relevance'], ascending=[True, False])\
                                .groupby('movieId').head(top_n)


In [23]:
# Step A4: Group top tags into a single string per movie
movie_tags = top_tags_per_movie.groupby('movieId')['tag'].apply(lambda tags: ' '.join(tags)).reset_index()

In [24]:
# Step A5: Merge tags with original movies dataframe
movies_with_tags = pd.merge(movies, movie_tags, on='movieId', how='left')

In [25]:
# Fill missing tag values with empty string
movies_with_tags['tag'] = movies_with_tags['tag'].fillna('')

In [26]:
# Step A6: Combine genres and tags into a single 'profile' column
movies_with_tags['genres'] = movies_with_tags['genres'].fillna('')
movies_with_tags['profile'] = movies_with_tags['genres'] + ' ' + movies_with_tags['tag']

In [27]:
# Step A7: Vectorize the new profile using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

In [28]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix_tags = tfidf_vectorizer.fit_transform(movies_with_tags['profile'])

In [29]:
# Step A8: Compute cosine similarity
from sklearn.metrics.pairwise import linear_kernel
cosine_sim_tags = linear_kernel(tfidf_matrix_tags, tfidf_matrix_tags)

In [30]:
# Step A9: Create reverse index for titles (needed for recommendations)
indices_tags = pd.Series(movies_with_tags.index, index=movies_with_tags['title']).drop_duplicates()

In [31]:

# Step A10: Final updated recommendation function using genres + tags
def recommend_movies_with_tags(title, cosine_sim=cosine_sim_tags):
    # Get index of the movie that matches the title
    idx = indices_tags.get(title)

    if idx is None:
        print(f" Movie titled '{title}' not found. Please try a valid title.")
        return []

    # Get pairwise similarity scores
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort movies based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get indices of top 5 most similar movies (excluding the first one = itself)
    sim_scores = sim_scores[1:6]

    # Get movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return top 5 recommended movie titles
    return movies_with_tags['title'].iloc[movie_indices].tolist()


In [32]:
print("Top 5 movies similar to 'Toy Story (1995)':")
recommend_movies_with_tags('Toy Story (1995)')


Top 5 movies similar to 'Toy Story (1995)':


['Monsters, Inc. (2001)',
 "Bug's Life, A (1998)",
 'Jimmy Neutron: Boy Genius (2001)',
 'Toy Story 2 (1999)',
 'Toy Story 3 (2010)']