This a content-based movie recommender system using TF-IDF and cosine similarity. It uses movie content rather than user interaction.  

Source: https://www.kaggle.com/datasets/karrrimba/movie-metadatacsv
Columns used: title, genres, overview, vote_average, vote_count
Rows: ~45,460 movies


I built a content based movie recommendation system that recommends similar movies. It has the followig features:
    -Content based filtering
    -Genre and overview text analysis
    -IMDb-style weighted ratings
    -Explainable recommendations

Libraries used:
    -Pandas
    -Numpy
    -Scikit learn

In [38]:
#import tools and libraries
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
import seaborn as sns
#import plotly.express as px

In [39]:
# Configure display options
pd.set_option('display.max_columns', None)
sns.set(style="darkgrid")

In [40]:
# Load the movies data file
movies = pd.read_csv("movies_metadata.csv", low_memory=False)

I need to check the columns to see if I have all necessary columns in this dataset 

In [41]:
movies.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

I don't need all these columns, I only need: Title, Genre, Overview, vote_average and vote_count

In [42]:
df = movies[['title', 'genres', 'overview', 'vote_average', 'vote_count']]

Let's check for missing values

In [43]:
df.isnull().sum().sort_values(ascending=False).head(15)

overview        954
title             6
vote_average      6
vote_count        6
genres            0
dtype: int64

Now I will fix the missing values by filling the missing values in the genres and overview columns with empty strings.
I filled the missing vote average with the mean of the column and the vote_count with 0.
Then  I will drop (remove) rows with no title

In [44]:
df.loc[:, 'genres'] = df['genres'].fillna('')
df.loc[:, 'overview'] = df['overview'].fillna('')
mean_vote = df['vote_average'].mean()
df.loc[:,'vote_average'] = df['vote_average'].fillna(mean_vote)
df.loc[:,'vote_count'] = df['vote_count'].fillna(0)
df = df.dropna(subset=['title']).copy()

Confirming my clean dataset

In [45]:
df.isnull().sum().sort_values(ascending=False).head(15)

title           0
genres          0
overview        0
vote_average    0
vote_count      0
dtype: int64

I parsed the genre column from stringified dictionaries into clean text.

In [46]:
import ast
def clean_genres(genres):
    if isinstance(genres, str):
        try:
            genres_list = ast.literal_eval(genres)
            return ' '.join([g['name'] for g in genres_list])
        except:
            return ''
    return ''

df['genres'] = df['genres'].apply(clean_genres)



In order to have the IMDb style rating, I calcullated the wieghted ratings using the vote average and vote count.

In [49]:
C = df['vote_average'].mean()
m = df['vote_count'].quantile(0.70)  # top 30% vote threshold
df['weighted_rating'] = (df['vote_count']/(df['vote_count'] + m) * df['vote_average']) + (m / (df['vote_count'] + m) * C)

To represent each movie as a single text document, I combined Genres and Overview into one feature.

In [51]:
df['combined_features'] = df['genres'] + ' ' + df['overview']

To ensure rating does not overpower similarity, I nomarlized the vote rating

In [52]:
df['norm_rating'] = (df['vote_average'] - df['vote_average'].min()) / (df['vote_average'].max() - df['vote_average'].min())

In [53]:
df.head()

Unnamed: 0,title,genres,overview,vote_average,vote_count,combined_features,weighted_rating,norm_rating
0,Toy Story,Animation Comedy Family,"Led by Woody, Andy's toys live happily in his ...",7.7,5415.0,"Animation Comedy Family Led by Woody, Andy's t...",7.690433,0.77
1,Jumanji,Adventure Fantasy Family,When siblings Judy and Peter discover an encha...,6.9,2413.0,Adventure Fantasy Family When siblings Judy an...,6.886856,0.69
2,Grumpier Old Men,Romance Comedy,A family wedding reignites the ancient feud be...,6.5,92.0,Romance Comedy A family wedding reignites the ...,6.311583,0.65
3,Waiting to Exhale,Comedy Drama Romance,"Cheated on, mistreated and stepped on, the wom...",6.1,34.0,"Comedy Drama Romance Cheated on, mistreated an...",5.895851,0.61
4,Father of the Bride Part II,Comedy,Just when George Banks has recovered from his ...,5.7,173.0,Comedy Just when George Banks has recovered fr...,5.689673,0.57


TF-IDF means Term Frequency-Inverse Document Frequency. It is used to convert text data into numerical vectors. 
This approach emphasizes important words but the reduces tehe impact of common terms.

In [55]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english')

tfidf_matrix = tfidf.fit_transform(df['combined_features'])


Cosine similarity is used to measure how similar two movies are based on their TF_DIF vectors. 
To ensure scalability the similarity scores are computed on demand  rather than pre-computed.

In [56]:
import sklearn.metrics.pairwise as pw

indices = pd.Series(df.index, index=df['title']).drop_duplicates()

def recommend_movies(title, num_recommendations=5):
    if title not in indices:
        return "Movie not found."

    idx = indices[title]

    sim_scores = pw.linear_kernel(
        tfidf_matrix[idx:idx+1],
        tfidf_matrix
    ).flatten()

    scores = pd.DataFrame({
        'index': df.index,
        'similarity': sim_scores,
        'weighted_rating': df['weighted_rating']
    })

    # Combine similarity and weighted rating
    scores['final_score'] = (scores['similarity'] * 0.7) + \
                            (scores['weighted_rating'] * 0.3)

    scores = scores.sort_values('final_score', ascending=False).reset_index(drop=True)
    top_scores = scores.iloc[1:num_recommendations+1]

    # Build explainable output
    result = df.loc[top_scores['index'], ['title']].copy()
    result['similarity_score'] = top_scores['similarity'].values
    result['weighted_rating'] = top_scores['weighted_rating'].values
    result['final_score'] = top_scores['final_score'].values

    return result.round(3)


The recommend_function:
    -Accepts movie title as input
    -Computes similarity on demand
    -Combines similarity scores with weighted ratings
    -Then returns an explainable recommendation table

In [57]:
recommend_movies("Toy Story").reset_index(drop=True)

Unnamed: 0,title,similarity_score,weighted_rating,final_score
0,Dilwale Dulhania Le Jayenge,0.005,8.973,2.696
1,Toy Story 3,0.539,7.59,2.654
2,The Shawshank Redemption,0.063,8.491,2.592
3,The Godfather,0.008,8.488,2.552
4,Your Name.,0.028,8.432,2.549


In [58]:
recommend_movies("Jumanji").reset_index(drop=True)

Unnamed: 0,title,similarity_score,weighted_rating,final_score
0,Dilwale Dulhania Le Jayenge,0.004,8.973,2.695
1,The Godfather,0.018,8.488,2.559
2,The Shawshank Redemption,0.016,8.491,2.559
3,Spirited Away,0.104,8.283,2.558
4,Your Name.,0.011,8.432,2.537


In [59]:
recommend_movies("Waiting to Exhale").reset_index(drop=True)

Unnamed: 0,title,similarity_score,weighted_rating,final_score
0,The Shawshank Redemption,0.002,8.491,2.549
1,The Godfather,0.002,8.488,2.548
2,Your Name.,0.007,8.432,2.535
3,Planet Earth,0.0,8.404,2.521
4,Fight Club,0.013,8.293,2.497


In [60]:
recommend_movies("Grumpier Old Men").reset_index(drop=True)

Unnamed: 0,title,similarity_score,weighted_rating,final_score
0,Grumpier Old Men,1.0,6.312,2.593
1,The Godfather,0.031,8.488,2.568
2,The Shawshank Redemption,0.0,8.491,2.547
3,Your Name.,0.005,8.432,2.533
4,Planet Earth,0.0,8.404,2.521
