Emily Picinich

11/16/25

Movie Recommender

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [6]:
# Load data into DataFrames
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

In [8]:
# Preview data
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [10]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 87585 entries, 0 to 87584
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  87585 non-null  int64 
 1   title    87585 non-null  object
 2   genres   87585 non-null  object
dtypes: int64(1), object(2)
memory usage: 2.0+ MB


In [12]:
# Preview data
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,17,4.0,944249077
1,1,25,1.0,944250228
2,1,29,2.0,943230976
3,1,30,5.0,944249077
4,1,32,5.0,943228858


In [14]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32000204 entries, 0 to 32000203
Data columns (total 4 columns):
 #   Column     Dtype  
---  ------     -----  
 0   userId     int64  
 1   movieId    int64  
 2   rating     float64
 3   timestamp  int64  
dtypes: float64(1), int64(3)
memory usage: 976.6 MB


In [16]:
# Check for missing data
ratings.isna().sum()

userId       0
movieId      0
rating       0
timestamp    0
dtype: int64

In [18]:
# Average rating per movie
avg_ratings = ratings.groupby('movieId')['rating'].mean()

In [22]:
# Convert genres to vectors
vectorizer = CountVectorizer(
    tokenizer=lambda x: x.split('|'),
    token_pattern=None
)

genre_matrix = vectorizer.fit_transform(movies['genres'])

In [24]:
# Genre similarity between all movies
similarity = cosine_similarity(genre_matrix)

In [26]:
# Define recommender system
def recommend(movie_title, n=10):
    # Try to find exact title
    match = movies[movies['title'].str.lower() == movie_title.lower()]
    if match.empty:
        print("Movie not found.")
        return

    idx = match.index[0]

    # Get similarity scores for this movie
    scores = similarity[idx]

    # Remove itself
    scores[idx] = -1

    # Get top N similar movies
    top_idx = np.argsort(scores)[-n:][::-1]

    recs = movies.iloc[top_idx].copy()
    recs['similarity'] = scores[top_idx]
    recs['avg_rating'] = recs['movieId'].map(avg_ratings)

    # Boost final score by rating (0–5 scaled to 0–1)
    recs['final_score'] = 0.7 * recs['similarity'] + 0.3 * (recs['avg_rating'] / 5)

    recs = recs.sort_values('final_score', ascending=False)

    # Print results
    print(f"\nBecause you liked: {movie_title}\nRecommended:\n")
    for _, r in recs.iterrows():
        print(f"{r['title']}  |  Genres: {r['genres']}  |  Rating: {r['avg_rating']:.2f}")
    print()

In [28]:
# Example
recommend("Toy Story (1995)")


Because you liked: Toy Story (1995)
Recommended:

Legends of Valhalla: Thor (2011)  |  Genres: Adventure|Animation|Children|Comedy|Fantasy  |  Rating: 4.50
The Magic Crystal (2011)  |  Genres: Adventure|Animation|Children|Comedy|Fantasy  |  Rating: 3.50
Boxtrolls, The (2014)  |  Genres: Adventure|Animation|Children|Comedy|Fantasy  |  Rating: 3.33
Christmas in Tattertown (1988)  |  Genres: Adventure|Animation|Children|Comedy|Fantasy  |  Rating: 3.25
Turbo (2013)  |  Genres: Adventure|Animation|Children|Comedy|Fantasy  |  Rating: 2.95
Shrek the Third (2007)  |  Genres: Adventure|Animation|Children|Comedy|Fantasy  |  Rating: 2.89
Aladdin (1992)  |  Genres: Adventure|Animation|Children|Comedy|Fantasy  |  Rating: 2.84
Olaf's Frozen Adventure (2017)  |  Genres: Adventure|Animation|Children|Comedy|Fantasy  |  Rating: 2.71
Bunyan and Babe (2017)  |  Genres: Adventure|Animation|Children|Comedy|Fantasy  |  Rating: 2.50
Puss in Book: Trapped in an Epic Tale (2017)  |  Genres: Adventure|Animation

This project demonstrates how a hybrid movie recommender system can effectively combine both content-based features and user rating data to generate personalized recommendations. By transforming movie genres into numerical vectors and computing cosine similarity, the system identifies films with similar thematic characteristics. Integrating average user ratings adds an additional quality filter, allowing higher-rated movies to be ranked more favorably while still preserving genre relevance. The resulting hybrid approach balances similarity and popularity, producing recommendations that are both meaningful and trustworthy.