Recommend top 10 movies to a user (called the "input user") using user-based collaborative filtering with Pearson correlation.The first technique we're going to take a look at is called Collaborative Filtering. As hinted by its alternate name, this technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. There are several methods of finding similar users (Even some making use of Machine Learning), and the one we will be using here is going to be based on the Pearson Correlation Function

In [10]:
#Step 1: Upload the ZIP file (e.g. movielens-20m-dataset.zip)
from google.colab import files
uploaded = files.upload()
# Step 2: Unzip the uploaded file
import zipfile
import os

with zipfile.ZipFile("movielens-20m-dataset.zip", 'r') as zip_ref:
    zip_ref.extractall("movielens")
# Step 3: List all files in the extracted folder
for root, dirs, files in os.walk("movielens"):
    for file in files:
        print(os.path.join(root, file))


Saving movielens-20m-dataset.zip to movielens-20m-dataset.zip
movielens/movie.csv
movielens/link.csv
movielens/genome_tags.csv
movielens/tag.csv
movielens/genome_scores.csv
movielens/rating.csv


FileNotFoundError: [Errno 2] No such file or directory: 'movielens/ratings.csv'

In [21]:
import pandas as pd
import os

# Check current working directory
print("📁 Current Directory:", os.getcwd())

# List all files in 'movielens' folder
print("📄 Files in 'movielens':")
print(os.listdir("movielens"))

import pandas as pd
import os

# Set base path
base_path = "movielens"

# Read each file using correct names
rating = pd.read_csv(os.path.join(base_path, "rating.csv"))
tag = pd.read_csv(os.path.join(base_path, "tag.csv"))
genome_scores = pd.read_csv(os.path.join(base_path, "genome_scores.csv"))
genome_tags = pd.read_csv(os.path.join(base_path, "genome_tags.csv"))
link = pd.read_csv(os.path.join(base_path, "link.csv"))
movie = pd.read_csv(os.path.join(base_path, "movie.csv"))

#  Confirm by printing the heads
print(" Files loaded successfully.\n")

print(" Movies:")
display(movie.head())

print(" Ratings:")
display(rating.head())

print(" Tags:")
display(tag.head())

print(" Genome Tags:")
display(genome_tags.head())

print("Genome Scores:")
display(genome_scores.head())

print("🔗 Links:")
display(link.head())


📁 Current Directory: /content
📄 Files in 'movielens':
['movie.csv', 'link.csv', 'genome_tags.csv', 'tag.csv', 'genome_scores.csv', 'rating.csv']
✅ Files loaded successfully.

🎬 Movies:


Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


⭐ Ratings:


Unnamed: 0,userId,movieId,rating,timestamp
0,1,2,3.5,2005-04-02 23:53:47
1,1,29,3.5,2005-04-02 23:31:16
2,1,32,3.5,2005-04-02 23:33:39
3,1,47,3.5,2005-04-02 23:32:07
4,1,50,3.5,2005-04-02 23:29:40


🏷️ Tags:


Unnamed: 0,userId,movieId,tag,timestamp
0,18,4141,Mark Waters,2009-04-24 18:19:40
1,65,208,dark hero,2013-05-10 01:41:18
2,65,353,dark hero,2013-05-10 01:41:19
3,65,521,noir thriller,2013-05-10 01:39:43
4,65,592,dark hero,2013-05-10 01:41:18


🧬 Genome Tags:


Unnamed: 0,tagId,tag
0,1,007
1,2,007 (series)
2,3,18th century
3,4,1920s
4,5,1930s


📊 Genome Scores:


Unnamed: 0,movieId,tagId,relevance
0,1,1,0.025
1,1,2,0.025
2,1,3,0.05775
3,1,4,0.09675
4,1,5,0.14675


🔗 Links:


Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


So each movie has a unique ID, a title with its release year along with it (Which may contain unicode characters) and several different genres in the same field. Let's remove the year from the title column and place it into its own one by using the handy extract function that Pandas has.

In [22]:
# -----------------------------------
# Clean title and extract year from 'movie' DataFrame
# -----------------------------------

# Extract year in parentheses
movie['year'] = movie['title'].str.extract(r'(\(\d{4}\))', expand=False)

# Remove parentheses
movie['year'] = movie['year'].str.extract(r'(\d{4})', expand=False)

# Remove year from title
movie['title'] = movie['title'].str.replace(r'(\(\d{4}\))', '', regex=True)

# Strip whitespace
movie['title'] = movie['title'].apply(lambda x: x.strip())


# Drop 'genres' if it exists
if 'genres' in movie.columns:
    movie = movie.drop('genres', axis=1)

# -----------------------------------
# Show cleaned movie data
# -----------------------------------
print(" Cleaned Movies Data:")
display(movie.head())

#With that, let's also drop the genres column since we won't need it for this particular recommendation system.


🎬 Cleaned Movies Data:


Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save on memory.

In [27]:
if 'timestamp' in rating.columns:
    rating = rating.drop(columns='timestamp')
rating.head()

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5


Create a New User's Ratings. You're simulating a new user who has rated 5 movies.

In [28]:
userInput = [
    {'title':'Breakfast Club, The', 'rating':5},
    {'title':'Toy Story', 'rating':3.5},
    {'title':'Jumanji', 'rating':2},
    {'title':"Pulp Fiction", 'rating':5},
    {'title':'Akira', 'rating':4.5}
]
inputMovies = pd.DataFrame(userInput)

Match These Movies with movies_df to Get Their movieId. This merges the movie titles with their movieIds, which are needed to find similar users.

In [32]:
inputId = movie[movie['title'].isin(inputMovies['title'].tolist())]
inputMovies = pd.merge(inputId, inputMovies)
inputMovies = inputMovies.drop(columns='year')
display(inputMovies.head())

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


Find All Users Who Have Watched These Movies. This groups users who have watched at least one of the same movies as the input user.



1.   List item
2.   List item



In [43]:
# Get subset of users who rated the same movies
userSubset = rating[rating['movieId'].isin(inputMovies['movieId'].tolist())]

# Group by userId (not list — just column name)
userSubsetGroup = userSubset.groupby('userId')

# Display a few entries
display(userSubsetGroup.head())  # Optional for debugging

# Sort users by number of common movies rated with input user
userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)


Unnamed: 0,userId,movieId,rating
0,1,2,3.5
11,1,296,4.0
236,3,1,4.0
451,5,2,3.0
517,6,1,5.0
...,...,...,...
19999786,138491,1,2.0
19999838,138492,1968,5.0
19999890,138493,1,3.5
19999891,138493,2,4.0


In [47]:
from math import sqrt
import pandas as pd

# Step 1: Compute Pearson Correlation between the input user and each other user
pearsonCorrelationDict = {}

for name, group in userSubsetGroup:
    group = group.sort_values(by='movieId')
    inputMoviesSorted = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())].sort_values(by='movieId')

    # Reset indices to align ratings
    group = group.reset_index(drop=True)
    inputMoviesSorted = inputMoviesSorted.reset_index(drop=True)

    # Extract ratings
    ratingsInput = inputMoviesSorted['rating']
    ratingsGroup = group['rating']

    # Calculate Pearson correlation components
    Sxx = sum((ratingsInput - ratingsInput.mean()) ** 2)
    Syy = sum((ratingsGroup - ratingsGroup.mean()) ** 2)
    Sxy = sum((ratingsInput - ratingsInput.mean()) * (ratingsGroup - ratingsGroup.mean()))

    # Avoid division by zero
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy / sqrt(Sxx * Syy)
    else:
        pearsonCorrelationDict[name] = 0

# Step 2: Create DataFrame from Pearson scores
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index', columns=['similarityIndex'])
pearsonDF['userId'] = pearsonDF.index.astype(int)

# Step 3: Select top 50 users with highest similarity
topUsers = pearsonDF.sort_values(by='similarityIndex', ascending=False).head(50)

# Step 4: Merge with original rating data
topUsersRating = topUsers.merge(rating, on='userId', how='inner')

# Step 5: Calculate weighted ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex'] * topUsersRating['rating']

# Step 6: Calculate recommendation score per movie
recommendationDF = topUsersRating.groupby('movieId').agg({
    'weightedRating': 'sum',
    'similarityIndex': 'sum'
})
recommendationDF['score'] = recommendationDF['weightedRating'] / recommendationDF['similarityIndex']

# Step 7: Merge with movie titles
recommendationDF = recommendationDF.merge(movie, on='movieId', how='left')

# Step 8: Display top 10 movie recommendations
recommendationDF = recommendationDF.sort_values(by='score', ascending=False)
recommendationDF = recommendationDF.reset_index(drop=True)
print("Top 10 Movie Recommendations")
print(recommendationDF[['title', 'score']].head(10))


Top 10 Movie Recommendations
                                   title  score
0  Sword of Doom, The (Dai-bosatsu tôge)    5.0
1                             Persuasion    5.0
2                    Melinda and Melinda    5.0
3                              Cube Zero    5.0
4                                Vincent    5.0
5                      Dial M for Murder    5.0
6                       Jean de Florette    5.0
7                     Christmas Carol, A    5.0
8                    Thin Blue Line, The    5.0
9                           Withnail & I    5.0


In [50]:
# Save the similarity matrix (your "model") to a file
recommendationDF.to_csv("movie_similarity_matrix.csv", index=True)
movie[['movieId', 'title']].to_csv("movie_title_mapping.csv", index=False)
