<a href="https://colab.research.google.com/github/alaeddinehamroun/Recommender-Systems/blob/main/Content_based_filtering_on_MovieLens.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Content-based filtering is a technique used by recommendation systems to provide personalized recommendations to users based on their preferences and past behavior. It works by analyzing the characteristics or features of items and creating a user profile based on their past interactions with the system.

The content-based filtering approach relies on the assumption that if a user likes a particular item, they are likely to enjoy other items that share similar attributes or features. For example, if a user enjoys watching action movies, the system will recommend other action movies based on the attributes of the movies they have previously watched, such as the genre, actors, and directors.

The process of content-based filtering involves extracting relevant features from the items, such as keywords, genre, actors, and directors. The system then compares the features of the items with the user's preferences and generates a list of recommendations based on the similarity between the user profile and item profiles.

One of the advantages of content-based filtering is that it is able to recommend items based on the specific interests of the user, even if the items are not popular among the majority of users. It is also able to provide recommendations in the absence of data on other users' preferences.

However, content-based filtering has limitations, such as the inability to recommend items that are outside the user's past behavior or preferences. It also requires accurate and relevant item attributes to create accurate item profiles, which can be challenging for certain types of items.

In [None]:
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import MultiLabelBinarizer
from scipy.sparse import hstack, csr_matrix
import math
import numpy as np

# Load the data

In [None]:
!wget http://files.grouplens.org/datasets/movielens/ml-1m.zip

--2023-04-09 22:15:27--  http://files.grouplens.org/datasets/movielens/ml-1m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5917549 (5.6M) [application/zip]
Saving to: ‘ml-1m.zip.1’


2023-04-09 22:15:28 (32.0 MB/s) - ‘ml-1m.zip.1’ saved [5917549/5917549]



In [None]:
!unzip ml-1m.zip

Archive:  ml-1m.zip
replace ml-1m/movies.dat? [y]es, [n]o, [A]ll, [N]one, [r]ename: Y
  inflating: ml-1m/movies.dat        
replace ml-1m/ratings.dat? [y]es, [n]o, [A]ll, [N]one, [r]ename: A
  inflating: ml-1m/ratings.dat       
  inflating: ml-1m/README            
  inflating: ml-1m/users.dat         


In [None]:
movies_file = "/content/ml-1m/movies.dat"
ratings_file = "/content/ml-1m/ratings.dat"
movies_cols = ["movie_id", "title", "genres"]
ratings_cols = ["user_id", "movie_id", "rating", "timestamp"]

# Read the files into DataFrames
movies = pd.read_csv(movies_file, sep="::", header=None, names=movies_cols, encoding='ISO-8859-1', engine='python')
ratings = pd.read_csv(ratings_file, sep="::", header=None, names=ratings_cols, encoding='ISO-8859-1', engine='python')

In [None]:
movies.head()

Unnamed: 0,movie_id,title,genres
0,1,Toy Story (1995),Animation|Children's|Comedy
1,2,Jumanji (1995),Adventure|Children's|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama
4,5,Father of the Bride Part II (1995),Comedy


In [None]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


# Preprocess the data

In [None]:
# Extract the release year from the movie titles
movies['year'] = movies['title'].str.extract(r'\((\d{4})\)')
movies['year'] = pd.to_numeric(movies['year'], errors='coerce')

# Remove any movies without a valid year of release
movies.dropna(subset=['year'], inplace=True)

In [None]:
movies.head()

Unnamed: 0,movie_id,title,genres,year
0,1,Toy Story (1995),Animation|Children's|Comedy,1995
1,2,Jumanji (1995),Adventure|Children's|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [None]:
movies.isnull().sum()

movie_id    0
title       0
genres      0
year        0
dtype: int64

In [None]:
# Merge the movies and ratings DataFrames
data = pd.merge(movies, ratings, on="movie_id")

In [None]:
data.drop(['title', 'timestamp'], axis=1, inplace=True)
data.drop_duplicates(inplace=True)
data.fillna(0, inplace=True)

In [None]:
data.head()

Unnamed: 0,movie_id,genres,year,user_id,rating
0,1,Animation|Children's|Comedy,1995,1,5
1,1,Animation|Children's|Comedy,1995,6,4
2,1,Animation|Children's|Comedy,1995,8,4
3,1,Animation|Children's|Comedy,1995,9,5
4,1,Animation|Children's|Comedy,1995,10,5


In [None]:
cols = ['movie_id', 'user_id', 'rating', 'genres', 'year']
data = data[cols]

KeyError: ignored

In [None]:
# Convert the movie genres into binary features using one-hot encoding
vectorizer = TfidfVectorizer(stop_words='english', binary=True)
movie_features = vectorizer.fit_transform(movies['genres'])

In [None]:
print(movie_features)

In [None]:
# Implement KNN to find the K-nearest neighbors of each movie
from sklearn.neighbors import NearestNeighbors

model = NearestNeighbors(n_neighbors=10, metric='cosine')
model.fit(movie_features)

In [None]:
# Define a function to recommend movies based on a given movie
def recommend_movies(movie_title):
    # Get the index of the given movie
    idx = movies[movies['title'] == movie_title].index[0]
    print(movies.iloc[idx])
    # Find the n nearest neighbors based on the cosine similarity
    distances, indices = model.kneighbors(movie_features[idx])
    # Get the titles of the nearest neighbor movies
    recommended_movies = [movies.iloc[idx] for idx in indices.flatten()[1:]]
    # Print the recommended movies
    print("Movies similar to", movie_title, ":\n")
    for movie in recommended_movies:
        print(movie)


In [None]:
# Test the recommender by recommending movies similar to "Toy Story (1995)"
recommend_movies("Toy Story (1995)")

movie_id                              1
title                  Toy Story (1995)
genres      Animation|Children's|Comedy
year                               1995
Name: 0, dtype: object
  (0, 4)	0.34435071658181016
  (0, 3)	0.5917143302654576
  (0, 2)	0.7289010463348881
[[3045 2285 3685 3682 3542 2073 2072 1050    0 2286]]
Movies similar to Toy Story (1995) :

movie_id                           2354
title         Rugrats Movie, The (1998)
genres      Animation|Children's|Comedy
year                               1998
Name: 2285, dtype: object
movie_id                                              3754
title       Adventures of Rocky and Bullwinkle, The (2000)
genres                         Animation|Children's|Comedy
year                                                  2000
Name: 3685, dtype: object
movie_id                           3751
title                Chicken Run (2000)
genres      Animation|Children's|Comedy
year                               2000
Name: 3682, dtype: object
movie_

As you can see, all the recommended movies have similar genres as the movie "Toy Story (1995)"

# Evaluation

# Feature = realese year

In [None]:
years = movies['year'].values.reshape(-1, 1)

In [None]:
years

array([[1995],
       [1995],
       [1995],
       ...,
       [2000],
       [2000],
       [2000]])

In [None]:
# def computeYearSimilarity(self, movie1, movie2, years):
#     diff = abs(years[movie1] - years[movie2])
#     sim = math.exp(-diff / 10.0)
#     return sim

In [None]:
model = NearestNeighbors(n_neighbors=10, metric='euclidean')
model.fit(years)

In [None]:
# Test the recommender by recommending movies similar to "Toy Story (1995)"

movie_title="Toy Story (1995)"
# Get the index of the given movie
idx = movies[movies['title'] == movie_title].index[0]
print(movies.iloc[idx])
# Find the n nearest neighbors based on the cosine similarity
distances, indices = model.kneighbors(years[idx].reshape(1, -1))
# Get the titles of the nearest neighbor movies
recommended_movies = [movies.iloc[idx] for idx in indices.flatten()[1:]]
# Print the recommended movies
print("Movies similar to", movie_title, ":\n")
for movie in recommended_movies:
    print(movie)

movie_id                              1
title                  Toy Story (1995)
genres      Animation|Children's|Comedy
year                               1995
Name: 0, dtype: object
Movies similar to Toy Story (1995) :

movie_id                           10
title                GoldenEye (1995)
genres      Action|Adventure|Thriller
year                             1995
Name: 9, dtype: object
movie_id                        6
title                 Heat (1995)
genres      Action|Crime|Thriller
year                         1995
Name: 5, dtype: object
movie_id                      9
title       Sudden Death (1995)
genres                   Action
year                       1995
Name: 8, dtype: object
movie_id                          3
title       Grumpier Old Men (1995)
genres               Comedy|Romance
year                           1995
Name: 2, dtype: object
movie_id                                     5
title       Father of the Bride Part II (1995)
genres                           

# Evaluation

# Multiple features at once

In [None]:
movies.head()

Unnamed: 0,movie_id,title,genres,year
0,1,Toy Story (1995),Animation|Children's|Comedy,1995
1,2,Jumanji (1995),Adventure|Children's|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,Waiting to Exhale (1995),Comedy|Drama,1995
4,5,Father of the Bride Part II (1995),Comedy,1995


In [None]:
# Split the "genres" column into multiple columns using get_dummies()
genres_df = movies['genres'].str.get_dummies('|')
genres_df

Unnamed: 0,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0
3,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3878,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3879,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3880,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3881,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


In [None]:
movie_features = movies.drop(['genres', 'title'], axis=1, inplace=False)

In [None]:
movie_features.shape

(3883, 2)

In [None]:
genres_df.shape

(3883, 18)

In [None]:
movie_features = pd.merge(movie_features, genres_df, left_index=True, right_index=True)

In [None]:
movie_features.tail()

Unnamed: 0,movie_id,year,Action,Adventure,Animation,Children's,Comedy,Crime,Documentary,Drama,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
3878,3948,2000,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
3879,3949,2000,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3880,3950,2000,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3881,3951,2000,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
3882,3952,2000,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0


In [None]:
movie_features.isnull().sum()

movie_id       0
year           0
Action         0
Adventure      0
Animation      0
Children's     0
Comedy         0
Crime          0
Documentary    0
Drama          0
Fantasy        0
Film-Noir      0
Horror         0
Musical        0
Mystery        0
Romance        0
Sci-Fi         0
Thriller       0
War            0
Western        0
dtype: int64

In [None]:
movie_features.iloc[1][1:]

year           1995
Action            0
Adventure         1
Animation         0
Children's        1
Comedy            0
Crime             0
Documentary       0
Drama             0
Fantasy           1
Film-Noir         0
Horror            0
Musical           0
Mystery           0
Romance           0
Sci-Fi            0
Thriller          0
War               0
Western           0
Name: 1, dtype: int64

In [None]:
# Cosine sim
def compute_genre_similarity(x1, x2):
  
  genres1 = x1[2:]
  genres2 = x2[2:]
  sumxx, sumxy, sumyy = 0, 0, 0
  for i in range(len(genres1)):
    x = genres1[i]
    y = genres2[i]
    sumxx += x*x
    sumyy += y*y
    sumxy += x * y
  return sumxy/math.sqrt(sumxx*sumyy)

In [None]:
# exponential similarity score
def compute_year_similarity(x1, x2):
  diff = abs(x1[1] - x2[1])
  sim = math.exp(-diff / 10.0)
  return sim

In [None]:
def compute_distance(x1, x2):
  genre_similarity= compute_genre_similarity(x1, x2)
  year_similarity= compute_year_similarity(x1, x2)

  return year_similarity * genre_similarity


In [None]:
movie_features.columns

Index(['movie_id', 'year', 'Action', 'Adventure', 'Animation', 'Children's',
       'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
       'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'],
      dtype='object')

In [None]:
type(movie_features)

pandas.core.frame.DataFrame

In [None]:
nn = NearestNeighbors(n_neighbors=10, metric=compute_distance)
nn.fit(movie_features)

In [None]:
# Test the recommender by recommending movies similar to "Toy Story (1995)"

movie_title="Toy Story (1995)"
# Get the index of the given movie
idx = movies[movies['title'] == movie_title].index[0]
print(movies.iloc[idx])
# Find the n nearest neighbors based on the cosine similarity
distances, indices = nn.kneighbors(movie_features.iloc[idx:idx+1])
# Get the titles of the nearest neighbor movies
recommended_movies = [movies.iloc[idx] for idx in indices.flatten()[1:]]
# Print the recommended movies
print("Movies similar to", movie_title, ":\n")
type(distances)
for movie, distance in zip(recommended_movies, distances.tolist()[0][1:]):
  print(movie)
  print('distance:', distance)

movie_id                              1
title                  Toy Story (1995)
genres      Animation|Children's|Comedy
year                               1995
Name: 0, dtype: object
Movies similar to Toy Story (1995) :

movie_id                            2414
title       Young Sherlock Holmes (1985)
genres          Action|Adventure|Mystery
year                                1985
Name: 2345, dtype: object
distance: 0.0
movie_id                               2421
title       Karate Kid, Part II, The (1986)
genres               Action|Adventure|Drama
year                                   1986
Name: 2352, dtype: object
distance: 0.0
movie_id                      2420
title       Karate Kid, The (1984)
genres                       Drama
year                          1984
Name: 2351, dtype: object
distance: 0.0
movie_id                   2425
title       General, The (1998)
genres                    Crime
year                       1998
Name: 2356, dtype: object
distance: 0.0
movie_id   

In [None]:
# Get the similarity matrix
similarity_matrix = nn.kneighbors_graph(movie_features)

In [None]:
print(similarity_matrix)

  (0, 2350)	1.0
  (0, 2345)	1.0
  (0, 2352)	1.0
  (0, 2351)	1.0
  (0, 2356)	1.0
  (0, 2346)	1.0
  (0, 2343)	1.0
  (0, 2359)	1.0
  (0, 2342)	1.0
  (0, 2353)	1.0
  (1, 2551)	1.0
  (1, 2550)	1.0
  (1, 2556)	1.0
  (1, 2555)	1.0
  (1, 2554)	1.0
  (1, 2552)	1.0
  (1, 2558)	1.0
  (1, 2544)	1.0
  (1, 2547)	1.0
  (1, 2560)	1.0
  (2, 2320)	1.0
  (2, 2308)	1.0
  (2, 2324)	1.0
  (2, 2322)	1.0
  (2, 2328)	1.0
  :	:
  (3880, 2284)	1.0
  (3880, 2280)	1.0
  (3880, 2294)	1.0
  (3880, 2278)	1.0
  (3880, 2290)	1.0
  (3881, 2285)	1.0
  (3881, 2281)	1.0
  (3881, 2287)	1.0
  (3881, 2286)	1.0
  (3881, 2292)	1.0
  (3881, 2284)	1.0
  (3881, 2280)	1.0
  (3881, 2294)	1.0
  (3881, 2278)	1.0
  (3881, 2290)	1.0
  (3882, 2211)	1.0
  (3882, 2212)	1.0
  (3882, 2190)	1.0
  (3882, 2191)	1.0
  (3882, 2192)	1.0
  (3882, 2196)	1.0
  (3882, 2197)	1.0
  (3882, 2201)	1.0
  (3882, 2206)	1.0
  (3882, 2188)	1.0


# Evaluation