**Project Objective<br/>**

_1. Create a popularity-based recommender system at a genre level._ <br/>
_2. The user will input a genre (g), a minimum ratings threshold (t), and a number of recommendations (N).<br/>
3. The system should recommend the top N movies within that genre, ordered by ratings in descending order, where each movie has at least (t) reviews._

# 1. Creating a popularity-based recommender system at a genre level
**Step 1: Import Libraries and Load Dataset**

In [2]:
import pandas as pd
import numpy as np

# Load datasets
movies_df = pd.read_csv('/Users/mac/Desktop/movies.csv')
ratings_df = pd.read_csv('/Users/mac/Desktop/ratings.csv')

In [3]:
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
10324,146684,Cosmic Scrat-tastrophe (2015),Animation|Children|Comedy
10325,146878,Le Grand Restaurant (1966),Comedy
10326,148238,A Very Murray Christmas (2015),Comedy
10327,148626,The Big Short (2015),Drama


In [5]:
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523
...,...,...,...,...
105334,668,142488,4.0,1451535844
105335,668,142507,3.5,1451535889
105336,668,143385,4.0,1446388585
105337,668,144976,2.5,1448656898


**Step 2: Data Cleaning Process**

In [8]:
#Checking for duplicates on movies

movies_df = movies_df.drop_duplicates()
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
10324,146684,Cosmic Scrat-tastrophe (2015),Animation|Children|Comedy
10325,146878,Le Grand Restaurant (1966),Comedy
10326,148238,A Very Murray Christmas (2015),Comedy
10327,148626,The Big Short (2015),Drama


In [9]:
#Checking for duplicates on ratings

ratings_df = ratings_df.drop_duplicates()
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523
...,...,...,...,...
105334,668,142488,4.0,1451535844
105335,668,142507,3.5,1451535889
105336,668,143385,4.0,1446388585
105337,668,144976,2.5,1448656898


In [10]:
# Checking for missing values rows using the ".duplicated  function"
missing_values = movies_df.isnull()

print(" There Missing values:")
print("\033[1mThere are no missing values in the dataframe:\033[0m")
print(missing_values)

 There Missing values:
[1mThere are no missing values in the dataframe:[0m
       movieId  title  genres
0        False  False   False
1        False  False   False
2        False  False   False
3        False  False   False
4        False  False   False
...        ...    ...     ...
10324    False  False   False
10325    False  False   False
10326    False  False   False
10327    False  False   False
10328    False  False   False

[10329 rows x 3 columns]


In [11]:
# Checking for missing values rows using the ".duplicated  function"
missing_values = ratings_df.isnull()

print(" There Missing values:")
print("\033[1mThere are no missing values in the dataframe:\033[0m")
print(missing_values)

 There Missing values:
[1mThere are no missing values in the dataframe:[0m
        userId  movieId  rating  timestamp
0        False    False   False      False
1        False    False   False      False
2        False    False   False      False
3        False    False   False      False
4        False    False   False      False
...        ...      ...     ...        ...
105334   False    False   False      False
105335   False    False   False      False
105336   False    False   False      False
105337   False    False   False      False
105338   False    False   False      False

[105339 rows x 4 columns]


In [12]:
#Verify that each column has the correct data type
print(movies_df.dtypes)

movieId     int64
title      object
genres     object
dtype: object


In [13]:
#Verify that each column has the correct data type
print(ratings_df.dtypes)

userId         int64
movieId        int64
rating       float64
timestamp      int64
dtype: object


**Step 3: Check the structure of the datasets**

In [14]:
print(movies_df.head())
print(ratings_df.head())

   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  
   userId  movieId  rating   timestamp
0       1       16     4.0  1217897793
1       1       24     1.5  1217895807
2       1       32     4.0  1217896246
3       1       47     4.0  1217896556
4       1       50     4.0  1217896523


**Step 4: Merge Data**

In [15]:
merged_data = pd.merge(ratings_df, movies_df, on='movieId')

In [16]:
print(merged_data.head())

   userId  movieId  rating   timestamp          title       genres
0       1       16     4.0  1217897793  Casino (1995)  Crime|Drama
1       9       16     4.0   842686699  Casino (1995)  Crime|Drama
2      12       16     1.5  1144396284  Casino (1995)  Crime|Drama
3      24       16     4.0   963468757  Casino (1995)  Crime|Drama
4      29       16     3.0   836820223  Casino (1995)  Crime|Drama


**Step 5: Filter Data by Genre**

In [17]:
def filter_by_genre(genre, min_reviews, num_recommendations):
    genre_data = merged_data[merged_data['genres'].str.contains(genre, case=False)]
    movie_ratings = genre_data.groupby('title').agg({'rating': 'mean', 'movieId': 'count'}).reset_index()
    filtered_data = movie_ratings[movie_ratings['movieId'] >= min_reviews]
    sorted_data = filtered_data.sort_values(by='rating', ascending=False).head(num_recommendations)
    return sorted_data

# Example usage
genre = 'Comedy'
min_reviews = 100
num_recommendations = 5
top_movies = filter_by_genre(genre, min_reviews, num_recommendations)
print(top_movies)


                                       title    rating  movieId
2093  Monty Python and the Holy Grail (1975)  4.301948      154
995                             Fargo (1996)  4.271144      201
2498              Princess Bride, The (1987)  4.163743      171
2523                     Pulp Fiction (1994)  4.160000      325
1069                     Forrest Gump (1994)  4.138264      311


# 2. Content-Based Recommender System

**_Creating a content-based recommender system that recommends the top N movies based on similar movie genres._**

**Step 1: Import Libraries and Load Dataset**

In [20]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer

In [21]:
# Load dataset
movies_df = pd.read_csv('/Users/mac/Desktop/movies.csv')

In [37]:
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
10324,146684,Cosmic Scrat-tastrophe (2015),Animation|Children|Comedy
10325,146878,Le Grand Restaurant (1966),Comedy
10326,148238,A Very Murray Christmas (2015),Comedy
10327,148626,The Big Short (2015),Drama


In [42]:
movies_df["title"].str.strip( )

0                           Toy Story (1995)
1                             Jumanji (1995)
2                    Grumpier Old Men (1995)
3                   Waiting to Exhale (1995)
4         Father of the Bride Part II (1995)
                        ...                 
10324          Cosmic Scrat-tastrophe (2015)
10325             Le Grand Restaurant (1966)
10326         A Very Murray Christmas (2015)
10327                   The Big Short (2015)
10328    Marco Polo: One Hundred Eyes (2015)
Name: title, Length: 10329, dtype: object

In [43]:
### Left Striping (removing / cleaning uwanted character from the both left and right)
movies_df["title"] = movies_df["title"].str.strip("|/" )
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
10324,146684,Cosmic Scrat-tastrophe (2015),Animation|Children|Comedy
10325,146878,Le Grand Restaurant (1966),Comedy
10326,148238,A Very Murray Christmas (2015),Comedy
10327,148626,The Big Short (2015),Drama


**Step 2: Vectorize based on similar movie genres.**

In [44]:
count_vectorizer = CountVectorizer(tokenizer=lambda x: x.split('|'))
genre_matrix = count_vectorizer.fit_transform(movies['genres'])

**Step 3: Calculate Cosine Similarity**

In [45]:
cosine_sim = cosine_similarity(genre_matrix, genre_matrix)

In [46]:
print(cosine_sim)

[[1.         0.77459667 0.31622777 ... 0.4472136  0.         0.        ]
 [0.77459667 1.         0.         ... 0.         0.         0.        ]
 [0.31622777 0.         1.         ... 0.70710678 0.         0.        ]
 ...
 [0.4472136  0.         0.70710678 ... 1.         0.         0.        ]
 [0.         0.         0.         ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.         0.         1.        ]]


**Step 4: Recommend Similar Movies**

In [51]:
def recommend_similar_movies(movie_title, num_recommendations):
        
        # Check if the movie title exists in the DataFrame
    if movie_title not in movies['title'].values:
        print(f"Movie title '{movie_title}' not found in the database.")
        return []
    
    # Proceed if the movie title is found
    idx = movies[movies['title'] == movie_title].index[0]
    similarity_scores = list(enumerate(cosine_sim[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the most similar movies
    similarity_scores = similarity_scores[1:num_recommendations+1]
    
    # Get the movie indices
    movie_indices = [i[0] for i in similarity_scores]
    
    # Return the top N most similar movies
    return movies['title'].iloc[movie_indices]

In [52]:
movie_title = 'Grumpier Old Men'
num_recommendations = 5
similar_movies = recommend_similar_movies(movie_title, num_recommendations)
print(similar_movies)

Movie title 'Grumpier Old Men' not found in the database.
[]


# 3. Collaborative-Based Recommender System
**_Create a collaborative-based recommender system that recommends the top N movies based on K similar users for a target user._**

**Step 1: Import Libaries and Load Datasets**

In [53]:
import pandas as pd
from sklearn.neighbors import NearestNeighbors

In [54]:
# Load datasets
ratings = pd.read_csv('/Users/mac/Desktop/ratings.csv')

In [55]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


**Step 2: Creating user-item Matrix**

In [56]:
user_movie_matrix = ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)

**Step 3: Train Nearest Neighbors Model**

In [57]:
knn = NearestNeighbors(metric='cosine', algorithm='brute')
knn.fit(user_movie_matrix)

NearestNeighbors(algorithm='brute', metric='cosine')

**Step 3: Recommend Movies Based on Similar Users**

In [58]:
def recommend_movies_for_user(user_id, num_recommendations, num_similar_users):
    distances, indices = knn.kneighbors(user_movie_matrix.loc[user_id].values.reshape(1, -1), n_neighbors=num_similar_users+1)
    similar_users = indices.flatten()[1:]
    similar_users_ratings = user_movie_matrix.iloc[similar_users].mean(axis=0)
    user_ratings = user_movie_matrix.loc[user_id]
    recommendations = similar_users_ratings[user_ratings == 0].sort_values(ascending=False).head(num_recommendations)
    return recommendations.index

In [59]:
# Example usage
user_id = 1
num_recommendations = 5
num_similar_users = 100
recommended_movies = recommend_movies_for_user(user_id, num_recommendations, num_similar_users)
print(recommended_movies)

Int64Index([1291, 1, 1036, 1200, 4226], dtype='int64', name='movieId')
