# Mini Project: Recommendation Engines

Recommendation engines are algorithms designed to provide personalized suggestions or recommendations to users. These systems analyze user behavior, preferences, and interactions with items (products, movies, music, articles, etc.) to predict and offer items that users are likely to be interested in. Recommendation engines play a crucial role in enhancing user experience, driving engagement, and increasing conversion rates in various applications, including e-commerce, entertainment, content platforms, and more.

There are generally two approaches taken in collaborative filtering and content-based recommendation engines:

**1. Collaborative Filtering:**
Collaborative Filtering is a popular approach to building recommendation systems that leverages the collective behavior of users to make personalized recommendations. It is based on the idea that users who have agreed in the past will likely agree in the future. There are two main types of collaborative filtering:

- **User-based Collaborative Filtering:** This method finds users similar to the target user based on their past interactions (e.g., ratings or purchases). It then recommends items that similar users have liked but the target user has not interacted with yet.

- **Item-based Collaborative Filtering:** In this approach, the system identifies similar items based on user interactions. It recommends items that are similar to the ones the target user has already liked or interacted with.

Collaborative filtering does not require any explicit information about items but relies on the similarity between users or items. It is effective in capturing complex patterns and can provide serendipitous recommendations. However, it suffers from the cold-start problem (i.e., difficulty in recommending to new users or items with no interactions) and scalability challenges in large datasets.

**2. Content-Based Recommendation:**
Content-based recommendation is an alternative approach to building recommendation systems that focuses on the attributes or features of items and users. It leverages the characteristics of items to make recommendations. The key steps involved in content-based recommendation are:

- **Feature Extraction:** For each item, relevant features are extracted. For movies, these features could be genre, director, actors, and plot summary.

- **User Profile:** A user profile is created based on the items they have interacted with in the past. The user profile contains the weighted importance of features based on their interactions.

- **Similarity Calculation:** The similarity between items or between items and the user profile is calculated using similarity metrics like cosine similarity or Euclidean distance.

- **Recommendation:** Items that are most similar to the user profile are recommended to the user.

Content-based recommendation systems are less affected by the cold-start problem as they can still recommend items based on their features. They are also more interpretable as they rely on item attributes. However, they may miss out on providing serendipitous recommendations and can be limited by the quality of feature extraction and user profiles.

**Choosing Between Collaborative Filtering and Content-Based:**
Both collaborative filtering and content-based approaches have their strengths and weaknesses. The choice between them depends on the specific requirements of the recommendation system, the type of data available, and the user base. Hybrid approaches that combine collaborative filtering and content-based techniques are also common, aiming to leverage the strengths of both methods and mitigate their weaknesses.

In this mini-project, you'll be building both content based and collaborative filtering engines for the [MovieLens 25M dataset](https://grouplens.org/datasets/movielens/25m/). The MovieLens 25M dataset is one of the most widely used and popular datasets for building and evaluating recommendation systems. It is provided by the GroupLens Research project, which collects and studies datasets related to movie ratings and recommendations. The MovieLens 25M dataset contains movie ratings and other related information contributed by users of the MovieLens website.

**Dataset Details:**
- **Size:** The dataset contains approximately 25 million movie ratings.
- **Users:** It includes ratings from over 162,000 users.
- **Movies:** The dataset consists of ratings for more than 62,000 movies.
- **Ratings:** The ratings are provided on a scale of 1 to 5, where 1 is the lowest rating and 5 is the highest.
- **Timestamps:** Each rating is associated with a timestamp, indicating when the rating was given.

**Data Files:**
The dataset is usually split into three CSV files:

1. **movies.csv:** Contains information about movies, including the movie ID, title, genres, and release year.
   - Columns: movieId, title, genres

2. **ratings.csv:** Contains movie ratings provided by users, including the user ID, movie ID, rating, and timestamp.
   - Columns: userId, movieId, rating, timestamp

3. **tags.csv:** Contains user-generated tags for movies, including the user ID, movie ID, tag, and timestamp.
   - Columns: userId, movieId, tag, timestamp

First, import all the libraries you'll need.

In [1]:
import zipfile
import numpy as np
import pandas as pd
from urllib.request import urlretrieve
from sklearn.metrics.pairwise import cosine_similarity

Next, download the relevant components of the MoveLens dataset. Note, these instructions are roughly based on the colab [here](https://colab.research.google.com/github/google/eng-edu/blob/main/ml/recommendation-systems/recommendation-systems.ipynb?utm_source=ss-recommendation-systems&utm_campaign=colab-external&utm_medium=referral&utm_content=recommendation-systems#scrollTo=O3bcgduFo4s6).

In [2]:
print("Downloading movielens data...")

urlretrieve('http://files.grouplens.org/datasets/movielens/ml-100k.zip', 'movielens.zip')
zip_ref = zipfile.ZipFile('movielens.zip', 'r')
zip_ref.extractall()
print("Done. Dataset contains:")
print(zip_ref.read('ml-100k/u.info'))

ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(
    'ml-100k/u.data', sep='\t', names=ratings_cols, encoding='latin-1')

# The movies file contains a binary feature for each genre.
genre_cols = [
    "genre_unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
    "Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
    "Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]
movies_cols = [
    'movie_id', 'title', 'release_date', "video_release_date", "imdb_url"
] + genre_cols
movies = pd.read_csv(
    'ml-100k/u.item', sep='|', names=movies_cols, encoding='latin-1')

Downloading movielens data...
Done. Dataset contains:
b'943 users\n1682 items\n100000 ratings\n'


Before doing any kind of machine learning, it's always good to familiarize yourself with the datasets you'lll be working with.

Here are your tasks:

1. Spend some time familiarizing yourself with both the `movies` and `ratings` dataframes. How many unique user ids are present? How many unique movies are there?
  * `ANSWERS:`
    * 943 unique user ID's
    * 1,682 unique movie_id's
    * But only 1,664 unique movie titles
2. Create a new dataframe that merges the `movies` and `ratings` tables on 'movie_id'. Only keep the 'user_id', 'title', 'rating' fields in this new dataframe.

## EDA

In [None]:
# Spend some time familiarizing yourself with both the movies and ratings
# dataframes. How many unique user ids are present? (943) How many unique movies
# are there? (1,682 unique movie_ids, but only 1,664 unique movie titles)

**Ratings EDA**

The `ratings` DataFrame has 4 columns and 100,000 records. There are no missing values in this dataset.

Here is an explanation of the columns in the `ratings` dataset that will be included in the combined dataset:
* *user_id* - 943 unique user_id numberes, ranging from 1 to 943
* *movie_id* - 1,682 unique movie_id numbers, ranging from 1 to 1,682
* *rating* - 5 unique rating numbers, ranging from 1 to 5

In [3]:
ratings.head()

Unnamed: 0,user_id,movie_id,rating,unix_timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [4]:
ratings.shape

(100000, 4)

In [5]:
ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype
---  ------          --------------   -----
 0   user_id         100000 non-null  int64
 1   movie_id        100000 non-null  int64
 2   rating          100000 non-null  int64
 3   unix_timestamp  100000 non-null  int64
dtypes: int64(4)
memory usage: 3.1 MB


In [6]:
ratings['rating'].unique()

array([3, 1, 2, 4, 5])

In [7]:
ratings['user_id'].nunique()

943

In [8]:
ratings['user_id'].sort_values()

Unnamed: 0,user_id
66567,1
62820,1
10207,1
9971,1
22496,1
...,...
96823,943
70902,943
84518,943
72321,943


In [9]:
ratings['movie_id'].nunique()

1682

In [10]:
ratings['movie_id'].sort_values()

Unnamed: 0,movie_id
25741,1
93639,1
55726,1
49529,1
89079,1
...,...
75323,1678
67302,1679
80394,1680
92329,1681


In [11]:
# check if there are duplicate entries between user_id and movie_id
duplicates = ratings[ratings.duplicated(subset=['user_id', 'movie_id'], keep=False)]
duplicates.sum()

Unnamed: 0,0
user_id,0
movie_id,0
rating,0
unix_timestamp,0


**Movies EDA**


The `movies` DataFrame has 24 columns and 1,682 records.

Here is an explanation of the columns in the `movies` dataset:
* *movie_id* - 1,682 unique movie_id numbers, values range from 1 to 1,682
* *title* - 1,664 unique strings for movie titles
  * **duplicated movie titles**
    * there are 18 movie titles that have two unique movie_id's assigned to them; these duplicates may cause problems with the calculations and should be removed before building the recommender systems
* *release_date* - date movie was released, in string format of Day(2 digits)-Month(short word)-Year(4 digits)
  * has 1 missing value
* *video_release_date* - all values are missing for this column
* *imdb_url* - string for url to the movie
  * has 3 missing values
* 19 separate genre-based columns - value of 0 or 1 to indicate if the movie meets this genre category


The remaining columns in the `movie` dataset provide data on each movie's release dates, imdb_url, and 19 different categories of movie genres.

In [12]:
movies.head()

Unnamed: 0,movie_id,title,release_date,video_release_date,imdb_url,genre_unknown,Action,Adventure,Animation,Children,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Toy%20Story%2...,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
1,2,GoldenEye (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?GoldenEye%20(...,0,1,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,3,Four Rooms (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Four%20Rooms%...,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,4,Get Shorty (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Get%20Shorty%...,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Copycat (1995),01-Jan-1995,,http://us.imdb.com/M/title-exact?Copycat%20(1995),0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [13]:
movies.shape

(1682, 24)

In [14]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1682 entries, 0 to 1681
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_id            1682 non-null   int64  
 1   title               1682 non-null   object 
 2   release_date        1681 non-null   object 
 3   video_release_date  0 non-null      float64
 4   imdb_url            1679 non-null   object 
 5   genre_unknown       1682 non-null   int64  
 6   Action              1682 non-null   int64  
 7   Adventure           1682 non-null   int64  
 8   Animation           1682 non-null   int64  
 9   Children            1682 non-null   int64  
 10  Comedy              1682 non-null   int64  
 11  Crime               1682 non-null   int64  
 12  Documentary         1682 non-null   int64  
 13  Drama               1682 non-null   int64  
 14  Fantasy             1682 non-null   int64  
 15  Film-Noir           1682 non-null   int64  
 16  Horror

In [15]:
movies.columns

Index(['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url',
       'genre_unknown', 'Action', 'Adventure', 'Animation', 'Children',
       'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
       'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'],
      dtype='object')

In [16]:
genres = movies.columns[5:]
print(f"Number of genres: {len(genres)} \nGenres: {genres}")


Number of genres: 19 
Genres: Index(['genre_unknown', 'Action', 'Adventure', 'Animation', 'Children',
       'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir',
       'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War',
       'Western'],
      dtype='object')


In [17]:
movies['movie_id'].nunique()

1682

In [18]:
movies['title'].nunique()

1664

There is a discrepancy between the number of unique movie_id's and unique titles. The below code will explore this discrepancy further.

*Find number of duplicated titles*

In [19]:
# create variable to hold duplicated titles
duplicated_titles = list(movies[movies['title'].duplicated()]['title'])

# print number of duplicated values
print(f'Number of duplicated titles \n{len(duplicated_titles)}')

Number of duplicated titles 
18


## Data Cleaning

Since there are rows in the `movies` dataframe which have duplicate movie titles, these duplicates are likely to cause problems if used in the recommender calculations. Let's clean up the movies dataframe and then determine if this will cause issues when combining the `movies` and `ratings` dataframes later in the notebook.

**See how the titles are duplicated in the dataset.**

In [20]:
# create df with specific columns to compare (movie id, title, release date, and all genre columns)
# and only rows that have duplicated information
compare = ['movie_id', 'title', 'release_date'] + list(genres)
dup_titles = movies[movies['title'].duplicated(keep=False)].sort_values(by='title')[compare]

# print size of the new df
print(f'Size of new df: {dup_titles.shape}')

# view duplicated titles to see how many movie_id's are assigned per title
for title in duplicated_titles:
  title_rows = dup_titles[dup_titles['title'] == title]
  print(title_rows[['movie_id', 'title']])

Size of new df: (36, 22)
     movie_id               title
245       246  Chasing Amy (1997)
267       268  Chasing Amy (1997)
     movie_id               title
302       303  Ulee's Gold (1997)
296       297  Ulee's Gold (1997)
     movie_id                      title
347       348  Desperate Measures (1998)
328       329  Desperate Measures (1998)
     movie_id                 title
499       500  Fly Away Home (1996)
303       304  Fly Away Home (1996)
     movie_id                  title
669       670  Body Snatchers (1993)
572       573  Body Snatchers (1993)
     movie_id                      title
679       680  Kull the Conqueror (1997)
265       266  Kull the Conqueror (1997)
     movie_id                  title
304       305  Ice Storm, The (1997)
864       865  Ice Storm, The (1997)
     movie_id               title
875       876  Money Talks (1997)
880       881  Money Talks (1997)
      movie_id                  title
1002      1003  That Darn Cat! (1997)
877        878  T

The dataframe created with just the duplicated entries contains 36 rows, twice as many rows as the number of duplicated movie titles. By printing the dataframe, it is clear that each duplicated movie title has 2 unique movie_id's assigned to it.

The below code will create a dataframe that removes the movie_id and release_date from the current dup_titles dataframe. This way, the dataframe will only contain the rows with the duplicated titles, and column information that will be used later in this notebook.

By running .duplicated() on this new dataframe, we can see if the column values match between the duplicated movie titles. If so, we can save the secondary movie_id's per title, then remove the entire row for these secondary movie_id's.

In [21]:
# create new df with just title and genre columns
# create updated compare list that contains title and genre columns
compare = ['title'] + list(genres)

## create df with title, release date, and all genre columns
duplicates = dup_titles[compare]

In [22]:
# check for duplicates in 'movies' dataframe
# no entries returned means there are no rows in the full 'movies' dataframe
# with matching values across all columns
movies.duplicated().sum()

0

In [23]:
# check for duplicates in df with movie_id, release_date, title, and genre columns
# no entries returned means there are no rows
# with matching values across the listed columns
dup_titles.duplicated().sum()

0

In [24]:
# check for duplicates in df with just title and genre columns
# no entries returned means there are no rows
# with matching values across the listed columns
duplicates.duplicated().sum()

18

The title and genre columns match across the duplicated titles, so one of the duplicated entries for each unique title can be deleted.

However, the movie_id's of matching pairs needs to be saved so the `ratings` df can also be updated.

**Create new movies dataframe with only unique movie titles.**

In [25]:
# save movie_id pairs for each duplicate movie title
dup_ids = []
for title in duplicated_titles:
  my_list = []
  title_rows = dup_titles[dup_titles['title'] == title]
  my_list.append(title_rows['movie_id'].values)
  dup_ids.append(my_list)

# change nested array structure to list of tuples
dup_ids = [(id[0], id[1]) for sublist in dup_ids for id in sublist]
dup_ids

[(246, 268),
 (303, 297),
 (348, 329),
 (500, 304),
 (670, 573),
 (680, 266),
 (305, 865),
 (876, 881),
 (1003, 878),
 (1257, 1256),
 (1606, 309),
 (1607, 1395),
 (1617, 1175),
 (1625, 1477),
 (1650, 1645),
 (1234, 1654),
 (1658, 711),
 (1680, 1429)]

In [26]:
# create copy of movies df
movies_no_dupes = movies.copy()
print(movies_no_dupes.shape)

# remove the second movie_id entry of duplicate titles from new movies list
for id1, id2 in dup_ids:
  drop_index = movies_no_dupes[movies_no_dupes['movie_id'] == id2].index
  movies_no_dupes.drop(drop_index, inplace=True)

print(movies_no_dupes.shape)

(1682, 24)
(1664, 24)


In [27]:
movies_no_dupes[movies_no_dupes['title'].duplicated()]['title'].sum()

0

**Create new ratings dataframe that replaces movie_id's of duplicate titles.**

The movie_id values that were removed in the new movies df will also no longer appear in the new ratings df. However, instead of removing the values entirely, they will be replaced by the movie_id that remained in the movies df for each duplicated title.

Example: Chasing Amy had these two movie_id's assigned to it in the movies df: 246 and 268. The new movies df no longer contains the row with the movie_id equal to 268. In the new ratings dataframe, any rows with the movie_id equal to 268 will be have its movie_id replaced with the value 246.

In [28]:
# Replace ratings movie_id's that were removed with still available movie ids
ratings_no_dupes = ratings.copy()
print(ratings_no_dupes.shape)

for id1, id2 in dup_ids:
  ratings_no_dupes.loc[ratings_no_dupes['movie_id'] == id2, 'movie_id'] = id1

print(ratings_no_dupes.shape)

(100000, 4)
(100000, 4)


Verify new dataframe was created correctly. The movie `Chasing Amy` had two movie_id's assigned in the original `movies` df: 246 and 268. The updated movies df removed the row with movie_id equal to 268.

To verify new data frame is made correctly, the number of 246 movie_id entries in the new ratings df should be equal to the number of 246 and 268 movie_id entries in the old ratings dataframe. The movie_id 268 should not occur at all in the new ratings df.

In [29]:
# count of movie_id values 246 and 268 in ratings df
twofoursix = ratings[ratings['movie_id'] == 246].value_counts().sum()
twosixeight = ratings[ratings['movie_id'] == 268].value_counts().sum()
twofoursix + twosixeight

379

In [30]:
# count of movie_id value 246 in new ratings df
ratings_no_dupes[ratings_no_dupes['movie_id'] == 246].value_counts().sum()

379

In [31]:
# count of movie_id value 268 in new ratings df
ratings_no_dupes[ratings_no_dupes['movie_id'] == 268].value_counts().sum()

0

## Merge movies and ratings dataframes

In [None]:
# Merge movies and ratings dataframes

Instructions state to merge the `ratings` and `movies` dataframes on 'movie_id', and only keep the 'user_id', 'title', 'rating' fields in the new dataframe.

Since these dataframes were updated to remove duplicated information, the code below will merge the two updated ratings and movies dataframes together.

In [32]:
# merge 'ratings' and 'movies' on 'movie_id'
merge_movie_id = ratings_no_dupes.merge(movies_no_dupes[['movie_id', 'title']], on='movie_id')

# only keep 'user_id', 'title', and 'rating' columns
movies_ratings_df = merge_movie_id[['user_id', 'title', 'rating']]

movies_ratings_df.head()

Unnamed: 0,user_id,title,rating
0,196,Kolya (1996),3
1,186,L.A. Confidential (1997),3
2,22,Heavyweights (1994),1
3,244,Legends of the Fall (1994),2
4,166,Jackie Brown (1997),1


In [33]:
movies_ratings_df.shape

(100000, 3)

Since the updated ratings dataframe had replaced movie_id's because there were duplicate title's in the movies dataframe, this may mean there will be duplicates across titles in the combined movies_ratings_df.

The below code will see if there are duplicates.

In [34]:
# count number of rows where user_id, title, and rating match
movies_ratings_df.duplicated().sum()

261

In [35]:
# count number of duplicate rows where user_id & title match
mov_rat_trunc = movies_ratings_df[['user_id', 'title']]
mov_rat_trunc.duplicated().sum()

307

When examining the combined movies and ratings dataframes, there are 261 duplicate rows where all column values match, and 370 duplicate rows where just the user_id and title rows match.

This difference indicates that some users rated a movie more than once, and their ratings were always the same between reviews.

The multiple ratings per user/movie should be aggregated and put into a single rating entry so it does not cause issues with the calculations for recommender systems.

In [36]:
# aggregate duplicate movie ratings per user & print shape of new df
mr_no_dupes_df = movies_ratings_df.groupby(['title', 'user_id'], as_index=False)['rating'].mean()
mr_no_dupes_df.shape

(99693, 3)

In [37]:
# verify no duplicates between all three column values
print(mr_no_dupes_df.duplicated().sum())

# verify no duplicates between user_id and title columns
mov_rat_trunc = mr_no_dupes_df[['user_id', 'title']]
print(mov_rat_trunc.duplicated().sum())

0
0


# Content Based Filtering

As mentioned in the introduction, content-Based Filtering is a recommendation engine approach that focuses on the attributes or features of items (products, movies, music, articles, etc.) and leverages these features to make personalized recommendations. The underlying idea is to match the characteristics of items with the preferences of users to suggest items that align with their interests. Content-based filtering is particularly useful when explicit user-item interactions (e.g., ratings or purchases) are sparse or unavailable.

**Key Steps in Content-Based Filtering:**

1. **Feature Extraction:**
   - For each item, relevant features are extracted. These features are typically descriptive attributes that can be represented numerically, such as genre, director, actors, author, publication date, and keywords.
   - In the case of text-based items, natural language processing techniques may be used to extract features like TF-IDF (Term Frequency-Inverse Document Frequency) scores.

2. **User Profile Creation:**
   - A user profile is created based on the items they have interacted with in the past. The user profile contains the weighted importance of features based on their interactions.
   - For example, if a user has watched several action movies, the action genre feature would receive a higher weight in their profile.

3. **Similarity Calculation:**
   - The similarity between items or between items and the user profile is calculated using similarity metrics like cosine similarity, Euclidean distance, or Pearson correlation.
   - Cosine similarity is commonly used as it measures the cosine of the angle between two vectors, which represents their similarity.

4. **Recommendation:**
   - Items that are most similar to the user profile are recommended to the user. These are items whose features have the highest similarity scores with the user profile.
   - The recommended items are presented as a list sorted by their similarity scores.

**Advantages of Content-Based Filtering:**
1. **No Cold-Start Problem:** Content-based filtering can make recommendations even for new users with no historical interactions because it relies on item features rather than user history.

2. **User Independence:** The recommendations are based solely on the features of items and do not require knowledge of other users' preferences or behavior.

3. **Transparency:** Content-based recommendations are interpretable, as they depend on the features of items, making it easier for users to understand why specific items are recommended.

4. **Serendipity:** Content-based filtering can recommend items with characteristics not seen before by the user, leading to serendipitous discoveries.

5. **Diversity in Recommendations:** The method can offer diverse recommendations since it suggests items with different feature combinations.

**Limitations of Content-Based Filtering:**
1. **Limited Discovery:** Content-based filtering may struggle to recommend items outside the scope of users' historical interactions or interests.

2. **Over-Specialization:** Users may receive recommendations that are too similar to their previous choices, leading to a lack of exposure to new item categories.

3. **Dependency on Feature Quality:** The quality and relevance of item features significantly influence the quality of recommendations.

4. **Limited for Cold Items:** Content-based filtering can struggle to recommend new items with limited feature information.

Here is your task:

1. Write a function that takes in a user id and the dataframe you created before that contains 'user_id', 'title', and 'rating'. The function should return content-based recommendations for this user. Here are steps you can take:

  A. Get the user's rated movies

  B. Create a TF-IDF matrix using movie genres. Note, this can be extracted from the `movies` dataframe.

  C. Compute the cosine similarity between movie genres. Use the [cosine_similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html) function.

  D. Get the indices of similar movies to those rated by the user based on cosine similarity. Keep only the top 5.

  E. Remove duplicates and movies already rated by the user.

Before creating the function to make content-based recommendations, a dataframe will be constructed containing properly formatted genre data for the TD-IDF vectorizer function.

TD-IDF requires text to vectorize, but the genre information is spread across 19 columns containing the value of 0 or 1. A new genre column will be created that contains all of the genre classes for each specific record.

In [38]:
# import modules
import warnings
warnings.filterwarnings('ignore')
from sklearn.feature_extraction.text import TfidfVectorizer

In [39]:
# add 'genre' column to df -- combines all relevant genres into one column
movies_no_dupes['genres'] = movies.apply(
    lambda x: ' '.join([genre.replace('-','_').lower() for genre in genres if x[genre] == 1]), axis=1
)

In [40]:
sorted(movies_no_dupes['genres'].unique())[:5]

['action',
 'action adventure',
 'action adventure animation children fantasy',
 'action adventure animation horror sci_fi',
 'action adventure children']

In [41]:
# create function to perform Content-Based Filtering using Movie Genres
def content_based_recommendation(user_id, df):
  # Create a TF-IDF matrix using movie genres
  tfidfvec = TfidfVectorizer(stop_words='english', min_df=2)
  tfidf_matrix = tfidfvec.fit_transform(movies_no_dupes['genres'])
  tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidfvec.get_feature_names_out())
  tfidf_df.index = movies_no_dupes['title']

  # Compute the cosine similarity between movie genres
  cos_sim_array = cosine_similarity(tfidf_df)
  cos_sim_df = pd.DataFrame(cos_sim_array, index=tfidf_df.index, columns=tfidf_df.index)

  # Create User Profile -- Get Average of Rated Movies
  user_rated_movies = df[df['user_id'] == user_id]['title']
  user_movies = tfidf_df.reindex(user_rated_movies)
  user_profile = user_movies.mean()

  # Create dataframe with movies not rated by user
  non_rated_movies = tfidf_df.drop(user_rated_movies, axis=0)

  # calculate cosine similarities between user profile and movies not rated by user
  user_prof_similarities = cosine_similarity(user_profile.values.reshape(1, -1), non_rated_movies)
  user_prof_similarities_df = pd.DataFrame(user_prof_similarities.T, index=non_rated_movies.index, columns=['similarity'])

  # Get the indices for the top 5 similar movies based on user profile cosine similarity
  # ensure no duplicate suggestions are in the list
  user_prof_similarities_df = user_prof_similarities_df.sort_values(by='similarity', ascending=False).drop_duplicates()
  top_5 = list(user_prof_similarities_df.head().index)

  return top_5

# Collaborative Based Filtering

The key idea behind collaborative filtering is that users who have agreed in the past will likely agree in the future. Instead of relying on item attributes or user profiles, collaborative filtering identifies patterns of user behavior and item preferences from the interactions present in the data.

**Types of Collaborative Filtering:**
There are two main types of collaborative filtering:

**Collaborative Filtering Process:**
The collaborative filtering process typically involves the following steps:

1. **Data Collection:**
   - Gather data on user-item interactions, such as movie ratings, product purchases, or article clicks.

2. **User-Item Matrix:**
   - Organize the data into a user-item matrix, where rows represent users, columns represent items, and the entries contain the users' interactions (e.g., ratings).

3. **Similarity Calculation:**
   - Calculate the similarity between users or items using similarity metrics such as cosine similarity, Pearson correlation, or Jaccard similarity.
   - For user-based collaborative filtering, user similarities are calculated, and for item-based collaborative filtering, item similarities are calculated.

4. **Neighborhood Selection:**
   - For each user or item, select the most similar users or items as the neighborhood.
   - The size of the neighborhood (the number of similar users or items to consider) is an important parameter to control the system's behavior.

5. **Prediction Generation:**
   - Predict the ratings for items that the target user has not yet interacted with by combining the ratings of neighboring users or items.

6. **Recommendation Generation:**
   - Recommend items with the highest predicted ratings to the target user.

**Advantages of Collaborative Filtering using User-Item Interactions:**
- Collaborative filtering is based solely on user interactions and does not require knowledge of item attributes, making it useful for cases where item data is sparse or unavailable.
- It can provide serendipitous recommendations, suggesting items that users may not have discovered on their own.
- Collaborative filtering can be applied in various domains, including e-commerce, music, movie, and content recommendations.

**Limitations of Collaborative Filtering:**
- The cold-start problem: Collaborative filtering struggles to recommend to new users or items with no or limited interaction history.
- It may suffer from sparsity when data is limited or when users have only interacted with a small subset of items.
- Scalability issues can arise with large datasets and an increasing number of users or items.

Here is your task:

1. Write a function that takes in a user id and the dataframe you created before that contains 'user_id', 'title', and 'rating'. The function should return collaborative filtering recommendations for this user based on a user-item interaction matrix. Here are steps you can take:

  A. Create the user-item matrix using Pandas' [pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html).

  B. Fill missing values with zeros in this matrix.

  C. Calculate user-user similarity matrix using cosine similarity.

  D. Get the array of similarity scores of the target user with all other users from the similarity matrix.

  E. Extract, say the the top 5 most similar users (excluding the target user).

  F. Generate movie recommendations based on the most similar users.

  G. Remove duplicate movies recommendations.

In [42]:
# Collaborative Filtering using User-Item Interactions
def collaborative_filtering_recommendation(user_id, df):
  # Create the user-item matrix
  user_ratings_pivot = df.pivot(index='user_id', columns='title', values='rating')

  # Fill missing values with 0 (indicating no rating)
  avg_ratings = user_ratings_pivot.mean(axis=1)
  user_ratings_pivot = user_ratings_pivot.sub(avg_ratings, axis=0).fillna(0)

  # Calculate user-user similarity matrix using cosine similarity
  similarities = cosine_similarity(user_ratings_pivot)
  similarities_df = pd.DataFrame(similarities, index=user_ratings_pivot.index, columns=user_ratings_pivot.index)

  # Get the similarity scores of the target user with all other users
  target_user_similarities = similarities_df.loc[user_id]

  # Find the top 3 most similar users (excluding the target user)
  sorted_target_sims = target_user_similarities.sort_values(ascending=False)
  nearest_neighbors = sorted_target_sims.index[1:4]

  # Generate movie recommendations based on the most similar users
  recommendations = pd.Series(dtype=float)

  # create scores based on each neighbor's user ratings and similarity score with neighbor
  for neighbor in nearest_neighbors:
    neighbor_ratings = user_ratings_pivot.loc[neighbor]
    similarity_score = target_user_similarities[neighbor]
    weighted_ratings = neighbor_ratings * similarity_score
    recommendations = recommendations.add(weighted_ratings, fill_value=0)

  # Remove duplicates from recommendations
  rated_movies = list(df[df['user_id'] == user_id]['title'])

  # sort recommendations by score & return the top 5 results
  new_recommendations = recommendations.drop(rated_movies, errors='ignore').sort_values(ascending=False)
  top_5 = list(new_recommendations.head().index)

  return top_5

Now, test your recommendations engines! Select a few user ids and generate recommendations using both functions you've written. Are the recommendations similar? Do the recommendations make sense?

# Test Recommenders

In [43]:
# Test the recommendation engines

## Content Based Recommender

To determine how relevant the suggestions are from the content based recommender system, the below code will generate the top five recommendations for two different user_id's. Then it will compare the genres that each user likes, and see if the suggested movies contain these genres and at what rate.

**user_id == 5**

In [44]:
# print recommended movies for user_id 5
id5_content = content_based_recommendation(5, mr_no_dupes_df)
id5_content

['Army of Darkness (1993)',
 'Evil Dead II (1987)',
 'Faster Pussycat! Kill! Kill! (1965)',
 'Cowboy Way, The (1994)',
 'House Arrest (1996)']

In [45]:
# create user_id 5's genre preferences

# Create a TF-IDF matrix using movie genres
tfidfvec = TfidfVectorizer(stop_words='english', min_df=2)
tfidf_matrix = tfidfvec.fit_transform(movies_no_dupes['genres'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidfvec.get_feature_names_out())
tfidf_df.index = movies_no_dupes['title']

# Compute the cosine similarity between movie genres
cos_sim_array = cosine_similarity(tfidf_df)
cos_sim_df = pd.DataFrame(cos_sim_array, index=tfidf_df.index, columns=tfidf_df.index)

In [46]:
# Create function for creating user profiles after all other pre-cals are done
def user_profile_10(user_id):
  user_rated_movies = mr_no_dupes_df[mr_no_dupes_df['user_id'] == user_id]['title']
  user_movies = tfidf_df.reindex(user_rated_movies)
  user_profile = user_movies.mean()
  return user_profile.sort_values(ascending=False).head(10)

In [47]:
# get top 10 preferred genres for user_id 5
user5_profile = user_profile_10(5)
user5_genres = set(user5_profile.index)
user5_profile

Unnamed: 0,0
comedy,0.300029
action,0.176468
horror,0.141371
sci_fi,0.118758
children,0.112358
adventure,0.111292
drama,0.094226
romance,0.064045
thriller,0.061015
animation,0.056325


In [48]:
# print genres of the movies recommended for user_id 5
recommended_genres = ['Comedy', 'Action', 'Horror', 'Sci-Fi', 'Adventure', 'Drama', 'genres']
movies_no_dupes[movies_no_dupes['title'].isin(id5_content)][recommended_genres]

Unnamed: 0,Comedy,Action,Horror,Sci-Fi,Adventure,Drama,genres
73,1,1,0,0,0,1,action comedy drama
183,1,1,1,1,1,0,action adventure comedy horror sci_fi
200,1,1,1,0,1,0,action adventure comedy horror
1048,1,0,0,0,0,0,comedy
1182,1,1,0,0,0,0,action comedy


Comparing user_id 5's genre preferences to the number of recommended movies with these genres, it is clear the movie recommendations are similar to the user's genre preferences.

|genre|score|# Suggestions in Genre|Genre Rank|
|:--:|:--:|:--:|:--:|
|comedy|0.300|5 of 5|1|
|action|0.176|4 of 5|2|
|horror|0.141|2 of 5|3|
|sci_fi|0.119|1 of 5|4|
|adventure|0.111|2 of 5|5|
|drama|0.094|1 of 5|7|


Four of the 5 movies contain the user's top two genre preferences, and all movies fall within the top 7 genres preferred by the user.

**user_id == 197**

In [49]:
# print recommended movies for user_id 197
id197_content = content_based_recommendation(197, mr_no_dupes_df)
id197_content

["Smilla's Sense of Snow (1997)",
 'Spawn (1997)',
 'Twister (1996)',
 'Tokyo Fist (1995)',
 'Mirage (1995)']

In [50]:
# create user_id 197's genre preferences
user197_profile = user_profile_10(197)
user197_genres = set(user197_profile.index)
user197_profile

Unnamed: 0,0
action,0.369636
thriller,0.171526
adventure,0.170156
drama,0.15604
sci_fi,0.128265
romance,0.106102
war,0.102121
crime,0.101828
comedy,0.08547
western,0.036651


In [51]:
# print genres of the movies recommended for user_id 197
recommended_genres = ['Action', 'Thriller', 'Adventure', 'Drama', 'Sci-Fi', 'genres']
movies_no_dupes[movies_no_dupes['title'].isin(id197_content)][recommended_genres]

Unnamed: 0,Action,Thriller,Adventure,Drama,Sci-Fi,genres
117,1,1,1,0,0,action adventure thriller
243,1,1,0,1,0,action drama thriller
357,1,1,1,0,1,action adventure sci_fi thriller
1612,1,0,0,1,0,action drama
1672,1,1,0,0,0,action thriller


Comparing user_id 197's genre preferences to the number of recommended movies with these genres, it is clear the movie recommendations are similar to the user's genre preferences.

|genre|preference|# Suggestions in Genre|Genre Rank|
|:--:|:--:|:--:|:--:|
|action|0.370|5 of 5|1|
|thriller|0.172|4 of 5|2|
|adventure|0.170|2 of 5|3|
|drama|0.156|2 of 5|4|
|sci_fi|0.128|1 of 5|5|


Four of the 5 movies contain the user's top two genre preferences, and all movies fall within the top 5 genres preferred by the user.

## Collaborative Filtering Recommender System

It will be slightly more tricky to determine how well the collaborative filtering recommender system performs. The collaborative recommender will be used against the same two user_id's used with the content based recommender system.

The resulting recommended movies will be reviewed for their related movie genres. In addition, the 5 closest neighbors will be identified and their user profiles will be reviewed. If the user preferences match closesly with the genres preferred by both the user's tested and their neighbors, the recommender system is likely useful.

**user_id == 5**

In [52]:
# recommendations for user_id 5
id5_colab = collaborative_filtering_recommendation(5, mr_no_dupes_df)
id5_colab

['Terminator 2: Judgment Day (1991)',
 'Godfather, The (1972)',
 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)',
 'L.A. Confidential (1997)',
 'Chasing Amy (1997)']

In [53]:
# print genres of the movies recommended for user_id 5
recommended_genres = ['Action', 'Sci-Fi', 'Drama', 'Thriller', 'Romance', 'Crime', 'Mystery', 'Film-Noir', 'War', 'genres']
movies_no_dupes[movies_no_dupes['title'].isin(id5_colab)][recommended_genres]

Unnamed: 0,Action,Sci-Fi,Drama,Thriller,Romance,Crime,Mystery,Film-Noir,War,genres
95,1,1,0,1,0,0,0,0,0,action sci_fi thriller
126,1,0,1,0,0,1,0,0,0,action crime drama
245,0,0,1,0,1,0,0,0,0,drama romance
301,0,0,0,1,0,1,1,1,0,crime film_noir mystery thriller
473,0,1,0,0,0,0,0,0,1,sci_fi war


There are several more genres covered in the five suggested movies, and each of the genres are represented only once or twice among all the movies. This list of movies is more spread out in terms of genre categories.

When comparing the 9 genres covered in the collaborative filtering recommendations versus the user's profile references, there are only 5 genres that match user_id 5's top 10 genre preferences.

Now we'll look at the preferences of the nearest neighbors.

In [54]:
# find nearest neighbors to user_id 5
# Create the user-item matrix
user_ratings_pivot = mr_no_dupes_df.pivot(index='user_id', columns='title', values='rating')

# Fill missing values with 0 (indicating no rating)
avg_ratings = user_ratings_pivot.mean(axis=1)
user_ratings_pivot = user_ratings_pivot.sub(avg_ratings, axis=0).fillna(0)

# Calculate user-user similarity matrix using cosine similarity
similarities = cosine_similarity(user_ratings_pivot)
similarities_df = pd.DataFrame(similarities, index=user_ratings_pivot.index, columns=user_ratings_pivot.index)

# Get the similarity scores of the target user with all other users
target_user_similarities = similarities_df.loc[5]

# Find the top n_neighbors most similar users (excluding the target user)
sorted_target_sims = target_user_similarities.sort_values(ascending=False)
user5_nearest_neighbors = list(sorted_target_sims.index[1:4])

In [55]:
user5_nearest_neighbors

[268, 497, 276]

In [56]:
# create user profiles for each nearest neighbor and print their top ten prefereces
for user in user5_nearest_neighbors:
  print('user_id', user)
  current_user = user_profile_10(user)
  print(current_user, '\n')
  user5_genres.update(current_user.index)

user_id 268
comedy       0.250018
action       0.201911
thriller     0.161098
drama        0.132683
sci_fi       0.100697
adventure    0.100193
romance      0.083379
horror       0.066552
crime        0.061368
war          0.044236
dtype: float64 

user_id 497
comedy       0.258474
action       0.221829
thriller     0.135223
sci_fi       0.124694
adventure    0.111863
romance      0.108685
drama        0.092899
children     0.066786
horror       0.064485
crime        0.053006
dtype: float64 

user_id 276
comedy       0.205377
drama        0.197862
action       0.178378
thriller     0.165714
sci_fi       0.100629
adventure    0.088066
romance      0.081815
horror       0.076009
crime        0.073426
children     0.042756
dtype: float64 



It is easy to see why these users were chosen as close neighbors. All user's top preferred genre is comedy. Three of the users, to include user_id 5, has their second most preferred genre as action, while the remaining user has action as their third preferred genre.

  

In [57]:
# number of unique genres across user_id 5 and neighbor preferences
user5_genres

{'action',
 'adventure',
 'animation',
 'children',
 'comedy',
 'crime',
 'drama',
 'horror',
 'romance',
 'sci_fi',
 'thriller',
 'war'}

The above list of genres shows the top 10 genres preferred by user_id 5, as well as the genre categories for the suggested movies. There are two genres from the suggested movies that are part of user_id 5's top 10 genres: crime and war.

When reviewing each neighbor's user profile, crime appears in all of the neighbors top 10 preferred genres, and war appears in one of the neighbors preferred list. This may explain why the recommended movies have genre combinations of 1 or more preferred user genres and 1 or more preferred neighbor genres.

This combination of user and neighbor preferences helps provide interesting suggestions to the user, as it widens the field of movies to options outside their narrow preferences.

In [58]:
# see if neighbors rated all five recommendations high
ratings = []

user5_nn_ratings = mr_no_dupes_df[
    (mr_no_dupes_df['user_id'].isin(user5_nearest_neighbors)) &
    (mr_no_dupes_df['title'].isin(id5_colab))
]

user5_nn_ratings

Unnamed: 0,title,user_id,rating
16556,Chasing Amy (1997),268,5.0
16560,Chasing Amy (1997),276,4.0
16627,Chasing Amy (1997),497,4.0
26417,Dr. Strangelove or: How I Learned to Stop Worr...,268,5.0
26421,Dr. Strangelove or: How I Learned to Stop Worr...,276,5.0
38473,"Godfather, The (1972)",268,4.0
38477,"Godfather, The (1972)",276,5.0
38584,"Godfather, The (1972)",497,5.0
50766,L.A. Confidential (1997),268,5.0
50769,L.A. Confidential (1997),276,5.0


Each movie suggested to user_id five was rated by at least two of the neighbors, and all of the ratings are either a 4 or a 5. The recommendations for user_id 5 based on nearest neighbors makes sense.

**user_id == 197**

In [60]:
# recommendation for user_id 197 -- 3 neighbors
id197_colab = collaborative_filtering_recommendation(197, mr_no_dupes_df)
id197_colab

['Natural Born Killers (1994)',
 'Star Trek: First Contact (1996)',
 'Die Hard (1988)',
 'Heavy Metal (1981)',
 'Alien: Resurrection (1997)']

In [61]:
# print genres of the movies recommended for user_id 197
recommended_genres = ['Action', 'Sci-Fi', 'Thriller', 'Adventure', 'Horror', 'Animation', 'genres']
movies_no_dupes[movies_no_dupes['title'].isin(id197_colab)][recommended_genres]

Unnamed: 0,Action,Sci-Fi,Thriller,Adventure,Horror,Animation,genres
52,1,0,1,0,0,0,action thriller
100,1,1,0,1,1,1,action adventure animation horror sci_fi
143,1,0,1,0,0,0,action thriller
221,1,1,0,1,0,0,action adventure sci_fi
342,1,1,0,0,1,0,action horror sci_fi


The recommendations for user_id 197 are more similar to their user profile genre preferences as compared to user_id 5's. All of the movies fall into the user's top genre preference of action. Three of the movies include the user's 5th preferred genre, and two movies each include the user's second and third top genres.

When comparing the 6 genres covered in the collaborative filtering recommendations versus the user's profile of top 10 genres, there are 4 matcheing genres.

Now we'll look at the preferences of the nearest neighbors.

In [62]:
# find nearest neighbors to user_id 197
# Get the similarity scores of the target user with all other users
target_user_similarities = similarities_df.loc[197]

# Find the top n_neighbors most similar users (excluding the target user)
sorted_target_sims = target_user_similarities.sort_values(ascending=False)
user197_nearest_neighbors = list(sorted_target_sims.index[1:4])

In [63]:
user197_nearest_neighbors

[8, 600, 826]

In [64]:
# create user profiles for each nearest neighbor and print their top ten prefereces
for user in user197_nearest_neighbors:
  print('user_id', user)
  current_user = user_profile_10(user)
  print(current_user, '\n')
  user197_genres.update(current_user.index)

user_id 8
action       0.365102
sci_fi       0.180012
thriller     0.175141
adventure    0.174270
war          0.131379
comedy       0.122603
drama        0.117631
crime        0.114625
romance      0.054041
western      0.042565
dtype: float64 

user_id 600
action       0.504683
adventure    0.248574
sci_fi       0.177020
thriller     0.138289
crime        0.095664
drama        0.076815
western      0.076666
war          0.075554
romance      0.063056
comedy       0.059813
dtype: float64 

user_id 826
action       0.396047
adventure    0.181704
sci_fi       0.158499
thriller     0.138775
animation    0.115472
crime        0.100035
children     0.092278
romance      0.078164
drama        0.074948
war          0.068645
dtype: float64 



It is easy to see why these users were chosen as close neighbors. All user's top preferred genre is action and the number of movies with high ratings for action is significantly higher compared to the other genres. Of user_id 197's top five genre preferences, all three neighbors have four of them in their own top five genre preferences.

In [65]:
# number of unique genres across user_id 5 and neighbor preferences
user197_genres

{'action',
 'adventure',
 'animation',
 'children',
 'comedy',
 'crime',
 'drama',
 'romance',
 'sci_fi',
 'thriller',
 'war',
 'western'}

The above list of genres shows the top 10 genres preferred by user_id 197, as well as the genre categories for the suggested movies. There are two genres from the suggested movies that are NOT part of user_id 197's top 10 genres: animation and children.

When reviewing each neighbor's user profile, animation and children appears only in the third suggested nearest neighbor. All the remaining neighbors have the exact same top 10 genre preference as user_id 197. This may explain why the list of movie recommendations matches better with the user profile's preferred genres as compared to user_id 5's recommended movies.

Since the nearest neighbors have very similar genre preferences to user_id, the recommended movies will be very similar to movies suggested based solely on the user's own genre preferences.

In [66]:
# see if neighbors rated all five recommendations high
ratings = []

user197_nn_ratings = mr_no_dupes_df[
    (mr_no_dupes_df['user_id'].isin(user197_nearest_neighbors)) &
    (mr_no_dupes_df['title'].isin(id197_colab))
]

user197_nn_ratings

Unnamed: 0,title,user_id,rating
3306,Alien: Resurrection (1997),826,5.0
25002,Die Hard (1988),8,5.0
42600,Heavy Metal (1981),826,5.0
62468,Natural Born Killers (1994),600,4.0
62498,Natural Born Killers (1994),826,5.0
84312,Star Trek: First Contact (1996),8,5.0


Only one movie suggested to user_id 197 was rated by at least two of the neighbors. All other suggested movies were rated by a single neighbor. This may throw off the accuracy of the predictions since there is not a concensus on ratings per movie.

**User_id Recommendations by Recommender System**

Below is a print out of the movie results per user_id by recommender system.

In [67]:
# user_id 5's content based recommendations
id5_content

['Army of Darkness (1993)',
 'Evil Dead II (1987)',
 'Faster Pussycat! Kill! Kill! (1965)',
 'Cowboy Way, The (1994)',
 'House Arrest (1996)']

In [68]:
# user_id 5's content based recommendations
id5_colab

['Terminator 2: Judgment Day (1991)',
 'Godfather, The (1972)',
 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)',
 'L.A. Confidential (1997)',
 'Chasing Amy (1997)']

In [69]:
# user_id 197's content based recommendations
id197_content

["Smilla's Sense of Snow (1997)",
 'Spawn (1997)',
 'Twister (1996)',
 'Tokyo Fist (1995)',
 'Mirage (1995)']

In [70]:
# user_id 197's content based recommendations
id197_colab

['Natural Born Killers (1994)',
 'Star Trek: First Contact (1996)',
 'Die Hard (1988)',
 'Heavy Metal (1981)',
 'Alien: Resurrection (1997)']

# Summary of Results

When reviewing both the recommendations provided by the Content Based Recommender System and by the Collaborative Filtering Recommender System, the results made sense for how the models were predicting movies the user would like.

Of note is that each recommender system provided completely different movie recommendations for each user. No movies overlapped between the movies suggested for user_id 5 by the content based or collaborative filtering recommenders, and the same goes for user_id 197.

**Content Based Recommender System**

The recommendations from this model came from analyzing what genre of movies the user preferred. For the first user, the recommended movies included genres from the user's top six preferences. The recommendations for the second user all contained the genres of the user's top five preferences. Which means all of the genres of the recommended movies were found within each user's top 6 preferred genres. (Note: further testing may reveal genre's further down a user's preferred genres.)

These movie suggestions will likely match well with movies the user will view in the future, and the user will likely enjoy the movies. However, if the user has been moving away from genres they liked in the past, these suggestions will not be as fitting or will seem boring to the user.

**Collaborative Filtering Recommender System**

The recommendations from this this model included both genres in the user's top preferences as well as in the top preferences of users that were rated similar to the user. When reviewing the list of movie suggestions versus the users genre preferences vs the nearest neighbors preferences, the suggested movies's genre ratings were very similar to the results provided by the content-based recommender if the neighbor's preferences strongly aligned with the user's preferences. However, there were still novel genre combinations in these instances.

Although the second user tested with the collaborative filtering recommender system had neighbors with very similar genre preferences, only one movie was rated by at least two of the neighbors. The model would likely return better aligned recommendations if it could be tuned so it only suggests movies that have been rated by at least two neighbors.

Of note, this model was trained by looking at 3 nearest neighbors. By including more neighbors, the results may become more varied across the number of genres seen in the suggested movies. As mentioned earlier, the first couple of neighbors were fairly similar to the users tested, but even among three neighbors, the preferences become varied and introduce genres not listed as highly preferred by the user. Increasing the number of neighbors may help provide more novel recommendations to the user, but it may also provide results too far from the user's preference. This is a number that should be continuously tested to ensure the right value is being used.