# Assignment

This assignment, we focus on **content-based recommenders**.

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

The data we use in this assignment is a movie meta-data that accompanies the user-movie rating data we used in the lecture. Let's take a look:

In [None]:
df_movies = pd.read_csv('./movies.csv')
df_movies.head()

Question 1. Do some exploratory analysis of the data.  Please do at least 3 different analysis.  1 analysis/data exploration should be a visualization of some sort (bar, histogram, boxplot, etc.).

In [None]:
#Data Exploration 1
df_movies.info()

In [None]:
#Data Exploration 2
df_movies['genres'].str.split('|').explode().value_counts()

In [None]:
#Data Exploration 3
df_movies['genres'].str.get_dummies('|').sum().sort_values(ascending=False).head(10).plot(kind='bar')
plt.title('Top Genres')
plt.tight_layout()

Question 1.1: Since the `genres` column is not in the right format for us to do comparisons, create a dummy column for each genre. You can use the `str.get_dummies` method to do this.

In [None]:
# Create genre dummy columns
df_movies = df_movies.join(df_movies['genres'].str.get_dummies('|'))
df_movies.head()

Questioin 1.2: Validate Question 1.1 updates by displaying the genres list for the movie `Toy Story (1995)` as the reference movie. Hint: Drop the movieId and title column and transpose the data to make it easier to read.

In [None]:
movie_chosen = df_movies[df_movies['title']=='Toy Story (1995)'].drop(['movieId','title','genres'], axis=1).squeeze()
movie_chosen.to_frame().T

Question 1.3: Create a new `DataFrame` that will store the similarity scores. Hint: Compy df_movies for columns movieId and title.

In [None]:
df_sim = df_movies[['movieId','title']].copy()
df_sim.head()

Question 2.1: Find all the movies similar to the above movie. The easiest way to do this is by using the `pd.DataFrame.corrwith` method. You can pass `movie_chosen` to this method and specify the correct value for `axis`. The default similarity metric used is Pearson's correlation, so add a new column to the `df_sim` data called `sim_pearson` to store the similarity scores. Show the top 5 rows of the resulting data.

In [None]:
genre_features = df_movies.drop(['movieId','title','genres'], axis=1)
df_sim['sim_pearson'] = genre_features.corrwith(movie_chosen, axis=1)
df_sim.head()

Question 2.2: Pearson's correlation may not be the best similarity metric to use with the data we have, so try [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index) instead. To specify another similarity function, we can use the `method` argument of `corrwith`. Unforetunately, Jaccard similarity is not one of the default metrics offered, but `method` also accepts functions (referred to as a "callable" in the doc).

- Use `corrwith` to find the similarity between `Toy Story (1995)` and other movies, using Jaccard similarity. Add the similarity scores to `df_sim` as a new column calld `sim_jaccard` and show the top 5 rows.

In [None]:
from sklearn.metrics import jaccard_score

genre_features = df_movies.drop(['movieId','title','genres'], axis=1)
df_sim['sim_jaccard'] = genre_features.apply(lambda row: jaccard_score(row, movie_chosen), axis=1)
df_sim.head()

Question 2.3: Use `corrwith` to find the similarity between `Toy Story (1995)` and other movies, but this time use cosine similarity. Add the similarity scores to `df_sim` as a new column calld `sim_cosine` and show the top 5 rows. 

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

genre_features = df_movies.drop(['movieId','title','genres'], axis=1)
df_sim['sim_cosine'] = cosine_similarity(genre_features, movie_chosen.values.reshape(1,-1)).ravel()
df_sim.head()

We built a simple example of a recommender system above. Now let's make this more interesting by adding additional information to the above table that can help us filter the recommendations. Specifically, we want to be able to filter by the movie's popularity (number of users who rated it) and its average rating (average over users). This information is not part of the movie meta-data, so we have to turn to the data with the ratings. This makes it a basic example of a hybrid approach.

The code below will load the data and reshape it from long to wide using `pivot_table`:

In [None]:
df_ratings = pd.read_csv('./ratings.csv')
movie_user_mat = df_ratings.pivot_table(index = 'movieId', columns = 'userId', values = 'rating')
movie_user_mat.head()

Question 3.1: From the table above, extract the average rating of each movie and the number of ratings received by each movie. Add those as two new columns to `df_sim`, and call them `ratings_avg` and `ratings_cnt` respectively. Show the top 5 rows. 

In [None]:
df_sim = df_sim.set_index('movieId')
df_sim['ratings_avg'] = movie_user_mat.mean(axis=1)
df_sim['ratings_cnt'] = movie_user_mat.count(axis=1)
df_sim = df_sim.reset_index()
df_sim.head()

Question 3.2: Now find all the movies that are similar to `Toy Story (1995)` using cosine similarity, just like we did earlier, but this time limit the results to movies in the bottom 40th percentile in terms of popularity (`ratings_cnt`) and the top 40th percentile in terms of average rating (`ratings_avg`). Show the top 5 rows.

In [None]:
less_known = df_sim['ratings_cnt'] <= df_sim['ratings_cnt'].quantile(0.4)
high_rating = df_sim['ratings_avg'] >= df_sim['ratings_avg'].quantile(0.6)
df_sim[less_known & high_rating].sort_values('sim_cosine', ascending = False).head()

[Bonus] Question 4: In the exercise above, we obtained similarity scores between `Toy Story (1995)` and all other movies. Perfomed all possible pairwise comparisons. 

In [None]:
genre_features = df_movies.drop(['movieId','title','genres'], axis=1)
sim_mat = genre_features.T.corr()
print(sim_mat)

In [None]:
from scipy.spatial.distance import pdist, squareform
genre_features = df_movies.drop(['movieId','title','genres'], axis=1)
sim_mat = 1 - squareform(pdist(genre_features, metric='cosine'))
print(sim_mat)

Question 5: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

This assignment was a good exercise in content-based filtering. I had some prior experience with recommender systems, but this was a good refresher on the basics. I followed the steps in the notebook, starting with data exploration and then moving on to creating dummy variables for genres. I then calculated similarity scores using Pearson, Jaccard, and cosine similarity. I also learned how to incorporate popularity and average ratings into the recommendations. The biggest obstacle was remembering the exact syntax for some of the pandas and scikit-learn functions, but the documentation was helpful. This exercise is a good starting point for building a real-world recommender system. To make it more robust, I would need to learn how to handle larger datasets, incorporate user feedback, and evaluate the performance of the recommender system.