# Recommendation Systems

We have seen how Recommender/Recommendation Systems have played an integral parts in the success of Amazon (Books, Items), Pandora/Spotify (Music), Google (News, Search), YouTube (Videos), etc.  For Amazon, these systems bring more than 30% of their total revenues. For Netflix service, 75% of movies that people watch are based on some sort of recommendation.

> The goal of Recommendation Systems is to find what is likely to be of interest to the user. This enables organizations to offer a high level of personalization and customer-tailored services.

## Three Main Types

- non-personalized
- content-based
- collaborative filtering

### Non-Personalized Recommendations

![screenshot of youtube's homepage](images/youtube-nonpersonalizedrecommendations.png)

YouTube is notorious for putting non-personalized content on their homepage (although they tailor recommendations in other places)

These recommendations are based purely on the popularity of the item!

#### Advantages
- Super easy (computationally and for the user to understand)
- Items are usually popular for a reason
- No cold-start issue

#### Disadvantages
- Not personalized
- New items won’t gain traction

## Content-Based

![screenshot found online of someone's 'made for you' recommendations from spotify](images/spotify-contentrecommendations.png)

[Image Source](https://www.howtogeek.com/393291/already-a-spotify-fan-here-are-6-new-features-you-might-have-missed/)

Content-based recommendations are based on the properties/attributes of the items, where the items you've rated highly (or, in Spotify's case, listened to recently or often) are then compared against the properties/attributes of other items, and those items are then recommended if they're considered 'similar'.

What items are 'similar'? Depends on your similarity metric:

![similarity metrics comparison](images/similaritymetrics.png)

[Image Source: "What Similarity Metric Should You Use for Your Recommendation System?](https://medium.com/bag-of-words/what-similarity-metric-should-you-use-for-your-recommendation-system-b45eb7e6ebd0) <- useful reading!

Those are just 3 examples, there are others (Jaccard index, Euclidian similarity) - but the point is you take some mathematical understanding of the items and find which ones are 'nearby' in some sense.

#### Advantages:
- Easy and transparent
- No cold start issue
- Recommend items to users with unique tastes

#### Disadvantages:
- Requires some type of tagging of items
- Overspecialization to certain types of items

## Collaborative Filtering

![collaborative filtering utility matrix example](images/collaborativefiltering.png)

[Image Source](https://www.incubegroup.com/blog/recommender-system-for-private-banking/)

Use both User and Item data! Use past behavior of many users (how they've rated many items) to find similarities either between users or between items (either user-based or item-based) to recommend new things.

We build a Utility/Rating Matrix to capture many users' ratings of many different items - a matrix that, in practice, tends to be quite _sparse_ (see all the blanks in just this tiny example above).

Then, we use **_MATH_** (namely, matrix factorization) to fill in those blanks, based upon similar users' ratings of similar items.

More specifically, it finds factor matrices which result in the ratings it has - decomposing the actual Utility Matrix into component pieces that explain it. These component pieces, matrices themselves, can be thought of as 'latent' or 'inherent' features of the items and users! The math then comes in, as we calculate the dot products in order to arrive at our predicted ratings.'

<img src="images/matrixfactorization.png" alt="matrix factorization image, showing the factor matrices" width=700>

[Image Source](https://medium.com/@connectwithghosh/simple-matrix-factorization-example-on-the-movielens-dataset-using-pyspark-9b7e3f567536)

A bit more on Matrix Factorization, from Google's Recommendations Systems crash course: https://developers.google.com/machine-learning/recommendation/collaborative/matrix

#### Advantages:
- Personalized. You’re special!

#### Disadvantages:
- Can require a lot of computation, especially as these matrices get larger
- Cold start: need to have a lot of ratings to be worthwhile
- Popularity Bias: biased towards items that are popular. May not capture people’s unique tastes.

Matrix factorization methods include Singular Value Decomposition (SVD) and Alternating Least Squares (ALS)

I'll note that there are differences between _explicit_ and _implicit_ ratings.

- **_Explicit_** data is gathered from users when we ask a user to rate an item on some scale
    - Pros: concrete rating system, can assume users actually feel the way they input and thus can extrapolate from those preferences
    - Cons: not all users might input their preferences
- **_Implicit_** data is gathered from users without their direct input - a system logs the actions of a user
    - Pros: Easier to collect automatically, thus have more data from more users without those users needing to go through extra steps
    - Cons: More difficult to work with - how do we know what actions imply preference?
    
[insert comment about y'all filling out surveys here]

[Resource](https://www.cs.carleton.edu/cs_comps/0607/recommend/recommender/collaborativefiltering.html#:~:text=Implicit%20Data%20Collection,system%20has%20to%20collect%20data.&text=Explicit%20data%20gathering%20is%20easy,data%20to%20predict%20future%20ratings.)

## And now, in code!

### Reading in the data and simple EDA

#### Data Source:

https://www.kaggle.com/rounakbanik/the-movies-dataset

In [None]:
# Import libraries, round 1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from collections import Counter

In [None]:
# load in data and check it out
df = pd.read_csv('data/ratings.csv') 
print(df.shape) 
df.head(10) 

In [None]:
# Can also get the data straight from the surprise library we'll be using

# from surprise import Dataset
# data = Dataset.load_builtin('ml-100k')
# df = pd.read_csv('~/.surprise_data/ml-100k/ml-100k/u.data',
#             sep='\t', header=None)
# df = df.rename(columns={0: 'user', 1: 'item', 2: 'rating', 3: 'timestamp'})

### Ratings

In [None]:
# check value_counts
ratings = df['rating'].value_counts()
ratings

In [None]:
ratings_sorted = dict(zip(ratings.index, ratings))

In [None]:
# plot distribution in matplotlib
plt.figure(figsize=(8,6))
plt.bar(ratings_sorted.keys(), ratings_sorted.values(), width=.4)
plt.xticks(np.arange(0, 5.1, step=0.5))
plt.xlabel("Rating")
plt.ylabel("# of Ratings")
plt.title("Distribution of Ratings")
plt.show()

### Users

In [None]:
print("Number of users: ", df.userId.nunique()) 
print("Average Number of Reviews per User: ", df.shape[0]/df.userId.nunique())

In [None]:
ratings_per_user = df['userId'].value_counts()
ratings_per_user = sorted(list(zip(ratings_per_user.index, ratings_per_user)))

plt.figure(figsize=(8,6))
plt.bar([r[0] for r in ratings_per_user], [r[1] for r in ratings_per_user])
plt.xlabel("User ID")
plt.ylabel("# of Reviews")
plt.title("Number of Reviews per User")
plt.show()

### Movies

In [None]:
print("Number of movies: ", df.movieId.nunique())
print("Average Number of Reviews per Movie: ", df.shape[0]/df.movieId.nunique())

In [None]:
# the movie IDs with the most ratings
df['movieId'].value_counts()[:10]

In [None]:
ratings_per_movie = df['movieId'].value_counts()

plt.figure(figsize=(8, 6))
plt.hist(ratings_per_movie, bins=50)
plt.xlabel("# of Reviews")
plt.ylabel("# of Movies")
plt.title("Distribution of the Number of Ratings Per Movie")
plt.show()

## Singular Value Decomposition using Surprise

Written by Yish, thanks <3

One of the easiest libraries to use for recommendation systems is Surprise, which stands for **Simple Python Recommendation System Engine**. Here, we'll code a recommendation system using the Surprise Library's Singular Value Decomposition (SVD) algorithm.

To read more about Surprise's SVD implementation, and its hyperparameters:
https://surprise.readthedocs.io/en/stable/matrix_factorization.html#surprise.prediction_algorithms.matrix_factorization.SVD

In [None]:
# If you need the surprise library
# !pip install surprise

In [None]:
# Import libraries, round 2
from surprise import Dataset, Reader
from surprise import SVD
from surprise import accuracy
from surprise.model_selection import cross_validate, train_test_split

In [None]:
# for Surprise, we only need three columns from the dataset
data = df[['userId', 'movieId', 'rating']]
reader = Reader(line_format='user item rating', sep=',')
data = Dataset.load_from_df(data, reader=reader)

# note - if you loaded this data up straight from surprise earlier
# you won't need to do this - data will already be a surprise dataset

In [None]:
# train-test-split


In [None]:
# instantiate SVD and fit the trainset


In [None]:
# get our predictions out and score our model
predictions = None

### Making Predictions

In [None]:
# taking a look at the first 10 rows of our test set
predictions[:10]

In [None]:
print("Number of users: ", df.userId.nunique()) 
print("Number of movies: ", df.movieId.nunique()) 

In [None]:
# let's predict for a specific user/item!
user = 5
item = 141
svd.predict(user, item)

## More Models? More Models!

Surprise has some basic algorithms - like `BaselineOnly`, which predicts a baseline estimate for a given user an item.

https://surprise.readthedocs.io/en/stable/basic_algorithms.html#surprise.prediction_algorithms.baseline_only.BaselineOnly

In [None]:
from surprise import BaselineOnly

In [None]:
# showcasing the cross_validate function as well 
cross_validate(BaselineOnly(), data, verbose=True)

#### KNN

Plus there are always neighbors!

In [None]:
from surprise import KNNBasic

In [None]:
KNN_model = KNNBasic().fit(trainset)

In [None]:
# find the nearest neighbors to an item
KNN_model.get_neighbors(iid=item, k=1) # using same item as earlier