# 06 Recommendation Systems
__Math 3280 - Data Mining__ : Snow College : Dr. Michael E. Olson

* Leskovec, Chapter 9
-----

In Chapter 3, we looked at similarities between two datasets. In Chapter 6, we examined Frequent Itemsets, or similar subsets between baskets. We can utilize these to recommend items to another user. In Recommendation Systems, we do look for similar subsets between baskets, but we add an extra dimension: ratings.

It will be helpful to think of the ratings as an n-dimensional vector where each component is a rating for a particular element (such as a specific actor).

## The Utility Matrix
In recommendation systems, we are trying to relate the *users* and the *items*. Each user-item pair is associated with a "degree of preference" (or rating) that the user has given that item. This can be written as a matrix.

The goal is to predict the blanks in the utility matrix.
* First, note is that if users give a subset of items similar ratings, then another user who gives one item a similar rating is likely to give the other items in that subset a similar rating
* The Utility Matrix will be very sparse. Fortunately, we won't have to fill the entire matrix. We only find enough similarity between the content the user rated (content-based recommendation systems) or the ratings other users gave (collaborative filtering recommendation systems)
   * Not even necessary to find the highest possible rating - just one high enough that is likely to be clustered among the user's likes

In [1]:
import numpy as np
import pandas as pd

In [2]:
#####  Utility Matrix  #####

## Binary (Like/Dislike) ratings

movies = ['M01','M02','M03','M04','M05','M06','M07','M08','M09','M10','M11','M12','M13','M14','M15']

likes = {
    'User A' : [0,1,1,1,0,1,1,0,1,0,0,1,1,1,1],
    'User B' : [1,1,0,0,0,0,1,1,0,1,0,1,0,0,1],
    'User C' : [0,0,0,1,1,0,0,0,0,1,0,0,0,0,1]
}

user_likes = pd.DataFrame(likes).transpose()
user_likes.columns = movies

user_likes

Unnamed: 0,M01,M02,M03,M04,M05,M06,M07,M08,M09,M10,M11,M12,M13,M14,M15
User A,0,1,1,1,0,1,1,0,1,0,0,1,1,1,1
User B,1,1,0,0,0,0,1,1,0,1,0,1,0,0,1
User C,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1


In [3]:
## Scaled (5-star) ratigs

ratings = {
    'User A' : [0,1,2,1,0,4,4,0,3,0,0,5,5,4,5],
    'User B' : [3,4,0,0,0,0,1,5,0,3,0,5,0,0,4],
    'User C' : [0,0,0,2,4,0,0,0,0,4,0,0,0,0,3]
}

user_ratings = pd.DataFrame(ratings).transpose()
user_ratings.columns = movies

user_ratings

Unnamed: 0,M01,M02,M03,M04,M05,M06,M07,M08,M09,M10,M11,M12,M13,M14,M15
User A,0,1,2,1,0,4,4,0,3,0,0,5,5,4,5
User B,3,4,0,0,0,0,1,5,0,3,0,5,0,0,4
User C,0,0,0,2,4,0,0,0,0,4,0,0,0,0,3


### Long Tail
We are going to address an advantage that online stores have over brick-and-mortar stores along with a disadvantage this introduces:
* Brick-and-mortar stores can only display what they have room for on the shelf
* Online stores can have as many items as they want - there is no limit
    * An "infinite number" of items on the shelf
    * How do you browser them all?
    * Rely on a recommendation system to show items particular items from a section of the online store that is relevant to the user's search

This is known as the *long tail* phenomenon. Brick-and-mortar stores rely on an item's popularity, and will only sell the most popular items, while online stores will sell everything.

-----
## Content-Based Recommendations
For example, on Amazon, there is a section on each page labelled, "Inspired by your browsing history." This is making recommendations based on what you have "liked".

### Item Profile

In [27]:
## Item Profile : Movie Database
actors_list = ['Julia Roberts','Robin Williams','Clint Eastwood','Ian McKellen','Movie Rating']

actors = {
    'M01' : [1,0,0,1,3],
    'M02' : [0,0,1,0,5],
    'M03' : [1,0,0,0,4],
    'M04' : [0,1,0,0,2],
    'M05' : [0,1,0,0,4],
    'M06' : [0,0,1,0,4],
    'M07' : [1,1,0,0,3],
    'M08' : [1,0,0,1,1],
    'M09' : [0,0,1,1,5],
    'M10' : [1,1,0,0,5],
    'M11' : [1,0,0,0,1],
    'M12' : [0,0,0,1,2],
    'M13' : [0,1,0,1,2],
    'M14' : [0,1,1,0,5],
    'M15' : [0,0,1,1,2]
}

movie_casts = pd.DataFrame(actors).transpose()
movie_casts.columns = actors_list

movie_casts

Unnamed: 0,Julia Roberts,Robin Williams,Clint Eastwood,Ian McKellen,Movie Rating
M01,1,0,0,1,3
M02,0,0,1,0,5
M03,1,0,0,0,4
M04,0,1,0,0,2
M05,0,1,0,0,4
M06,0,0,1,0,4
M07,1,1,0,0,3
M08,1,0,0,1,1
M09,0,0,1,1,5
M10,1,1,0,0,5


In [28]:
def cosine_distance(vector_1,vector_2):
    dotproduct = np.dot(vector_1, vector_2)
    norms = np.dot(vector_1, vector_1)*np.dot(vector_2, vector_2)
    return dotproduct/np.sqrt(norms)

In [29]:
## Example 9.2
alpha = 2
movie_1 = 'M07'
movie_2 = 'M10'

movie_comparison = movie_casts.loc[[movie_1, movie_2]]
movie_comparison['Movie Rating'] = movie_comparison['Movie Rating']*alpha
movie_comparison

Unnamed: 0,Julia Roberts,Robin Williams,Clint Eastwood,Ian McKellen,Movie Rating
M07,1,1,0,0,6
M10,1,1,0,0,10


In [30]:
cosine_distance(movie_comparison.loc[movie_1], movie_comparison.loc[movie_2])

0.995863477615015

In [8]:
alpha = 0.5
movie_comparison = movie_casts.copy()
movie_comparison['Movie Rating'] *= alpha

distances = pd.DataFrame(columns=movies)
for movie1 in movies:
    for movie2 in movies:
        distances.loc[movie1,movie2] = cosine_distance(movie_comparison.loc[movie1],movie_comparison.loc[movie2])

distances

Unnamed: 0,M01,M02,M03,M04,M05,M06,M07,M08,M09,M10,M11,M12,M13,M14,M15
M01,1.0,0.675566,0.867722,0.514496,0.650791,0.650791,0.764706,0.889297,0.802181,0.802181,0.759257,0.857493,0.70014,0.6333,0.70014
M02,0.675566,1.0,0.830455,0.656532,0.830455,0.996546,0.675566,0.309492,0.937437,0.808135,0.415227,0.656532,0.536056,0.937437,0.750479
M03,0.867722,0.830455,1.0,0.632456,0.8,0.8,0.867722,0.596285,0.778499,0.934199,0.8,0.632456,0.516398,0.778499,0.516398
M04,0.514496,0.656532,0.632456,1.0,0.948683,0.632456,0.857493,0.235702,0.615457,0.86164,0.316228,0.5,0.816497,0.86164,0.408248
M05,0.650791,0.830455,0.8,0.948683,1.0,0.8,0.867722,0.298142,0.778499,0.934199,0.4,0.632456,0.774597,0.934199,0.516398
M06,0.650791,0.996546,0.8,0.632456,0.8,1.0,0.650791,0.298142,0.934199,0.778499,0.4,0.632456,0.516398,0.934199,0.774597
M07,0.764706,0.675566,0.867722,0.857493,0.867722,0.650791,1.0,0.565916,0.6333,0.971061,0.759257,0.514496,0.70014,0.802181,0.420084
M08,0.889297,0.309492,0.596285,0.235702,0.298142,0.298142,0.565916,1.0,0.522233,0.522233,0.745356,0.707107,0.57735,0.290129,0.57735
M09,0.802181,0.937437,0.778499,0.615457,0.778499,0.934199,0.6333,0.522233,1.0,0.757576,0.389249,0.86164,0.703526,0.878788,0.904534
M10,0.802181,0.808135,0.934199,0.86164,0.934199,0.778499,0.971061,0.522233,0.757576,1.0,0.700649,0.615457,0.703526,0.878788,0.502519


### User Profiles
Having made an item profile (a matrix to provide information about the items), we can now create a profile associating each user with the information from each item. For example, the utility matrix matches each user with a movie (the item) with a user rating. The item profile is a matrix that matches each movie (the item) with information such as actors or average ratings. These matrices can be used to create a __user profile__ which matches each user with the information in the item profile.

In this example, we're going to use movie information (actors and average ratings) to create a matrix giving a score based on the average ratings given to movies with each given actor. 
* Example: If 20\% of the movies that a user likes have Julia Roberts as one of the actors, then the user profile for that user will be 0.2 in the component for Julia Roberts.

How to find this:
* Let $\vec{A}$ be the vector of movies actor $A$ is in
* Let $\vec{u}$ be the vector of movies rating given by user $u$
* $\vec{A}\cdot\vec{u}$ will give the number of movies with the given actor that were given a rating by the given user
* $\vec{u}\cdot\vec{u}$ will give the norm of the user vector, which gives number of movies liked by that user
* The quotient of these values will give a normalized weight to movies rated by that user that have that actor

#### User Profile with like/dislike ratings

In [9]:
## User Profile
## Compare with Example 9.3 : Movies given only likes or dislikes
dotproduct = np.dot(movie_casts['Julia Roberts'], user_likes.loc['User A'])
norm = np.dot(user_likes.loc['User A'], user_likes.loc['User A'])
print(f"Dot Product of User Likes and Julia Roberts' Movies : {dotproduct}")
print(f"Norm of User Likes (number of movies liked)         : {norm}")
print(f"User weight to movies with Julia Roberts            : {dotproduct / norm}\n")


user_profile_likes = pd.DataFrame(columns=actors_list)

for actor_id in actors_list:
    for user_id in user_ratings.index:
        user_profile_likes.loc[user_id,actor_id] = np.dot(movie_casts[actor_id], user_ratings.loc[user_id]) / user_ratings.loc[user_id].sum()
        
user_profile_likes
#user_profile_likes.drop('Movie Rating', axis=1)

Dot Product of User Likes and Julia Roberts' Movies : 2
Norm of User Likes (number of movies liked)         : 10
User weight to movies with Julia Roberts            : 0.2



Unnamed: 0,Julia Roberts,Robin Williams,Clint Eastwood,Ian McKellen
User A,0.176471,0.411765,0.5,0.529412
User B,0.48,0.16,0.32,0.68
User C,0.307692,0.769231,0.230769,0.230769


#### User Profile with scaled ratings

In [10]:
user_ratings

Unnamed: 0,M01,M02,M03,M04,M05,M06,M07,M08,M09,M10,M11,M12,M13,M14,M15
User A,0,1,2,1,0,4,4,0,3,0,0,5,5,4,5
User B,3,4,0,0,0,0,1,5,0,3,0,5,0,0,4
User C,0,0,0,2,4,0,0,0,0,4,0,0,0,0,3


There are a couple of problems we have to deal with:
* Users may only rate movies they like (or only movies that they don't like)

So, we think of it this way: with a 1-5 star rating, a rating above the user's average rating would be a high recommendation, while a rating below the user's average rating would be a weak recommendation. So, we want to, 
* normalize the user's ratings based on the average rating, then
* find the average of these normalized ratings

In [15]:
## User Profile
## Compare with Example 9.4 : Movies given scaled ratings
user_ratings.replace(0,np.nan).loc['User A']

M01    NaN
M02    1.0
M03    2.0
M04    1.0
M05    NaN
M06    4.0
M07    4.0
M08    NaN
M09    3.0
M10    NaN
M11    NaN
M12    5.0
M13    5.0
M14    4.0
M15    5.0
Name: User A, dtype: float64

In [16]:
### Start by finding the average rating given by that user
user_avg_rating = user_ratings.replace(0,np.nan).loc['User A'].mean()
user_avg_rating

3.4

In [17]:
### Next find the movies with a given actor and the ratings the user has given it
actor_ratings_from_user = movie_casts['Julia Roberts'] * user_ratings.loc['User A']
actor_ratings_from_user

M01    0
M02    0
M03    2
M04    0
M05    0
M06    0
M07    4
M08    0
M09    0
M10    0
M11    0
M12    0
M13    0
M14    0
M15    0
dtype: int64

In [18]:
### Normalize the ratings by subtracting the average rating
actor_ratings_from_user = actor_ratings_from_user.apply(lambda x: 0 if x==0 else x-user_avg_rating)
actor_ratings_from_user

M01    0.0
M02    0.0
M03   -1.4
M04    0.0
M05    0.0
M06    0.0
M07    0.6
M08    0.0
M09    0.0
M10    0.0
M11    0.0
M12    0.0
M13    0.0
M14    0.0
M15    0.0
dtype: float64

In [19]:
### The average is the score
actor_ratings_from_user.replace(0,np.nan).mean()

-0.3999999999999999

In [22]:
a = 'Julia Roberts'
u = 'User A'

avg_rating = user_ratings.replace(0,np.nan).loc[u].mean()
(movie_casts[a] * user_ratings.loc[u]).apply(lambda x: 0 if x==0 else x-avg_rating).replace(0,np.nan).mean()

-0.3999999999999999

In [31]:
### Apply for all users and actors
user_profile_ratings = pd.DataFrame(columns=actors_list)

for actor_id in actors_list:
    for user_id in user_ratings.index:
        avg_rating = user_ratings.replace(0,np.nan).loc[user_id].mean()  # Average rating given by user
        tmp = movie_casts[actor_id] * user_ratings.loc[user_id]          # Array of ratings given by user involving the given actor 
        user_profile_ratings.loc[user_id, actor_id] = tmp.apply(lambda x: 0 if x==0 else x-avg_rating).replace(0,np.nan).mean() # Subtract avg rating from ratings given, then take the mean
        

user_profile_ratings.drop('Movie Rating', axis=1, inplace=True)
user_profile_ratings

Unnamed: 0,Julia Roberts,Robin Williams,Clint Eastwood,Ian McKellen
User A,-0.4,0.1,0.0,1.1
User B,-0.571429,-1.571429,0.428571,0.678571
User C,0.75,0.083333,-0.25,-0.25


### Recommendations
At this point, we have an item profile (relating movies to information) and a user profile (relating users to the same information). Now, we can make a recommendation by calculating the distance (using your distance measure of choice) between a user's profile and the different movies in the item profile. We are going to use a cosine distance in this example.

#### Recommendations using binary (like/dislike) system

In [37]:
movie_casts

Unnamed: 0,Julia Roberts,Robin Williams,Clint Eastwood,Ian McKellen,Movie Rating
M01,1,0,0,1,3
M02,0,0,1,0,5
M03,1,0,0,0,4
M04,0,1,0,0,2
M05,0,1,0,0,4
M06,0,0,1,0,4
M07,1,1,0,0,3
M08,1,0,0,1,1
M09,0,0,1,1,5
M10,1,1,0,0,5


In [36]:
user_profile_likes

Unnamed: 0,Julia Roberts,Robin Williams,Clint Eastwood,Ian McKellen,Movie Rating
User A,0.176471,0.411765,0.5,0.529412,3.176471
User B,0.48,0.16,0.32,0.68,2.8
User C,0.307692,0.769231,0.230769,0.230769,3.538462


In [39]:
user = "User A"
movie = "M10"

cosine_distance(movie_casts.drop('Movie Rating', axis=1).loc[movie],
                user_profile_likes.drop('Movie Rating', axis=1).loc[user])

0.48650425541051995

#### Recommendations using scaled (stars) system

In [149]:
user_profile_ratings.head()

Unnamed: 0,Julia Roberts,Robin Williams,Clint Eastwood,Ian McKellen
User A,-1.0,-0.25,1.1,-0.5
User B,0.214286,0.214286,-0.785714,0.214286
User C,-0.25,-0.25,0.75,0.75


In [150]:
cosine_distance(movie_casts.drop('Movie Rating', axis=1).loc['M06'],
                user_profile_ratings.loc['User A'])

0.6925914050226145

Advantages to Content-based approach
* No need for data from other users
* Able to cater to unique tastes
* Can include new and unpopular items in recommendations
* Can include explanations for recommendations (Because you liked 'A', you might like 'B')

Disadvantages to Content-based approach
* Difficult to find the appropriate features
  * Images, music, etc.
* Overspecialization (sticks to user's profile - doesn't go outside of that
* Can't take advantage of experience (quality judgments) from other users
* How do you make recommendations to a new user who doesn't have a profile?

-----
## Collaborative-filtering approach
For example, on Amazon, there is a section on each page labelled, "Customers who bought this item also bought."

-----
* Exercise 9.2.1 a-d
* Exercise 9.2.2 a
  * Calculate and interpret the cosine distances between each computer
* Exercise 9.2.3 a-b