
## Analytics Specializations & Applications - Week 6

# Recommender Systems - Example Case Study
----------

Not that long ago, companies acted over much smaller locatalities. Many shops knew their customers personally and could make recommendations to them based on their experience of what they liked (based on their prior purchases and reactions). This led to a completely different customer experience (CX) than we currently are used to - the customer received better interactions, while sellers could able to reap the benefits of brand loyalty given that they better understood their customer’s needs, preferences, and even their budget. We used to have **personalized consumption experiences**.

Recommendation systems have brought the capability to bring back this sort of personalization - it is something that all companies are currently interested (including Boots, as this was the basis of their talk in many ways - the aim to generate personalized targeted marketing). Recommender systems can be used to personalize everything frm the content of websites tailord to each visitor, to deriving better product recommendations for customers, and their use has snowballed since the inception of web. In this session we will get a firm ground in their underpinnings.

## Scenario:
This week we consider the logic behin a working recommendation system, that is able to take in user ratings data about products and retun us recommendations about what new items a customer might like that, that they simply haven't tried before. We'll use the setting of Music store, which has ratings from its customers about some of their prior purchases.


### Task 6a: Getting some customer ratings into the system
As ever we need some information to underpin our decision making. Just to break from the norm, let's not load in a .csv for once and just consider that we have been provided with a python dictionary full of data. This dictionary is keyed by a particular person's name, and contains information about how they have rated each product.

<span style="font-weight:bold; color:green;">&rarr; run the following code to load the product ratings up into memory, and check you have some information for 10 customers:<span/>

In [1]:
customers = {
     "Andrew": {
         "Tame Impala": 5.0,
         "Broken Bells": 2.0,
         "Adele": 3.0,
         "Radiohead": 5.0,
         "Tenacious D": 4.0,
         "The Strokes": 5.0
     },
     "Gavin": {
        "Tame Impala": 2.0, 
        "Broken Bells": 3.5, 
        "Skrillex": 4.0, 
        "Radiohead": 2.0,
        "Tenacious D": 3.5, 
        "Vampire Weekend": 3.0
     },
     "Jennifer": {
        "Tame Impala": 3.5, 
        "Broken Bells": 2.0, 
        "Adele": 4.5, 
        "Radiohead": 5.0, 
        "Tenacious D": 1.5, 
        "The Strokes": 2.5, 
        "Vampire Weekend": 2.0
     },
     "Chan": {
         "Tame Impala": 5.0, 
         "Broken Bells": 1.0,
         "Skrillex": 1.0, 
         "Adele": 3.0,
         "Radiohead": 5, 
         "Tenacious D": 1.0},
     "Dan": {
         "Tame Impala": 3.0,
         "Broken Bells": 4.0,
         "Skrillex": 4.5, 
         "Radiohead": 3.0,
         "Tenacious D": 4.5, 
         "The Strokes": 4.0,
         "Vampire Weekend": 2.0
     },
     "Valerie": {
         "Broken Bells": 4.0, 
         "Skrillex": 1.0,
         "Adele": 4.0,
         "The Strokes": 4.0,
         "Vampire Weekend": 1.0
     },
     "Gordon": {
         "Broken Bells": 4.5,
         "Skrillex": 4.0, 
         "Adele": 5.0,
         "Radiohead": 5.0,
         "Tenacious D": 4.5,
         "The Strokes": 4.0,
         "Vampire Weekend": 4.0
     },
     "Veronica": {
         "Tame Impala": 3.0, 
         "Adele": 5.0,
         "Radiohead": 4.0, 
         "Tenacious D": 2.5, 
         "The Strokes": 3.0
          },
     "Clara": {
         "Tame Impala": 4.75, 
         "Adele": 4.5,
         "Radiohead": 5.0, 
         "The Strokes": 4.25,
         "Jay-Z": 4
    },
     "Robert": {
         "Tame Impala": 4.0, 
         "Adele": 3.0,
         "Radiohead": 5.0, 
         "The Strokes": 2.0,
         "Jay-Z": 1.0
     }
}

print("Our python dictionary has data for {} people in it".format(len(customers)))

Our python dictionary has data for 10 people in it


<span style="font-weight:bold; color:green;">&rarr; Print out the data for Veronica to check your customer data is loaded in correctly:<span/>

In [2]:
print(customers["Veronica"])

{'Tame Impala': 3.0, 'Adele': 5.0, 'Radiohead': 4.0, 'Tenacious D': 2.5, 'The Strokes': 3.0}


Veronica appears to be a big Adele fan.

### Task 6b: Taking a peek at the customer ratings
Now python dictionaries are fine, but we've got used to being able to manipulate data in pandas. It's very easy to convert into that format, so let's do that below, and have a look at our customer ratings in a nice neat table

<span style="font-weight:bold; color:green;">&rarr; run the following code to create a pandas DataFrame representation:<span/>

In [3]:
import pandas
data = pandas.DataFrame.from_dict(customers, orient="index")
data

Unnamed: 0,Tame Impala,Broken Bells,Adele,Radiohead,Tenacious D,The Strokes,Skrillex,Vampire Weekend,Jay-Z
Andrew,5.0,2.0,3.0,5.0,4.0,5.0,,,
Chan,5.0,1.0,3.0,5.0,1.0,,1.0,,
Clara,4.75,,4.5,5.0,,4.25,,,4.0
Dan,3.0,4.0,,3.0,4.5,4.0,4.5,2.0,
Gavin,2.0,3.5,,2.0,3.5,,4.0,3.0,
Gordon,,4.5,5.0,5.0,4.5,4.0,4.0,4.0,
Jennifer,3.5,2.0,4.5,5.0,1.5,2.5,,2.0,
Robert,4.0,,3.0,5.0,,2.0,,,1.0
Valerie,,4.0,4.0,,,4.0,1.0,1.0,
Veronica,3.0,,5.0,4.0,2.5,3.0,,,


Ther are two things worthy of note here:
1. Pandas used every artist name it saw and generated a colum for it - if a customer hadn't rated a particular artist then a NaN flag ("Not a Number") was put in instead.
2. While our data doesn't actually have a field for a customer's name, pandas has used that information for the 'index' of the table. This is a bit like a row number, but in a label format.

We can list these labels as follows:

In [4]:
data.index

Index(['Andrew', 'Chan', 'Clara', 'Dan', 'Gavin', 'Gordon', 'Jennifer',
       'Robert', 'Valerie', 'Veronica'],
      dtype='object')

Having an index for each row can be very useful - for examp, to get at an individual users rating we can use the **loc[]** command and using their index, such as:

In [5]:
data.loc['Andrew'] 

Tame Impala        5.0
Broken Bells       2.0
Adele              3.0
Radiohead          5.0
Tenacious D        4.0
The Strokes        5.0
Skrillex           NaN
Vampire Weekend    NaN
Jay-Z              NaN
Name: Andrew, dtype: float64

<span style="font-weight:bold; color:green;">&rarr; print out the data for "Gavin":<span/>

In [6]:
print(data.loc['Gavin'])

Tame Impala        2.0
Broken Bells       3.5
Adele              NaN
Radiohead          2.0
Tenacious D        3.5
The Strokes        NaN
Skrillex           4.0
Vampire Weekend    3.0
Jay-Z              NaN
Name: Gavin, dtype: float64


### Task 6c: Finding which customers are similar to each other
At the heard of any recommendation system, is finding similarities between either products (content-based recommendation) or customers (user-based recommendation). Let's focus on customers - what does it mean to say that two customers are similar (and hence one customer's purchases might be good recommendations for another?).

Well, commonalities between their ratings seems like a good place to start. Pandas again makes us easy to do this:

<span style="font-weight:bold; color:green;">&rarr; Consider the code below that finds the difference between two customers:<span/>

In [7]:
data.loc['Gavin'] - data.loc['Andrew']

Tame Impala       -3.0
Broken Bells       1.5
Adele              NaN
Radiohead         -3.0
Tenacious D       -0.5
The Strokes        NaN
Skrillex           NaN
Vampire Weekend    NaN
Jay-Z              NaN
dtype: float64

What have we done here? Well we've actually taken one vector (Gavin's) and deducted another (Andrews's) from it. Hopefully this use of vectors to represent information will remind you of our text analytics week.

What happens if we make these differences absolute (positive) and add them together? Well we have an old friend - manhattan distance!

<span style="font-weight:bold; color:green;">&rarr; Adjust the previous difference calculation by wrapping the result in np.abs()</span>

In [5]:
import numpy as np

difference = np.abs(data.loc['Gavin'] - data.loc['Andrew'])
print(difference)

Tame Impala        3.0
Broken Bells       1.5
Adele              NaN
Radiohead          3.0
Tenacious D        0.5
The Strokes        NaN
Skrillex           NaN
Vampire Weekend    NaN
Jay-Z              NaN
dtype: float64


<span style="font-weight:bold; color:green;">&rarr; Find the difference between Andrew and Gavin's preferences by taking the difference calculated above and using np.sum()</span>

In [6]:
distance = np.sum(difference)
print("Andrew and Gavin are {} apart".format(distance))

Andrew and Gavin are 8.0 apart


Now we don't want to type in all of the above every time we want to compare two customers, so let's create a function that will speed things up for us:

<span style="font-weight:bold; color:green;">&rarr; examine the following code, and make sure you understand what it is doing</span>

In [7]:
def manhattan_distance(user1, user2):
    difference = np.abs(user1 - user2)
    distance = np.sum(difference)
    return distance

manhattan_distance(data.loc['Andrew'], data.loc['Gavin'])


8.0

Of course we know this is only one option - we could have created a euclidian version too. 

<span style="font-weight:bold; color:green;">&rarr; OPTIONAL CHALLENGE - create a version of the above function, that calculates the euclidian distance between two customersexamine</span>

In [8]:
def euclidian_distance(user1, user2):
    #-- square the differences between each user's rating
    difference = np.square(user1 - user2)
    
    #-- add the squared differences together
    distance = np.sum(difference)
    
    #-- and return the square root of that sum!
    sqrt_result = np.sqrt(distance)
    return sqrt_result
    
#-- test out the result by looking at the euclidian distance 
#-- between customers Andrew and Gavin
euclidian_distance(data.loc['Andrew'], data.loc['Gavin'])

4.527692569068709

### Task 6d: A flaw in our thinking?
There is a subtle issue here that is occurring - the fact that some people don't have a rating at all means that it is just not included in our calculations. Let's quickly explore wht this means. To do this first finding the difference between Hailey and Gordon...

<span style="font-weight:bold; color:green;">&rarr; Use pen and paper to work out the manhattan distance between Valeria and Veronica</span>


In [12]:
data.loc[['Valerie','Veronica']]

Unnamed: 0,Tame Impala,Broken Bells,Adele,Radiohead,Tenacious D,The Strokes,Skrillex,Vampire Weekend,Jay-Z
Valerie,,4.0,4.0,,,4.0,1.0,1.0,
Veronica,3.0,,5.0,4.0,2.5,3.0,,,


What result did you get? Notice that you just had to ignore any field that didn't have a rating for both customer's. How does this compare to the distance between Andrew and GAvin?

<span style="font-weight:bold; color:green;">&rarr; Use pen and paper to work out the manhattan distance between Andrew and Gavin</span>

In [13]:
data.loc[['Andrew','Gavin']]

Unnamed: 0,Tame Impala,Broken Bells,Adele,Radiohead,Tenacious D,The Strokes,Skrillex,Vampire Weekend,Jay-Z
Andrew,5.0,2.0,3.0,5.0,4.0,5.0,,,
Gavin,2.0,3.5,,2.0,3.5,,4.0,3.0,


In [9]:
manhattan_distance(data.loc['Andrew'], data.loc['Gavin'])

8.0

In [10]:
manhattan_distance(data.loc['Valerie'], data.loc['Veronica'])

2.0

Who are closest - Andrew and Gavin, or Valerie and Veronica?

We have found a potential flaw here, that occurs when using distance measures with real world data with all it's missing values. 
When we compute the distance between Valerie and Veronica, they only had two artists in common (Adele and The Strokes), whereas when we computed the distance between Gavin and Andrew there are lots - so they just innately are going to have a higher manhattan distance, as we are working it out in more dimensions. This seems to skew our distance measurements.

Manhattan Distance and Euclidean Distance work best when there are no missing values. Dealing with missing values is an active area of research, and in a bit we will talk about how to deal with this problem. For now just be aware of the flaw as we continue our first exploration into building a recommendation platform.

### Task 6e: Finding the distances between all customers
Ok, you should be capable at this point in your studies (with a bit of time, some frustrating debugging and a following wind) to be able to create a table that lists the distances between each customer. 

But time is short so instead, please examine the following code I've created for you to do just this:

In [14]:
user_distances = {}

#-- loop through all of the customers
for i in data.index:
    distances = {}
    
    #-- for the current customer loop through the others
    for j in data.index:        
        
        #-- if they are different people calculate and store their distance 
        if i != j:
            distances[j] = manhattan_distance(data.loc[i],data.loc[j])
        #-- otherwise just add a "not-a-number" in the results table
        else:
            distances[j] = np.NaN
            
    #-- add the current customers differences to our final results 
    user_distances[i] = distances

#-- cast the results as a pandas dataframe
user_distances = pandas.DataFrame.from_dict(user_distances, orient="index")
user_distances

Unnamed: 0,Andrew,Chan,Clara,Dan,Gavin,Gordon,Jennifer,Robert,Valerie,Veronica
Andrew,,4.0,2.5,7.5,8.0,6.0,8.0,4.0,4.0,8.5
Chan,4.0,,1.75,14.0,14.0,12.0,4.5,1.0,4.0,6.5
Clara,2.5,1.75,,4.0,5.75,0.75,3.0,7.5,0.75,4.5
Dan,7.5,14.0,4.0,,5.0,5.0,9.0,5.0,4.5,4.0
Gavin,8.0,14.0,5.75,5.0,,6.0,9.0,5.0,5.5,4.0
Gordon,6.0,12.0,0.75,5.0,6.0,,9.5,4.0,7.5,4.0
Jennifer,8.0,4.5,3.0,9.0,9.0,9.5,,2.5,5.0,3.5
Robert,4.0,1.0,7.5,5.0,5.0,4.0,2.5,,3.0,5.0
Valerie,4.0,4.0,0.75,4.5,5.5,7.5,5.0,3.0,,2.0
Veronica,8.5,6.5,4.5,4.0,4.0,4.0,3.5,5.0,2.0,


Let's concentrate on getting some recommendations for Valerie. To do this we need to pick a customer that is similar. Reading off the table above you can probably see that Veronica is closest in her tasters to Clara, but we can do this in python using the "sort_values()" function:



In [15]:
user_distances.loc["Valerie"].sort_values()

Clara       0.75
Veronica    2.00
Robert      3.00
Andrew      4.00
Chan        4.00
Dan         4.50
Jennifer    5.00
Gavin       5.50
Gordon      7.50
Valerie      NaN
Name: Valerie, dtype: float64

To make some recommendations to Valerie we need to know what Clara likes. We do this as follows:

In [16]:
#-- sort valeries distances, and keep the closest one
closest = user_distances.loc["Valerie"].sort_values()[:1]

#-- extract Clara's scores from our original data
closest_scores = data.loc[closest.index]
closest_scores


Unnamed: 0,Tame Impala,Broken Bells,Adele,Radiohead,Tenacious D,The Strokes,Skrillex,Vampire Weekend,Jay-Z
Clara,4.75,,4.5,5.0,,4.25,,,4.0


### Task 6F: Making a recommendation
Given this data, what would you recommend? Well, we aren't going to recommend something that Valerie has already bought - rather we look where she has a "NaN" and find which of those Clara (someone with similar tastes) votes highly. Here is Valerie's data together with Clara's to help you pick:

In [17]:
 data.loc[["Valerie", "Clara"]]

Unnamed: 0,Tame Impala,Broken Bells,Adele,Radiohead,Tenacious D,The Strokes,Skrillex,Vampire Weekend,Jay-Z
Valerie,,4.0,4.0,,,4.0,1.0,1.0,
Clara,4.75,,4.5,5.0,,4.25,,,4.0


Well I picked "Pheonix" - Valerie hasn't bought any of their music, and Clara though they were great. So either via a direct mail, or a message on a till receipt, or an offer voucher the business can now leverage this information to try and make a sale!

Note however, this is a very limited viewpoint - maybe it was a quirk that Clara liked "Pheonix". Perhaps we should have used a bigger 'neighbourhood' and found a few more people who were simlar to Valerie and merged their results?


In [18]:
#-- sort valeries distances, and keep the closest three this time!
closest = user_distances.loc["Valerie"].sort_values()[:3]

#-- extract the scores for these 3 people from our original data
closest_scores = data.loc[closest.index]
closest_scores


Unnamed: 0,Tame Impala,Broken Bells,Adele,Radiohead,Tenacious D,The Strokes,Skrillex,Vampire Weekend,Jay-Z
Clara,4.75,,4.5,5.0,,4.25,,,4.0
Veronica,3.0,,5.0,4.0,2.5,3.0,,,
Robert,4.0,,3.0,5.0,,2.0,,,1.0


This time we want to take the mean of all these similar people's ratings, as this will give as smoothed picture of what Valerie might like:

In [19]:
recommendations = np.mean(closest_scores)
recommendations

Tame Impala        3.916667
Broken Bells            NaN
Adele              4.166667
Radiohead          4.666667
Tenacious D        2.500000
The Strokes        3.083333
Skrillex                NaN
Vampire Weekend         NaN
Jay-Z              2.500000
dtype: float64

Let's put these predictions for what Valerie will like in descending order, and get rid of the bands for which noone had made any purchases (and which hence had an NaN)

In [20]:
recommendations = recommendations.sort_values(ascending=False).dropna()
print(recommendations)

Radiohead      4.666667
Adele          4.166667
Tame Impala    3.916667
The Strokes    3.083333
Jay-Z          2.500000
Tenacious D    2.500000
dtype: float64


One final step remains. To see which of these Valerie hasn't bought before. This time let's automate the process to drop out our final recommendations

In [21]:
# loop around every band
for r in recommendations.index:
    
    #-- check if Valerie has never bought and rated this band's music before
    if data.loc["Valerie"][r] >= 0:
        
        #-- and if so remove it from our recommendations
        recommendations = recommendations.drop(r)

#-- leaving us with our final results, in order of what we'd recommend 
print(recommendations)


Radiohead      4.666667
Tame Impala    3.916667
Jay-Z          2.500000
Tenacious D    2.500000
dtype: float64


We still recommend "Radiohead" but are less confident this time. Maybe we should also see if Valerie would like to give Tame Impala a try (as should you!).

### Task 6g: The sparsity problem - Cosine Similarity to the rescue
Remember we noted that our similarity measures were very dependent on how many items two people had in column - more so in fact than what they rated those items. Time to fix this with cosine similarity, which you should recall from the Week 4 lecture slides.

The rows in our tables are essentially vectors, and we can measure the angle between two vectors to see how similar they are. Crucially this calculation - the cosine similarity - is **not** dependent on the **number** of items two people have in common, and so is preferable when we have sparse datasets. This tends to be the case in two scenarios - text and consumer purchases - so exactly what we need here. 


Find below a replacement function for the one we created for manhattan distance:

In [22]:
def cosine_distance(user1, user2):
    dot_product = np.sum(user1 * user2)
    norm1 = np.sqrt(np.sum(np.square(user1)))
    norm2 = np.sqrt(np.sum(np.square(user2)))
    return dot_product / (norm1 * norm2)


<span style="font-weight:bold; color:green;">&rarr; Recalculate the user distances table this time using this function for cosine similarity (and referring to the code in task 6d)</span>

In [23]:
#-- generate new user_distances using cosine_distance rather than manhattan distance
new_user_distances = {}

#-- loop through all of the customers
for i in data.index:
    distances = {}
    
    #-- for the current customer loop through the others
    for j in data.index:        
        
        #-- if they are different people calculate and store their distance 
        if i != j:
            distances[j] = cosine_distance(data.loc[i],data.loc[j])
        #-- otherwise just add a "not-a-number" in the results table
        else:
            distances[j] = np.NaN
            
    #-- add the current customers differences to our final results 
    new_user_distances[i] = distances

#-- cast the results as a pandas dataframe
new_user_distances = pandas.DataFrame.from_dict(new_user_distances, orient="index")
new_user_distances



Unnamed: 0,Andrew,Chan,Clara,Dan,Gavin,Gordon,Jennifer,Robert,Valerie,Veronica
Andrew,,0.80947,0.811215,0.766622,0.530192,0.724899,0.894823,0.846217,0.5547,0.910446
Chan,0.80947,,0.783267,0.561768,0.519197,0.571946,0.878426,0.924733,0.305329,0.825417
Clara,0.811215,0.783267,,0.47137,0.254781,0.543001,0.835004,0.935153,0.490399,0.852434
Dan,0.766622,0.561768,0.47137,,0.891961,0.832577,0.648736,0.485479,0.560093,0.563517
Gavin,0.530192,0.519197,0.254781,0.891961,,0.7788,0.540393,0.320079,0.391652,0.371413
Gordon,0.724899,0.571946,0.543001,0.832577,0.7788,,0.802569,0.549965,0.745044,0.717939
Jennifer,0.894823,0.878426,0.835004,0.648736,0.540393,0.802569,,0.901303,0.624716,0.924628
Robert,0.846217,0.924733,0.935153,0.485479,0.320079,0.549965,0.901303,,0.381385,0.884717
Valerie,0.5547,0.305329,0.490399,0.560093,0.391652,0.745044,0.624716,0.381385,,0.560241
Veronica,0.910446,0.825417,0.852434,0.563517,0.371413,0.717939,0.924628,0.884717,0.560241,


<span style="font-weight:bold; color:green;">&rarr; Recalculate your recommendations for Valerie based on our new table</span>

In [24]:
#-- sort valeries distances, and keep the closest three
new_closest = new_user_distances.loc["Valerie"].sort_values()[:3]

#-- extract the scores for these 3 people from our original data
new_closest_scores = data.loc[new_closest.index]
new_closest_scores

Unnamed: 0,Tame Impala,Broken Bells,Adele,Radiohead,Tenacious D,The Strokes,Skrillex,Vampire Weekend,Jay-Z
Chan,5.0,1.0,3.0,5.0,1.0,,1.0,,
Robert,4.0,,3.0,5.0,,2.0,,,1.0
Gavin,2.0,3.5,,2.0,3.5,,4.0,3.0,


You should have found that now Chan, Robert and Gavin are the three closest customers to Valerie - this is a dramatic difference from before. Let's see what the consequences are for recommendations:

In [25]:
recommendations = np.mean(new_closest_scores).sort_values(ascending=False).dropna()
print(recommendations)
    

Radiohead          4.000000
Tame Impala        3.666667
Vampire Weekend    3.000000
Adele              3.000000
Skrillex           2.500000
Tenacious D        2.250000
Broken Bells       2.250000
The Strokes        2.000000
Jay-Z              1.000000
dtype: float64


However, we still get Radiohead as the winner, but Vampire Weekend have increased up the ranks and pipped Adele in our list. Well, I guess you can't keep good music down.

## Conclusions

Recommendation systems of this nature underpin an increasing amount of targeted marketing systems. There are many forms of variants of collaborative filtering" of this nature - it is a huge and growing area of business research. 

There are also a couple of other popular distance measures - one example is the use of **"pearson correlation"** to compare between two people's ratings as this accounts for different tendencies in rating styles that people might have (some people may always rate highly in comparison to others, so while they appear far apart, their correlateion will actually be close). However, this still suffers from favouring people with more items to compare, so in most business domain cosine-similarity is preferred.

