```
pip install RISE
jupyter-nbextension install rise --py --sys-prefix
jupyter-nbextension enable rise --py --sys-prefix
```


In [359]:
# Add all necessary imports here
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.reload_library()
plt.style.use("ggplot")

class display2(object):
    """Display HTML representation of multiple objects"""
    template = """<div style="float: left; padding: 10px;">
    <p style='font-family:"Courier New", Courier, monospace'>{1}
    </div>"""
    def __init__(self, *args):
        self.args = args

    def _repr_html_(self):
        return ''.join(self.template.format(a, eval(a)._repr_html_())
                     for a in self.args)

    def __repr__(self):
        return ''.join(a + '' + repr(eval(a))
                       for a in self.args)

In [360]:
%%html
<style>

body {
  background: url(img/logo_medium-pad.png)
  no-repeat
  top right;
    padding: 10px;
#padding: 10px 10px 0px 0px;
  padding-right: 10px;
  padding-top: 10px;
}

table.dataframe {
font-size:150%;
}

.column-left{
  float: left;
  width: 45%;
  text-align: left;
  image-align:middle;  
}
.column-left_large{
  float: left;
  width: 52%;
  text-align: left;
  image-align:middle;  
}
.column-right{
  float: right;
  width: 45%;
  text-align: left;
}

</style>





<div class="intro-body">
<p>&nbsp;</p>
<p>&nbsp;</p>
<div class="intro_h1"><h1>Exercise: Building Recommender Systems</h1></div>
<p>&nbsp;</p>
<h3>The Boston Immersion Program</h3>
<p><strong><span class="a">Djork-Arné Clevert</span></strong><span class="b"> [Okko]</span></p>
<p><span class="b">Boston, November 2017</span></p>
<p>&nbsp;</p>
</div>

# A typical business problem

Consider a scenario of an online fashion retailer which sells hundred thousands of different articles.

With growing number of customers every day, the task in hand is to showcase the best choices of article to the users... 

...according to users individual needs and preferences.

# Understand how recommendation works

To understand how a recommendation engine works, let’s assume five products {P1,P2,...,P5} with two major features “fabric (scratchy|soft), pattern (patterned|patternless)”. The five products have following properties:

- P1 has a super cool pattern but sctrachy fabric
- P2 has a flashy pattern but soft fabric
- P3 has a pronounced pattern but rough fabric
- P4 have subtly pattern but very soft fabric
- P5 have poor pattern but super soft fabric

Using these characteristics, we can create an **Item – Feature Matrix**. Value in the cell represents the rating of the product feature ranging from zero to one.

In [376]:
import pandas as pd
P = pd.read_excel("data/feature.xlsx", header=0, index_col=0)
P

Unnamed: 0,P1,P2,P3,P4,P5
Pattern,0.9,0.6,0.55,0.4,0.1
Fabric,0.1,0.6,0.45,0.8,0.9


Our test data set also consist of four active users, namley Alon, Andreas, Eren and Okko with their preferences.
Using their interests, we can create a **User – Feature Matrix** as follows:

- Alon: prefers subtle patterns and fabric should be soft.
- Andreas: likes pronounced patterns and fabric is not so important.
- Eren: prefers super soft fabric and faint patters.
- Okko: only pattern is important, fabric doesn't matter.

Using their interests, we can create a **User – Feature Matrix** as follows:

In [362]:
U = pd.read_excel("data/user.xlsx", header=0, index_col=[0])
U

Unnamed: 0,Pattern,Fabric
Alon,0.3,0.7
Andreas,0.55,0.45
Eren,0.05,0.95
Okko,1.0,0.0


# Content Based Recommendations
Now having two matrices, **Item – Feature (P)** and **User – Feature (U)**. 

In [363]:
display2("U","P")

Unnamed: 0,Pattern,Fabric
Alon,0.3,0.7
Andreas,0.55,0.45
Eren,0.05,0.95
Okko,1.0,0.0

Unnamed: 0,P1,P2,P3,P4,P5
Pattern,0.9,0.6,0.55,0.4,0.1
Fabric,0.1,0.6,0.45,0.8,0.9


Considering both matrices, we now calcluate a recommendation score for Alon as the weighted sum of the product features and Alon's preferences as:
$$SCORE_{Alon}^{P1} =\sum{U_{Alon}*P_{P1}}= [0.30 ~~ 0.70] * [0.9 ~~~ 0.1]^T = 0.34 $$



For **Content based systems**, recommendation is based on a weighted sum between the <br> items feature and user’s preference profile. The feature of items are mapped with <br> feature of users in order to obtain user – item score. 

In [364]:
SCORE = U.dot(P)
SCORE

Unnamed: 0,P1,P2,P3,P4,P5
Alon,0.34,0.6,0.48,0.68,0.66
Andreas,0.54,0.6,0.505,0.58,0.46
Eren,0.14,0.6,0.455,0.78,0.86
Okko,0.9,0.6,0.55,0.4,0.1


The top matched pairs are given as recommendations, as demonstrated below:

In [365]:
SCORE.idxmax(axis=1)

Alon       P4
Andreas    P2
Eren       P5
Okko       P1
dtype: object

- **Alon** prefers subtle patterns and fabric should be soft -->  reco:  **P4** have subtly pattern but very soft fabric
- **Andreas** likes pronounced patterns and fabric is not so important --> his reco: **P2** has a flashy pattern but soft fabric 
- **Eren** prefers super soft fabric over flashy patters --> his reco: **P5** have faint pattern but super soft fabric
- For **Okko** only pattern is important, fabric doesn't matter --> his reco: **P1** has a super cool pattern but sctrachy fabric

**Exercise: **
Alon has changed his preferences, instead of subtle patterns and soft fabric, he is now interested in distinct patterns and very soft fabric. 
Adjust the User-Feature (U) matrix accordingly and recommend a product to Alon.<br>
Helpful hints:
```python
U.columns  ## list column names
U.Pattern.Eren ## accessor for column 'Pattern' and row 'Eren'
U.loc['Eren'] ## accessor for all elements in row 'Eren'
U.dot(P) ## weighted products by user preferences 
U.idxmax(axis = 0) ## column (axis=0) or row (axis=1) maximum   
```

# Solution
```
U.Pattern.Alon = 0.7
U.Fabric.Alon = 0.6
U.loc['Alon'].dot(P).idxmax()
```

In [374]:
U.Pattern.Alon = 0.7
U.Fabric.Alon = 0.6
U.loc['Alon'].dot(P).idxmax()

u'P2'

# Collaborative Filtering
Content-based recommendation **lacks in detecting inter dependencies** or complex behaviors. 

**For example**: Some fashion victoms might only like cool patterns, if and only if they belong to a particular brand, otherwise not.

- Collaborative Filtering algorithms take **user behaviour** for recommendation into account. Such systems exploit behaviour of other users and items in terms of transaction histories, ratings, selection and purchase information. 


- Other users behaviour and preferences over the items are used to recommend items to the new users. 

This time we **don’t know features of the items** but we have user  behaviour - we know,<br>  i.e., how our users brought/rated/visited the existing items.

**User- Behaviour Matrix**

In [366]:
B = pd.read_excel("data/behaviour.xlsx", header=0, index_col=[0],dtype=float())
B_filled = B.fillna(0.0)
display2("B","B_filled")

Unnamed: 0,Alon,Andreas,Eren,Okko
P1,0.5,,,4.0
P2,1.0,,1.0,
P3,,3.0,1.0,3.0
P4,,,4.0,
P5,5.0,2.0,4.0,

Unnamed: 0,Alon,Andreas,Eren,Okko
P1,0.5,0.0,0.0,4.0
P2,1.0,0.0,1.0,0.0
P3,0.0,3.0,1.0,3.0
P4,0.0,0.0,4.0,0.0
P5,5.0,2.0,4.0,0.0


# Item-based collaborative filtering
Item-based collaborative filtering is a model-based algorithm for making recommendations, by exploiting the similarities between different items in the dataset. 
Similarity values are used to predict ratings for user-item pairs not present in the dataset.


<div class="column-left">
<div style="text-align:center"><img src="img/itembased.png" alt="item-based" style="width: 800px;"class="center" /></div>
</div>
<div class="column-center">
</div>
<div class="column-right"><span class="b">
<br>
<br>
The similarity values between items are measured by observing all the users who have rated both the items. As shown in the diagram left, the similarity between two items is dependent upon the ratings given to the items by users who have rated both of them. 
</span>
</div>





# Understanding the concepts of similarity between products 


- In the next section we will explain the most popular similarity meassures, such as intersection, cosine, Pearson, Jaccard (Tanimoto) and compare them on four different use cases (UCs).

- When two customers rate two products exactly the same (UC1).
- When two customers rate the same products very differently (UC2).
- When two customers rate different products (UC3).
- Positive people vs. negative people (UC4).

# Jaccard similarity

<div class="column-left_large">
<br>
<br>
<img src="img/jaccard.svg" align="center" alt="jaccard" style="width:700px;"/></div>
</div>
<div class="column-right"><span class="b">
<br>
<ul>
   <li>Similarity score ranges from 0 to 1</li>
   <li>1 means two products are rated exactly the same</li>
   <li>0 means two products have nothing in common</li>
</ul>
</span>
</div>

# Cosine similarity



<div class="column-left_large">
<br>
<br>
<img src="img/cosine.svg" align="center" alt="cosine" style="width:700px;"/></div>
</div>
<div class="column-right"><span class="b">
<br>
<ul>
   <li>Similarity based on the angle between two vectors</li>
   <li>Similarity score ranges from −1 to 1</li>
   <li>1 means two products are rated exactly the same</li>
   <li>-1 means two products are rated exactly opposite</li>
   <li>0 is indicating orthogonality (decorrelation)</li>
</ul>
</span>
</div>


# Pearson similarity or correlation




<div class="column-left_large">
<br>
<br>
<img src="img/pearson.svg" align="center" alt="cosine" style="width:700px;"/></div>
</div>
<div class="column-right"><span class="b">
<br>
<ul>
   <li>Similarity based on the deviation from average ratings for two items</li>
   <li>Similarity score ranges from −1 to 1</li>
   <li>1 means two products are rated exactly the same</li>
   <li>-1 means two products are rated exactly opposite</li>
   <li>0 is indicating orthogonality (decorrelation)</li>
</ul>
</span>
</div>


# Exercise

Complete the following function body, so that it returns the intersection score of two ratings.
```python
def intersection_sim(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: number of common items
    """
    # remove elements which contain zero ratings

    
    # calculate the intersection 
    
    
    return 

    
```

# Solution



```python
def intersection_sim(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: number of common items
    """
    # remove zero ratings
    rating1 = rating1.loc[rating1 != 0]
    rating2 = rating2.loc[rating2 != 0]
    return len(rating1.index.intersection(rating2.index))
    
```




In [375]:
import numpy as np

def cosine_sim(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: cosine similarity
    """
    return sum(rating1*rating2)/(pow(sum(pow(rating1,2)),0.5)*pow(sum(pow(rating2,2)),0.5))

def pearson_sim(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: cosine similarity
    """    
    rating1 -= np.mean(rating1)
    rating2 -= np.mean(rating2)
    
    return sum(rating1*rating2)/(pow(sum(pow(rating1,2)),0.5)*pow(sum(pow(rating2,2)),0.5))

def jaccard_sim(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: number of common items
    """
    rating1 = rating1.loc[rating1 != 0]
    rating2 = rating2.loc[rating2 != 0]
    return len(rating1.index.union(rating2.index))/(len(rating1.index.intersection(rating2.index)))

def intersection_sim(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: number of common items
    """
    # remove zero ratings
    rating1 = rating1.loc[rating1 != 0]
    rating2 = rating2.loc[rating2 != 0]
    return len(rating1.index.intersection(rating2.index))

def union(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: number of common items
    """
    # remove zero ratings
    
    rating1 = rating1.loc[rating1 != 0]
    rating2 = rating2.loc[rating2 != 0]
    return len(rating1.index.union(rating2.index))

print(union(mat.loc['Eren'],mat.loc['Okko']))
print(intersection_sim(mat.loc['Eren'],mat.loc['Okko']))
print(jaccard_sim(mat.loc['Eren'],mat.loc['Okko']))
print(cosine_sim(mat.loc['Eren'],mat.loc['Okko']))
print(pearson_sim(mat.loc['Eren'],mat.loc['Okko']))

5
2
2
0.471404520791
-0.60858061945


In [368]:
tmp = mat.loc['Eren']

In [369]:
def intersection_sim(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: number of common items
    """
    # remove zero ratings
    rating1 = rating1.loc[rating1 != 0]
    rating2 = rating2.loc[rating2 != 0]
    return len(rating1.index.intersection(rating2.index))

In [370]:
intersection_sim(mat.loc['Eren'], mat.loc['Alon'])

2

# Exercise
```
def jaccard_sim(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: number of common items
    """
    # remove zero ratings
    rating1 = rating1[rating1 != 0]
    rating2 = rating2[rating2 != 0]
    return len(rating1.index.union(rating2.index))/(1.0*len(rating1.index.intersection(rating2.index)))
    

def union(rating1, rating2):
    """
    :param rating1:
    :param rating2:
    :return: number of common items
    """
    # remove zero ratings
    rating1 = rating1[rating1 != 0]
    rating2 = rating2[rating2 != 0]
    return len(rating1.index.union(rating2.index))
```

# Use case 1: Two users rate two products exactly the same 

In [371]:
ratings = pd.DataFrame(columns = ["user", "product", "rating"], 
                       data=[['Silke','P1',1],
                           ['Silke', 'P2', 5],
                           ['Timo','P1',1],
                           ['Timo', 'P2', 5]])
ratings_matrix = ratings.pivot_table(index='user', columns='product', values='rating', fill_value=0)

rating_1 = ratings_matrix.loc['Silke']
rating_2 = ratings_matrix.loc['Timo']

cosine_sim(ratings_matrix.loc['Silke'],ratings_matrix.loc['Timo'])

1.0000000000000002

In [372]:
from sklearn.metrics.pairwise import pairwise_distances
jac_sim = 1 - pairwise_distances(mat, metric = 'jaccard')
jac_sim

array([[ 1.  ,  0.75,  0.5 ,  0.4 ],
       [ 0.75,  1.  ,  0.75,  0.6 ],
       [ 0.5 ,  0.75,  1.  ,  0.4 ],
       [ 0.4 ,  0.6 ,  0.4 ,  1.  ]])

# Excerise
The data (**'data/behaviour_long.xlsx'**) is stored in a long pandas dataframe. Pivot the data to create a [user x item] matrix and fill missing values with zeros.
Helpful hints:
```python
import pandas as pd ## loads pandas library for data import
pd.read_excel(...) ## imports excel tables
pd.pivot_table(...) ## pivots a table
```


# Solution
```
data = pd.read_excel("data/behaviour_long.xlsx", header=0, index_col=0)
mat = data.pivot_table(index=['customer'],columns='product',values='rating', fill_value=0)
```

In [373]:
data = pd.read_excel("data/behaviour_long.xlsx", header=0, index_col=0)
mat = data.pivot_table(index=['customer'],columns='product',values='rating', fill_value=0)
mat

product,P1,P2,P3,P4,P5
customer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alon,5,4,3,0,0
Andreas,3,4,5,2,0
Eren,0,2,4,4,0
Okko,3,4,0,3,4


<img src="img/fashion-mnist-sprite.png" alt="item-based" style="width: 1200px;"/>


# Headline Subslide

<image>
</section>
<section data-background="#F27C3A" data-state="no-title-footer">
  <div class="divider_h1">
    <h1>Divider</h1>
  </div>
</section>
</image>

### Q&A Slide

<image>
</section>
<section data-background="#0093C9" data-state="no-title-footer">
  <div class="divider_h1">
    <h1>Questions???</h1>
  </div>
</section>
</image>