In [2]:
%pylab inline
import pandas as pd

import numpy as np
from __future__ import division
import itertools

import matplotlib.pyplot as plt
import seaborn as sns

import logging
logger = logging.getLogger()

Populating the interactive namespace from numpy and matplotlib


9 Recommendation Systems
=============

two broad groups:

1. Content-based systems    
   focus on the properities of items.
   
2. Collaborative filtering systems    
   focus on the relationship between users and items.

### 9.1 A Model for Recommendation Systems


#### The Utility Matrix
record the preference given by users for certain items.

In [11]:
# Example 9.1
M = pd.DataFrame(index=['A', 'B', 'C', 'D'], columns=['HP1', 'HP2', 'HP3', 'TW', 'SW1', 'SW2', 'SW3'])

M.loc['A', ['HP1', 'TW', 'SW1']] = [4, 5, 1]
M.iloc[1, 0:3] = [5, 5, 4]
M.iloc[2, 3:-1] = [2, 4, 5]
M.iloc[3, [1, -1]] = [3, 3]

M

Unnamed: 0,HP1,HP2,HP3,TW,SW1,SW2,SW3
A,4.0,,,5.0,1.0,,
B,5.0,5.0,4.0,,,,
C,,,,2.0,4.0,5.0,
D,,3.0,,,,,3.0


In practice, the matrix would be even **sparser**, with the typical user rating only a tiny fraction of all avalibale items.

the **goal** of a recommendation system is: to **predict the blanks** in the utility matrix.     
+ slightly difference in many application:      
  - predict every blank entry $<$ discover some potential entries in each row.     
  - find all items with the highest expected ratings $<$ find a large subset of those.
  

#### The Long Tail
physical institutions | online institutions
---- | -----
provide only the most popular items | provide the entire range of items


the long tail force online institutions to recommend items to individual users:

1. It's no possible to present all avaliable items to the user.

2. Neither can we expect users to have heared of each of the items they might like.


#### Applications of Recommendation Systems
1. Product Recommendations

2. Movie Recommendations

3. News Articles


#### Populating the Utility Matrix
how to discovery the value users place on items:

1. We can ask users to rate items.    
   cons: users are unwilling to do, and so samples are biased by very little fraction of peoples.
 
2. We can make inferences from users' behavior.    
   eg: items purchased/viewed/rated.

### 9.2 Content-Based Recommendations


#### 9.2.1 Item Profiles
a record representing important characteristics of items.


##### Discovering Features
1. for Documents    
   idea: find the identification of words that characterize the topic of a document.     
   namely, we expect a sets of words to express the subjects or main ideas of the document.       
   1. eliminate stop words.    
   2. compute the TF.DIF score for each reamining word in the document.     
   3. take as the features of a document the $n$ words with the highest TF.DIF scores.     
   
   to measure the similarity of two documents, the distance measures we could use are:    
   1. Jaccard distance     
   2. cosine distance    
      cosine distance of vectors is not affected by components in which both vectors have 0.
   
2. for Images    
   invite users to tag the items.     
   cons: users are unwilling to do $\implies$ there are not enough tags (bias).
   

##### generalize feature vector
1. feature is discrete. $\to$ boolean value.

2. feature is numerical. $\to$ normalization.


#### 9.2.5 User Profiles
create vectors with the same components of item profiles to describe the user's preferences.  

It could be derived from utility matrix and item profiles.

1. normalizate untility matrix. ($[-1,1]$ for cosine distance).

2. value in user profiles = utility value * corresponding item vectors.

In [38]:
# example 9.4

users_name = ['U', 'V']
items_name = ['F{}'.format(x) for x in range(4)]
features_name = ['Julia Roberts', 'others']

# utility matrix
M_uti = pd.DataFrame([
                        [3, 4, 5, 0],
                        [6, 2, 3, 5]  
                     ],
                     index=users_name,
                     columns=items_name
                    )

M_uti

Unnamed: 0,F0,F1,F2,F3
U,3,4,5,0
V,6,2,3,5


In [39]:
# item profile
M_item = pd.DataFrame(index=items_name, columns=features_name)

M_item.loc[:, features_name[0]] = 1
M_item = M_item.fillna(value=0)

M_item

Unnamed: 0,Julia Roberts,others
F0,1,0
F1,1,0
F2,1,0
F3,1,0


In [41]:
M_uti.apply(lambda x: x - np.mean(x), axis=1)

Unnamed: 0,F0,F1,F2,F3
U,0,1,2,-3
V,2,-2,-1,1


In [45]:
M_user = M_uti.fillna(value=0).dot(M_item) / 4 #average = sum/len
M_user

Unnamed: 0,Julia Roberts,others
U,3,0
V,4,0


#### 9.2.6 Recommending Items to Users Based on Content

1. to estimate:    
   $$M_{utility}[user, item] = cosineDistant(M_{user}, M_{item})$$    
   
   the more similar, the higher probility to recommend.

2. classification algorithms:     
   Recommend or Not (machine learning):    
   one decision per user $\to$ take too long time to construct.     
   be used only for relatively small problem size.

In [49]:
# exercises 9.2.1

raw_data = [
        [3.06, 2.68, 2.92],
        [500, 320, 640],
        [6, 4, 6]
    ]

M_item = pd.DataFrame(raw_data, index=['Processor Speed', 'Disk Size', 'Main-Memory Size'], columns=['A', 'B', 'C'])

# items: A, B, C; features: Processor Speed, Disk Size, ...
M_item

Unnamed: 0,A,B,C
Processor Speed,3.06,2.68,2.92
Disk Size,500.0,320.0,640.0
Main-Memory Size,6.0,4.0,6.0
