# Content Based Recommenders: Adding IDF Term When Making Predictions

**IDF** stands for Inverse Document Frequency. It shows how often a particular term occurs in documents. 

$$ IDF = \frac{1}{DF} $$

DF - number of documents in which a particular content attribute occurs. For example, suppose that a word "baseball" occurs in 4 documents, then

$$ IDF_{baseball} = \frac{1}{4} = 0.25 $$

We are going to incorporate inverse document frequency when computing prediction scores. In this lab we are going to do the following:

- Build weighted user profiles;

- Predict user liking / disliking on the unrated documents taking into account the **IDF** term.

- When we are done we are going to:
    
    1) Compare the results for **doc1** and **doc9** for **user1**. Display the prediction scores for **doc1** and **doc9**.
    
    2) Look at (**user2**, **doc6**) pair: when we were computing the predictions for (user2, doc6) in **Weighted User Profile** notebook, the prediction score was moderately positive, now it is slightly negative. Why?



## 1. Settings

In [1]:
# Settings 
import numpy as np
import pandas as pd

## 2. Data

The dataset contains a table of documents content attributes: 20 documents across 10 attributes. 

We also have two users' evaluations of five documents each. 

The content attributes should be interpreted as follows:

- 1 - document is about listed topic;
- 0 - document is not about listed topic;

User evaluations should be read as follows:

- 1 - user liked a document;
- 0 - user never saw a document;
- -1 - user didn't like a document;

### 2.1. Content Attributes Table

In [2]:
# Content Attributes Table
topics = ["baseball", "economics", "politics", "Europe", "Asia", 
          "soccer", "war", "security", "shopping", "family"]
doclist = ["doc1", "doc2", "doc3", "doc4", "doc5", "doc6", "doc7", "doc8", "doc9", "doc10",
           "doc11", "doc12", "doc13", "doc14", "doc15", "doc16", "doc17", "doc18", "doc19", "doc20"]

document_attributes = pd.DataFrame(columns=topics, index=doclist)

# Adding documents data
document_attributes.loc['doc1'] = [1,0,1,0,1,1,0,0,0,1]
document_attributes.loc['doc2'] = [0,1,1,1,0,0,0,1,0,0]
document_attributes.loc['doc3'] = [0,0,0,1,1,1,0,0,0,0]
document_attributes.loc['doc4'] = [0,0,1,1,0,0,1,1,0,0]
document_attributes.loc['doc5']= [0,1,0,0,0,0,0,0,1,1]
document_attributes.loc['doc6'] = [1,0,0,1,0,0,0,0,0,0]
document_attributes.loc['doc7'] = [0,0,0,0,0,0,0,1,0,1]
document_attributes.loc['doc8'] = [0,0,1,1,0,0,1,0,0,1]
document_attributes.loc['doc9'] = [0,0,0,0,0,1,0,0,1,0]
document_attributes.loc['doc10'] = [0,1,0,0,1,0,1,0,0,0]
document_attributes.loc['doc11'] = [0,0,1,0,1,0,0,0,1,0]
document_attributes.loc['doc12'] = [1,0,0,0,0,1,1,0,0,0]
document_attributes.loc['doc13'] = [0,0,1,1,1,0,0,1,0,0]
document_attributes.loc['doc14'] = [0,1,1,1,0,0,0,0,1,0]
document_attributes.loc['doc15'] = [0,0,0,1,0,1,1,1,0,0]
document_attributes.loc['doc16'] = [1,0,0,0,0,1,0,0,1,0]
document_attributes.loc['doc17'] = [0,1,1,1,0,0,0,1,0,0]
document_attributes.loc['doc18'] = [0,0,0,1,0,0,0,0,1,0]
document_attributes.loc['doc19'] = [0,1,1,0,1,0,1,0,0,1]
document_attributes.loc['doc20'] = [0,0,1,1,0,0,1,0,1,0]

document_attributes

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc1,1,0,1,0,1,1,0,0,0,1
doc2,0,1,1,1,0,0,0,1,0,0
doc3,0,0,0,1,1,1,0,0,0,0
doc4,0,0,1,1,0,0,1,1,0,0
doc5,0,1,0,0,0,0,0,0,1,1
doc6,1,0,0,1,0,0,0,0,0,0
doc7,0,0,0,0,0,0,0,1,0,1
doc8,0,0,1,1,0,0,1,0,0,1
doc9,0,0,0,0,0,1,0,0,1,0
doc10,0,1,0,0,1,0,1,0,0,0


### 2.2. Users' Evaluations Data

In [3]:
# User Evaluations Data
users = ["User1", "User2"]
user_evaluations = pd.DataFrame(columns=users, index=doclist).fillna(0)

# Adding users evaluations data
user_evaluations.loc['doc1'] = [1,-1]
user_evaluations.loc['doc2'] = [-1,1]
user_evaluations.loc['doc4'] = [0,1]
user_evaluations.loc['doc6'] = [1,0]
user_evaluations.loc['doc12'] = [0,-1]
user_evaluations.loc['doc16'] = [1,0]
user_evaluations.loc['doc17'] = [0,1]
user_evaluations.loc['doc19'] = [-1,0]

user_evaluations.T

Unnamed: 0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,doc9,doc10,doc11,doc12,doc13,doc14,doc15,doc16,doc17,doc18,doc19,doc20
User1,1,-1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,-1,0
User2,-1,1,0,1,0,0,0,0,0,0,0,-1,0,0,0,0,1,0,0,0


### 2.3. Weighted Document Attributes Matrix

In [4]:
# Weighted Document Attributes Matrix

# Content Attributes Table
topics = ["baseball", "economics", "politics", "Europe", "Asia", 
          "soccer", "war", "security", "shopping", "family"]
doclist = ["doc1", "doc2", "doc3", "doc4", "doc5", "doc6", "doc7", "doc8", "doc9", "doc10",
           "doc11", "doc12", "doc13", "doc14", "doc15", "doc16", "doc17", "doc18", "doc19", "doc20"]

weighted_document_attributes = pd.DataFrame(columns=topics, index=doclist).fillna(0)

for index, row in document_attributes.iterrows():
    weighted_document_attributes.loc[index] = row / np.sqrt(sum(row))
    
weighted_document_attributes

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc1,0.447214,0.0,0.447214,0.0,0.447214,0.447214,0.0,0.0,0.0,0.447214
doc2,0.0,0.5,0.5,0.5,0.0,0.0,0.0,0.5,0.0,0.0
doc3,0.0,0.0,0.0,0.57735,0.57735,0.57735,0.0,0.0,0.0,0.0
doc4,0.0,0.0,0.5,0.5,0.0,0.0,0.5,0.5,0.0,0.0
doc5,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735
doc6,0.707107,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
doc7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.707107
doc8,0.0,0.0,0.5,0.5,0.0,0.0,0.5,0.0,0.0,0.5
doc9,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107,0.0
doc10,0.0,0.57735,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0


### 2.4. Computing DF and IDF vectors

$$ DF = A^T \times l $$

DF - attribute document frequency

A^T - transposed document attributes matrix

l - attributes unit vector

In order to compute the **IDF** vector we just need to take an inverse of each non-zero term of **DF**.

In [5]:
# IDF (inverse document frequency) vector

def idf(df):
    if df != 0:
        return 1/df
    else:
        return 0

l = np.ones(document_attributes.shape[0])
DF = document_attributes.T.dot(l)
IDF = DF.apply(idf) 
# Display IDF vector as data frame
pd.DataFrame(IDF, columns=['IDF']).T


Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
IDF,0.25,0.166667,0.1,0.090909,0.166667,0.166667,0.142857,0.166667,0.142857,0.2


## 3. Weighted User Profiles


In order to build **weighted** user profiles we need to multiply the transposed **user_evaluations** matrix (E) by the **weighted_document_attributes** matrix (W).  

$$ U = E^T \times W $$

$U$ - user profiles matrix

$E$ - user evaluations matrix

$W$ - weighted document attributes matrix

In [6]:
# Calculating User Profiles Matrix

user_profiles = user_evaluations.T.dot(weighted_document_attributes)
user_profiles

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
User1,1.731671,-0.947214,-0.5,0.207107,0.0,1.024564,-0.447214,-0.5,0.57735,0.0
User2,-1.024564,1.0,1.052786,1.5,-0.447214,-1.024564,-0.07735,1.5,0.0,-0.447214


## 4. Making Predictions

*Notation:* **IDFM** matrix is a matrix containing the elements of IDF vector on its diagonal.

In order to make prediction we will use the following formula:

$$ L = U \times {IDFM} \times W^T $$

$L$ - user predicted preferences matrix

$U$ - user profiles matrix

$IDFM$ - matrix containing the values of **inverse document frequency** vector on its diagonal.

$W^T$ - trasposed weighted document attributes matrix

In [7]:
# IDFM Matrix
IDFM = pd.DataFrame(np.diag(IDF), index=topics, columns=topics)

# User Preference Prediction Matrix
user_predicted_preferences = user_profiles.dot(IDFM).dot(weighted_document_attributes.T)
user_predicted_preferences

Unnamed: 0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,doc9,doc10,doc11,doc12,doc13,doc14,doc15,doc16,doc17,doc18,doc19,doc20
User1,0.247612,-0.136187,0.109459,-0.089197,-0.043527,0.319432,-0.058926,-0.04753,0.179067,-0.128031,0.018752,0.311648,-0.057253,-0.053281,0.021184,0.396153,-0.136187,0.071635,-0.121533,-0.006291
User2,-0.217167,0.329154,-0.062892,0.240296,0.044585,-0.084695,0.113531,0.070575,-0.120746,0.046812,0.01775,-0.252852,0.208553,0.204154,0.102276,-0.246472,0.329154,0.096424,0.043343,0.115296


## 5. Q & A

**Question 1**: Compare the results for doc1 and doc9 for user1. Display the prediction scores for doc1 and doc9.

**Answer**: User1 is going to like **doc1** more than **doc9**. The **doc1** has got a score of **0.2476**, and the **doc9** got the score of **0.1791**.

**Question 2**: Look at (user2, doc6) pair: when we were computing the predictions for (user2, doc6) in Weighted User Profile notebook, the prediction score was moderately positive, now it is slightly negative. Why?

**Answer**: The **doc6** has got a score of **-0.0847**. The score became negative because we took into consideration the impact of **IDF**, effectively it meant, that the user profile has been multiplied by **IDF** vector coefficients. Let's have a closer look at the computations performed on (user2, doc6) pair.

The **doc6** covers two topics: (a) baseball, (b) Europe.


In [8]:
# Content of doc6
document_attributes.loc[['doc6']]


Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc6,1,0,0,1,0,0,0,0,0,0


In [9]:
# Weighted doc6 attributes
weighted_document_attributes.loc[['doc6']]

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc6,0.707107,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0


From the profile of the **user2** we can see that the user has got a score of **-1.0246** on baseball, and the score of **1.5000** on Europe. When we were making a prediction for the **doc6** we made the following score computation:

$$ S_{doc6} = 0.7071 * (-1.0246) + 0.7071 * 1.5000 = 0.3362 $$ 

In [10]:
# user2 preferences profile

user_profiles.loc[['User2']]


Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
User2,-1.024564,1.0,1.052786,1.5,-0.447214,-1.024564,-0.07735,1.5,0.0,-0.447214


When we were using **IDF** when making a prediction, effectively we multiplied each score contribution term by the correspondent **IDF** value:

$$ S_{doc6} = 0.7071 * (-1.0246) * 0.25  + 0.7071 * 1.5000 * 0.09091  = -0.08470 $$ 