# Content Based Recommenders - Building Weighted User Profile

In this exercise we are going to build a **weighted** user profile based on documents attribute data, and then we will predict user liking / disliking of the unrated documents.

You may have noticed, that documents with lots of attributes have more influence on a user profile. Let's recall the dataset from [**Basic User Profile**](https://github.com/alv2017/DataAnalysis---ContentBasedRecommenders/blob/master/notebooks/Content%20Based%20Recommenders%20-%20User%20Basic%20Profile.ipynb) notebook. The **doc1** has 5 attributes while **doc6** has only 2 attributes. It would be logical to assume, that the **doc6** says more about liking baseball than the **doc1**.

In order to incorporate the statement above into user profile model, we need to normalize **document attribute matrix**, and re-calculate user profiles.

**Document Attribute Matrix Normalization Procedure**

- Count the total number of items in the row.

- Divide each value by the square root of the number of items.

In addition we will answer the following questions:

1) Which document as per weighted profile prediction User1 will like best? What is the prediction score for this document?

2) Which document as per weighted profile prediction User2 will like best? What is the prediction score for this document?

3) Which document is the second best for User1? What is the prediction score for this document?

4) Which document is the second best for User2? What is the prediction score for this document?

5) How many documents User1 will dislike?

6) How many documents User2 will dislike?

## 1. Settings

In [1]:
# Settings 
import numpy as np
import pandas as pd

## 2. Data

The dataset contains a table of content attributes: 20 documents across 10 attributes. 

We also have two users evaluations of five documents each. 

The content attributes should be interpreted as follows:

- 1 - document is about listed topic;
- 0 - document is not about listed topic;

User evaluations should be read as follows:

- 1 - user liked a document;
- 0 - user never saw a document;
- -1 - user didn't like a document;

### 2.1. Content Attributes Table

In [2]:
# Content Attributes Table
topics = ["baseball", "economics", "politics", "Europe", "Asia", 
          "soccer", "war", "security", "shopping", "family"]
doclist = ["doc1", "doc2", "doc3", "doc4", "doc5", "doc6", "doc7", "doc8", "doc9", "doc10",
           "doc11", "doc12", "doc13", "doc14", "doc15", "doc16", "doc17", "doc18", "doc19", "doc20"]

document_attributes = pd.DataFrame(columns=topics, index=doclist)

# Adding documents data
document_attributes.loc['doc1'] = [1,0,1,0,1,1,0,0,0,1] 
document_attributes.loc['doc2'] = [0,1,1,1,0,0,0,1,0,0]
document_attributes.loc['doc3'] = [0,0,0,1,1,1,0,0,0,0]
document_attributes.loc['doc4'] = [0,0,1,1,0,0,1,1,0,0]
document_attributes.loc['doc5']= [0,1,0,0,0,0,0,0,1,1]
document_attributes.loc['doc6'] = [1,0,0,1,0,0,0,0,0,0]
document_attributes.loc['doc7'] = [0,0,0,0,0,0,0,1,0,1]
document_attributes.loc['doc8'] = [0,0,1,1,0,0,1,0,0,1]
document_attributes.loc['doc9'] = [0,0,0,0,0,1,0,0,1,0]
document_attributes.loc['doc10'] = [0,1,0,0,1,0,1,0,0,0]
document_attributes.loc['doc11'] = [0,0,1,0,1,0,0,0,1,0]
document_attributes.loc['doc12'] = [1,0,0,0,0,1,1,0,0,0]
document_attributes.loc['doc13'] = [0,0,1,1,1,0,0,1,0,0]
document_attributes.loc['doc14'] = [0,1,1,1,0,0,0,0,1,0]
document_attributes.loc['doc15'] = [0,0,0,1,0,1,1,1,0,0]
document_attributes.loc['doc16'] = [1,0,0,0,0,1,0,0,1,0]
document_attributes.loc['doc17'] = [0,1,1,1,0,0,0,1,0,0]
document_attributes.loc['doc18'] = [0,0,0,1,0,0,0,0,1,0]
document_attributes.loc['doc19'] = [0,1,1,0,1,0,1,0,0,1]
document_attributes.loc['doc20'] = [0,0,1,1,0,0,1,0,1,0]

document_attributes

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc1,1,0,1,0,1,1,0,0,0,1
doc2,0,1,1,1,0,0,0,1,0,0
doc3,0,0,0,1,1,1,0,0,0,0
doc4,0,0,1,1,0,0,1,1,0,0
doc5,0,1,0,0,0,0,0,0,1,1
doc6,1,0,0,1,0,0,0,0,0,0
doc7,0,0,0,0,0,0,0,1,0,1
doc8,0,0,1,1,0,0,1,0,0,1
doc9,0,0,0,0,0,1,0,0,1,0
doc10,0,1,0,0,1,0,1,0,0,0


### 2.2 Weighted Document Attributes Matrix

In [3]:
# Weighted Document Attributes Matrix

# Content Attributes Table
topics = ["baseball", "economics", "politics", "Europe", "Asia", 
          "soccer", "war", "security", "shopping", "family"]
doclist = ["doc1", "doc2", "doc3", "doc4", "doc5", "doc6", "doc7", "doc8", "doc9", "doc10",
           "doc11", "doc12", "doc13", "doc14", "doc15", "doc16", "doc17", "doc18", "doc19", "doc20"]

weighted_document_attributes = pd.DataFrame(columns=topics, index=doclist).fillna(0)

for index, row in document_attributes.iterrows():
    weighted_document_attributes.loc[index] = row / np.sqrt(sum(row))
    
weighted_document_attributes

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc1,0.447214,0.0,0.447214,0.0,0.447214,0.447214,0.0,0.0,0.0,0.447214
doc2,0.0,0.5,0.5,0.5,0.0,0.0,0.0,0.5,0.0,0.0
doc3,0.0,0.0,0.0,0.57735,0.57735,0.57735,0.0,0.0,0.0,0.0
doc4,0.0,0.0,0.5,0.5,0.0,0.0,0.5,0.5,0.0,0.0
doc5,0.0,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.57735,0.57735
doc6,0.707107,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
doc7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.707107
doc8,0.0,0.0,0.5,0.5,0.0,0.0,0.5,0.0,0.0,0.5
doc9,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107,0.0
doc10,0.0,0.57735,0.0,0.0,0.57735,0.0,0.57735,0.0,0.0,0.0


### 2.3. User Evaluations Data

In [4]:
# User Evaluations Data
users = ["User1", "User2"]
user_evaluations = pd.DataFrame(columns=users, index=doclist).fillna(0)

# Adding users evaluations data
user_evaluations.loc['doc1'] = [1,-1]
user_evaluations.loc['doc2'] = [-1,1]
user_evaluations.loc['doc4'] = [0,1]
user_evaluations.loc['doc6'] = [1,0]
user_evaluations.loc['doc12'] = [0,-1]
user_evaluations.loc['doc16'] = [1,0]
user_evaluations.loc['doc17'] = [0,1]
user_evaluations.loc['doc19'] = [-1,0]

user_evaluations.T

Unnamed: 0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,doc9,doc10,doc11,doc12,doc13,doc14,doc15,doc16,doc17,doc18,doc19,doc20
User1,1,-1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,-1,0
User2,-1,1,0,1,0,0,0,0,0,0,0,-1,0,0,0,0,1,0,0,0


## 2. Building Weighted User Profile

In order to build user profiles we need to multiply the transposed **user_evaluations** matrix (E) by the **weighted document_attributes** matrix (W).  

$$ U = E^T \times W $$

$U$ - user profiles matrix

$E$ - user evaluations matrix

$W$ - weighted document attributes matrix

In [5]:
# Calculating User Profiles Matrix

user_profiles = user_evaluations.T.dot(weighted_document_attributes)
user_profiles

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
User1,1.731671,-0.947214,-0.5,0.207107,0.0,1.024564,-0.447214,-0.5,0.57735,0.0
User2,-1.024564,1.0,1.052786,1.5,-0.447214,-1.024564,-0.07735,1.5,0.0,-0.447214


## 3. Computing Document Prediction Scores

In this section we are going to predict user liking/disliking of each document. In order to achieve that we are going to calculate a document prediction score for each document for each user. We will use matrices to perform the computation.

User predicted preferences matrix (L) consists out of document prediction scores for each user.

$$ L = U \times W^T $$

$L$ - user predicted preference matrix

$U$ - user profiles matrix

$W$ - weighted document attributes matrix

In [6]:
# User Predicted Preference Matrix
user_predicted_preferences = user_profiles.dot(weighted_document_attributes.T)
user_predicted_preferences

Unnamed: 0,doc1,doc2,doc3,doc4,doc5,doc6,doc7,doc8,doc9,doc10,doc11,doc12,doc13,doc14,doc15,doc16,doc17,doc18,doc19,doc20
User1,1.009019,-0.870053,0.711105,-0.620053,-0.213541,1.370923,-0.353553,-0.370053,1.132724,-0.805073,0.044658,1.333114,-0.396447,-0.331378,0.142229,1.924646,-0.870053,0.554695,-0.847214,-0.081378
User2,-0.845577,2.526393,0.016294,1.987718,0.319151,0.336184,0.744432,1.014111,-0.724476,0.274493,0.349628,-1.227723,1.802786,1.776393,0.949043,-1.183064,2.526393,1.06066,0.483442,1.237718


## 4. Q & A

In this section we are going to answer the questions stated at the beginning of the notebook.

**Question 1:**  Which document as per weighted profile prediction User1 will like best? What is the prediction score for this document?

In [7]:
# Which document User1 is going to like best?
user1_max = user_predicted_preferences.loc['User1'].max()
user1_favorite_documents = \
    user_predicted_preferences.loc['User1'][user_predicted_preferences.loc['User1']==user1_max]
user1_favorite_documents

doc16    1.924646
Name: User1, dtype: float64

**Answer:** As per prediction User1 is going to like best **doc16** (Score: 1.9246). In case of **Basic Profile Model** doc16 also got the highest score.

In [8]:
# Content of User1 favourite document(s)
document_attributes.loc[user1_favorite_documents.index]

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc16,1,0,0,0,0,1,0,0,1,0


**Question 2:** Which document as per weighted profile prediction User2 will like best? What is the prediction score for this document?

In [9]:
# Which document User2 is going to like best?
user2_max = user_predicted_preferences.loc['User2'].max()
user2_favorite_documents = \
    user_predicted_preferences.loc['User2'][user_predicted_preferences.loc['User2']==user2_max]
user2_favorite_documents

doc2     2.526393
doc17    2.526393
Name: User2, dtype: float64

**Answer:** As per prediction User2 is going to like best **doc2** and **doc17**. Both documents have got the score of 2.5264. When we used **Basic User Profile** model, we also got **doc2** and **doc17** as User2 favourites.

In [10]:
# Content of User2 favourite document(s)
document_attributes.loc[user2_favorite_documents.index]

Unnamed: 0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
doc2,0,1,1,1,0,0,0,1,0,0
doc17,0,1,1,1,0,0,0,1,0,0


**Question 3:**  Which document is the second best for User1? What is the prediction score for this document?

In [11]:
# User1 Top 5 Documents
user_predicted_preferences.T.loc[:, ['User1']].sort_values(['User1'], ascending=False).head(5)

Unnamed: 0,User1
doc16,1.924646
doc6,1.370923
doc12,1.333114
doc9,1.132724
doc1,1.009019


**Answer:** The second best document for the User1 is **doc6**.

**Question 4:** Which document is the second best for User2? What is the prediction score for this document?

In [12]:
# User2 Top 3 Documents
user_predicted_preferences.T.loc[:, ['User2']].sort_values(['User2'], ascending=False).head(5)

Unnamed: 0,User2
doc17,2.526393
doc2,2.526393
doc4,1.987718
doc13,1.802786
doc14,1.776393


**Answer:** The second best document for the User2 is **doc4**. Please note, that **doc2** and **doc17** share the first place.

**Question 5:**  How many documents User1 will dislike?

In [13]:
user1_disliked_documents = \
    user_predicted_preferences.loc['User1'][user_predicted_preferences.loc['User1'] < 0]
len(user1_disliked_documents)

11

**Answer:** As per our prediction there are **11** documents that the User1 is not going to like.

**Question 6:** How many documents User2 will dislike?

In [14]:
user2_disliked_documents = \
    user_predicted_preferences.loc['User2'][user_predicted_preferences.loc['User2'] < 0]
len(user2_disliked_documents)

4

**Answer**: As per our prediction there are **4** documents that the User2 is not going to like.