# Content Based Recommendation (CBR)

When we use Non personalised and stereotyped recommendations, we have the benefit that we don't need to know about our products neither user preferences (apart from explicit metrics). However, in order to provide confident recommendations, we have to have a big number of useful reviews, and account for all the risks when using a evaluation such as 5 starts (see Part I).

CBR are the other way around. In this approach, we don't need a big amount of recommending users, but we need to keep track on the items description and a user profile, which can be used to match with determined items. This difference in approach comes with some pros and cons:
* **Pros:**
* As stated before, we don't need a big number of recommending users in order to provide confident recommendations.
* With the feature above, this means items can be readilly recommended, given we extract the items' characteristics.  


* **Cons:**
* Item descriptions can be a tricky subject. Being able to automatically process and extract these descriptions goes into the fields on Natural Language Processing or even maybe Computer Vision. Besides, very often we rely on subjective qualifiers. When going to a restaurant or hotel, we usually search for tag indications on confort or taste, and this can be very individual to each person. 

One very simple approach for CBR is make use of key words and integrate it with past user review on determined domains. On this notebook we exemplify this approach and some next more 'smart' approaches, such as the TD-IDF statistics, which match a user's taste with documents that contains the key words but that are not commom in all the other documents. Lets go!

<img src=http://www.vodkr.com/wp-content/uploads/2014/03/netflix_contentrecommendation_599x318.jpg>


# Small data analysis II

Lets work with a small dataset as we did with notebook I - link [here](https://d396qusza40orc.cloudfront.net/flex-umntestsite/on-demand_files/Assignment%202.xls).

The main table represent a set a documents and each column contains a possible keywork/characteristic with which we could classify the document. The terms vary from sports to economics and are marked as 1 with the specific documento contains this topic. 

Besides the main table, we also load a review vector for 2 users, which show which document the user marked as 'liked'. These vectors are going to be combined with the document feature vectors in order to create a proper 'Taste vector', *i.e.*, what are the features a user liked and with what weight. In order to simplify the math, the number of stars was reduced to liked the movie (liked = 1, didn't like = 0, didn't review = Nan)

In [2]:
import pandas as pd
import numpy as np

In [28]:
reviewsDS = pd.read_csv('content_based_filtering.csv')
docTopics = reviewsDS.iloc[:20,:11]
docTopics.index = docTopics.iloc[:,0]
docTopics.drop('Unnamed: 0', axis = 1, inplace=True)
print('Nbr Rows/Users: ' + str(reviews.shape[0]) + ' - Nbr Columns/Movies: ' + str(reviews.shape[1]))

userReviews = reviewsDS.iloc[:20,[0,14,15]]
userReviews.index = userReviews.iloc[:,0]
userReviews.drop('Unnamed: 0', axis = 1, inplace=True)

Nbr Rows/Users: 20 - Nbr Columns/Movies: 10


In [29]:
docTopics.head()

Unnamed: 0_level_0,baseball,economics,politics,Europe,Asia,soccer,war,security,shopping,family
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
doc1,1.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
doc2,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
doc3,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0
doc4,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
doc5,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0


In [27]:
userReviews.head()

Unnamed: 0_level_0,User 1,User 2
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1
doc1,1.0,-1.0
doc2,-1.0,1.0
doc3,0.0,0.0
doc4,0.0,1.0
doc5,0.0,0.0


## User Profiles

Given what documents did each user like, we can establish ways of how to create a user profile, *i.e.* identify which features is the user more prone to like. 

Each time a user 'liked' a document, we can say he also liked the topics that are contained in the document. By summing up all topics for all the documents the user liked, we can have an idea of what are the user's prefered topics and with what intensity.

In [41]:
# makes the dot product between user reviews and doc topics
def getTasteVector(userCol, docTopics):
    return docTopics.apply(lambda docCol : np.dot(userCol, docCol))

userTastes = userReviews.apply(lambda col : getTasteVector(col, docTopics))
userTastes

Unnamed: 0,User 1,User 2
baseball,3.0,-2.0
economics,-2.0,2.0
politics,-1.0,2.0
Europe,0.0,3.0
Asia,0.0,-1.0
soccer,2.0,-2.0
war,-1.0,0.0
security,-1.0,3.0
shopping,1.0,0.0
family,0.0,-1.0
