# Content-Based Recommender System

- In this demo I will create a content-based recommender system
- This revolves around the following key ideas (typically used in Data Science):
    1. Vectors
    2. TF-IDF
    3. Cosine Similarity
   
## Vectors:
- Input: text/words
- Output: text/words represented in a vector space model
- Note: this idea seems pretty 'bland' however this process is the reason why we're able to make great strides in ML and AI.

## TF-IDF:
- TF-IDF stands for Term Frequency and Inverse Document Frequency.
- Use this to help determine the importance of a word in a document.
- I won't go into detail, but here are the general steps of TF-IDF:
    1. Create a dictionary of words (bag of words) present in the whole document space (a document space is basically the data you have...it's a list of documents).
    2. Form your vector: based on your bag of words, count the presence or absence  of word by marking it (1=present, 0=absent). For each document, you know get your vector.
    3. Compute TF-IDF (can look up equation in Wikipedia if you'd like).

## Cosine Similarity:
- Cosine similarity computes how similar two non-zero vectors are...the vectors in this case are the same vectors previously mentioned.
- If two vectors make an angle $0$, then we know that $cos(0)=1$, and this means that the sentences are closely related.
- If two vectors are orthogonal ($cos(90)$), then the sentences are almost unrelated.

That's enough explaining...let me show you how this would be done.

# Data
- Here is some synthetic data I made. It's just a list of book titles about machine learning.
- Task: Recommend a book to me based on other book titles
- The dataset is called books.csv
- Key assumption: 
    - The book titles are detailed enough to explain what it is about.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# read in dataset
df = pd.read_csv("data/books.csv")
df

Unnamed: 0,ID,Book Title
0,1,Probabilistic Graphical Models
1,2,Bayesian Data Analysis
2,3,Doing data science
3,4,Pattern Recognition and Machine Learning
4,5,The Elements of Statistical Learning
5,6,An introduction to Statistical Learning
6,7,Python Machine Learning
7,8,Natural Langauage Processing with Python
8,9,Statistical Distributions
9,10,Monte Carlo Statistical Methods


# Compute word n-grams

- Again, without loss of generality and without losing focus on the main subject, I'll explain n-gram briefly.
- ngram(1,3) takes into account 1-gram, 2-gram, and 3-gram.
- e.g. Let the sentence be "I like basketball"
    - ngram(1,3) = {'I', 'like', 'basketball', 'I like', 'like basketball', 'I like basketball'}
    - i.e. it's every possible combination (order matters)

In [3]:
# create object to convert collection of raw text docs to a matrix of TF-IDF features
tf = TfidfVectorizer(analyzer = 'word', ngram_range=(1,3),
                    min_df = 0, stop_words = 'english')

In [4]:
# learn vocab and idf, return term-doc matrix
tfidf_matrix = tf.fit_transform(df['Book Title'])

# compute similarities
cos_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [5]:
# save results as a dictionary
results = {}

# go through each row of df
for idx, row in df.iterrows():
    # store similar ids based on cosine similarity, then sort in ascending order
    similar_indices = cos_sim[idx].argsort()[::-1]
    
    # get 5 most similar books
    similar_items = [(cos_sim[idx][i], df['ID'][i]) for i in similar_indices]
    results[row['ID']] = similar_items[1:]  

In [6]:
# this function is going to return a row that mathches the id
# along with the book title as a list (not as a dataframe)
def item(id):
    return df.loc[df['ID'] == id]['Book Title'].tolist()[0]

# this function returns the most similar books
# input: id = id of the book, num = number of similar books to return
# output: most similar books
def recommend(id, num):
    if (num == 0):
        print("You haven't chosen a book dawg! I can't recommend anything if you didn't choose one.")
    elif (num==1):
        print("Here is " + str(num) + " book similar to " + item(id) + ":")
    else :
        print("Here are " + str(num) + " books similar to " + item(id) + ":")
        
    print("----------------------------------------------------------")
    recs = results[id][:num]
    for rec in recs:
        print(item(rec[1]) + " (score:" + str(rec[0]) + ")")


In [7]:
recommend(7, 5)

Here are 5 books similar to Python Machine Learning :
----------------------------------------------------------
Pattern Recognition and Machine Learning  (score:0.30466902859680617)
Machine Learning :A Probablisitic Perspective (score:0.29088234870148644)
Natural Langauage Processing with Python (score:0.1232123322436553)
An introduction to Statistical Learning  (score:0.10186256812768607)
The Elements of Statistical Learning  (score:0.09932294987882118)
