### **Recommendation System**: `Content Based`

A content-based recommendation system is a type of recommendation system that makes recommendations based on the characteristics or attributes of the items being recommended. It uses the **content of the items** being recommended to determine what other items a user might be interested in based on **their preferences and their past behaviors**.

For example, a content-based recommender system for movies might recommend movies to a user based on the genres of movies that the user has previously watched. If a user has watched a lot of action movies, the system might recommend other action movies to the user.

Content-based recommendation systems work by **analyzing the content and attributes of items (such as movies, books, products, etc.) and matching them with a user's preferences or profile**. Here's a step-by-step explanation of how content-based recommendation systems work:

- **Item Profile Creation**:

    Each item in the recommendation system is associated with a set of attributes or content features. These attributes can vary depending on the type of items being recommended. For example, in a movie recommendation system, attributes might include genres, actors, directors, and plot keywords.

- **User Profile Creation**:

    The system also maintains a user profile for each user, which contains information about their preferences and past interactions. This profile is built over time as the user interacts with the system, providing feedback or indicating their likes and dislikes.


- **Calculating Item Scores**:

    To recommend items to a user, the system calculates a score for each item by comparing its features to the user's profile. This is often done using mathematical techniques like cosine similarity or Euclidean distance.

- **Ranking and Filtering**:

    The items are ranked based on their scores, with higher-scoring items being recommended first. The system may also apply filtering to remove items that the user has already interacted with or items that don't meet certain criteria.

- **Recommendation Generation**:

    Finally, the system generates a list of recommended items for the user based on the ranked scores. These recommendations are presented to the user through an interface, such as a website or app.

### **1. Similarity Measure**

It utilizes **an item** that we already used before to recommend next items.

We measure similarity between two items based on certain content using cosine similarity, pearson correlation, spearman correlation or jaccard distance.

Cosine similarity is a metric used to measure the similarity between two non-zero vectors in a multi-dimensional space.

​$$ \text{Cosine}(x, y) = \frac{\sum_{i=1}^n x_iy_i}{\sqrt{\sum\limits_{i=1}^n x_i^2} \sqrt{\sum\limits_{i=1}^n y_i^2}} $$

The cosine similarity ranges from 0 to 1, with the following interpretation:
- if the vectors are perfectly aligned, the cosine similarity is 1, indicating maximum similarity
- if the vectors are orthogonal (perpendicular), the cosine similarity is 0, indicating no similarity

In [68]:
import numpy as np
import pandas as pd

df_movies = pd.DataFrame({
    'movie' : ['Terminator 2', 'Interstellar', 'Ant Man', '3 Idiots', 'End Game'],
    'Action' : [1, 0, 1, 0, 1],
    'Sci-Fi' : [1, 1, 1, 0, 1],
    'Adventure' : [0, 1, 1, 0, 1],
    'Comedy' : [0, 0, 1, 1, 1],
    'Drama' : [0, 1, 0, 1, 1],
    'Romance' : [0, 1, 1, 1, 1]
})

df_movies

Unnamed: 0,movie,Action,Sci-Fi,Adventure,Comedy,Drama,Romance
0,Terminator 2,1,1,0,0,0,0
1,Interstellar,0,1,1,0,1,1
2,Ant Man,1,1,1,1,0,1
3,3 Idiots,0,0,0,1,1,1
4,End Game,1,1,1,1,1,1


In [69]:
# 1 film yang ditonton user1
antman = df_movies.loc[df_movies['movie']=='Ant Man','Action':]

# film yang belum ditonton user1
terminator = df_movies.loc[df_movies['movie']=='Terminator 2','Action':]
interstellar = df_movies.loc[df_movies['movie']=='Interstellar','Action':]
idiots = df_movies.loc[df_movies['movie']=='3 Idiots','Action':]
endgame = df_movies.loc[df_movies['movie']=='End Game','Action':]


In [70]:
def cosine(movie1,movie2):
    A = antman.values[0]
    B = terminator.values[0]

    # np.dot(A,B.T)[0][0] -> jika tidak pakai .values[0]
    dot_product = np.dot(A,B)

    norm_A = np.linalg.norm(A)
    norm_B = np.linalg.norm(B)

    return dot_product/(norm_A*norm_B)

In [71]:
print(f'Cosine Similarity between Antman and Interstellar : {cosine(antman,interstellar):.2f}')
print(f'Cosine Similarity between Antman and Terminator : {cosine(antman,terminator):.2f}')
print(f'Cosine Similarity between Antman and 3 Idiots : {cosine(antman,idiots):.2f}')
print(f'Cosine Similarity between Antman and End Game : {cosine(antman,endgame):.2f}')

Cosine Similarity between Antman and Interstellar : 0.63
Cosine Similarity between Antman and Terminator : 0.63
Cosine Similarity between Antman and 3 Idiots : 0.63
Cosine Similarity between Antman and End Game : 0.63


In [72]:
from sklearn.metrics.pairwise import cosine_similarity

print(f'Cosine Similarity between Antman and Interstellar : {cosine_similarity(antman,interstellar)[0][0]:.2f}')
print(f'Cosine Similarity between Antman and Terminator : {cosine_similarity(antman,terminator)[0][0]:.2f}')
print(f'Cosine Similarity between Antman and 3 Idiots : {cosine_similarity(antman,idiots)[0][0]:.2f}')
print(f'Cosine Similarity between Antman and End Game : {cosine_similarity(antman,endgame)[0][0]:.2f}')

Cosine Similarity between Antman and Interstellar : 0.67
Cosine Similarity between Antman and Terminator : 0.63
Cosine Similarity between Antman and 3 Idiots : 0.52
Cosine Similarity between Antman and End Game : 0.91


In [73]:
df_unwatched = df_movies.loc[df_movies['movie'] != 'Ant Man', 'Action':]

cosine_similarity(antman, df_unwatched)

array([[0.63245553, 0.67082039, 0.51639778, 0.91287093]])

### **2. Content Based Filtering**

It utilizes **several items** that we already used before to recommend next items.


**a. Single User**

Now we will see how content based filtering works for one user. Let's say user 1, this user has given ratings to a number of movies and there are a number of movies that have not been watched before. We want to know what movie he wants to see next.

In [74]:
df_movies = pd.DataFrame({
    'movie' : ['Terminator 2', 'Interstellar', 'Ant Man', '3 Idiots', 'End Game'],
    'score' : [7, 9, 8, 9, 10],
    'Action' : [1, 0, 1, 0, 1],
    'Sci-Fi' : [1, 1, 1, 0, 1],
    'Adventure' : [0, 1, 1, 0, 1],
    'Comedy' : [0, 0, 1, 1, 1],
    'Drama' : [0, 1, 0, 1, 1]
})

df_movies

Unnamed: 0,movie,score,Action,Sci-Fi,Adventure,Comedy,Drama
0,Terminator 2,7,1,1,0,0,0
1,Interstellar,9,0,1,1,0,1
2,Ant Man,8,1,1,1,1,0
3,3 Idiots,9,0,0,0,1,1
4,End Game,10,1,1,1,1,1


First we create an item-feature matrix. This matrix is ​​a combination of the items and genres as features.

In [75]:
item_feature_matrix = df_movies. loc[:, 'Action' : ]
item_feature_matrix

Unnamed: 0,Action,Sci-Fi,Adventure,Comedy,Drama
0,1,1,0,0,0
1,0,1,1,0,1
2,1,1,1,1,0
3,0,0,0,1,1
4,1,1,1,1,1


Then we have to create an item-feature matrix by its rating

In [76]:
rating = df_movies['score']
rating

0     7
1     9
2     8
3     9
4    10
Name: score, dtype: int64

In [77]:
item_feature_matrix_with_rating = item_feature_matrix.mul(rating, axis=0)
item_feature_matrix_with_rating

Unnamed: 0,Action,Sci-Fi,Adventure,Comedy,Drama
0,7,7,0,0,0
1,0,9,9,0,9
2,8,8,8,8,0
3,0,0,0,9,9
4,10,10,10,10,10


We summarize the matrix so that we get user 1's overall preferences. The preference we use in this case is a preference for genre. To obtain user preferences for genres, we add up the ratings given by users for each genre.

In [78]:
score_per_genre = item_feature_matrix_with_rating.sum()
score_per_genre

Action       25
Sci-Fi       34
Adventure    27
Comedy       27
Drama        28
dtype: int64

After that, we make preferences in the form of percentages or proportions. User preferences for the content used are called User Feature Vector.

In [79]:
user_feature_vector = score_per_genre/score_per_genre.sum() # masing2 score dibagi 91
user_feature_vector

Action       0.177305
Sci-Fi       0.241135
Adventure    0.191489
Comedy       0.191489
Drama        0.198582
dtype: float64

Here we know that the user likes the 'Science Fiction' genre the most because it has the biggest score.

Then the user feature vector that has been obtained will be used to measure interest in other items that have not been used. 

In [86]:
df_unwatched_movies = pd.DataFrame({
    'movie' : ['Transformers', 'Martian', 'GOTG Vol 2', 'Avenger'],
    'Action' : [1, 0, 1, 1],
    'Sci-Fi' : [1, 1, 1, 1],
    'Adventure' : [0, 1, 1, 1],
    'Comedy' : [0, 0, 1, 1],
    'Drama' : [0, 1, 0, 1]
})

df_unwatched_movies

Unnamed: 0,movie,Action,Sci-Fi,Adventure,Comedy,Drama
0,Transformers,1,1,0,0,0
1,Martian,0,1,1,0,1
2,GOTG Vol 2,1,1,1,1,0
3,Avenger,1,1,1,1,1


We combine the user feature vector with the item-feature matrix.

In [87]:
item_feature_matrix_unwatched = df_unwatched_movies.loc[:,'Action':]
item_feature_matrix_unwatched

Unnamed: 0,Action,Sci-Fi,Adventure,Comedy,Drama
0,1,1,0,0,0
1,0,1,1,0,1
2,1,1,1,1,0
3,1,1,1,1,1


In [88]:
df_recommendation =item_feature_matrix_unwatched.mul(user_feature_vector,axis=1)
df_recommendation 

Unnamed: 0,Action,Sci-Fi,Adventure,Comedy,Drama
0,0.177305,0.241135,0.0,0.0,0.0
1,0.0,0.241135,0.191489,0.0,0.198582
2,0.177305,0.241135,0.191489,0.191489,0.0
3,0.177305,0.241135,0.191489,0.191489,0.198582


Then we calculate the total score for each item by adding up the scores from each genre.

In [89]:
scoring = df_recommendation.sum(axis=1)
scoring

0    0.418440
1    0.631206
2    0.801418
3    1.000000
dtype: float64

In [90]:
df_unwatched_movies['score'] = scoring
df_unwatched_movies.sort_values('score',ascending=False)[:3]

Unnamed: 0,movie,Action,Sci-Fi,Adventure,Comedy,Drama,score
3,Avenger,1,1,1,1,1,1.0
2,GOTG Vol 2,1,1,1,1,0,0.801418
1,Martian,0,1,1,0,1,0.631206


**Summary**

- The content is Genre
- User has watched 4 movies: `Terminator 2`, `Interstellar`, `Ant Man`, and `3 Idiots`
- We use Content Based Filtering to determine what movies to recommend next
- Based on the highest Movie Score, the recommended film is `GOTG Vol 2`, then `Martian`, and `Transformers`. 

**b. Multiple User**

- In practice, we need to make recommendations for many users not just one.
- How it works still only looks at the history of a user (user1 has nothing to do with user2 and so on)

In [49]:
df_movies = pd.DataFrame({
    'movie' : ['Terminator 2', 'Interstellar', 'Ant Man', '3 Idiots'],
    'score' : [7, 9, 8, 9],
    'Action' : [1, 0, 1, 0],
    'Sci-Fi' : [1, 1, 1, 0],
    'Adventure' : [0, 1, 1, 0],
    'Comedy' : [0, 0, 1, 1],
    'Drama' : [0, 1, 0, 1]
})

df_movies


Unnamed: 0,movie,score,Action,Sci-Fi,Adventure,Comedy,Drama
0,Terminator 2,7,1,1,0,0,0
1,Interstellar,9,0,1,1,0,1
2,Ant Man,8,1,1,1,1,0
3,3 Idiots,9,0,0,0,1,1


In [50]:
df_user = pd.DataFrame({
    'user' : ['user1', 'user2', 'user3', 'user4'],
    'Terminator' : [7, 8, 9, 0],
    'Interstellar' : [9, 0, 0, 7],
    'Ant man' : [8, 6, 0, 0],
    '3 Idiots' : [9, 5, 10, 9]
})

df_user

Unnamed: 0,user,Terminator,Interstellar,Ant man,3 Idiots
0,user1,7,9,8,9
1,user2,8,0,6,5
2,user3,9,0,0,10
3,user4,0,7,0,9


We need two things: the user-item rating matrix and the item-feature matrix.

Basically we do the same thing, calculate the user feature vector but this time we calculate it for each existing user. Then we combine the user feature vector for each user and create a user feature matrix.

Next, we combine the user feature matrix with the item feature matrix so that we can get scores from movies that have not been watched by each user.

We can see that from a number of movies that have not been seen by user 3 and user 4, it turns out that the most recommended movie to watch next is Ant Man.