# Firli Ilhami

## Hybrid Filtering Algorithm
## Objective
Hybrid filtering is combination of Collaborative and Content Based Filtering Algorithm. This algorithm has 2 types :
* Collaborative + Content Based
* Content Based + Collaborative

The difference is about order of algorithm.
### Note : I recommend to read content based and collaborative filtering algorithm first
Because in this case (Hybrid Filtering), i will modify some function and don't give full explanation about the function.
You can see full explanation on other code (Collaborative and Content Based Filtering Algorithm)

## Import library and dataset

In [4]:
from recommendation_data import dataset
from math import sqrt
import pandas as pd
import numpy as np

In [5]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

## Dataset
In this case, we will use 2 datasets, the first one is dataset about Films and their Description. The description is consist of 4 word:
1. First word  : Director
2. Second word : Actress
3. Third word  : Genre
4. Last word   : Country

In [7]:
df=pd.read_csv('dataset1.csv',delimiter=';')
df.head()

Unnamed: 0,Film,Description
0,Ada Apa dengan Cinta 2,Budi Tom Romance ID
1,Aladdin,Budi Cruise Romance US
2,Avengers: End Game,Andi Michael Action US
3,Bumi Manusia,Charles Tom Drama ID
4,Captain Marvel,Andi Cruise Action US


The second dataset is rating viewer about films. There are 10 films :

1. Ada Apa Dengan Cinta 2
2. Aladdin
3. Avenger: End Game
4. Bumi Manusia
5. Captain Marvel
6. Dilam 1991
7. Dua Garis Biru
8. Gundala
9. Spiderman: Far From Home
10. The Lion King

The rating score is from 1 to 5. If viewers give 0 score, it means they haven't watched that film and there are 24 respondens.

Our dataset is not dataframe, so i will make a dataframe to make it easier to see the data.

In [8]:
df1=pd.DataFrame(dataset)
df1.head()

Unnamed: 0,ANI,AhokTemanFirli,Damar Teman Firli,Dpv,Febi ganteng gak ada obat,Genjeh,Hania,Indra 1991 SM,Indra Junior,Jawaharal,...,Putrisqiana,Rima,Romantika,Star,Topik Zulkarnain,bunga,faizah,franadek,jul,luck
Ada Apa dengan Cinta 2,4,0,5,5,4,5,3,0,4,2,...,4,5,5,4,0,0,3,4,0,3
Aladdin,4,0,0,0,5,5,0,0,5,5,...,0,5,0,5,0,5,0,5,3,0
Avengers: End Game,0,3,5,5,5,5,0,0,5,5,...,5,5,0,5,5,5,5,5,3,4
Bumi Manusia,5,0,0,0,0,0,4,0,0,0,...,4,4,0,0,0,0,5,5,0,0
Captain Marvel,4,4,0,5,4,4,0,0,5,4,...,3,5,0,5,2,5,0,4,3,2


## Hybrid_Filtering Function (Content Based + Collaborative)
In this Hybrid Filltering method, <b>i will use content based algorithm first </b> and then collaborative filtering.
So content based algorithm will give the recommendation film is based on film which viewer has watched before and then the collaborative filtering algorithm will give us the recommendation film based on other viewer's rating but the recommendation films have been filtered by content based algorithm.

### Content_Based Function (Modify)
I will change the return from this function because we just need the result of list film which recommend by this algorithm.

We have to input 2 parameters:
1. person : Name of person
2. min_content_score : this is cut off for our result score, if our result score is below than min_content_score this function will not recommend that film. the value is from 0 to 1.

In [9]:
def content_based(person,min_content_score):
        
    #We want to know films that the person has not watched before
    k=0
    not_watch=[]
    watch=[]
    for ratings in df1[person]:
        if ratings==0:                    #if ratings = 0 , it means he hasn't watched that film
            not_watch.append(df1.index[k])
            k=k+1
        else:
            watch.append(df1.index[k])
            k=k+1
    
    #We will analyze the description every films by linear_kernel and TfidfVectorizer
    #TfidfVectorizer will analyze how many same word in description film with another film's description
    #In this part we will calculate cosine_similarities 
    tf = TfidfVectorizer(analyzer='word',
                             ngram_range=(1, 3),
                             min_df=0,
                             stop_words='english')
    tfidf_matrix=tf.fit_transform(df['Description'])
    cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
    
    #Make a dataframe from cosine_similarities
    list_film=pd.DataFrame(cosine_similarities,index=df['Film'],columns=df['Film'])
    
    #Select list_film which we haven't watched and the final score is more than min_content_score
    #The final score film is mean of cosine_simalirites that film
    final=pd.DataFrame(list_film.loc[watch,not_watch].mean().sort_values(ascending=False),columns=['Score'])
    final=final[final.Score>=min_content_score]
    
    #Modify
    #We just need list of film recommendation so we just need the index
    indeks=final.index
    
    return indeks

### Person_Correlation Function
We need this function to calculate correlation between viewers and it will be added in user_recommendation function (collaborative filtering).

If the correlation score is close to 1 , it means that they have same opinion about films that they have watched and vice versa.
In this function we have to input 2 parameters:
1. person1 : name of person 1
2. person2 : name of person 2

In [10]:
def person_correlation(person1, person2):

   # To get both rated items
    both_rated = {}
    for item in dataset[person1]:
        if item in dataset[person2]:
            both_rated[item] = 1

    number_of_ratings = len(both_rated)

    # Checking for ratings in common
    if number_of_ratings == 0:
        return 0

    # Add up all the preferences of each user
    person1_preferences_sum = sum([dataset[person1][item] for item in both_rated])
    person2_preferences_sum = sum([dataset[person2][item] for item in both_rated])

    # Sum up the squares of preferences of each user
    person1_square_preferences_sum = sum([pow(dataset[person1][item],2) for item in both_rated])
    person2_square_preferences_sum = sum([pow(dataset[person2][item],2) for item in both_rated])

    # Sum up the product value of both preferences for each item
    product_sum_of_both_users = sum([dataset[person1][item] * dataset[person2][item] for item in both_rated])

    # Calculate the pearson score
    numerator_value = product_sum_of_both_users - (person1_preferences_sum*person2_preferences_sum/number_of_ratings)
    denominator_value = sqrt((person1_square_preferences_sum - pow(person1_preferences_sum,2)/number_of_ratings) * (person2_square_preferences_sum -pow(person2_preferences_sum,2)/number_of_ratings))

    if denominator_value == 0:
        return 0
    else:
        r = numerator_value / denominator_value
        return r

### User_Recommendations Function (Modify)  ( Collaborative Filtering)
In this function i add 1 parameter. It is min_content_score because i add content_based function to this function. I need information about list film which recommend by content_based function. I modify how we select the recommendation film at the last code.

In this function we have to input 2 parameters:
1. person : name of person 
2. min_content_score : this is cut off for our result score, if our result score is below than min_content_score this function will not recommend that film. the value is from 0 to 1

In [11]:
def user_recommendations(person,min_content_score):

    # Gets recommendations for a person by using a weighted average of every other user's rankings
    totals = {}
    simSums = {}
    rankings_list =[]
    for other in dataset:
        # don't compare me to myself
        if other == person:
            continue
        sim = person_correlation(person,other)
        #print ">>>>>>>",sim

        # ignore scores of zero or lower
        if sim <=0: 
            continue
        for item in dataset[other]:

            # only score movies i haven't seen yet
            if item not in dataset[person] or dataset[person][item] == 0:

            # Similrity * score
                totals.setdefault(item,0)
                totals[item] += dataset[other][item]* sim
                # sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+= sim

    # Create the normalized list
    rankings = [(total/simSums[item],item) for item,total in totals.items()]
    rankings.sort()
    rankings.reverse()
    
    # returns the recommended items
    recommendataions_list = [recommend_item for score,recommend_item in rankings]
    
    # recommendataions_list is the final result if we just use collaborative filtering (user_recommendation function)
    # but we want filter the final result by content_based filtering
   
    #Modify this part
    # I will select the recommendation film which has been filtered by content_based function
    new_rankings=[]
    for film in rankings:
        if film[1] in content_based(person,min_content_score):  #content_based funtion will return list of films that have been 
            new_rankings.append(film)                              #filtered by content_based fitlering
            
    # returns the recommended items
    recommendataions_list = [recommend_item for score,recommend_item in new_rankings]
    final=pd.DataFrame(recommendataions_list,columns=['Film'],index=list(range(1,len(recommendataions_list)+1)))
            
    return print('Top Recommendation Film Based on Hybrid Filtering :'),print(final)

In [12]:
def hybrid_filtering(person,min_content_score):
    content_based(person,min_content_score)
    return user_recommendations(person,min_content_score)
    

## Result
### min_content_score = 0

In [13]:
hybrid_filtering('Indra Junior',0)

Top Recommendation Film Based on Hybrid Filtering :
           Film
1       Gundala
2    Dilan 1991
3  Bumi Manusia


(None, None)

### min_content_score = 0.11

In [14]:
hybrid_filtering('Indra Junior',0.11)

Top Recommendation Film Based on Hybrid Filtering :
           Film
1    Dilan 1991
2  Bumi Manusia


(None, None)

### min_content_score = 0.5

In [15]:
hybrid_filtering('Indra Junior',0.5)

Top Recommendation Film Based on Hybrid Filtering :
Empty DataFrame
Columns: [Film]
Index: []


(None, None)

### content_based filtering without modification

In [16]:
def content_based_without_modification(person,min_content_score):

    #We want to know films that the person has not watched before
    k=0
    not_watch=[]
    watch=[]
    for ratings in df1[person]:
        if ratings==0:                    #if ratings = 0 , it means he hasn't watched that film
            not_watch.append(df1.index[k])
            k=k+1
        else:
            watch.append(df1.index[k])
            k=k+1
    
    #We will analyze the description every films by linear_kernel and TfidfVectorizer
    #TfidfVectorizer will analyze how many same word in description film with another film's description
    #In this part we will calculate cosine_similarities 
    tf = TfidfVectorizer(analyzer='word',
                             ngram_range=(1, 3),
                             min_df=0,
                             stop_words='english')
    tfidf_matrix=tf.fit_transform(df['Description'])
    cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
    
    #Make a dataframe from cosine_similarities
    list_film=pd.DataFrame(cosine_similarities,index=df['Film'],columns=df['Film'])
    
    #Select list_film which we haven't watched and the final score is more than min_content_score
    #The final score film is mean of cosine_simalirites that film
    final=pd.DataFrame(list_film.loc[watch,not_watch].mean().sort_values(ascending=False),columns=['Score'])
    final=final[final.Score>=min_content_score]
    
    
    return print ('Recommendation Film Based on Content Based Filtering :'),print(final)

In [17]:
content_based_without_modification('Indra Junior',0)

Recommendation Film Based on Content Based Filtering :
                 Score
Film                  
Dilan 1991    0.204236
Bumi Manusia  0.195649
Gundala       0.076412


(None, None)

###  user recommendation without modification (collaborative filtering)

In [22]:
def user_recommendations_without_modification(person):

    # Gets recommendations for a person by using a weighted average of every other user's rankings
    totals = {}
    simSums = {}
    rankings_list =[]
    for other in dataset:
        # don't compare me to myself
        if other == person:
            continue
        sim = person_correlation(person,other)
        #print ">>>>>>>",sim

        # ignore scores of zero or lower
        if sim <=0: 
            continue
        for item in dataset[other]:

            # only score movies i haven't seen yet
            if item not in dataset[person] or dataset[person][item] == 0:

            # Similrity * score
                totals.setdefault(item,0)
                totals[item] += dataset[other][item]* sim
                # sum of similarities
                simSums.setdefault(item,0)
                simSums[item]+= sim

        # Create the normalized list

    rankings = [(total/simSums[item],item) for item,total in totals.items()]
    rankings.sort()
    rankings.reverse()
    # returns the recommended items & score
    recommendataions_list = [recommend_item for score,recommend_item in rankings]
    score_list=[score for score,recommend_item in rankings]
    final=pd.DataFrame(recommendataions_list,columns=['Film'],index=list(range(1,len(recommendataions_list)+1)))
    return print('Top Recommendation Film Based on User Recommendations (collaborative filtering) : '),print(final)

In [23]:
user_recommendations_without_modification('Indra Junior')

Top Recommendation Film Based on User Recommendations (collaborative filtering) : 
           Film
1       Gundala
2    Dilan 1991
3  Bumi Manusia


(None, None)

## Conclusion
As reminder : Content Base Score is from 0 to 1. If the score is close to 1 , it means that film is more recommended.

* If we set min_content_score = 0 (no cut off value), hybrid filtering will give us same result with collaborative filtering because content based filtering doesn't filter any films (all of films's content based score are above 0.1)
* If we set min_content_score = 0.11, hybrid filtering will only give us Dilan 1991 and Bumi Manusia as recommendation films even though Gundala is the best recommendation film if we use collaborative filtering. It because content based function has filtered the recommendation film before (the gundala's content bases score is below 0.11)
* If we set min_content_score = 0.5, hybrid filtering gives no recommendation film. It because all of films content based score is below 0.5
* So based on hybrid filtering algorithm, Gundala, Dilan 1991 and Bumi Manusia are not recommended for Indra Junior if we set min_content_score is 0.5 (cut off score).