## Build Recommender System with Similiraty Functions 

Notebook ini akan menggunakan `CountVectorizer` & `TfidfVectorizer` 

In [1]:
import pandas as pd
import numpy as np

In [2]:
movie_rating_df = pd.read_csv('movie_rating_df.csv')
name_df = pd.read_csv('actor_name.csv')
director_writers = pd.read_csv('directors_writers.csv')

In [3]:
movie_rating_df.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894.0,,1.0,"Documentary,Short",5.6,1608
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892.0,,5.0,"Animation,Short",6.0,197
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892.0,,4.0,"Animation,Comedy,Romance",6.5,1285
3,tt0000004,short,Un bon bock,Un bon bock,0,1892.0,,12.0,"Animation,Short",6.1,121
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893.0,,1.0,"Comedy,Short",6.1,2050


In [4]:
movie_rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 751614 entries, 0 to 751613
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   tconst          751614 non-null  object 
 1   titleType       751614 non-null  object 
 2   primaryTitle    751614 non-null  object 
 3   originalTitle   751614 non-null  object 
 4   isAdult         751614 non-null  int64  
 5   startYear       751614 non-null  float64
 6   endYear         16072 non-null   float64
 7   runtimeMinutes  751614 non-null  float64
 8   genres          486766 non-null  object 
 9   averageRating   751614 non-null  float64
 10  numVotes        751614 non-null  int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 63.1+ MB


In [5]:
name_df.head()

Unnamed: 0,nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
0,nm1774132,Nathan McLaughlin,1973,\N,"special_effects,make_up_department","tt0417686,tt1713976,tt1891860,tt0454839"
1,nm10683464,Bridge Andrew,\N,\N,actor,tt7718088
2,nm1021485,Brandon Fransvaag,\N,\N,miscellaneous,tt0168790
3,nm6940929,Erwin van der Lely,\N,\N,miscellaneous,tt4232168
4,nm5764974,Svetlana Shypitsyna,\N,\N,actress,tt3014168


In [6]:
name_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   nconst             1000 non-null   object
 1   primaryName        1000 non-null   object
 2   birthYear          1000 non-null   object
 3   deathYear          1000 non-null   object
 4   primaryProfession  891 non-null    object
 5   knownForTitles     1000 non-null   object
dtypes: object(6)
memory usage: 47.0+ KB


In [7]:
director_writers.head()

Unnamed: 0,tconst,director_name,writer_name
0,tt0011414,David Kirkland,"John Emerson,Anita Loos"
1,tt0011890,Roy William Neill,"Arthur F. Goodrich,Burns Mantle,Mary Murillo"
2,tt0014341,"Buster Keaton,John G. Blystone","Jean C. Havez,Clyde Bruckman,Joseph A. Mitchell"
3,tt0018054,Cecil B. DeMille,Jeanie Macpherson
4,tt0024151,James Cruze,"Max Miller,Wells Root,Jack Jevne"


In [8]:
director_writers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 986 entries, 0 to 985
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   tconst         986 non-null    object
 1   director_name  986 non-null    object
 2   writer_name    986 non-null    object
dtypes: object(3)
memory usage: 23.2+ KB


### Data Preparation

#### DATAFRAME director_writers

**Mengubah `director_name` dan `writer_name` menjadi list**

In [9]:
director_writers['director_name'] = director_writers['director_name'].apply(lambda row: row.split(','))
director_writers['writer_name'] = director_writers['writer_name'].apply(lambda row: row.split(','))

director_writers.head()

Unnamed: 0,tconst,director_name,writer_name
0,tt0011414,[David Kirkland],"[John Emerson, Anita Loos]"
1,tt0011890,[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,tt0014341,"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,tt0018054,[Cecil B. DeMille],[Jeanie Macpherson]
4,tt0024151,[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


#### DATAFRAME name_df

**Mengambil kolom yang akan digunakan yaitu `nconst`, `primaryName`, dan `knownForTitles`**

In [10]:
name_df = name_df[['nconst','primaryName','knownForTitles']]

name_df.head()

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,"tt0417686,tt1713976,tt1891860,tt0454839"
1,nm10683464,Bridge Andrew,tt7718088
2,nm1021485,Brandon Fransvaag,tt0168790
3,nm6940929,Erwin van der Lely,tt4232168
4,nm5764974,Svetlana Shypitsyna,tt3014168


**Melakukan pengecekan variasi** dari list yang ada di setiap kolom

In [11]:
print(name_df['knownForTitles'].apply(lambda x: len(x.split(','))).unique())

[4 1 2 3]


**Mengubah `knownForTitles` menjadi list of list**

In [12]:
name_df['knownForTitles'] = name_df['knownForTitles'].apply(lambda x: x.split(','))

In [13]:
name_df.head()

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,"[tt0417686, tt1713976, tt1891860, tt0454839]"
1,nm10683464,Bridge Andrew,[tt7718088]
2,nm1021485,Brandon Fransvaag,[tt0168790]
3,nm6940929,Erwin van der Lely,[tt4232168]
4,nm5764974,Svetlana Shypitsyna,[tt3014168]


**menjadikan list dalam kolom menajdi kolom baru**

In [14]:
import numpy as np
#menyiapkan bucket untuk dataframe
df_uni = []

for x in ['knownForTitles']:
    #mengulang index dari tiap baris sampai tiap elemen dari knownForTitles
    idx = name_df.index.repeat(name_df['knownForTitles'].str.len())
   
   #memecah values dari list di setiap baris dan menggabungkan nya dengan rows lain menjadi dataframe
    df1 = pd.DataFrame({
        x: np.concatenate(name_df[x].values)
    })
    
    #mengganti index dataframe tersebut dengan idx yang sudah kita define di awal
    df1.index = idx
    #untuk setiap dataframe yang terbentuk, kita append ke dataframe bucket
    df_uni.append(df1)

In [15]:
#menggabungkan semua dataframe menjadi satu
df_concat = pd.concat(df_uni, axis=1)

In [16]:
df_concat

Unnamed: 0,knownForTitles
0,tt0417686
0,tt1713976
0,tt1891860
0,tt0454839
1,tt7718088
...,...
998,tt1464058
999,tt0436869
999,tt0476663
999,tt0109723


In [17]:
#left join dengan value dari dataframe yang awal
unnested_df = df_concat.join(name_df.drop(['knownForTitles'], 1), how='left')

In [18]:
#select kolom sesuai dengan dataframe awal
unnested_df = unnested_df[name_df.columns.tolist()]
unnested_df.head(10)

Unnamed: 0,nconst,primaryName,knownForTitles
0,nm1774132,Nathan McLaughlin,tt0417686
0,nm1774132,Nathan McLaughlin,tt1713976
0,nm1774132,Nathan McLaughlin,tt1891860
0,nm1774132,Nathan McLaughlin,tt0454839
1,nm10683464,Bridge Andrew,tt7718088
2,nm1021485,Brandon Fransvaag,tt0168790
3,nm6940929,Erwin van der Lely,tt4232168
4,nm5764974,Svetlana Shypitsyna,tt3014168
5,nm8621807,Utku Arslan,tt5493404
5,nm8621807,Utku Arslan,tt7661932


In [19]:
unnested_drop = unnested_df.drop(['nconst'], axis=1)

#menyiapkan bucket untuk dataframe
df_uni = []

for col in ['primaryName']:
    #agregasi kolom PrimaryName sesuai group_col yang sudah di define di atas
    dfi = unnested_drop.groupby(['knownForTitles'])[col].apply(list)
    #Lakukan append
    df_uni.append(dfi)
df_grouped = pd.concat(df_uni, axis=1).reset_index()
df_grouped.columns = ['knownForTitles','cast_name']
df_grouped

Unnamed: 0,knownForTitles,cast_name
0,tt0008125,[Charles Harley]
1,tt0009706,[Charles Harley]
2,tt0010304,[Natalie Talmadge]
3,tt0011414,[Natalie Talmadge]
4,tt0011890,[Natalie Talmadge]
...,...,...
1893,tt9610496,[Stefano Baffetti]
1894,tt9714030,[Kevin Kain]
1895,tt9741820,[Caroline Plyler]
1896,tt9759814,[Ethan Francis]


### Menggabungkan setiap dataframe

In [20]:
base_df = pd.merge(df_grouped, movie_rating_df, left_on='knownForTitles', right_on='tconst', how='inner')

In [21]:
base_df = pd.merge(base_df, director_writers, left_on='tconst', right_on='tconst', how='left')

In [22]:
base_df.head()

Unnamed: 0,knownForTitles,cast_name,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,director_name,writer_name
0,tt0011414,[Natalie Talmadge],tt0011414,movie,The Love Expert,The Love Expert,0,1920.0,,60.0,"Comedy,Romance",4.9,136,[David Kirkland],"[John Emerson, Anita Loos]"
1,tt0011890,[Natalie Talmadge],tt0011890,movie,Yes or No,Yes or No,0,1920.0,,72.0,,6.3,7,[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,tt0014341,[Natalie Talmadge],tt0014341,movie,Our Hospitality,Our Hospitality,0,1923.0,,65.0,"Comedy,Romance,Thriller",7.8,9621,"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,tt0018054,[Reeka Roberts],tt0018054,movie,The King of Kings,The King of Kings,0,1927.0,,155.0,"Biography,Drama,History",7.3,1826,[Cecil B. DeMille],[Jeanie Macpherson]
4,tt0024151,[James Hackett],tt0024151,movie,I Cover the Waterfront,I Cover the Waterfront,0,1933.0,,80.0,"Drama,Romance",6.3,455,[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


## Data Cleaning

**Melakukan drop terhadap kolom knownForTitles**

In [23]:
base_drop = base_df.drop(['knownForTitles'], axis=1)
base_drop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1060 entries, 0 to 1059
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   cast_name       1060 non-null   object 
 1   tconst          1060 non-null   object 
 2   titleType       1060 non-null   object 
 3   primaryTitle    1060 non-null   object 
 4   originalTitle   1060 non-null   object 
 5   isAdult         1060 non-null   int64  
 6   startYear       1060 non-null   float64
 7   endYear         110 non-null    float64
 8   runtimeMinutes  1060 non-null   float64
 9   genres          745 non-null    object 
 10  averageRating   1060 non-null   float64
 11  numVotes        1060 non-null   int64  
 12  director_name   986 non-null    object 
 13  writer_name     986 non-null    object 
dtypes: float64(4), int64(2), object(8)
memory usage: 124.2+ KB


**Mengganti nilai NULL dengan 'Unknown'**

In [24]:
# Mengganti nilai NULL pada kolom genres dengan 'Unknown'
base_drop['genres'] = base_drop['genres'].fillna('Unknown')

In [25]:
# Mengganti nilai NULL pada kolom dorector_name dan writer_name dengan 'Unknown'
base_drop[['director_name','writer_name']] = base_drop[['director_name','writer_name']].fillna('Unknown')

**Karena value kolom genres terdapat multiple values, maka dijadikan list of list**

In [26]:
base_drop['genres'] = base_drop['genres'].apply(lambda x: x.split(','))

In [27]:
#Melakukan perhitungan jumlah nilai NULL pada tiap kolom
base_drop.isnull().sum()

cast_name           0
tconst              0
titleType           0
primaryTitle        0
originalTitle       0
isAdult             0
startYear           0
endYear           950
runtimeMinutes      0
genres              0
averageRating       0
numVotes            0
director_name       0
writer_name         0
dtype: int64

### Reformat Tabel

In [28]:
#Drop kolom tconst, isAdult, endYear, originalTitle
base_drop2 = base_drop.drop(['tconst','isAdult','endYear','originalTitle'], axis=1)

In [29]:
#Menyusun ulang urutan kolom
base_drop2 = base_drop2[['primaryTitle','titleType','startYear','runtimeMinutes','genres',
                         'averageRating','numVotes','cast_name','director_name','writer_name']]

In [30]:
base_drop2.head()

Unnamed: 0,primaryTitle,titleType,startYear,runtimeMinutes,genres,averageRating,numVotes,cast_name,director_name,writer_name
0,The Love Expert,movie,1920.0,60.0,"[Comedy, Romance]",4.9,136,[Natalie Talmadge],[David Kirkland],"[John Emerson, Anita Loos]"
1,Yes or No,movie,1920.0,72.0,[Unknown],6.3,7,[Natalie Talmadge],[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,Our Hospitality,movie,1923.0,65.0,"[Comedy, Romance, Thriller]",7.8,9621,[Natalie Talmadge],"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,The King of Kings,movie,1927.0,155.0,"[Biography, Drama, History]",7.3,1826,[Reeka Roberts],[Cecil B. DeMille],[Jeanie Macpherson]
4,I Cover the Waterfront,movie,1933.0,80.0,"[Drama, Romance]",6.3,455,[James Hackett],[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


**Rename kolom :**
- primaryTitle -> title
- titleType -> type
- startYear -> start
- runtimeMinutes -> duration
- averageRating -> rating
- numVotes -> votes'''

In [31]:
base_drop2.columns = ['title','type','start','duration',
                      'genres','rating','votes','cast_name','director_name','writer_name']
base_drop2.head()

Unnamed: 0,title,type,start,duration,genres,rating,votes,cast_name,director_name,writer_name
0,The Love Expert,movie,1920.0,60.0,"[Comedy, Romance]",4.9,136,[Natalie Talmadge],[David Kirkland],"[John Emerson, Anita Loos]"
1,Yes or No,movie,1920.0,72.0,[Unknown],6.3,7,[Natalie Talmadge],[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,Our Hospitality,movie,1923.0,65.0,"[Comedy, Romance, Thriller]",7.8,9621,[Natalie Talmadge],"[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,The King of Kings,movie,1927.0,155.0,"[Biography, Drama, History]",7.3,1826,[Reeka Roberts],[Cecil B. DeMille],[Jeanie Macpherson]
4,I Cover the Waterfront,movie,1933.0,80.0,"[Drama, Romance]",6.3,455,[James Hackett],[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


## Creating Content-based Recommender System

#### Memilih fitur yang akan digunakan untuk mengukur similarity untuk recomender system yang akan digunakan
Dalam hal ini, klasifikasi dipilih berdasar title, cast_name, genres, director_name, dan writer_name

In [32]:
feature_df = base_drop2[['title','cast_name','genres','director_name','writer_name']]

#Tampilkan 5 baris teratas
feature_df.head()

Unnamed: 0,title,cast_name,genres,director_name,writer_name
0,The Love Expert,[Natalie Talmadge],"[Comedy, Romance]",[David Kirkland],"[John Emerson, Anita Loos]"
1,Yes or No,[Natalie Talmadge],[Unknown],[Roy William Neill],"[Arthur F. Goodrich, Burns Mantle, Mary Murillo]"
2,Our Hospitality,[Natalie Talmadge],"[Comedy, Romance, Thriller]","[Buster Keaton, John G. Blystone]","[Jean C. Havez, Clyde Bruckman, Joseph A. Mitc..."
3,The King of Kings,[Reeka Roberts],"[Biography, Drama, History]",[Cecil B. DeMille],[Jeanie Macpherson]
4,I Cover the Waterfront,[James Hackett],"[Drama, Romance]",[James Cruze],"[Max Miller, Wells Root, Jack Jevne]"


#### Membuat fungsi untuk strip spaces dari setiap row dan setiap elemennya

In [33]:
def sanitize(x):
    try:
        #kalau cell berisi list
        if isinstance(x,list):
            return [i.replace(' ','').lower() for i in x]
        #kalau cell berisi string
        else:
            return [x.replace(' ','').lower()]
    except:
        print(x)

In [34]:
#Kolom : cast_name, genres, writer_name, director_name        
feature_cols = ['cast_name','genres','writer_name','director_name']

#Apply function sanitize 
for col in feature_cols:
    feature_df[col] = feature_df[col].apply(sanitize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feature_df[col] = feature_df[col].apply(sanitize)


In [35]:
feature_df.head()

Unnamed: 0,title,cast_name,genres,director_name,writer_name
0,The Love Expert,[natalietalmadge],"[comedy, romance]",[davidkirkland],"[johnemerson, anitaloos]"
1,Yes or No,[natalietalmadge],[unknown],[roywilliamneill],"[arthurf.goodrich, burnsmantle, marymurillo]"
2,Our Hospitality,[natalietalmadge],"[comedy, romance, thriller]","[busterkeaton, johng.blystone]","[jeanc.havez, clydebruckman, josepha.mitchell]"
3,The King of Kings,[reekaroberts],"[biography, drama, history]",[cecilb.demille],[jeaniemacpherson]
4,I Cover the Waterfront,[jameshackett],"[drama, romance]",[jamescruze],"[maxmiller, wellsroot, jackjevne]"


**Membuat fungsi untuk kolom soup (menggabungkan semua feature menjadi 1 bagian kalimat) untuk setiap fitur klasifikasi**

In [36]:
#kolom yang digunakan : cast_name, genres, director_name, writer_name
def soup_feature(x):
    return ' '.join(x['cast_name']) + ' ' + ' '.join(x['genres']) + ' ' + ' '.join(x['director_name']) + ' ' + ' '.join(x['writer_name'])    

In [37]:
#membuat soup menjadi 1 kolom 
feature_df['soup'] = feature_df.apply(soup_feature, axis=1)

feature_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feature_df['soup'] = feature_df.apply(soup_feature, axis=1)


Unnamed: 0,title,cast_name,genres,director_name,writer_name,soup
0,The Love Expert,[natalietalmadge],"[comedy, romance]",[davidkirkland],"[johnemerson, anitaloos]",natalietalmadge comedy romance davidkirkland j...
1,Yes or No,[natalietalmadge],[unknown],[roywilliamneill],"[arthurf.goodrich, burnsmantle, marymurillo]",natalietalmadge unknown roywilliamneill arthur...
2,Our Hospitality,[natalietalmadge],"[comedy, romance, thriller]","[busterkeaton, johng.blystone]","[jeanc.havez, clydebruckman, josepha.mitchell]",natalietalmadge comedy romance thriller buster...
3,The King of Kings,[reekaroberts],"[biography, drama, history]",[cecilb.demille],[jeaniemacpherson],reekaroberts biography drama history cecilb.de...
4,I Cover the Waterfront,[jameshackett],"[drama, romance]",[jamescruze],"[maxmiller, wellsroot, jackjevne]",jameshackett drama romance jamescruze maxmille...


## Similarity Function dengan CountVectorizer (stop_words = english)

**CountVectorizer** adalah tipe paling sederhana dari vectorizer. 
Sebagai contoh, bayangkan terdapat 3 text A, B, dan C, dimana text nya adalah

- A: The Sun is a star
- B: My Love is like a red, red rose
- C : Mary had a little lamb

Sekarang kita harus konversi text-text ini menjadi bentuk vector menggunakan CountVectorizer. Langkah-langkahnya adalah: menghitung ukuran dari vocabulary. Vocabulary adalah jumlah dari kata unik yang ada dari text tersebut. Oleh sebab itu, vocabulary dari set ketiga text tersebut adalah: the, sun, is, a, star, my, love, like, red, rose, mary, had, little, lamb. Secara total, ukuran vocabulary adalah 14.

Tetapi, biasanya kita tidak include stop words (english), seperti as, is, a, the, dan sebagainya karena itu adalah kata yang sudah common sekali.

Dengan mengeliminasi stop words, maka clean size vocabulary kita adalah like, little, lamb, love, mary, red, rose, sun, star (sorted alphabet ascending). Maka, dengan menggunakan CountVectorizer, maka hasil yang kita dapatkan adalah sebagai berikut:


- A : (0,0,0,0,0,0,0,1,1), terdiri atas sun:1, star:1
- B : (1,0,0,1,0,2,1,0,0), terdiri atas like:1, love:1, red:2, rose:1
- C : (0,1,1,0,1,0,0,0,0), terdiri atas little:1, lamb:1, mary:1

In [38]:
#import CountVectorizer 
from sklearn.feature_extraction.text import CountVectorizer

#definisikan CountVectorizer dan mengubah soup tadi menjadi bentuk vector
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(feature_df['soup'])

print(count)
print(count_matrix.shape)

CountVectorizer(stop_words='english')
(1060, 10026)


**Membuat model similarity antara count matrixml**

Menghitung score cosine similarity dari setiap pasangan judul (berdasarkan semua kombinasi pasangan yang ada, dengan kata lain kita akan membuat 675 x 675 matrix, dimana cell di kolom i dan j menunjukkan score similarity antara judul i dan j. kita dapat dengan mudah melihat bahwa matrix ini simetris dan setiap elemen pada diagonal adalah 1, karena itu adalah similarity score dengan dirinya sendiri

 
Cosine Similarity
pada bagian ini, kita akan menggunakan formula cosine similarity untuk membuat model. Score cosine ini sangatlah berguna dan mudah untuk dihitung.

formula untuk perhitungan cosine similarity antara 2 text, adalah sebagai berikut:

<img src="https://render.githubusercontent.com/render/math?math=cosine%28x%2Cy%29%3D%5Cfrac%7Bx.y%5ET%7D%7B%7C%7Cx%7C%7C.%7C%7Cy%7C%7C%7D&mode=inline" width="200" height="300">

Output yang didapat antara range -1 sampai 1. Score yang hampir mencapai 1 artinya kedua entitas tersebut sangatlah mirip sedangkan score yang hampir mencapai -1 artinya kedua entitas tersebut adalah beda

In [41]:
#Import cosine_similarity
from sklearn.metrics.pairwise import cosine_similarity

#Gunakan cosine_similarity antara count_matrix 
cosine_sim = cosine_similarity(count_matrix, count_matrix)

#print hasilnya
print(cosine_sim)

[[1.         0.15430335 0.35355339 ... 0.         0.         0.13608276]
 [0.15430335 1.         0.10910895 ... 0.         0.         0.        ]
 [0.35355339 0.10910895 1.         ... 0.         0.08703883 0.09622504]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.08703883 ... 0.         1.         0.10050378]
 [0.13608276 0.         0.09622504 ... 0.         0.10050378 1.        ]]


#### Membuat fungsi untuk mapping judul dan menghasilkan 10 film dengan fitur yang memiliki nilai similarity paling baik 

In [42]:
indices = pd.Series(feature_df.index, index=feature_df['title']).drop_duplicates()

def content_recommender(title):
    try:
        #mendapatkan index dari judul film yang disebutkan
        idx = indices[title]
        
        #menjadikan list dari array similarity cosine sim tadi
        #hint: cosine_sim[idx]
        sim_scores = list(enumerate(cosine_sim[idx]))
        
        #mengurutkan film dari similarity tertinggi ke terendah
        #hint: sorted(iter, key, reverse)
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        
        #untuk mendapatkan list judul dari item kedua sampe ke 11 
        #abaikan yang pertama karena yang pertama pasti judul film itu sendiri
        sim_scores = sim_scores[1:11]
        
        #mendapatkan index dari judul-judul yang muncul di sim_scores
        movie_indices = [i[0] for i in sim_scores]
        
        #dengan menggunakan iloc, kita bisa panggil balik berdasarkan index dari movie_indices
        return base_df.iloc[movie_indices]
    
    except (RuntimeError, TypeError, NameError, KeyError):
        print("Oops!  That was no valid title. Try again with another title...")
    

In [49]:
content_recommender('Wedding Crashers')

Unnamed: 0,knownForTitles,cast_name,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,director_name,writer_name
344,tt0237123,[Anat Dychtwald],tt0237123,tvSeries,Coupling,Coupling,0,2000.0,2004.0,30.0,"Comedy,Romance",8.5,41571,[Martin Dennis],[Steven Moffat]
1052,tt9124840,[Metin Namlisesli],tt9124840,movie,Organik Ask,Organik Ask,0,2018.0,,100.0,"Comedy,Romance",4.3,89,[Kamil Cetin],[Volkan Girgin]
0,tt0011414,[Natalie Talmadge],tt0011414,movie,The Love Expert,The Love Expert,0,1920.0,,60.0,"Comedy,Romance",4.9,136,[David Kirkland],"[John Emerson, Anita Loos]"
24,tt0043762,[Constance De Mattiazzi],tt0043762,movie,Lullaby of Broadway,Lullaby of Broadway,0,1951.0,,92.0,"Comedy,Musical,Romance",6.8,893,[David Butler],[Earl Baldwin]
398,tt0308670,[Wai Chi Wong],tt0308670,movie,Oi ching bak min bau,Oi ching bak min bau,0,2001.0,,101.0,"Comedy,Romance",6.8,47,[Steven Lo],"[Canny Leung, Chi Shan Leung]"
142,tt0094889,[Harvey J. Alperin],tt0094889,movie,Cocktail,Cocktail,0,1988.0,,104.0,"Comedy,Drama,Romance",5.9,76694,[Roger Donaldson],[Heywood Gould]
325,tt0198284,[Tim Horsely],tt0198284,movie,After Sex,After Sex,0,2000.0,,96.0,"Comedy,Drama,Romance",4.4,753,[Cameron Thor],[Thomas M. Kostigen]
345,tt0237501,[Ngan-Ying Poon],tt0237501,movie,Ninth Happiness,Gau sing bou hei,0,1998.0,,86.0,"Comedy,Musical,Romance",5.9,118,[Clifton Ko],[Raymond To]
410,tt0340109,[Catherine May],tt0340109,movie,Fast Food High,Fast Food High,0,2003.0,,92.0,"Comedy,Drama,Romance",5.2,174,[Nisha Ganatra],"[Tassie Cameron, Jackie May]"
450,tt0409673,[Nicolas Proulx],tt0409673,movie,Les aimants,Les aimants,0,2004.0,,91.0,"Comedy,Romance",6.7,595,[Yves Pelletier],[Yves Pelletier]


## Similarity Function dengan TFIDF Vectorizer

In [44]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [45]:
#definisikan CountVectorizer dan mengubah soup tadi menjadi bentuk vector
count_tf = TfidfVectorizer()
count_matrix_tf = count_tf.fit_transform(feature_df['soup'])

count_matrix_tf.shape

(1060, 10028)

**Membuat model similarity antara count matrixml**

In [46]:
#Gunakan cosine_similarity antara count_matrix 
cosine_sim_tf = cosine_similarity(count_matrix_tf, count_matrix_tf)

#print hasilnya
print(cosine_sim_tf)

[[1.         0.16802967 0.18411455 ... 0.         0.         0.02445827]
 [0.16802967 1.         0.11021027 ... 0.         0.         0.        ]
 [0.18411455 0.11021027 1.         ... 0.         0.02388765 0.01604212]
 ...
 [0.         0.         0.         ... 1.         0.         0.        ]
 [0.         0.         0.02388765 ... 0.         1.         0.03610634]
 [0.02445827 0.         0.01604212 ... 0.         0.03610634 1.        ]]


In [47]:
indices_tf = pd.Series(feature_df.index, index=feature_df['title']).drop_duplicates()

def content_recommender_tf(title):
    try:
        #mendapatkan index dari judul film yang disebutkan
        idx = indices_tf[title]
        
        #menjadikan list dari array similarity cosine sim tadi
        #hint: cosine_sim[idx]
        sim_scores_tf = list(enumerate(cosine_sim_tf[idx]))
        
        #mengurutkan film dari similarity tertinggi ke terendah
        #hint: sorted(iter, key, reverse)
        sim_scores_tf = sorted(sim_scores_tf, key=lambda x: x[1], reverse=True)
        
        #untuk mendapatkan list judul dari item kedua sampe ke 11 
        #abaikan yang pertama karena yang pertama pasti judul film itu sendiri
        sim_scores_tf = sim_scores_tf[1:11]
        
        #mendapatkan index dari judul-judul yang muncul di sim_scores
        movie_indices_tf = [i[0] for i in sim_scores_tf]
        
        #dengan menggunakan iloc, kita bisa panggil balik berdasarkan index dari movie_indices
        return base_df.iloc[movie_indices_tf]
    
    except (RuntimeError, TypeError, NameError, KeyError):
        print("Oops!  That was no valid title. Try again with another title...")
    

In [48]:
content_recommender_tf('Wedding Crashers')

Unnamed: 0,knownForTitles,cast_name,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,director_name,writer_name
468,tt0438315,[Matthew Fuchs],tt0438315,movie,Peaceful Warrior,Peaceful Warrior,0,2006.0,,120.0,"Drama,Romance,Sport",7.3,25778,[Victor Salva],"[Kevin Bernhardt, Dan Millman]"
511,tt0796366,"[Matthew Fuchs, Aida Caefer]",tt0796366,movie,Star Trek,Star Trek,0,2009.0,,127.0,"Action,Adventure,Sci-Fi",7.9,567224,[J.J. Abrams],"[Gene Roddenberry, Roberto Orci, Alex Kurtzman]"
424,tt0371746,[Matthew Fuchs],tt0371746,movie,Iron Man,Iron Man,0,2008.0,,126.0,"Action,Adventure,Sci-Fi",7.9,906367,[Jon Favreau],"[Mark Fergus, Hawk Ostby, Art Marcum, Matt Hol..."
1052,tt9124840,[Metin Namlisesli],tt9124840,movie,Organik Ask,Organik Ask,0,2018.0,,100.0,"Comedy,Romance",4.3,89,[Kamil Cetin],[Volkan Girgin]
344,tt0237123,[Anat Dychtwald],tt0237123,tvSeries,Coupling,Coupling,0,2000.0,2004.0,30.0,"Comedy,Romance",8.5,41571,[Martin Dennis],[Steven Moffat]
24,tt0043762,[Constance De Mattiazzi],tt0043762,movie,Lullaby of Broadway,Lullaby of Broadway,0,1951.0,,92.0,"Comedy,Musical,Romance",6.8,893,[David Butler],[Earl Baldwin]
142,tt0094889,[Harvey J. Alperin],tt0094889,movie,Cocktail,Cocktail,0,1988.0,,104.0,"Comedy,Drama,Romance",5.9,76694,[Roger Donaldson],[Heywood Gould]
0,tt0011414,[Natalie Talmadge],tt0011414,movie,The Love Expert,The Love Expert,0,1920.0,,60.0,"Comedy,Romance",4.9,136,[David Kirkland],"[John Emerson, Anita Loos]"
398,tt0308670,[Wai Chi Wong],tt0308670,movie,Oi ching bak min bau,Oi ching bak min bau,0,2001.0,,101.0,"Comedy,Romance",6.8,47,[Steven Lo],"[Canny Leung, Chi Shan Leung]"
345,tt0237501,[Ngan-Ying Poon],tt0237501,movie,Ninth Happiness,Gau sing bou hei,0,1998.0,,86.0,"Comedy,Musical,Romance",5.9,118,[Clifton Ko],[Raymond To]


In [51]:
content_recommender_tf('Habibie Ainun')

Oops!  That was no valid title. Try again with another title...


Demikianlah recommender system yang dibuat. Notebook ini merupakan projek yang dilakukan ketika saya mengikuti pelatihan online di DQ Lab Academy