# Movie Recommendation System: [TMDB Movies](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata)
- Nama: Mohammad Valeriant Qumara Tanda
- ID: MC180D5Y0566
- Email: valenttanda@gmail.com

## Import Library
Import library yang dibutuhkan untuk proyek ini. Pada proyek ini, digunakan library berikut:
- `pandas`: Mengolah data.
- `numpy`: Melakukan operasi matematika.
- `matplotlib`: Membuat grafik.
- `seaborn`: Membuat grafik yang lebih menarik.
- `ast`: Mengakses atribut dari objek.
- `scikit-learn`: Membuat model rekomendasi dengan TF-IDF dan Cosine Similarity.
- `sentence-transformers`: Membuat model rekomendasi dengan BERT Embeddings. (Untuk menggunakan library ini, disarankan menggunakan IDE Google Colab untuk komputasi yang lebih cepat dan stabil).
- `json`: Menyimpan file ke bentuk JSON.

In [300]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import ast
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer

## Load Dataset
Dataset yang digunakan dalam proyek ini adalah dataset kumpulan film dari tahun 1916 hingga 2017. Sumber dataset untuk proyek ini berasal dari [Kaggle](https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata).

In [301]:
movie = pd.read_csv("data/tmdb_5000_movies.csv")

Tampilkan dataset movie

In [302]:
movie.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


## EDA
Di tahap ini, akan dilakukan analisis exploratory data (EDA) untuk memahami data yang ada. EDA ini bertujuan untuk memahami distribusi data, hubungan antar variabel, dan menemukan pola-pola yang ada dalam data. EDA untuk proyek ini hanya mencari informasi dataset, nilai kosong, dan duplikasi data. Ketiga hal di atas dirangkum dalam **Dataset Information**

### Dataset information
Berisikan informasi mengenai dataset yang digunakan. Informasi ini meliputi:
- Bentuk dataset: Mengetahui jumlah kolom dan baris dalam masing-masing dataset
- Informasi dataset: Mengetahui jenis data yang ada dalam dataset, seperti numerik, kategori, atau tanggal, dan secara tidak langsung mengetahui adakah nilai yang hilang dalam dataset atau tidak
- Cek Nilai Kosong: Mengetahui apakah ada nilai kosong dalam dataset atau tidak
- Cek Nilai Duplikat: Mengetahui apakah ada nilai duplikat dalam dataset atau tidak
- Statistik dataset: Mengetahui nilai-nilai statistik dalam dataset

Cek bentuk kedua dataset

In [303]:
print(f"Movie dataframe: {movie.shape}")

Movie dataframe: (4803, 20)


Cek informasi kedua dataset

In [304]:
print(f"{movie.info()}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

Dari informasi ini, diketahui bahwa tidak ada nilai kosong pada dataset credit. Selanjutnya, akan diperiksa nilai kosong untuk movie dan duplikasi data untuk kedua dataset

Cek nilai kosong untuk movie

In [305]:
missing_value = movie.isna().sum()
missing_value = missing_value[missing_value > 0]
missing_percentage = (missing_value/len(movie)) * 100
missing_percentage = missing_percentage[missing_percentage > 0]
missing_display = pd.DataFrame({
	'Missing Value': missing_value.sort_values(ascending=False),
  'Missing Percentage': missing_percentage.sort_values(ascending=False).round(4).astype(str) + '%'
})
missing_display

Unnamed: 0,Missing Value,Missing Percentage
homepage,3091,64.3556%
tagline,844,17.5724%
overview,3,0.0625%
runtime,2,0.0416%
release_date,1,0.0208%


Terlihat bahwa `homepage` memiliki nilai kosong paling besar, sekitar 64%. Selain itu, karena `homepage` tidak dibutuhkan dalam analisis mendatang,  maka kolom `homepage` akan dihapus

Cek duplikasi data

In [306]:
print(f"Duplicated data in movie: {movie.duplicated().sum()}")

Duplicated data in movie: 0


Pada dataset **movie**, terdapat `original_title` dan `title`, sedangkan pada dataset **credit** terdapat `title`. Akan dilakukan pengecekan nilai unik dari masing-masing kolom pada kedua dataset

In [307]:
# Unique values
for col in ['original_title', 'title']:
  print(f"Unique values in movie: column {col}: {movie[col].unique()}\n")

Unique values in movie: column original_title: ['Avatar' "Pirates of the Caribbean: At World's End" 'Spectre' ...
 'Signed, Sealed, Delivered' 'Shanghai Calling' 'My Date with Drew']

Unique values in movie: column title: ['Avatar' "Pirates of the Caribbean: At World's End" 'Spectre' ...
 'Signed, Sealed, Delivered' 'Shanghai Calling' 'My Date with Drew']



Ternyata, isi dari ketiga kolom tersebut sama. Oleh karena itu, untuk `title` pada **movie** dan `title` pada **credit** akan dihapus, karena merupakan duplikasi dari kolom `original_title`

Statistik Dataset

In [308]:
movie.describe(include='all')

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
count,4803.0,4803,1712,4803.0,4803,4803,4803,4800,4803.0,4803,4803,4802,4803.0,4801.0,4803,4803,3959,4803,4803.0,4803.0
unique,,1175,1691,,4222,37,4801,4800,,3697,469,3280,,,544,3,3944,4800,,
top,,"[{""id"": 18, ""name"": ""Drama""}]",http://www.missionimpossible.com/,,[],en,Batman,"'Breaking Upwards' explores a young, real-life...",,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2006-01-01,,,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,Based on a true story.,The Host,,
freq,,370,4,,412,4505,2,1,,351,2977,10,,,3171,4795,3,2,,
mean,29045040.0,,,57165.484281,,,,,21.492301,,,,82260640.0,106.875859,,,,,6.092172,690.217989
std,40722390.0,,,88694.614033,,,,,31.81665,,,,162857100.0,22.611935,,,,,1.194612,1234.585891
min,0.0,,,5.0,,,,,0.0,,,,0.0,0.0,,,,,0.0,0.0
25%,790000.0,,,9014.5,,,,,4.66807,,,,0.0,94.0,,,,,5.6,54.0
50%,15000000.0,,,14629.0,,,,,12.921594,,,,19170000.0,103.0,,,,,6.2,235.0
75%,40000000.0,,,58610.5,,,,,28.313505,,,,92917190.0,118.0,,,,,6.8,737.0


## Preprocessing
Membersihkan dataset dari nilai kosong. Untuk tahap ini, tidak banyak preprocessing yang dilakukan, karena hanya menangani nilai kosong, dan setiap data berbentuk objek diperlukan untuk pemodelan sistem rekomendasi mendatang. Preprocessing ini hanya berfokus pada kolom `overview` dan `genres`

Menghapus kolom yang tidak dibutuhkan:
- `homepage`: memiliki nilai kosong paling banyak (64%).
- `title` pada **credit** dan `title` pada **movie**: keduanya merupakan duplikasi dari `original_title`.

In [309]:
movie_cleaned = movie.drop(columns=['homepage', 'title'])

Mengatasi nilai kosong pada `overview`, dengan membuat string kosong.

In [310]:
movie_cleaned['overview'] = movie_cleaned['overview'].fillna('')

## Preparation
Mempersiapkan dataset sebelum pemodelan. Untuk tahap ini, ada beberapa hal yang dikerjakan, seperti:

### 1. Copy Cleaned `movie` to `df` for Modeling
Menyalin dataset `movie` yang telah bersih ke dalam dataset baru `df` untuk persiapan pemodelan

In [311]:
df = movie_cleaned.copy()
df

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.312950,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",6.1,2124
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4798,220000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",9367,"[{""id"": 5616, ""name"": ""united states\u2013mexi...",es,El Mariachi,El Mariachi just wants to play his guitar and ...,14.269792,"[{""name"": ""Columbia Pictures"", ""id"": 5}]","[{""iso_3166_1"": ""MX"", ""name"": ""Mexico""}, {""iso...",1992-09-04,2040920,81.0,"[{""iso_639_1"": ""es"", ""name"": ""Espa\u00f1ol""}]",Released,"He didn't come looking for trouble, but troubl...",6.6,238
4799,9000,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 10749, ""...",72766,[],en,Newlyweds,A newlywed couple's honeymoon is upended by th...,0.642552,[],[],2011-12-26,0,85.0,[],Released,A newlywed couple's honeymoon is upended by th...,5.9,5
4800,0,"[{""id"": 35, ""name"": ""Comedy""}, {""id"": 18, ""nam...",231617,"[{""id"": 248, ""name"": ""date""}, {""id"": 699, ""nam...",en,"Signed, Sealed, Delivered","""Signed, Sealed, Delivered"" introduces a dedic...",1.444476,"[{""name"": ""Front Street Pictures"", ""id"": 3958}...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2013-10-13,0,120.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,,7.0,6
4801,0,[],126186,[],en,Shanghai Calling,When ambitious New York attorney Sam is sent t...,0.857008,[],"[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-05-03,0,98.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,A New Yorker in Shanghai,5.7,7


Cek bentuk dataset

In [312]:
df.shape

(4803, 18)

Cek informasi dataset

In [313]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   id                    4803 non-null   int64  
 3   keywords              4803 non-null   object 
 4   original_language     4803 non-null   object 
 5   original_title        4803 non-null   object 
 6   overview              4803 non-null   object 
 7   popularity            4803 non-null   float64
 8   production_companies  4803 non-null   object 
 9   production_countries  4803 non-null   object 
 10  release_date          4802 non-null   object 
 11  revenue               4803 non-null   int64  
 12  runtime               4801 non-null   float64
 13  spoken_languages      4803 non-null   object 
 14  status                4803 non-null   object 
 15  tagline              

### 2. Drop Unnecessary Columns
Beberapa kolom, seperti `status` dan `production_countries` tidak diperlukan untuk proyek ini, karena tidak menyatakan apa pun dalam sistem rekomendasi. Sehingga, kedua kolom ini dihapus

In [314]:
df = df.drop(columns=['status', 'production_countries'])
df.head()

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Enter the World of Pandora.,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","At the end of the world, the adventure begins.",6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",A Plan No One Escapes,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",The Legend Ends,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","Lost in our world, found in another.",6.1,2124


### 3. Parsing `genres` Column

Karena `genres` dibutuhkan dalam sistem rekomendasi, dan bentuk `genres` masih berformat **JSON**, maka perlu dilakukan parsing ke bentuk string.

In [315]:
def clean_genres(genre_str):
  try:
    genres = ast.literal_eval(genre_str)
    return ' '.join([g['name'] for g in genres]).lower()
  except:
    return ''
df['genres'] = df['genres'].apply(clean_genres)

Cek dataset sebelum proses modeling

In [316]:
df.head()

Unnamed: 0,budget,genres,id,keywords,original_language,original_title,overview,popularity,production_companies,release_date,revenue,runtime,spoken_languages,tagline,vote_average,vote_count
0,237000000,action adventure fantasy science fiction,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Enter the World of Pandora.,7.2,11800
1,300000000,adventure fantasy action,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","At the end of the world, the adventure begins.",6.9,4500
2,245000000,action adventure crime,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",A Plan No One Escapes,6.3,4466
3,250000000,action crime drama thriller,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",The Legend Ends,7.6,9106
4,260000000,action adventure science fiction,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]","Lost in our world, found in another.",6.1,2124


Simpan dataset dengan bentuk `.csv`

In [317]:
df.to_csv("data/movies_dataset.csv", index=False)

## Modeling
Di tahap ini, dataset yang bersih bisa digunakan untuk pemodelan. Untuk proyek sistem rekomendasi ini, akan digunakan empat skema dengan dua model berbeda, yaitu:

### 1. TF-IDF + Cosine Similarity (kolom `overview`)

- _Term Frequency - Inverse Document Frequency_ (TF-IDF) merupakan teknik yang digunakan untuk menghitung bobot dari setiap kata dalam dokumen.
- _Cosine Similarity_ digunakan untuk menghitung kemiripan antara dua vektor.

Inisiasi vectorizer tf-idf

In [318]:
tfv_overview = TfidfVectorizer()

Fitting TF-IDF ke kolom `overview`

In [319]:
tfidf_overview_matrix = tfv_overview.fit_transform(df['overview'])
tfidf_overview_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 197368 stored elements and shape (4803, 21262)>

Ukuran matriks TF-IDF `overview`

In [320]:
tfidf_overview_matrix.shape

(4803, 21262)

Inisisasi fungsi *cosine similarity*

In [321]:
cos_tfidf_overview = cosine_similarity(tfidf_overview_matrix)
cos_tfidf_overview[0]

array([1.        , 0.03668199, 0.01946126, ..., 0.0327237 , 0.01739194,
       0.01366935], shape=(4803,))

Balik pemetaan antara `index` dan `original_title`

In [322]:
indices = pd.Series(df.index, index=df['original_title']).drop_duplicates()
indices

original_title
Avatar                                         0
Pirates of the Caribbean: At World's End       1
Spectre                                        2
The Dark Knight Rises                          3
John Carter                                    4
                                            ... 
El Mariachi                                 4798
Newlyweds                                   4799
Signed, Sealed, Delivered                   4800
Shanghai Calling                            4801
My Date with Drew                           4802
Length: 4803, dtype: int64

Buat fungsi sistem rekomendasi dan testing fungsi

In [None]:
def recommend_tfidf_cossim_overview(title, cos_tfidf_overview=cos_tfidf_overview):
  # Get the original_title index
  idx = indices[title]
  
  # Get the pairwise similarity scores
  cos_tfidf_overview_scores = list(enumerate(cos_tfidf_overview[idx]))
  
  # Sort movies
  cos_tfidf_overview_scores = sorted(cos_tfidf_overview_scores, key=lambda x: x[1], reverse=True)
  
  # Scores of the top 10 most similar movies
  cos_tfidf_overview_scores = cos_tfidf_overview_scores[1:6]
  
  # Movie indices 
  movie_indices = [i[0] for i in cos_tfidf_overview_scores]
  
  return df['original_title'].iloc[movie_indices]

Uji coba fungsi

In [324]:
tfidf_overview_test = recommend_tfidf_cossim_overview('Avatar')
print("5 Recommend Films Based on Avatar:")
tfidf_overview_test

5 Recommend Films Based on Avatar:


3604           Apollo 18
529     Tears of the Sun
2130        The American
1341    Obitaemyy Ostrov
634           The Matrix
Name: original_title, dtype: object

Cek nilai *cosine similarity* untuk 10 film rekomendasi teratas

In [325]:
sorted(list(enumerate(cos_tfidf_overview[indices['Avatar']])), key=lambda x: x[1], reverse=True)

[(0, np.float64(1.0)),
 (3604, np.float64(0.19507071405516235)),
 (529, np.float64(0.15982247741846406)),
 (2130, np.float64(0.15913539473913496)),
 (1341, np.float64(0.14279306550568086)),
 (634, np.float64(0.1426794068466045)),
 (2628, np.float64(0.1280211742196389)),
 (847, np.float64(0.12715697418970243)),
 (311, np.float64(0.125496856407626)),
 (942, np.float64(0.1247738001890465)),
 (3458, np.float64(0.11645908052001727)),
 (1213, np.float64(0.11577150440927401)),
 (2967, np.float64(0.11114143356816217)),
 (1610, np.float64(0.10612197630772931)),
 (2109, np.float64(0.10467564813354942)),
 (775, np.float64(0.10407276392631461)),
 (1513, np.float64(0.10125396196386263)),
 (2920, np.float64(0.09964157591674233)),
 (570, np.float64(0.09515223114446794)),
 (36, np.float64(0.09475224277403813)),
 (1033, np.float64(0.0935212718618396)),
 (150, np.float64(0.09275906826931352)),
 (2529, np.float64(0.09251179065289368)),
 (83, np.float64(0.09239572786589818)),
 (582, np.float64(0.092267289

### 2. TF-IDF + Cosine Similarity (Kolom `genres`)
Pemodelan ini bertujuan untuk membandingkan hasil rekomendasi dengan model yang sama dengan menggunakan kolom `overview`

Inisisasi vectorizer TF-IDF

In [326]:
tfv_genres = TfidfVectorizer()

fitting TF-IDF ke kolom `genres`

In [327]:
tfidf_genres_matrix = tfv_genres.fit_transform(df['genres'])
tfidf_genres_matrix

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 12703 stored elements and shape (4803, 22)>

Ukuran matriks TF-IDF `genres`

In [328]:
tfidf_genres_matrix.shape

(4803, 22)

Inisiasi fungsi *cosine similarity*

In [329]:
cos_tfidf_genres = cosine_similarity(tfidf_genres_matrix)
cos_tfidf_genres[0]

array([1.        , 0.74526744, 0.42944732, ..., 0.        , 0.        ,
       0.        ], shape=(4803,))

Buat fungsi sistem rekomendasi

In [330]:
def recommend_tfidf_cossim_genres(title, cos_tfidf_genres=cos_tfidf_genres):
  # Get the original_title index
  idx = indices[title]
  
  # Get the pairwise similarity scores
  cos_tfidf_genres_scores = list(enumerate(cos_tfidf_genres[idx]))
  
  # Sort movies
  cos_tfidf_genres_scores = sorted(cos_tfidf_genres_scores, key=lambda x: x[1], reverse=True)
  
  # Scores of the top 10 most similar movies
  cos_tfidf_genres_scores = cos_tfidf_genres_scores[1:6]
  
  # Movie indices 
  movie_indices = [i[0] for i in cos_tfidf_genres_scores]
  
  return df['original_title'].iloc[movie_indices]

Uji coba fungsi

In [331]:
tfidf_genres_test = recommend_tfidf_cossim_genres('Avatar')
print("5 Recommend Films Based on Avatar:")
tfidf_genres_test

5 Recommend Films Based on Avatar:


10               Superman Returns
14                   Man of Steel
46     X-Men: Days of Future Past
61              Jupiter Ascending
232                 The Wolverine
Name: original_title, dtype: object

Cek nilai *cosine similarity* untuk 10 film rekomendasi teratas

In [332]:
sorted(list(enumerate(cos_tfidf_genres[indices['Avatar']])), key=lambda x: x[1], reverse=True)

[(0, np.float64(1.0)),
 (10, np.float64(1.0)),
 (14, np.float64(1.0)),
 (46, np.float64(1.0)),
 (61, np.float64(1.0)),
 (232, np.float64(1.0)),
 (813, np.float64(1.0)),
 (870, np.float64(1.0)),
 (3494, np.float64(1.0)),
 (238, np.float64(0.958074089745822)),
 (618, np.float64(0.958074089745822)),
 (1191, np.float64(0.958074089745822)),
 (1296, np.float64(0.958074089745822)),
 (1932, np.float64(0.958074089745822)),
 (322, np.float64(0.9457531478388126)),
 (1230, np.float64(0.9457531478388126)),
 (1390, np.float64(0.9457531478388126)),
 (1652, np.float64(0.9457531478388126)),
 (419, np.float64(0.9336770329033672)),
 (420, np.float64(0.9336770329033672)),
 (426, np.float64(0.9336770329033672)),
 (72, np.float64(0.9177700265018054)),
 (728, np.float64(0.9102712779861778)),
 (1611, np.float64(0.9102712779861778)),
 (1802, np.float64(0.9102712779861778)),
 (2592, np.float64(0.9102712779861778)),
 (859, np.float64(0.8711812646335949)),
 (2069, np.float64(0.8711812646335949)),
 (1936, np.float

### 3. BERT + Cosine Similarity (Kolom `overview`)
BERT (Bidirectional Encoder Representations from Transformers) merupakan salah satu model yang paling populer dalam bidang NLP. BERT menggunakan teknik self-supervised learning untuk menghasilkan representasi kata yang lebih baik. Untuk pemodelan ini, kembali digunakan *cosine similarity* untuk melihat hubungan antara *embeddings* BERT.

Inisiasi model BERT 

In [333]:
model_overview = SentenceTransformer('all-MiniLM-L6-v2') # Salah satu model BERT yang cepat, namun bagus

Encoding kolom `overview`

In [334]:
embeddings_overview = model_overview.encode(df['overview'].tolist(), show_progress_bar=True)

Batches: 100%|██████████| 151/151 [00:06<00:00, 23.99it/s]


Inisisasi fungsi *cosine similarity*

In [335]:
cos_bert_overview = cosine_similarity(embeddings_overview, embeddings_overview)

Membuat fungsi sistem rekomendasi

In [336]:
def recommend_bert_overview(title, top_n):
  idx = df[df['original_title'].str.lower() == title.lower()].index[0]
  # Pick recommended rows
  row = cos_bert_overview[idx]
    
  # Sort by the most similar
  similar_indices = row.flatten().argsort()[::-1]
    
  # Drop it's own movie
  similar_indices = similar_indices[similar_indices != idx]
    
  # Pick top_n
  top_indices = similar_indices[:top_n]
    
  # Take a result
  results = []
  for i in top_indices:
    results.append((df.iloc[i]['original_title'], round(row[i], 4)))
  return results

Uji coba fungsi 

In [337]:
bert_overview_test = recommend_bert_overview('Avatar', top_n=5)
print("5 Recommend Films Based on Avatar:")
bert_overview_test

5 Recommend Films Based on Avatar:


[('Alien: Resurrection', np.float32(0.4628)),
 ('The Black Hole', np.float32(0.4385)),
 ('Serenity', np.float32(0.4371)),
 ('Aliens', np.float32(0.4227)),
 ('Supernova', np.float32(0.4112))]

### 4. BERT + Cosine Similarity (Kolom `genres`)
Pemodelan ini bertujuan untuk membandingkan hasil rekomendasi dengan model yang sama dengan menggunakan kolom `overview`

Inisiasi model BERT

In [338]:
model_genres = SentenceTransformer('all-MiniLM-L6-v2')

Encoding kolom `genres`

In [None]:
embeddings_genres = model_genres.encode(df['genres'].tolist(), show_progress_bar=True)

Batches: 100%|██████████| 151/151 [00:01<00:00, 81.77it/s] 


Membuat fungsi *cosine similarity*

In [None]:
cos_bert_genres = cosine_similarity(embeddings_genres, embeddings_genres)

Membuat fungsi sistem rekomendasi

In [None]:
def recommend_bert_genres(title, top_n):
  idx = df[df['original_title'].str.lower() == title.lower()].index[0]
  # Pick recommended rows
  row = cos_bert_genres[idx]
    
  # Sort by the most similar
  similar_indices = row.flatten().argsort()[::-1]
    
  # Drop it's own movie
  similar_indices = similar_indices[similar_indices != idx]
    
  # Pick top_n
  top_indices = similar_indices[:top_n]
    
  # Take a result
  results = []
  for i in top_indices:
    results.append((df.iloc[i]['original_title'], round(row[i], 4)))
  return results

Uji coba fungsi

In [342]:
bert_genres_test = recommend_bert_genres('Avatar', top_n=5)
print("5 Recommend Films Based on Avatar:")
bert_genres_test

5 Recommend Films Based on Avatar:


[('X-Men: Days of Future Past', np.float32(1.0)),
 ('Man of Steel', np.float32(1.0)),
 ('Superman', np.float32(1.0)),
 ('Beastmaster 2: Through the Portal of Time', np.float32(1.0)),
 ('Superman II', np.float32(1.0))]

## Evaluation
Di tahap ini, kedua model dengan empat skema berbeda akan dievaluasi untuk mengetahui mana yang lebih baik. Evaluasi ini akan dilakukan dengan menggunakan  **_Mean Reciprocal Rank_** (MRR) dan **_Normalized Discounted Cumulative Gain_** (NDCG) sebagai metrik evaluasi. MRR akan digunakan untuk menilai kemampuan model dalam menemukan jawaban yang tepat, sedangkan NDCG akan digunakan untuk menilai kemampuan model dalam menemukan jawaban yang relevan.

Karena kedua metrik tersebut membutuhkan *Ground Truth* untuk menguji keakuratan model, maka selanjutnya akan dibuat *Ground Truth* untuk kebutuhan evaluasi kedua model.

### 1. Ground Truth
Pembuatan *ground truth* ini berdasarkan kepopuleran film dari 10 genre yang telah ditentukan. Berikut hasilnya:

In [343]:
# Parsing 'genres' column to list
df['genres'] = df['genres'].apply(lambda x: x.split() if isinstance(x, str) else [])

# Make a ground truth dict
ground_truth = {}

# Set all unique genres
all_genres = set(g for genres_list in df['genres'] for g in genres_list)

for genre in all_genres:
    # Filter all films with the current genre
    df_genre = df[df['genres'].apply(lambda x: genre in x)]
    
    # Sort by top 5 most popular
    df_genre_top5 = df_genre.sort_values(by='popularity', ascending=False).head(6)
    
    for idx, row in df_genre_top5.iterrows():
        film_id = row['id']
        relevant_films = df_genre_top5[df_genre_top5['id'] != film_id]['id'].tolist()
        ground_truth[film_id] = relevant_films

Cek ground truth

In [344]:
ground_truth

{205321: [10947, 13187, 158150, 231617, 22488],
 10947: [205321, 13187, 158150, 231617, 22488],
 13187: [205321, 10947, 158150, 231617, 22488],
 158150: [205321, 10947, 13187, 231617, 22488],
 231617: [205321, 10947, 13187, 158150, 22488],
 22488: [205321, 10947, 13187, 158150, 231617],
 211672: [157336, 293660, 118340, 76341, 135397],
 177572: [211672, 293660, 98566, 257344, 13],
 109445: [211672, 177572, 93456, 150540, 62177],
 93456: [211672, 177572, 109445, 150540, 62177],
 672: [211672, 177572, 109445, 93456, 158852],
 158852: [27205, 210577, 198663, 17578, 2501],
 157336: [211672, 293660, 118340, 76341, 135397],
 118340: [211672, 157336, 293660, 76341, 135397],
 76341: [211672, 157336, 293660, 118340, 135397],
 135397: [211672, 157336, 293660, 118340, 76341],
 119450: [157336, 118340, 76341, 135397, 131631],
 131631: [157336, 118340, 76341, 135397, 119450],
 150540: [211672, 177572, 109445, 93456, 62177],
 62177: [211672, 177572, 109445, 93456, 150540],
 293660: [211672, 157336, 

Simpan ground truth

In [345]:
with open('data/ground_truth.json', 'w') as f:
  json.dump(ground_truth, f)
  
print(f'Ground truth has been saved! Total: {len(ground_truth)} film queries')

Ground truth has been saved! Total: 88 film queries


Ambil ID film dari film query Avatar (ID: 19995)

In [346]:
avatar_ids = ground_truth[19995]
avatar_ids

[22, 209112, 58, 98566, 285]

Konversi ground truth Avatar ke judul film

In [347]:
# Convert ground truth ID to movie title
ground_truth_titles = {}
for film_id, gt_ids in ground_truth.items():
  # Movie title
  main_title = df[df['id'] == int(film_id)]['original_title'].values[0]
    
  # Movie title on ground truth
  gt_titles = df[df['id'].isin(gt_ids)]['original_title'].tolist()
    
  ground_truth_titles[main_title] = gt_titles

print(ground_truth_titles)

{'Sharknado': ['High School Musical', 'How to Fall in Love', "Love's Abiding Joy", 'A Charlie Brown Christmas', 'Signed, Sealed, Delivered'], 'High School Musical': ['How to Fall in Love', 'Sharknado', "Love's Abiding Joy", 'A Charlie Brown Christmas', 'Signed, Sealed, Delivered'], 'A Charlie Brown Christmas': ['High School Musical', 'How to Fall in Love', 'Sharknado', "Love's Abiding Joy", 'Signed, Sealed, Delivered'], 'How to Fall in Love': ['High School Musical', 'Sharknado', "Love's Abiding Joy", 'A Charlie Brown Christmas', 'Signed, Sealed, Delivered'], 'Signed, Sealed, Delivered': ['High School Musical', 'How to Fall in Love', 'Sharknado', "Love's Abiding Joy", 'A Charlie Brown Christmas'], "Love's Abiding Joy": ['High School Musical', 'How to Fall in Love', 'Sharknado', 'A Charlie Brown Christmas', 'Signed, Sealed, Delivered'], 'Minions': ['Jurassic World', 'Guardians of the Galaxy', 'Interstellar', 'Mad Max: Fury Road', 'Deadpool'], 'Big Hero 6': ['Teenage Mutant Ninja Turtles'

Ambil rekomendasi film Avatar

In [348]:
avatar_recs = ground_truth_titles['Avatar']
avatar_recs

["Pirates of the Caribbean: At World's End",
 'Batman v Superman: Dawn of Justice',
 "Pirates of the Caribbean: Dead Man's Chest",
 'Pirates of the Caribbean: The Curse of the Black Pearl',
 'Teenage Mutant Ninja Turtles']

### 2. Model Evaluation
Setelah membuat ground truth khusus, maka proses evaluasi bisa dilanjutkan, dengan menggunakan metrik Mean Reciprocal Rank (MRR) dan Normalized Discounted Cumulative Gain (NDCG). MRR dan NDCG digunakan untuk mengukur kinerja model dalam menemukan item yang relevan.

Membuat fungsi mengambil rekaman rekomendasi film dari semua model

In [349]:
def get_top_recommendations(similarity_matrix, ids, top_n):
  recs = {}
  for idx, row in enumerate(similarity_matrix):
    similar_indices = row.flatten().argsort()[::-1]
    similar_indices = [i for i in similar_indices if i != idx][:top_n]
    film_id = ids[idx]
    recommended_ids = [ids[i] for i in similar_indices]
    recs[film_id] = recommended_ids
  return recs

Simpan semua hasil rekomendasi

In [350]:
model_results = {
	'TF-IDF Overview': get_top_recommendations(cos_tfidf_overview, df['id'], top_n=5),
	'TF-IDF Genres': get_top_recommendations(cos_tfidf_genres, df['id'], top_n=5),
	'BERT Overview': get_top_recommendations(cos_bert_overview, df['id'], top_n=5),
	'BERT Genres': get_top_recommendations(cos_bert_genres, df['id'], top_n=5)
}

Buat dict hasil rekomendasi tiap skema

In [351]:
recs_tfidf_overview = get_top_recommendations(cos_tfidf_overview, df['id'], top_n=5)
recs_tfidf_genres = get_top_recommendations(cos_tfidf_genres, df['id'], top_n=5)
recs_bert_overview = get_top_recommendations(cos_bert_overview, df['id'], top_n=5)
recs_tfidf_genres =  get_top_recommendations(cos_bert_genres, df['id'], top_n=5)

Buat fungsi metrik MRR dan NDCG

In [352]:
from sklearn.metrics import ndcg_score

# MRR Function
def mean_reciprocal_rank(ground_truth, prediction):
  rr = 0.0
  for i, p in enumerate(prediction, start=1):
    if p in ground_truth:
      rr = 1.0 / i
      break
  return rr

# NDCG Function
def ndcg(ground_truth, prediction, k=10):
  # Binary relevance (1 if in ground truth, 0 if is not)
  relevance = [1 if p in ground_truth else 0 for p in prediction[:k]]
  return ndcg_score([relevance], [list(range(len(relevance), 0, -1))])

Buat proses looping untuk mengambil nilai MRR dan NDCG

In [353]:
# Looping for TF-IDF overview
for model_name, avatar_ids in model_results.items():
  mrr_list_tfidf_overview, ndcg_list_tfidf_overview = [], []
  for film_id, prediction in avatar_ids.items():
    gt = ground_truth.get(str(film_id), [])
    mrr_list_tfidf_overview.append(mean_reciprocal_rank(gt, prediction))
    ndcg_list_tfidf_overview.append(ndcg(gt, prediction))
  mrr_result_tfidf_overview = sum(mrr_list_tfidf_overview) / len(mrr_list_tfidf_overview)
  ndcg_result_tfidf_overview = sum(ndcg_list_tfidf_overview) / len(ndcg_list_tfidf_overview)
  
# Looping for TF-IDF genres
for model_name, avatar_ids in model_results.items():
  mrr_list_tfidf_genres, ndcg_list_tfidf_genres = [], []
  for film_id, prediction in avatar_ids.items():
    gt = ground_truth.get(str(film_id), [])
    mrr_list_tfidf_genres.append(mean_reciprocal_rank(gt, prediction))
    ndcg_list_tfidf_genres.append(ndcg(gt, prediction))
  mrr_result_tfidf_genres = sum(mrr_list_tfidf_genres) / len(mrr_list_tfidf_genres)
  ndcg_result_tfidf_genres = sum(ndcg_list_tfidf_genres) / len(ndcg_list_tfidf_genres)
  
# Looping for BERT overview
for model_name, avatar_ids in model_results.items():
  mrr_list_bert_overview, ndcg_list_bert_overview = [], []
  for film_id, prediction in avatar_ids.items():
    gt = ground_truth.get(str(film_id), [])
    mrr_list_bert_overview.append(mean_reciprocal_rank(gt, prediction))
    ndcg_list_bert_overview.append(ndcg(gt, prediction))
  mrr_result_bert_overview = sum(mrr_list_bert_overview) / len(mrr_list_bert_overview)
  ndcg_result_bert_overview = sum(ndcg_list_bert_overview) / len(ndcg_list_bert_overview)
  
# Looping for BERT genres
for model_name, avatar_ids in model_results.items():
  mrr_list_bert_genres, ndcg_list_bert_genres = [], []
  for film_id, prediction in avatar_ids.items():
    gt = ground_truth.get(str(film_id), [])
    mrr_list_bert_genres.append(mean_reciprocal_rank(gt, prediction))
    ndcg_list_bert_genres.append(ndcg(gt, prediction))
  mrr_result_bert_genres = sum(mrr_list_bert_genres) / len(mrr_list_bert_genres)
  ndcg_result_bert_genres = sum(ndcg_list_bert_genres) / len(ndcg_list_bert_genres)

Tampilkan nilai metrik

In [354]:
metric_result = pd.DataFrame({
	'Model': ['TF-IDF Overview', 'TF-IDF Genres', 'BERT Overview', 'BERT Genres'],
  'MRR': [mrr_result_tfidf_overview, mrr_result_tfidf_genres, mrr_result_bert_overview, mrr_result_bert_genres],
  'NDCG': [ndcg_result_tfidf_overview, ndcg_result_tfidf_genres, ndcg_result_bert_overview, ndcg_result_bert_genres]
})
metric_result

Unnamed: 0,Model,MRR,NDCG
0,TF-IDF Overview,0.0,0.0
1,TF-IDF Genres,0.0,0.0
2,BERT Overview,0.0,0.0
3,BERT Genres,0.0,0.0


Hasil metrik evaluasi MRR dan NDCG di semua model menghasilkan 0. Hal ini mengindikasikan bahwa rekomendasi yang dihasilkan oleh masing-masing model tidak menemukan relevansi langsung dengan ground truth yang telah dibuat. Penyebab utamanya kemungkinan karena data ground truth dan data id film pada hasil rekomendasi (recs) tidak sinkron atau tidak cocok secara langsung. Oleh karena itu, evaluasi manual akan digunakan untuk verifikasi kinerja model secara lebih akurat.

### 3. Manual Evaluation
Evaluasi manual dilakukan dengan membandingkan _ground truth_ yang telah dibuat dengan hasil rekomendasi dari kedua model.

Buat dataframe perbandingan _ground truth_ dengan hasil rekomendasi kedua model

In [355]:

def compute_relevance(recommended, avatar_recs):
    return len(set(recommended).intersection(set(avatar_recs)))

# Count relevances using recs functions
relevance_tfidf_overview = compute_relevance(tfidf_overview_test, avatar_recs)
relevance_tfidf_genres = compute_relevance(tfidf_genres_test, avatar_recs)
relevance_bert_overview = compute_relevance(bert_overview_test, avatar_recs)
relevance_bert_genres = compute_relevance(bert_genres_test, avatar_recs)

film_title = "Avatar"

evaluation_df = pd.DataFrame({
    'Film': [film_title],
    'TF-IDF overview': [tfidf_overview_test],
    'TF-IDF genres': [tfidf_genres_test],
    'BERT overview': [bert_overview_test],
    'BERT genres': [bert_genres_test],
    'TF-IDF overview relevance': [relevance_tfidf_overview],
    'TF-IDF genres relevance': [relevance_tfidf_genres],
    'BERT overview relevance': [relevance_bert_overview],
    'BERT genres relevance': [relevance_bert_genres]
})

# Lihat dataframe
evaluation_df

Unnamed: 0,Film,TF-IDF overview,TF-IDF genres,BERT overview,BERT genres,TF-IDF overview relevance,TF-IDF genres relevance,BERT overview relevance,BERT genres relevance
0,Avatar,3604 Apollo 18 529 Tears of the ...,10 Superman Returns 14 ...,"[(Alien: Resurrection, 0.4628), (The Black Hol...","[(X-Men: Days of Future Past, 1.0), (Man of St...",0,0,0,0


Hasil Evaluasi manual dengan _ground truth_ ternyata menunjukkan hal yang sama dengan hasil evaluasi dengan metrik MRR dan NDCG. Hal ini tentu menjadi pertimbangan dalam menentukan _ground truth_ yang sesuai dengan sistem rekomendasi yang dibangun. Namun, untuk melihat perbandingan secara langsung keempat skema pemodelan tersebut dalam menentukan rekomendasi film dari judul (untuk proyek ini, misalkan **Avatar**), akan dipaparkan hasil dari keempat skema pemodelan tersebut

Ambil judul film hasil rekomendasi dari setiap skema

In [356]:
bert_overview_titles = [title for title, _ in bert_overview_test]
bert_genres_titles = [title for title, _ in bert_genres_test]
tfidf_overview_titles = tfidf_overview_test.tolist()
tfidf_genres_titles = tfidf_genres_test.tolist()

Buat dataframe hasil rekomendasi tiap skema

In [357]:
# Make a set of recommendation lists
recommendation_lists = {
    'TF-IDF Overview': tfidf_overview_titles,
    'TF-IDF Genres': tfidf_genres_titles,
    'BERT Overview': bert_overview_titles,
    'BERT Genres': bert_genres_titles
}

film_input = 'Avatar'

manual_eval = pd.DataFrame([
    {"Model": model, "Movie Title": film_input, "Recommendation": title}
    for model, titles in recommendation_lists.items()
    for title in titles
])
manual_eval


Unnamed: 0,Model,Movie Title,Recommendation
0,TF-IDF Overview,Avatar,Apollo 18
1,TF-IDF Overview,Avatar,Tears of the Sun
2,TF-IDF Overview,Avatar,The American
3,TF-IDF Overview,Avatar,Obitaemyy Ostrov
4,TF-IDF Overview,Avatar,The Matrix
5,TF-IDF Genres,Avatar,Superman Returns
6,TF-IDF Genres,Avatar,Man of Steel
7,TF-IDF Genres,Avatar,X-Men: Days of Future Past
8,TF-IDF Genres,Avatar,Jupiter Ascending
9,TF-IDF Genres,Avatar,The Wolverine


Berdasarkan hasil rekomendasi dari keempat pendekatan content-based filtering terhadap film Avatar, terlihat bahwa setiap skema menghasilkan pola rekomendasi yang berbeda. Pendekatan TF-IDF pada overview cenderung merekomendasikan film dengan kemiripan secara literal dalam teks sinopsis, namun tidak selalu relevan secara tematik — misalnya Apollo 18 atau The American, yang memiliki kata kunci serupa namun latar cerita sangat berbeda. Sebaliknya, TF-IDF pada genres menunjukkan hasil yang sedikit lebih baik, dengan munculnya film-film superhero dan fiksi ilmiah seperti Man of Steel, X-Men: Days of Future Past, dan Jupiter Ascending, meskipun tetap belum menyentuh film dalam ground truth. Pendekatan BERT pada overview memberikan hasil yang lebih semantik dan relevan, seperti Aliens, Serenity, dan Alien: Resurrection, yang memiliki kemiripan konteks dunia luar angkasa, spesies asing, dan konflik manusia-alien—tema yang sangat dekat dengan Avatar. Terakhir, BERT pada genres juga merekomendasikan film superhero dan sci-fi klasik, seperti Superman, X-Men, dan Superman II, namun kembali terbatas oleh input genre yang sangat singkat. Secara keseluruhan, pendekatan BERT dengan overview terlihat paling mendekati konteks naratif dan atmosfer Avatar, meskipun masih belum berhasil mencocokkan film yang ada dalam ground truth secara eksplisit.