#Sistem Rekomendasi Film berdasarkan Content-Based Filtering

## Latar Belakang

Sistem rekomendasi telah menjadi komponen penting dalam berbagai platform digital seperti e-commerce, layanan streaming, dan sosial media. Dalam konteks film, pengguna seringkali kesulitan menemukan film yang sesuai dengan preferensi mereka karena banyaknya pilihan yang tersedia. Untuk mengatasi masalah ini, diperlukan sistem yang mampu merekomendasikan film yang relevan.

Proyek ini bertujuan mengembangkan sistem rekomendasi film berbasis konten (Content-Based Filtering) yang dapat memberikan saran film serupa berdasarkan karakteristik tertentu—dalam hal ini, bahasa asli film. Sistem ini penting sebagai pondasi awal pengembangan sistem rekomendasi yang lebih kompleks seperti yang digunakan Netflix, Hulu, dan layanan streaming lainnya.

# Business Understanding
## Problem Statement

Pengguna ingin mendapatkan rekomendasi film yang serupa dengan film yang telah ditonton atau disukai, namun tidak selalu tahu cara menemukannya. Tanpa sistem rekomendasi, pengguna dapat merasa kewalahan oleh banyaknya pilihan.

## Goals

Membangun sistem rekomendasi berbasis konten yang mampu:

- Memberikan rekomendasi film serupa berdasarkan karakteristik tertentu (dalam proyek ini: bahasa film).

- Menyediakan 10 film rekomendasi teratas (Top-N).

- Menyediakan dua pendekatan algoritmik yang berbeda untuk membandingkan hasil.

# Import Library

In [3]:
!pip install -q kaggle

In [4]:
# Import Data Loading
from google.colab import files
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
import seaborn as sns
import zipfile
import warnings
warnings.filterwarnings('ignore')

# Import TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# Import Model Development
from sklearn.metrics.pairwise import cosine_similarity

# 1. Data Understanding

## 1.1. Data Loading

Supaya isi dataset lebih mudah dipahami, kita perlu melakukan proses loading data terlebih dahulu. Tidak lupa, import library pandas untuk dapat membaca file datanya.

In [5]:
files.upload() # Upload kaggle json

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"dewipuspita0904","key":"79b8f8e42e00e633808b12391f3fa555"}'}

In [6]:
!mkdir ~/.kaggle # membuat folder .kaggle
!cp kaggle.json ~/.kaggle/ # menyalin file credential ke lokasi yang sesuai
!chmod 600 ~/.kaggle/kaggle.json # mengatur izin file agar bisa digunakan oleh kernel Colab
!kaggle datasets download -d rohan4050/movie-recommendation-data # mengunduh dataset dari Kaggle

Dataset URL: https://www.kaggle.com/datasets/rohan4050/movie-recommendation-data
License(s): unknown
Downloading movie-recommendation-data.zip to /content
  0% 0.00/13.1M [00:00<?, ?B/s]
100% 13.1M/13.1M [00:00<00:00, 679MB/s]


In [7]:
# mengekstrak file ZIP dataset
zip_ref = zipfile.ZipFile('/content/movie-recommendation-data.zip', 'r')
zip_ref.extractall('/content')
zip_ref.close()

In [8]:
# load the dataset
movie = pd.read_csv('/content/movies_metadata.csv')
movie

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45461,False,,0,"[{'id': 18, 'name': 'Drama'}, {'id': 10751, 'n...",http://www.imdb.com/title/tt6209470/,439050,tt6209470,fa,رگ خواب,Rising and falling between a man and woman.,...,,0.0,90.0,"[{'iso_639_1': 'fa', 'name': 'فارسی'}]",Released,Rising and falling between a man and woman,Subdue,False,4.0,1.0
45462,False,,0,"[{'id': 18, 'name': 'Drama'}]",,111109,tt2028550,tl,Siglo ng Pagluluwal,An artist struggles to finish his work while a...,...,2011-11-17,0.0,360.0,"[{'iso_639_1': 'tl', 'name': ''}]",Released,,Century of Birthing,False,9.0,3.0
45463,False,,0,"[{'id': 28, 'name': 'Action'}, {'id': 18, 'nam...",,67758,tt0303758,en,Betrayal,"When one of her hits goes wrong, a professiona...",...,2003-08-01,0.0,90.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,A deadly game of wits.,Betrayal,False,3.8,6.0
45464,False,,0,[],,227506,tt0008536,en,Satana likuyushchiy,"In a small town live two brothers, one a minis...",...,1917-10-21,0.0,87.0,[],Released,,Satan Triumphant,False,0.0,0.0


Output di atas memberikan informasi sebagai berikut:
- Ada 45.466 baris data.
- Ada 24 kolom, yaitu `adult`, `belongs_to_collection`, `budget`, `genres`, `homepage`, `id`, `imdb_id`, `original_language`, `original_title`, `overview`, `popularity`, `poster_path`, `production_companies`, `production_countries`, `release_date`, `revenue`, `runtime`, `spoken_languages`, `status`, `tagline`, `title`, `video`, `vote_average`, `vote_count`.

## 1.2. Deskripsi Variabel
Berdasarkan informasi dari Kaggle, variabel-variabel pada dataset adalah berikut:

|Column |Description |
|---------------|--------------------------|
|adult |Apakah film tersebut untuk dewasa (konten eksplisit). Nilai: True atau False. |
|belongs_to_collection |Informasi tentang koleksi film jika film tersebut bagian dari suatu seri atau franchise. |
|budget |Anggaran biaya produksi film (dalam satuan mata uang dolar AS). |
|genres |Daftar genre film dalam format JSON string (misalnya: Action, Comedy, Drama, dll). |
|homepage |URL resmi halaman web film (jika ada). |
|id |ID unik dari film tersebut. |
|imdb_id |ID film di situs IMDb (Internet Movie Database). |
|original_language |Bahasa asli saat film tersebut diproduksi (kode bahasa ISO seperti 'en', 'fr', 'ja'). |
|original_title |Judul asli dari film sesuai dengan versi produksinya. |
|overview |Ringkasan atau sinopsis pendek mengenai cerita film. |
|popularity |Skor popularitas film berdasarkan sistem internal TMDb (semakin tinggi, semakin populer). |
|poster_path |Path (lokasi relatif) dari gambar poster film pada server TMDb. |
|production_companies |	Daftar perusahaan produksi yang memproduksi film (format JSON). |
|production_countries |Daftar negara tempat film tersebut diproduksi (format JSON). |
|release_date |Tanggal rilis resmi film (format: YYYY-MM-DD). |
|revenue |Pendapatan kotor dari film di seluruh dunia (dalam dolar AS). |
|runtime |Durasi tayang film dalam satuan menit. |
|spoken_languages |Daftar bahasa yang digunakan dalam dialog film (format JSON). |
|status |Status rilis film (contoh: Released, Post Production, Rumored). |
|tagline |Slogan atau kutipan promosi film yang biasanya muncul di poster. |
|title |Judul film versi internasional yang umum digunakan. |
|video |Menunjukkan apakah data video tersedia (True atau False). |
|vote_average |Rata-rata skor rating film berdasarkan penilaian pengguna TMDb (skala 0–10). |
|vote_count |Jumlah total pengguna yang memberikan rating untuk film tersebut. |

Setelah memahami deskripsi variabel pada data, langkah selanjutnya adalah mengecek informasi pada dataset dengan fungsi `info()` berikut.

In [9]:
movie.info() # cari tau keterangan dan tipe dari tiap kolom data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

Dari output terlihat bahwa, data yang diambil memiliki:
- 4 kolom numerik dengan tipe float64, yakni `revenue`, `runtime`, `vote_average`, dan `vote_count`.
- 20 kolom kategorikal bertipe object, yakni `adult`, `belongs_to_collection`, `budget`, `genres`, `homepage`, `id`, `imdb_id`, `original_language`, `original_title`, `overview`, `popularity`, `poster_path`, `production_companies`, `production_countries`, `release_date`, `spoken_languages`, `status`, `tagline`, `title`, `video`.

## 1.3. EDA - Missing Value dan Duplicate

In [10]:
movie.isnull().sum() # cek jumlah missing value

Unnamed: 0,0
adult,0
belongs_to_collection,40972
budget,0
genres,0
homepage,37684
id,0
imdb_id,17
original_language,11
original_title,0
overview,954


Dari hasil, terlihat bahwa terdapat beberapa kolom yang memiliki missing value, seperti `original_language` yang memiliki 11 missing value dan `title` yang memiliki 6 missing value.

In [11]:
movie.duplicated().sum() # cek jumlah data duplikat

np.int64(13)

Dari hasil, terlihat bahwa terdapat 13 baris data yang duplikat.

In [12]:
# melihat ada berapa banyak entri yang unik berdasarkan id
print('Banyak data: ', len(movie.id.unique()))

# melihat ada berapa banyak koleksi unik berdasarkan bahasa
print('Banyak bahasa: ', len(movie.original_language.unique()))
print('Jenis bahasa: ', movie.original_language.unique())

# melihat ada berapa banyak judul unik berdasarkan genres
print('Banyak judul: ', len(movie.title.unique()))

Banyak data:  45436
Banyak bahasa:  93
Jenis bahasa:  ['en' 'fr' 'zh' 'it' 'fa' 'nl' 'de' 'cn' 'ar' 'es' 'ru' 'sv' 'ja' 'ko'
 'sr' 'bn' 'he' 'pt' 'wo' 'ro' 'hu' 'cy' 'vi' 'cs' 'da' 'no' 'nb' 'pl'
 'el' 'sh' 'xx' 'mk' 'bo' 'ca' 'fi' 'th' 'sk' 'bs' 'hi' 'tr' 'is' 'ps'
 'ab' 'eo' 'ka' 'mn' 'bm' 'zu' 'uk' 'af' 'la' 'et' 'ku' 'fy' 'lv' 'ta'
 'sl' 'tl' 'ur' 'rw' 'id' 'bg' 'mr' 'lt' 'kk' 'ms' 'sq' nan '104.0' 'qu'
 'te' 'am' 'jv' 'tg' 'ml' 'hr' 'lo' 'ay' 'kn' 'eu' 'ne' 'pa' 'ky' 'gl'
 '68.0' 'uz' 'sm' 'mt' '82.0' 'hy' 'iu' 'lb' 'si']
Banyak judul:  42278


Dari data, terlihat bahwa data yang kita ambil memiliki jumlah 45.436 data yang berbeda dengan total 93 bahasa dan 42.278 judul yang berbeda.

# 2. Data Preparation
Pada tahap ini, kita akan mengambil kolom-kolom yang akan digunakan saja dan mempersiapkan data agar siap untuk dipakai permodelan.

## 2.1. Pengambilan Sampling 10.000 Data Acak

Karena dataset asli berisi sekitar 45.000 entri yang tergolong cukup besar untuk proses eksplorasi awal dan eksperimen. Maka dilakukan sampling acak sebanyak 10.000 baris data menggunakan sample (n=10.000, random_sate=42) agar proses komputasi berjalan lebih cepat dan efisien serta hasilnya tetap representatif karena diambil secara acak.

Hal ini dilakukan untuk mengurangi beban memori dan waktu komputasi di Google Colab, tetap mempertahankan keberagaman data dengan pengambilan acak, dan pengujian model awal sebelum mengembangkan ke full dataset.

In [13]:
movie_new = movie.sample(n=10000, random_state=42)  # random_state agar hasilnya konsisten

## 2.2. Hanya Ambil Kolom yang Dipakai

Kolom-kolom seperti `budget`, `homepage`, `status`, dan sebagainya tidak digunakan karena fokus kita adalah untuk melakukan rekomendasi berdasarkan bahasa.

Sehingga, kolom yang dipertahankan hanyalah:
- `id` sebagai identifikasi film
- `title` sebagai index dan tampilan
- `original_language` sebagai fitur utama dalam model

Dengan menerapkan ini, kita bisa fokus hanya pada fitur relevan, menyederhanakan preprocessing dan vektorisasi, dan menghindari noise dari data yang tidak terstruktur atau tidak digunakan.

In [14]:
movie_clean = pd.DataFrame({
    'id': movie_new['id'],
    'title': movie_new['title'],
    'language': movie_new['original_language']
})
movie_clean

Unnamed: 0,id,title,language
43526,411405,Small Crimes,en
6383,42492,Up the Sandbox,en
3154,12143,Bad Lieutenant,en
10146,9976,Satan's Little Helper,en
9531,46761,Sitcom,fr
...,...,...,...
43661,387399,We Go On,en
19067,66956,This Special Friendship,fr
41958,205236,El taxi de los conflictos,en
33771,70131,Cruel Winter Blues,ko


## 2.3. Handling Missing Value dan Duplikasi

Setelah kolom data difilter, dilakukan penggecekan dan pembersihan terhadap missing values karena baris dengan `Nan` pada kolom tidak bisa digunakan untuk TF-IDF.

In [15]:
movie_clean = movie_clean.dropna()
movie_clean.isna().sum()

Unnamed: 0,0
id,0
title,0
language,0


Setelah kolom Nan dibersihkan, sekarang dilakukan penanganan duplikat agar satu film tidak muncul lebih dari sekali dan mennggangu hasil kemiripan.

In [16]:
movie_clean = movie_clean.drop_duplicates()
movie_clean.shape

(9997, 3)

Setelah kolom duplikat juga dihapus, data tersisa adalah 9.997 entri dan 3 kolom.

# 2.4. TF_IDF Vectorizer

`original_language` berisi kode bahasa film. Meskipun sederhana, kita perlakukan kolom ini sebagai representasi konten dan vektorisasi dilakukan menggunakan TF-IDF di mana TF (Term Frequency) berarti seberapa sering kata muncul, dan IDF (inverse Document Frequency) berarti seberapa unik kata tersebut dibandingkan dokumen lain.

Pertama, objek TF-IDF Vectorizer diinisiasi terlebih dahulu. Lalu model dilatih pada data `language` dan ditampilkan daftar kata unik yang digunakan sebagai fitur.

In [17]:
# Inisialisasi TfidfVectorizer
tf = TfidfVectorizer()

# Melakukan perhitungan idf pada data language
tf.fit(movie_clean['language'])

# Mapping array dari fitur index integer ke fitur nama
tf.get_feature_names_out()

array(['ab', 'am', 'ar', 'ay', 'bg', 'bn', 'bs', 'ca', 'cn', 'cs', 'da',
       'de', 'el', 'en', 'es', 'et', 'fa', 'fi', 'fr', 'he', 'hi', 'hr',
       'hu', 'id', 'is', 'it', 'iu', 'ja', 'jv', 'ka', 'ko', 'lv', 'mk',
       'ml', 'mn', 'mr', 'ms', 'nb', 'nl', 'no', 'pa', 'pl', 'ps', 'pt',
       'ro', 'ru', 'rw', 'sh', 'sk', 'sl', 'sq', 'sr', 'sv', 'ta', 'te',
       'th', 'tl', 'tr', 'uk', 'ur', 'uz', 'vi', 'xx', 'zh'], dtype=object)

Selanjutnya, kita mengubah teks menjadi matriks vektor supaya dapat melatih dan langsung mentransformasikan data ke bentuk matriks TF-IDP. `shape` digunakan untuk melihat dimensi matriks TF-IDF, di mana terlihat adanya 9.997 film dengan 64 jenis bahasa unik.

In [18]:
# Melakukan fit lalu ditransformasikan ke bentuk matrix
tfidf_matrix = tf.fit_transform(movie_clean['language'])

# Melihat ukuran matrix tfidf
tfidf_matrix.shape

(9997, 64)

Lalu, matriks TF-IDF diubah ke bentuk dense (penuh).

In [19]:
tfidf_matrix.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

Terakhir, DataFrame dari TF-IDF dibuat dan ditampilkan 10 baris film dan 22 kolom bahasa secara acak untuk melihat bagaimana bobot TF-IDF dibentuk per bahasa dan per film.

In [20]:
pd.DataFrame(
    tfidf_matrix.todense(),
    columns=tf.get_feature_names_out(),
    index=movie_clean.title
).sample(22, axis=1).sample(10, axis=0)

Unnamed: 0_level_0,id,ml,sq,fa,pt,is,fi,ka,sh,bn,...,ru,hu,lv,es,de,el,bs,sr,it,nb
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Black Day Blue Night,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Allnighter,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Monsieur Batignole,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Macbeth,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Go West,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Six by Sondheim,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Roommate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
The Barber,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Affinity,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Fist of the North Star: Legend of Toki,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# 3. Model Development dengan Content Based Filtering

Model Development adalah tahapan di mana kita menggunakan algoritma machine learning untuk menjawab problem statement dari tahap business understanding.

Pada tahap ini, kita akan membangun sistem rekomendasi berbasis konten untuk menyarankan film yang mirip berdasarkan karakteristik tertentu (dalam hal ini: `original_language`) dan menyajikan rekomendasi Top 10 (Top-N Recommendation).

## 3.1. Membuat Matriks Kemiripan

Pertama, data akan membuat matriks kemiripan dimana fungsi akan menghitung cosine similarity antar vektor TF-IDF dari fitur `original_language`, menghasilkan skor kemiripan antara 0 (tidak mirip) sampai 1 (identik).

In [21]:
cosine_sim = cosine_similarity(tfidf_matrix) # hitung cosine similarity antar vektor TF-IDF
cosine_sim

array([[1., 1., 1., ..., 1., 0., 1.],
       [1., 1., 1., ..., 1., 0., 1.],
       [1., 1., 1., ..., 1., 0., 1.],
       ...,
       [1., 1., 1., ..., 1., 0., 1.],
       [0., 0., 0., ..., 0., 1., 0.],
       [1., 1., 1., ..., 1., 0., 1.]])

## 3.2. Membuat DataFrame dari Matriks Kemiripan

Matriks `cosine_sim` diubah menjadi DataFrame agar mudah diakses berdasarkan `title`.

In [22]:
# Membuat dataframe dari variabel cosine_sim dengan baris dan kolom berupa nama judul
cosine_sim_df = pd.DataFrame(cosine_sim, index=movie_clean['title'], columns=movie_clean['title'])
print('Shape:', cosine_sim_df.shape)

# Melihat similarity matrix pada setiap judul
cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (9997, 9997)


title,Once Upon a Time in Anatolia,Comrades in Arms,A Few Good Men,Jimmy Carr: Making People Laugh,James White
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A Woman's Secret,0.0,1.0,1.0,1.0,1.0
Snowball Effect: The Story of Clerks,0.0,1.0,1.0,1.0,1.0
Deadly Weapons,0.0,1.0,1.0,1.0,1.0
Gambling House,0.0,1.0,1.0,1.0,1.0
Avalon,0.0,1.0,1.0,1.0,1.0
Baffled!,0.0,1.0,1.0,1.0,1.0
Playing Beatie Bow,0.0,1.0,1.0,1.0,1.0
Elite Group,0.0,0.0,0.0,0.0,0.0
Sighs of Spain,0.0,1.0,1.0,1.0,1.0
Wild Is the Wind,0.0,1.0,1.0,1.0,1.0


## 3.3. Membuat Fungsi Rekomendasi

Pada tahap ini, dibuat fungsi `recommend_movies` yang berfungsi untuk mengambil skor kemiripan dari film input ke semua film lain, mengurutkan skor, menghapus dirinya sendiri dari daftar, dan mengembalikan **Top-10 film paling mirip** berdasarkan fitur `original_language`.

In [23]:
def recommend_movies(title, cosine_sim=cosine_sim_df):
    # Ambil skor similarity dari judul yang diminta
    sim_scores = list(cosine_sim_df[title].items())

    # Urutkan berdasarkan skor tertinggi, kecuali dirinya sendiri
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Ambil 10 film teratas selain dirinya sendiri
    sim_scores = [item for item in sim_scores if item[0] != title]
    top_titles = [item[0] for item in sim_scores[:10]]

    # Ambil info dari movie_clean berdasarkan judul
    return movie_clean[movie_clean['title'].isin(top_titles)][['id', 'title', 'language']]


## 3.4. Contoh Output Rekomendasi

Di tahap ini, hasil rekomendasi berdasarkan input judul film akan ditampilkan.

In [24]:
recommend_movies("Elephant")

Unnamed: 0,id,title,language
43526,411405,Small Crimes,en
6383,42492,Up the Sandbox,en
3154,12143,Bad Lieutenant,en
10146,9976,Satan's Little Helper,en
33663,268725,Nightlight,en
3396,184885,The Bells,en
33321,306745,Freeheld,en
19045,49014,Cosmopolis,en
2081,46702,Why Do Fools Fall In Love,en
1414,18420,City of Industry,en


Dari hasil rekomendasi, terlihat bahwa **Top-N Recommendation** berhasil disajikan.

# 4. Evaluasi

Tahapan evaluasi dilakukan untuk menilai performa sistem rekomendasi menggunakan metrik evaluasi **Precision@10** dan hasilnya.

## 4.1. Precision@10

In [25]:
def precision_at_k(recommended_titles, true_language, k=10):
    # Ambil hanya k item
    top_k = recommended_titles[:k]

    # Ambil subset dari movie_clean berdasarkan judul
    relevant = movie_clean[movie_clean['title'].isin(top_k)]

    # Hitung jumlah film yang bahasanya sama dengan input
    relevant_count = (relevant['language'] == true_language).sum()

    # Bagi dengan k, bukan panjang asli dari daftar
    return (relevant_count -1)/ k # dikurangi 1 karena biasanya input film ikut direkomendasikan


Pada metrik ini, relevansi didefinisikan sebagai film yang memiliki bahasa sama dengan film input dengan nilai ideal Precision@10 = 1.0 (semua rekomendasi relevan).

## 4.2. Average Cosine Similarity

In [26]:
def average_cosine_similarity(input_title, k=10):
    sim_scores = cosine_sim_df[input_title].sort_values(ascending=False)
    sim_scores = sim_scores.drop(index=input_title)
    return sim_scores.head(k).mean()

Metrik ini menghitung skor rata-rata cosine similarity dari Top-k rekomendasi terhadap film input. Skor mendekati 1 menunjukkan bahwa kemiripan yang tinggi.

## 4.3. Menjalankan Evaluasi

In [27]:
# Uji presisi
input_title = "Elephant"
recommended_df = recommend_movies(input_title)
recommended_titles = recommended_df['title'].tolist()
true_language = movie_clean[movie_clean['title'] == input_title]['language'].values[0]

print("Precision@10:", precision_at_k(recommended_titles, true_language))
print("Rata-rata Cosine Similarity terhadap rekomendasi:", average_cosine_similarity("Elephant"))

Precision@10: 1.0
Rata-rata Cosine Similarity terhadap rekomendasi: 1.0


Dari hasil yang didapat terlihat bahwa semua film rekomendasi memiliki bahasa yang sama, yang berarti sistem memberikan rekomendasi yang sangat mirip dari sisi `original_language`.

Karena hanya menggunakan satu fitur sederhana (bahasa), kemiripan menjadi maksimal tapi kurang kaya dalam makna konten.