# Sistem Rekomendasi Netflix

ID : M299X0762

Nama : Alfin Muhammad Ilmi

Dataset: **Netflix Movies and TV Shows** *accessed from* https://www.kaggle.com/datasets/shivamb/netflix-shows

## Import Libraries

In [None]:
!pip install opendatasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [None]:
import opendatasets as od
import pandas as pd
import numpy as np
import time

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

## Download Dataset

In [None]:
od.download("https://www.kaggle.com/datasets/shivamb/netflix-shows")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: alfinmuhammadilmi
Your Kaggle Key: ··········
Downloading netflix-shows.zip to ./netflix-shows


100%|██████████| 1.34M/1.34M [00:00<00:00, 119MB/s]







## Univariate Exploratory Data Analysis

In [None]:
df = pd.read_csv("/content/netflix-shows/netflix_titles.csv")
print("Shape:", df.shape)
df.head()

Shape: (8807, 12)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


*Output* kode di atas memberikan informasi sebagai berikut:

* Ada 8807 baris (*records* atau jumlah pengamatan) dalam dataset.
* Terdapat 12 kolom (fitur) yaitu `show_id`, `type`, `title`, `director`, `cast`, `country`, `date_added`, `release_year`, `rating`, `duration`, `listed_in`, `description`.

### Deskripsi Variabel

Sesuai dengan informasi dari [Kaggle](https://www.kaggle.com/datasets/shivamb/netflix-shows), Variabel-variabel pada *Netflix Movies and TV Shows* Dataset adalah sebagai berikut:

* `show_id` adalah ID untuk setiap *Movie* / *TV Show*
* `type` adalah sebagai pengenal apakah termasuk pada kategori *Movie* / *TV Show*
* `title` adalah judul dari *Movie* / *TV Show*
* `director` adalah nama dari sutradara film (*Movie*)
* `cast` adalah aktor yang terlibat dalam *Movie* / *TV Show*
* `country` adalah negara tempat *Movie* / *TV Show* itu diproduksi
* `date_added` adalah tanggal *Movie* / *TV Show* ditambahkan ke Netflix
* `release_year` adalah tahun rilis sebenarnya dari *Movie* / *TV Show*
* `rating` adalah jenis-jenis rating *Movie* / *TV Show*
* `duration` adalah durasi total (dalam menit atau jumlah *season*)
* `listed_in` adalah genre atau aliran *Movie* / *TV Show*
* `description` adalah deskripsi ringkasan



## Data Preprocessing

### Menentukan fitur yang akan digunakan

Dalam kasus ini, kita akan merekomendasikan film berdasarkan genre saja. Sehingga kita hanya membutuhkan kolom (fitur) `show_id`, `title`, dan `listed_in`. 

Selain itu kita hanya memilih data yang bertipe `Movie`. Sehingga data yang bertipe `TV Show` tidak dibutuhkan.

In [None]:
df = df.loc[df["type"] == "Movie"]

df_movies = df[["show_id", "title", "listed_in"]]
df_movies.head()

Unnamed: 0,show_id,title,listed_in
0,s1,Dick Johnson Is Dead,Documentaries
6,s7,My Little Pony: A New Generation,Children & Family Movies
7,s8,Sankofa,"Dramas, Independent Movies, International Movies"
9,s10,The Starling,"Comedies, Dramas"
12,s13,Je Suis Karl,"Dramas, International Movies"


## Data Preparation

### Mengatasi Missing Value

In [None]:
df_movies.isnull().sum()

show_id      0
title        0
listed_in    0
dtype: int64

In [None]:
df_movies.isna().sum()

show_id      0
title        0
listed_in    0
dtype: int64

Dari output di atas, terlihat bahwa tidak ada *missing value* pada dataset.

### Menghilangkan data duplikat pada judul film

Sebelum menghilangkan data duplikat, kita cek terlebih dahulu apakah ada data yang duplikat atau tidak

In [None]:
pd.DataFrame({"Unique title": df_movies['title'].nunique(),
              "Total Data": len(df_movies['title'])}, index=["Jumlah Film"])

Unnamed: 0,Unique title,Total Data
Jumlah Film,6131,6131


Dapat dilihat bahwa jumlah `title` yang berbeda sama dengan total data. Ini menunjukkan bahwa setiap baris memiliki judul film yang berbeda-beda.

## Model Development dengan Content Based Filtering

Di tahap ini, sistem rekomendasi dibuat menggunakan model dengan metode *Cosine Similarity* dan *Euclidean Similarity*. Tetapi sebelumnya akan dilakukan perubahan tipe data dari kategorikal menjadi data numerik menggunakan metode `TF-IDF Vectorizer`.

### TF-IDF Vectorizer

In [None]:
tf = TfidfVectorizer()

# Melakukan perhitungan idf pada data nikes
tf.fit(df_movies['listed_in']) 
 
# Mapping array dari fitur index integer ke fitur nama
tf.get_feature_names() 



['action',
 'adventure',
 'anime',
 'children',
 'classic',
 'comedies',
 'comedy',
 'cult',
 'documentaries',
 'dramas',
 'faith',
 'family',
 'fantasy',
 'features',
 'fi',
 'horror',
 'independent',
 'international',
 'lgbtq',
 'movies',
 'music',
 'musicals',
 'romantic',
 'sci',
 'spirituality',
 'sports',
 'stand',
 'thrillers',
 'up']

In [None]:
# Melakukan fit lalu ditransformasikan ke bentuk matrix
tfidf_matrix = tf.fit_transform(df_movies['listed_in']) 

tfidf_matrix.shape 

(6131, 29)

In [None]:
# Mengubah vektor tf-idf dalam bentuk matriks dengan fungsi todense()
tfidf_matrix.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
pd.DataFrame(
    tfidf_matrix.todense(), 
    columns=tf.get_feature_names(),
    index=df_movies.title
).sample(10, axis=1).sample(10, axis=0)



Unnamed: 0_level_0,sci,family,spirituality,children,features,stand,thrillers,movies,comedies,faith
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Norm of the North: Keys to the Kingdom,0.0,0.680177,0.0,0.680177,0.0,0.0,0.0,0.273345,0.0,0.0
Andhaghaaram,0.0,0.0,0.0,0.0,0.0,0.0,0.559144,0.435371,0.0,0.0
Familiye,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.541509,0.0,0.0
DreamWorks Home: For the Holidays,0.0,0.431517,0.0,0.431517,0.0,0.0,0.0,0.173415,0.304451,0.0
Apache Warrior,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mr. Virgin,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.350455,0.615263,0.0
Have a Good Trip: Adventures in Psychedelics,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Never Back Down,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.202607,0.0,0.0
Executive Decision,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Five Elements Ninjas,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.275727,0.0,0.0


### Cosine Similarity

Kelebihan dari metode *cosine similarity* adalah tidak bergantung pada besarnya vektor. Tetapi kelebihan tersebut dapat menjadi kekurangan jika pada kasus tertentu, makna frekuensi kemunculan fitur menjadi penting. Sedangkan pada kasus ini, *Cosine Similarity* aman digunakan karena sudah dilakukan tahap *one-hot-encoding* pada matrik tf-idf. Sehingga frekuensi tiap kategori pada produk mempunyai bobot yang sama yaitu 0 (tidak ada) atau 1 (ada).

Untuk implementasinya menggunakan fungsi `cosine_similarity()` dari *library* sklearn dengan lama waktu komputasinya sebagai berikut.

In [None]:
def cos_sim_handler(df_tfidf, series_title):
  # Menghitung cosine similarity pada dataframe tfidf
  cos_sim = cosine_similarity(df_tfidf)

  # Membuat dataframe dari variabel cos_sim dengan baris dan kolom berupa nama produk
  df_cos_sim = pd.DataFrame(cos_sim, index=series_title, columns=series_title)

  # Melihat similarity matrix pada setiap produk
  return df_cos_sim

In [None]:
start = time.time()
cos_sim_df = cos_sim_handler(tfidf_matrix, df_movies['title'])
cos_exec_time = time.time() - start
print("Exec Time Cosine Similarity (Seconds) :", cos_exec_time)

Exec Time Cosine Similarity (Seconds) : 1.0046014785766602


In [None]:
# Melihat similarity matrix pada setiap produk
print('Shape:', cos_sim_df.shape)
cos_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (6131, 6131)


title,Chal Mere Bhai,A Futile and Stupid Gesture,Air Force One,Paying Guests,Lowriders
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
War Against Women,0.352047,0.0,0.0,0.418899,0.119662
God Calling,0.155752,0.0,0.0,0.185328,0.167643
Lupt,0.327769,0.0,0.0,0.346705,0.147201
The Grandmaster,0.254135,0.0,0.818365,0.302394,0.273538
Catch.er,0.291335,0.0,0.0,0.346658,0.313578
The Roommate,0.0,0.0,0.0,0.0,0.0
American Hangman,0.0,0.0,0.0,0.0,0.0
Le serment des Hitler,0.352047,0.0,0.0,0.418899,0.119662
Eve's Apple,0.352047,0.0,0.0,0.418899,0.119662
What Makes a Psychopath?,0.0,0.0,0.0,0.0,0.0


### Euclidean Distance

Kelebihan Euclidean adalah dapat memperoleh nilai perbedaan antara dua vektor yang sama arahnya namun beda besarannya. Sedangkan kekurangan algoritma ini adalah fitur dengan frekuensi kemunculan paling banyak akan mendominasi fitur lain dalam hasil komputasi jarak euclideannya.

Untuk implementasinya menggunakan fungsi `euclidean_distances()` dari *library* sklearn dengan lama waktu komputasinya sebagai berikut.

In [None]:
def euc_sim_handler(df_tfidf, series_title):
  # Menghitung euclidean distance pada dataframe tfidf
  euc_dist = euclidean_distances(df_tfidf)

  # Menghitung euclidean similarity
  f = lambda x: 1 / (1 + x)
  euc_sim = f(euc_dist)

  # Membuat dataframe dari variabel euc_sim dengan baris dan kolom berupa nama produk
  df_euc_sim = pd.DataFrame(euc_sim, index=series_title, columns=series_title)

  # Melihat similarity matrix pada setiap produk
  return df_euc_sim

In [None]:
start = time.time()
euc_sim_df = euc_sim_handler(tfidf_matrix, df_movies["title"])
euc_exec_time = time.time() - start
print("Exec Time Euclidean Similarity (Seconds) :", euc_exec_time)

Exec Time Euclidean Similarity (Seconds) : 2.6405744552612305


In [None]:
print('Shape:', euc_sim_df.shape)
euc_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (6131, 6131)


title,Gerald's Game,Krish Trish and Baltiboy: The Greatest Trick,Secret Obsession,Little Singham: Legend of Dugabakka,An Ordinary Man
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Alex Strangelove,0.423335,0.424294,0.414214,0.456715,0.443505
Cappuccino,0.425304,0.426481,0.414214,0.467742,0.450564
The Worthy,0.457271,0.419533,0.477729,0.418998,0.467089
Her,0.419036,0.419532,0.414214,0.418997,0.429057
Bo Burnham: Inside,0.414214,0.414214,0.414214,0.414214,0.414214
Bombay Rose,0.43184,0.433772,0.414214,0.431691,0.441234
Skater Girl,0.4284,0.661924,0.414214,0.603386,0.435785
Manorama Six Feet Under,0.513919,0.424372,0.586907,0.423331,0.659798
Ainu Mosir,0.428476,0.430014,0.414214,0.428357,0.462622
Romina,0.52055,0.436058,0.414214,0.433713,0.414214


### Mendapatkan Rekomendasi

Tahap ini merupakan tahap pengujian hasil top-10 rekomendasi judul film.

In [None]:
def movie_recommendations(judul_film, similarity_data, items=df_movies, k=10):
 
    # Mengambil data dengan menggunakan argpartition untuk melakukan partisi secara tidak langsung sepanjang sumbu yang diberikan    
    # Dataframe diubah menjadi numpy
    index = similarity_data.loc[:,judul_film].to_numpy().argpartition(
        range(-1, -k, -1))
    
    # Mengambil data dengan similarity terbesar dari index yang ada
    closest = similarity_data.columns[index[-1:-(k+2):-1]]
    
    closest = closest.drop(judul_film, errors='ignore')
 
    return pd.DataFrame(closest).merge(items).head(k)

In [None]:
df_movies[df_movies["title"].eq('Bo Burnham: Inside')]

Unnamed: 0,show_id,title,listed_in
826,s827,Bo Burnham: Inside,Stand-Up Comedy


#### Rekomendasi dengan Cosine Similarity

In [None]:
movie_recommendations(
    judul_film="Bo Burnham: Inside",
    similarity_data=cos_sim_df
)

Unnamed: 0,title,show_id,listed_in
0,Bo Burnham: what.,s4792,Stand-Up Comedy
1,Joe Mande’s Award-Winning Comedy Special,s5370,Stand-Up Comedy
2,Aditi Mittal: Things They Wouldn't Let Me Say,s5372,Stand-Up Comedy
3,Alan Saldaña: Locked Up,s767,Stand-Up Comedy
4,Tom Papa: You're Doing Great!,s2953,Stand-Up Comedy
5,D.L. Hughley: Clear,s5379,Stand-Up Comedy
6,Zach Galifianakis: Live at the Purple Onion,s4081,Stand-Up Comedy
7,"Oh, Hello On Broadway",s5433,Stand-Up Comedy
8,Chris D'Elia: Man on Fire,s5417,Stand-Up Comedy
9,Tom Segura: Completely Normal,s5380,Stand-Up Comedy


#### Rekomendasi dengan Euclidean Distance

In [None]:
movie_recommendations(
    judul_film="Bo Burnham: Inside",
    similarity_data=euc_sim_df
)

Unnamed: 0,title,show_id,listed_in
0,Bo Burnham: what.,s4792,Stand-Up Comedy
1,Joe Mande’s Award-Winning Comedy Special,s5370,Stand-Up Comedy
2,Aditi Mittal: Things They Wouldn't Let Me Say,s5372,Stand-Up Comedy
3,Alan Saldaña: Locked Up,s767,Stand-Up Comedy
4,Tom Papa: You're Doing Great!,s2953,Stand-Up Comedy
5,D.L. Hughley: Clear,s5379,Stand-Up Comedy
6,Zach Galifianakis: Live at the Purple Onion,s4081,Stand-Up Comedy
7,"Oh, Hello On Broadway",s5433,Stand-Up Comedy
8,Chris D'Elia: Man on Fire,s5417,Stand-Up Comedy
9,Tom Segura: Completely Normal,s5380,Stand-Up Comedy


## Evaluasi

$$\text{Recommender system precision (P)} = \frac{\text{#of our recommendation that relevant}}{\text{#of item we recommend}}\times 100% $$

Dari hasil rekomendasi di atas diketahui bahwa film dengan judul `Bo Burnham: Inside` termasuk ke dalam genre atau aliran `Stand-Up Comedy`. Dari 10 produk yang direkomendasikan, berikut nilai *precision* pada model *cosine similarity* dan *euclidean distance*.
 
|Model | Sesuai | Tidak Sesuai |Total| Precision |
|---|---|---|---|---|
|*Cosine Similarity*|10|0|10|100%|
|*Euclidean Similarity*|10|0|10|100%|
 
Pada tabel di atas, terlihat bahwa model *Cosine Similiarity* dan *Euclidean Distance* memiliki nilai presisi yang sama pada top-10 rekomendasi di atas.

Selain dari nilai presisi, lama komputasi setiap metode juga perlu dipertimbangkan. Berikut perbandingannya:

In [None]:
df_exec_time_models = pd.DataFrame(index=['Time (Seconds)'],
    columns=['Cosine Similarity', 'Euclidean Similarity'])

df_exec_time_models['Cosine Similarity'] = [cos_exec_time]
df_exec_time_models['Euclidean Similarity'] = [euc_exec_time]

df_exec_time_models

Unnamed: 0,Cosine Similarity,Euclidean Similarity
Time (Seconds),1.004601,2.640574


Berdasarkan output di atas, waktu komputasi pada metode Cosine Similarity (1.004601 detik) lebih cepat dibandingkan Euclidean Similarity (2.640574 detik).