<a href="https://colab.research.google.com/github/aliyaaliyal/mesin-learning/blob/main/Latihan_Sistem_Rekomendasi_Movie_Netflix.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Demonstrasi 1: Contoh Sistem Rekomendasi

Dataset yang digunakan pada project ini merupakan dataset [Netflix Movies and TV Shows](https://www.kaggle.com/shivamb/netflix-shows)

Tujuan dari proyek ini adalah untuk membuat sistem rekomendasi yang mampu memberikan rekomendasi film kepada pelanggan.

## Menyiapkan Library

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Menyiapkan Dataset

In [None]:
movie_df = pd.read_csv("/content/netflix_titles.csv")
movie_df.sample(10)


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
4269,s4270,TV Show,Jojo's World,,"Tia Lee, In Deok Hwang, Yen-j, Jason Hsu, Andy...",Taiwan,"December 21, 2018",2017,TV-14,1 Season,"International TV Shows, Romantic TV Shows, TV ...",Diagnosed with a condition that could make her...
4667,s4668,Movie,​​Kuch Bheege Alfaaz,Onir,"Geetanjali Thapa, Zain Khan Durrani, Shray Rai...",India,"September 1, 2018",2018,TV-14,110 min,"Dramas, Independent Movies, International Movies",After accidentally connecting over the Interne...
5591,s5592,TV Show,Legend Quest,,"Benny Emmanuel, Mayté Cordeiro, Andrés Couturi...",Mexico,"February 24, 2017",2017,TV-Y7,1 Season,Kids' TV,"When an evil force threatens his village, a gi..."
1326,s1327,Movie,Squared Love,Filip Zylber,"Adrianna Chlebicka, Mateusz Banasiuk, Agnieszk...",Poland,"February 11, 2021",2021,TV-14,102 min,"Comedies, International Movies, Romantic Movies",A celebrity journalist and renowned womanizer ...
8685,s8686,Movie,VS.,Ed Lilly,"Connor Swindells, Fola Evans-Akingbola, Nichol...",United Kingdom,"June 19, 2019",2018,TV-MA,99 min,Dramas,A young man in foster care finds his voice in ...
6567,s6568,Movie,Darna Mana Hai,Prawal Raman,"Aftab Shivdasani, Antara Mali, Boman Irani, Is...",India,"August 1, 2019",2003,TV-MA,116 min,"Horror Movies, International Movies, Thrillers",Stranded in a jungle when their car breaks dow...
5775,s5776,Movie,Pac's Scary Halloween,,"Erin Mathews, Sam Vincent, Andrea Libman, Ashl...",,"October 1, 2016",2016,TV-Y7,44 min,Movies,When sinister Dr. Pacenstein schemes to swap b...
3586,s3587,Movie,The Little Switzerland,Kepa Sojo,"Jon Plazaola, Maggie Civantos, Ingrid García J...",Spain,"August 16, 2019",2019,TV-MA,86 min,"Comedies, International Movies",The discovery of the tomb of William Tell’s so...
5034,s5035,Movie,Agustín Aristarán: Soy Rada,Mariano Baez,Agustín Aristarán,Argentina,"February 16, 2018",2018,TV-MA,60 min,Stand-Up Comedy,"Argentine comedian Agustín ""Radagast"" Aristará..."
715,s716,TV Show,Elite Short Stories: Nadia Guzmán,,"Mina El Hammani, Miguel Bernardeau, Omar Ayuso",,"June 15, 2021",2021,TV-MA,1 Season,"International TV Shows, Romantic TV Shows, Spa...",Nadia feels conflicted about whether or not to...


Jika diperhatikan pada kolom `listed_in`, `director`, `country` & `cast` memiliki beberapa nilai yang dipisah oleh tanda koma (","), nilai tersebut nantinya akan diubah kedalam bentuk list.

## Pemahaman Data (Data Understanding)

### Keterangan kolom pada dataset

In [None]:
movie_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


Berdasarkan hasil tersebut dapat dilihat bahwa terdapat beberapa kolom yang jumlah datanya berbeda, hal ini menunjukkan adanya missing value pada dataset yang kita gunakan.

In [None]:
movie_df['show_id'].is_unique

True

## Memeriksa missing values

In [None]:
movie_df.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

Berdsarkan hasil tersebut dapat dilihat bahwa jumlah missing value sangat banyak terutama pada kolom `director`, `cast`, dan `country`.

### Melihat rangkuman parameter statistik dari data numerik

In [None]:
movie_df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


## Data Cleansing

### Menangani missing value

Semua missing value akan diganti menjadi 'Unknown'

In [None]:
movie_df.fillna('unknown', inplace=True)


In [None]:
movie_df.isna().sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64

### Mengubah `listed_in`, `director`, `country` & `cast` menjadi list

In [None]:
column_list = ['listed_in', 'director', 'country', 'cast']

for column in column_list:
    movie_df[column] = movie_df[column].apply(lambda row: row.split(','))

movie_df.sample(10)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
4482,s4483,Movie,ADAM SANDLER 100% FRESH,[Steve Brill],[Adam Sandler],[United States],"October 23, 2018",2018,TV-MA,74 min,[Stand-Up Comedy],"From ""Heroes"" to ""Ice Cream Ladies"" – Adam San..."
4722,s4723,Movie,Fitoor,[Abhishek Kapoor],"[Aditya Roy Kapoor, Katrina Kaif, Tabu, Rah...",[India],"August 2, 2018",2016,TV-14,124 min,"[Dramas, International Movies, Romantic Movies]",A young artist falls for an aristocratic young...
5965,s5966,Movie,22-Jul,[Paul Greengrass],"[Anders Danielsen Lie, Jon Øigarden, Jonas S...","[Norway, Iceland, United States]","October 10, 2018",2018,R,144 min,"[Dramas, Thrillers]","After devastating terror attacks in Norway, a ..."
4932,s4933,Movie,Greg Davies: You Magnificent Beast,[Peter Orton],[Greg Davies],[United Kingdom],"April 10, 2018",2018,TV-MA,66 min,[Stand-Up Comedy],British comedian Greg Davies revisits terrifyi...
3203,s3204,Movie,Why Me?,[Tudor Giurgiu],"[Emilian Oprea, Mihai Constantin, Andreea Va...","[Romania, Bulgaria, Hungary]","December 1, 2019",2015,TV-MA,126 min,"[Dramas, International Movies, Thrillers]",A young prosecutor is assigned a career-making...
8582,s8583,Movie,Thorne: Scaredy Cat,[Benjamin Ross],"[David Morrissey, Eddie Marsan, Aidan Gillen...",[United Kingdom],"November 2, 2016",2010,NR,125 min,"[Dramas, International Movies]",Heading a new team whose aim is to crack the c...
6658,s6659,TV Show,Earth to Luna!,[unknown],"[Angelina Carballo, Raul-Gomez Pina, Eric An...",[Brazil],"April 10, 2020",2014,TV-Y,1 Season,[Kids' TV],Curious about everything and excited about sci...
4852,s4853,TV Show,Trollhunters,[unknown],"[Kelsey Grammer, Anton Yelchin, Charlie Saxt...","[United States, Mexico]","May 25, 2018",2018,TV-Y7,3 Seasons,"[Kids' TV, TV Action & Adventure, TV Sci-Fi ...","After uncovering a mysterious amulet, an avera..."
4954,s4955,Movie,Om Shanti Om,[Farah Khan],"[Shah Rukh Khan, Deepika Padukone, Shreyas T...",[India],"April 1, 2018",2007,TV-14,169 min,"[Comedies, Dramas, International Movies]",Reincarnated 30 years after being killed in a ...
4946,s4947,TV Show,Star Trek: The Next Generation,[unknown],"[Patrick Stewart, Jonathan Frakes, LeVar Bur...",[United States],"April 2, 2018",1993,TV-PG,7 Seasons,"[TV Action & Adventure, TV Sci-Fi & Fantasy]",Decades after the adventures of the original E...


## Data Preprocessing

### Mengambil kolom akan dijadikan sebagai fitur

In [None]:
feature_df = movie_df[['title', 'director', 'cast', 'country', 'listed_in']]


In [None]:
feature_df.sample(5)

Unnamed: 0,title,director,cast,country,listed_in
379,Tattoo Redo,[unknown],[unknown],[unknown],[Reality TV]
5931,The Battered Bastards of Baseball,"[Chapman Way, Maclain Way]","[Todd Field, Kurt Russell]",[United States],"[Documentaries, Sports Movies]"
7457,Mini Wolf,[unknown],[unknown],[France],[Kids' TV]
3417,Ghosts of Sugar Land,[Bassam Tariq],[unknown],[United States],[Documentaries]
4822,Us and Them,[Rene Liu],"[Jing Boran, Zhou Dongyu, Zhuangzhuang Tian,...",[China],"[Dramas, International Movies, Romantic Movies]"


## Menggunakan CountVectorizer

In [None]:
def sanitize(x):
    try:
        #kalau cell berisi list
        if isinstance(x, list):
            return [i.replace(' ', '').lower() for i in x]
        #kalau cell berisi string
        else:
            return [x.replace(' ', '').lower()]
    except:
        print(x)


In [None]:
feature_column = ['director', 'cast', 'country', 'listed_in']
for column in feature_column:
    feature_df[column] = feature_df[column].apply(sanitize)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feature_df[column] = feature_df[column].apply(sanitize)


In [None]:
# untuk menggabungkan semua fiture
def soup_feature(x):
    return ' '.join(x['director']) + ' ' + ' '.join(x['cast']) + ' ' + ' '.join(x['country']) + ' ' + ' '.join(x['listed_in'])


In [None]:
feature_df['soup'] = feature_df.apply(soup_feature, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  feature_df['soup'] = feature_df.apply(soup_feature, axis=1)


In [None]:
feature_df['soup']


0       kirstenjohnson unknown unitedstates documentaries
1       unknown amaqamata khosingema gailmabalane thab...
2       julienleclercq samibouajila tracygotoas samuel...
3            unknown unknown unknown docuseries realitytv
4       unknown mayurmore jitendrakumar ranjanraj alam...
                              ...                        
8802    davidfincher markruffalo jakegyllenhaal robert...
8803    unknown unknown unknown kids'tv koreantvshows ...
8804    rubenfleischer jesseeisenberg woodyharrelson e...
8805    peterhewitt timallen courteneycox chevychase k...
8806    mozezsingh vickykaushal sarah-janedias raaghav...
Name: soup, Length: 8807, dtype: object

In [None]:
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(feature_df['soup'])

print(count)
print(count_matrix.shape)

CountVectorizer(stop_words='english')
(8807, 41989)


## Cosine Similarity

In [None]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

#cosine_sim

## Mendapatkan Rekomendasi

In [None]:
indices = pd.Series(feature_df.index, index=feature_df['title']).drop_duplicates()

def movie_recommendations(title, n=5):
    idx = indices[title]
    
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    #mengurutkan film dari similarity tertinggi ke terendah
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    sim_scores = sim_scores[1:n+1]

    movie_indices = [i[0] for i in sim_scores]

    return movie_df.iloc[movie_indices]



In [None]:
movie_df[movie_df['title'] == "Train to Busan"]


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
8613,s8614,Movie,Train to Busan,[Sang-ho Yeon],"[Gong Yoo, Yu-mi Jung, Dong-seok Ma, Soo-an...",[South Korea],"March 18, 2017",2016,TV-MA,118 min,"[Action & Adventure, Horror Movies, Internat...","As a zombie outbreak sweeps the country, a dad..."


In [None]:
rec = movie_recommendations(title="Train to Busan").reset_index(drop=True)
rec[['title', 'director', 'cast', 'country', 'listed_in']]


Unnamed: 0,title,director,cast,country,listed_in
0,Psychokinesis,[Sang-ho Yeon],"[Ryu Seung-ryong, Shim Eun-kyung, Jung-min P...",[South Korea],"[Action & Adventure, Comedies, International..."
1,Steel Rain,[Yang Woo-seok],"[Woo-sung Jung, Do-won Kwak, Kap-soo Kim, W...",[South Korea],"[Action & Adventure, Dramas, International M..."
2,Master,[Ui-seok Jo],"[Byung-hun Lee, Dong-won Gang, Woo-bin Kim, ...",[South Korea],"[Action & Adventure, International Movies]"
3,Sol Levante,[Akira Saitoh],[unknown],[Japan],"[Action & Adventure, Anime Features, Interna..."
4,Abdo Mota,[unknown],[Mohamed Ramadan],[Egypt],"[Action & Adventure, Dramas, International M..."


## Evaluasi

In [None]:
# Fungsi untuk menghitung presisi
def rec_precision(num_relevant_recomendation, num_items_recommended):
    return num_relevant_recomendation/num_items_recommended

In [None]:
precision = rec_precision(3, 5)
print('Precision = ', precision)

Precision =  0.6
