**Nama: Cindy Deviana Atmakusuma**

Proyek ini menggunakan dataset dari Kaggle untuk memberikan rekomendasi film berdasarkan kemiripan genre dan juga berdasarkan rating yang diberikan oleh pengguna

Link: https://www.kaggle.com/datasets/gargmanas/movierecommenderdataset

# **Import Library yang digunakan**

In [1]:
# Install public API Kaggle
!pip install -q kaggle

In [2]:
# Impor semua library yang diperlukan
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pathlib import Path

# Impor library TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

# Impor library Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity

# Impor library K-Nearest Neighbor
from sklearn.neighbors import KNeighborsClassifier

# Impor library SVD
from scipy.sparse.linalg import svds

# Impor library MSE
from sklearn.metrics import mean_squared_error

# **Data Understanding**

Merupakan tahapan yang dilakukan untuk memahami informasi pada data dan juga berguna dalam menentukan kualitas data yang didapatkan.

**1. Data Loading**

Tahapan ini memuat dataset yang digunakan agar dapat dipahami.

In [3]:
# Membuat direktori baru bernama kaggle
!rm -rf ~/.kaggle && mkdir ~/.kaggle/

# Menyalin berkas kaggle.json pada direktori aktif saat ini ke direktori kaggle
!mv kaggle.json ~/.kaggle/kaggle.json

# Mengubah permission berkas
!chmod 600 ~/.kaggle/kaggle.json

# Download dataset
!kaggle datasets download -d gargmanas/movierecommenderdataset

# Ekstrak berkas zip
!unzip /content/movierecommenderdataset.zip

Downloading movierecommenderdataset.zip to /content
  0% 0.00/846k [00:00<?, ?B/s]
100% 846k/846k [00:00<00:00, 31.2MB/s]
Archive:  /content/movierecommenderdataset.zip
  inflating: movies.csv              
  inflating: ratings.csv             


In [4]:
movies = pd.read_csv('/content/movies.csv')
ratings = pd.read_csv('/content/ratings.csv')

print('Jumlah data movie: ', len(movies.movieId.unique()))
print('Jumlah data user yang memberikan rating: ', len(ratings.userId.unique()))
print('Jumlah data movie yang telah diberikan peringkat: ', len(ratings.movieId.unique()))

Jumlah data movie:  9742
Jumlah data user yang memberikan rating:  610
Jumlah data movie yang telah diberikan peringkat:  9724


**2. Univariate Exploratory Data Analysis**

Variabel-variabel pada dataset adalah sebagai berikut:

Pada file movies.csv terdapat tiga fitur sebagai berikut:
1. movieId yang merupakan ID unik untuk setiap movie
2. title yang merupakan judul movie
3. genres yang merupakan genre movie

In [5]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


Pada file ratings.csv terdapat empat fitur sebagai berikut:

1. userId yaitu ID pengguna yang memberikan rating
2. movieId yaitu ID film yang diberikan rating
3. rating yaitu rating yang diberikan oleh pengguna
4. timestamp yaitu waktu dimana peringkat telah diberikan

In [6]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [7]:
ratings.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100836.0,100836.0,100836.0,100836.0
mean,326.127564,19435.295718,3.501557,1205946000.0
std,182.618491,35530.987199,1.042529,216261000.0
min,1.0,1.0,0.5,828124600.0
25%,177.0,1199.0,3.0,1019124000.0
50%,325.0,2991.0,3.5,1186087000.0
75%,477.0,8122.0,4.0,1435994000.0
max,610.0,193609.0,5.0,1537799000.0


# **Data Preprocessing**

**1. Menggabungkan Movie**

Pada tahap ini menggabungkan file movies.csv dan ratings.csv dengan fungsi concatenate berdasarkan pada movieId, dan diperoleh seluruh data pada variabel all_movies.

In [8]:
# Menggabungkan seluruh movieId pada kategori movie
all_movies = np.concatenate((
    movies.movieId.unique(),
    ratings.movieId.unique()
))

# Mengurutkan data dan menghapus data yang sama
all_movies = np.sort(np.unique(all_movies))

print('Jumlah seluruh data movie berdasarkan pada movieId: ', len(all_movies))

Jumlah seluruh data movie berdasarkan pada movieId:  9742


In [9]:
# Menghitung jumlah keseluruhan user
all_users = ratings['userId'].unique()

print('Jumlah seluruh user: ', len(all_users))

Jumlah seluruh user:  610


In [10]:
# Menggabungkan file movies dan ratings ke dalam dataframe movies_info serta menggabungkan dataframe ratings dengan movies_info berdasarkan nilai movieId
movies_info = pd.concat([movies, ratings])
df_movie = pd.merge(ratings, movies_info, on='movieId', how='left')
df_movie

Unnamed: 0,userId_x,movieId,rating_x,timestamp_x,title,genres,userId_y,rating_y,timestamp_y
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,,,
1,1,1,4.0,964982703,,,1.0,4.0,9.649827e+08
2,1,1,4.0,964982703,,,5.0,4.0,8.474350e+08
3,1,1,4.0,964982703,,,7.0,4.5,1.106636e+09
4,1,1,4.0,964982703,,,15.0,2.5,1.510578e+09
...,...,...,...,...,...,...,...,...,...
6025531,610,168252,5.0,1493846352,,,610.0,5.0,1.493846e+09
6025532,610,170875,3.0,1493846415,The Fate of the Furious (2017),Action|Crime|Drama|Thriller,,,
6025533,610,170875,3.0,1493846415,,,50.0,1.0,1.514498e+09
6025534,610,170875,3.0,1493846415,,,249.0,3.0,1.505165e+09


Dilakukan pula cek missing value karena terdapat banyak missing value seperti yang dapat dilihat diatas

In [11]:
df_movie.isnull().sum()

userId_x             0
movieId              0
rating_x             0
timestamp_x          0
title          5924700
genres         5924700
userId_y        100836
rating_y        100836
timestamp_y     100836
dtype: int64

Menggabungkan rating berdasarkan movieId

In [12]:
df_movie.groupby('movieId').sum()

  df_movie.groupby('movieId').sum()


Unnamed: 0_level_0,userId_x,rating_x,timestamp_x,userId_y,rating_y,timestamp_y
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,14235264,182088.0,52469522383464,14169360.0,181245.0,5.222661e+13
2,4023861,41902.5,13868182748742,3987610.0,41525.0,1.374324e+13
3,781591,8983.5,2770083922458,766844.0,8814.0,2.717818e+12
4,12312,132.0,50320416384,10773.0,115.5,4.403036e+10
5,733950,7525.0,2432027629700,719271.0,7374.5,2.383387e+12
...,...,...,...,...,...,...
193581,368,8.0,3074218164,184.0,4.0,1.537109e+09
193583,368,7.0,3074219090,184.0,3.5,1.537110e+09
193585,368,7.0,3074219610,184.0,3.5,1.537110e+09
193587,368,7.0,3074220042,184.0,3.5,1.537110e+09


**2. Menggabungkan Data dengan Fitur Nama Movie**

Disini didefinisikan pula all_rate_movies dengan variabel ratings

In [13]:
all_rate_movies = ratings
all_rate_movies

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [14]:
# Menggabungkan all_rate_movies dengan dataframe movies berdasarkan movieId
all_movies = pd.merge(all_rate_movies, movies[['movieId','title','genres']], on='movieId', how='left')
all_movies

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
...,...,...,...,...,...,...
100831,610,166534,4.0,1493848402,Split (2017),Drama|Horror|Thriller
100832,610,168248,5.0,1493850091,John Wick: Chapter Two (2017),Action|Crime|Thriller
100833,610,168250,5.0,1494273047,Get Out (2017),Horror
100834,610,168252,5.0,1493846352,Logan (2017),Action|Sci-Fi


# **Data Preparation**

Mengatasi Missing Value

In [15]:
# Mengecek adanya data null
all_movies.isnull().sum()

userId       0
movieId      0
rating       0
timestamp    0
title        0
genres       0
dtype: int64

In [16]:
all_movies.sort_values('movieId', ascending=True)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
81531,517,1,4.0,1487954343,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
30517,213,1,3.5,1316196157,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
81082,514,1,4.0,1533872400,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
30601,214,1,3.0,853937855,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...
27256,184,193581,4.0,1537109082,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
27257,184,193583,3.5,1537109545,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
27258,184,193585,3.5,1537109805,Flint (2017),Drama
27259,184,193587,3.5,1537110021,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


Mengecek jumlah movie secara keseluruhan

In [17]:
len(all_movies.movieId.unique())

9724

Membuat variabel preparation lalu mengurutkan data tersebut berdasarkan movieId

In [18]:
dataprep = all_movies
dataprep.sort_values('movieId')

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
81531,517,1,4.0,1487954343,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
30517,213,1,3.5,1316196157,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
81082,514,1,4.0,1533872400,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
30601,214,1,3.0,853937855,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
...,...,...,...,...,...,...
27256,184,193581,4.0,1537109082,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
27257,184,193583,3.5,1537109545,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
27258,184,193585,3.5,1537109805,Flint (2017),Drama
27259,184,193587,3.5,1537110021,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


Menghapus data duplikat dengan fungsi drop_duplicates() berdasarkan movieId

In [19]:
dataprep = dataprep.drop_duplicates('movieId')
dataprep

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
...,...,...,...,...,...,...
100820,610,160341,2.5,1479545749,Bloodmoon (1997),Action|Thriller
100821,610,160527,4.5,1479544998,Sympathy for the Underdog (1971),Action|Crime|Drama
100823,610,160836,3.0,1493844794,Hazard (2005),Action|Drama|Thriller
100827,610,163937,3.5,1493848789,Blair Witch (2016),Horror|Thriller


In [20]:
# Mengonversi data series 'movieId' menjadi dalam bentuk list
movie_id = dataprep['movieId'].tolist()

# Mengonversi data series 'title' menjadi dalam bentuk list
movie_name = dataprep['title'].tolist()

# Mengonversi data series 'genres' menjadi dalam bentuk list
movie_genre = dataprep['genres'].tolist()

print(len(movie_id))
print(len(movie_name))
print(len(movie_genre))

9724
9724
9724


In [21]:
# Membuat dictionary untuk data 'movie_id', 'movie_name', dan 'movie_genre'
movie_new = pd.DataFrame({
    'id': movie_id,
    'movie_name': movie_name,
    'genre': movie_genre
})
movie_new

Unnamed: 0,id,movie_name,genre
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,3,Grumpier Old Men (1995),Comedy|Romance
2,6,Heat (1995),Action|Crime|Thriller
3,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
...,...,...,...
9719,160341,Bloodmoon (1997),Action|Thriller
9720,160527,Sympathy for the Underdog (1971),Action|Crime|Drama
9721,160836,Hazard (2005),Action|Drama|Thriller
9722,163937,Blair Witch (2016),Horror|Thriller


# **Model Development & Evaluation**

Pada tahap ini, model machine learning yang akan dikembangkan yaitu Content-Based Filtering dan juga Collaborative Filtering. Untuk Content-Based Filtering, model dibuat dengan pendekatan Cosine Similarity dan K-Nearest Neighbor. Sedangkan untuk Collaborative Filtering, model dibuat dengan pendekatan Singular Value Decomposition. Content-Based Filtering melibatkan genre film yang disukai pengguna lalu akan diberikan rekomendasi film berdasarkan genre film yang disukai pengguna di masa lalu. Pada Collaborative Filtering melibatkan tingkat rating yang diberikan oleh pengguna pada movie.

**1. Content Based Filtering (Cosine Similarity)**

In [22]:
# Inisialisasi TfidfVectorizer
tf = TfidfVectorizer()

# Melakukan perhitungan idf pada data genre
tf.fit(movie_new['genre'])

# Mapping array dari fitur index integer ke fitur nama
tf.get_feature_names_out()

array(['action', 'adventure', 'animation', 'children', 'comedy', 'crime',
       'documentary', 'drama', 'fantasy', 'fi', 'film', 'genres',
       'horror', 'imax', 'listed', 'musical', 'mystery', 'no', 'noir',
       'romance', 'sci', 'thriller', 'war', 'western'], dtype=object)

In [23]:
# Melakukan fit lalu ditransformasikan ke bentuk matrix
tfidf_matrix = tf.fit_transform(movie_new['genre'])

# Melihat ukuran matrix tfidf
tfidf_matrix.shape

(9724, 24)

In [24]:
# Mengubah vektor tf-idf dalam bentuk matriks dengan fungsi todense()
tfidf_matrix.todense()

matrix([[0.        , 0.41681721, 0.51634045, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.54896727, 0.        , 0.        , ..., 0.54222422, 0.        ,
         0.        ],
        ...,
        [0.64123095, 0.        , 0.        , ..., 0.63335461, 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.62477687, 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

In [25]:
# Membuat dataframe untuk melihat tf-idf matrix
pd.DataFrame(
    tfidf_matrix.todense(),
    columns=tf.get_feature_names_out(),
    index=movie_new.movie_name
).sample(22, axis=1).sample(10, axis=0)

Unnamed: 0_level_0,fantasy,war,genres,thriller,documentary,film,action,romance,crime,children,...,no,drama,adventure,animation,imax,horror,comedy,western,fi,listed
movie_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"Thing, The (1982)",0.0,0.0,0.0,0.386198,0.0,0.0,0.391001,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.482644,0.0,0.0,0.482195,0.0
Pan (2015),0.593769,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.620425,...,0.0,0.0,0.512358,0.0,0.0,0.0,0.0,0.0,0.0,0.0
13 Assassins (Jûsan-nin no shikaku) (2010),0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
In Love and War (1996),0.0,0.833408,0.0,0.0,0.0,0.0,0.0,0.552659,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Closer (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.841412,0.0,0.0,...,0.0,0.540394,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Every Secret Thing (2014),0.0,0.0,0.0,0.449392,0.0,0.0,0.0,0.0,0.527199,0.0,...,0.0,0.307393,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Lola Montès (1955),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Police Academy (1984),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.845832,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.533449,0.0,0.0,0.0
Blow Out (1981),0.0,0.0,0.0,0.567276,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Return to Me (2000),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.841412,0.0,0.0,...,0.0,0.540394,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [26]:
# Menghitung cosine similarity pada matrix tf-idf
cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim

array([[1.        , 0.15262722, 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.15262722, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 1.        , ..., 0.69543501, 0.33876915,
        0.        ],
       ...,
       [0.        , 0.        , 0.69543501, ..., 1.        , 0.39570531,
        0.        ],
       [0.        , 0.        , 0.33876915, ..., 0.39570531, 1.        ,
        0.78080334],
       [0.        , 0.        , 0.        , ..., 0.        , 0.78080334,
        1.        ]])

In [27]:
# Membuat dataframe dari variabel cosine_sim dengan baris dan kolom berupa nama movie
cosine_sim_df = pd.DataFrame(cosine_sim, index=movie_new['movie_name'], columns=movie_new['movie_name'])
print('Shape:', cosine_sim_df.shape)

# Melihat similarity matrix pada setiap movie
cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (9724, 9724)


movie_name,Unstrung Heroes (1995),"Vanishing, The (Spoorloos) (1988)","Day of the Beast, The (Día de la Bestia, El) (1995)",Mrs. Henderson Presents (2005),"Believers, The (1987)"
movie_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
American Heist (2015),0.0,0.0,0.0,0.0,0.0
Junior (1994),0.28377,0.0,0.168594,0.28377,0.0
Last Vegas (2013),0.687253,0.263397,0.220148,0.687253,0.0
D.O.A. (1950),0.13402,0.111461,0.0,0.13402,0.0
High Heels (Tacones lejanos) (1991),1.0,0.38326,0.320331,1.0,0.0
Michael Jackson's Thriller (1983),0.0,0.0,0.0,0.0,0.780803
Eyes of Laura Mars (1978),0.0,0.468219,0.334483,0.0,0.354421
Suture (1993),0.0,0.255585,0.182583,0.0,0.193466
April Morning (1988),0.678847,0.564576,0.0,0.678847,0.0
Pursuit of Happiness (2001),0.41893,0.0,0.248896,0.41893,0.0


Tahap ini yaitu membuat fungsi cosim_movie_recommendation dengan parameter-parameter berikut:
- nama_movie yaitu nama judul movie (index kemiripan dataframe)
- similarity_data yaitu dataframe mengenai similarity yang telah didefinisikan sebelumnya
- items yaitu nama dan fitur yang digunakan untuk mendefinisikan kemiripan, disini adalah 'movie_name' dan 'genre'
- k yaitu banyak rekomendasi yang diinginkan

In [28]:
def cosim_movie_recommendations(nama_movie, similarity_data=cosine_sim_df, items=movie_new[['movie_name','genre']], k=5):
  index = similarity_data.loc[:,nama_movie].to_numpy().argpartition(
      range(-1, -k, -1))

  # Mengambil data dengan similarity terbesar dari index yang ada
  closest = similarity_data.columns[index[-1:-(k+2):-1]]

  # Drop nama_movie agar nama movie yang dicari tidak muncul dalam daftar rekomendasi
  closest = closest.drop(nama_movie, errors='ignore')

  return pd.DataFrame(closest).merge(items).head(k)

Menemukan rekomendasi film yang mirip dengan John Wick: Chapter Two (2017)

In [29]:
movie_new[movie_new.movie_name.eq('John Wick: Chapter Two (2017)')]

Unnamed: 0,id,movie_name,genre
2035,168248,John Wick: Chapter Two (2017),Action|Crime|Thriller


Mendapatkan rekomendasi film yang mirip dengan John Wick: Chapter Two (2017).

In [30]:
cosim_movie_recommendations('John Wick: Chapter Two (2017)')

Unnamed: 0,movie_name,genre
0,Furious 7 (2015),Action|Crime|Thriller
1,Crimson Rivers 2: Angels of the Apocalypse (Ri...,Action|Crime|Thriller
2,Dirty Harry (1971),Action|Crime|Thriller
3,Takers (2010),Action|Crime|Thriller
4,Natural Born Killers (1994),Action|Crime|Thriller


Berdasarkan hasil rekomendasi tersebut, diketahui bahwa John Wick: Chapter Two (2017) termasuk ke dalam genre Action|Crime|Thriller. Dari 5 item yang direkomendasikan, 5 item memiliki genre yang sama dengan yang dicari yaitu Action|Crime|Thriller (similar). Dengan demikian, jika diukur menggunakan metrik Precision maka dapat diketahui bahwa hasil precision sistem sebesar 5/5 atau 100%.

**2. Content Based Filtering (K-Nearest Neighbor)**



In [31]:
# Menggunakan matriks tf-idf sebagai fitur dan nama film sebagai label
X = np.asarray(tfidf_matrix.todense())
y = movie_new['movie_name']

# Inisialisasi KNN dengan jumlah tetangga (k) yang diinginkan
knn = KNeighborsClassifier(n_neighbors=5)

# Melatih model KNN
knn.fit(X, y)

# Fungsi untuk mendapatkan rekomendasi film berdasarkan KNN
def knn_movie_recommendations(nama_movie, k=5):
    # Mencari index dari film yang dicari
    index = np.where(y == nama_movie)[0]

    # Menggunakan model KNN untuk memprediksi film yang paling mirip
    distances, indices = knn.kneighbors(X[index], n_neighbors=k+1)

    # Mengambil nama film yang paling mirip
    closest_movies = y[indices[0][1:]]

    # Mengambil genre dari film yang paling mirip
    closest_movies_genres = movie_new.loc[movie_new['movie_name'].isin(closest_movies), ['movie_name', 'genre']]

    return closest_movies_genres

Mendapatkan rekomendasi film yang mirip dengan John Wick: Chapter Two (2017).

In [32]:
knn_movie_recommendations('John Wick: Chapter Two (2017)')

Unnamed: 0,movie_name,genre
33,Batman (1989),Action|Crime|Thriller
227,Shaft (2000),Action|Crime|Thriller
234,Kill Bill: Vol. 1 (2003),Action|Crime|Thriller
531,Die Hard: With a Vengeance (1995),Action|Crime|Thriller
539,"Net, The (1995)",Action|Crime|Thriller


Berdasarkan hasil rekomendasi tersebut, diketahui bahwa John Wick: Chapter Two (2017) termasuk ke dalam genre Action|Crime|Thriller. Dari 5 item yang direkomendasikan, 5 item memiliki genre yang sama dengan yang dicari yaitu Action|Crime|Thriller (similar). Dengan demikian, jika diukur menggunakan metrik Precision maka dapat diketahui bahwa hasil precision sistem sebesar 5/5 atau 100%.

**3. Collaborative Filtering (SVD)**

In [33]:
# Membaca dataset
df = ratings
df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [34]:
# Mengubah userId dan movieId menjadi list tanpa nilai yang sama
user_ids = df['userId'].unique().tolist()
movie_ids = df['movieId'].unique().tolist()

# Melakukan encoding userId dan movieId
user_to_user_encoded = {x: i for i, x in enumerate(user_ids)}
movie_to_movie_encoded = {x: i for i, x in enumerate(movie_ids)}

# Melakukan proses encoding angka ke userId dan movieId
user_encoded_to_user = {i: x for i, x in enumerate(user_ids)}
movie_encoded_to_movie = {i: x for i, x in enumerate(user_ids)}

print('list userId: ', user_ids)
print('encoded userId : ', user_to_user_encoded)
print('encoded angka ke userId: ', user_encoded_to_user)
print('list movieId: ', movie_ids)
print('encoded movieId : ', movie_to_movie_encoded)
print('encoded angka ke movieId: ', movie_encoded_to_movie)

list userId:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219

In [35]:
# Mendapatkan jumlah user
num_users = len(user_to_user_encoded)
print(num_users)

# Mendapatkan jumlah movie
num_movie = len(movie_encoded_to_movie)
print(num_movie)

610
610


In [36]:
# Mengubah rating menjadi nilai float
df['ratings'] = df['rating'].values.astype(np.float32)

In [37]:
# Membuat matriks rating
num_users = len(user_to_user_encoded)
num_movies = len(movie_to_movie_encoded)

In [38]:
# Mengisi matriks rating dengan 0
rating_matrix = np.zeros((num_users, num_movies))

In [39]:
# Mengisi matriks rating dengan rating yang ada
for index, row in df.iterrows():
    user_encoded = user_to_user_encoded[row['userId']]
    movie_encoded = movie_to_movie_encoded[row['movieId']]
    rating_matrix[user_encoded, movie_encoded] = row['rating']

In [40]:
# Melakukan SVD dan menghitung MSE
def calculate_mse(k):
    U, sigma, Vt = svds(rating_matrix, k=k)
    predicted_ratings = np.dot(np.dot(U, np.diag(sigma)), Vt)
    mse = mean_squared_error(rating_matrix, predicted_ratings)
    return mse

In [41]:
# Mencari nilai k yang memberikan MSE terkecil
best_k = 1
best_mse = float('inf')
for k in range(1, 51):
    mse = calculate_mse(k)
    if mse < best_mse:
        best_k = k
        best_mse = mse

print(f"Best k: {best_k}, MSE: {best_mse}")

Best k: 50, MSE: 0.09369500374812553


In [42]:
# Melakukan SVD dengan nilai k terbaik
U, sigma, Vt = svds(rating_matrix, k=best_k)
predicted_ratings = np.dot(np.dot(U, np.diag(sigma)), Vt)

In [43]:
def recommend_movies(user_id, num_recommendations=10):
    user_encoded = user_to_user_encoded[user_id]
    user_ratings = predicted_ratings[user_encoded]
    sorted_indices = np.argsort(user_ratings)[::-1]

    top_movie_indices = sorted_indices[:num_recommendations]

    recommended_movie_ids = [movie_encoded_to_movie[i] for i in top_movie_indices if i in movie_encoded_to_movie]

    recommended_movie_names = [movies[movies['movieId'] == movie_id]['title'].values[0] if not movies[movies['movieId'] == movie_id]['title'].empty else None for movie_id in recommended_movie_ids]

    recommended_movie_names = [name for name in recommended_movie_names if name is not None]

    recommended_movies_df = pd.DataFrame(recommended_movie_names, columns=['Movie Title'])

    return recommended_movies_df

In [44]:
user_id = 151
recommended_movies_df = recommend_movies(user_id)
print(f"Recommended movies for user {user_id}:")
print(recommended_movies_df)

Recommended movies for user 151:
                                         Movie Title
0                                 Restoration (1995)
1                                   Toy Story (1995)
2                             Dead Presidents (1995)
3                                It Takes Two (1995)
4                              Eye for an Eye (1996)
5                                 Richard III (1995)
6  Léon: The Professional (a.k.a. The Professiona...
7                             Renaissance Man (1994)
