# Proyek System Rekomendasi : Movie Recommendation

- Nama: Usamah Putra Firdaus
- Email: usamahfirdaa@gmail.com
- ID Dicoding: Usamah Putra Firdaus

In [1]:
!pip install kaggle



# **Import Library**

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# **Data Load**

In [3]:
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"usamahptrf","key":"e9a5950881f1b977cfac3cdb6553b2ea"}'}

In [4]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [5]:
!kaggle datasets download -d nicoletacilibiu/movies-and-ratings-for-recommendation-system
!unzip movies-and-ratings-for-recommendation-system.zip

Dataset URL: https://www.kaggle.com/datasets/nicoletacilibiu/movies-and-ratings-for-recommendation-system
License(s): CC0-1.0
Downloading movies-and-ratings-for-recommendation-system.zip to /content
  0% 0.00/846k [00:00<?, ?B/s]
100% 846k/846k [00:00<00:00, 682MB/s]
Archive:  movies-and-ratings-for-recommendation-system.zip
  inflating: movies.csv              
  inflating: ratings.csv             


In [6]:
movies_df = pd.read_csv('movies.csv')
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
ratings_df = pd.read_csv('ratings.csv')
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


# **Data Understanding**

## **Data Basic Information**

In [8]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9742 entries, 0 to 9741
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  9742 non-null   int64 
 1   title    9742 non-null   object
 2   genres   9742 non-null   object
dtypes: int64(1), object(2)
memory usage: 228.5+ KB


In [9]:
movies_df.shape

(9742, 3)

In [10]:
ratings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [11]:
ratings_df.shape

(100836, 4)

In [12]:
n_ratings = len(ratings_df)
n_movies = len(ratings_df['movieId'].unique())
n_users = len(ratings_df['userId'].unique())

print(f"Jumlah Data Rating: {n_ratings}")
print(f"Jumlah Data Movie: {n_movies}")
print(f"Jumlah Data User: {n_users}")
print(f"Rata-rata Rating per User: {round(n_ratings/n_users, 2)}")
print(f"Rara-rata Rating per Movie: {round(n_ratings/n_movies,2)}")

Jumlah Data Rating: 100836
Jumlah Data Movie: 9724
Jumlah Data User: 610
Rata-rata Rating per User: 165.3
Rara-rata Rating per Movie: 10.37


# **Data Preparation**

## **Data Cleaning**

### **Movies**

In [13]:
movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


Memisahkan tahun dalam judul film, dan membuat kolom baru `year_of_release`

In [14]:
movies_df['year_of_release'] = movies_df.title.str.extract('(\(\d\d\d\d\))', expand=False)
movies_df.head()

Unnamed: 0,movieId,title,genres,year_of_release
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,(1995)
1,2,Jumanji (1995),Adventure|Children|Fantasy,(1995)
2,3,Grumpier Old Men (1995),Comedy|Romance,(1995)
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,(1995)
4,5,Father of the Bride Part II (1995),Comedy,(1995)


kode diatas memisahkan tahun dalam judul. Akan dicari tahun (4 digit angka) yang berada di dalam tanda kurung ( ). Tujuannya agar tidak salah menangkap angka lain di dalam judul (misalnya angka dalam "2001: A Space Odyssey").

In [15]:
movies_df['year_of_release'] = movies_df.year_of_release.str.extract('(\d\d\d\d)',expand=False)
movies_df.head(3)

Unnamed: 0,movieId,title,genres,year_of_release
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men (1995),Comedy|Romance,1995


In [16]:
movies_df['title'] = movies_df['title'].str.replace('(\(\d\d\d\d\))', '', regex=True)
movies_df.head()

Unnamed: 0,movieId,title,genres,year_of_release
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


### **Ratings**

In [17]:
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


In [18]:
# Dropping the timestamp column
ratings_df.drop('timestamp', axis=1, inplace=True)

# Confirming the drop
ratings_df.head(3)

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0


Karena pada proyek ini tidak memerlukan timestamp, jadi kolom tersebut dihapus

## **Data Merging**

In [19]:
# merge dataframe
films = pd.merge(movies_df, ratings_df, on='movieId', how='left')
films

Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0
1,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0
2,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,7.0,4.5
3,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,15.0,2.5
4,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,17.0,4.5
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy,2017,184.0,4.0
100850,193583,No Game No Life: Zero,Animation|Comedy|Fantasy,2017,184.0,3.5
100851,193585,Flint,Drama,2017,184.0,3.5
100852,193587,Bungo Stray Dogs: Dead Apple,Action|Animation,2018,184.0,3.5


### Data Cleaning

In [20]:
# check missing values
(films.isnull() | films.empty | films.isna()).sum()

Unnamed: 0,0
movieId,0
title,0
genres,0
year_of_release,18
userId,18
rating,18


Karena hanya sedikit yang terdapat nilai kosong, sehingga dihapus saja

In [21]:
# handling missing values
films = films.dropna()
films

Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,1.0,4.0
1,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,5.0,4.0
2,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,7.0,4.5
3,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,15.0,2.5
4,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995,17.0,4.5
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy,2017,184.0,4.0
100850,193583,No Game No Life: Zero,Animation|Comedy|Fantasy,2017,184.0,3.5
100851,193585,Flint,Drama,2017,184.0,3.5
100852,193587,Bungo Stray Dogs: Dead Apple,Action|Animation,2018,184.0,3.5


In [22]:
# recheck missing values
(films.isnull() | films.empty | films.isna()).sum()

Unnamed: 0,0
movieId,0
title,0
genres,0
year_of_release,0
userId,0
rating,0


In [23]:
all_genres = set()
films['genres'].str.split('|').apply(all_genres.update)

print('Jumlah genre unik:', len(all_genres))
print('Daftar genre unik:', all_genres)


Jumlah genre unik: 20
Daftar genre unik: {'Horror', 'Drama', 'Adventure', '(no genres listed)', 'Children', 'Thriller', 'Fantasy', 'Action', 'Documentary', 'Animation', 'Mystery', 'Comedy', 'Musical', 'Romance', 'Sci-Fi', 'IMAX', 'Crime', 'Western', 'War', 'Film-Noir'}


setelah dilihat genres yang unik, terdapat genre yang aneh seperti `listed)`, `(no`, dan `genres`, sebenarnya genre tersebut adalah satu kesatuan, yaitu `(no genres listed)`, selanjutnya akan dilakukan breakdown terlebih dahulu

In [24]:
# Menampilkan film dengan genre '(no genres listed)'
no_genre_movies = films[films['genres'] == '(no genres listed)']

# Menampilkan hasil
print('Jumlah film dengan genre (no genres listed):', len(no_genre_movies))
no_genre_movies


Jumlah film dengan genre (no genres listed): 37


Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
97505,114335,La cravate,(no genres listed),1957,50.0,3.0
98200,122888,Ben-hur,(no genres listed),2016,567.0,0.5
98234,122896,Pirates of the Caribbean: Dead Men Tell No Tales,(no genres listed),2017,21.0,4.0
98235,122896,Pirates of the Caribbean: Dead Men Tell No Tales,(no genres listed),2017,62.0,3.5
98236,122896,Pirates of the Caribbean: Dead Men Tell No Tales,(no genres listed),2017,111.0,3.5
98237,122896,Pirates of the Caribbean: Dead Men Tell No Tales,(no genres listed),2017,212.0,3.5
98238,122896,Pirates of the Caribbean: Dead Men Tell No Tales,(no genres listed),2017,248.0,4.0
98239,122896,Pirates of the Caribbean: Dead Men Tell No Tales,(no genres listed),2017,252.0,3.0
98240,122896,Pirates of the Caribbean: Dead Men Tell No Tales,(no genres listed),2017,586.0,5.0
98626,129250,Superfast!,(no genres listed),2015,448.0,0.5


Terdapat 37 film yang tidak memiliki genre, jadi film yang tidak memiliki genre akan dihapus karena tidak digunakan dalam tahapa modeling

In [25]:
films = films[films['genres'] != '(no genres listed)']

all_genres = set()
films['genres'].str.split('|').apply(all_genres.update)

print('Jumlah genre unik:', len(all_genres))
print('Daftar genre unik:', all_genres)


Jumlah genre unik: 19
Daftar genre unik: {'Horror', 'Drama', 'Adventure', 'Children', 'Thriller', 'Fantasy', 'Action', 'Documentary', 'Animation', 'Mystery', 'Comedy', 'Musical', 'Romance', 'Sci-Fi', 'IMAX', 'Crime', 'Western', 'War', 'Film-Noir'}


In [26]:
films.loc[:, 'genres'] = films['genres'].str.lower().str.replace('|', ' ', regex=False)

Selanjutnya, mari kita pisahkan nilai-nilai pada kolom genres menjadi `(' ')` agar lebih mudah untuk dibaca

### **Membuat 2 Variable Dataframe Berbeda**

Karena tujuan kita membuat modeling dengan 2 metode yang berbeda yaitu `Content Based Filtering` dan `Collaborative Filtering`. Maka treatment setiap dataset akan berbeda

In [27]:
content_df = films.copy()
content_df

Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
0,1,Toy Story,adventure animation children comedy fantasy,1995,1.0,4.0
1,1,Toy Story,adventure animation children comedy fantasy,1995,5.0,4.0
2,1,Toy Story,adventure animation children comedy fantasy,1995,7.0,4.5
3,1,Toy Story,adventure animation children comedy fantasy,1995,15.0,2.5
4,1,Toy Story,adventure animation children comedy fantasy,1995,17.0,4.5
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic,action animation comedy fantasy,2017,184.0,4.0
100850,193583,No Game No Life: Zero,animation comedy fantasy,2017,184.0,3.5
100851,193585,Flint,drama,2017,184.0,3.5
100852,193587,Bungo Stray Dogs: Dead Apple,action animation,2018,184.0,3.5


In [28]:
collab_df = films.copy()
collab_df

Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
0,1,Toy Story,adventure animation children comedy fantasy,1995,1.0,4.0
1,1,Toy Story,adventure animation children comedy fantasy,1995,5.0,4.0
2,1,Toy Story,adventure animation children comedy fantasy,1995,7.0,4.5
3,1,Toy Story,adventure animation children comedy fantasy,1995,15.0,2.5
4,1,Toy Story,adventure animation children comedy fantasy,1995,17.0,4.5
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic,action animation comedy fantasy,2017,184.0,4.0
100850,193583,No Game No Life: Zero,animation comedy fantasy,2017,184.0,3.5
100851,193585,Flint,drama,2017,184.0,3.5
100852,193587,Bungo Stray Dogs: Dead Apple,action animation,2018,184.0,3.5


### Handling Data untuk Content Based Filtering

Karena datafrane digunakan untuk metode Content Based Filtering sehingga nama film yang sama perlu dihapus terlebih dahulu

In [29]:
# duplicated by movieId
content_df.duplicated('movieId').sum()

np.int64(91095)

In [30]:
# duplicated by title
content_df.duplicated('title').sum()

np.int64(91372)

In [31]:
# drop duplicated data by movieId & title
content_df = films.drop_duplicates('movieId')
content_df = films.drop_duplicates('title')

In [32]:
content_df

Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
0,1,Toy Story,adventure animation children comedy fantasy,1995,1.0,4.0
215,2,Jumanji,adventure children fantasy,1995,6.0,4.0
325,3,Grumpier Old Men,comedy romance,1995,1.0,4.0
377,4,Waiting to Exhale,comedy drama romance,1995,6.0,3.0
384,5,Father of the Bride Part II,comedy,1995,6.0,5.0
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic,action animation comedy fantasy,2017,184.0,4.0
100850,193583,No Game No Life: Zero,animation comedy fantasy,2017,184.0,3.5
100851,193585,Flint,drama,2017,184.0,3.5
100852,193587,Bungo Stray Dogs: Dead Apple,action animation,2018,184.0,3.5


In [33]:
# Bersihkan spasi di kolom
content_df.loc[:, 'title'] = content_df['title'].str.strip()
content_df.loc[:, 'genres'] = content_df['genres'].str.strip()

In [34]:
content_df[content_df.title.eq('Toy Story')]

Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
0,1,Toy Story,adventure animation children comedy fantasy,1995,1.0,4.0


### Handling Data untuk Collaborative Filtering

In [35]:
collab_df.loc[:, 'title'] = collab_df['title'].str.strip()

In [36]:
collab_df = collab_df.pivot_table(index="userId", columns="title", values="rating")
collab_df

title,'71,'Hellboy': The Seeds of Creation,'Round Midnight,'Salem's Lot,'Til There Was You,'Tis the Season for Love,"'burbs, The",'night Mother,(500) Days of Summer,*batteries not included,...,Zulu,[REC],[REC]²,[REC]³ 3 Génesis,anohana: The Flower We Saw That Day - The Movie,eXistenZ,xXx,xXx: State of the Union,¡Three Amigos!,À nous la liberté (Freedom for Us)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,,,,,,,,,,,...,,,,,,,,,4.0,
2.0,,,,,,,,,,,...,,,,,,,,,,
3.0,,,,,,,,,,,...,,,,,,,,,,
4.0,,,,,,,,,,,...,,,,,,,,,,
5.0,,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606.0,,,,,,,,,,,...,,,,,,,,,,
607.0,,,,,,,,,,,...,,,,,,,,,,
608.0,,,,,,,,,,,...,,,,,,4.5,3.5,,,
609.0,,,,,,,,,,,...,,,,,,,,,,


Karena dilihat dari Exploratory Data, setiap film hanya menerima rata-rata sekitar 10 rating dari user, jadi kita asumsikan banyak film yang hanya di rating 1-5 oleh user yang membuat rata-rata rating cukup rendah, sehingga kita abaikan saja film-film yang dirating <5 user. Serta dilihat banyak nilai NaN pada hasil diatas, dimana maksud NaN adalah user tersebut belum merating film tersebut, sehingga kita isi dengan nilai 0 untuk merepresentasikan film tersebut belum dirating oleh user tersebut

In [37]:
collab_df = collab_df.dropna(thresh=5, axis=1).fillna(0)
collab_df

title,"'burbs, The",(500) Days of Summer,*batteries not included,10 Cloverfield Lane,10 Things I Hate About You,"10,000 BC",101 Dalmatians,101 Dalmatians (One Hundred and One Dalmatians),102 Dalmatians,12 Angry Men,...,Zodiac,Zombieland,Zoolander,Zootopia,Zulu,[REC],eXistenZ,xXx,xXx: State of the Union,¡Three Amigos!
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,3.0,0.0,0.0,0.0,4.5,3.5,0.0,0.0
609.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# **Modeling**

## **Content Based Filtering**

In [38]:
data = content_df
data.sample(5)

Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
44048,2468,Jumpin' Jack Flash,action comedy romance thriller,1986,19.0,2.0
88863,66320,"11th Hour, The",documentary,2007,158.0,4.0
27876,1253,"Day the Earth Stood Still, The",drama sci-fi thriller,1951,28.0,3.5
79375,32017,"Pacifier, The",action comedy,2005,140.0,2.0
78187,26622,Dominick and Eugene,drama,1988,599.0,3.0


### **TF-IDF Vektorisasi**

In [39]:
# Misal films sudah ada
data = content_df.copy()

# Lanjutkan proses TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(token_pattern=r"(?u)\b[\w-]+\b")
tfidf_matrix = tfidf.fit_transform(data['genres'])

print("Fitur TF-IDF:", tfidf.get_feature_names_out())

tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf.get_feature_names_out())
print(tfidf_df.head())


Fitur TF-IDF: ['action' 'adventure' 'animation' 'children' 'comedy' 'crime'
 'documentary' 'drama' 'fantasy' 'film-noir' 'horror' 'imax' 'musical'
 'mystery' 'romance' 'sci-fi' 'thriller' 'war' 'western']
   action  adventure  animation  children    comedy  crime  documentary  \
0     0.0   0.417593   0.513835  0.504805  0.266130    0.0          0.0   
1     0.0   0.512028   0.000000  0.618962  0.000000    0.0          0.0   
2     0.0   0.000000   0.000000  0.000000  0.568601    0.0          0.0   
3     0.0   0.000000   0.000000  0.000000  0.502878    0.0          0.0   
4     0.0   0.000000   0.000000  0.000000  1.000000    0.0          0.0   

      drama   fantasy  film-noir  horror  imax  musical  mystery   romance  \
0  0.000000  0.485734        0.0     0.0   0.0      0.0      0.0  0.000000   
1  0.000000  0.595578        0.0     0.0   0.0      0.0      0.0  0.000000   
2  0.000000  0.000000        0.0     0.0   0.0      0.0      0.0  0.822613   
3  0.466706  0.000000        0.0

In [40]:
tfidf_matrix = tfidf.fit_transform(data['genres'])

tfidf_matrix.shape

(9409, 19)

In [41]:
# change tf-idf vector to matrix form
tfidf_matrix.todense()

matrix([[0.        , 0.41759287, 0.51383457, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.51202783, 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.58117288, 0.        , 0.81378012, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

In [42]:
df_tfidf = pd.DataFrame(
    tfidf_matrix.todense(),
    columns=tfidf.get_feature_names_out(),
    index=data.title
)

# Cek ukuran baris dan kolom
num_rows, num_cols = df_tfidf.shape

# Ambil sampel yang aman
sampled_df = df_tfidf.sample(min(10, num_rows), axis=0).sample(min(20, num_cols), axis=1)


In [43]:
pd.DataFrame(
    tfidf_matrix.todense(),
    columns=tfidf.get_feature_names_out(),
    index=data.title
).sample(19, axis=1).sample(10, axis=0)

Unnamed: 0_level_0,romance,documentary,action,war,horror,thriller,musical,sci-fi,adventure,imax,film-noir,fantasy,crime,children,drama,animation,comedy,western,mystery
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Lamerica,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.860721,0.0,0.0,0.0,0.0,0.0,0.509076,0.0,0.0,0.0,0.0
Leap of Faith,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.680253,0.0,0.732977,0.0,0.0
"Red Violin, The (Violon rouge, Le)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.426366,0.0,0.0,0.0,0.904551
Limbo,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
Dog Days (Hundstage),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
One Fine Day,0.8417,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.539945,0.0,0.0,0.0,0.0
Guys and Dolls,0.505731,0.0,0.0,0.0,0.0,0.0,0.788694,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.349568,0.0,0.0
Sherlock Holmes and Dr. Watson: Acquaintance,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
Joe Somebody,0.72753,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.466706,0.0,0.502878,0.0,0.0
The Opposite Sex,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### **Cosine Similarity**

In [44]:
# calculate cosine similarity on matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim = cosine_similarity(tfidf_matrix)
cosine_sim

array([[1.        , 0.81556675, 0.15132156, ..., 0.        , 0.41814836,
        0.26612951],
       [0.81556675, 1.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.15132156, 0.        , 1.        , ..., 0.        , 0.        ,
        0.56860118],
       ...,
       [0.        , 0.        , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.41814836, 0.        , 0.        , ..., 0.        , 1.        ,
        0.        ],
       [0.26612951, 0.        , 0.56860118, ..., 0.        , 0.        ,
        1.        ]])

In [45]:
# create dataframe from the results of cosine similarity
cosine_sim_df = pd.DataFrame(cosine_sim, index=data['title'], columns=data['title'])
print('Shape:', cosine_sim_df.shape)

# show similarity matrix
cosine_sim_df.sample(5, axis=1).sample(10, axis=0)

Shape: (9409, 9409)


title,Chitty Chitty Bang Bang,Plunkett & MaCleane,Imagine That,Paddington 2,"Juror, The"
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Back to School,0.254149,0.0,0.438842,0.304459,0.0
Damage (Fatale),0.0,0.406017,0.407276,0.0,0.562813
Scorpio,0.0,0.562271,0.175843,0.0,0.767137
Double Team,0.0,0.603244,0.0,0.0,0.0
Comme un chef,0.254149,0.0,0.438842,0.304459,0.0
Child's Play,0.0,0.0,0.0,0.0,0.515602
Marauders,0.0,0.603244,0.0,0.0,0.0
Wildcats,0.254149,0.0,0.438842,0.304459,0.0
The Drop,0.0,0.164216,0.164725,0.0,0.71863
Natural Born Killers,0.0,0.33126,0.0,0.0,0.448681


### **Get Recommendation**

In [46]:
# function recommendations
def film_recommendations(title, similarity_data=cosine_sim_df, items=data[['title', 'genres']], k=5):
    # get data index
    index = similarity_data.loc[:,title].to_numpy().argpartition(
        range(-1, -k, -1))

    # retrieve data from an existing index
    closest = similarity_data.columns[index[-1:-(k+2):-1]]

    # drop title you want to search
    closest = closest.drop(title, errors='ignore')

    return pd.DataFrame(closest).merge(items).head(k)

In [77]:
# sample data
data.sample(3)

Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
48476,2818,Iron Eagle IV,action war,1995,160.0,1.0
99381,141994,Saving Christmas,children comedy,2014,514.0,0.5
45657,2611,"Winslow Boy, The",drama,1999,474.0,3.0


In [78]:
data[data.title.eq('Saving Christmas')]

Unnamed: 0,movieId,title,genres,year_of_release,userId,rating
99381,141994,Saving Christmas,children comedy,2014,514.0,0.5


In [79]:
film_recommendations('Saving Christmas')

Unnamed: 0,title,genres
0,Christmas with the Kranks,children comedy
1,Ernest Saves Christmas,children comedy
2,House Arrest,children comedy
3,Bad News Bears,children comedy
4,First Kid,children comedy


### Evaluation

In [81]:
def evaluate_recommendations(title, k=5):
    # Ambil rekomendasi
    recommended = film_recommendations(title, k=k)

    # Ambil genre dari film utama
    main_genres = set(data.loc[data['title'] == title, 'genres'].iloc[0].split(', '))

    results = []
    for rec_title in recommended['title']:
        rec_genres = set(data.loc[data['title'] == rec_title, 'genres'].iloc[0].split(', '))
        common_genres = main_genres.intersection(rec_genres)
        similarity_score = len(common_genres) / len(main_genres) if main_genres else 0
        results.append({
            'Recommended Title': rec_title,
            'Common Genres': common_genres,
            'Genre Similarity': round(similarity_score, 2)
        })

    return pd.DataFrame(results)

In [82]:
evaluate_recommendations('Saving Christmas', k=5)

Unnamed: 0,Recommended Title,Common Genres,Genre Similarity
0,Christmas with the Kranks,{children comedy},1.0
1,Ernest Saves Christmas,{children comedy},1.0
2,House Arrest,{children comedy},1.0
3,Bad News Bears,{children comedy},1.0
4,First Kid,{children comedy},1.0


## **Collaborative Filtering**

In [52]:
from scipy import sparse
from sklearn.metrics.pairwise import cosine_similarity

### **Cosine Similarity**

In [53]:
def standardize(row):
    return (row - row.mean()) / (row.max() - row.min())

data_std = collab_df.apply(standardize)
item_similarity = cosine_similarity(data_std.T)
item_similarity = pd.DataFrame(item_similarity, index=collab_df.columns, columns=collab_df.columns)
item_similarity

title,"'burbs, The",(500) Days of Summer,*batteries not included,10 Cloverfield Lane,10 Things I Hate About You,"10,000 BC",101 Dalmatians,101 Dalmatians (One Hundred and One Dalmatians),102 Dalmatians,12 Angry Men,...,Zodiac,Zombieland,Zoolander,Zootopia,Zulu,[REC],eXistenZ,xXx,xXx: State of the Union,¡Three Amigos!
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
"'burbs, The",1.000000,0.063117,0.235908,-0.023768,0.143482,0.011998,0.087931,0.224052,-0.018608,0.034133,...,0.153158,0.101301,0.049897,0.003233,0.012563,-0.017905,0.187953,0.062174,-0.014025,0.353194
(500) Days of Summer,0.063117,1.000000,0.133949,0.142471,0.273989,0.193960,0.148903,0.142141,0.066567,0.160679,...,0.414585,0.355723,0.252226,0.216007,-0.003346,0.126147,0.053614,0.241092,0.139511,0.125905
*batteries not included,0.235908,0.133949,1.000000,0.035596,0.061144,-0.017106,0.073459,0.106100,-0.012561,0.026313,...,0.194530,0.121010,0.071852,-0.024573,0.042406,-0.012086,0.115396,-0.000060,-0.009467,0.234514
10 Cloverfield Lane,-0.023768,0.142471,0.035596,1.000000,-0.005799,0.112396,0.006139,-0.016835,-0.017692,0.031619,...,0.272347,0.241751,0.195054,0.319371,0.019617,0.082246,0.177846,0.096638,0.081429,0.002733
10 Things I Hate About You,0.143482,0.273989,0.061144,-0.005799,1.000000,0.244670,0.223481,0.211473,0.109729,0.013083,...,0.091853,0.158637,0.281934,0.050031,0.009408,0.088391,0.121029,0.130813,0.068745,0.110612
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
[REC],-0.017905,0.126147,-0.012086,0.082246,0.088391,0.074425,0.240364,0.073539,0.206804,0.081389,...,0.086805,0.137618,0.119447,0.135089,0.248011,1.000000,-0.021449,0.291410,0.376455,-0.022876
eXistenZ,0.187953,0.053614,0.115396,0.177846,0.121029,0.088045,0.047804,0.085606,-0.022291,-0.001768,...,0.145741,0.068763,0.097147,0.046885,0.063009,-0.021449,1.000000,0.163022,-0.016800,0.138611
xXx,0.062174,0.241092,-0.000060,0.096638,0.130813,0.203002,0.156932,0.248820,0.093041,0.074160,...,0.209840,0.203285,0.338034,0.200762,0.200504,0.291410,0.163022,1.000000,0.259049,0.065673
xXx: State of the Union,-0.014025,0.139511,-0.009467,0.081429,0.068745,0.150745,0.079071,0.044104,0.218801,0.022178,...,0.203675,0.157213,0.169656,0.039615,-0.007525,0.376455,-0.016800,0.259049,1.000000,-0.017917


In [54]:
def get_similar_movies(movie_name, user_rating):
    if movie_name not in item_similarity:
        print(f"Not Found: {movie_name}")
        return
    similar_score = item_similarity[movie_name] * (user_rating-2.5)  # manipulate the similar score using given rating by user
    similar_score = pd.DataFrame(similar_score).T
    return np.array(similar_score).reshape(-1)

print(get_similar_movies("Zodiac", 3.5))

[0.15315773 0.41458533 0.19453005 ... 0.2098396  0.20367534 0.18979541]


### **Membuat User Profile**

In [55]:
# ambil 10 judul film acak tanpa menampilkan userId/rating
sample_titles = collab_df.columns.to_series().sample(10).tolist()

sample_titles

['Muse, The',
 'Rambo: First Blood Part II',
 'Doc Hollywood',
 'Erin Brockovich',
 'Fantasia',
 'Fun with Dick and Jane',
 'Brothers',
 "Things to Do in Denver When You're Dead",
 'Indecent Proposal',
 'Ghost Dog: The Way of the Samurai']

In [83]:
usamah_profile = [
    ("Zodiac", 3.5),
    ("Muse, The", 2.0),
    ("Rambo: First Blood Part II", 2.5),
    ("Doc Hollywood", 4.5),
    ("Erin Brockovich", 5.0),
    ("Fantasia", 3.0),
    ("Fun with Dick and Jane", 4.5),
    ("Brothers", 5.0),
    ("Indecent Proposal", 4.0)
]

### **Mendapatkan Rekomendasi Berdasarkan User Profile**

In [84]:
similar_movies = pd.DataFrame(columns=collab_df.columns)

for i, (movie, rating) in enumerate(usamah_profile):
    sim_scores = get_similar_movies(movie, rating)
    similar_movies.loc[i] = sim_scores

# Profil user (film yang sudah ditonton)
watched_movies = [movie for movie, rating in usamah_profile]
recommendation_scores = similar_movies.sum().sort_values(ascending=False)
recommendation_scores = recommendation_scores.drop(labels=watched_movies, errors='ignore')

print(recommendation_scores[:10])

title
Beowulf                 3.242131
Three Men and a Baby    3.199312
World Trade Center      3.154436
Serious Man, A          3.133615
New Guy, The            3.061745
Cellular                3.043576
Flags of Our Fathers    3.040235
Freshman, The           3.010570
Rundown, The            3.005932
Proof                   2.975900
dtype: float64


### Evaluation

In [86]:
sample_titles = collab_df.columns.to_series().sample(10).tolist()

sample_titles

['Frequency',
 'Meatballs',
 'Unstoppable',
 'Night on Earth',
 'Dragonslayer',
 'Fly, The',
 'Walk the Line',
 'Chasing Liberty',
 "Pee-wee's Big Adventure",
 'Shrek 2']

In [87]:
top_recommendations = recommendation_scores[:10].index.tolist()

# Buat penilaian manual untuk evaluasi (1 = tidak relevan, 5 = sangat relevan)
manual_relevance = {
    "Frequency": 5,
    "Meatballs": 4,
    "Unstoppable": 4,
    "Night on Earth": 2,
    "Dragonslayer": 5,
    "Fly, The": 3,
    "Walk the Line": 2,
    "Chasing Liberty": 5,
    "Pee-wee's Big Adventure": 4,
    "Shrek 2": 3,
}


In [90]:
# Ambil hanya film yang dinilai secara manual
evaluated_scores = recommendation_scores[manual_relevance.keys()]

# Buat DataFrame evaluasi
evaluation_df = pd.DataFrame({
    'Predicted Score': evaluated_scores,
    'Manual Relevance': pd.Series(manual_relevance)
})

# Normalisasi skor prediksi
evaluation_df['Normalized Predicted'] = evaluation_df['Predicted Score'] / evaluation_df['Predicted Score'].max()

# Hitung korelasi antara skor sistem dan penilaian manual
correlation = evaluation_df[['Normalized Predicted', 'Manual Relevance']].corr().iloc[0, 1]

# Tampilkan hasil evaluasi
print("=== Evaluation Result ===")
print(evaluation_df)
print(f"\nCorrelation between predicted and manual relevance: {correlation:.2f}")


=== Evaluation Result ===
                         Predicted Score  Manual Relevance  \
Frequency                       1.818137                 5   
Meatballs                       1.464195                 4   
Unstoppable                     2.725166                 4   
Night on Earth                  0.708033                 2   
Dragonslayer                    1.250156                 5   
Fly, The                        1.819015                 3   
Walk the Line                   1.817266                 2   
Chasing Liberty                 1.589963                 5   
Pee-wee's Big Adventure         1.374724                 4   
Shrek 2                         1.766686                 3   

                         Normalized Predicted  
Frequency                            0.667166  
Meatballs                            0.537287  
Unstoppable                          1.000000  
Night on Earth                       0.259813  
Dragonslayer                         0.458745  
Fly