Perintah `!pip install scikit-surprise` digunakan untuk menginstal pustaka Python bernama "scikit-surprise", yang dirancang untuk membangun dan menganalisis sistem rekomendasi.

In [2]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp310-cp310-linux_x86_64.whl size=2357295 sha256=21501c804adf23ff24cdf6a6ce381276190273e22e3b86d4a44a15f2fedcc926
  Stored in directory: /root/.cache/pip/wheels/4b/3f/df/6acbf0a40397d9bf3ff97f582cc22fb9ce66adde75bc71fd54
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Succe

Kode berikut mengimpor pustaka untuk manipulasi data, transformasi teks, pengukuran kesamaan, dan pembangunan sistem rekomendasi.

In [3]:
# Importing necessary libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy

Kode berikut memuat dataset contoh tentang buku dari URL (GoodReads) yang diberikan dan menyimpannya dalam variabel `data` menggunakan pustaka pandas.

In [4]:
# Project Overview
# Loading a sample dataset (Movies dataset)
url = 'https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/books.csv'
data = pd.read_csv(url)

Kode berikut mencetak jumlah total baris dan kolom dalam dataset serta daftar nama kolom yang ada.

In [5]:
# Data Understanding
print(f"Total Rows: {data.shape[0]}, Total Columns: {data.shape[1]}")
print(f"Dataset Columns: {data.columns.tolist()}")

Total Rows: 10000, Total Columns: 23
Dataset Columns: ['book_id', 'goodreads_book_id', 'best_book_id', 'work_id', 'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year', 'original_title', 'title', 'language_code', 'average_rating', 'ratings_count', 'work_ratings_count', 'work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'image_url', 'small_image_url']


Perintah `data.head()` menampilkan lima baris pertama dari dataset, memberikan gambaran awal tentang struktur dan konten data.

In [6]:
data.head()

Unnamed: 0,book_id,goodreads_book_id,best_book_id,work_id,books_count,isbn,isbn13,authors,original_publication_year,original_title,...,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,1,2767052,2767052,2792775,272,439023483,9780439000000.0,Suzanne Collins,2008.0,The Hunger Games,...,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,2,3,3,4640799,491,439554934,9780440000000.0,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,...,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,3,41865,41865,3212258,226,316015849,9780316000000.0,Stephenie Meyer,2005.0,Twilight,...,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,4,2657,2657,3275794,487,61120081,9780061000000.0,Harper Lee,1960.0,To Kill a Mockingbird,...,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,5,4671,4671,245494,1356,743273567,9780743000000.0,F. Scott Fitzgerald,1925.0,The Great Gatsby,...,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


Perintah `print(data.isna().sum())` menghitung dan menampilkan jumlah nilai kosong (NaN) untuk setiap kolom dalam dataset, membantu mengidentifikasi masalah data yang perlu ditangani.

In [7]:
print(data.isna().sum())

book_id                         0
goodreads_book_id               0
best_book_id                    0
work_id                         0
books_count                     0
isbn                          700
isbn13                        585
authors                         0
original_publication_year      21
original_title                585
title                           0
language_code                1084
average_rating                  0
ratings_count                   0
work_ratings_count              0
work_text_reviews_count         0
ratings_1                       0
ratings_2                       0
ratings_3                       0
ratings_4                       0
ratings_5                       0
image_url                       0
small_image_url                 0
dtype: int64


Perintah berikut menggantikan nilai kosong di kolom `original_title` dengan nilai dari kolom `title`, memastikan tidak ada nilai hilang dalam kolom `original_title`.

In [8]:
data['original_title'] = data['original_title'].fillna(data['title'])

Kode berikut membuat kolom baru bernama `tags` yang menggabungkan informasi dari kolom `authors`, `original_title`, dan `average_rating` (dikonversi ke string) untuk digunakan dalam pemfilteran berbasis konten.

In [9]:
# Data Preparation
# Selecting relevant columns for the content-based filtering
data['tags'] = data['authors'] + " " + data['original_title'] + " " + data['average_rating'].astype(str)

Menampilkan atau melakukan pengecekan hasil dari kode sebelumnya.

In [10]:
data['tags'].head()

Unnamed: 0,tags
0,Suzanne Collins The Hunger Games 4.34
1,"J.K. Rowling, Mary GrandPré Harry Potter and t..."
2,Stephenie Meyer Twilight 3.57
3,Harper Lee To Kill a Mockingbird 4.25
4,F. Scott Fitzgerald The Great Gatsby 3.89


Memastikan tidak ada null value, atau missing value pada kolom tags.

In [11]:
print(data['tags'].isna().sum())

0


Kode berikut menggunakan `TfidfVectorizer` untuk mengubah kolom `tags` menjadi matriks TF-IDF, yang merepresentasikan teks secara numerik dan mengabaikan kata-kata umum (stop words) dalam bahasa Inggris.

In [12]:
# Content-Based Filtering
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(data['tags'])

Kode berikut menghitung matriks kesamaan kosinus antara semua item dalam matriks TF-IDF, yang digunakan untuk mengukur seberapa mirip satu item dengan item lainnya berdasarkan konten.

In [13]:
# Cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Fungsi `content_based_recommendations` merekomendasikan buku berdasarkan kesamaan konten dengan cara mencari indeks buku yang diberikan, menghitung skor kesamaan, mengurutkannya, dan mengembalikan sepuluh judul buku teratas yang paling mirip.

In [14]:
# Function to recommend books based on content
def content_based_recommendations(title, cosine_sim=cosine_sim):
    idx = data.index[data['original_title'] == title].tolist()[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    book_indices = [i[0] for i in sim_scores]
    return data['original_title'].iloc[book_indices]

Kode tersebut mencetak rekomendasi buku berdasarkan konten untuk judul "Harry Potter and the Order of the Phoenix" dengan menggunakan fungsi `content_based_recommendations`.

In [15]:
# Example of content-based recommendation
print("Content-Based Recommendations for 'Harry Potter':")
print(content_based_recommendations('Harry Potter and the Order of the Phoenix'))

Content-Based Recommendations for 'Harry Potter':
6140    Harry Potter and the Order of the Phoenix (Har...
3274    Harry Potter Boxed Set, Books 1-5 (Harry Potte...
23                    Harry Potter and the Goblet of Fire
22                Harry Potter and the Chamber of Secrets
1                Harry Potter and the Philosopher's Stone
26                 Harry Potter and the Half-Blood Prince
2100                     Harry Potter Boxed Set Books 1-4
24                   Harry Potter and the Deathly Hallows
3752         Harry Potter Collection (Harry Potter, #1-6)
17               Harry Potter and the Prisoner of Azkaban
Name: original_title, dtype: object


Kode berikut memuat dataset rating dari URL yang diberikan dan menyimpannya dalam variabel `ratings` menggunakan pustaka pandas.

In [16]:
# Collaborative Filtering
# Loading ratings dataset
url_ratings = 'https://raw.githubusercontent.com/zygmuntz/goodbooks-10k/master/ratings.csv'
ratings = pd.read_csv(url_ratings)

Melihat 5 baris pertama pada dataframe ratings.

In [17]:
ratings.head()

Unnamed: 0,user_id,book_id,rating
0,1,258,5
1,2,4081,4
2,2,260,5
3,2,9296,5
4,2,2318,3


Kode berikut mempersiapkan dataset untuk pemfilteran kolaboratif dengan menggunakan `Reader` untuk mendefinisikan skala rating, dan memuat data dari DataFrame `ratings` ke dalam format yang dapat digunakan oleh pustaka Surprise.

In [18]:
# Preparing the dataset for collaborative filtering
reader = Reader(rating_scale=(1, 5))
data_collab = Dataset.load_from_df(ratings[['user_id', 'book_id', 'rating']], reader)

Kode berikut membagi dataset menjadi set pelatihan dan pengujian, dengan 80% data digunakan untuk pelatihan dan 20% untuk pengujian, menggunakan fungsi `train_test_split` dari pustaka Surprise.

In [19]:
# Train-test split
trainset, testset = train_test_split(data_collab, test_size=0.2)

Kode berikut menggunakan algoritma SVD (Singular Value Decomposition) untuk pelatihan model pemfilteran kolaboratif, kemudian menguji model tersebut pada set pengujian untuk menghasilkan prediksi rating.

In [20]:
# Using SVD for collaborative filtering
model = SVD()
model.fit(trainset)
predictions = model.test(testset)

Perintah `accuracy.rmse(predictions)` menghitung dan menampilkan root mean square error (RMSE) dari prediksi yang dihasilkan oleh model, yang digunakan untuk mengevaluasi akurasi model pemfilteran kolaboratif.

In [21]:
# Evaluation
accuracy.rmse(predictions)

RMSE: 0.8302


0.8301913704850175

Fungsi `collaborative_filtering_recommendations` merekomendasikan buku untuk pengguna tertentu dengan memprediksi rating untuk setiap buku, mengurutkan hasilnya, dan mengembalikan judul serta penulis dari buku-buku dengan rating tertinggi.

In [22]:
# Function to recommend books for a user based on collaborative filtering
def collaborative_filtering_recommendations(user_id, n_recommendations=10):
    book_ids = ratings['book_id'].unique()
    predicted_ratings = []

    for book_id in book_ids:
        predicted_ratings.append((book_id, model.predict(user_id, book_id).est))

    predicted_ratings.sort(key=lambda x: x[1], reverse=True)
    top_books = predicted_ratings[:n_recommendations]
    book_titles = data[data['book_id'].isin([book[0] for book in top_books])]

    return book_titles[['original_title', 'authors']]

Kode berikut mencetak rekomendasi buku berdasarkan pemfilteran kolaboratif untuk pengguna dengan ID 1, menggunakan fungsi `collaborative_filtering_recommendations`.

In [23]:
# Example of collaborative filtering recommendation
print("Collaborative Filtering Recommendations for User 1:")
print(collaborative_filtering_recommendations(1))

Collaborative Filtering Recommendations for User 1:
                                         original_title  \
779                                   Calvin and Hobbes   
1009  The Essential Calvin and Hobbes: A Calvin and ...   
1787       The Calvin and Hobbes Tenth Anniversary Book   
3471                               The Complete Stories   
6360  There's Treasure Everywhere: A Calvin and Hobb...   
6589                The Authoritative Calvin and Hobbes   
7253  Homicidal Psycho Jungle Cat: A Calvin and Hobb...   
7843  India After Gandhi: The History of the World's...   
8945                                    دیوان‎‎ [Dīvān]   
9565  Attack of the Deranged Mutant Killer Monster S...   

                           authors  
779   Bill Watterson, G.B. Trudeau  
1009                Bill Watterson  
1787                Bill Watterson  
3471             Flannery O'Connor  
6360                Bill Watterson  
6589                Bill Watterson  
7253                Bill Watterson  
78