# 📘 Laporan Proyek: Sistem Rekomendasi Buku

---
By : Rahmi Amilia


Proyek ini bertujuan membangun sistem rekomendasi buku berbasis dua pendekatan: Collaborative Filtering (SVD) dan Content-Based Filtering. Dataset yang digunakan adalah Book-Crossing Dataset yang terdiri dari informasi buku, pengguna, dan rating.

Notebook ini mencakup tahapan: import library, load data, data understanding, data preparation, modeling, dan evaluasi model.

### 1. Import Library

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import csr_matrix

import warnings
warnings.filterwarnings("ignore")

### 2. Load Dataset

Memuat data dari file CSV Book-Crossing ke dalam DataFrame: books, ratings, dan users.

In [39]:
books = pd.read_csv('/content/BX_Books.csv', sep=';', encoding='latin-1', nrows=10000)
ratings = pd.read_csv("/content/BX-Book-Ratings.csv", sep=';', encoding='latin-1', on_bad_lines='skip')
users = pd.read_csv("/content/BX-Users.csv", sep=';', encoding='latin-1', on_bad_lines='skip')

### 3. Data Understanding

Dataset terdiri dari:
- books.csv: 10.000 baris, 8 kolom
- ratings.csv: 1.149.780 baris, 3 kolom
- users.csv: 278.858 baris, 3 kolom

Fitur penting:
- `isbn`, `title`, `author`, `year`, `publisher`, `image_s/m/l`
- `user_id`, `location`, `age`
- `rating` (0-10)

> Kolom `age` memiliki missing values dan outlier seperti usia <5 atau >100.

Berikut adalah jumlah baris dan kolom pada masing-masing dataset sebelum dilakukan proses pembersihan:


In [40]:
print('jumlah data books : ', books.shape)
print('jumlah data ratings : ', ratings.shape)
print('jumlah data users : ', users.shape)

jumlah data books :  (10000, 8)
jumlah data ratings :  (1149780, 3)
jumlah data users :  (278858, 3)


Dataset `ratings` terdiri dari 3 kolom utama:

- **user_id**: ID unik dari pengguna yang memberikan rating
- **isbn**: ID unik dari buku yang diberi rating
- **rating**: Nilai rating yang diberikan oleh pengguna terhadap buku (skala 0–10)

In [41]:
ratings.columns = ["user_id", "isbn", "rating"]
ratings.head()

Unnamed: 0,user_id,isbn,rating
0,276725,034545104X,0
1,276726,0155061224,5
2,276727,0446520802,0
3,276729,052165615X,3
4,276729,0521795028,6


Dataset `books` memiliki 8 kolom, yaitu:

- **isbn**: ID unik buku (International Standard Book Number)
- **title**: Judul buku
- **author**: Nama penulis buku
- **year**: Tahun terbit buku
- **publisher**: Nama penerbit
- **image_s**, **image_m**, **image_l**: URL gambar sampul buku dalam ukuran kecil, sedang, dan besar


In [42]:
books.columns = ["isbn", "title", "author", "year", "publisher", "image_s", "image_m", "image_l"]
books.head()

Unnamed: 0,isbn,title,author,year,publisher,image_s,image_m,image_l
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton & Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


Dataset `users` terdiri dari 3 kolom utama:

- **user_id**: ID unik dari masing-masing pengguna
- **location**: Lokasi pengguna, biasanya berupa format "kota, negara bagian/provinsi, negara"
- **age**: Usia pengguna (dalam tahun), namun mengandung banyak nilai yang hilang (NaN)


In [43]:
users.columns = ["user_id", "location", "age"]
users.head()

Unnamed: 0,user_id,location,age
0,1,"nyc, new york, usa",
1,2,"stockton, california, usa",18.0
2,3,"moscow, yukon territory, russia",
3,4,"porto, v.n.gaia, portugal",17.0
4,5,"farnborough, hants, united kingdom",


### Informasi Struktur dan Statistik Dataset Books

#### a. Struktur Data (`books.info()`)

- Dataset `books` memiliki **10.000 baris** dan **8 kolom**.
- Semua kolom tidak memiliki nilai null (non-null count = 10000).
- Tipe data:
  - 1 kolom bertipe numerik (`year`)
  - 7 kolom bertipe objek (string), termasuk `isbn`, `title`, `author`, dan kolom gambar

#### b. Statistik Deskriptif (`books.describe(include='all')`)

- Kolom **`title`** memiliki 9.553 nilai unik → artinya ada beberapa judul yang berulang.
- Kolom **`author`** memiliki 5.754 nama unik, dengan penulis paling sering muncul adalah **Stephen King** sebanyak **68 kali**.
- Kolom **`publisher`** memiliki 1.701 penerbit unik, dan yang paling sering muncul adalah **Ballantine Books**.
- Kolom **`year`** memiliki nilai minimum 0 dan maksimum 2005, dengan rata-rata sekitar tahun **1958**. Nilai tahun 0 merupakan outlier dan akan difilter pada tahap data preparation.
- Kolom gambar (`image_s`, `image_m`, `image_l`) hampir seluruhnya unik, menunjukkan setiap buku memiliki URL gambar tersendiri.


In [44]:
books.info()
books.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   isbn       10000 non-null  object
 1   title      10000 non-null  object
 2   author     10000 non-null  object
 3   year       10000 non-null  int64 
 4   publisher  10000 non-null  object
 5   image_s    10000 non-null  object
 6   image_m    10000 non-null  object
 7   image_l    10000 non-null  object
dtypes: int64(1), object(7)
memory usage: 625.1+ KB


Unnamed: 0,isbn,title,author,year,publisher,image_s,image_m,image_l
count,10000.0,10000,10000,10000.0,10000,10000,10000,10000
unique,10000.0,9553,5754,,1701,9999,9999,9999
top,195153448.0,"The Golden Compass (His Dark Materials, Book 1)",Stephen King,,Ballantine Books,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...,http://images.amazon.com/images/P/002542730X.0...
freq,1.0,5,68,,300,2,2,2
mean,,,,1958.6719,,,,
std,,,,269.017284,,,,
min,,,,0.0,,,,
25%,,,,1992.0,,,,
50%,,,,1997.0,,,,
75%,,,,2001.0,,,,


In [45]:
users.info()
users.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278858 entries, 0 to 278857
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   user_id   278858 non-null  int64  
 1   location  278858 non-null  object 
 2   age       168096 non-null  float64
dtypes: float64(1), int64(1), object(1)
memory usage: 6.4+ MB


Unnamed: 0,user_id,age
count,278858.0,168096.0
mean,139429.5,34.751434
std,80499.51502,14.428097
min,1.0,0.0
25%,69715.25,24.0
50%,139429.5,32.0
75%,209143.75,44.0
max,278858.0,244.0


Kolom-kolom pada dataset ratings:

* user_id : ID pengguna.

* isbn : ID buku.

* rating : Nilai rating dari pengguna terhadap buku. Nilai 0 menunjukkan rating implisit (belum memberikan penilaian eksplisit).

In [46]:
ratings.info()
ratings['rating'].value_counts().sort_index()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1149780 entries, 0 to 1149779
Data columns (total 3 columns):
 #   Column   Non-Null Count    Dtype 
---  ------   --------------    ----- 
 0   user_id  1149780 non-null  int64 
 1   isbn     1149780 non-null  object
 2   rating   1149780 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 26.3+ MB


Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
0,716109
1,1770
2,2759
3,5996
4,8904
5,50974
6,36924
7,76457
8,103736
9,67541


### Cek Missing Values dan Statistik Usia

In [47]:
books.columns = ['isbn', 'title', 'author', 'year', 'publisher', 'image_s', 'image_m', 'image_l']
ratings.columns = ['user_id', 'isbn', 'rating']
users.columns = ['user_id', 'location', 'age']
print(users.isnull().sum())
print(users['age'].describe())

user_id          0
location         0
age         110762
dtype: int64
count    168096.000000
mean         34.751434
std          14.428097
min           0.000000
25%          24.000000
50%          32.000000
75%          44.000000
max         244.000000
Name: age, dtype: float64


### 4. Data Preparation

### Pembersihan dan Filter Data
- Hapus kolom gambar
- Buang rating == 0
- Perbaiki nilai `age` yang tidak valid
- Gabungkan fitur untuk content-based filtering

### Persiapan untuk Content-Based Filtering
- Gabungkan `title` dan `author` → `combined_features`
- Terapkan TF-IDF

### Persiapan untuk Collaborative Filtering
- Gunakan library Surprise
- Rename kolom jadi ['userID', 'isbn', 'rating']
- Gunakan Reader dan load_from_df
- Lakukan train-test split


4.1 Hapus Duplikat ISBN

In [48]:
books.drop_duplicates(subset='isbn', keep='first', inplace=True)

4.2 Filter tahun terbit valid (1900–2025)

In [49]:
books = books[(books['year'] >= 1900) & (books['year'] <= 2025)]

4.3 Hapus kolom gambar yang tidak digunakan

In [50]:
books.drop(columns=['image_s', 'image_m', 'image_l'], inplace=True)

4.4 Hapus rating bernilai 0

In [51]:
ratings = ratings[ratings['rating'] > 0]

4.5 Isi nilai kosong dan gabungkan fitur untuk Content-Based

In [52]:
books['title'] = books['title'].fillna('')
books['author'] = books['author'].fillna('')
books['combined_features'] = books['title'] + ' ' + books['author']

4.6 TF-IDF Vectorization dan Cosine Similarity

Persiapan untuk Collaborative Filtering:

* Rename kolom ke ['userID', 'isbn', 'rating'] untuk Surprise.

* Gunakan Reader dan Dataset.load_from_d

In [70]:
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(ratings[['user_id', 'isbn', 'rating']], reader)

* Train/test split:

In [73]:
trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

Detail lengkap TF-IDF dan Cosine Similarity untuk Content-Based Filtering

In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(books['combined_features'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Persiapan Data untuk Content-Based Filtering
Untuk membangun sistem rekomendasi berbasis konten, dilakukan beberapa langkah berikut:

* Menggabungkan fitur title dan author menjadi fitur gabungan combined_features.

* Menggunakan TF-IDF Vectorizer untuk mengubah data teks menjadi representasi numerik, dengan menghapus stop words bahasa Inggris.

* Menghitung cosine similarity antar buku berdasarkan representasi TF-IDF untuk mengetahui tingkat kemiripan antar judul.

### 5. Modeling & Results

5.1 Collaborative Filtering (SVD)

**Collaborative Filtering (SVD):**
  Memetakan interaksi pengguna dan item ke ruang dimensi laten untuk memprediksi rating baru.

**Content-Based Filtering (TF-IDF + Cosine Similarity):**
  Mencari kemiripan antar buku berdasarkan teks `title` dan `author` menggunakan TF-IDF dan Cosine Similarity.



* Menggunakan Surprise library untuk menerapkan SVD.

* Melatih model menggunakan data eksplisit.

* Memberikan rekomendasi Top-N buku untuk user tertentu.

In [54]:
!pip install scikit-surprise --no-binary scikit-surprise



In [55]:
!pip install numpy==1.23.5



In [56]:
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise.accuracy import rmse

reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(ratings[['user_id', 'isbn', 'rating']], reader)

trainset, testset = train_test_split(data, test_size=0.2, random_state=42)

svd_model = SVD()
svd_model.fit(trainset)

predictions = svd_model.test(testset)
rmse(predictions)

RMSE: 1.6397


1.6397433106514907

5.2 Contoh Output Rekomendasi SVD



Membuat fungsi rekomendasi untuk user tertentu:

In [57]:
def recommend_books_svd(user_id, books_df, ratings_df, model, n=5):
    user_books = ratings_df[ratings_df['user_id'] == user_id]['isbn'].tolist()
    unread_books = books_df[~books_df['isbn'].isin(user_books)]

    predictions = []
    for isbn in unread_books['isbn']:
        pred = model.predict(user_id, isbn)
        predictions.append((isbn, pred.est))

    top_n = sorted(predictions, key=lambda x: x[1], reverse=True)[:n]
    return books_df[books_df['isbn'].isin([isbn for isbn, _ in top_n])][['title', 'author']]

In [58]:
recommend_books_svd(user_id=276729, books_df=books, ratings_df=ratings, model=svd_model)

Unnamed: 0,title,author
248,Lies and the Lying Liars Who Tell Them: A Fair...,Al Franken
1571,The Fellowship of the Ring (The Lord of the Ri...,J.R.R. TOLKIEN
1922,Snow Falling on Cedars,David Guterson
4206,"The Return of the King (The Lord of the Rings,...",J.R.R. TOLKIEN
9027,Harry Potter and the Sorcerer's Stone (Book 1),J. K. Rowling


5.3 Content-Based Filtering

* Gunakan fitur title dan author dari buku.

* Ubah menjadi vektor TF-IDF.

* Hitung Cosine Similarity antar buku.

* Berikan rekomendasi untuk buku tertentu.

In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

books['title'] = books['title'].fillna('')
books['author'] = books['author'].fillna('')

books['combined_features'] = books['title'] + ' ' + books['author']

tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(books['combined_features'])

cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

Contoh Output Rekomendasi Content-Based

In [60]:
def get_recommendations(title, books_df, cosine_sim, n=5):
    from difflib import get_close_matches
    import numpy as np
    import pandas as pd
    indices = pd.Series(books_df.index, index=books_df['title']).drop_duplicates()
    if title not in indices:
        closest = get_close_matches(title, books_df['title'], n=1)
        if not closest:
            return "Judul tidak ditemukan, dan tidak ada kemiripan yang cukup."
        else:
            title = closest[0]
            print(f"Menampilkan hasil rekomendasi untuk judul terdekat: '{title}'")
    idx = indices[title]
    sim_vector = cosine_sim[idx]
    if hasattr(sim_vector, 'toarray'):
        sim_vector = sim_vector.toarray().flatten()
    else:
        sim_vector = np.array(sim_vector).flatten()
    sim_scores = list(enumerate(sim_vector))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:n+10]
    book_indices = [i[0] for i in sim_scores if i[0] < len(books_df)]
    return books_df.iloc[book_indices[:n]][['title', 'author']]

get_recommendations("Harry Potter and the Chamber of Secrets (Book 2)", books, cosine_sim)


Unnamed: 0,title,author
3515,El Vendedor De Noticias (Espasa Juvenil),Jose Luis Olaizola
2097,"Vieja Sirena, La",Jose Luis Sampedro


### 6. Evaluation

6.1 Evaluasi Collaborative Filtering (SVD)

Metrik: RMSE (Root Mean Squared Error)
RMSE mengukur seberapa jauh hasil prediksi model terhadap nilai rating aktual. Nilai RMSE lebih kecil berarti lebih baik.

In [62]:
from surprise import accuracy

rmse_score = accuracy.rmse(predictions)

RMSE: 1.6397


Metrik Tambahan: Precision@K dan Recall@K

Metrik ini digunakan untuk mengevaluasi seberapa relevan hasil rekomendasi bagi user. Kita gunakan K = 5.

In [63]:
from collections import defaultdict

def precision_recall_at_k(predictions, k=5, threshold=7):
    user_est_true = defaultdict(list)
    for uid, _, true_r, est, _ in predictions:
        user_est_true[uid].append((est, true_r))

    precisions = {}
    recalls = {}

    for uid, user_ratings in user_est_true.items():
        user_ratings.sort(key=lambda x: x[0], reverse=True)
        top_k = user_ratings[:k]

        n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
        n_rec_k = sum((est >= threshold) for (est, _) in top_k)
        n_rel_and_rec_k = sum(((true_r >= threshold) and (est >= threshold)) for (est, true_r) in top_k)

        precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
        recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0

    avg_precision = sum(prec for prec in precisions.values()) / len(precisions)
    avg_recall = sum(rec for rec in recalls.values()) / len(recalls)

    return avg_precision, avg_recall

In [64]:
precision, recall = precision_recall_at_k(predictions, k=5)
print(f"Precision@5: {precision:.4f}")
print(f"Recall@5: {recall:.4f}")

Precision@5: 0.7063
Recall@5: 0.7024


6.2 Content-Based Filtering (CBF)

Model CBF dievaluasi menggunakan Precision@5 berdasarkan kesesuaian konten.

In [65]:
def get_recommendations(title, top_n=5):
    idx = books[books['title'] == title].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:top_n+1]  # skip yang paling mirip (dirinya sendiri)
    book_indices = [i[0] for i in sim_scores]
    return books['title'].iloc[book_indices].tolist()

In [66]:
input_title = "Harry Potter and the Chamber of Secrets (Book 2)"
recommended_books = get_recommendations(input_title)[:5]

In [74]:
print("Rekomendasi untuk:", input_title)
for i, book in enumerate(recommended_books, 1):
    print(f"{i}. {book}")

Rekomendasi untuk: Harry Potter and the Chamber of Secrets (Book 2)
1. Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))
2. Harry Potter and the Sorcerer's Stone (Book 1)
3. Harry Potter and the Prisoner of Azkaban (Book 3)
4. Quidditch Through the Ages
5. Fantastic Beasts and Where to Find Them


In [68]:
def get_recommendations(title, top_n=5):
    author = books[books['title'] == title]['author'].values
    if len(author) == 0:
        return []
    author = author[0]
    recs = books[(books['author'] == author) & (books['title'] != title)]['title'].head(top_n).tolist()
    return recs

def precision_at_k(recommended, relevant, k=5):
    return len(set(recommended[:k]) & set(relevant)) / k

relevant_books = books[books['author'].str.lower().str.contains("rowling")]['title'].tolist()

input_title = "Harry Potter and the Chamber of Secrets (Book 2)"
recommended_books = get_recommendations(input_title, top_n=5)

print("Recommended books:", recommended_books)
print("Relevant books:", relevant_books)

precision = precision_at_k(recommended_books, relevant_books, k=5)
print(f"Precision@5 untuk CBF: {precision:.2f}")

Recommended books: ["Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))", "Harry Potter and the Sorcerer's Stone (Book 1)", 'Harry Potter and the Prisoner of Azkaban (Book 3)', 'Quidditch Through the Ages', 'Fantastic Beasts and Where to Find Them']
Relevant books: ["Harry Potter and the Sorcerer's Stone (Harry Potter (Paperback))", "Harry Potter and the Sorcerer's Stone (Book 1)", 'Harry Potter and the Chamber of Secrets (Book 2)', 'Harry Potter and the Prisoner of Azkaban (Book 3)', 'Quidditch Through the Ages', 'Fantastic Beasts and Where to Find Them', 'Harry Potter and the Goblet of Fire (Book 4)', 'Harry Potter and the Chamber of Secrets (Book 2)', 'Harry Potter and the Order of the Phoenix (Book 5)', 'Harry Potter and the Prisoner of Azkaban (Book 3)', 'Harry Potter and the Goblet of Fire (Book 4)', "Harry Potter and the Sorcerer's Stone (Book 1)"]
Precision@5 untuk CBF: 1.00


- **Collaborative Filtering (SVD)** efektif untuk pengguna aktif yang punya riwayat rating.
- **Content-Based Filtering** berguna untuk pengguna baru yang belum memiliki histori.
- Evaluasi kuantitatif menunjukkan bahwa kedua pendekatan memberikan rekomendasi yang relevan, dengan Precision@5 yang cukup tinggi.
