# Book Scraping and Clustering
Case Study for Patika Global Technology

* **Author:** Bora Boyacıoğlu
* **GitHub:** https://github.com/boraboyacioglu-itu
* **E-Mail:** boraboyacioglu@icloud.com

## Main Notebook

This notebook contains the main tasks on the project. The required analyses are done, and a main model (alongside an alternative) has been selected to do the clustering.

### 1. Web Scraping

This part is actually implemented in ```get_books.py```. Since the process takes time to run, I decided to write a terminal script. However, below, it is possible to run the code as well.

In [1]:
# Import necessary libraries.
import json
from get_books import scrape

In [None]:
# Define the base URL.
base_url = 'http://books.toscrape.com/'

# Scrape the books.
books = scrape(base_url)

In [None]:
# Save the data to a JSON file.
with open('books.json', 'w', encoding='utf-8') as f:
    json.dump(books, f, indent=4)

Alternatively, it is also possible to read the already-extracted book data.

In [1]:
import json

# Read the already extracted books data.
with open('books.json', encoding='utf-8') as f:
    books = json.load(f)

### 2. Analyses

In [20]:
import AnsiLib as al

print("Total number of books:", al.s(str(len(books)), False))
print("First Book:", al.s(books[0]['title'], False))
print("-Last Book:", al.s(books[-1]['title'], False))


Total number of books: [1m1000[0m
First Book: [1mA Light in the Attic[0m
-Last Book: [1m1,000 Places to See Before You Die[0m


Additional analyses:

In [19]:
print("Average price of all books:", al.s("£" + str(sum([float(book['price']) for book in books]) / len(books)), False))
print("Average rating of all books:", al.s(str(sum([float(book['rating']) for book in books]) / len(books)) + "/5.000", False))
print("Number of books without any stocks:", al.s(str(len([book for book in books if book['stock'] == '0'])), False))

Average price of all books: [1m£35.07035[0m
Average rating of all books: [1m2.923/5.000[0m
Number of books without any stocks: [1m0[0m


### 3. Clustering Model

I have tried four type of clustering models in ```demos.ipynb```, and selected the Method 2 (Sentence Embeddings & K-Means Clustering) to be the best performing one.

However, other than the most of the books haven't being clustered in HDBSCAN, the Method 4 also performs excellent.

In [4]:
# Import necessary libraries.
import re
import random

from sentence_transformers import SentenceTransformer

from sklearn.cluster import KMeans
import hdbscan

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
DIVISION_FACTOR = 10  # Number of books in each cluster.

RANDOM_SEED = 1984
random.seed(RANDOM_SEED)

In [6]:
# Slice the descriptions.
pattern = r'[^a-zA-Z0-9\s.,!?]'
descriptions = [
    re.sub(pattern, '', book['desc']).lower()
    for book in books
]

#### Sentence Embeddings & K-Means Clustering

In [21]:
# Load a pre-trained sentence transformer model.
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the descriptions into embeddings.
embeddings = model.encode(descriptions, show_progress_bar=True)

Batches: 100%|██████████| 32/32 [00:04<00:00,  7.65it/s]


In [22]:
# Fit K-Means clustering.
n_clusters = len(books) // DIVISION_FACTOR
kmeans = KMeans(n_clusters=n_clusters, random_state=RANDOM_SEED)

kmeans.fit(embeddings)
labels = kmeans.labels_

In [23]:
# Group the descriptions by cluster.
clusters = {i: [] for i in range(n_clusters)}
for desc, label in zip(descriptions, labels):
    clusters[label].append(desc)

In [24]:
random.choice(clusters)

['drawing on his extensive experience evaluating applicants for his marketing agency, and featuring stories based on reallife situations, sample cover letters, resumes, and straightforward advice, don raskins the dirty little secrets of getting your dream job offers all the necessary tools for navigating the tough job market and securing your dream job.don raskin owns and drawing on his extensive experience evaluating applicants for his marketing agency, and featuring stories based on reallife situations, sample cover letters, resumes, and straightforward advice, don raskins the dirty little secrets of getting your dream job offers all the necessary tools for navigating the tough job market and securing your dream job.don raskin owns and operates mme, an advertising and marketing agency in new york city. during his twentyfive years at the agency he has interviewed hundreds of new college graduates for positions within his agency and has placed a strong emphasis on entrylevel recruitmen

##### Outcomes:

The results show that the clusters contain similar books to each other, meaning the method of embedding the sentences and simply doing K-Means works.

In [11]:
# Save the clusters to a JSON file.
with open('clusters.json', 'w', encoding='utf-8') as f:
    json.dump(clusters, f, indent=4)

#### HDBSCAN

In [31]:
# Generate embeddings for the descriptions.
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(descriptions, show_progress_bar=True)

Batches: 100%|██████████| 32/32 [00:04<00:00,  7.62it/s]


In [32]:
# Cluster the embeddings using HDBSCAN.
clusterer = hdbscan.HDBSCAN(min_cluster_size=DIVISION_FACTOR//4, metric='euclidean')
cluster_labels = clusterer.fit_predict(embeddings)



In [33]:
# Group the books by clusters.
clusters = {}
for desc, label in zip(descriptions, cluster_labels):
    clusters.setdefault(label, []).append(desc)

In [34]:
print("Total number of clusters:", al.g(str(len(clusters)), False))
print("Ratio of outliers:", '/'.join([
    al.r(str(list(cluster_labels).count(-1)), False),
    str(len(cluster_labels))
]))

Total number of clusters: [32m60[0m
Ratio of outliers: [31m783[0m/1000


In [35]:
clusters.pop(-1);

In [36]:
random.choice(clusters)

['private investigator cormoran strike returns in a new mystery from robert galbraith, author of the 1 international bestseller the cuckoos calling.when novelist owen quine goes missing, his wife calls in private detective cormoran strike. at first, mrs. quine just thinks her husband has gone off by himself for a few daysas he has done beforeand she wants strike to find private investigator cormoran strike returns in a new mystery from robert galbraith, author of the 1 international bestseller the cuckoos calling.when novelist owen quine goes missing, his wife calls in private detective cormoran strike. at first, mrs. quine just thinks her husband has gone off by himself for a few daysas he has done beforeand she wants strike to find him and bring him home.but as strike investigates, it becomes clear that there is more to quines disappearance than his wife realizes. the novelist has just completed a manuscript featuring poisonous penportraits of almost everyone he knows. if the novel w

In [37]:
random.choice(clusters)

['capital in the twentyfirst century meets the second machine age in this stunning and optimistic tour de force on the promise and peril of the digital economy, from one of the most brilliant social critics of our time. digital technology was supposed to usher in a new age of endless prosperity, but so far it has been used to put industrial capitalism on steroids, makin capital in the twentyfirst century meets the second machine age in this stunning and optimistic tour de force on the promise and peril of the digital economy, from one of the most brilliant social critics of our time. digital technology was supposed to usher in a new age of endless prosperity, but so far it has been used to put industrial capitalism on steroids, making it harder for people and businesses to keep up. social networks surrender their original missions to more immediately profitable data mining, while brokerage houses abandon value investing for algorithms that drain markets and our 401ks alikeall tactics d

In [40]:
clusters[12]

['the zombie apocalypse has never been more surreal! a mentally unhinged manga artist witnesses the beginning of a zombie outbreak in tokyo, and hes certain of only two things hes destined to be the citys hero, and he possesses something very rare in japanan actual firearm! kengo hanazawas awardwinning series comes to dark horse, and this realisticallydrawn internat the zombie apocalypse has never been more surreal! a mentally unhinged manga artist witnesses the beginning of a zombie outbreak in tokyo, and hes certain of only two things hes destined to be the citys hero, and he possesses something very rare in japanan actual firearm! kengo hanazawas awardwinning series comes to dark horse, and this realisticallydrawn international bestseller takes us from initial small battles for survival to a huge, bodyhorror epidemic that threatens all of humanity! these special omnibus volumes will collect two of the original japanese books into each dark horse edition and include all of the color 

##### Outcomes:

Even though the clusters are almost perfect in terms of what they contain, the optimum cluster size is too low. When I try increasing the minimum cluster size, the cluster sizes suddenly increase a lot that there are only two non-outlier clusters.

Also not forgetting that almost all (more than <font color="red">75\%</font>) books have not been clustered here, meaning the results are not going to help with proper clustering.