# Book Scraping and Clustering
Case Study for Patika Global Technology

* **Author:** Bora Boyacıoğlu
* **GitHub:** https://github.com/boraboyacioglu-itu
* **E-Mail:** boraboyacioglu@icloud.com

## Main Notebook

This notebook contains the main tasks on the project. The required analyses are done, and a main model (alongside an alternative) has been selected to do the clustering.

### 1. Web Scraping

This part is actually implemented in ```get_books.py```. Since the process takes time to run, I decided to write a terminal script. However, below, it is possible to run the code as well.

In [1]:
# Import necessary libraries.
import json
from get_books import scrape

In [None]:
# Define the base URL.
base_url = 'http://books.toscrape.com/'

# Scrape the books.
books = scrape(base_url)

In [None]:
# Save the data to a JSON file.
with open('books.json', 'w', encoding='utf-8') as f:
    json.dump(books, f, indent=4)

Alternatively, it is also possible to read the already-extracted book data.

In [1]:
import json

# Read the already extracted books data.
with open('books.json', encoding='utf-8') as f:
    books = json.load(f)

### 2. Analyses

In [2]:
import AnsiLib as al

print("Total number of books:", al.s(str(len(books)), False))
print("First Book:", al.s(books[0]['title'], False))
print(" Last Book:", al.s(books[-1]['title'], False))


Total number of books: [1m1000[0m
First Book: [1mA Light in the Attic[0m
 Last Book: [1m1,000 Places to See Before You Die[0m


Additional analyses:

In [3]:
print("Average price of all books:", al.s("£" + str(sum([float(book['price']) for book in books]) / len(books)), False))
print("Average rating of all books:", al.s(str(sum([float(book['rating']) for book in books]) / len(books)) + "/5.000", False))

Average price of all books: [1m£35.07035[0m
Average rating of all books: [1m2.923/5.000[0m


### 3. Clustering Model

I have tried four type of clustering models in ```demos.ipynb```, and selected the Method 2 (Sentence Embeddings & K-Means Clustering) to be the best performing one.

However, other than the most of the books haven't being clustered in HDBSCAN, the Method 4 also performs excellent.

In [4]:
# Import necessary libraries.
import re
import random

from sentence_transformers import SentenceTransformer

from sklearn.cluster import KMeans
import hdbscan

  from .autonotebook import tqdm as notebook_tqdm


In [5]:
DIVISION_FACTOR = 10  # Number of books in each cluster.

RANDOM_SEED = 1984
random.seed(RANDOM_SEED)

In [6]:
# Slice the descriptions.
pattern = r'[^a-zA-Z0-9\s.,!?]'
descriptions = [
    re.sub(pattern, '', book['desc']).lower()
    for book in books
]

#### Sentence Embeddings & K-Means Clustering

In [7]:
# Load a pre-trained sentence transformer model.
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the descriptions into embeddings.
embeddings = model.encode(descriptions, show_progress_bar=True)

Batches: 100%|██████████| 32/32 [00:06<00:00,  4.83it/s]


In [8]:
# Fit K-Means clustering.
n_clusters = len(books) // DIVISION_FACTOR
kmeans = KMeans(n_clusters=n_clusters, random_state=RANDOM_SEED)

kmeans.fit(embeddings)
labels = kmeans.labels_

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [9]:
# Group the descriptions by cluster.
clusters = {i: [] for i in range(n_clusters)}
for desc, label in zip(descriptions, labels):
    clusters[label].append(desc)

In [10]:
random.choice(clusters)

['start preparing children for classroom success with tracing numbers on a train! this educational workbook is filled with pages of giant numbers that make it easy for little hands to learn pencil control, followed by pages of small numbers for repetition and motor skill development. the illustrated practice pages will keep kids engaged while learning their numbers from 1 to start preparing children for classroom success with tracing numbers on a train! this educational workbook is filled with pages of giant numbers that make it easy for little hands to learn pencil control, followed by pages of small numbers for repetition and motor skill development. the illustrated practice pages will keep kids engaged while learning their numbers from 1 to 30. ...more',
 'twotime winner of the pulitzer prize david mccullough tells the dramatic storybehindthestory about the courageous brothers who taught the world how to fly wilbur and orville wright.on a winter day in 1903, in the outer banks of no

##### Outcomes:

The results show that the clusters contain similar books to each other, meaning the method of embedding the sentences and simply doing K-Means works.

In [11]:
# Save the clusters to a JSON file.
with open('clusters.json', 'w', encoding='utf-8') as f:
    json.dump(clusters, f, indent=4)

#### HDBSCAN

In [12]:
# Generate embeddings for the descriptions.
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(descriptions, show_progress_bar=True)

Batches: 100%|██████████| 32/32 [00:04<00:00,  7.64it/s]


In [13]:
# Cluster the embeddings using HDBSCAN.
clusterer = hdbscan.HDBSCAN(min_cluster_size=DIVISION_FACTOR//4, metric='euclidean')
cluster_labels = clusterer.fit_predict(embeddings)



In [14]:
# Group the books by clusters.
clusters = {}
for desc, label in zip(descriptions, cluster_labels):
    clusters.setdefault(label, []).append(desc)

In [15]:
print("Total number of clusters:", al.g(str(len(clusters)), False))
print("Ratio of outliers:", '/'.join([
    al.r(str(list(cluster_labels).count(-1)), False),
    str(len(cluster_labels))
]))

Total number of clusters: [32m60[0m
Ratio of outliers: [31m783[0m/1000


In [16]:
clusters.pop(-1);

In [17]:
random.choice(clusters)

['when worldrenowned harvard symbologist robert langdon is summoned to a swiss research facility to analyze a mysterious symbolseared into the chest of a murdered physicisthe discovers evidence of the unimaginable the resurgence of an ancient secret brotherhood known as the illuminati ... the most powerful underground organization ever to walk the earth. the illuminati h when worldrenowned harvard symbologist robert langdon is summoned to a swiss research facility to analyze a mysterious symbolseared into the chest of a murdered physicisthe discovers evidence of the unimaginable the resurgence of an ancient secret brotherhood known as the illuminati ... the most powerful underground organization ever to walk the earth. the illuminati has now surfaced to carry out the final phase of its legendary vendetta against its most hated enemythe catholic church. langdons worst fears are confirmed on the eve of the holy conclave, when a messenger of the illuminati announces they have hidden an un

In [18]:
random.choice(clusters)

['nothing living is safe. nothing dead is to be trusted.for years, gansey has been on a quest to find a lost king. one by one, hes drawn others into this quest ronan, who steals from dreams adam, whose life is no longer his own noah, whose life is no longer a lie and blue, who loves ganseyand is certain she is destined to kill him.now the endgame has begun. dreams and nothing living is safe. nothing dead is to be trusted.for years, gansey has been on a quest to find a lost king. one by one, hes drawn others into this quest ronan, who steals from dreams adam, whose life is no longer his own noah, whose life is no longer a lie and blue, who loves ganseyand is certain she is destined to kill him.now the endgame has begun. dreams and nightmares are converging. love and loss are inseparable. and the quest refuses to be pinned to a path. ...more',
 'every year, blue sargent stands next to her clairvoyant mother as the soontobe dead walk past. blue never sees themuntil this year, when a boy e

##### Outcomes:

Even though the clusters are almost perfect in terms of what they contain, the optimum cluster size is too low. When I try increasing the minimum cluster size, the cluster sizes suddenly increase a lot that there are only two non-outlier clusters.

Also not forgetting that almost all (more than <font color="red">75\%</font>) books have not been clustered here, meaning the results are not going to help with proper clustering.