# Book Scraping and Clustering
Case Study for Patika Global Technology

* **Author:** Bora Boyacıoğlu
* **GitHub:** https://github.com/boraboyacioglu-itu
* **E-Mail:** boraboyacioglu@icloud.com

## Clustering Demos

This notebook tries different clustering methods, using either NLP or embedding methods.

In [1]:
# Import necessary libraries.
import re
import json
import random

import numpy as np
import AnsiLib as al

In [2]:
DIVISION_FACTOR = 10  # Number of books in each cluster.

RANDOM_SEED = 1984
random.seed(RANDOM_SEED)

In [3]:
# Read the already extracted books data.
with open('books.json', encoding='utf-8') as f:
    books = json.load(f)

# Slice the descriptions.
pattern = r'[^a-zA-Z0-9\s.,!?]'
descriptions = [
    re.sub(pattern, '', book['desc']).lower()
    for book in books
]

### Method 1: TF-IDF Vectorisation & K-Means Clustering

This is the straight-forward K-Means clustering, using 10 clusters. First, the descriptions turned into vectors, then fitted into the clustering model.

In [4]:
# Import necessary libraries.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

In [5]:
# Convert descriptions to TF-IDF vectors.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(descriptions)

In [6]:
# Fit K-Means clustering.
n_clusters = len(books) // DIVISION_FACTOR
kmeans = KMeans(n_clusters=n_clusters, random_state=1984)

kmeans.fit(X)
labels = kmeans.labels_

In [7]:
# Group the descriptions by cluster.
clusters = {i: [] for i in range(n_clusters)}
for desc, label in zip(descriptions, labels):
    clusters[label].append(desc)

In [8]:
random.choice(clusters)

['one class assignment. one second chance at love. the school player is all in. now he needs to win back the sweet commitment girl whos forever owned his heart. justin carter has a secret. hes not the total player fairfield academy believes him to be. not really. in fact, he used to be a onewoman guy...and his feelings for her never went away. too bad he broke her heart t one class assignment. one second chance at love. the school player is all in. now he needs to win back the sweet commitment girl whos forever owned his heart. justin carter has a secret. hes not the total player fairfield academy believes him to be. not really. in fact, he used to be a onewoman guy...and his feelings for her never went away. too bad he broke her heart three years ago and made sure to ruin any chance shed ever forgive him. peyton williams is a liar. she pretends to be whole, counting down the days until graduation and helping her parents at the family ranch. but the truth is, shes done everything she c

#### Outcomes of Method 1:



### Method 2: Sentence Embeddings & K-Means Clustering

For the second method, unlike the first one, instead of relying on a straight TF-IDF vectorisation, I introduced a sentence transformer. This pre-trained model captures the semantic meaning of sentences, and generally result in better results as it incorporates contextual information.

In [9]:
# Import necessary libraries.
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
# Load a pre-trained sentence transformer model.
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the descriptions into embeddings.
embeddings = model.encode(descriptions, show_progress_bar=True)

Batches: 100%|██████████| 32/32 [00:04<00:00,  7.61it/s]


In [11]:
# Fit K-Means clustering.
n_clusters = len(books) // DIVISION_FACTOR
kmeans = KMeans(n_clusters=n_clusters, random_state=RANDOM_SEED)

kmeans.fit(embeddings)
labels = kmeans.labels_

In [12]:
# Group the descriptions by cluster.
clusters = {i: [] for i in range(n_clusters)}
for desc, label in zip(descriptions, labels):
    clusters[label].append(desc)

In [13]:
random.choice(clusters)

['discovered in the attic in which she spent the last years of her life, anne franks remarkable diary has since become a world classica powerful reminder of the horrors of war and an eloquent testament to the human spirit.in 1942, with nazis occupying holland, a thirteenyearold jewish girl and her family fled their home in amsterdam and went into hiding. for the next two discovered in the attic in which she spent the last years of her life, anne franks remarkable diary has since become a world classica powerful reminder of the horrors of war and an eloquent testament to the human spirit.in 1942, with nazis occupying holland, a thirteenyearold jewish girl and her family fled their home in amsterdam and went into hiding. for the next two years, until their whereabouts were betrayed to the gestapo, they and another family lived cloistered in the secret annexe of an old office building. cut off from the outside world, they faced hunger, boredom, the constant cruelties of living in confined

#### Outcomes of Method 2:

This method actually seems to have worked well. The resulting clusters are in the same topic *(for the first case: WW2)* and meaningful. Also considering the easiness of K-Menans, sentence embeddings proved to be a useful concept.

### Method 3: Topic Modeling using Latent Dirichlet Allocation (LDA)

To understand the underlying meanings instead of word analyses, I used LDA topic modeling in the third method.

In [14]:
# Import necessary libraries.
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
# Convert descriptions to document-term matrix.
vectorizer = CountVectorizer(stop_words='english')
dtm = vectorizer.fit_transform(descriptions)

In [16]:
# Fit the LDA model.
n_components = len(books) // DIVISION_FACTOR
lda = LatentDirichletAllocation(n_components=n_components, random_state=RANDOM_SEED)
lda.fit(dtm)

In [17]:
# Get the topic distribution for each description.
topic_distribution = lda.transform(dtm)

# Assign each description to the topic with the highest probability.
labels = topic_distribution.argmax(axis=1)

In [18]:
# Group the descriptions by cluster.
clusters = {i: [] for i in range(n_components)}
for desc, label in zip(descriptions, labels):
    clusters[label].append(desc)

In [19]:
random.choice(clusters)

['everyone knows the legends about the cursed girlisabel, the one the seoras whisper about. they say she has green skin and grass for hair, and she feeds on the poisonous plants that fill her familys caribbean island garden. some say she can grant wishes some say her touch can kill.seventeenyearold lucas lives on the mainland most of the year but spends summers with h everyone knows the legends about the cursed girlisabel, the one the seoras whisper about. they say she has green skin and grass for hair, and she feeds on the poisonous plants that fill her familys caribbean island garden. some say she can grant wishes some say her touch can kill.seventeenyearold lucas lives on the mainland most of the year but spends summers with his hoteldeveloper father in puerto rico. hes grown up hearing stories about the cursed girl, and he wants to believe in isabel and her magic. when letters from isabel begin mysteriously appearing in his room the same day his new girlfriend disappears, lucas tur

#### Outcomes of Method 3:

This time, a generalisation doesn't seem to be have made. Random clusters I've checked return unrelated descriptions. Topic Modeling here did not help with clustering. There are no clear patterns, no coherence.

### Method 4: HDBSCAN

For the fourth method, I have tried using modern vector embeddings along with a density‐based clustering algorithm (HDBSCAN), which automatically determines the number of clusters.

In [20]:
# Import libraries for HDBSCAN.
from sentence_transformers import SentenceTransformer
import hdbscan

In [43]:
# Generate embeddings for the descriptions.
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(descriptions, show_progress_bar=True)

Batches: 100%|██████████| 32/32 [00:04<00:00,  7.98it/s]


In [50]:
# Cluster the embeddings using HDBSCAN.
clusterer = hdbscan.HDBSCAN(min_cluster_size=DIVISION_FACTOR//4, metric='euclidean')
cluster_labels = clusterer.fit_predict(embeddings)



In [51]:
# Group the books by clusters.
clusters = {}
for desc, label in zip(descriptions, cluster_labels):
    clusters.setdefault(label, []).append(desc)

In [52]:
print("Total number of clusters:", al.g(str(len(clusters)), False))
print("Ratio of outliers:", '/'.join([
    al.r(str(list(cluster_labels).count(-1)), False),
    str(len(cluster_labels))
]))

Total number of clusters: [32m60[0m
Ratio of outliers: [31m783[0m/1000


In [53]:
clusters.pop(-1);

In [54]:
random.choice(clusters)

['given the opportunity, would you assume someone elses identity and leave your old life behind? a serendipitous crossing of paths between lisa barnes, a downonherluck job seeker, and julie forman, a personal trainer to an olympic hopeful, forever changes the course of both womens lives. one winds up dead and the other finds herself a fugitive, hiding behind one lie aft given the opportunity, would you assume someone elses identity and leave your old life behind? a serendipitous crossing of paths between lisa barnes, a downonherluck job seeker, and julie forman, a personal trainer to an olympic hopeful, forever changes the course of both womens lives. one winds up dead and the other finds herself a fugitive, hiding behind one lie after another as a coldblooded killer methodically hunts her. desperately trying to stay alive, the terrified woman enlists the help of forensic instincts, a rogue investigative team that clandestinely operates in the gray area between legal and illegal. safeg

In [61]:
random.choice(clusters)

['logan matthews is a father, architect, and widower. he lives in brooklyn, new york with his three year old son, liam. his life is as ordinary as any single father with a toddler. lara miller is a single mother raising her nine year old daughter, olivia in new york city, the city that never sleeps. she lives with her roommate, erin and finds that she enjoys life just the wa logan matthews is a father, architect, and widower. he lives in brooklyn, new york with his three year old son, liam. his life is as ordinary as any single father with a toddler. lara miller is a single mother raising her nine year old daughter, olivia in new york city, the city that never sleeps. she lives with her roommate, erin and finds that she enjoys life just the way it is. logan and lara have all but given up on love. however, fate has another plan for these parents because they both share something that will draw them together in a way they never thought possible. their daughter. ...more',
 'maple valley b

#### Outcomes of Method 4:

As it seems, from the random cluster samples, the clusters generate book categories which contain similar books. Actually the clusters themselves are excellent, in terms of what the contain.

However, each cluster contain very specific configurations. It is possible to just increase the minimum cluster size of course, but in that case the cluster amount becomes too low to be useful in any meanings.

Also, the amount of outliers in the results shows that it will not be a useful tool. More than <font color="red">75\%</font> of the books have not been classified, meaning this model only works for some of the books, and therefore, should not be prefered.