# Clustering of Books Based on Goodreads Descriptions: Unsupervised ML

# Introduction
The goal of this project was to create an unsupervised machine learning algorithm that could cluster books together with other books of a similar topic. Categorization tasks are frustrating and time-consuming to do by hand, and automating a large part of it, even if the results are imperfect, would make the task significantly easier and faster.
# 1 Load Data
## 1.1 Package Importing
For this project we will do all of the data manipulation using Pandas. The machine learning component will all be handled using Scikit-Learn. TfidfVectorizer will convert the list of texts into a term frequency / inverse document frequnecy vector. The TruncatedSVD package will truncate the vector to the 1000 most useful terms for categorization (based on their values in the TF-IDF matrix. We will used the KMeans method to cluster the data, and we will construct a data pipeline using sklearn's make_pipeline.

In [10]:
#import pandas and necessary sklearn modules
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline

## 1.2 Data Importing / Merging
The files for this project are not particularly clean. books 1-600k do not have descriptions attached (they are in a separate file). We will need to import all of the files, match up the column names, merge on teh appropriate columns, then concatenate the merged dataframes. We can accomplish much of this relatively wuickly using for loops. 

In [29]:
#create lists of the incomplete book files, incomplete description files, and complete files
incomp_book_files = ['CSV datasets/book1-100k.csv', 
                     'CSV datasets/book100k-200k.csv', 
                     'CSV datasets/book200k-300k.csv', 
                     'CSV datasets/book300k-400k.csv', 
                     'CSV datasets/book400k-500k.csv', 
                     'CSV datasets/book500k-600k.csv']
incomp_desc_files = ['CSV datasets/book1-100k_descrip.csv', 
                     'CSV datasets/book100k-195k_descrip.csv', 
                     'CSV datasets/book195k-295k_descrip.csv', 
                     'CSV datasets/book295k-400k_descrip.csv', 
                     'CSV datasets/book400k-500k_descrip.csv', 
                     'CSV datasets/book500k-600k_descrip.csv']
comp_book_files = ['CSV datasets/book600k-700k.csv',
                   'CSV datasets/book700k-800k.csv',
                   'CSV datasets/book800k-900k.csv', 
                   'CSV datasets/book900k-1000k.csv', 
                   'CSV datasets/book1000k-1100k.csv', 
                   'CSV datasets/book1100k-1200k.csv', 
                   'CSV datasets/book1200k-1300k.csv', 
                   'CSV datasets/book1300k-1400k.csv']

#create empty dfs for each of the file categories
incomplete_book_df = pd.DataFrame(columns = ['Id', 'Name', 'Language'])
incomplete_description_df = pd.DataFrame(columns = ['Id', 'Description'])
comp_book_df = pd.DataFrame(columns = ['Id', 'Name', 'Description', 'Language'])

#read in csv book files for books 1-600k
for path in incomp_book_files:
    df = pd.read_csv(path, usecols = ['Id', 'Name', 'Language'])
    incomplete_book_df = incomplete_book_df.append(df)

#read in csv description files for books 1-600k
for path in incomp_desc_files:
    df = pd.read_csv(path, usecols = ['Id', 'Description'])
    incomplete_description_df = incomplete_description_df.append(df)

#merge incomplete_book_df and incomplete_description_df on 'Id'
book_desc_merge = pd.merge(incomplete_book_df, incomplete_description_df, how = 'inner', on = 'Id')

book_desc_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 316931 entries, 0 to 316930
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Id           316931 non-null  object
 1   Name         316931 non-null  object
 2   Language     69424 non-null   object
 3   Description  275728 non-null  object
dtypes: object(4)
memory usage: 12.1+ MB


In [32]:
#read in complete files for books 600k-1400k
for path in comp_book_files:
    df = pd.read_csv(path, usecols = ['Id', 'Name', 'Language', 'Description'])
    comp_book_df = comp_book_df.append(df)

comp_book_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 727338 entries, 0 to 38287
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Id           727338 non-null  object
 1   Name         727338 non-null  object
 2   Description  625896 non-null  object
 3   Language     122764 non-null  object
dtypes: object(4)
memory usage: 27.7+ MB


In [33]:
#vertically concatenate the two dfs
books_complete = book_desc_merge.append(comp_book_df)
books_complete.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1044269 entries, 0 to 38287
Data columns (total 4 columns):
 #   Column       Non-Null Count    Dtype 
---  ------       --------------    ----- 
 0   Id           1044269 non-null  object
 1   Name         1044269 non-null  object
 2   Language     192188 non-null   object
 3   Description  901624 non-null   object
dtypes: object(4)
memory usage: 39.8+ MB


# 2 Data Preparation
## 2.1 Data Cleaning
There is a language column, although most of the values are null. As Goodreads is a primarily english language site, I will assume that most of the nulls values are English. I'll create a list of the additional language codes that I will accept as english, and remove all the rows that have a language code that is not in this list. 

In [50]:
#list unique values in 'Language' column, remove all english and nan values from the list
langs = pd.unique(books_complete['Language']).tolist()
eng_langs = ['eng', 'en-US', 'en-GB', 'enm', 'en-CA', '--']

for lang in eng_langs:
    langs.remove(lang)

langs.pop(0)

#remove all entries in books_complete where Language is in langs
books_eng = books_complete[~books_complete['Language'].isin(langs)]

Unnamed: 0,Id,Name,Language,Description
1731,2701,The Canterbury Tales (original-spelling edition),enm,One of the greatest and most ambitious works i...
1739,2711,The Riverside Chaucer,enm,The most authentic edition of Chaucer's Comple...
19439,32816,The Canterbury Tales: Fifteen Tales and the Ge...,enm,"Each is presented in the original language, wi..."
41998,74041,Morte Arthure (alliterative version from Thorn...,enm,EDITED FROM ROBERT THORNTONS MS. AB. 1440 A.D....
42072,74179,Dream Visions and Other Poems,enm,Contexts connects the poems to their classical...
63712,111994,Troilus and Criseyde,enm,This Norton Critical Edition of Chaucer’s mast...
114457,205486,Everyman and Medieval Miracle Plays (Everyman'...,enm,Miracle plays were a popular form of entertain...
121765,219269,"Le Morte d'Arthur, Vol. 2",enm,"An immortal story of love, adventure, chivalry..."
232664,429581,"Poems of the Pearl Manuscript: Pearl, Cleannes...",enm,This edition has been revised to take account ...
232679,429611,The Works of the Gawain Poet,enm,"Pearl, a dream vision, presents its poignant s..."


Many of the book descriptions still have a lot of HTML tags in them, so I will construct a regex, and call the replace function to replace all HTML tags with nothing. I will also drop any NAs a this point. 

In [51]:
#drop language col
books_eng.drop('Language', axis = 1, inplace = True)

#use regex to remove html tags
books_eng.replace('<[^>]*>', '', regex = True, inplace = True)
books_eng.dropna(inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(


<class 'pandas.core.frame.DataFrame'>
Int64Index: 886056 entries, 0 to 38287
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Id           886056 non-null  int64 
 1   Name         886056 non-null  object
 2   Description  886056 non-null  object
dtypes: int64(1), object(2)
memory usage: 27.0+ MB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  books_eng.dropna(inplace = True)


In order for the algorithm to procede at a reasonable pace (as this will be run on my home computer), I am going to sample just 100,000 titles to categorize.

In [52]:
#take a random sample of 100,000 books from books_eng
books_sample = books_eng.sample(n = 100000, random_state = 42)
books_sample

Unnamed: 0,Id,Name,Description
11045,925803,After Tamerlane: The Global History of Empire ...,\n A Rise and Fall of the Great Powers for th...
161424,294342,"Rethinking Our Classrooms, Volume 1: Teaching ...",The original edition of Rethinking Our Classro...
31625,1079041,Autumn Blue,"An authentic tale of the bonds of family, fait..."
38896,1097910,Inka Bodies and the Body of Christ: Corpus Chr...,In Inka Bodies and the Body of Christ Carolyn ...
31570,860045,Uniforms Of The Civil War: In Color,"""Traces the Blue and the Gray in this gallery ..."
...,...,...,...
292266,550238,Next to Hughes: Behind the Power and Tragic Do...,The man closest to billionaire Howard Hughes d...
106340,189962,Joanne Weir's More Cooking in the Wine Country...,"""Somehow, we all must eat. we can make indiffe..."
207754,382263,History of the Chicago Urban League,The first scholarly study of a local racial ad...
26965,850243,The Lost Land,"Eavan Boland's new book, her first since the C..."


## 2. 2 Data Preprocessing
Now I will take the Descritpion and Name columns and convert them to two lists.

In [53]:
#convert descriptions to list
desc_list = books_sample['Description'].tolist()
title_list = books_sample['Name'].tolist()

desc_list[:6]

['\n  A Rise and Fall of the Great Powers for the post–Cold War era—a brilliantly written, sweeping new history of how empires have ebbed and flowed over the past six centuries. \n\xa0The death of the great Tatar emperor Tamerlane in 1405, writes historian John Darwin, was a turning point in world history. Never again would a single warlord, raiding across the steppes, be able to unite Eurasia under his rule. After Tamerlane, a series of huge, stable empires were founded and consolidated— Chinese, Mughal, Persian, and Ottoman—realms of such grandeur, sophistication, and dynamism that they outclassed the fragmentary, quarrelsome nations of Europe in every respect. The nineteenth century saw these empires fall vulnerable to European conquest, creating an age of anarchy and exploitation, but this had largely ended by the twenty-first century, with new Chinese and Indian super-states and successful independent states in Turkey and Iran. \xa0This elegantly written, magisterial account chall

There are still some small peculiarities in the list, but it should be fine, especially once the matrix is truncated. Now we can instatiate the TF-IDF vectorizer, and create a sparse matrix from the description texts.

In [54]:
#instantiate TfidfVectorizer as tfidf
tfidf = TfidfVectorizer()

In [55]:
#create a CSR TF-IDF matrix from the description list
csr_mat = tfidf.fit_transform(desc_list)

# 3 Machine Learning Model Creation
## 3.1 Hyperparameters and Pipeline
We will (arbitrarily) create 100 clusters, and use only 1000 words as the parameters on which the clustering will be applied. We will combine these two processes in a pipeline.

In [56]:
#create a TruncatedSVD instance, svd, focusing on 1000 words
svd = TruncatedSVD(n_components = 1000)

#create a KMeans instance, kmeans, with 100 clusters
kmeans = KMeans(n_clusters = 100)

#create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)

## 3.2 Fitting and Predicting
We will send the truncated CSR matrix through the pipeline, assigning the label for each title to the column "label." We will combine the titles list and labels list into a dataframe, and call several label values to check for the efficacy of the model.

In [57]:
# Fit the pipeline to articles
pipeline.fit(csr_mat)

# Calculate the cluster labels: labels
labels = pipeline.predict(csr_mat)

# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'title': title_list})

# Display df sorted by cluster label
print(df.sort_values('label'))


       label                                              title
78320      0     China's Bravest Girl: The Legend of Hua Mu Lan
77072      0                                        It's Summer
91054      0  Stairway to Heaven: The Final Resting Places o...
36011      0  Inspirational Romance Reader (Historical Colle...
36012      0                              New Kids on the Block
...      ...                                                ...
67953     99                 Shakespeare, Race, And Colonialism
8876      99                                 Shakespeare’s Hand
81894     99                              What Was Shakespeare?
9243      99                        Romeo & Juliet, Plainspoken
64224     99  Shakespeare - The Biography: Vol III: A Muse o...

[100000 rows x 2 columns]


In [70]:
df[df['label'] == 9].head(10)

Unnamed: 0,label,title
173,9,The Ultimate Weight Solution Food Guide
655,9,Cucina of Le Marche: A Chef's Treasury of Reci...
1243,9,The New Sugar Busters! Shopper's Guide
1883,9,"Man and Animals: Living, Working, and Changing..."
2147,9,Conscious Eating
2289,9,Great Wine Made Simple: Straight Talk from a M...
2540,9,"Booty Food: A Date by Date, Nibble by Nibble, ..."
2844,9,The Science of Cooking
2932,9,The Simple Art of Marrying Food and Wine
3553,9,The Insanity Defense: The Complete Prose


In [81]:
df[df['label'] == 29].head(10)

Unnamed: 0,label,title
146,29,Creating Client Extranets with Sharepoint 2003
751,29,SQL Built-In Functions and Stored Procedures: ...
1343,29,Pro JavaScript Techniques
1463,29,Professional BizTalk Server 2006
1765,29,Foundations of Aop for J2ee Development
1830,29,Essential Visual Basic 6.0 Fast
1851,29,Microsoft Exchange Server 2003 24seven
2658,29,Designing and Coding Reusable C++
2870,29,Java Distributed Computing
3027,29,Programming with Microsoft Visual Basic.Net: A...


In [85]:
df[df['label'] == 41].head(10)

Unnamed: 0,label,title
244,41,Down for the Count: (Poetry)
369,41,Mermaids in the Basement
433,41,The Drop That Became the Sea: Lyric Poems
473,41,Joyful Noise: Poems for Two Voices
590,41,Japanese Haiku Poems
601,41,Transcircularities: New & Selected Poems
697,41,The Defiant Muse: Hebrew Feminist Poems from A...
713,41,The Best American Poetry 1998
767,41,Etym(bi)ology
977,41,Straits: Poems


In [91]:
df[df['label'] == 66].head(10)

Unnamed: 0,label,title
8518,66,Lawrence University Off the Record
16590,66,Hanover College (College Prowler Guide)
19485,66,Cal Poly Pomona Off the Record
22519,66,Stanford University
27064,66,St. Olaf College Off the Record
46744,66,"Hamilton College: Clinton, New York"
61706,66,James Madison University
83044,66,Denison University
84714,66,Syracuse University
86955,66,"Hamilton College: Clinton, New York"


In [92]:
df[df['label'] == 77].head(10)

Unnamed: 0,label,title
72,77,101 Questions and Answers on the Biblical Tora...
460,77,Where Does God Live?: Questions and Answers fo...
648,77,Success God's Way
798,77,Longing for a Homeland: Discovering the Place ...
924,77,Our God is Awesome: Encountering the Greatness...
1091,77,Tough Questions Jews Ask: A Young Person's Gui...
1092,77,God Is the Gospel: Meditations on God's Love a...
1276,77,Experiencing Revival
1289,77,Heaven's Heroes: Real Life Stories from Histor...
1472,77,The Case for Faith: A Journalist Investigates ...


In [93]:
df[df['label'] == 90].head(10)

Unnamed: 0,label,title
953,90,Tree Yoga: A Workbook: Strengthen Your Persona...
2112,90,The Yoga Sutras
2546,90,Hatha Yoga for Kids: by Kids!
5469,90,365 Yoga
6411,90,The Yoga Face: Eliminate Wrinkles with the Ult...
7115,90,Morning Yoga Workouts
8995,90,Hatha Yoga Manual II
10103,90,Teaching Yoga to Children Through Story (Story...
10959,90,Yoga Journal's Yoga Basics: The Essential Begi...
12241,90,The Wisdom of Yoga: A Seeker's Guide to Extrao...


# 4 Conclusions
Based on the sampling of titles, the KMeans method seems to have been effective at categorizing the books into logical clusters. This could be scaled up to be used as a recommendation algorithmm, or to categorize new titles and validate contributor categorization suggestions.