# Preprocessing & Exploring the Dataset

This dataset is the CMU Book Summaries dataset from Kaggle: https://www.kaggle.com/datasets/ymaricar/cmu-book-summary-dataset. It contains 16,559 books extracted from Wikipedia.

#### Exploring the original data

In [9]:
import pandas as pd

# read the file and show first few rows
filename = 'booksummaries.txt'
df = pd.read_csv(filename, sep="\t", 
                 names=['Wikipedia ID', 'Freebase ID', 'Title', 'Author', 'Publication Date', 'Genres', 'Summary'])
df.head()

Unnamed: 0,Wikipedia ID,Freebase ID,Title,Author,Publication Date,Genres,Summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


In [10]:
# get dataframe information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16559 entries, 0 to 16558
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Wikipedia ID      16559 non-null  int64 
 1   Freebase ID       16559 non-null  object
 2   Title             16559 non-null  object
 3   Author            14177 non-null  object
 4   Publication Date  10949 non-null  object
 5   Genres            12841 non-null  object
 6   Summary           16559 non-null  object
dtypes: int64(1), object(6)
memory usage: 905.7+ KB


In [11]:
# get count of genre grouping
df['Genres'].value_counts()

{"/m/05hgj": "Novel"}                                                                                                                                                  839
{"/m/06n90": "Science Fiction", "/m/014dfn": "Speculative fiction"}                                                                                                    567
{"/m/06n90": "Science Fiction"}                                                                                                                                        526
{"/m/02xlf": "Fiction"}                                                                                                                                                402
{"/m/02xlf": "Fiction", "/m/05hgj": "Novel"}                                                                                                                           381
                                                                                                                                                 

In [12]:
# inspect a few rows of data with truncated summaries
print("Title: {} \nGenres: {} \nSummary: {}...".format(df.Title[0], df.Genres[0], df.Summary[0][:200]))
print()
print("Title: {} \nGenres: {} \nSummary: {}...".format(df.Title[3], df.Genres[3], df.Summary[3][:200]))

Title: Animal Farm 
Genres: {"/m/016lj8": "Roman \u00e0 clef", "/m/06nbt": "Satire", "/m/0dwly": "Children's literature", "/m/014dfn": "Speculative fiction", "/m/02xlf": "Fiction"} 
Summary:  Old Major, the old boar on the Manor Farm, calls the animals on the farm for a meeting, where he compares the humans to parasites and teaches the animals a revolutionary song, 'Beasts of England'. Wh...

Title: An Enquiry Concerning Human Understanding 
Genres: nan 
Summary:  The argument of the Enquiry proceeds by a series of incremental steps, separated into chapters which logically succeed one another. After expounding his epistemology, Hume explains how to apply his p...


**Insights:** We can see that some rows do not have any genres and those who do, have them in a dictionary with some sort of ID key. The summary and title don't have any extra characters or whitespace, so most the the data cleanup and considerations are with the genres. 

#### Preprocessing the Data

We need to clean the data, so that we can apply vectorizing operations to the book summaries. Cleaning involves getting rid of stop words and punctuation.

In [13]:
# clean the punctuation and capitalization
import re

def clean(text):
    cleaned_text = ""
    punc_less_text = re.sub(r'[^\w\s]', '', text)
    alpha_only_text = re.sub(r'[^a-zA-Z]',' ',punc_less_text)
    cleaned_text = ' '.join(alpha_only_text.split())
    return cleaned_text.lower()

# apply to dataframe col that contains the book summary
df['CleanSummary'] = df['Summary'].apply(lambda s: clean(s))
df.head(5)

Unnamed: 0,Wikipedia ID,Freebase ID,Title,Author,Publication Date,Genres,Summary,CleanSummary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca...",old major the old boar on the manor farm calls...
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan...",alex a teenager living in nearfuture england l...
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...,the text of the plague is divided into five pa...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...,the argument of the enquiry proceeds by a seri...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...,the novel posits that space around the milky w...


In [14]:
# remove stop words
import nltk
from nltk.corpus import stopwords

# download stopwords list
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
  stop_less = ' '.join([word for word in text.split() if word not in (stop_words)])
  return stop_less

# apply stopword removal to dataframe col that contains the book summary
df['CleanSummary'] = df['CleanSummary'].apply(lambda s: remove_stopwords(s))
df.head(5)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/alexisechano/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/alexisechano/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/alexisechano/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/alexisechano/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Unnamed: 0,Wikipedia ID,Freebase ID,Title,Author,Publication Date,Genres,Summary,CleanSummary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca...",old major old boar manor farm calls animals fa...
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan...",alex teenager living nearfuture england leads ...
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...,text plague divided five parts town oran thous...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...,argument enquiry proceeds series incremental s...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...,novel posits space around milky way divided co...


#### Tokenizing data and sentiment analysis

Now that we have cleaned data, we can conduct some sentiment analysis on the book summaries to see what sentiments could be related to particular genres!

In [None]:
# tokenize data