# Corpus driven analysis of Ch. Dickens' novel 'Hard Times'

I think this novel is interesting to analyze because of its structure. There are books, named with Bible terms ('Sowing', 'Reaping' and 'Garnering'), and some chapters which headings are also connected to religion. However, nothing is said about it in the book. There are at least 5 different sujet lines under the one cover. So I wanted to do some quantitative research.

## Opening and preprocessing

I opened texts from my device and preprocessed them with spaCy.

In [None]:
import os
import re
import spacy
nlp = spacy.load('en_core_web_lg')

In [2]:
with open('C:\\Users\\yanak\\OneDrive\\Документы\\Dickens_Hard-Times.txt', encoding='utf-8') as txt:
        text = txt.read()
        text = re.sub('\n+', ' ', text)

I noticed some highlited words, ticked with _word_. I decided to count them.

In [3]:
bold = len(re.findall('_[a-zA-Z]*_', text))
print(bold)

130


In [4]:
text = re.sub('_', '', text) # cleaning from '_'
text = re.sub('CHAPTER [A-Z]*', '', text) # cleaning from CHAPTERs - I want to split by books
#text

My decision was to split the novel not in chapters, but in books, because I guess they better demonstrate the differences throughout the whole text.

In [5]:
splitted_text = re.split('BOOK THE [A-Z]*', text) # splitting into 3 books
del splitted_text[0]

In [6]:
import pandas as pd # I wrote everything in df

In [7]:
hardtimes_df = pd.DataFrame({
    'Text': splitted_text
}
)
hardtimes_df

Unnamed: 0,Text
0,"SOWING THE ONE THING NEEDFUL ‘NOW, what I wa..."
1,REAPING EFFECTS IN THE BANK A SUNNY midsumme...
2,GARNERING ANOTHER THING NEEDFUL LOUISA awoke...


In [8]:
def process_text(text):
    return nlp(text.lower())

In [9]:
hardtimes_df['doc'] = hardtimes_df['Text'].apply(process_text)

In [107]:
def docs_nosw(doc): # doc w/o stopwords. I'll need to freq. word lists and similarity
    return nlp(' '.join([str(token) for token in doc if not token.is_stop]))

In [108]:
hardtimes_df['doc_wosw'] = hardtimes_df['doc'].apply(docs_nosw)

In [94]:
# let's get lemmas without stopw. I need them to examine lexical richness of the books and some metrics in stylo
def get_lemmas_wosw(doc):
    return [(token.lemma_) for token in doc if not token.is_punct]
hardtimes_df['Lemmas_not_stop'] = hardtimes_df['doc_wosw'].apply(get_lemmas_wosw)

In [10]:
# the whole list of lemmas I need to examine lexical richness of the books and some metrics in stylo
def get_lemmas(doc):
    return [(token.lemma_) for token in doc if not token.is_punct]
hardtimes_df['Lemmas'] = hardtimes_df['doc'].apply(get_lemmas)

To work with Stylo, I saved the lists of lemmas into 3 txt files. The results are in the folder.

In [27]:
path = 'C:\\Users\\yanak\\Documents\\'

In [31]:
# I didn't come up with anyth smarter :(
with open(path+"third_lemmas.txt", "w+", encoding="utf-8") as fp:
    for i in hardtimes_df['Lemmas'][2]:
        fp.write("%s\n" % i)

In [34]:
fp.close()

## NERs

Let's observe named entities, SpaCy has various inbuilt labels (ordinals, dates, times...), but I'm interested in GPE's and Person.

In [69]:
def get_pers(doc):
    pers = []
    for ent in doc.ents:
        if ent.label_ == 'PERSON':
            pers.append(ent.text)
    return pers

In [70]:
hardtimes_df['Persons'] = hardtimes_df['doc'].apply(get_pers)

In [72]:
def get_gpes(doc):
    gpes = []
    for ent in doc.ents: #тут уже речь идет не о токенах, а об именованных сущностях и их списке
        if ent.label_ == 'GPE': #если у им. сущности лейбл "person"
            gpes.append(ent.text) #то мы добавляем эту сущность в список
    return gpes

In [73]:
hardtimes_df['GPES'] = hardtimes_df['doc'].apply(get_gpes)

In [82]:
from collections import Counter
def freq_table(lemmas_list): # counts abs freq of lemmas/entities in the texts
    counter = Counter(lemmas_list)
    freq_dict = dict(counter)
    return freq_dict

def making_df(freq_dict, col): # makes them a better view
    df = pd.DataFrame({
        col: freq_dict.keys(),
        'Abs_freq': freq_dict.values(),
    }
    )
    return df

def sorter(df):
    df = df.sort_values(by=['Abs_freq'], ascending=False)
    return df

In [79]:
hardtimes_df['Pers_freq'] = hardtimes_df['Persons'].apply(freq_table)

In [80]:
hardtimes_df['GPEs_freq'] = hardtimes_df['GPES'].apply(freq_table)

There will be the most frequent entities with label 'Person'.

In [84]:
persons_book1 = making_ent_df(hardtimes_df['Pers_freq'][0])
persons_book1 = sorter(persons_book1)
persons_book1[:15] # there are more of them, but the vast majority is hapax

Unnamed: 0,Entity,Abs_freq
7,gradgrind,143
25,bounderby,115
22,louisa,108
61,sparsit,54
34,tom,50
72,stephen,39
73,rachael,34
8,jupe,29
24,thomas,23
48,thquire,21


As you may see, the most freq. entity is Gradgrind. However, there is not one person: it could be Thomas Gradgrind as well as his father. Sissy (Cecilia) Jupe has only 8th rank despite the fact that first book is devoted to her story. I think this is because spaCy didn't recognize her name. Thquire - Esquire; one of characters has troubles in pronouncing this word.

In [85]:
persons_book2 = making_ent_df(hardtimes_df['Pers_freq'][1])
persons_book2 = sorter(persons_book2)
persons_book2[:15]

Unnamed: 0,Entity,Abs_freq
1,sparsit,193
0,bounderby,142
17,tom,101
32,stephen,65
10,harthouse,65
14,louisa,64
6,james harthouse,37
42,rachael,31
4,gradgrind,15
11,tom gradgrind,9


Sissy disappears in the 2nd book. We may notice it even using AntConc (see distribution of her name throughout the whole novel).

In [86]:
persons_book3 = making_ent_df(hardtimes_df['Pers_freq'][2])
persons_book3 = sorter(persons_book3)
persons_book3[:15]

Unnamed: 0,Entity,Abs_freq
0,louisa,68
6,bounderby,68
22,rachael,67
2,gradgrind,61
7,sparsit,43
45,sleary,25
62,thquire,22
19,stephen blackpool,21
8,tom,21
26,stephen,18


Accidentally, Coketown is recognized as person, not gpe. So, based on this three lists of persons, the most mentioned characters are: Gradgrinds, Bounderby, Rachel, Stephen and Louisa. They have 2-3 own stories (sujet lines), but there are more of them with other characters.

In [88]:
gpes_book1 = making_ent_df(hardtimes_df['GPEs_freq'][0])
gpes_book1 = sorter(gpes_book1)
gpes_book1

Unnamed: 0,Entity,Abs_freq
3,louisa,4
0,england,3
2,london,2
7,thort,2
1,st,1
4,ark,1
5,wapping,1
6,bitterth,1
8,billth,1
9,thtick,1


As was already mentioned, the story takes place in Coketown - imagined city in the UK of Victorian era. So,  gpes here are somehow connected to England (England, London), Wales, India. Surprisingly, there is China, but I don't remember, why. Accidentally Louisa is recognized as GPE. Some strange words are also marked with this label.

In [89]:
gpes_book2 = making_ent_df(hardtimes_df['GPEs_freq'][1])
gpes_book2 = sorter(gpes_book2)
gpes_book2

Unnamed: 0,Entity,Abs_freq
1,london,6
19,yorkshire,2
2,jerusalem,2
18,india,2
6,yo’d,2
12,seabeach,1
17,romulus,1
16,rome,1
15,yorick,1
14,sir!—in,1


In the 2nd book there are more names of cities/countries: London, Yorkshire, Rome, Ireland, Britain, India, China, Jerusalem. This not means that the action moves somewhere from Coketown. I think this GPEs come from dialogues or character's stories. Louisa is GPE again.

In [90]:
gpes_book3 = making_ent_df(hardtimes_df['GPEs_freq'][2])
gpes_book3 = sorter(gpes_book3)
gpes_book3

Unnamed: 0,Entity,Abs_freq
2,louisa,8
5,liverpool,4
0,lancashire,1
1,st,1
3,dree,1
4,redeemer,1
6,japan,1
7,josephine,1
8,harneth,1
9,animalth,1


The 3rd book has less chapters and less entities also. Here we have only Liverpool and Japan as real locations. Others are personal names or mispronounced words.

So, the main geographical location in this novel is England and its surroundings. There are also mentions of east countries (India, Japan, China).

## Frequency word lists

I compiled freq. word lists without stopwords, but to be honest, I don't agree that they have much information under these conditions. I displayed 50 the most frequent lemmas and am ready to argue that they are typical to every english fiction book. Hence they don't have much meaning and can be seen as stopwords for this type of texts, too.

In [119]:
hardtimes_df['Lemmas_freq'] = hardtimes_df['Lemmas_not_stop'].apply(freq_table)

In [121]:
lemmas_book1 = making_df(hardtimes_df['Lemmas_freq'][0], 'Lemma')
lemmas_book1 = sorter(lemmas_book1)
lemmas_book1[1:50] # the 1st row are spaces which were added instead of stopwords

Unnamed: 0,Lemma,Abs_freq
188,say,299
151,mr,262
112,gradgrind,197
954,bounderby,181
191,know,148
199,father,127
882,louisa,118
294,look,118
133,come,111
965,mrs,95


In [122]:
lemmas_book2 = making_df(hardtimes_df['Lemmas_freq'][1], 'Lemma')
lemmas_book2 = sorter(lemmas_book2)
lemmas_book2[1:50]

Unnamed: 0,Lemma,Abs_freq
449,say,276
269,mrs,232
89,bounderby,214
88,mr,214
270,sparsit,196
19,know,174
1216,tom,120
1060,harthouse,110
520,man,104
76,look,100


In [123]:
lemmas_book3 = making_df(hardtimes_df['Lemmas_freq'][2], 'Lemma')
lemmas_book3 = sorter(lemmas_book3)
lemmas_book3[1:50]

Unnamed: 0,Lemma,Abs_freq
68,say,180
242,mr,179
521,bounderby,145
206,know,120
59,come,118
62,go,87
238,man,84
4,louisa,82
243,gradgrind,81
55,sissy,79


The only difference here are proper names and adjectives 'old' and 'young'. And I'm not really sure that it have much meaning. We can just conclude smth about charachters appeared in different parts of the novel and nothing more.

## Similarity 

Let's see how similar the books to each other using spaCy function similarity. I started with docs without stopwords.

In [35]:
print('Similarity between Book 1 and Book 2 = ', hardtimes_df['doc_wosw'][0].similarity(hardtimes_df['doc_wosw'][1]))

Similarity between Book 1 and Book 2 =  0.9991396969291683


In [36]:
print('Similarity between Book 2 and Book 3 = ', hardtimes_df['doc_wosw'][1].similarity(hardtimes_df['doc_wosw'][2]))

Similarity between Book 2 and Book 3 =  0.9979373029412386


In [37]:
print('Similarity between Book 1 and Book 3 = ', hardtimes_df['doc_wosw'][0].similarity(hardtimes_df['doc_wosw'][2]))

Similarity between Book 1 and Book 3 =  0.9987592511843397


All 3 books are almost similar to each other. I guess this is because they are the parts of one novel and they all have the same authorship (what means that there are no or minor lexical, thematical and stylistical differences).

Let's try docs WITH stopwords

In [38]:
print('Similarity between Book 1 and Book 2 = ', hardtimes_df['doc'][0].similarity(hardtimes_df['doc'][1]))

Similarity between Book 1 and Book 2 =  0.9995799683823092


In [39]:
print('Similarity between Book 2 and Book 3 = ', hardtimes_df['doc'][1].similarity(hardtimes_df['doc'][2]))

Similarity between Book 2 and Book 3 =  0.9989907808111365


In [40]:
print('Similarity between Book 1 and Book 3 = ', hardtimes_df['doc'][0].similarity(hardtimes_df['doc'][2]))

Similarity between Book 1 and Book 3 =  0.9984452025083468


Well, I can see no huge difference.

## Lexical Diversity (Richness)

The following measures show to what extent is vocabulary of different parts of the novel diversive. I guess you know about TTR (type-token ratio) and problems with it, so I used more insensitive to text length measures. In fact, this code and the whole idea are stolen from my bachelor thesis, so I was able to bring references to compare results.

In [35]:
import lexicalrichness
from lexicalrichness import LexicalRichness

In [36]:
def lex_rich(lst_lemmas):
    return LexicalRichness(lst_lemmas, preprocessor=None, tokenizer=None)

Excuse me I really have troubles with cycles, I'll work on it...

In [37]:
lex1 = lex_rich(hardtimes_df['Lemmas'][0])
lex2 = lex_rich(hardtimes_df['Lemmas'][1])
lex3 = lex_rich(hardtimes_df['Lemmas'][2])

NB! The bigger number is, the more lexically rich is the book. Except Maas metric: it uses logarythm, so the smaller the number is, then more diversive is vocabulary. If you want to read full description of metrics, please, google the module used here.

In [42]:
per_msttr = round(lex1.msttr(), 4)
per_mattr = round(lex1.mattr(), 4)
per_maas = round(lex1.Maas, 4)
per_mtld = round(lex1.mtld(), 4)
per_hdd = round(lex1.hdd(), 4)
print('Book1\nMSTTR =',per_msttr,'\nMATTR =',per_mattr,'\nMaas =',per_maas,'\nMTLD =',per_mtld,'\nHDD =',per_hdd)

Book1
MSTTR =  0.666 
MATTR =  0.667 
Maas =  0.0203 
MTLD =  62.4766 
HDD =  0.8473


In [44]:
per_msttr = round(lex2.msttr(), 4)
per_mattr = round(lex2.mattr(), 4)
per_maas = round(lex2.Maas, 4)
per_mtld = round(lex2.mtld(), 4)
per_hdd = round(lex2.hdd(), 4)
print('Book2\nMSTTR =',per_msttr,'\nMATTR =',per_mattr,'\nMaas =',per_maas,'\nMTLD =',per_mtld,'\nHDD =',per_hdd)

Book2
MSTTR = 0.6759 
MATTR = 0.6763 
Maas = 0.0203 
MTLD = 66.515 
HDD = 0.8494


In [47]:
per_msttr = round(lex3.msttr(), 4)
per_mattr = round(lex3.mattr(), 4)
per_maas = round(lex3.Maas, 4)
per_mtld = round(lex3.mtld(), 4)
per_hdd = round(lex3.hdd(), 4)
print('Book3\nMSTTR =',per_msttr,'\nMATTR =',per_mattr,'\nMaas =',per_maas,'\nMTLD =',per_mtld,'\nHDD =',per_hdd)

Book3
MSTTR = 0.6737 
MATTR = 0.672 
Maas = 0.0203 
MTLD = 65.3065 
HDD = 0.8444


Are these numbers big or not? Well, to compare, russian short stories from 20th century have the following results:

In [127]:
rss_lexrich_df = pd.read_csv('C:/Users/yanak/Documents/lex_rich_decades_new.txt', sep='\t')
rss_lexrich_df

Unnamed: 0,Decade,MSTTR,MATTR,Maas,MTLD,HD-D
0,00th,0.775,0.774,0.017,139.516,0.913
1,10th,0.777,0.777,0.017,147.776,0.918
2,20th,0.787,0.788,0.016,158.578,0.929
3,20th,0.788,0.788,0.016,161.277,0.927
4,40th,0.784,0.784,0.017,154.924,0.922
5,50th,0.79,0.79,0.017,166.633,0.922
6,60th,0.787,0.787,0.016,158.059,0.923
7,70th,0.776,0.775,0.016,144.925,0.916
8,80th,0.78,0.779,0.016,147.921,0.917
9,90th-2000,0.795,0.795,0.015,172.487,0.925


There were analyzed 1000 stories from 807 different authors. The stories were divided into subcorpora of 100 texts by every decade of the century. About 80-100 authors by each decade. So, vocabulary there is more rich than in books from 'Hard Times'. I guess that is so because of differences between languages and authorical styles.

## Conclusion

So, there are some differencies between books in characters and locations. However, basing on quantitative research, I can say that the main focus is on Gradgrinds family, mr. Bounderby, Rachel and Stephen Blackpool. The action takes place in England, though there are minor mentions of east countries. The lexicon is rather and at quite similar level diversive in all 3 books.