# Introduction

> Data cleaning helps avoid "garbage in , garbage out" -- we do not want to feed meaningless data into a model which will probably return us with more meaningless junk.

This time I will skip the scraping part that data scientists normally do. This allows the content to be updated over time, but to be fair the content is pretty static anyways so I don't really see the point of doing so. In addition, I imagine there would be quite a number of problems if the layout of the site changes.

## Outline for data cleaning
- Input: a simple text file with metadata removed. Headers and page numbers are kept though.
- Common Pre-processing/ cleaning procedures
  - All lower case
  - Remove punctuation, symbols and numerical values
  - Remove common non-sensical text (such as line breakers `\n`)
  - Tokenize text: split sentences into individual words (in preparation for DTM)
  - Remove stop-words
  - Using NLTK perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.
  - Parts of speech tagging
  - DTM for bi-grams/ tri-grams (phrases like thank you)
- Output
  - Corpus: not much different from the actual input since there is only one file here....... but with all the data cleaned up.
  - Document Term matrix: a matrix of word counts in the entire corpus.

SpaCy can also perform these NLTK techniques as well, with a greater degree of efficiency. The extra features might be overkill for the time being though.

# Importing the data

For the purposes of this project, I will simply import the data from a text file, which will be parsed into a string object.

In [295]:
#  Define list of file names for books to be analysed
fileNames = [
    'books/charlieandthechocolatefactory.txt',
    'books/fantasticmrfox.txt',
    'books/matilda.txt'
]

# bookNames for indexing the dictionary in later steps
bookNames = [
    'chocofact',
    'fox',
    'matilda'
]

# fullNames available for better readability
fullNames = [
    'Charlie and the Chocolate Factory',
    'Fantastic Mr Fox!',
    'Matilda'
]

In [296]:
# Import text files and remove non sensical '\n'
def importText(fileName):
    data = open(fileName, "r", encoding="utf-8").read().replace('\n', ' ')
    return data

In [297]:
# Test print the first 5000 characters of the third book, which is 'Matilda'
rawBooks = [importText(bkName) for bkName in fileNames]
print(rawBooks[2][:3000])

The Reader of Books    It’s a funny thing about mothers and fathers. Even  when their own child is the most disgusting little blister  you could ever imagine, they still think that he or she is  wonderful.   Some parents go further. They become so blinded by  adoration they manage to convince themselves their  child has qualities of genius.   Well, there is nothing very wrong with all this. It’s the  way of the world. It is only when the parents begin  telling us about the brilliance of their own revolting off¬  spring, that we start shouting, 'Bring us a basin! We’re  going to be sick!’    3      School teachers suffer a good deal from having to  listen to this sort of twaddle from proud parents, but  they usually get their own back when the time comes  to write the end-of-term reports. If I were a teacher I  would cook up some real scorchers for the children of  doting parents. ‘Your son Maximilian,’ I would write,  ‘is a total wash-out. I hope you have a family business  you can pus

But we want to make sure that the texts are indexed with their short-hand names as well, such that we don't necessarily have to access them with a specific number. This makes it way more convenient the access the data in the future, especially if we decide to append a few more copies of Roald Dahl's texts!

In [298]:
import pandas as pd
pd.set_option('max_colwidth',150)

raw_df = pd.DataFrame({'book_names':bookNames, 'full_names': fullNames, 'text':rawBooks})

# set book names as index
raw_df = raw_df.set_index('book_names')

# sort dataframe and print
raw_df = raw_df.sort_index()
raw_df

Unnamed: 0_level_0,full_names,text
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1
chocofact,Charlie and the Chocolate Factory,1 Here Comes Charlie These two very old people are the father and mother of Mr Bucket. Their names are Grandpa Joe and Grandma Josephine. A...
fox,Fantastic Mr Fox!,Down in the valley there were three farms. The owners of these farms had done well. They were rich men. They were also nasty men. All three of the...
matilda,Matilda,The Reader of Books It’s a funny thing about mothers and fathers. Even when their own child is the most disgusting little blister you could ...


Before we proceed any further, we should also pickle a raw copy of all books, which saves the object in a binary format. This is done for contingency purposes only.

In [394]:
import pickle

with open("raw_books.pkl", "wb") as file:
    pickle.dump(raw_df, file)

At the mean time, it is also a good idea to quickly verify the data structure (such as the headers), as well as the text, which are more or less unmodified at this point.

In [300]:
# Check the list of book names (keys)
raw_df.keys

<bound method NDFrame.keys of                                    full_names  \
book_names                                      
chocofact   Charlie and the Chocolate Factory   
fox                         Fantastic Mr Fox!   
matilda                               Matilda   

                                                                                                                                                             text  
book_names                                                                                                                                                         
chocofact   1   Here Comes Charlie   These two very old people are the father and mother of Mr Bucket. Their names are  Grandpa Joe and Grandma Josephine.   A...  
fox         Down in the valley there were three farms. The owners of these farms had done well. They were rich men. They were also nasty men. All three of the...  
matilda      The Reader of Books    It’s a funny thing about mothers

In [301]:
# Test print the contents for Fantastic Mr Fox!
raw_df.text.loc['fox'][47000:]

' cider inside her inside.’  Then Badger joined in: 48  \x0c‘Oh poor Mrs Badger, he cried, So hungry she very near died. But she’ll not feel so hollow If only she’ll swallow Some cider inside her inside.’  They were still singing as they rounded the final corner and burst in upon the most wonderful and amazing sight any of them had ever seen. The feast was just beginning. A large dining-room had been hollowed out of the earth, and in the middle of it, seated around a huge table, were no less than twenty-nine animals. They were: Mrs Fox and three Small Foxes. Mrs Badger and three Small Badgers. Mole and Mrs Mole and four Small Moles. Rabbit and Mrs Rabbit and five Small Rabbits. 49  \x0cWeasel and Mrs Weasel and six Small Weasels. The table was covered with chickens and ducks and geese and hams and bacon, and everyone was tucking into the lovely food. ‘My darling!’ cried Mrs Fox, jumping up and hugging Mr Fox. ‘We couldn’t wait! Please forgive us!’ Then she hugged the Smallest Fox of al

# Getting started with Data cleaning!

When data scientists process numerical data, they often remove invalid data (which can be automatically and manually interpreted), duplicate data, outliers and null data. There are several methods that we can iteratively apply along the way to clean our data:
  - All lower case
  - Remove punctuation, symbols and numerical values
  - Remove common non-sensical text (such as line breakers `\n`, as well as other escape characters such as `51\x0c`)
  - Tokenize text: split sentences into individual words (in preparation for DTM)
  - Remove stop-words
  - Using NLTK perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.
  - Parts of speech tagging
  - DTM for bi-grams/ tri-grams (phrases like thank you)
  - fix typos (a bit too advanced.......)

We want to apply these methods iteratively such that we can observe the results after each cleaning stage; this is especially important for text-preprocessing since an overly aggressive approach may result in key information being lost.  

In [302]:
# First, make all text lower case, get rid of punctuation, numbers and other non-sensical text.

# Packages for string manipulation
import re
import string

# Basic Function for cleaning texts
def basic_text_clean(text):
    text = text.lower() #lower case
    text = re.sub('\x0c', ' ', text) # non sensical text
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text) # punctuation
    text = re.sub('\w*\d\w*', ' ', text) # numbers in between text
    return text

basic_cleaning = lambda x: basic_text_clean(x)

Some say quotes (things within single or double quotation marks) should be removed as well, but in this context, I believe dialogues or conversations within the story are pretty important as well, so might as well see how things work out first......

In [303]:
# Apply the basic_cleaning "mask" to the texts within the dataframe
data_b_cleaned = pd.DataFrame(raw_df.text.apply(basic_cleaning))
data_b_cleaned

Unnamed: 0_level_0,text
book_names,Unnamed: 1_level_1
chocofact,here comes charlie these two very old people are the father and mother of mr bucket their names are grandpa joe and grandma josephine a...
fox,down in the valley there were three farms the owners of these farms had done well they were rich men they were also nasty men all three of the...
matilda,the reader of books it’s a funny thing about mothers and fathers even when their own child is the most disgusting little blister you could ...


In [304]:
data_b_cleaned.text.loc['fox'][47000:]

'r joined in      ‘oh poor mrs badger  he cried  so hungry she very near died  but she’ll not feel so hollow if only she’ll swallow some cider inside her inside ’  they were still singing as they rounded the final corner and burst in upon the most wonderful and amazing sight any of them had ever seen  the feast was just beginning  a large dining room had been hollowed out of the earth  and in the middle of it  seated around a huge table  were no less than twenty nine animals  they were  mrs fox and three small foxes  mrs badger and three small badgers  mole and mrs mole and four small moles  rabbit and mrs rabbit and five small rabbits      weasel and mrs weasel and six small weasels  the table was covered with chickens and ducks and geese and hams and bacon  and everyone was tucking into the lovely food  ‘my darling ’ cried mrs fox  jumping up and hugging mr fox  ‘we couldn’t wait  please forgive us ’ then she hugged the smallest fox of all  and mrs badger hugged badger  and everyone 

## Corpus cleaned!

A quick round-up of things we have worked on:
- Indexed this list of texts with keys, which are the book names - at this point, this can be considered a corpus, map of book names to texts.
- As we can see, not much has actually been done so far, but most importantly the text makes sense and the flow of sentences is not disturbed by weird characters. 
- Since the actual order of words do matter for things like sentiment analysis, performing further cleaning techniques such as stemming and lemmatisation will only worsen the final sentence generation algorithm. 

The corpus is ready to be pickled for further use.

In [305]:
data_b_cleaned.to_pickle("basic_cleaned_corpus.pkl")

# Creating a Document Term Matrix (DTM) with NLTK

## Tokenization and Stop Words

A DTM is a database of words within various documents. This means the text is split into words (tokenization) for further analysis. In this sceario, we can apply further techqniques to remove  stop words, which are relatively meaningless words, such as 'a', 'the' or other various prepositions.

The pakages that we can utilise for this section are:
- scikit-learn's CountVectorizer
- NLTK
- (maybe) SpaCy

In [306]:
from sklearn.feature_extraction.text import CountVectorizer

# Tokenize text, fit it into a mask (that removes stop words), and transform it in a DTM
cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_b_cleaned.text) 

#Transform DTM (in context a vocabulary/ dictionary) into pandas dataframe format
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names()) 
data_dtm.index = data_b_cleaned.index
data_dtm

Unnamed: 0_level_0,aback,abandon,abc,abdomen,abide,abilities,ability,able,absolute,absolutely,...,yippeeeeee,yippeeeeeeee,yippeeeeeeeeee,youheard,young,younger,youreally,youth,zing,zip
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chocofact,0,0,0,0,1,0,0,13,2,10,...,1,1,1,0,6,1,0,1,1,1
fox,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,1,0,1,0,0,0
matilda,1,1,1,1,0,1,3,14,5,6,...,0,0,0,0,14,1,0,0,0,0


The advantages of this format are immediately apparent. It allows us to filter some meaningless words (stop words) and presents the data in a neat and organised manner. Let's pickle this relatively raw or primitive DTM first.

In [307]:
data_dtm.to_pickle("data_dtm.pkl")

## Bigrams

Dealing with an excessively large dataset is messy for various reasons:
- As we can see, there are some really similar words like "young" or "younger", which are stemmings or lemmatisations from a "root word"
- Some words that have arbitrary spelling like "yippeeeeeeee....!" -- it's cute, but who knows why Roald Dahl likes to put `x` many e's behind yippee......
- Frequency for certain words are quite low, which means that they won't really be used for things like topic analysis or sentiment analysis in later stages. Perhaps identifying unique words may be important, but for these purposes I have specifically pickled the DTM above into the file `data_dtm.pkl`; or we could always start off from the corpus.

To recall, these are the things that we can work on to further clean our data:
  - Perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.
  - Parts of speech tagging
  - Collating bi-grams/ tri-grams (phrases like thank you)

In [308]:
# Accept bigrams as well
bigrams_cv = CountVectorizer(stop_words='english', ngram_range= (1,2)) 
cleaned_cv = bigrams_cv.fit_transform(data_b_cleaned.text) 
cleaned_data_dtm = pd.DataFrame(cleaned_cv.toarray(), columns=bigrams_cv.get_feature_names())
cleaned_data_dtm.index = data_dtm.index
cleaned_data_dtm

Unnamed: 0_level_0,aback,aback arrival,abandon,abandon trunchbull,abc,abc quite,abdomen,abdomen daughter,abide,abide ugliness,...,younger children,younger ones,youreally,youreally mean,youth,youth just,zing,zing fantastic,zip,zip guns
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chocofact,0,0,0,0,0,0,0,0,1,1,...,0,1,0,0,1,1,1,1,1,1
fox,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
matilda,1,1,1,1,1,1,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0


I imagine the DTM is quite sparse since there are 35000 columns! Not only is most of the data useless, it will also increase the load for processing in subsequent steps. Let's only keep words that have appeared more than 5 times.

In [309]:
# Transpose for easier calculations, and add a column for number for sum of word frequencies in all books.
cleaned_data_dtm = cleaned_data_dtm.transpose()
cleaned_data_dtm['sum'] = cleaned_data_dtm.sum(axis = 1, skipna = True) 
cleaned_data_dtm

book_names,chocofact,fox,matilda,sum
aback,0,0,1,1
aback arrival,0,0,1,1
abandon,0,0,1,1
abandon trunchbull,0,0,1,1
abc,0,0,1,1
...,...,...,...,...
youth just,1,0,0,1
zing,1,0,0,1
zing fantastic,1,0,0,1
zip,1,0,0,1


In [310]:
# Only keep words that have appeared more than 5 times.
filtered_data_dtm = cleaned_data_dtm[cleaned_data_dtm['sum'] > 5]

#Transpose back to the original format and display the DTM!
filtered_data_dtm = filtered_data_dtm.transpose()
filtered_data_dtm

Unnamed: 0_level_0,able,absolute,absolutely,actually,added,afraid,afternoon,afternoons,age,ago,...,year old,years,yelled,yelled mrs,yelling,yellow,yes,yes miss,yesterday,young
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chocofact,13,2,10,5,3,3,2,0,0,2,...,2,8,14,7,3,3,29,0,2,6
fox,0,0,1,1,0,0,1,0,0,0,...,0,0,4,0,0,0,11,0,1,1
matilda,14,5,6,9,9,8,15,9,7,7,...,11,23,13,0,4,4,37,15,6,14
sum,27,7,17,15,12,11,18,9,7,9,...,13,31,31,7,7,7,77,15,9,21


This yields us with a much more manageable 1361 entries. Before we proceed any furher, we should again pickle it for further use.

In [311]:
data_dtm.to_pickle("high_bigram_dtm.pkl")

# Exploring Stemming and Lemmatisation on NLTK

When a language contains words that are derived from another (root) word , the changes are called **inflected Language**. For example, modifications are made to reflect tense, case, aspect, person, number, gender and mood or position in speech. For example, googling fish (I use DuckDuckGo) will also result in fishes, fishing as fish is the stem of both words.

Next we will utilise NLTK to perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.

1. Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language. A stem (or root) is the part of the word to which you add inflectional affixes such as (-ed,-ize, -s,-de,mis). They are obtained by removing the inflections used with a word.
2. Lemmatization reduces the inflected words properly, ensuring that the root word belongs to the language. In Lemmatization the root word is called a Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. Context is an essential input for lemmatization, so the part of speech of the word (verb, noun, adjective, etc.....) should be passed as a parameter as well.

I love stem.

In [312]:
import nltk

# Using the GUI select the punkt model and wordnet corpora for download.
nltk.download()
nltk.download('wordnet')

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\brianwu\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [313]:
# Using the PortStemmer, which is less aggressive, and less likely to yield non-real words. (lower chance of over-stemming)
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
porter = PorterStemmer()
wordnet_lemmatizer = WordNetLemmatizer()

print(" Stemmings for troubling")
print(porter.stem("trouble"))
print(porter.stem("troubling"))

print("\n Stemmings for young")
print(porter.stem("young"))
print(porter.stem("younger"))

# if not defined, the default "pos" is "noun"
print("\n Lemmatisations for trouble/ troubling")
print("n/a - ", wordnet_lemmatizer.lemmatize("trouble"))
print("verb - ", wordnet_lemmatizer.lemmatize("troubling", pos = "v"))
print("noun - ", wordnet_lemmatizer.lemmatize("troubling", pos = "n"))
print("adj - ", wordnet_lemmatizer.lemmatize("troubling", pos = "a"))

print("\n Lemmatisations for done")
print("verb - ", wordnet_lemmatizer.lemmatize("done", pos = "v"))
print("noun - ", wordnet_lemmatizer.lemmatize("done", pos = "n"))
print("adj - ", wordnet_lemmatizer.lemmatize("done", pos = "a"))

Stemmings for troubling
troubl
troubl

 Stemmings for young
young
younger

 Lemmatisations for trouble/ troubling
n/a -  trouble
verb -  trouble
noun -  troubling
adj -  troubling

 Lemmatisations for done
verb -  do
noun -  done
adj -  done


Some key observations can be made for both stemming and lemmatisation algorithms on NLTK:
- Stemming still yields some non-real words ocassionally, which is something we will have to manually deal with when the word comes to the top of our list. 
- Lemmatisations for verbs seem to yield the most straightforward results, but we have to identify the parts of speech manually.

This isn't quite what I am looking for, which is probably a signal for me to experiment with SpaCy as well.

# Exploring SpaCy

## Parts of Speech (POS) tagging, dependencies, entities and more

With that in mind, perhaps we should experiment with spacy as well. For basic NLP tasks such as tokenization and lemmatisation (but not stemming), SpaCy offers a smaller variety of options but some say the functionality is more refined.

(To be fair, getting the POS of a word is possible in NLTK as well as shown in this [article](https://www.tutorialexample.com/improve-nltk-word-lemmatization-with-parts-of-speech-nltk-tutorial/), but its not as straightfoward)

In [314]:
# Import Spacy packages (install on computer via terminal/ anaconda, and load the language model)
import spacy
sp = spacy.load('en_core_web_sm') 

In [315]:
# Basic Tasks - as usual we start with a (small) experiment.
sentence = sp(u'Grandpa Joe lifted Charlie up so that he could get a better view, and looking in, Charlie saw a long table, and on the table there were rows and rows of small white square-shaped sweets.')


# Print every word in sentence (which is now tokenized into numerous tokens), as the parts of speech (verb, noun, adjective etc.) or pos of each tokens, as well as its dependencies
print(f'{"Word":{12}} {"POS":{10}} {"Word Tag":{10}} {"Depend":{10}} {"Explain"}')
print("--------------------------------------------------------------------------------------------")
for word in sentence:
    # print(word.text,  word.pos_, word.dep_)
    print(f'{word.text:{12}} {word.pos_:{10}} {word.tag_:{10}} {word.dep_:{10}} {spacy.explain(word.tag_)}')
# simple print rewritten for a bit more clarity

Word         POS        Word Tag   Depend     Explain
--------------------------------------------------------------------------------------------
Grandpa      PROPN      NNP        compound   noun, proper singular
Joe          PROPN      NNP        nsubj      noun, proper singular
lifted       VERB       VBD        ccomp      verb, past tense
Charlie      PROPN      NNP        dobj       noun, proper singular
up           ADP        RP         prt        adverb, particle
so           SCONJ      IN         mark       conjunction, subordinating or preposition
that         SCONJ      IN         mark       conjunction, subordinating or preposition
he           PRON       PRP        nsubj      pronoun, personal
could        VERB       MD         aux        verb, modal auxiliary
get          AUX        VB         advcl      verb, base form
a            DET        DT         det        determiner
better       ADJ        JJR        amod       adjective, comparative
view         NOUN       NN 

In [316]:
# Visualising POS tags - neat!
from spacy import displacy
displacy.render(sentence, style='dep', jupyter=True, options={'distance': 85})

Similarly, named entities can be identified as well. This will be useful when we want to count the number of unique entities within the texts.

In [317]:
print(f'{"Word":{12}} {"Entity_Label":{14}} {"Explain"}')
print("-----------------------------------------")
for entity in sentence.ents:
    print(f'{entity.text:{12}} {entity.label_:{14}} {str(spacy.explain(entity.label))}')

Word         Entity_Label   Explain
-----------------------------------------
Grandpa Joe  PERSON         None
Charlie      PERSON         None
Charlie      PERSON         None


As we can clearly see, SpaCy has a much more advanced tagging algorithms for POS (`.pos`), entities (`.ents`) and also nouns(`.noun_chunks`), which account for the word's position within the sentence. I'm not gonna go too deep into this, but some of the POS identified are:
- (various forms of) verbs
- (singular or mass) nouns
- (various forms of) pronouns
- adjectives
- auxilaries
- conjunctions
and more........... for a complete list of tags check out SpaCy's [annotation specifications](https://spacy.io/api/annotation#pos-tagging)

I imagine the tagging algorithms will make things way easier for lemmatisation in SpaCy, since much more context is extracted with it's algorithm.

## Lemmatisation on SpaCy

Although stemming is not available on SpaCy, I believe that lemmatisation might actually yield slightly more useful results (which is we've seen with lemmatisation on NLTK as well), because the root lemma of a word can be returned, regardless of its part of speech, tense or more. That said, the assumption for correct lemmatisation results is that the part of speech tagging is good, which seems to be the case for SpaCy.

In [318]:
print(f'{"Word":{12}} {"Lemma"} ')
print("-----------------------------------")
for word in sentence:
    print(f'{word.text:{12}} {word.lemma_}')

Word         Lemma 
-----------------------------------
Grandpa      Grandpa
Joe          Joe
lifted       lift
Charlie      Charlie
up           up
so           so
that         that
he           -PRON-
could        could
get          get
a            a
better       well
view         view
,            ,
and          and
looking      look
in           in
,            ,
Charlie      Charlie
saw          see
a            a
long         long
table        table
,            ,
and          and
on           on
the          the
table        table
there        there
were         be
rows         row
and          and
rows         row
of           of
small        small
white        white
square       square
-            -
shaped       shaped
sweets       sweet
.            .


# Creating a new DTM with SpaCy

With this in mind, I think I'm ready to redo my DTM using SpaCy from my original corpus `basic_cleaned_corpus.pkl`. With a few slight modifications to my agenda, the things to work on are:
- Tokenization of sentences
- Removal of stop words
- Identifying entities
- Lemmatisation, which is inherently better due to SpaCy's superior POS tagging.


In [319]:
cleaned_corpus = pd.read_pickle('basic_cleaned_corpus.pkl')
cleaned_corpus

Unnamed: 0_level_0,text
book_names,Unnamed: 1_level_1
chocofact,here comes charlie these two very old people are the father and mother of mr bucket their names are grandpa joe and grandma josephine a...
fox,down in the valley there were three farms the owners of these farms had done well they were rich men they were also nasty men all three of the...
matilda,the reader of books it’s a funny thing about mothers and fathers even when their own child is the most disgusting little blister you could ...


## Tokenization and Lemmatization with SpaCy

I'm being slightly greedy this time (working on both tokenization and lemmatization simultaneously) because looping through such a large library of texts repetitively is extremely time consuming....... there is a better way to this........

In [320]:
# Load tools for SpaCy
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS

# Create document with SpaCy annotations (POS, entities, dependencies etc.) and lemmatise immediately
def createSpaCy(storyName):
    data = sp(cleaned_corpus.text.loc[storyName])
    new_data = []
    for word in data:
        if word.is_stop == False and not word.is_space: # only append if it's not a stop word.
            new_data.append(word.lemma_)
        sentence = " ".join(new_data)
    return sentence

spacy_doc = [createSpaCy(storyName) for storyName in bookNames]
spacy_doc[0][:500]

'come charlie old people father mother mr bucket name grandpa joe grandma josephine old people father mother mrs bucket name grandpa george grandma georgina mr bucket mrs bucket mr mrs bucket small boy charlie charlie d’you d’you d’you pleased meet family — grown up count little charlie bucket — live small wooden house edge great town house nearly large people life extremely uncomfortable room place altogether bed bed give old grandparent old tired tired get grandpa joe grandma josephine grandpa '

In [393]:
spacy_df = pd.DataFrame({'book_names':bookNames, 'full_names': fullNames, 'spacified_text': spacy_doc})
spacy_df = spacy_df.set_index('book_names')

# Keep a lemmatised version of the corpus
spacy_df.to_pickle('lemmatised_books.pkl')
spacy_df

Unnamed: 0_level_0,full_names,spacified_text
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1
chocofact,Charlie and the Chocolate Factory,come charlie old people father mother mr bucket name grandpa joe grandma josephine old people father mother mrs bucket name grandpa george grandma...
fox,Fantastic Mr Fox!,valley farm owner farm rich man nasty man nasty mean man meet name farmer boggis farmer bunce farmer bean boggi chicken farmer keep thousand chick...
matilda,Matilda,reader book funny thing mother father child disgusting little blister imagine think wonderful parent blinded adoration manage convince child quali...


Since tokenization and transforming the data into a document term matrix is easier with scikit-learn, I'm gonna process the sentences with `CountVectoriser()`.

In [322]:
# Tokenize text and transform it in a DTM
blank_cv = CountVectorizer()
spacy_cv = blank_cv.fit_transform(spacy_df.spacified_text) 

#Transform DTM (in context a vocabulary/ dictionary) into pandas dataframe format
spacy_dtm = pd.DataFrame(spacy_cv.toarray(), columns=blank_cv.get_feature_names())
spacy_dtm.index = spacy_df.index
spacy_dtm

Unnamed: 0_level_0,aback,abandon,abc,abdomen,abide,ability,able,absolute,absolutely,absorb,...,yippeeeeee,yippeeeeeeee,yippeeeeeeeeee,you,youheard,young,youreally,youth,zing,zip
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chocofact,0,0,0,0,1,0,13,2,10,0,...,1,1,1,7,0,7,0,1,1,1
fox,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,1,1,1,0,0,0
matilda,1,1,1,1,0,4,14,5,6,3,...,0,0,0,3,0,15,0,0,0,0


In [391]:
# Transpose for easier calculations, and add a column for number for sum of word frequencies in all books.
filtered_spacy_dtm = spacy_dtm.transpose()
filtered_spacy_dtm['sum'] = spacy_dtm.sum() 
filtered_spacy_dtm.sort_values(by= 'sum', inplace=True, ascending =False)

# Only keep words that have appeared more than 5 times.
filtered_spacy_dtm = filtered_spacy_dtm[filtered_spacy_dtm['sum'] > 1]
filtered_spacy_dtm.sort_values(by= 'sum', inplace=True, ascending =False)

#Transpose back to the original format and display the DTM!
filtered_spacy_dtm = filtered_spacy_dtm.transpose()
filtered_spacy_dtm

Unnamed: 0_level_0,say,mr,miss,matilda,honey,go,wonka,like,come,look,...,pad,murder,devising,lavatory,maggot,serpent,magician,messing,subtle,gasping
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chocofact,352,386,12,0,0,159,324,103,137,105,...,0,1,0,0,0,0,2,2,0,0
fox,152,132,2,0,0,39,0,23,32,31,...,0,0,0,0,0,0,0,0,0,0
matilda,589,70,453,430,382,155,0,163,114,122,...,2,1,2,2,2,2,0,0,2,2
sum,1093,588,467,430,382,353,324,289,283,258,...,2,2,2,2,2,2,2,2,2,2


Filtering words that have appeared more than or equal to 2 times yield several advantages:
- Less computing time during future manipulations
- Minimises the chance of taking non-real words into the DTM.

Now, let's pickle this for further use.

In [325]:
filtered_spacy_dtm.to_pickle("refined_dtm.pkl")

## Entity Idenitifaction with SpaCy

Conversely, we would like to explore words which rarely occur, and to do so we identify the unique entities within the texts.

In [326]:
# Function to identify entities
def identifyENT(storyName):
    data = sp(cleaned_corpus.text.loc[storyName])
    new_data = []
    for entity in data.ents:
        new_data.append(entity.text) # things to consider: entity.label_, spacy.explain(entity.label)
    return new_data

# Run the function and return the list (of list)
entityList = [identifyENT(storyName) for storyName in bookNames]
entityList[0][:50]

['charlie',
 'two',
 'grandpa joe',
 'grandma',
 'two',
 'georgina',
 'charlie',
 'charlie',
 'six',
 'only two',
 'only one',
 'four',
 'grandpa joe',
 'grandma',
 'georgina',
 'little charlie',
 'winter',
 'all night',
 'one half',
 'second',
 'one',
 'two',
 'two',
 'charlie',
 'charlie',
 'charlie',
 'about from morning',
 'one',
 'charlie',
 'charlie',
 'the  great day',
 'charlie',
 'one',
 'the next few days',
 'one',
 'the next day',
 'charlie',
 'more than a month',
 'one',
 'house',
 'charlie',
 'willy',
 'half a mile',
 'charlie',
 'charlie',
 'four',
 'ninety',
 'the day',
 'charlie',
 'one']

Oddly, this returned a bunch of numbers as well, so I'm gonna manually modify the function (a bit).

In [379]:
# Function to identify entities
def identifyENT_revised(storyName):
    data = sp(cleaned_corpus.text.loc[storyName])
    new_data = []
    for entity in data.ents:
        if entity.text != "one" and entity.text != "two" and entity.text != "three" and entity.text != "four" and entity.text != "five" and entity.text != "six" and entity.text != "seven" and entity.text != "eight" and entity.text != "nine" and entity.text != "ten":
            new_data.append(entity.text) # things to consider: entity.label_, spacy.explain(entity.label)
    return new_data

# Run the function and return the list (of list)
entityList = [identifyENT_revised(storyName) for storyName in bookNames]
entityList[0][:20]

['charlie',
 'grandpa joe',
 'grandma',
 'georgina',
 'charlie',
 'charlie',
 'only two',
 'only one',
 'grandpa joe',
 'grandma',
 'georgina',
 'little charlie',
 'winter',
 'all night',
 'one half',
 'second',
 'charlie',
 'charlie',
 'charlie',
 'about from morning']

In [328]:
# As usual we are going to do a bunch of manipulation to change it into a DTM form.

entity_df = pd.DataFrame({'book_names':bookNames, 'full_names': fullNames, 'entities': entityList})
entity_df = entity_df.set_index('book_names')
entity_df

Unnamed: 0_level_0,full_names,entities
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1
chocofact,Charlie and the Chocolate Factory,"[charlie, grandpa joe, grandma, georgina, charlie, charlie, only two, only one, grandpa joe, grandma, georgina, little charlie, winter, all night,..."
fox,Fantastic Mr Fox!,"[thousands, thousands, turkey, apple, thousands, gallons, valley, fox, fox, fox, fox, turkey, fox, fox, valley, fox, fox, fox, fifty yards, number..."
matilda,Matilda,"[spring, six years, no more than six days, michael, matilda, matilda, the next county, half¬, michael, the age of one and a half, twelve inch, fiv..."


In [373]:
import collections as coll

entityCounter = []
for storyNames in bookNames:
    entityCounter.append(coll.Counter(entity_df.entities.loc[storyNames]))
entityCounter #Results hidden

w days': 1,
          'the next day': 2,
          'more than a month': 1,
          'house': 4,
          'willy': 12,
          'half a mile': 1,
          'ninety': 1,
          'the day': 4,
          'heard charlie': 1,
          'grandpa': 8,
          'george': 2,
          'all day long': 3,
          'half an hour': 3,
          'one evening': 1,
          'about fifty': 2,
          'little charlie    ': 1,
          'ninety six and a half': 2,
          'more than two hundred': 1,
          'earth': 5,
          'cold for hours and hours': 1,
          'all morning': 1,
          'every ten seconds': 1,
          'indian': 2,
          'prince pondicherry ’': 1,
          'grandpa george  ': 4,
          'india': 1,
          'one hundred': 4,
          'charlie    ‘': 2,
          'grandpa joe    ': 12,
          'charlie    ': 6,
          'tonight': 1,
          'tomorrow': 4,
          'evening': 7,
          'the next evening': 1,
          'thousands': 4,
          'on

Unfortunately, as we can see, this algorithm yields a lot of irrelevant results, a lot of which are related to time (hours/ seconds/ seasons......) I have a feeling that counting the number of entities wouldn't do much benefit to this analysis, so I'm gonna stop this section here right now. In addition, it would rquire quite a bit of manipulation to convert this list of lists (array) to list of counters and then to a dictionary; we can't use CountVectoriser because the number of words of each entity is not determined by CountVectoriser itself.

The time is now to proceed to exploratory data anlaysis.