# Data Cleaning

> Data cleaning helps avoid "garbage in , garbage out" -- we do not want to feed meaningless data into a model which will probably return us with more meaningless junk.

This time I will skip the scraping part that data scientists normally do. This allows the content to be updated over time, but to be fair the content is pretty static anyways so I don't really see the point of doing so. In addition, I imagine there would be quite a number of problems if the layout of the site changes.

## Outline for data cleaning
- Input: a simple text file with metadata removed. Headers and page numbers are kept though.
- Common Pre-processing/ cleaning procedures
  - All lower case
  - Remove punctuation, symbols and numerical values
  - Remove common non-sensical text (such as line breakers `\n`)
  - Tokenize text: split sentences into individual words (in preparation for DTM)
  - Remove stop-words
  - Using NLTK perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.
  - Parts of speech tagging
  - DTM for bi-grams/ tri-grams (phrases like thank you)
- Output
  - Corpus: not much different from the actual input since there is only one file here....... but with all the data cleaned up.
  - Document Term matrix: a matrix of word counts in the entire corpus.

SpaCy can also perform these NLTK techniques as well, with a greater degree of efficiency. The extra features might be overkill for the time being though.

## Importing the data

For the purposes of this project, I will simply import the data from a text file, which will be parsed into a string object.

In [1]:
# List of file names for books to be analysed.
filenames = [
    'books/charlieandthechocolatefactory.txt',
    'books/fantasticmrfox.txt',
    'books/matilda.txt'
]

bookNames = [
    'chocofact',
    'fox',
    'matilda'
]

fullNames = [
    'Charlie and the Chocolate Factory',
    'Fantastic Mr Fox!',
    'Matilda'
]

In [2]:
# Import text files and remove non sensical '\n'
def importText(fileName):
    data = open(fileName, "r", encoding="utf-8").read().replace('\n', ' ')
    return data

In [3]:
# Test print the first 5000 characters of the third book 'Matilda'
rawBooks = [importText(bkName) for bkName in filenames]
print(rawBooks[2][:3000])

The Reader of Books    It’s a funny thing about mothers and fathers. Even  when their own child is the most disgusting little blister  you could ever imagine, they still think that he or she is  wonderful.   Some parents go further. They become so blinded by  adoration they manage to convince themselves their  child has qualities of genius.   Well, there is nothing very wrong with all this. It’s the  way of the world. It is only when the parents begin  telling us about the brilliance of their own revolting off¬  spring, that we start shouting, 'Bring us a basin! We’re  going to be sick!’    3      School teachers suffer a good deal from having to  listen to this sort of twaddle from proud parents, but  they usually get their own back when the time comes  to write the end-of-term reports. If I were a teacher I  would cook up some real scorchers for the children of  doting parents. ‘Your son Maximilian,’ I would write,  ‘is a total wash-out. I hope you have a family business  you can pus

But we want to make sure that the texts are indexed with their names as well, such that we don't necessarily have to access them with a specific number. This makes it way more convenient the access the data in the future, especially if we decide to append a few more copies of Roald Dahl's texts!

In [4]:
import pandas as pd
pd.set_option('max_colwidth',150)

raw_df = pd.DataFrame({'book_names':bookNames, 'full_names': fullNames, 'text':rawBooks})

# set book names as index
raw_df = raw_df.set_index('book_names')

# sort dataframe and print
raw_df = raw_df.sort_index()
raw_df

Unnamed: 0_level_0,full_names,text
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1
chocofact,Charlie and the Chocolate Factory,This book is fantastic it is about a very poor boy named Charlie Bucket. He always goes to school with out a jacket because they don’t have money...
fox,Fantastic Mr Fox!,Down in the valley there were three farms. The owners of these farms had done well. They were rich men. They were also nasty men. All three of the...
matilda,Matilda,The Reader of Books It’s a funny thing about mothers and fathers. Even when their own child is the most disgusting little blister you could ...


Before we proceed any further, we should also pickle a raw copy of all books, which saves the object in a binary format. This is done for contingency purposes only.

In [5]:
import pickle

with open("rawBooks.pkl", "wb") as file:
    pickle.dump(raw_df, file)

In [6]:
# Check the list of book names (keys)
raw_df.keys

<bound method NDFrame.keys of                                    full_names  \
book_names                                      
chocofact   Charlie and the Chocolate Factory   
fox                         Fantastic Mr Fox!   
matilda                               Matilda   

                                                                                                                                                             text  
book_names                                                                                                                                                         
chocofact   This book is fantastic it is about a very poor boy named Charlie Bucket. He always  goes to school with out a jacket because they don’t have money...  
fox         Down in the valley there were three farms. The owners of these farms had done well. They were rich men. They were also nasty men. All three of the...  
matilda      The Reader of Books    It’s a funny thing about mothers

In [7]:
# Test print the contents for Fantastic Mr Fox!
raw_df.text.loc['fox'][47000:]

' cider inside her inside.’  Then Badger joined in: 48  \x0c‘Oh poor Mrs Badger, he cried, So hungry she very near died. But she’ll not feel so hollow If only she’ll swallow Some cider inside her inside.’  They were still singing as they rounded the final corner and burst in upon the most wonderful and amazing sight any of them had ever seen. The feast was just beginning. A large dining-room had been hollowed out of the earth, and in the middle of it, seated around a huge table, were no less than twenty-nine animals. They were: Mrs Fox and three Small Foxes. Mrs Badger and three Small Badgers. Mole and Mrs Mole and four Small Moles. Rabbit and Mrs Rabbit and five Small Rabbits. 49  \x0cWeasel and Mrs Weasel and six Small Weasels. The table was covered with chickens and ducks and geese and hams and bacon, and everyone was tucking into the lovely food. ‘My darling!’ cried Mrs Fox, jumping up and hugging Mr Fox. ‘We couldn’t wait! Please forgive us!’ Then she hugged the Smallest Fox of al

## Getting started with Data cleaning!

When data scientists process numerical data, they often remove invalid data (which can be automatically and manually interpreted), duplicate data, outliers and null data. There are several methods that we can iteratively apply along the way to clean our data:
  - All lower case
  - Remove punctuation, symbols and numerical values
  - Remove common non-sensical text (such as line breakers `\n`, as well as other escape characters such as `51\x0c`)
  - Tokenize text: split sentences into individual words (in preparation for DTM)
  - Remove stop-words
  - Using NLTK perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.
  - Parts of speech tagging
  - DTM for bi-grams/ tri-grams (phrases like thank you)
  - fix typos (a bit too advanced.......)

We want to apply these methods iteratively such that we can observe the results after each cleaning stage; this is especially important for text-preprocessing since an overly aggressive approach may result in key information being lost.  

In [8]:
# First, make all text lower case, get rid of punctuation, numbers and other non-sensical text.

# Packages for string manipulation
import re
import string

def basic_text_clean(text):
    text = text.lower() #lower case
    text = re.sub('\x0c', ' ', text) # non sensical text
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text) # punctuation
    text = re.sub('\w*\d\w*', ' ', text) # numbers in between text
    return text

basic_cleaning = lambda x: basic_text_clean(x)

Some say quotes (so things within single or double quotation marks) should be removed as well, but in this context, I believe dialogues or conversations within the story are pretty important as well, so might as well see how things work out first......

In [9]:
data_b_cleaned = pd.DataFrame(raw_df.text.apply(basic_cleaning))
data_b_cleaned

Unnamed: 0_level_0,text
book_names,Unnamed: 1_level_1
chocofact,this book is fantastic it is about a very poor boy named charlie bucket he always goes to school with out a jacket because they don’t have money...
fox,down in the valley there were three farms the owners of these farms had done well they were rich men they were also nasty men all three of the...
matilda,the reader of books it’s a funny thing about mothers and fathers even when their own child is the most disgusting little blister you could ...


In [10]:
data_b_cleaned.text.loc['fox'][47000:]

'r joined in      ‘oh poor mrs badger  he cried  so hungry she very near died  but she’ll not feel so hollow if only she’ll swallow some cider inside her inside ’  they were still singing as they rounded the final corner and burst in upon the most wonderful and amazing sight any of them had ever seen  the feast was just beginning  a large dining room had been hollowed out of the earth  and in the middle of it  seated around a huge table  were no less than twenty nine animals  they were  mrs fox and three small foxes  mrs badger and three small badgers  mole and mrs mole and four small moles  rabbit and mrs rabbit and five small rabbits      weasel and mrs weasel and six small weasels  the table was covered with chickens and ducks and geese and hams and bacon  and everyone was tucking into the lovely food  ‘my darling ’ cried mrs fox  jumping up and hugging mr fox  ‘we couldn’t wait  please forgive us ’ then she hugged the smallest fox of all  and mrs badger hugged badger  and everyone 

## Corppus cleaned!

As we can see, not much has actually been done so far, but at this point, at least the text makes sense. Since the actual order of words do matter for things like sentiment analysis, performing further cleaning techniques such as stemming and lemmatisation will only worsen the final sentence generation algorithm. So at this point, the corpus is ready to be pickled for further use.

In [11]:
data_b_cleaned.to_pickle("basic_cleaned_corpus.pkl")

## Document Term Matrix (DTM)

### Tokenization and Stop Words

A DTM is a database of words within various documents. This means the text is split into words (tokenization) for further analysis. In this sceario, we can apply further techqniques to remove relatively meaningless words, such as 'a', 'the' or other various prepositions.  These are known as stop words.

The pakages that we can utilise for this section are:
- scikit-learn's CountVectorizer
- NLTK
- (maybe) SpaCy

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

# Tokenize text, fit it into a mask (that removes stop words), and transform it in a DTM
cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(data_b_cleaned.text) 
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names()) #Transform DTM (in context a vocabulary/ dictionary) into pandas dataframe format
data_dtm.index = data_b_cleaned.index
data_dtm

Unnamed: 0_level_0,aback,abandon,abc,abdomen,abide,abilities,ability,able,absolute,absolutely,...,yippeeeeee,yippeeeeeeee,yippeeeeeeeeee,youheard,young,younger,youreally,youth,zing,zip
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chocofact,0,0,0,0,1,0,0,13,2,10,...,1,1,1,0,6,1,0,1,1,1
fox,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,1,0,1,0,0,0
matilda,1,1,1,1,0,1,3,14,5,6,...,0,0,0,0,14,1,0,0,0,0


The advantages of this format are immediately apparent. It allows us to filter some meaningless words and presents the data in a neat and organised manner. Let's pickle this relatively raw or primitive DTM first.

In [13]:
data_dtm.to_pickle("data_dtm.pkl")

### Bigrams

Dealing with an excessively large dataset is messy for various reasons:
- As we can see, there are some really similar words like "young" or "younger", which are stemmings or lemmatisations from a "root word"
- Some words that have arbitrary spelling like "yippeeeeeeee....!"
- Some frequency for certain words are quite low, which means that they won't really be used for things like topic analysis occurs. Perhaps they should be removed as well!

To recall, these are the things that we can work on to further clean our data:
  - Perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.
  - Parts of speech tagging
  - Collating bi-grams/ tri-grams (phrases like thank you)

In [14]:
# Accept bigrams as well
bigrams_cv = CountVectorizer(stop_words='english', ngram_range= (1,2)) 
cleaned_cv = bigrams_cv.fit_transform(data_b_cleaned.text) 
cleaned_data_dtm = pd.DataFrame(cleaned_cv.toarray(), columns=bigrams_cv.get_feature_names())
cleaned_data_dtm.index = data_dtm.index
cleaned_data_dtm

Unnamed: 0_level_0,aback,aback arrival,abandon,abandon trunchbull,abc,abc quite,abdomen,abdomen daughter,abide,abide ugliness,...,younger children,younger ones,youreally,youreally mean,youth,youth just,zing,zing fantastic,zip,zip guns
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chocofact,0,0,0,0,0,0,0,0,1,1,...,0,1,0,0,1,1,1,1,1,1
fox,0,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
matilda,1,1,1,1,1,1,1,1,0,0,...,1,0,0,0,0,0,0,0,0,0


Doesn't look very useful now since there are 35000 columns! Let's only keep words that have appeared more than 5 times.

In [15]:
# Transpose for easier calculations, and add a column for number for sum of word frequencies in all books.
cleaned_data_dtm = cleaned_data_dtm.transpose()
cleaned_data_dtm['sum'] = cleaned_data_dtm.sum(axis = 1, skipna = True) 
cleaned_data_dtm

book_names,chocofact,fox,matilda,sum
aback,0,0,1,1
aback arrival,0,0,1,1
abandon,0,0,1,1
abandon trunchbull,0,0,1,1
abc,0,0,1,1
...,...,...,...,...
youth just,1,0,0,1
zing,1,0,0,1
zing fantastic,1,0,0,1
zip,1,0,0,1


In [16]:
# Only keep words that have appeared more than 5 times.
filtered_data_dtm = cleaned_data_dtm[cleaned_data_dtm['sum'] > 5]

#Transpose back to the original format and display the DTM!
filtered_data_dtm = filtered_data_dtm.transpose()
filtered_data_dtm

Unnamed: 0_level_0,able,absolute,absolutely,actually,added,afraid,afternoon,afternoons,age,ago,...,year old,years,yelled,yelled mrs,yelling,yellow,yes,yes miss,yesterday,young
book_names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
chocofact,13,2,10,5,3,3,2,0,0,2,...,2,8,14,7,3,3,29,0,2,6
fox,0,0,1,1,0,0,1,0,0,0,...,0,0,4,0,0,0,11,0,1,1
matilda,14,5,6,9,9,8,15,9,8,7,...,11,23,13,0,4,4,37,15,6,14
sum,27,7,17,15,12,11,18,9,8,9,...,13,31,31,7,7,7,77,15,9,21


This yields us with a much more manageable 1361 entries. Before we proceed any furher, we should again pickle it for further use.

In [17]:
data_dtm.to_pickle("high_bigram_dtm.pkl")

### Stemming

When a language contains words that are derived from another (root) word , the changes are called **inflected Language**. For example, modifications are made to reflect tense, case, aspect, person, number, gender and mood or position in speech. For example, googling fish (I use DuckDuckGo) will also result in fishes, fishing as fish is the stem of both words.

Next we will utilise NLTK to perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.

> Stemming is the process of reducing inflection in words to their root forms such as mapping a group of words to the same stem even if the stem itself is not a valid word in the Language.

A stem (or root) is the part of the word to which you add inflectional affixes such as (-ed,-ize, -s,-de,mis). They are obtained by removing the inflections used with a word. 

I love stem.

In [22]:
import nltk

# Using the GUI select the punkt model for download.
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True