# Data Cleaning

> Data cleaning helps avoid "garbage in , garbage out" -- we do not want to feed meaningless data into a model which will probably return us with more meaningless junk.

This time I will skip the scraping part that data scientists normally do. This allows the content to be updated over time, but to be fair the content is pretty static anyways so I don't really see the point of doing so. In addition, I imagine there would be quite a number of problems if the layout of the site changes.

## Outline for data cleaning
- Input: a simple text file with metadata removed. Headers and page numbers are kept though.
- Common Pre-processing/ cleaning procedures
  - All lower case
  - Remove punctuation, symbols and numerical values
  - Remove common non-sensical text (such as line breakers `\n`)
  - Tokenize text: split sentences into individual words (in preparation for DTM)
  - Remove stop-words
  - Using NLTK perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.
  - Parts of speech tagging
  - DTM for bi-grams/ tri-grams (phrases like thank you)
- Output
  - Corpus: not much different from the actual input since there is only one file here....... but with all the data cleaned up.
  - Document Term matrix: a matrix of word counts in the entire corpus.

SpaCy can also perform these NLTK techniques as well, with a greater degree of efficiency. The extra features might be overkill for the time being though.

## Importing the data

For the purposes of this project, I will simply import the data from a text file, which will be parsed into a string object.

In [1]:
# List of file names for books to be analysed.
filenames = [
    'books/charlieandthechocolatefactory.txt',
    'books/fantasticmrfox.txt',
    'books/matilda.txt'
]

bookNames = [
    'chocofact',
    'fox',
    'matilda'
]

In [2]:
# Import text files
def importText(fileName):
    data = open(fileName, "r", encoding="utf-8").read().replace('\n', '')
    return data

In [3]:
# Test print the first 5000 characters of the third book 'Matilda'
rawBooks = [importText(bkName) for bkName in filenames]
print(rawBooks[2][:5000])

The Reader of Books It’s a funny thing about mothers and fathers. Even when their own child is the most disgusting little blister you could ever imagine, they still think that he or she is wonderful. Some parents go further. They become so blinded by adoration they manage to convince themselves their child has qualities of genius. Well, there is nothing very wrong with all this. It’s the way of the world. It is only when the parents begin telling us about the brilliance of their own revolting off¬ spring, that we start shouting, 'Bring us a basin! We’re going to be sick!’ 3 School teachers suffer a good deal from having to listen to this sort of twaddle from proud parents, but they usually get their own back when the time comes to write the end-of-term reports. If I were a teacher I would cook up some real scorchers for the children of doting parents. ‘Your son Maximilian,’ I would write, ‘is a total wash-out. I hope you have a family business you can push him into when he leaves schoo

But we want to make sure that the texts are indexed with their names as well, such that we don't necessarily have to access them with a specific number. This makes it way more convenient the access the data in the future, especially if we decide to append a few more copies of Roald Dahl's texts!

In [4]:
import pandas as pd

raw_df = pd.DataFrame({'book_names':bookNames, 'text':rawBooks})

# set book names as index
raw_df = raw_df.set_index('book_names')

# sort dataframe and print
raw_df = raw_df.sort_index()
raw_df

Unnamed: 0_level_0,text
book_names,Unnamed: 1_level_1
chocofact,This book is fantastic it is about a very poor...
fox,Down in the valley there were three farms. The...
matilda,The Reader of Books It’s a funny thing about m...


Before we proceed any further, we should also pickle a raw copy of all books, which saves the object in a binary format. This is done for contingency purposes only.

In [5]:
import pickle

with open("rawBooks.pkl", "wb") as file:
    pickle.dump(raw_df, file)

In [6]:
# Check the list of book names
raw_df.book_names

AttributeError: 'DataFrame' object has no attribute 'book_names'

In [7]:
# Test print the contents for Fantastic Mr Fox!
raw_df.text.loc['fox'][45000:]

'nners,’ Badger said. ‘All rats have badmanners. I’ve never met a polite rat yet.’‘And he drinks too much,’ said Mr Fox, putting the last47\x0cbrick in place. ‘There we are. Now, home to the feast!’They grabbed their jars of cider and off they went. MrFox was in front, the Smallest Fox came next andBadger last. Along the tunnel they flew . . . past theturning that led to Bunce’s Mighty Storehouse . . . pastBoggis’s Chicken House Number One and then up thelong home stretch towards the place where they knewMrs Fox would be waiting.‘Keep it up, my darlings!’ shouted Mr Fox. ‘We’ll soonbe there! Think what’s waiting for us at the other end!And just think what we’re bringing home with us inthese jars! That ought to cheer up poor Mrs Fox.’ MrFox sang a little song as he ran:‘Home again swiftly I glide,Back to my beautiful bride.She’ll not feel so rottenAs soon as she’s gottenSome cider inside her inside.’Then Badger joined in:48\x0c‘Oh poor Mrs Badger, he cried,So hungry she very near died.B

## Getting started with Data cleaning!

When data scientists process numerical data, they often remove invalid data (which can be automatically and manually interpreted), duplicate data, outliers and null data. There are several methods that we can iteratively apply along the way to clean our data:
  - All lower case
  - Remove punctuation, symbols and numerical values
  - Remove common non-sensical text (such as line breakers `\n`, as well as other escape characters such as `51\x0c`)
  - Tokenize text: split sentences into individual words (in preparation for DTM)
  - Remove stop-words
  - Using NLTK perform stemming and lemmatisation for words in the DTM, to reduce the number of inflicted words.
  - Parts of speech tagging
  - DTM for bi-grams/ tri-grams (phrases like thank you)
  - fix typos (a bit too advanced.......)

We want to apply these methods iteratively such that we can observe the results after each cleaning stage; this is especially important for text-preprocessing since an overly aggressive approach may result in key information being lost.  

In [8]:
# First, make all text lower case, get rid of punctuation, numbers and other non-sensical text.

import re
import string

def basic_text_clean(text):
    text = text.lower() #lower case
    text = re.sub('\x0c', ' ', text) # non sensical text
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text) # punctuation
    text = re.sub('\w*\d\w*', ' ', text) # numbers in between text
    return text

basic_cleaning = lambda x: basic_text_clean(x)

In [9]:
data_b_cleaned = pd.DataFrame(raw_df.text.apply(basic_cleaning))
data_b_cleaned

Unnamed: 0_level_0,text
book_names,Unnamed: 1_level_1
chocofact,this book is fantastic it is about a very poor...
fox,down in the valley there were three farms the...
matilda,the reader of books it’s a funny thing about m...


In [11]:
data_b_cleaned.text.loc['fox'][46000:]

'rounded the final cornerand burst in upon the most wonderful and amazingsight any of them had ever seen  the feast was justbeginning  a large dining room had been hollowed outof the earth  and in the middle of it  seated around ahuge table  were no less than twenty nine animals they were mrs fox and three small foxes mrs badger and three small badgers mole and mrs mole and four small moles rabbit and mrs rabbit and five small rabbits   weasel and mrs weasel and six small weasels the table was covered with chickens and ducks andgeese and hams and bacon  and everyone was tuckinginto the lovely food ‘my darling ’ cried mrs fox  jumping up and hugging mrfox  ‘we couldn’t wait  please forgive us ’ then shehugged the smallest fox of all  and mrs badger huggedbadger  and everyone hugged everyone else  amidshouts of joy  the great jars of cider were placed uponthe table  and mr fox and badger and the smallest foxsat down with the others you must remember no one had eaten a thing forseveral da