# Text Cleaning
Garth Mortensen  
2019.09.19

## Overview

I have performed text analysis using Matlab's excellent toolboxes, but never Python. This is an effort to transfer my knowledge from Matlab to Python. I'll pull some text from [Project Gutenberg](https://www.gutenberg.org/) and see what I can do with it.

## Import Libraries

In [5]:
import requests  # for reading website
import string  # for punctuation list
import re  # regex for text cleaning
import nltk  # natural language toolkit for __________
# nltk.download()  # Using the GUI, download all packages, or the ones of your choosing.

## Read Text

Provide a URL to a book and, if the connection worked, read it into a variable.

**Note**: You could use requests or another popular solution urllib2, but there are security [problems](https://www.nbu.gov.sk/skcsirt-sa-20170909-pypi/index.html) with the latter. I've opted to keep my distance.

In [7]:
# Using a gutenberg book
url = "http://www.gutenberg.org/files/2701/2701-0.txt"

response = requests.get(url)

print("The url status is:", response.status_code)
print("Encoding style is:", response.encoding)

The url status is: 200
Encoding style is: ISO-8859-1


In [8]:
# print a confirmation that the url leads to a succesful connection.
try:
    response.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

if response.ok:  # != 4xx or 5xx
    print("Connection successful")

# name change
text = response.text
del response

Connection successful


## Visual Check

Check that the content makes sense.

In [9]:
print("Total characters in story:", len(text))

Total characters in story: 1276201


Preview the text. Chapter 1 starts at character 29455, so I'll simply explicitly preview this section.

In [10]:
# remove everything before chapter 1
text = text[29455:]

# display first 250 characters in book.
print(text[:250])

 1. Loomings.

Call me Ishmael. Some years agoânever mind how long preciselyâhaving
little or no money in my purse, and nothing particular to interest me
on shore, I thought I would sail about a little and see the watery part
of the world. I


We can see that some cleaning is required.

## Text Cleaning with Python

Let's split by whitespace and the following punctuation,

In [11]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [12]:
# split by white spaces
words = text.split()

# prepare regex for char filtering
re_punc = re.compile('[%s]' % re.escape(string.punctuation))

# remove punctuation from each word
stripped = [re_punc.sub('', w) for w in words]
print(stripped[:100])

['1', 'Loomings', 'Call', 'me', 'Ishmael', 'Some', 'years', 'agoâ\x80\x94never', 'mind', 'how', 'long', 'preciselyâ\x80\x94having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', 'It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', 'Whenever', 'I', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth', 'whenever', 'it', 'is', 'a', 'damp', 'drizzly', 'November', 'in', 'my', 'soul', 'whenever', 'I', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', 'and', 'bringing', 'up', 'the', 'rear', 'of', 'every', 'funeral', 'I', 'meet', 'and', 'especially', 'whenever', 'my']


There are some unprintable characters. This seems to be because the text document is encoded in [ISO-8859](https://www.gutenberg.org/wiki/Gutenberg:File_Formats_FAQ#ISO-8859.2FISO-Latin_.28Character_Sets.29), instead of utf-8. These characters will simply be stripped out.

In [17]:
re_print = re.compile('[^%s]' % re.escape(string.printable))
result = [re_print.sub('', w) for w in words]

print("List length:", len(result))
print(result[:100])

List length: 211388
['1.', 'Loomings.', 'Call', 'me', 'Ishmael.', 'Some', 'years', 'agonever', 'mind', 'how', 'long', 'preciselyhaving', 'little', 'or', 'no', 'money', 'in', 'my', 'purse,', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore,', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world.', 'It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation.', 'Whenever', 'I', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth;', 'whenever', 'it', 'is', 'a', 'damp,', 'drizzly', 'November', 'in', 'my', 'soul;', 'whenever', 'I', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses,', 'and', 'bringing', 'up', 'the', 'rear', 'of', 'every', 'funeral', 'I', 'meet;', 'and', 'especially', 'whenever', 'my']


There remains issues, such as precisely-having being converted to preciselyhaving. The purpose of this exercise is not to perfect the text-cleaning process, but explore text-analysis. I will move on.

## Text Cleaning with nltk

### Tokenize
According to [MathWorks](https://www.mathworks.com/help/textanalytics/ref/tokenizeddocument.html),
> A **tokenized** document is a document represented as a collection of words (also known as tokens) which is used for text analysis.

First, let's tokenize this into sentances.

In [18]:
from nltk import sent_tokenize

# split into sentences
sentences = sent_tokenize(text)
print("List length:", len(sentences))
print(sentences[:5])

List length: 8712
[' 1.', 'Loomings.', 'Call me Ishmael.', 'Some years agoâ\x80\x94never mind how long preciselyâ\x80\x94having\r\nlittle or no money in my purse, and nothing particular to interest me\r\non shore, I thought I would sail about a little and see the watery part\r\nof the world.', 'It is a way I have of driving off the spleen and\r\nregulating the circulation.']


The presence of \r\n (carriage return and new line) indicates the artificial word-wrapping was preserved.

Next, let's tokenize into words.

In [19]:
from nltk.tokenize import word_tokenize

# split into words
tokens = word_tokenize(text)
print("List length:", len(tokens))
print(tokens[:100])

List length: 246386
['1', '.', 'Loomings', '.', 'Call', 'me', 'Ishmael', '.', 'Some', 'years', 'agoâ\x80\x94never', 'mind', 'how', 'long', 'preciselyâ\x80\x94having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', ',', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', ',', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '.', 'It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', '.', 'Whenever', 'I', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth', ';', 'whenever', 'it', 'is', 'a', 'damp', ',', 'drizzly', 'November', 'in', 'my', 'soul', ';', 'whenever', 'I', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', ',', 'and', 'bringing', 'up']


We now have punctuation included as tokens. Let's remove it by filtering out anything that doesn't consist of alphabetical characters.

Let's start using [list comprehensions](https://www.pythonforbeginners.com/basics/list-comprehensions-in-python) to process the text.

In [29]:
# remove all tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]

print("List length:", len(words))
print(words[:100])

List length: 204265
['Loomings', 'Call', 'me', 'Ishmael', 'Some', 'years', 'mind', 'how', 'long', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', 'It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', 'Whenever', 'I', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth', 'whenever', 'it', 'is', 'a', 'damp', 'drizzly', 'November', 'in', 'my', 'soul', 'whenever', 'I', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', 'and', 'bringing', 'up', 'the', 'rear', 'of', 'every', 'funeral', 'I', 'meet', 'and', 'especially', 'whenever', 'my', 'hypos', 'get', 'such']


We have gone  
from: _"Some years agoâ\x80\x94never mind"_  
to: _"Some years mind"_  
Which, for this exercise, is an improvement.

In [30]:
words = [w.lower() for w in words]
print("List length:", len(words))
print(words[:100])

List length: 204265
['loomings', 'call', 'me', 'ishmael', 'some', 'years', 'mind', 'how', 'long', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', 'i', 'thought', 'i', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', 'it', 'is', 'a', 'way', 'i', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', 'whenever', 'i', 'find', 'myself', 'growing', 'grim', 'about', 'the', 'mouth', 'whenever', 'it', 'is', 'a', 'damp', 'drizzly', 'november', 'in', 'my', 'soul', 'whenever', 'i', 'find', 'myself', 'involuntarily', 'pausing', 'before', 'coffin', 'warehouses', 'and', 'bringing', 'up', 'the', 'rear', 'of', 'every', 'funeral', 'i', 'meet', 'and', 'especially', 'whenever', 'my', 'hypos', 'get', 'such']


### Stopwords

Now let's chop out all the stopwords. **Stopwords** are generally the highest frequency words in a language. You can more or less find English language stopwords using this [technique](https://youtu.be/zth-Awh2xWk). Just determine if your stopwords will be the top 100 or 1,000 words. Exploring the stopwords provided by nltk, n varies widely.

I found the supported languages here:
> C:\Users\grm\AppData\Roaming\nltk_data\corpora\stopwords

and used cmd to pipe the list to a .txt file:
> dir > languageList.txt

to write the results to a list.

In [31]:
from nltk.corpus import stopwords

language = ["arabic", "azerbaijani", "danish", "dutch", "english", "finnish", "french", "german", "greek", "hungarian", "indonesian", "italian", "kazakh", "nepali", "norwegian", "portuguese", "romanian", "russian", "slovene", "spanish", "swedish", "tajik", "turkish"]

print("Total stopwords by language")
for lan in language:
    stop_words = stopwords.words(lan)
    print(lan, ": \t", len(stop_words))

Total stopwords by language
arabic : 	 248
azerbaijani : 	 165
danish : 	 94
dutch : 	 101
english : 	 179
finnish : 	 235
french : 	 157
german : 	 232
greek : 	 265
hungarian : 	 199
indonesian : 	 758
italian : 	 279
kazakh : 	 324
nepali : 	 255
norwegian : 	 176
portuguese : 	 204
romanian : 	 356
russian : 	 151
slovene : 	 1784
spanish : 	 313
swedish : 	 114
tajik : 	 163
turkish : 	 53


Just out of curiosity, let's see the stopwords in interesting languages. I wonder if we can see similarities in Turkish languages (I think this covers Central Asia? And non-Indo-European languages in Europe.

In [32]:
languages = ["turkish", "azerbaijani", "kazakh", "finnish", "hungarian"]

for lan in languages:
    stop_words = stopwords.words(lan)
    print("\nTop 10 stopwords in " + lan + ":")
    print(stop_words[:10])


Top 10 stopwords in turkish:
['acaba', 'ama', 'aslında', 'az', 'bazı', 'belki', 'biri', 'birkaç', 'birşey', 'biz']

Top 10 stopwords in azerbaijani:
['a', 'ad', 'altı', 'altmış', 'amma', 'arasında', 'artıq', 'ay', 'az', 'bax']

Top 10 stopwords in kazakh:
['ах', 'ох', 'эх', 'ай', 'эй', 'ой', 'тағы', 'тағыда', 'әрине', 'жоқ']

Top 10 stopwords in finnish:
['olla', 'olen', 'olet', 'on', 'olemme', 'olette', 'ovat', 'ole', 'oli', 'olisi']

Top 10 stopwords in hungarian:
['a', 'ahogy', 'ahol', 'aki', 'akik', 'akkor', 'alatt', 'által', 'általában', 'amely']


Okay, back on track. Let's see English.

In [33]:
stop_words = stopwords.words('english')
print("\ntotal English stopwords:", len(stop_words))
print(stop_words)


total English stopwords: 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 

Let's take out the stopwords using a list comprehension.

In [34]:
words = [w for w in words if not w in stop_words]
print("List length:", len(words))
print(words[:100])

List length: 102487
['loomings', 'call', 'ishmael', 'years', 'mind', 'long', 'little', 'money', 'purse', 'nothing', 'particular', 'interest', 'shore', 'thought', 'would', 'sail', 'little', 'see', 'watery', 'part', 'world', 'way', 'driving', 'spleen', 'regulating', 'circulation', 'whenever', 'find', 'growing', 'grim', 'mouth', 'whenever', 'damp', 'drizzly', 'november', 'soul', 'whenever', 'find', 'involuntarily', 'pausing', 'coffin', 'warehouses', 'bringing', 'rear', 'every', 'funeral', 'meet', 'especially', 'whenever', 'hypos', 'get', 'upper', 'hand', 'requires', 'strong', 'moral', 'principle', 'prevent', 'deliberately', 'stepping', 'street', 'methodically', 'knocking', 'hats', 'account', 'high', 'time', 'get', 'sea', 'soon', 'substitute', 'pistol', 'ball', 'philosophical', 'flourish', 'cato', 'throws', 'upon', 'sword', 'quietly', 'take', 'ship', 'nothing', 'surprising', 'knew', 'almost', 'men', 'degree', 'time', 'cherish', 'nearly', 'feelings', 'towards', 'ocean', 'insular', 'city', '

We have gone  
from:  _"Call me Ishmael"_  
to: _"call ishmael"_  
which is an improvement.

### Stemming

According to [Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) 
>**Stemming** usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. **Lemmatization** usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

Porter's stemmer is the most well known for stemming. Time and place = Cambridge (Oxford), 1980. There are stemmers for other languages that follow other rules.

The goal is to reduce the total vocabulary space by simplifying words into their stems, chopping off the suffix.

_cat_ <- car, cats, cat's, cats'

Let's get stemmy.

In [35]:
from nltk.stem.porter import PorterStemmer

# stem the words
porter = PorterStemmer()

stemmed = [porter.stem(word) for word in words]
print("List length:", len(stemmed))
print(stemmed[:100])

List length: 102487
['loom', 'call', 'ishmael', 'year', 'mind', 'long', 'littl', 'money', 'purs', 'noth', 'particular', 'interest', 'shore', 'thought', 'would', 'sail', 'littl', 'see', 'wateri', 'part', 'world', 'way', 'drive', 'spleen', 'regul', 'circul', 'whenev', 'find', 'grow', 'grim', 'mouth', 'whenev', 'damp', 'drizzli', 'novemb', 'soul', 'whenev', 'find', 'involuntarili', 'paus', 'coffin', 'warehous', 'bring', 'rear', 'everi', 'funer', 'meet', 'especi', 'whenev', 'hypo', 'get', 'upper', 'hand', 'requir', 'strong', 'moral', 'principl', 'prevent', 'deliber', 'step', 'street', 'method', 'knock', 'hat', 'account', 'high', 'time', 'get', 'sea', 'soon', 'substitut', 'pistol', 'ball', 'philosoph', 'flourish', 'cato', 'throw', 'upon', 'sword', 'quietli', 'take', 'ship', 'noth', 'surpris', 'knew', 'almost', 'men', 'degre', 'time', 'cherish', 'nearli', 'feel', 'toward', 'ocean', 'insular', 'citi', 'manhatto', 'belt', 'round', 'wharv']


You can see that this has produced misspelled and sometimes difficult to recognize words. Thankfully, text analysis does not require perfectly cleaned text files. We can move on with this working document.