# How to Clean Text for Machine Learning with Python

Source: https://machinelearningmastery.com/clean-text-machine-learning-python/

by Jason Brownlee on October 18, 2017 in Deep Learning for Natural Language Processing

Text preparation methods are specific on the natural language processing task.

In [1]:
!wget http://www.gutenberg.org/cache/epub/5200/pg5200.txt

--2020-06-29 10:32:49--  http://www.gutenberg.org/cache/epub/5200/pg5200.txt
Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47
Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 141420 (138K) [text/plain]
Saving to: ‘pg5200.txt’


2020-06-29 10:32:49 (1.01 MB/s) - ‘pg5200.txt’ saved [141420/141420]



In [2]:
!mv pg5200.txt metamorphosis.txt

## Text Cleaning is Task Specific

- It’s plain text so there is no markup to parse (yay!).
- The translation of the original German uses UK English (e.g. “travelling“).
- The lines are artificially wrapped with new lines at about 70 characters (meh).
- There are no obvious typos or spelling mistakes.
- There’s punctuation like commas, apostrophes, quotes, question marks, and more.
- There’s hyphenated descriptions like “armour-like”.
- There’s a lot of use of the em dash (“-“) to continue sentences (maybe replace with commas?).
- There are names (e.g. “Mr. Samsa“)
There does not appear to be numbers that require handling (e.g. 1999)
- There are section markers (e.g. “II” and “III”), and we have removed the first “I”.

## Manual Tokenization

### Load Data

In [3]:
filename = 'metamorphosis.txt'
with open(filename, 'r') as file:
    text = file.read()
    print(text[:70])

One morning, when Gregor Samsa woke from troubled dreams, he found
him


### Split by Whitespace

In [5]:
# load text
filename = 'metamorphosis.txt'
with open(filename, 'rt') as file:
    text = file.read()
words = text.split(' ')
print(words[:20])

['One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found\nhimself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', '']


### Select Words

In [8]:
import re
filename = 'metamorphosis.txt'
with open(filename, 'rt') as file:
    text = file.read()
words = re.split(r'\W+', text)
print(words[:20])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin']


### Split by Whitespace and Remove Punctuation

In [9]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


We can use the function maketrans() to create a mapping table. We can create an empty mapping table, but the third argument of this function allows us to list all of the characters to remove during the translation process. For example:

In [10]:
table = str.maketrans('', '', string.punctuation)

In [11]:
# load text
filename = 'metamorphosis.txt'
with open(filename, 'rt') as file:
    text = file.read()
# split into words with white space
words = text.split()
# remove punctuation
import string
table = str.maketrans('', '', string.punctuation)
stripped = [word.translate(table) for word in words]
print(stripped[:20])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin']


### Normalizing Case

In [12]:
filename = 'metamorphosis.txt'
with open(filename, 'rt') as file:
    text = file.read()
words = text.split()
lowercased = [word.lower() for word in words]
print(lowercased[:20])

['one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.']


### Note:

Remember, simple is better. Start with simple, it not performs as you expect you can make things more complex later to see if it performs better.

---

## Tokenization and Cleaning with NLTK

### Install NLTK

In [14]:
import nltk

### Split into Sentences

In [16]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [17]:
# load text
filename = 'metamorphosis.txt'
with open(filename, 'rt') as file:
    text = file.read()
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[0])

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.


### Split into Words

In [19]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:5])

['One', 'morning', ',', 'when', 'Gregor']


### Filter Out Punctuation

In [20]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
words = [word for word in tokens if word.isalpha()]
print(words[:20])

['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin']


### Filter out Stop Words (and Pipeline)

In [22]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [23]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Text prepration pipeline:
- Load raw text
- Split into tokens
- Convert to lowercase
- Remove punctuation from each token
- Filter out remaining tokens that are not alphabetic
- Filter out tokens that are stop words

In [25]:
# imports
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# load text
filename = 'metamorphosis.txt'
with open(filename, 'rt') as file:
    text = file.read()
# tokenization
tokens = word_tokenize(text)
# lowercasing
tokens = [token.lower() for token in tokens]
# remove punctuation
table = str.maketrans('', '', string.punctuation)
tokens = [token.translate(table) for token in tokens]
# filter not alphabetic
tokens = [token for token in tokens if token.isalpha()]
# filter stopwords
stop_words = stopwords.words('english')
tokens = [token for token in tokens if token not in stop_words]
print(tokens[:20])

['one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armourlike', 'back', 'lifted', 'head', 'little', 'could', 'see']


### Stem Words

Stemming refers to the process of reducing each word to its root or base.

Many stemming algorithms, the popular one is Porter Stemming Algorithms

In [26]:
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
tokens = word_tokenize(text)
stemmer = PorterStemmer()
stemmed = [stemmer.stem(token) for token in tokens]
print(stemmed[:20])

['one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a']
