<a href="https://colab.research.google.com/github/gcosma/COP509/blob/main/Tutorial1NLPcleanText.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**How to Clean Text for Machine Learning with Python**

**Original Source:** Jason Brownlee, [How to Clean Text for Machine Learning with Python](https://machinelearningmastery.com/clean-text-machine-learning-python/), Available from [here](https://machinelearningmastery.com/clean-text-machine-learning-python/), accessed December 13, 2021.

In this tutorial, you will discover how you can clean and prepare your text ready for modeling. After completing this tutorial, you will know:

- How to get started by developing your own very simple text cleaning tools.
- How to take a step up and use the more sophisticated methods in the NLTK library.
- How to prepare text when using modern text representation methods like word embeddings.

Tutorial Overview
This tutorial is divided into 6 parts; they are:

- Metamorphosis by Franz Kafka
- Text Cleaning is Task Specific
- Manual Tokenization
- Tokenization and Cleaning with NLTK
- Additional Text Cleaning Considerations
- Tips for Cleaning Text for Word Embedding

**"What are the reasons for examining text data before beginning the cleaning process in natural language processing tasks?"**

**Understanding the Data:** Before any preprocessing, it's important to understand the nature of the text you're working with. This includes identifying the language, the presence of special characters, and any anomalies such as non-text elements (images, tables, etc.). Understanding these aspects can significantly influence how you approach cleaning and preprocessing the text.

**Identifying Noise:** Not all elements in a text are useful for every analysis or machine learning task. By examining the text, you can identify what constitutes noise (e.g., irrelevant symbols, formatting characters, or specific types of information that are not useful for the task at hand). This helps in designing a cleaning process that removes or retains the right elements.

**Preserving Meaningful Information:** Some elements that may initially appear as noise could hold meaningful information. For example, emojis in social media text can convey sentiment, and punctuation can affect the meaning of sentences. A preliminary review helps in deciding which elements are crucial for maintaining the intended meaning of the text.

**Customizing Cleaning Steps:** Text data can vary widely across sources and applications, necessitating different cleaning approaches. For instance, literary texts might require preserving stylistic elements like capitalization and punctuation, while user-generated content on social media might require specialized handling of slang, abbreviations, and emojis. Pre-analysis ensures that the cleaning process is tailored to the specific characteristics of the text.

**Efficiency and Effectiveness:** By understanding the text's structure and content beforehand, you can choose the most efficient tools and methods for cleaning and preprocessing, avoiding unnecessary steps that don't contribute to your analysis or model's performance. This saves time and computational resources and can lead to more accurate outcomes.

**Data Integrity and Quality:** Proper initial examination helps maintain the integrity and quality of the data. It ensures that the cleaning process does not inadvertently remove or alter information that is essential for analysis, preserving the richness and nuances of the original text.

**Metamorphosis by Franz Kafka download and save instructions**

In this tutorial, you will use the text from the book Metamorphosis by Franz Kafka.

The full text for Metamorphosis is available for free from Project Gutenberg.
[Metamorphosis by Franz Kafka on Project Gutenberg](https://www.gutenberg.org/ebooks/5200)

ASCII text version - [Metamorphosis by Franz Kafka Plain Text UTF-8 (may need to load the page twice)](https://www.gutenberg.org/cache/epub/5200/pg5200.txt).

1. Download the file (or right click on the ASCII link and save as)
2. Place it in your current working directory with the file name “metamorphosis.txt“.
3. Open the file and delete the header and footer information (specifically copyright and license information) and save the file as “metamorphosis_clean.txt“.

The start of the clean file should look like:

One morning, when Gregor ....

The file should end with: And, as if in confirmation of their new dreams and good intentions, a... Poor Gregor…

**Text Cleaning Is Task Specific**

Take a moment to look at the text. What do you notice?
Here are some observations:

- It’s plain text so there is no markup to parse (yay!).
- The translation of the original German uses UK English (e.g. “travelling“).
- The lines are artificially wrapped with new lines at about 70 characters (meh).
- There are no obvious typos or spelling mistakes.
- There’s punctuation like commas, apostrophes, quotes, question marks, and more.
- There’s hyphenated descriptions like “armour-like”.
- There’s a lot of use of the em dash (“-“) to continue sentences (maybe replace with commas?).
- There are names (e.g. “Mr. Samsa“)
- There does not appear to be numbers that require handling (e.g. 1999)
- There are section markers (e.g. “II” and “III”), and we have removed the first “I”.

We are going to look at general text cleaning steps in this tutorial.

Nevertheless, consider some possible task objectives we may have when working with this text document.

For example:

If we were interested in developing a Kafkaesque language model, we may want to keep all of the case, quotes, and other punctuation in place.
If we were interested in classifying documents as “Kafka” and “Not Kafka,” maybe we would want to strip case, punctuation, and even trim words back to their stem.
Use your task as the lens by which to choose how to ready your text data.

**Manual Tokenization**

Text cleaning is a difficult task, but the Metamorphosis text is quite clean already.

We could just write some Python code to clean it up manually, and this is a good exercise for those simple problems that you encounter. Tools like regular expressions and splitting strings can get you a long way.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!ls "/content/drive/My Drive/Colab Notebooks/21COP509/LabDatasets/"
Data_path = "/content/drive/My Drive/Colab Notebooks/21COP509/LabDatasets/"

ArtsRatings_5000_test.txt   ArtsReviews_5000_train.txt	Reduced_ArtsRatings_5000.txt
ArtsRatings_5000_train.txt  glove.6B.100d.txt		Reduced_ArtsReviews_5000.txt
ArtsReviews_5000_test.txt   metamorphosis_clean.txt	review_polarity


# **1. Load Data**

Let’s load the text data so that we can work with it.

The text is small and will load quickly and easily fit into memory. This will not always be the case and you may need to write code to memory map the file. Tools like NLTK (covered in the next section) will make working with large files much easier.

We can load the entire “metamorphosis_clean.txt” into memory as follows:

In [None]:
#Load text
file = open(Data_path + "metamorphosis_clean.txt",'rt')

text = file.read()
file.close()

Running the example loads the whole file into memory ready to work with.

# **2. Split by Whitespace**

Clean text often means a list of words or tokens that we can work with in our machine learning models. This means converting the raw text into a list of words and saving it again.

A very simple way to do this would be to split the document by white space, including ” “, new lines, tabs and more. We can do this in Python with the split() function on the loaded string.

In [None]:
# split into words by white space
words = text.split()
print(words[:100])

['\ufeffMetamorphosis', 'by', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie', 'One', 'morning,', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'He', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'His', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '“What’s', 'happened', 'to', 'me?”', 'he', 'thought.', 'It']


Running the example splits the document into a long list of words and prints the first 100 for us to review.

We can see that punctuation is preserved (e.g. “wasn’t” and “armour-like“), which is nice. We can also see that end of sentence punctuation is kept with the last word (e.g. “thought.”), which is not great.

# **3. Select Words**

Another approach might be to use the regex model (re) and split the document into words by selecting for strings of alphanumeric characters (a-z, A-Z, 0-9 and ‘_’).

In [None]:
# split based on words only
import re
words = re.split(r'\W+', text)
print(words[:100])

['', 'Metamorphosis', 'by', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie', 'One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armour', 'like', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 's', 'happened', 'to', 'me']


Again, running the example we can see that we get our list of words. This time, we can see that “armour-like” is now two words “armour” and “like” (fine) but contractions like “What’s” is also two words “What” and “s” (not great).



# **4. Split by Whitespace and Remove Punctuation**

Note: This example was written for Python 3.

We may want the words, but without the punctuation like commas and quotes. We also want to keep contractions together.

One way would be to split the document into words by white space (as in “2. Split by Whitespace“), then use string translation to replace all punctuation with nothing (e.g. remove it).

Python provides a constant called string.punctuation that provides a great list of punctuation characters. For example:

In [None]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


Python offers a function called translate() that will map one set of characters to another.

We can use the function maketrans() to create a mapping table. We can create an empty mapping table, but the third argument of this function allows us to list all of the characters to remove during the translation process. For example:

In [None]:
table = str.maketrans('', '', string.punctuation)

We can put all of this together, load the text file, split it into words by white space, then translate each word to remove the punctuation.

In [None]:
# load text
file = open(Data_path + "metamorphosis_clean.txt",'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
print(stripped[:100])

['\ufeffMetamorphosis', 'by', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie', 'One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', '“What’s', 'happened', 'to', 'me”', 'he', 'thought', 'It']


We can see that this has had the desired effect, mostly.

Contractions like “What’s” have become “Whats” but “armour-like” has become “armourlike“.

In [None]:
['One', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'Whats', 'happened', 'to', 'me', 'he', 'thought', 'It', 'wasnt', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human']

['One',
 'morning',
 'when',
 'Gregor',
 'Samsa',
 'woke',
 'from',
 'troubled',
 'dreams',
 'he',
 'found',
 'himself',
 'transformed',
 'in',
 'his',
 'bed',
 'into',
 'a',
 'horrible',
 'vermin',
 'He',
 'lay',
 'on',
 'his',
 'armourlike',
 'back',
 'and',
 'if',
 'he',
 'lifted',
 'his',
 'head',
 'a',
 'little',
 'he',
 'could',
 'see',
 'his',
 'brown',
 'belly',
 'slightly',
 'domed',
 'and',
 'divided',
 'by',
 'arches',
 'into',
 'stiff',
 'sections',
 'The',
 'bedding',
 'was',
 'hardly',
 'able',
 'to',
 'cover',
 'it',
 'and',
 'seemed',
 'ready',
 'to',
 'slide',
 'off',
 'any',
 'moment',
 'His',
 'many',
 'legs',
 'pitifully',
 'thin',
 'compared',
 'with',
 'the',
 'size',
 'of',
 'the',
 'rest',
 'of',
 'him',
 'waved',
 'about',
 'helplessly',
 'as',
 'he',
 'looked',
 'Whats',
 'happened',
 'to',
 'me',
 'he',
 'thought',
 'It',
 'wasnt',
 'a',
 'dream',
 'His',
 'room',
 'a',
 'proper',
 'human']

If you know anything about regex, then you know things can get complex from here.

# **5. Normalizing Case**

It is common to convert all words to one case.

This means that the vocabulary will shrink in size, but some distinctions are lost (e.g. “Apple” the company vs “apple” the fruit is a commonly used example).

We can convert all words to lowercase by calling the lower() function on each word.

For example:

In [None]:
file = open(Data_path + "metamorphosis_clean.txt",'rt')
text = file.read()
file.close()
# split into words by white space
words = text.split()
# convert to lower case
words = [word.lower() for word in words]
print(words[:100])

['\ufeffmetamorphosis', 'by', 'franz', 'kafka', 'translated', 'by', 'david', 'wyllie', 'one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him,', 'waved', 'about', 'helplessly', 'as', 'he', 'looked.', '“what’s', 'happened', 'to', 'me?”', 'he', 'thought.', 'it']


**Note**

Cleaning text is really hard, problem specific, and full of tradeoffs.

Remember, simple is better.

Simpler text data, simpler models, smaller vocabularies. You can always make things more complex later to see if it results in better model skill.

Next, we’ll look at some of the tools in the NLTK library that offer more than simple string splitting.

# **Tokenization and Cleaning with NLTK**

The Natural Language Toolkit, or NLTK for short, is a Python library written for working and modeling text.

It provides good tools for loading and cleaning text that we can use to get our data ready for working with machine learning and deep learning algorithms.

# **1. Install NLTK**

You can install NLTK using your favorite package manager, such as pip:

In [None]:
!pip install -U nltk



After installing the NLTK package, please do install the necessary datasets/models for specific functions to work.

If you’re unsure of which datasets/models you’ll need, you can install the “popular” subset of NLTK data, on the command line type python -m nltk.downloader popular, or in the Python interpreter import nltk; nltk.download('popular')

For details, see https://www.nltk.org/data.html

In [None]:
import nltk; nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

Or from the command line:

`#python -m nltk.downloader all`

# **2. Split into Sentences**
Some modeling tasks prefer input to be in the form of paragraphs or sentences, such as word2vec. You could first split your text into sentences, split each sentence into words, then save each sentence to file, one per line.

NLTK provides the sent_tokenize() function to split text into sentences.

The example below loads the “metamorphosis_clean.txt” file into memory, splits it into sentences, and prints the first sentence.

In [None]:
import nltk
nltk.download('popular')
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

In [None]:
#from google.colab import drive
#drive.mount('/content/drive')
#!ls "/content/drive/My Drive/Colab Notebooks"
Data_path = "/content/drive/My Drive/Colab Notebooks/21COP509/LabDatasets/"

In [None]:
# load data
file = open(Data_path + "metamorphosis_clean.txt",'rt')
text = file.read()
file.close()
# split into sentences
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[0])

﻿Metamorphosis

by Franz Kafka

Translated by David Wyllie

One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin.


Running the example, we can see that although the document is split into sentences, that each sentence still preserves the new line from the artificial wrap of the lines in the original document.

"One morning, when Gregor Samsa woke from troubled dreams, he found
himself transformed in his bed into a horrible vermin."

# **3. Split into Words**

NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words).

It splits tokens based on white space and punctuation. For example, commas and periods are taken as separate tokens. Contractions are split apart (e.g. “What’s” becomes “What” “‘s“). Quotes are kept, and so on.

For example:

In [None]:
# load data
file = open(Data_path + "metamorphosis_clean.txt",'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:100])

['\ufeffMetamorphosis', 'by', 'Franz', 'Kafka', 'Translated', 'by', 'David', 'Wyllie', 'One', 'morning', ',', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', ',', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', '.', 'He', 'lay', 'on', 'his', 'armour-like', 'back', ',', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', ',', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', '.', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', '.', 'His', 'many', 'legs', ',', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'waved', 'about', 'helplessly', 'as']


Running the code, we can see that punctuation are now tokens that we could then decide to specifically filter out.

# **5. Filter out Stop Words (and Pipeline)**

Stop words are those words that do not contribute to the deeper meaning of the phrase.

They are the most common words such as: “the“, “a“, and “is“.

For some applications like documentation classification, it may make sense to remove stop words.

NLTK provides a list of commonly agreed upon stop words for a variety of languages, such as English. They can be loaded as follows:

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

You can see the full list as follows:



You can see that they are all lower case and have punctuation removed.

You could compare your tokens to the stop words and filter them out, but you must ensure that your text is prepared the same way.

**Let’s demonstrate this with a small pipeline of text preparation including:**
1. Load the raw text.
2. Split into tokens.
3. Convert to lowercase.
4. Remove punctuation from each token.
5. Filter out remaining tokens that are not alphabetic.
6. Filter out tokens that are stop words.

In [None]:
# load data
file = open(Data_path + "metamorphosis_clean.txt",'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])

['franz', 'kafka', 'translated', 'david', 'wyllie', 'one', 'morning', 'gregor', 'samsa', 'woke', 'troubled', 'dreams', 'found', 'transformed', 'bed', 'horrible', 'vermin', 'lay', 'armourlike', 'back', 'lifted', 'head', 'little', 'could', 'see', 'brown', 'belly', 'slightly', 'domed', 'divided', 'arches', 'stiff', 'sections', 'bedding', 'hardly', 'able', 'cover', 'seemed', 'ready', 'slide', 'moment', 'many', 'legs', 'pitifully', 'thin', 'compared', 'size', 'rest', 'waved', 'helplessly', 'looked', 'happened', 'thought', 'dream', 'room', 'proper', 'human', 'room', 'although', 'little', 'small', 'lay', 'peacefully', 'four', 'familiar', 'walls', 'collection', 'textile', 'samples', 'lay', 'spread', 'travelling', 'hung', 'picture', 'recently', 'cut', 'illustrated', 'magazine', 'housed', 'nice', 'gilded', 'frame', 'showed', 'lady', 'fitted', 'fur', 'hat', 'fur', 'boa', 'sat', 'upright', 'raising', 'heavy', 'fur', 'muff', 'covered', 'whole', 'lower', 'arm', 'towards']


Running this example, we can see that in addition to all of the other transforms, stop words like “a” and “to” have been removed.

I note that we are still left with tokens like “nt“. The rabbit hole is deep; there’s always more we can do.

# **6. Stem Words**

Stemming refers to the process of reducing each word to its root or base.

For example “fishing,” “fished,” “fisher” all reduce to the stem “fish.”

Some applications, like document classification, may benefit from stemming in order to both reduce the vocabulary and to focus on the sense or sentiment of a document rather than deeper meaning.

There are many stemming algorithms, although a popular and long-standing method is the Porter Stemming algorithm. This method is available in NLTK via the PorterStemmer class.

For example:

In [None]:
# load data
file = open(Data_path + "metamorphosis_clean.txt",'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

['\ufeffmetamorphosi', 'by', 'franz', 'kafka', 'translat', 'by', 'david', 'wylli', 'one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'he', 'lay', 'on', 'hi', 'armour-lik', 'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', ',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg', ',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about', 'helplessli', 'as']


Running the example, you can see that words have been reduced to their stems, such as “trouble” has become “troubl“. You can also see that the stemming implementation has also reduced the tokens to lowercase, likely for internal look-ups in word tables.

You can also see that the stemming implementation has also reduced the tokens to lowercase, likely for internal look-ups in word tables.

There is a nice suite of stemming and lemmatization algorithms to choose from in NLTK, if reducing words to their root is something you need for your project.

#**Additional Text Cleaning Considerations**
We are only getting started.

Because the source text for this tutorial was reasonably clean to begin with, we skipped many concerns of text cleaning that you may need to deal with in your own project.

Here is a short list of additional considerations when cleaning text:

- Handling large documents and large collections of text documents that do not fit into memory.
- Extracting text from markup like HTML, PDF, or other structured document formats.
- Transliteration of characters from other languages into English.
- Decoding Unicode characters into a normalized form, such as UTF8.
- Handling of domain specific words, phrases, and acronyms.
- Handling or removing numbers, such as dates and amounts.
- Locating and correcting common typos and misspellings.
…
The list could go on.

Ideally, you would save a new file after each transform so that you can spend time with all of the data in the new form. Things always jump out at you when to take the time to review your data.

#**Tips for Cleaning Text for Word Embedding**

Recently, the field of natural language processing has been moving away from bag-of-word models and word encoding toward word embeddings.

The benefit of word embeddings is that they encode each word into a dense vector that captures something about its relative meaning within the training text.

This means that variations of words like case, spelling, punctuation, and so on will automatically be learned to be similar in the embedding space. In turn, this can mean that the amount of cleaning required from your text may be less and perhaps quite different to classical text cleaning.

For example, it may no-longer make sense to stem words or remove punctuation for contractions.

Tomas Mikolov is one of the developers of word2vec, a popular word embedding method. He suggests only very minimal text cleaning is required when learning a word embedding model.

Below is his response when pressed with the question about how to best prepare text data for word2vec.

*There is no universal answer. It all depends on what you plan to use the vectors for. In my experience, it is usually good to disconnect (or remove) punctuation from words, and sometimes also convert all characters to lowercase. One can also replace all numbers (possibly greater than some constant) with some single token such as.
All these pre-processing steps aim to reduce the vocabulary size without removing any important content (which in some cases may not be true when you lowercase certain words, ie. ‘Bush’ is different than ‘bush’, while ‘Another’ has usually the same sense as ‘another’). The smaller the vocabulary is, the lower is the memory complexity, and the more robustly are the parameters for the words estimated. You also have to pre-process the test data in the same way.
…
In short, you will understand all this much better if you will run experiments.*

**Specifically, you learned:**

- How to get started by developing your own very simple text cleaning tools.

- How to take a step up and use the more sophisticated methods in the NLTK library.

- How to prepare text when using modern text representation methods like word embeddings.

**Original Source:** Jason Brownlee, [How to Clean Text for Machine Learning with Python](https://machinelearningmastery.com/clean-text-machine-learning-python/), Available from [here](https://machinelearningmastery.com/clean-text-machine-learning-python/), accessed December 13, 2021.

