# <font color="red">Cleaning Text Data</font>
In this notebook, I mainly cited

[Jason Brownlee's post (How to Clean Text for Machine Learning with Python)](https://machinelearningmastery.com/clean-text-machine-learning-python/)

[Kendall Fortney's post (Pre-Processing in Natural Language Machine Learning)](https://towardsdatascience.com/pre-processing-in-natural-language-machine-learning-898a84b8bd47)

[Maria Dobko's post (Text Data Cleaning and Preprocessing)](https://medium.com/@dobko_m/nlp-text-data-cleaning-and-preprocessing-ea3ffe0406c1)

## Introduction

Teaching a computer accurately understand word-context has been an unsolved problem for a long time. Words with the same meaning can exist in a variety of expressions, and there are even new words born every day.

To solve this problem, a variety of research directions are under way which require huge text datasets for their respective purpose.

Cleaning text-data is a typical pre-processing task for data science and machine learning.

It consists of getting rid of the less useful parts of text through stopword removal, dealing with capitalization, special characters and other details.

Today we’re going to do cleaning text  from Kafka’s famous book Metamorphosis, as described in [Jason Brownlee’s post](https://machinelearningmastery.com/clean-text-machine-learning-python/).

[Download]()

[Metamorphosis by Franz Kafka Plain Text UTF-8](http://www.gutenberg.org/ebooks/5200?msg=welcome_stranger).

Open the file and delete the header and footer information and save the file as “metamorphosis_clean.txt“.

We are using python3 for the example.

 

## <font color="navy">NLTK </font>
### <font color="forestGreen">The Natural Language Toolkit, or NLTK for short, is a Python library written for working and modeling text.</font>
### <font color="forestGreen">It provides a high-level api to flexibly implement a variety of cleaning methods.</font>

## Step 1. Sneak peek into the data

Take a look at the data: explore its main characteristics like size and structure to see how sentences, paragraphs, text are built.

* Understand how much of this data is useful for your needs.
* Review the text to see what exactly might help.

In the case of Metamorphosis:

* There are no obvious typos or spelling mistakes.
* There’s punctuation like commas, apostrophes, quotes, question marks, and more.
* Overall it is simple

## Step 2. Whitespace/Punctuation/Normalize Case

### 1. Install NLTK

```python
pip install -U nltk
python -m nltk.downloader all
```
* Usage
```python
import nltk
```

## 2. Tokenization
### Split into Words

In [None]:
!ls  ../input

In [None]:
# load data
filename = '../input/metamorphosis_clean.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:100])

<font color="red">tokens = word_tokenize(text)</font>

<font color="black">It does the same thing as split() we saw above.</font>
We can see  <font color="red">'looked', '.', '``', 'What', "'s"</font> in this result.

You may not feel a big difference just by looking at this, but you can easily handle the split by sentences using NLTK. Let’s replace **word_tokenizer** with **sent_tokenizer**

In [None]:
from nltk import sent_tokenize
sentences = sent_tokenize(text)

Cleaning text is used for a variety of purposes and the flexibility of NTLK allows you to focus on the core rather than on implementation itself.

## 3. Filter Out Punctuation

In [None]:
# load data
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

## Step 3. Stopwords/Stemming


Cleaning Text Data
In this blog, I mainly cited
Jason Brownlee's post (How to Clean Text for Machine Learning with Python)
Kendall Fortney's post (Pre-Processing in Natural Language Machine Learning)
Maria Dobko's post (Text Data Cleaning and Preprocessing)

### 1. Filter out Stopwords and Pipelines
A majority of the words in a given text are connecting parts of a sentence rather than showing subjects, objects or intent. Word like “the” or “and” can be removed by comparing text to a list of stopwords.

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

In [None]:
# load data
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])

* We can easily get result<br>
<font color="red">['dreams', 'found', 'transformed', 'bed', 'horrible']</font> from <font color="red">['dreams', 'he', 'found', 'himself', 'transformed']</font> using **<font color="red">set(stopwords.words('english'))</font>**.

### 2. Stemming
Stemming is a process where words are reduced to a root by removing inflection through dropping unnecessary characters, usually a suffix. There are several stemming models, including Porter and Snowball. But there is a danger of “over-stemming” were words like “universe” and “university” are reduced to the same root of “univers”.

In [None]:
# load data
file = open(filename, 'rt')
text = file.read()
file.close()
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

After stemming, <font color="red">['morning', 'troubled']</font> converted to <font color="red">['morn','troubl'] </font>by **<font color="red">PorterStemmer</font>**.

You can also see that the tokens converted to lowercase.

Finally, we have cleaned text data reducing words to their root by stemming.

## Step 4. Other tools
### 1. Lemmatization
Lemmatization is also an alternative to removing inflection. By determining the part of text and utilizing WordNet’s lexical database of English, it can get better results.

It is a more accurate but slower. Stemming may be more useful in queries for databases whereas Lemmatization may work much better when trying to determine text sentiment.

### 2. Word Embedding/Text Vectors
Word embedding is the modern way of representing words as vectors. The aim of word embeddings is to find a series of high dimensionality vectors (one for each word) that represent the relation of words in such a way that semantically related words are ‘close together’ in that high dimensional space. Word2Vec and GloVe are the most common models for converting text to vectors. Often, T-SNE (as well as PCA) is used to reduce the dimensionality enough to display as a 2 or 3 dimensional graph. Check out this example of T-SNE applied to word embeddings.

## Further Reading
[Shubham Jain’s post(Ultimate guide to deal with Text Data]

In that article you can learn different feature extraction methods, starting with some basic techniques which will lead into advanced Natural Language Processing techniques including <code>N-grams, Term Frequency, Inverse Document Frequency, Term Frequency-Inverse Document Frequency (TF-IDF),  Bag of Words and Sentiment Analysis</code>

In addition, if you want to dive deeper, visit video course on NLP (using Python).

 
## Summary
In this notebook, you saw what cleaning text is and looked into it in Python codes.

Specifically, you learned it by 4 steps

Sneak peek into the data
Whitespace/Punctuation/Normalize Case
Stopwords/Stemming
Other tools
Also, I think you have understood the pros of using NLTK compared to manually implementing it.

Cleaning text can be performed variously depending on the what the purpose is. It would be nice to study the methods that I haven’t introduced today at Futher Reading.