### Text Cleaning

Data scientists spend most of their time not on modeling but on cleaning and exploring the data. Furthermore, different approaches in text cleaning can lead to very diverse results during model training.

Cleaning text-data is a typical pre-processing task for data science and machine learning. It consists of getting rid of the less useful parts of text through stopword removal, dealing with capitalization, special characters and other details.

Today we’re going to do cleaning text from Kafka’s famous book Metamorphosis, as described in Jason Brownlee’s post.
https://machinelearningmastery.com/clean-text-machine-learning-python/

#### Take a look at the data: 

- explore its main characteristics like size and structure to see how sentences, paragraphs, text are built.
- Understand how much of this data is useful for your needs.
- Review the text to see what exactly might help.

In [3]:
#Read the file and load the data
#This loads the whole file into memory ready to work with

filename = r'C:\Users\divyakamat\data\DataSets\nlp\metamorphosis_clean.txt'

with open(filename,'rt') as file:
    text=file.read()

In [4]:
#Next clean text to convert raw text into a list of words - split by whitespace
# We use split method here and the words are stored in a list

words = text.split()
#Convert all words to lower case by calling the lower() function
words = [word.lower() for word in words]
print(words[:70])

['ï»¿one', 'morning,', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams,', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin.', 'he', 'lay', 'on', 'his', 'armour-like', 'back,', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly,', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections.', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment.', 'his', 'many', 'legs,', 'pitifully', 'thin']


- We can see that the punctuations are preserved using split  method, which is nice
- Also, end of sentence punctuation are retained with the last word of the sentence (for eg: moment.)

In [5]:
#Remove punctuation from each word

import string

# create a mapping table
table= str.maketrans('','',string.punctuation)
strip = [ w.translate(table) for w in words]
print(strip[:100])


['ï»¿one', 'morning', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'he', 'lay', 'on', 'his', 'armourlike', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'the', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'his', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'whats', 'happened', 'to', 'me', 'he', 'thought', 'it', 'wasnt', 'a', 'dream', 'his', 'room', 'a', 'proper', 'human']


- Python provides a constant called string.punctuation that provides a great list of punctuation characters. 
- Python offers a function called translate() that will map one set of characters to another.
- We can use the function maketrans() to create a mapping table. We can create an empty mapping table, but the third argument of this function allows us to list all of the characters to remove during the translation process.


### Tokenization and Cleaning with NLTK (Natural Language Toolkit)

NLTK pipeline for text processing 
- Load the raw text.
- Split into tokens.
- Convert to lowercase.
- Remove punctuation from each token.
- Filter out remaining tokens that are not alphabetic.
- Filter out tokens that are stop words.
- Stem words (stemming refers to the process of reducing each word to its toot or base)
    - For example “fishing,” “fished,” “fisher” all reduce to the stem “fish.”

In [6]:
import nltk

In [7]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> q


True

NLTK provides the sent_tokenize() function to split text into sentences <br>
NLTK provides a function called word_tokenize() for splitting strings into tokens (nominally words).
 - It splits tokens based on white space and punctuation. 
 - For example, commas and periods are taken as separate tokens. 
 - Contractions are split apart (e.g. “What’s” becomes “What” “‘s“). 
 - Quotes are kept etc

In [8]:
from nltk.tokenize import word_tokenize
#split to words
tokens = word_tokenize(text)

#remove tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])


['ï', 'morning', 'when', 'Gregor', 'Samsa', 'woke', 'from', 'troubled', 'dreams', 'he', 'found', 'himself', 'transformed', 'in', 'his', 'bed', 'into', 'a', 'horrible', 'vermin', 'He', 'lay', 'on', 'his', 'back', 'and', 'if', 'he', 'lifted', 'his', 'head', 'a', 'little', 'he', 'could', 'see', 'his', 'brown', 'belly', 'slightly', 'domed', 'and', 'divided', 'by', 'arches', 'into', 'stiff', 'sections', 'The', 'bedding', 'was', 'hardly', 'able', 'to', 'cover', 'it', 'and', 'seemed', 'ready', 'to', 'slide', 'off', 'any', 'moment', 'His', 'many', 'legs', 'pitifully', 'thin', 'compared', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', 'waved', 'about', 'helplessly', 'as', 'he', 'looked', 'What', 'happened', 'to', 'me', 'he', 'thought', 'It', 'was', 'a', 'dream', 'His', 'room', 'a', 'proper', 'human', 'room']


In [14]:
#Load data 
filename=r"C:\Users\divyakamat\data\DataSets\nlp\metamorphosis_clean.txt"
with open(filename,'rt') as file:
    text = file.read()
    
#split the words into tokens

from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)

#convert to lower case
tokens = [word.lower() for word in tokens]

#Remove punctuations within the tokens

import string
table = str.maketrans('','',string.punctuation)
strip = [w.translate(table) for w in tokens]

#Remove other tokens that are not alphabetic
words = [word for word in strip if word.isalpha()]

#filter out stop words like a , to , this etc
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]

#Stemming of words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]

print(stemmed[:100])

['ï', '»', '¿one', 'morn', ',', 'when', 'gregor', 'samsa', 'woke', 'from', 'troubl', 'dream', ',', 'he', 'found', 'himself', 'transform', 'in', 'hi', 'bed', 'into', 'a', 'horribl', 'vermin', '.', 'he', 'lay', 'on', 'hi', 'armour-lik', 'back', ',', 'and', 'if', 'he', 'lift', 'hi', 'head', 'a', 'littl', 'he', 'could', 'see', 'hi', 'brown', 'belli', ',', 'slightli', 'dome', 'and', 'divid', 'by', 'arch', 'into', 'stiff', 'section', '.', 'the', 'bed', 'wa', 'hardli', 'abl', 'to', 'cover', 'it', 'and', 'seem', 'readi', 'to', 'slide', 'off', 'ani', 'moment', '.', 'hi', 'mani', 'leg', ',', 'piti', 'thin', 'compar', 'with', 'the', 'size', 'of', 'the', 'rest', 'of', 'him', ',', 'wave', 'about', 'helplessli', 'as', 'he', 'look', '.', '``', 'what', "'s"]


Because the source text for this tutorial was reasonably clean to begin with, we skipped many concerns of text cleaning that you may need to deal with in your own project. <br>

Here is a short list of additional considerations when cleaning text:<br>

- Handling large documents and large collections of text documents that do not fit into memory.
- Extracting text from markup like HTML, PDF, or other structured document formats.
- Transliteration of characters from other languages into English.
- Decoding Unicode characters into a normalized form, such as UTF8.
- Handling of domain specific words, phrases, and acronyms.
- Handling or removing numbers, such as dates and amounts.
-  Locating and correcting common typos and misspellings.

### Lemmatization
Lemmatization is an alternative approach from stemming to removing inflection

Lemmazation is a more intensive and therefore slower process, but more accurate. Stemming may be more useful in queries for databases whereas lemmazation may work much better when trying to determine text sentiment.

from nltk.stem.wordnet import WordNetLemmatizer <br>
lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer <br>
stem = PorterStemmer()

word = "multiplying" <br>
lem.lemmatize(word, "v") <br>
>> "multiply" 
stem.stem(word) <br>
>> "multipli"


### References

https://machinelearningmastery.com/clean-text-machine-learning-python/

https://towardsdatascience.com/pre-processing-in-natural-language-machine-learning-898a84b8bd47

https://medium.com/@dobko_m/nlp-text-data-cleaning-and-preprocessing-ea3ffe0406c1

https://www.theschool.ai/courses/data-lit/lessons/cleaning-text-data/

