### Text Preproccesing

##### After a text is obtained, we start with text preproccesing. Text preproccesing includes:
* converting all letters to lower or upper case
* converting numbers into words or removing numbers
* removing punctuations, accent marks and other diacritics
* removing white spaces
* expanding abbreviations
* removing stop words, sparse terms, and particular words
* text canonicalization

##### Convert text to lowercase

In [1]:
input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_str = input_str.lower()
print(input_str)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.


###### Remove numbers

In [2]:
import re
input_str = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
result = re.sub(r"\d+", "", input_str)
print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


##### Remove punctuation

In [3]:
import string
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
result = s = re.sub(r'[^\w\s]','',input_str)

print(result)

This is an example of string with punctuation


###### Remove whitespaces

In [4]:
input_str = "\t a string example\t "
print(input_str)
input_str = input_str.strip()
print(input_str)

	 a string example	 
a string example


##### Tokenization

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. In this table (“Tokenization” sheet) several tools for implementing tokenization are described.

![](./img/Tokenization.png)

#### Remove stop words
“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.
Stop words cleaning also supports many modern libraries, like Sklearn, SpaCy, Keras

In [7]:
import nltk 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [8]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


In [9]:
input_str = "NLTK is a leading platform for building Python programs to work with human language data."
stop_words = set(stopwords.words("english"))
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


##### Stemming 
Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are Porter stemming algorithm (removes common morphological and inflexional endings from words [14]) and Lancaster stemming algorithm (a more aggressive stemming algorithm). In the “Stemming” sheet of the table some stemmers are described.m

In [10]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer= PorterStemmer()

input_str="There are several types of stemming algorithms."
input_str=word_tokenize(input_str)

for word in input_str:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
.


##### Lemmatization
The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.
<br>Lemmatization tools are presented libraries described above: NLTK (WordNet Lemmatizer), spaCy, TextBlob, Pattern, gensim, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), Illinois Lemmatizer, and DKPro Core.

In [12]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [14]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer=WordNetLemmatizer()

input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))
    
print("\n")
input_str1 = "There are several types of stemming algorithms."

input_str=word_tokenize(input_str1)
for word in input_str:
    print(lemmatizer.lemmatize(word))

been
had
done
language
city
mouse


There
are
several
type
of
stemming
algorithm
.


### Text preprocessing with spacy

In [24]:
import spacy #load spacy
# dowenload spacy "english" model with python -m spacy download en
nlp = spacy.load("en", disable=['parser', 'tagger', 'ner'])
stops = stopwords.words("english")



def normalize(comment, lowercase, remove_stopwords):
    if lowercase:
        comment = comment.lower()
    comment = nlp(comment)
    lemmatized = list()
    for word in comment:
        lemma = word.lemma_.strip()
        if lemma:
            if not remove_stopwords or (remove_stopwords and lemma not in stops):
                lemmatized.append(lemma)
    return " ".join(lemmatized)


text = """Machine learning (ML) is the scientific study of algorithms and statistical models that
        computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as training data, in order to make predictions or decisions without being explicitly programmed to perform the task.[1][2]:2 Machine learning algorithms are used in a wide variety of applications, 
        such as email filtering, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task.
        Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.[3][4] In its application across business problems, machine learning is also referred to as predictive analytics."""
normalize(text,lowercase=True,remove_stopwords=True)

'machine learn ( ml ) scientific study algorithm statistical model computer system use order perform specific task effectively without use explicit instruction , rely pattern inference instead . see subset artificial intelligence . machine learn algorithm build mathematical model base sample datum , know train datum , order make prediction decision without explicitly programme perform task.[1][2]:2 machine learn algorithm use wide variety application , email filter , computer vision , infeasible develop algorithm specific instruction perform task . machine learn closely relate computational statistic , focus make prediction use computer . study mathematical optimization deliver method , theory application domain field machine learn . datum mine field study within machine learn , focus exploratory datum analysis unsupervised learning.[3][4 ] application across business problem , machine learn also refer predictive analytics .'

##### Useful links

re tutorial: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
<br>stemming vs lemmatization: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/
<br> more:https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908
<br> keras text preproccesing:https://keras.io/preprocessing/text/
<br> intro to spacy:https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/