### Text Preproccesing

##### After a text is obtained, we start with text preproccesing. Text preproccesing includes:
* converting all letters to lower or upper case
* converting numbers into words or removing numbers
* removing punctuations, accent marks and other diacritics
* removing white spaces
* expanding abbreviations
* removing stop words, sparse terms, and particular words
* text canonicalization

##### Convert text to lowercase

In [2]:
input_str = "The 5 biggest countries by population in 2017 are China, India, United States, Indonesia, and Brazil."
input_str = input_str.lower()
print(input_str)

the 5 biggest countries by population in 2017 are china, india, united states, indonesia, and brazil.


###### Remove numbers

In [3]:
import re
input_str = "Box A contains 3 red and 5 white balls, while Box B contains 4 red and 2 blue balls."
result = re.sub(r"\d+", "", input_str)
print(result)

Box A contains  red and  white balls, while Box B contains  red and  blue balls.


##### Remove punctuation

In [4]:
import string
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!" # Sample string
result = s = re.sub(r'[^\w\s]','',input_str)

print(result)

This is an example of string with punctuation


###### Remove whitespaces

In [5]:
input_str = "\t a string example\t "
print(input_str)
input_str = input_str.strip()
print(input_str)

	 a string example	 
a string example


##### Tokenization

Tokenization is the process of splitting the given text into smaller pieces called tokens. Words, numbers, punctuation marks, and others can be considered as tokens. In this table (“Tokenization” sheet) several tools for implementing tokenization are described.

![](./img/Tokenization.png)

#### Remove stop words
“Stop words” are the most common words in a language like “the”, “a”, “on”, “is”, “all”. These words do not carry important meaning and are usually removed from texts. It is possible to remove stop words using Natural Language Toolkit (NLTK), a suite of libraries and programs for symbolic and statistical natural language processing.
Stop words cleaning also supports many modern libraries, like Sklearn, SpaCy, Keras

In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


In [7]:
input_str = "NLTK is a leading platform for building Python programs to work with human language data."
stop_words = set(stopwords.words("english"))
tokens = word_tokenize(input_str)
result = [i for i in tokens if not i in stop_words]
print (result)

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - '/root/nltk_data'
    - '/root/environments/my_env/nltk_data'
    - '/root/environments/my_env/share/nltk_data'
    - '/root/environments/my_env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


##### Stemming 
Stemming is a process of reducing words to their word stem, base or root form (for example, books — book, looked — look). The main two algorithms are Porter stemming algorithm (removes common morphological and inflexional endings from words [14]) and Lancaster stemming algorithm (a more aggressive stemming algorithm). In the “Stemming” sheet of the table some stemmers are described.m

In [8]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

stemmer= PorterStemmer()

input_str="There are several types of stemming algorithms."
input_str=word_tokenize(input_str)

for word in input_str:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
.


##### Lemmatization
The aim of lemmatization, like stemming, is to reduce inflectional forms to a common base form. As opposed to stemming, lemmatization does not simply chop off inflections. Instead it uses lexical knowledge bases to get the correct base forms of words.
<br>Lemmatization tools are presented libraries described above: NLTK (WordNet Lemmatizer), spaCy, TextBlob, Pattern, gensim, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), Illinois Lemmatizer, and DKPro Core.

In [9]:
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

lemmatizer=WordNetLemmatizer()
input_str1 = "There are several types of stemming algorithms."
input_str="been had done languages cities mice"
input_str=word_tokenize(input_str)
for word in input_str:
    print(lemmatizer.lemmatize(word))


LookupError: 
**********************************************************************
  Resource [93mwordnet[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('wordnet')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/wordnet[0m

  Searched in:
    - '/root/nltk_data'
    - '/root/environments/my_env/nltk_data'
    - '/root/environments/my_env/share/nltk_data'
    - '/root/environments/my_env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


In [10]:
import spacy #load spacy
# nlp = spacy.load("en", disable=['parser', 'tagger', 'ner'])
# stops = stopwords.words("english")



#  def normalize(comment, lowercase, remove_stopwords):
#     if lowercase:
#         comment = comment.lower()
#     comment = nlp(comment)
#     lemmatized = list()
#     for word in comment:
#         lemma = word.lemma_.strip()
#         if lemma:
#             if not remove_stopwords or (remove_stopwords and lemma not in stops):
#                 lemmatized.append(lemma)
#     return " ".join(lemmatized)


# Data['Text_After_Clean'] = Data['Text'].apply(normalize, lowercase=True, remove_stopwords=True)

ModuleNotFoundError: No module named 'spacy'

re tutorial: https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285
<br>stemming vs lemmatization: https://blog.bitext.com/what-is-the-difference-between-stemming-and-lemmatization/
<br> more:https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908
<br> keras text preproccesing:https://keras.io/preprocessing/text/
<br> intro to spacy:https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/