# Text Preprocessing in Python: Steps, Tools, and Examples

Source: https://medium.com/@datamonsters/text-preprocessing-in-python-steps-tools-and-examples-bf025f872908

by Olga Davydova, Data Monsters

After we have the data, we start with text normalization. Which includes:
- Converting all letters to lower or upper case.
- Converting numbers into words or removing numbers.
- Removing punctuations, accent marks and other diacritics.
- Removing white spaces.
- Exapanding abbreviations.
- Removing stop words, sparse terms, and particular words.
- Text canonicalization.

## Convert text to lowercase

In [1]:
input_str = 'Hey Jude, dont make it bad. Take a sad song, Jude. Sekar tell me to shut up.'
print(input_str.lower())

hey jude, dont make it bad. take a sad song, jude. sekar tell me to shut up.


## Removing numbers

In [2]:
import re
input_str = "Box A contains 3 red balls and box B contains 4 blue balls"
print(re.sub(r'\d+', '', input_str))

Box A contains  red balls and box B contains  blue balls


## Removing Punctuations

In [3]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [4]:
input_str = "This &is [an] example? {of} string. with.? punctuation!!!!"

In [5]:
input_str.translate(str.maketrans('', '', string.punctuation))

'This is an example of string with punctuation'

## Remove whitespaces

In [6]:
input_str = " \t a string example\t "
input_str.strip()

'a string example'

## Tokenization

Is the process of splitting the given text into smaller pieces called tokens.

## Remove stop words

Are the most commons words in a language like 'the', 'a', 'on', 'is', 'all'. 

In [7]:
import nltk
nltk.download('stopwords')
input_str = "NLTK is a leading platform for building Python programs to work with human language data."
from nltk.corpus import stopwords
print(set(stopwords.words('english')))


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
{'ours', 'aren', 'theirs', 'during', 'wasn', 'him', 'because', 'whom', 'too', 'off', "needn't", 'such', 'who', 'their', 'those', 'against', "won't", 'll', 'our', "you're", "should've", 'haven', 'most', 'have', "you'd", 'on', 'other', 'until', 'these', 'hers', 'yours', 'yourselves', 'so', "wouldn't", 'only', "that'll", 'be', 'when', 'mustn', 'isn', 'to', 'being', 'few', 'mightn', 'very', 'itself', 'won', 'at', 'had', 're', 'shan', "weren't", 'between', 'before', 'she', "didn't", 'by', 'some', 'can', 'needn', 'which', 'an', 'a', 'into', 'further', 've', 'if', 'out', 'again', 'but', 'm', 'over', 'were', 'o', 'or', 'below', 'this', 'with', 'no', "mightn't", 'then', 'each', "aren't", 'her', 'there', 'ourselves', 'any', 'about', 'how', 'his', "she's", 'what', 'and', "shouldn't", 'myself', 'themselves', 'more', 'you', 'herself', "shan't", 'we', 'the', "isn't", 'am', "you'll", 'from',

In [8]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')
tokens = word_tokenize(input_str)
result = [token for token in tokens if not token in set(stopwords.words('english'))]

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [9]:
result

['NLTK',
 'leading',
 'platform',
 'building',
 'Python',
 'programs',
 'work',
 'human',
 'language',
 'data',
 '.']

In [10]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
result = [token for token in tokens if token not in ENGLISH_STOP_WORDS]
print(result)

['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


In [11]:
from spacy.lang.en.stop_words import STOP_WORDS
result = [token for token in tokens if token not in STOP_WORDS]
print(result)

['NLTK', 'leading', 'platform', 'building', 'Python', 'programs', 'work', 'human', 'language', 'data', '.']


## Stemming

Is a process of reducing words to their word stem, base or root forms.

In [12]:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
input_str = "There are several types of stemming algorithms."
stemmer = PorterStemmer()
input_str = word_tokenize(input_str)
for word in input_str:
    print(stemmer.stem(word))

there
are
sever
type
of
stem
algorithm
.


## Lemmatization

As opposed to Stemming, Lemmatization does not simple chop off inflections, instead it uses lexical knowledge bases to get the correct base forms of words.

Lemmatization tools are presented libraries described above: NLTK (WordNet Lemmatizer), spaCy, TextBlob, Pattern, gensim, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), Illinois Lemmatizer, and DKPro Core.

In [13]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
input_str = "been had done languages cities mice"
input_str = word_tokenize(input_str)
lemmatizer = WordNetLemmatizer()
for word in input_str:
    print(lemmatizer.lemmatize(word))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
been
had
done
language
city
mouse


## Part of Speech tagging (POS)

Part-of-speech tagging aims to assign parts of speech to each word of a given text (such as nouns, verbs, adjectives, and others) based on its definition and its context. There are many tools containing POS taggers including NLTK, spaCy, TextBlob, Pattern, Stanford CoreNLP, Memory-Based Shallow Parser (MBSP), Apache OpenNLP, Apache Lucene, General Architecture for Text Engineering (GATE), FreeLing, Illinois Part of Speech Tagger, and DKPro Core.

In [14]:
input_str = "Parts of speech examples: an article, to write, interesting, easily, and, of"

In [15]:
from textblob import TextBlob
result = TextBlob(input_str)
print(result)

Parts of speech examples: an article, to write, interesting, easily, and, of


In [16]:
nltk.download('averaged_perceptron_tagger')
print(result.tags)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[('Parts', 'NNS'), ('of', 'IN'), ('speech', 'NN'), ('examples', 'NNS'), ('an', 'DT'), ('article', 'NN'), ('to', 'TO'), ('write', 'VB'), ('interesting', 'VBG'), ('easily', 'RB'), ('and', 'CC'), ('of', 'IN')]


## Chunking (Shallow Parsing)

Chunking is a natural language process that identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and links them to higher order units that have discrete grammatical meanings (noun groups or phrases, verb groups, etc.) [23]. Chunking tools: NLTK, TreeTagger chunker, Apache OpenNLP, General Architecture for Text Engineering (GATE), FreeLing.

In [17]:
input_str = "A black television and a white stove were bought for the new apartment of John."

In [18]:
result = TextBlob(input_str)
print(result.tags)

[('A', 'DT'), ('black', 'JJ'), ('television', 'NN'), ('and', 'CC'), ('a', 'DT'), ('white', 'JJ'), ('stove', 'NN'), ('were', 'VBD'), ('bought', 'VBN'), ('for', 'IN'), ('the', 'DT'), ('new', 'JJ'), ('apartment', 'NN'), ('of', 'IN'), ('John', 'NNP')]


In [19]:
reg_exp = "NP: {<DT>?<JJ>*<NN>}"
rp = nltk.RegexpParser(reg_exp)
result = rp.parse(result.tags)
print(result)

(S
  (NP A/DT black/JJ television/NN)
  and/CC
  (NP a/DT white/JJ stove/NN)
  were/VBD
  bought/VBN
  for/IN
  (NP the/DT new/JJ apartment/NN)
  of/IN
  John/NNP)


## Named Entity Recognition

Named-entity recognition (NER) aims to find named entities in text and classify them into pre-defined categories (names of persons, locations, organizations, times, etc.).

Named-entity recognition tools: NLTK, spaCy, General Architecture for Text Engineering (GATE) — ANNIE, Apache OpenNLP, Stanford CoreNLP, DKPro Core, MITIE, Watson Natural Language Understanding, TextRazor, FreeLing are described in the “NER” sheet of the table.

In [20]:
from nltk import word_tokenize, pos_tag, ne_chunk
input_str = "Bill works for Apple so he went to Boston for a conference."
print(word_tokenize(input_str))

['Bill', 'works', 'for', 'Apple', 'so', 'he', 'went', 'to', 'Boston', 'for', 'a', 'conference', '.']


In [21]:
pos_tag(word_tokenize(input_str))

[('Bill', 'NNP'),
 ('works', 'VBZ'),
 ('for', 'IN'),
 ('Apple', 'NNP'),
 ('so', 'IN'),
 ('he', 'PRP'),
 ('went', 'VBD'),
 ('to', 'TO'),
 ('Boston', 'NNP'),
 ('for', 'IN'),
 ('a', 'DT'),
 ('conference', 'NN'),
 ('.', '.')]

In [25]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
print(ne_chunk(pos_tag(word_tokenize(input_str))))

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
(S
  (PERSON Bill/NNP)
  works/VBZ
  for/IN
  Apple/NNP
  so/IN
  he/PRP
  went/VBD
  to/TO
  (GPE Boston/NNP)
  for/IN
  a/DT
  conference/NN
  ./.)


## Coreference Resolution (Anaphora Resolution)

Pronouns and other referring expressions should be connected to the right individuals. Coreference resolution finds the mentions in a text that refer to the same real-world entity. For example, in the sentence, “Andrew said he would buy a car” the pronoun “he” refers to the same person, namely to “Andrew”. Coreference resolution tools: Stanford CoreNLP, spaCy, Open Calais, Apache OpenNLP are described in the “Coreference resolution” sheet of the table.

## Collocation Extraction

Collocations are word combinations occurring together more often than would be expected by chance. Collocation examples are “break the rules,” “free time,” “draw a conclusion,” “keep in mind,” “get ready,” and so on.

## Relationship Extraction

Relationship extraction allows obtaining structured information from unstructured sources such as raw text. Strictly stated, it is identifying relations (e.g., acquisition, spouse, employment) among named entities (e.g., people, organizations, locations). For example, from the sentence “Mark and Emily married yesterday,” we can extract the information that Mark is Emily’s husband.