# Normalizing text

Normalizing text means to put all the text into some standard form. There are various ways to normalize text and which you choose to use will vary depending on your application. Here are some common normalization techniques:
* lower case
* remove punctuation and/or numbers
* stem 
* lemmatize

The following code cell demonstrates several techniques. 

In [3]:
# sentence tokenization and word tokenization
import nltk
from nltk import word_tokenize, sent_tokenize

sentences = """The quick brown fox jumped over the lazy river. She sells
sea shells by the sea shore. Humpy Dumpy sat on the wall."""

sents = nltk.sent_tokenize(sentences)
tokens = nltk.word_tokenize(sents[0]) 

In [4]:
print(sents[0])
print(tokens)

The quick brown fox jumped over the lazy river.
['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'river', '.']


In [5]:
# if you just want the words you can remove everything else with regex
# remove punctuation symbols, newlines, and digits

import re

text = re.sub(r'[.?!,:;()\-\n\d]',' ', sentences.lower())
print(text)

the quick brown fox jumped over the lazy river  she sells sea shells by the sea shore  humpy dumpy sat on the wall 


### stemming

Stemming removes affixes from words. A well-known stemmer is the Porter stemmer, available in NLTK.


In [6]:
import nltk
porter = nltk.PorterStemmer()
stemmed = [porter.stem(t) for t in tokens]
stemmed[:25]

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'river', '.']

### lemmatizing

Lemmatizing attempts to take a word down to its base lexical form, as you would find in a dictionary. NLTK can do that as well.

In [7]:
wnl = nltk.WordNetLemmatizer()
lemmatized = [wnl.lemmatize(t) for t in tokens]
lemmatized

['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'river', '.']

Notice that 'jumped' did not get lemmatize. The lemmatizer works better if it knows whether a word is a verb.


### pos

NLTK has a pos tagger that inputs a list of tokens and outputs a list of tuples in the form (token, pos).

In [8]:
tags = nltk.pos_tag(tokens)
tags

[('The', 'DT'),
 ('quick', 'JJ'),
 ('brown', 'NN'),
 ('fox', 'NN'),
 ('jumped', 'VBD'),
 ('over', 'IN'),
 ('the', 'DT'),
 ('lazy', 'JJ'),
 ('river', 'NN'),
 ('.', '.')]

In [9]:
lemmatized = []
for token, tag in tags:
    if tag.startswith('VB'):
        lemma = wnl.lemmatize(token, pos='v')
    else:
        lemma = wnl.lemmatize(token)
    lemmatized.append(lemma)
lemmatized


['The', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'river', '.']

### order matters

Order matters with text processing. For example, if you lower case and remove punctuation, it will be very hard for nltk to segment text into sentences. So often it's a good idea to keep the original text in a variable for things like sentence segmentation and make text processing changes to a copy.