## Pre-Processing

This notebook will show how to create a simple pre-processing pipeline. We will use the Brown NLTK corpus, which contains a small selection of texts from the Project Gutenberg electronic text archive.

In [1]:
import nltk

# load the brown corpus
corpus = nltk.corpus.brown

# Access documents in corpus
print(corpus.fileids())

['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10', 'ca11', 'ca12', 'ca13', 'ca14', 'ca15', 'ca16', 'ca17', 'ca18', 'ca19', 'ca20', 'ca21', 'ca22', 'ca23', 'ca24', 'ca25', 'ca26', 'ca27', 'ca28', 'ca29', 'ca30', 'ca31', 'ca32', 'ca33', 'ca34', 'ca35', 'ca36', 'ca37', 'ca38', 'ca39', 'ca40', 'ca41', 'ca42', 'ca43', 'ca44', 'cb01', 'cb02', 'cb03', 'cb04', 'cb05', 'cb06', 'cb07', 'cb08', 'cb09', 'cb10', 'cb11', 'cb12', 'cb13', 'cb14', 'cb15', 'cb16', 'cb17', 'cb18', 'cb19', 'cb20', 'cb21', 'cb22', 'cb23', 'cb24', 'cb25', 'cb26', 'cb27', 'cc01', 'cc02', 'cc03', 'cc04', 'cc05', 'cc06', 'cc07', 'cc08', 'cc09', 'cc10', 'cc11', 'cc12', 'cc13', 'cc14', 'cc15', 'cc16', 'cc17', 'cd01', 'cd02', 'cd03', 'cd04', 'cd05', 'cd06', 'cd07', 'cd08', 'cd09', 'cd10', 'cd11', 'cd12', 'cd13', 'cd14', 'cd15', 'cd16', 'cd17', 'ce01', 'ce02', 'ce03', 'ce04', 'ce05', 'ce06', 'ce07', 'ce08', 'ce09', 'ce10', 'ce11', 'ce12', 'ce13', 'ce14', 'ce15', 'ce16', 'ce17', 'ce18', 'ce19', 'ce20',

In [2]:
# Load first document
print(corpus.raw("ca01")[:1000])



	The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.


	The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./.


	The/at September-October/np term/nn jury/nn had/hvd been/ben charged/vbn by/in Fulton/np-tl Superior/jj-tl Court/nn-tl Judge/nn-tl Durwood/np Pye/np to/to investigate/vb reports/nns of/in possible/jj ``/`` irregularities/nns ''/'' in/in the/at hard-fought/jj primary/nn which/wdt was/bedz won/vbn by/in Mayor-nominate/nn-tl Ivan/np Allen/np Jr./

That doesn't look too great. All the white space and escape characters will get in the way of any analysis. Luckily, NLTK comes with a built in sentence converter, so instead of reinventing the wheel, lets use that!

In [3]:
doc1 = corpus.sents("ca01")
print(doc1[:3])

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ['The', 'September-October', 'term', 'jury', 'had', 'been', 'charged', 'by', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'to', 'investigate', 'reports', 'of', 'possible', '``', 'irregularities', "''", 'in', 'the', 'hard-fought', 'primary', 'which', 'was', 'won', 'by', 'Mayor-nominate', 'Ivan', 'Allen', 'Jr.', '.']]


This automatically handles sentence and token segmentation, the first step in our pipeline. Now let's remove punctuation. Once again we can leverage NLTK's built in methods.

In [4]:
from nltk import word_tokenize

def remove_punc(doc):
    sentences = []
    for sentence in doc:
        sentence = " ".join(sentence)  # converts from list of words to a single string
        words = word_tokenize(sentence)  # separates punctuation
        
        # removes punctuation, makes lowercase
        words= [word.lower() for word in words if word.isalnum()]
        sentences.append(words)
    return sentences

print("With Punctuation")
print(doc1[1])
print()

doc1 = remove_punc(doc1)

print("Without Punctuation")
print(doc1[1])

With Punctuation
['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.']

Without Punctuation
['the', 'jury', 'further', 'said', 'in', 'presentments', 'that', 'the', 'city', 'executive', 'committee', 'which', 'had', 'charge', 'of', 'the', 'election', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'city', 'of', 'atlanta', 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted']


Notice how the quotations, periods, and commas have all been removed. We also made everything lowercase, the second step in the pre-processing pipeline. Next we need to remove stopwords, such as "The", "That", and "Had". There are a lot of examples, so NLTK comes with a Stopwords list.

In [5]:
from nltk.corpus import stopwords

sw = stopwords.words("english")
sw += ""  # empty string

def remove_sw(doc):
    sentences = []
    for sentence in doc:
        sentence = [word for word in sentence if word not in sw]
        sentences.append(sentence)
    return sentences

print("With Stopwords")
print(doc1[1])
print()

doc1 = remove_sw(doc1)

print("Without Stopwords")
print(doc1[1])

With Stopwords
['the', 'jury', 'further', 'said', 'in', 'presentments', 'that', 'the', 'city', 'executive', 'committee', 'which', 'had', 'charge', 'of', 'the', 'election', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'city', 'of', 'atlanta', 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted']

Without Stopwords
['jury', 'said', 'presentments', 'city', 'executive', 'committee', 'charge', 'election', 'deserves', 'praise', 'thanks', 'city', 'atlanta', 'manner', 'election', 'conducted']


The text already looks more manageable, while still conveying a similar message. For this example, we won't remove any additional words. The next step is stemming/lemmatization. Since our dataset is not massive and we prioritize readability, let's use lemmatization. To lemmatize, we need the part of speech. However, for the purpose of this example, we won't save the parts of speech.

In [6]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk.tag import pos_tag

lemmatizer = WordNetLemmatizer()

# only use Noun, Verb, Adjective, and Adverb. Default everything else to Noun
wordnet_map = {"N": wordnet.NOUN, "V": wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}

def lemmatize(doc):
    sentences = []
    for sentence in doc:
        tagged = pos_tag(sentence)
        sentence = [
            lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN))
            for word, pos in tagged
        ]
        sentences.append(sentence)
    return sentences

print("Before Lemmatization")
print(doc1[1])
print()

doc1 = lemmatize(doc1)

print("After Lemmatization")
print(doc1[1])

Before Lemmatization
['jury', 'said', 'presentments', 'city', 'executive', 'committee', 'charge', 'election', 'deserves', 'praise', 'thanks', 'city', 'atlanta', 'manner', 'election', 'conducted']

After Lemmatization
['jury', 'say', 'presentment', 'city', 'executive', 'committee', 'charge', 'election', 'deserve', 'praise', 'thanks', 'city', 'atlanta', 'manner', 'election', 'conduct']


The verb tense for "said" and "conducted" has been reverted to present tense. Also plural nouns like "presentments" have been made singular. Since we are skipping the POS tagging step, this completes the Pre-Processing pipeline. Use variations of this method in ALL your natural language processing projects. Thanks for reading :)