# Text Preprocessing #

Author: Christin Seifert, licensed under the Creative Commons Attribution 3.0 Unported License https://creativecommons.org/licenses/by/3.0/ 

This is a tutorial for text preprocessing. In this tutorial you will see examples of 
* stemming
* lemmatization and
* stopword-removal

It is assumed that you have some general knowledge on 
* .. no particular knowledge required. You should be able to read texts, though ;-)

Let's work with texts from the web. Let's first have a look what nltk resources are readily available, so that we do not have to download them ourselves (Thanks, nltk folks!).

In [1]:
import nltk
from nltk.corpus import webtext
# this has to be run once to get the corpus
#nltk.download('webtext')

for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:100], '...')

firefox.txt Cookie Manager: "Don't allow sites that set removed cookies to set future cookies" should stay check ...
grail.txt SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop clop clop] 
SOLDIER #1: Halt!  Who ...
overheard.txt White guy: So, do you have any plans for this evening?
Asian girl: Yeah, being angry!
White guy: Oh, ...
pirates.txt PIRATES OF THE CARRIBEAN: DEAD MAN'S CHEST, by Ted Elliott & Terry Rossio
[view looking straight dow ...
singles.txt 25 SEXY MALE, seeks attrac older single lady, for discreet encounters.
35YO Security Guard, seeking  ...
wine.txt Lovely delicate, fragrant Rhone wine. Polished leather and strawberries. Perhaps a bit dilute, but g ...


The corpus about singles seems the most interesting. So let's load it. And since we want to learn text pre-processing we load the raw version (meaning no preprocessing is implictly done during loading).
After loading let's inspect what is in there.

In [2]:
singles = nltk.corpus.webtext.raw('singles.txt')
singles

'25 SEXY MALE, seeks attrac older single lady, for discreet encounters.\n35YO Security Guard, seeking lady in uniform for fun times.\n40 yo SINGLE DAD, sincere friendly DTE seeks r/ship with fem age open S/E\n44yo tall seeks working single mum or lady below 45 fship rship. Nat Open\n6.2 35 yr old OUTGOING M seeks fem 28-35 for o/door sports - w/e away\nA professional business male, late 40s, 6 feet tall, slim build, well groomed, great personality, home owner, interests include the arts travel and all things good, Ringwood area, is seeking a genuine female of similar age or older, in same area or surrounds, for a meaningful long term rship. Looking forward to hearing from you all.\nABLE young man seeks, sexy older women. Phone for fun ready to play\nAFFECTIONATE LADY Sought by generous guy, 40s, mutual fulfillment\nARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and looking for someone 

This is one big bunch of words. The reader also read the line endings `\n`, but did not split the text accordingly. We want to have one list entry per line. So let's split the raw text at the line endings. And inspect some random lines then.

In [5]:
import random 
slines = singles.splitlines()
for i in range(0,5):
    print(slines[random.randrange(len(slines))])

CUDDLY FULL FIGURED LADY 50 plus sought by Australian gent, early 60s, financially secure, non-drinker, non smoker, for permanent relationship.
GREEK Male Late 30''s seeks Lady up to 37yrs for long term relationship
MARRIED 32 personal trainer looking for married woman age open for fun
SWM 45 DtoE, honest, S/D, 178cm, 79kg. Clean cut, intelligent, natur ist seeks petite open minded red hd or brunette. Asian, eurasian welcome. Home bch drives. Fship, poss rship.
48YO SWF WLTM a 42-54yo genuine, caring, honest and normal man for fship, poss rship. S/S, S/D, GSOH. Photo pls.


Did you observe that some people use capitalized words, others don't? If not, run the above lines of code again to get some other examples. For most applications `LADY`, `Lady` and `lady` can and should be handled the same. So let's unify this. Note `s.lower()`works on a single string only. So in this tutorial we do it again on the raw text and also split the lines again.

In [7]:
slines2 = singles.lower().splitlines()
for i in range(0,5):
    print(slines2[random.randrange(len(slines2))])

about me 53 y.o. lady, 5 ft 5, non smoker, social drinker. i enjoy gardening, music, movies, walking, dining out and quiet nights at home, v8 motor racing. i am an easygoing, honest and caring person, seeking gentleman 50 - 65 for friendship with view to relationship.
a friend, a lover self employed single 47 yo romantic easy going gsoh honest reliable looking to find a friend.
6.2 35 yr old outgoing m seeks fem 28-35 for o/door sports - w/e away
single good looking 45 yo seeks 40+ lady casual f/ship & fun married ok
beautiful, intelligent caring, musical shy, late 20s, size 12, long blonde hair and likes to drink guiness. if this is you then i want you.


## Tokenization ##

Next we figure out what our individual tokens are. Tokens might be words, but also numbers. That might seem easy. Just use the whitespace character for separation. But actually its not so easy. For instance, should "40-45" be one or two tokens? Here we will use a simple tokenizer, but generally it is worth investing some time in figuring out what individual tokens should be. 
We will only do this for a single example line from our corpus.

In [8]:
line = slines2[41]
print(line)
tokens = nltk.word_tokenize(line)
print(tokens)

romantic sexy country guy 58 ns seeks similar ns lady for fun and friendship.
['romantic', 'sexy', 'country', 'guy', '58', 'ns', 'seeks', 'similar', 'ns', 'lady', 'for', 'fun', 'and', 'friendship', '.']


##  Stopword removal ##
For some applications words like "not" and "we" do not carry any information but lead to a bigger vocabulary. So we might decide to remove them from the text. There are lists of stopwords available for many languages. Let's have a look what happens with our text. First let's have a look at the stopwords.

In [9]:
from nltk.corpus import stopwords
# has to be run only once to get the data
#nltk.download('stopwords')
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

And remove them from the text.

In [10]:
filtered_words = [word for word in tokens if word not in stopwords.words('english')]

print(tokens)
print(filtered_words)

['romantic', 'sexy', 'country', 'guy', '58', 'ns', 'seeks', 'similar', 'ns', 'lady', 'for', 'fun', 'and', 'friendship', '.']
['romantic', 'sexy', 'country', 'guy', '58', 'ns', 'seeks', 'similar', 'ns', 'lady', 'fun', 'friendship', '.']


## Stemming and Lemmatization ##
For some applications words with different inflections should be treated equally. So for instance, the occurrence of the word `car` has the same importance as the word `cars`. Stemmers and Lemmatizers adress this by conflating the words into their stems or lemmas. Let's see how they differ. First, we apply a very well known stemmer, the Porter stemmer. 
You also can try this interactively in [this demo](https://text-processing.com/demo/stem/).

In [11]:
from nltk.stem.porter import *
stemmer = PorterStemmer()
porter_tokens = [stemmer.stem(word) for word in tokens]
print(porter_tokens)

['romant', 'sexi', 'countri', 'guy', '58', 'ns', 'seek', 'similar', 'ns', 'ladi', 'for', 'fun', 'and', 'friendship', '.']


In [12]:
# we only need to do this once
#nltk.download('wordnet')
from nltk.stem.wordnet import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word) for word in tokens]
print(lemmas)

['romantic', 'sexy', 'country', 'guy', '58', 'n', 'seek', 'similar', 'n', 'lady', 'for', 'fun', 'and', 'friendship', '.']


So, while a stemmer bluntly removes suffices a lemmatizer attempts to map a word to a canonical form (a form you find in a dictionary and is actually a word in the language. Both can not handly tokens that are not words (e.g. "42").


So far for this notebook. That's it. The end. Go!