## Introduction

**Natural language processing**(NLP) is the task of making computers understand and produce human languages. And it always start with corpus i.e _body of test_

## What is a corpus ?
There are many corpora(*plural of corpus*) available in NLTK, lets start with an english one called the **Brown corpus**.

When using a new corpus in NLTK for the first time, downloads the corpus with the `nltk.download()` function, e.g. 

In [1]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to /home/avi/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

After its download you can import it as such:

In [2]:
from nltk.corpus import brown

In [3]:
brown.words()

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

In [6]:
len(brown.words())  # No. of words in the corpus

1161192

In [9]:
brown.sents()  # Returns a list of list of strings 

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In [10]:
brown.sents(fileids='ca01') # You can access a specific file with 'fileids' argument.

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

The actual `brown` corpus data is **packaged as raw text files**. And you can find their IDs with: 

In [11]:
len(brown.fileids())

500

In [12]:
print(brown.fileids()[:100])

In [13]:
print(brown.raw('cb01').strip()[:1000])  # Filter 1000 characters

Assembly/nn-hl session/nn-hl brought/vbd-hl much/ap-hl good/nn-hl 
The/at General/jj-tl Assembly/nn-tl ,/, which/wdt adjourns/vbz today/nr ,/, has/hvz performed/vbn in/in an/at atmosphere/nn of/in crisis/nn and/cc struggle/nn from/in the/at day/nn it/pps convened/vbd ./.
It/pps was/bedz faced/vbn immediately/rb with/in a/at showdown/nn on/in the/at schools/nns ,/, an/at issue/nn which/wdt was/bedz met/vbn squarely/rb in/in conjunction/nn with/in the/at governor/nn with/in a/at decision/nn not/* to/to risk/vb abandoning/vbg public/nn education/nn ./.


	There/ex followed/vbd the/at historic/jj appropriations/nns and/cc budget/nn fight/nn ,/, in/in which/wdt the/at General/jj-tl Assembly/nn-tl decided/vbd to/to tackle/vb executive/nn powers/nns ./.
The/at final/jj decision/nn went/vbd to/in the/at executive/nn but/cc a/at way/nn has/hvz been/ben opened/vbn for/in strengthening/vbg budgeting/vbg procedures/nns and/cc to/to provide/vb legislators/nns information/nn they/ppss need/vb ./.




Noticeable thigs are words have / and label and punctuations are seperated by <>/<>

From above understanding we can come to a point that our next step should be **word tokenize** and **sentence tokenize**

## Tokenization

**Sentence tokenization** is the process of splitting up strings into "sentences"

**Word tokenization** is the process of splitting up sentences into "words"

In [14]:
# lets try webtext
nltk.download("webtext")

[nltk_data] Downloading package webtext to /home/avi/nltk_data...
[nltk_data]   Package webtext is already up-to-date!


True

In [15]:
from nltk.corpus import webtext

webtext.fileids()

['firefox.txt',
 'grail.txt',
 'overheard.txt',
 'pirates.txt',
 'singles.txt',
 'wine.txt']

In [16]:
# Each line is one advertisement.
print(webtext.raw('singles.txt').strip()[:1000])
print("\n")
for i, line in enumerate(webtext.raw('singles.txt').split('\n')):
    if i > 10: # Lets take a look at the first 10 ads.
        break
    
    print(str(i) + ':\t' + line)

25 SEXY MALE, seeks attrac older single lady, for discreet encounters.
35YO Security Guard, seeking lady in uniform for fun times.
40 yo SINGLE DAD, sincere friendly DTE seeks r/ship with fem age open S/E
44yo tall seeks working single mum or lady below 45 fship rship. Nat Open
6.2 35 yr old OUTGOING M seeks fem 28-35 for o/door sports - w/e away
A professional business male, late 40s, 6 feet tall, slim build, well groomed, great personality, home owner, interests include the arts travel and all things good, Ringwood area, is seeking a genuine female of similar age or older, in same area or surrounds, for a meaningful long term rship. Looking forward to hearing from you all.
ABLE young man seeks, sexy older women. Phone for fun ready to play
AFFECTIONATE LADY Sought by generous guy, 40s, mutual fulfillment
ARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and looking for someone similar. 

In [17]:
# lets explore candidate no. 8
single_no8 = webtext.raw('singles.txt').split('\n')[8]
print(single_no8)

ARE YOU ALONE or lost in a r/ship too, with no hope in sight? Maybe we could explore new beginnings together? Im 45 Slim/Med build, GSOH, high needs and looking for someone similar. You WONT be disappointed.


## Sentence Tokenization

In NLTK `sent_tokenize()` the default tokenizer function that you can use to split strings into "sentences".

In [18]:
from nltk import sent_tokenize, word_tokenize

sent_tokenize(single_no8)

['ARE YOU ALONE or lost in a r/ship too, with no hope in sight?',
 'Maybe we could explore new beginnings together?',
 'Im 45 Slim/Med build, GSOH, high needs and looking for someone similar.',
 'You WONT be disappointed.']

In [19]:
for sent in sent_tokenize(single_no8):
    print(word_tokenize(sent))

['ARE', 'YOU', 'ALONE', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?']
['Maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?']
['Im', '45', 'Slim/Med', 'build', ',', 'GSOH', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.']
['You', 'WONT', 'be', 'disappointed', '.']


 ## Lowercasing

tokenization use capitalization as cues to know when to split so lowercasing before calling tokenization would be 
sub-optimal.

In [24]:
for sent in sent_tokenize(single_no8):
    # It's a little inefficient to loop through each word,
    # but sometimes it helps to get better tokens.
    print([word.lower() for word in word_tokenize(sent)])
    # alternatively:
    # print(list(map(str.lower, word_tokenize(sent))))

['are', 'you', 'alone', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?']
['maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?']
['im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.']
['you', 'wont', 'be', 'disappointed', '.']


In [28]:
print(word_tokenize(single_no8))  # Treats the whole line as one document.  

['ARE', 'YOU', 'ALONE', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?', 'Maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'Im', '45', 'Slim/Med', 'build', ',', 'GSOH', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.', 'You', 'WONT', 'be', 'disappointed', '.']


## Tangential Note

NLTK use punkt for tokenization. punkt is a statistical model so it applies the knowledge it has learnt from previous data.

Generally, it **works for most of the cases on well formed texts** but if your data is different e.g. user-generated noisy texts, you might have to retrain a new model.

e.g. if we look at candidate no. 9 (shown below), we see that it's splitting on y.o. (its thinking that its the end of the sentence) and not splitting on &c. (its thinking that it's an abbreviation, e.g. Mr. , Inc. )

In [36]:
single_no9 = webtext.raw('singles.txt').split('\n')[9]
print(single_no9)

AMIABLE 43 y.o. gentleman with European background, 170 cm, medium build, employed, never married, no children. Enjoys sports, music, cafes, beach &c. Seeks an honest, attractive lady with a European background, without children, who would like to get married and have chil dren in the future. 29-39 y.o. Prefer non-smoker and living in Adelaide.


In [38]:
# Not splitting on &c but splits on y.o.
sent_tokenize(single_no9)

['AMIABLE 43 y.o.',
 'gentleman with European background, 170 cm, medium build, employed, never married, no children.',
 'Enjoys sports, music, cafes, beach &c. Seeks an honest, attractive lady with a European background, without children, who would like to get married and have chil dren in the future.',
 '29-39 y.o.',
 'Prefer non-smoker and living in Adelaide.']

So here we have to introduce new concept called stopwords

## Stopwords

**Stopwords** are non-content words that primarily has only grammatical function

In NLTK, you can access them as follows. 

In [39]:
from nltk.corpus import stopwords

stopwords_en = stopwords.words('english')
print(stopwords_en)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

We use stopwords to remove unnecessary content from the text and maintain the gist of important words.

In [44]:
# Treat the multiple sentence as one document (no need for sentence tokenization)
# Tokenize and lowercase
single_no8_tokenized_lowered = list(map(str.lower, word_tokenize(single_no8)))
print(single_no8_tokenized_lowered)

['are', 'you', 'alone', 'or', 'lost', 'in', 'a', 'r/ship', 'too', ',', 'with', 'no', 'hope', 'in', 'sight', '?', 'maybe', 'we', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'and', 'looking', 'for', 'someone', 'similar', '.', 'you', 'wont', 'be', 'disappointed', '.']


## Let's try to remove the stopwords using the `english` stopwords list in NLTK

In [49]:
# we already have taken stopwords in stopwords_en

# List comprehension.
print([word for word in single_no8_tokenized_lowered if word not in stopwords_en])

['alone', 'lost', 'r/ship', ',', 'hope', 'sight', '?', 'maybe', 'could', 'explore', 'new', 'beginnings', 'together', '?', 'im', '45', 'slim/med', 'build', ',', 'gsoh', ',', 'high', 'needs', 'looking', 'someone', 'similar', '.', 'wont', 'disappointed', '.']


In [84]:
# We can remove punctuations as well using string.punctuation
from string import punctuation
# It's a string so we have to convert them into a set type
print('From string.punctuation:', type(punctuation), punctuation)

## Combining the punctuation from stopwords from NLTK

In [85]:
print(stopwords_en)
stopwords_en_withpunct = set(stopwords_en).union(set(punctuation))
print(stopwords_en_withpunct)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'no

In [89]:
# Removing stopwords with punctuations from single_no8_tokenized_lowered
print([word for word in single_no8_tokenized_lowered if word not in stopwords_en_withpunct])

['alone', 'lost', 'r/ship', 'hope', 'sight', 'maybe', 'could', 'explore', 'new', 'beginnings', 'together', 'im', '45', 'slim/med', 'build', 'gsoh', 'high', 'needs', 'looking', 'someone', 'similar', 'wont', 'disappointed']


## Using a stronger/longer list of stopwords

From the previous output we have still dangly model verbs(i.e. 'could', 'wont', etc.)

We can combine the stopwords we have in NLTK with other stopwords list we find online.

Personally, I like to use `stopword-json` because it has stopwords in 50 languages 
[stopwords-json](https://github.com/6/stopwords-json)

In [90]:
# Stopwords from stopwords-json
stopwords_json = {"en":["a","a's","able","about","above","according","accordingly","across","actually","after","afterwards","again","against","ain't","all","allow","allows","almost","alone","along","already","also","although","always","am","among","amongst","an","and","another","any","anybody","anyhow","anyone","anything","anyway","anyways","anywhere","apart","appear","appreciate","appropriate","are","aren't","around","as","aside","ask","asking","associated","at","available","away","awfully","b","be","became","because","become","becomes","becoming","been","before","beforehand","behind","being","believe","below","beside","besides","best","better","between","beyond","both","brief","but","by","c","c'mon","c's","came","can","can't","cannot","cant","cause","causes","certain","certainly","changes","clearly","co","com","come","comes","concerning","consequently","consider","considering","contain","containing","contains","corresponding","could","couldn't","course","currently","d","definitely","described","despite","did","didn't","different","do","does","doesn't","doing","don't","done","down","downwards","during","e","each","edu","eg","eight","either","else","elsewhere","enough","entirely","especially","et","etc","even","ever","every","everybody","everyone","everything","everywhere","ex","exactly","example","except","f","far","few","fifth","first","five","followed","following","follows","for","former","formerly","forth","four","from","further","furthermore","g","get","gets","getting","given","gives","go","goes","going","gone","got","gotten","greetings","h","had","hadn't","happens","hardly","has","hasn't","have","haven't","having","he","he's","hello","help","hence","her","here","here's","hereafter","hereby","herein","hereupon","hers","herself","hi","him","himself","his","hither","hopefully","how","howbeit","however","i","i'd","i'll","i'm","i've","ie","if","ignored","immediate","in","inasmuch","inc","indeed","indicate","indicated","indicates","inner","insofar","instead","into","inward","is","isn't","it","it'd","it'll","it's","its","itself","j","just","k","keep","keeps","kept","know","known","knows","l","last","lately","later","latter","latterly","least","less","lest","let","let's","like","liked","likely","little","look","looking","looks","ltd","m","mainly","many","may","maybe","me","mean","meanwhile","merely","might","more","moreover","most","mostly","much","must","my","myself","n","name","namely","nd","near","nearly","necessary","need","needs","neither","never","nevertheless","new","next","nine","no","nobody","non","none","noone","nor","normally","not","nothing","novel","now","nowhere","o","obviously","of","off","often","oh","ok","okay","old","on","once","one","ones","only","onto","or","other","others","otherwise","ought","our","ours","ourselves","out","outside","over","overall","own","p","particular","particularly","per","perhaps","placed","please","plus","possible","presumably","probably","provides","q","que","quite","qv","r","rather","rd","re","really","reasonably","regarding","regardless","regards","relatively","respectively","right","s","said","same","saw","say","saying","says","second","secondly","see","seeing","seem","seemed","seeming","seems","seen","self","selves","sensible","sent","serious","seriously","seven","several","shall","she","should","shouldn't","since","six","so","some","somebody","somehow","someone","something","sometime","sometimes","somewhat","somewhere","soon","sorry","specified","specify","specifying","still","sub","such","sup","sure","t","t's","take","taken","tell","tends","th","than","thank","thanks","thanx","that","that's","thats","the","their","theirs","them","themselves","then","thence","there","there's","thereafter","thereby","therefore","therein","theres","thereupon","these","they","they'd","they'll","they're","they've","think","third","this","thorough","thoroughly","those","though","three","through","throughout","thru","thus","to","together","too","took","toward","towards","tried","tries","truly","try","trying","twice","two","u","un","under","unfortunately","unless","unlikely","until","unto","up","upon","us","use","used","useful","uses","using","usually","uucp","v","value","various","very","via","viz","vs","w","want","wants","was","wasn't","way","we","we'd","we'll","we're","we've","welcome","well","went","were","weren't","what","what's","whatever","when","whence","whenever","where","where's","whereafter","whereas","whereby","wherein","whereupon","wherever","whether","which","while","whither","who","who's","whoever","whole","whom","whose","why","will","willing","wish","with","within","without","won't","wonder","would","wouldn't","x","y","yes","yet","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves","z","zero"]}

In [91]:
stopwords_json_en = set(stopwords_json['en'])
stopwords_nltk_en = set(stopwords.words('english'))
stopwords_punct = set(punctuation)

# Combine the stopwords, Its a lot longer so I'm not printing it out

stoplist_combined = set.union(stopwords_json_en, stopwords_nltk_en, stopwords_punct)


# Remove the stopwords from single_no8_tokenized_lowered
print([word for word in single_no8_tokenized_lowered if word not in stoplist_combined])

['lost', 'r/ship', 'hope', 'sight', 'explore', 'beginnings', 'im', '45', 'slim/med', 'build', 'gsoh', 'high', 'similar', 'wont', 'disappointed']


## Stemming and Lemmatization

Matching words with root words e.g. "walks", "walking", "walked" to "walk"

The stemming and lemmatization process are hand-written regex rules written to find the root word.

**Stemming** : Trying to shorten a word with simple regex rules.

(in): having

(out): hav

**Lemmatization** : Trying to find the root word with linguistics rules (with the use of regexes)

(in): having

(out): have

There are various stemmers and one lemmatizer in NLTK, the most common being:

* **Porter Stemmer**

* **Wordnet Lemmatizer**


In [101]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

for word in ['Walking', 'JOKING', 'walked']:
    print(porter.stem(word))  # stem and convert to lowercase 

walk
joke
walk


In [98]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

for word in ['Walking', 'walks', 'walked']:
    print(wnl.lemmatize(word))  # only lemmatize

Walking
walk
walked


## The lemmatizer is pretty complicated, it needs parts of speech (POS) tags.

We won't cover what's POS today so I'll just show to "whip" the lemmatizer to do what you need.

By default, the WordNetLemmatizer.lemmatize() function will assume that the word is a Noun if there's no explisit POS tag in the input. 



In [106]:
from nltk import pos_tag

# 'pos_tag' takes the tokenized sentence as input and returns the tuple of (word, tag)

walking_tagged = pos_tag(word_tokenize('He is walking to school'))
print(walking_tagged)

[('He', 'PRP'), ('is', 'VBZ'), ('walking', 'VBG'), ('to', 'TO'), ('school', 'NN')]


In [105]:
def penn2morphy(penntag):
    """
    Converts Penn Treebank tags to Wordnet
    :param penntag: 
    :return: 
    """
    
    morphy_tag = {'NN':'n', 'JJ':'a',
                  'VB':'v', 'RB': 'r'}
    
    try:
        return morphy_tag[penntag[:2]]