# Module 2: Basic Natural Language Processing

## 2.1.- Basic NLP Tasks with NLTK (Natural Language ToolKit)

In [None]:
import nltk
nltk.download("book")
from nltk.book import *

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\aleex\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\aleex\AppData\Roaming\nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     C:\Users\aleex\AppData\Roaming\nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\aleex\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     C:\Users\aleex\AppData\Roaming\nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     C:\Users\aleex\AppData\R

In [None]:
text1

In [None]:
sents()

### Counting vocabulary of words

In [None]:
text7

In [None]:
sent7

In [None]:
len(sent7)

In [None]:
len(text7)

In [None]:
len(set(text7)) # number of unique words (taking off duplicates and so on)

In [None]:
list(set(text7))[:10] # first 10 unique words

### Frequency of words

In [None]:
dist = FreqDist(text7) # number of times a single word appears on the text
dist

In [None]:
len(dist.keys()) #dist.keys() displays the set of unique words

In [None]:
vocab1 = dist.keys()
list(vocab1)[:10]

In [None]:
dist['as'] # how many times this word appears

In [None]:
freqwords = [word for word in vocab1 if len(word) > 5 and dist[word]>100] # every word longer than 5 that appears 
# more than 100 times
freqwords

In [None]:
freqwords2 = [word for word in set(text7) if len(word) > 5 and dist[word]>100] # same but iterating the main text
freqwords2

In [None]:
freqwords_all = [word.strip() for word in text7 if dist[word]>150]
list(set(freqwords_all))

### Tasks for Word Manipulation

#### Normalization and stemming - ntlk.PorterStemmer().stem('')

*Stemming means finding the root word from a set of different similar words*

In [None]:
input1 = "List listed lists listing listings" # Different forms of the same word
words1 = input1.lower().split(' ') # NORMALIZATION
words1

In [None]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

#### Lemmatization

*Stemming, but resulting stems are all valid words*

In [None]:
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]

In [None]:
[porter.stem(t) for t in udhr[:20]]

In [None]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

#### Tokenization

*Recall splitting a sentence into words/tokens*

In [None]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

In [None]:
nltk.word_tokenize(text11) # Punctuation marks are separated from words even if they are sticked (no original separation, no detection thru str.split(' '))

In [None]:
# Not only it exists word tokenization, but also sentence tokenization
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
len(sentences)

In [None]:
sentences

*There is not only an effective splitting method, but also a sensible one, as it does not intervene under the word U.S., even when it has a double dot, nor under 2.99, as it detects also the use of the full marks*

In [None]:
text12.split('.') # differences btw tokenization and simply splitting the full text

## 2.2.- Advanced NLP Tasks with NLTK

### POS Tagging (Part-Of-Speech Tagging) with NLTK

In [None]:
import nltk
nltk.help.upenn_tagset('MD')

**Splitting a sentence into words / tokens**

In [None]:
text11

In [None]:
text13 = nltk.word_tokenize(text11)
text13

**Tagging them through pos_tag**

In [None]:
nltk.pos_tag(text13) # we get the type of word for each token. It is a list of tuples

In [None]:
type(nltk.pos_tag(text13)[0])

In [None]:
print([n[0] for n in nltk.pos_tag(text13)],'\n')
[nltk.help.upenn_tagset(t[1]) for t in nltk.pos_tag(text13)][0]

**Ambiguity in POS Tagging**

In [None]:
aaa = 'Visiting aunts can be a nuisance'
text14 = nltk.word_tokenize(aaa)
print(nltk.pos_tag(text14),'\n')
[nltk.help.upenn_tagset(t[1]) for t in nltk.pos_tag(text14)][0]

In [None]:
nltk.help.upenn_tagset('JJ')

*But Visiting can also be a 'JJ'(adjective), depending on the situation, but the probability of finding a 'Visiting' as a JJ is lower in general terms, so POS Tagging identifies this word as a verb in the process*

### Parsing Sentence Structure

In [None]:
text15 = nltk.word_tokenize('Alice loves Bob')
nltk.pos_tag(text15)

In [None]:
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")
parser=nltk.ChartParser(grammar)
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

In [None]:
text16 = nltk.word_tokenize("I saw the man with a telescope")
grammar1 = nltk.data.load('mygrammar.cfg') # file with a sentence structure
grammar1

In [None]:
parser = nltk.ChartParser(grammar1)
trees = parser.parse_all(text16)
for t in trees:
    print(t) # two different structures parsed - AMBIGUITY

**Collection of Parse Trees in NLTK - Data package as training data**

In [None]:
sent7

In [None]:
from nltk.corpus import treebank
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]
print(text17)

### POS Tagging and Parsing Complexity

**Uncommon usages of words**

In [None]:
sentt = 'The old man the boat'
text18 = nltk.word_tokenize(sentt)
nltk.pos_tag(text18)

*We get man as a noun and old as an adjective, when the original meanning put old as a noun and man as a verb. The way it is parsed makes no sense gramatically, because there's not a verb*

**Well-formed sentences may still be meanningless!!!**

In [None]:
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)

*Good tagging and gramatical sense, but no sense in the meaning way*