# Week 2: NLP hors d'oeuvre

Overview:
- Tokenization
- Text Preprocessing
 - Normalization
 - Stemming
- Part of Speech
    -Tagging
    - Choice: two or more mentions
    - Chunking

Reference: Introduction to Naturak Language Processing(NLP) via Udemy

### Tokenization

Tokenization: this is breaking text into individual words from phrases/sentences.

In [1]:
import nltk

In [2]:
sample_text = "Have you seen The Baker and The Beauty on Netflix?.I love love it. You should see it."

In [3]:
word_token = nltk.word_tokenize(sample_text)

In [4]:
word_token

['Have',
 'you',
 'seen',
 'The',
 'Baker',
 'and',
 'The',
 'Beauty',
 'on',
 'Netflix',
 '?',
 '.I',
 'love',
 'love',
 'it',
 '.',
 'You',
 'should',
 'see',
 'it',
 '.']

In [5]:
# Length of the list created
len(word_token)

21

Tokenization is not just words but sentences.

In [6]:
sents_token = nltk.sent_tokenize(sample_text)

In [8]:
sents_token

['Have you seen The Baker and The Beauty on Netflix?.I love love it.',
 'You should see it.']

In [9]:
len(sents_token)

2

### Normalization

This is the process of transforming text to standard form.

In [10]:
sh_ham = nltk.corpus.gutenberg.words("shakespeare-hamlet.txt")

In [11]:
sh_ham_10 = sh_ham[:10]

In [12]:
sh_ham_10

['[',
 'The',
 'Tragedie',
 'of',
 'Hamlet',
 'by',
 'William',
 'Shakespeare',
 '1599',
 ']']

In [13]:
for word in sh_ham_10:
    if word.isalpha():
        print(word.upper()) # transform to all uppercases

THE
TRAGEDIE
OF
HAMLET
BY
WILLIAM
SHAKESPEARE


#### STEMMERS

"This is the process of reducing inflection in words (e.g. troubled, troubles) to their root form (e.g. trouble)." 

Reference:https://bit.ly/3esvj2y


In [14]:
sample_stems = ["workers", "cities", "works", "city", "runs", "running",]

In [15]:
porter = nltk.PorterStemmer()

In [16]:
for word in sample_stems:
    print(porter.stem(word))

worker
citi
work
citi
run
run


In [17]:
lancaster = nltk.LancasterStemmer()

In [18]:
for word in sample_stems:
    print(lancaster.stem(word))

work
city
work
city
run
run


In [None]:
nltk.download('wordnet')

In [19]:
word_lemmatize = nltk.WordNetLemmatizer()

In [20]:
for word in sample_stems:
    print(word_lemmatize.lemmatize(word))

worker
city
work
city
run
running


WordNetLemmatizer is good, just takes up alot of resources compared to the other stemmers.

## Part Of Speech

In [None]:
nltk.download('averaged_perceptron_tagger')

In [21]:
nltk.pos_tag(word_token)

[('Have', 'VBP'),
 ('you', 'PRP'),
 ('seen', 'VBN'),
 ('The', 'DT'),
 ('Baker', 'NNP'),
 ('and', 'CC'),
 ('The', 'DT'),
 ('Beauty', 'NNP'),
 ('on', 'IN'),
 ('Netflix', 'NNP'),
 ('?', '.'),
 ('.I', 'NNP'),
 ('love', 'IN'),
 ('love', 'NN'),
 ('it', 'PRP'),
 ('.', '.'),
 ('You', 'PRP'),
 ('should', 'MD'),
 ('see', 'VB'),
 ('it', 'PRP'),
 ('.', '.')]

In [22]:
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to
[nltk_data]     /Users/princessiria/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [23]:
# To understand what the abbreviations mean for the part of speech
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [25]:
nltk.pos_tag(nltk.word_tokenize("I am princess."))

[('I', 'PRP'), ('am', 'VBP'), ('princess', 'RB'), ('.', '.')]

In [26]:
nltk.pos_tag(nltk.word_tokenize("Data is amazing and cool"))

[('Data', 'NNP'),
 ('is', 'VBZ'),
 ('amazing', 'JJ'),
 ('and', 'CC'),
 ('cool', 'JJ')]

In [27]:
sh_ham = nltk.corpus.gutenberg.words("shakespeare-hamlet.txt")

In [28]:
norm_sh_ham = [word.lower() for word in sh_ham if word.isalpha()]

In [29]:
nltk.download('universal_tagset')

[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/princessiria/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [30]:
sh_ham_tags = nltk.pos_tag(norm_sh_ham, tagset="universal") # universal means Noun

In [31]:
sh_ham_tags[0:6]

[('the', 'DET'),
 ('tragedie', 'NOUN'),
 ('of', 'ADP'),
 ('hamlet', 'NOUN'),
 ('by', 'ADP'),
 ('william', 'NOUN')]

In [33]:
sh_ham_tags_nouns = [word[0] for word in sh_ham_tags if word[1] == "NOUN"]

In [34]:
fd_nouns = nltk.FreqDist(sh_ham_tags_nouns)

In [35]:
fd_nouns.most_common(10)

[('i', 324),
 ('ham', 223),
 ('lord', 193),
 ('d', 152),
 ('hamlet', 82),
 ('king', 79),
 ('hor', 62),
 ('tis', 53),
 ('enter', 51),
 ('t', 49)]

Exercise
- Alice in wonderland
    - get alice, adventures and wonderland, rabbit usage.

In [None]:
alice = nltk.corpus.gutenberg.words("carroll-alice.txt")

In [None]:
alice_norm  = [word.lower() for word in alice if word.isalpha() ] #Normalize text

In [None]:
alice[:5]

In [None]:
tags_alice = nltk.pos_tag(alice_norm, tagset="universal")

In [None]:
cfd_alice = nltk.ConditionalFreqDist(tags_alice)

In [None]:
cfd_alice

In [None]:
cfd_alice["alice"]

In [None]:
cfd_alice["wonderland"]

In [None]:
cfd_alice["wonder"]

#### Choices

In [None]:
stories = nltk.corpus.gutenberg.words("bryant-stories.txt")

In [None]:
tag_story = nltk.pos_tag(stories, tagset="universal")

In [None]:
sub_tags = tag_story[:10]

In [None]:
for((firstWord, firstTag), (secondWord, secondTag), (thirdWord, thirdTag)) in nltk.trigrams(sub_tags):
    if firstTag == "NOUN" and secondTag == "or" and thirdTag == "NOUN":
        print(firstWord + "" + secondWord + "" + thirdWord)
    else:
        print("Combination not found! Please try again later :-)")

### Chunking

Chunking refers to the process of taking individual pieces of information and grouping them into larger units. 
By grouping each data point into a larger whole, you can improve the amount of information you can remember

In [36]:
sample_text = "I saw The Baker and The Beauty this weekend. I love the story...also I would love to visit Puerto Rico."

In [37]:
sample_tags = nltk.pos_tag(nltk.word_tokenize(sample_text))

In [38]:
sample_tags

[('I', 'PRP'),
 ('saw', 'VBD'),
 ('The', 'DT'),
 ('Baker', 'NNP'),
 ('and', 'CC'),
 ('The', 'DT'),
 ('Beauty', 'NNP'),
 ('this', 'DT'),
 ('weekend', 'NN'),
 ('.', '.'),
 ('I', 'PRP'),
 ('love', 'VBP'),
 ('the', 'DT'),
 ('story', 'NN'),
 ('...', ':'),
 ('also', 'RB'),
 ('I', 'PRP'),
 ('would', 'MD'),
 ('love', 'VB'),
 ('to', 'TO'),
 ('visit', 'VB'),
 ('Puerto', 'NNP'),
 ('Rico', 'NNP'),
 ('.', '.')]

In [39]:
sequence = '''
                Chunk:
                {<NNPS>+}
                {<NNP>+}
                {<NN>+} '''

In [40]:
chunker_text = nltk.RegexpParser(sequence)

In [41]:
output = chunker_text.parse(sample_tags)

In [42]:
print(output)

(S
  I/PRP
  saw/VBD
  The/DT
  (Chunk Baker/NNP)
  and/CC
  The/DT
  (Chunk Beauty/NNP)
  this/DT
  (Chunk weekend/NN)
  ./.
  I/PRP
  love/VBP
  the/DT
  (Chunk story/NN)
  .../:
  also/RB
  I/PRP
  would/MD
  love/VB
  to/TO
  visit/VB
  (Chunk Puerto/NNP Rico/NNP)
  ./.)


Chunking is good for idea is made up of one words and we can always change the sequence to fit