# Quick tour to NLP

## Tokenization

(with nltk)

In [1]:
from nltk.tokenize import TweetTokenizer
tweet=u"Snow White and the Seven Degrees #MakeAMovieCold@midnight:-)"
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))


['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']


In [2]:
import spacy
nlp = spacy.load('en')
text = "Mary, don’t slap the green witch"
print([str(token) for token in nlp(text.lower())])

['mary', ',', 'do', 'n’t', 'slap', 'the', 'green', 'witch']


## Lemmas and Stems
**Lemmas** are root forms of words. Consider the verb fly. It can be inflected into many different words—flow, flew, flies, flown, flowing, and so on—and fly is the lemma for all of these seemingly different words. Sometimes, it might be useful to reduce the tokens to their lemmas to keep the dimensionality of the vector representation low. This reduction is called lemmatization, and you can see it in action here:

In [3]:
import spacy

doc = nlp(u"he was running late")
for token in doc:
    print('{} --> {}'.format(token, token.lemma_))

he --> -PRON-
was --> be
running --> run
late --> late



**Stemming** is the poor-man’s lemmatization.3 It involves the use of handcrafted rules to strip endings of words to reduce them to a common form called stems. Popular stemmers often implemented in open source packages include the Porter and Snowball stemmers. 

## POS Tagging

In [4]:
import spacy
nlp = spacy.load('en')
doc = nlp(u"Mary slapped the green witch.")

for token in doc:
    print('{} - {}'.format(token, token.pos_))

Mary - PROPN
slapped - VERB
the - DET
green - ADJ
witch - NOUN
. - PUNCT


In [5]:
nlp_es = spacy.load('es')

doc = nlp("Qué onda amiguitoooooo, cómo está todo en ese bello lugar?")


for token in doc:
    print('{} - {}'.format(token, token.pos_))

Qué - ADJ
onda - NOUN
amiguitoooooo - NOUN
, - PUNCT
cómo - VERB
está - ADJ
todo - NOUN
en - X
ese - ADJ
bello - NOUN
lugar - VERB
? - PUNCT


## Chunking

Often, we need to label a span of text; that is, a contiguous multitoken boundary. For example, consider the sentence, “Mary slapped the green witch.” We might want to identify the noun phrases (NP) and verb phrases (VP) in it, as shown here:

```
[NP Mary] [VP slapped] [the green witch].
```

This is called chunking or shallow parsing. Shallow parsing aims to derive higher-order units composed of the grammatical atoms, like nouns, verbs, adjectives, and so on. It is possible to write regular expressions over the part-of-speech tags to approximate shallow parsing if you do not have data to train models for shallow parsing. Fortunately, for English and most extensively spoken languages, such data and pretrained models exist

In [6]:

import spacy
nlp = spacy.load('en')
doc  = nlp(u"Mary slapped the green witch.")
for chunk in doc.noun_chunks:
    print ('{} - {}'.format(chunk, chunk.label_))

Mary - NP
the green witch - NP


## Dependency Parsing

In [17]:
for t in doc:
    print("{:<10} | {:<6} --> {}".format(str(t), t.dep_, t.head))

Mary       | nsubj  --> slapped
slapped    | ROOT   --> slapped
the        | det    --> witch
green      | amod   --> witch
witch      | dobj   --> slapped
.          | punct  --> slapped


In [20]:

import spacy
nlp = spacy.load('es')
doc  = nlp(u"No teníamos nada para perder")
for t in doc:
    print("{:<10} | {:<6} | {:<6} <- {}".format(str(t), t.pos_, t.dep_, t.head))

No         | ADV    | advmod <- teníamos
teníamos   | VERB   | ROOT   <- teníamos
nada       | PRON   | obj    <- teníamos
para       | ADP    | mark   <- perder
perder     | VERB   | acl    <- nada


Este es sacado de Twitter... no entiendo qué tan bien está!

In [23]:
doc  = nlp(u"Señor FBI, si está viendo esto: chífleme por favor si tiene un laburito en blanco con aguinaldo por ahí")
for t in doc:
    print("{:<10} | {:<6} | {:<6} <- {}".format(str(t), t.pos_, t.dep_, t.head))

Señor      | NOUN   | nsubj  <- chífleme
FBI        | PROPN  | flat   <- Señor
,          | PUNCT  | punct  <- viendo
si         | SCONJ  | mark   <- viendo
está       | AUX    | aux    <- viendo
viendo     | VERB   | acl    <- Señor
esto       | PRON   | obj    <- viendo
:          | PUNCT  | punct  <- chífleme
chífleme   | VERB   | ROOT   <- chífleme
por        | ADP    | case   <- favor
favor      | NOUN   | obl    <- chífleme
si         | SCONJ  | mark   <- tiene
tiene      | VERB   | advcl  <- chífleme
un         | DET    | det    <- laburito
laburito   | NOUN   | obj    <- tiene
en         | ADP    | case   <- blanco
blanco     | NOUN   | nmod   <- laburito
con        | ADP    | case   <- aguinaldo
aguinaldo  | NOUN   | nmod   <- laburito
por        | ADP    | case   <- ahí
ahí        | ADV    | advmod <- aguinaldo
