# Quick tour to NLP

## Tokenization

(with nltk)

In [2]:
from nltk.tokenize import TweetTokenizer
tweet=u"Snow White and the Seven Degrees #MakeAMovieCold@midnight:-)"
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))


['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']


In [5]:
import spacy
nlp = spacy.load('en')
text = "Mary, don’t slap the green witch"
print([str(token) for token in nlp(text.lower())])

['mary', ',', 'do', 'n’t', 'slap', 'the', 'green', 'witch']


## Lemmas and Stems
**Lemmas** are root forms of words. Consider the verb fly. It can be inflected into many different words—flow, flew, flies, flown, flowing, and so on—and fly is the lemma for all of these seemingly different words. Sometimes, it might be useful to reduce the tokens to their lemmas to keep the dimensionality of the vector representation low. This reduction is called lemmatization, and you can see it in action here:

In [6]:
import spacy

doc = nlp(u"he was running late")
for token in doc:
    print('{} --> {}'.format(token, token.lemma_))

he --> -PRON-
was --> be
running --> run
late --> late



**Stemming** is the poor-man’s lemmatization.3 It involves the use of handcrafted rules to strip endings of words to reduce them to a common form called stems. Popular stemmers often implemented in open source packages include the Porter and Snowball stemmers. 

## POS Tagging

In [8]:
import spacy
nlp = spacy.load('en')
doc = nlp(u"Mary slapped the green witch.")

for token in doc:
    print('{} - {}'.format(token, token.pos_))

Mary - PROPN
slapped - VERB
the - DET
green - ADJ
witch - NOUN
. - PUNCT


In [10]:
nlp_es = spacy.load('es')

doc = nlp("Qué onda amiguitoooooo, cómo está todo en ese bello lugar?")


for token in doc:
    print('{} - {}'.format(token, token.pos_))

Qué - ADJ
onda - NOUN
amiguitoooooo - NOUN
, - PUNCT
cómo - VERB
está - ADJ
todo - NOUN
en - X
ese - ADJ
bello - NOUN
lugar - VERB
? - PUNCT


## Chunking

Often, we need to label a span of text; that is, a contiguous multitoken boundary. For example, consider the sentence, “Mary slapped the green witch.” We might want to identify the noun phrases (NP) and verb phrases (VP) in it, as shown here

In [11]:

import spacy
nlp = spacy.load('en')
doc  = nlp(u"Mary slapped the green witch.")
for chunk in doc.noun_chunks:
    print ('{} - {}'.format(chunk, chunk.label_))

Mary - NP
the green witch - NP
