## Nlp vocabulary

- corpus: text dataset, raw text (can also have metadata)
- tokens: alpha numeric separated by spaces (process of converting text to tokens is tokenization)

Tokenization in some languages is not always as easy as splitting on whitespace. Eg Turkish.

In [1]:
import spacy
from nltk.tokenize import TweetTokenizer

In [2]:
nlp = spacy.load('en')
text = "Mary, don't slap the green witch."

In [3]:
print([str(token) for token in nlp(text.lower())])

['mary', ',', 'do', "n't", 'slap', 'the', 'green', 'witch', '.']


In [4]:
tweet=u"Snow White and the Seven Degrees #MakeAMovieCold@midnight:-)"
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']


- types: unique tokens present in a corpus
- vocabulary: collection of all types
- n-grams: a sequence of n tokens from the corpus

Example: 2 grams

"Learning PyTorch from a pdf is fun"

"Learning PyTorch"

"PyTorch from"

"from a" and so on.

- lemma: root form of a word, fly is the lemma for flying, flown, flew, flies
- stemming: simplified lemmatization that includes stripping down words

The process of reducing tokens to their lemmas is known as lemmatization and is important to limit the vocabulary

In [5]:
# spacy uses predefined dictionary called WordNet for extracting lemmas
doc = nlp(u"he was running late")

In [6]:
for token in doc:
    print('{} --> {}'.format(token, token.lemma_))

he --> -PRON-
was --> be
running --> run
late --> late


Categorizing documents is one of the earliest application of nlp

We can extend the concept of labelling documents to labelling tokens. An example would be parts of speech tagging

In [7]:
doc = nlp(u"Mary slapped the green witch.")

In [8]:
for token in doc:
    print("{} --> {}".format(token, token.pos_))

Mary --> PROPN
slapped --> VERB
the --> DET
green --> ADJ
witch --> NOUN
. --> PUNCT


In [9]:
# noun phrase and word phrase / subject predicate tagging
for chunk in doc.noun_chunks:
    print("{} --> {}".format(chunk, chunk.label_))

Mary --> NP
the green witch --> NP


another labelling example would be name entity where we use real world tags like person, location, etc.

Parse trees indicate how grammatical units are related hierarchically

Words often have more than one meaning and the different meanings of a word are called **senses**

Neural network approach to nlp is a supplement and not a replacement to traditional nlp approaches. Traditional appraoches work well in production.