# NLP

### Corpus
All NLP Memthods, be they classfic or mordern, begin with a text dataset, also. A corpus usually contains raw text and any metadata associated with the text. 

The raw text is a sequence of characters (bytes), but most times it is usefol to group those characters into contigious units called:

### Tokens
Tokens correspond to words and numeric sequences seperated by white-space character or punctuation. 

### Metadata
The metadata could be any auxilliary piece of information associated with the text, like **identifiers**, **labels**, and **timestamps**. 

### Datapoint
In Machine Learning parlance, the text along with the metadata is called an **instance** or **datapoint**

### Dataset
A dataset is a collection of **instances** 

The process of breaking a test down into tokens is called **Tokenization** 

In [57]:
# Tokenizing Text
import spacy
from nltk.tokenize import TweetTokenizer
import time
nlp = spacy.load('en')

In [10]:
text = "Mary, don't slapp the green witch."
print([str(token) for token in nlp(text.lower())])

['mary', ',', 'do', "n't", 'slapp', 'the', 'green', 'witch', '.']


In [18]:
tweet = u"Snow White and the Seven Degrees #MakeAMovieCold@Midnight:-)"
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']


In [19]:
print([str(token) for token in nlp(tweet.lower())])

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#', 'makeamoviecold@midnight:-', ')']


### N-grams
N-grams are fixed-lenght (n) consecutive token sequences occurring in the text. 

#### Bigram
Two tokens

#### Unigram
One token

In [22]:
# Generating n-grams
def n_grams(text, n):
    """
    Takes tokens or text, returns a list of n-grams
    """
    return [text[i:i+n] for i in range(len(text)- n+1)]

In [29]:
cleaned = ['Mary', ',' "n't", 'slap', 'green', 'witch', '.']
print(n_grams(cleaned, 3))

[['Mary', ",n't", 'slap'], [",n't", 'slap', 'green'], ['slap', 'green', 'witch'], ['green', 'witch', '.']]


In [30]:
cleaned = ['Mary', ',' "n't", 'slap', 'green', 'witch', '.']
print(n_grams(cleaned, 2))

[['Mary', ",n't"], [",n't", 'slap'], ['slap', 'green'], ['green', 'witch'], ['witch', '.']]


In [31]:
cleaned = ['Mary', ',' "n't", 'slap', 'green', 'witch', '.']
print(n_grams(cleaned, 4))

[['Mary', ",n't", 'slap', 'green'], [",n't", 'slap', 'green', 'witch'], ['slap', 'green', 'witch', '.']]


### Lemmas
Lemmas are root form of words. Consider the verb fly. It can be inflected into many different words - flow, flew, flies, flown, flowing, and so on. 

Sometimes, it might be useful to reduce the tokens to their lemmas to keep the dimensionality of the vector representation low. This reduction is called **Lemmatization** 

### Stemming
Is the poor-mans **Lemmatization**. It involves the use of hand-crafted rules to strip endings of words to reduce them to a common form called **Stems**

In [33]:
# Implementing lemmatization
doc = nlp(u"he was running late")
for token in doc:
    print(f'{token} --> {token.lemma_}')

he --> -PRON-
was --> be
running --> run
late --> late


In [40]:
words = nlp(u'he was telling a story')
for doc in words:
    print(doc.lemma_)

-PRON-
be
tell
a
story


### POS Tagging
A common type of Categorizing words. 

In [42]:
doc = nlp(u"Mary slapped Max the sinner")
for token in doc:
    print(f'{token} --> {token.pos_}')

Mary --> PROPN
slapped --> VERB
Max --> PROPN
the --> DET
sinner --> NOUN


In [66]:
def to_pos(sentence):
    """
    This function takes a sentence and will return POS tagging utilizing Spacy Framework
    """
    print('processing....')
    time.sleep(1)
    doc = nlp(sentence)
    for token in doc:
        print(f'{token} --> {token.pos_}')

In [67]:
sentence_one = "Dave and max are both sinners"
sentence_two = "Max must confess to the priest"

In [68]:
to_pos(sentence_one)

processing....
Dave --> PROPN
and --> CCONJ
max --> NOUN
are --> VERB
both --> DET
sinners --> NOUN


In [69]:
to_pos(sentence_two)

processing....
Max --> PROPN
must --> VERB
confess --> VERB
to --> ADP
the --> DET
priest --> NOUN


### Shallow Parsing 
Shallow parsing also known as **Chunking** aims to derive higher-order units composed of the grammatical atoms, like nouns, verbs, adjectives, and so on

In [70]:
# Noun Phrase Chunking
doc = nlp(u"Mary slapped the green witch")
for chunk in doc.noun_chunks:
    print(f'{chunk} - {chunk.label_}')

Mary - NP
the green witch - NP


In [74]:
def shallow_parsing(sentence):
    """
    Takes a sentence and outputs chunks
    """
    doc = nlp(sentence)
    for chunk in doc.noun_chunks:
        print(f'{chunk} - {chunk.label_}')

In [78]:
sentence_one = "This is the first sentence"
sentence_two = "Diego wrote the first sentence"
sentence_three = "Diego's last name is medina, Max's last name is devil"

In [76]:
shallow_parsing(sentence_one)

the first sentence - NP


In [77]:
shallow_parsing(sentence_two)

Diego - NP
the first sentence - NP


In [79]:
shallow_parsing(sentence_three)

Diego's last name - NP
medina - NP
Max's last name - NP
devil - NP


### Parsing
Whereas s
hallow parsing identifies phrasal units, the task of identifying the relationship between them is called **Parsing** 

### Word Senses and Semantics
Words have meanings, and often more than one. The different meanings of a word are called its *senses*. 