Words can be distinguished as *content words* and *stopwords*. Stop words such as articles and prepostiions serve mostly as a grammatical purpose, like filler holding the content words.

Tokenization
============

In [None]:
import spacy
nlp = spacy.load("en")
text =" mary, don't slap the green witch"
print([str(token) for token in nlp(text.lower())])

[' ', 'mary', ',', 'do', "n't", 'slap', 'the', 'green', 'witch']


In [None]:
from nltk.tokenize import TweetTokenizer
tweet = u"Snow White and the Serve Degrees #MakeAMovieCOld@midnight :-)"
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))

['snow', 'white', 'and', 'the', 'serve', 'degrees', '#makeamoviecold', '@midnight', ':-)']


N-Grams
=======

In [None]:
def n_grams(text,n):
  return [text[i:i+n] for i in range(len(text)-n+1)]

cleaned =   ['mary', ',', 'do', "n't", 'slap', 'the', 'green', 'witch']
print(n_grams(cleaned, 3))

[['mary', ',', 'do'], [',', 'do', "n't"], ['do', "n't", 'slap'], ["n't", 'slap', 'the'], ['slap', 'the', 'green'], ['the', 'green', 'witch']]
ERROR! Session/line number was not unique in database. History logging moved to new session 59


Lemmas and Stems
================
[Stemming and Lemmatization Tutorial](https://stackabuse.com/python-for-nlp-tokenization-stemming-and-lemmatization-with-spacy-library/)

In [None]:
import spacy
nlp = spacy.load("en")
doc = nlp(u"The geese were waddling like mad")
for token in doc:
  print('{} -->{}'.format(token, token.lemma_))

The -->the
geese -->goose
were -->be
waddling -->waddle
like -->like
mad -->mad


In [None]:
import nltk
from nltk.stem.porter import *

stemmer = PorterStemmer()
tokens = ['compute', 'computer', 'computed', 'computing', 'geese']
for token in tokens:
    print('{} -->{}'.format(token, stemmer.stem(token)))

compute -->comput
computer -->comput
computed -->comput
computing -->comput
geese -->gees


Part of Speech Tagging
======================

In [None]:
import spacy
nlp = spacy.load("en")
doc = nlp(u"The geese were waddling like mad")
for token in doc:
  print('{} -->{}'.format(token, token.pos_))

The -->DET
geese -->NOUN
were -->AUX
waddling -->VERB
like -->SCONJ
mad -->ADJ


Chunking and Named Entity Recognition
=====================================

In [None]:
import spacy
nlp = spacy.load("en")
doc = nlp(u"Mary slapped the green witch.")
for chunk in doc.noun_chunks:
  print('{} -->{}'.format(chunk, chunk.label_))

Mary -->NP
the green witch -->NP


In [None]:
import spacy

nlp = spacy.load("en")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY


Binary Cross Entropy Loss
=========================

In [3]:
import torch
import torch.nn as nn

bce_loss = nn.BCELoss()
sigmoid = nn.Sigmoid()
#probabilities = sigmoid(torch.randn(4, 1, requires_grad=True))
probabilities = torch.tensor([0.9, 0, 0.9, 0], dtype=torch.float32).view(4,1)
targets = torch.tensor([1, 0, 1, 0], dtype=torch.float32).view(4,1)
loss = bce_loss(probabilities, targets)
print(probabilities)
print(loss)

tensor([[0.9000],
        [0.0000],
        [0.9000],
        [0.0000]])
tensor(0.0527)


Supervised Training Look for a perceptron and binary classification
====================================================================


```
# each epoch is a complete pass over the training data
for epoch_i in range(n_epochs):
  
  # the inner look is over the batches in the dataset
  for batch_i in range(n_batches):

    # Step 0: Get the data
    x_data, y_target = get_toy_data(batch_size)

    # Step 1: Clear the gradients
    perception.zero_grad()

    # Step 2: Compute the forward pass of the model
    y_pred = perceptron(x_data, apply_sigmoid=true)

    # Step 3: Computer the loss value that we wish to optimise
    loss = bce_loss(y_pred, y_target)

    # Step 4: Propagate the loss signal backward
    loss.backward()

    # Step 5: Trigger the optimizer to perform one update
    optimizer.step()

```

