<a href="https://colab.research.google.com/github/ccarpenterg/introNLP/blob/master/01a_intro_NLP_and_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP with Deep Learning



In [0]:
!pip install spacy==2.2.3
!python -m spacy download en_core_web_sm

In [0]:
import spacy

nlp_en = spacy.load('en_core_web_sm')

### Part-of-Speech Tagging

In [0]:
doc = nlp_en("Many Japanese children refuse to go to school")

for token in doc:
    print(token.text, token.pos_, token.tag_)

Many ADJ JJ
Japanese ADJ JJ
children NOUN NNS
refuse VERB VBP
to PART TO
go VERB VB
to ADP IN
school NOUN NN


### Sentence Boundary Disambiguation

In [0]:
doc = nlp_en('Multi-agent planning uses the cooperation and competition of many agents to achieve a given goal. Emergent behavior such as this is used by evolutionary algorithms and swarm intelligence.')

for sent in doc.sents:
    print(sent)

Multi-agent planning uses the cooperation and competition of many agents to achieve a given goal.
Emergent behavior such as this is used by evolutionary algorithms and swarm intelligence.


In [0]:
text = ("Facebook, Inc. is an American social media and technology company"
        " based in Menlo Park, California.")

doc = nlp_en(text)

for key, sent in enumerate(doc.sents):
    print(key, sent)

0 Facebook, Inc. is an American social media and technology company based in Menlo Park, California.


### SpaCy's Language Support

In [0]:
!python -m spacy download pt_core_news_sm

In [0]:
import pt_core_news_sm as language_pt

nlp_pt = language_pt.load()

In [0]:
doc = nlp_pt("China anuncia redução de tarifas de importação de mais de 850 produtos")

for token in doc:
    print(token.text, token.pos_)

China PROPN
anuncia VERB
redução NOUN
de ADP
tarifas NOUN
de ADP
importação NOUN
de ADP
mais ADV
de ADP
850 NUM
produtos SYM


## Language Modeling

"Language modeling is the task of assigning a probability to sentences in a language (what is the probability of seeing the sentence *the lazy dog barked loudly*?). Besides assigning a probability to each sequence of words, the language models also assign a probability for the likelihood of a given word (or sequence of words) to follow a sequence of words (what is the probability of seeing the word *barked* after seeing the sequence *the lazy dog*?)." [1]

Formally, the probability of a sequence of words $\Large w_{1:n}$, ussing the chain rule of probability, is:

$$\Large P(w_{1:n}) = P(w_1)P(w_2|w_1)P(w_3|w_{1:2})...P(w_n|w_{1:n-1})$$

which can be written using the product of sequences symbol:

$$ \Large P(w_{1:n}) = \displaystyle \prod_{k=1}^{n} P(w_{k} | w_{1:k-1}) $$

where $\Large w_{1:n}$ is a sequence of words: $\Large w_1 w_2 w_3 ... w_{n-1} w_{n}$



### N-grams

N-grams are models that approximate the probability of a word given all the previous words $ P(w_{n} | w_{1:n-1}) $ by only using the conditional probability  of the preceding  $ N - 1 $ words.

Bigrams (2-grams)

When $N=2$ we have:

$$\Large P(w_n | w_{1:n-1}) = P(w_n | w_{n-1})$$

When $N=3$ we have:

$$\Large P(w_n | w_{1:n-1}) = P(w_n | w_{n-2:n-1})$$

## Word Embeddings

### Vector Semantics

"The idea of vector semantics is to represent a word as a point in some multi-dimensional semantic space. Vector representing words are generally callled embeddings, because the word is embedded in a particular vector space."

> **Distributional hypothesis**
> 
> Words that occur in similar contexts tend to have similar meanings.
>
> **Distributional semantics**
>
> A word's meaning is given by the words that frequently appear close-by.

The vector semantics' model instantiates the distributional hypothesis by learning representations of the meaning of words direclty from their distributions in texts. It offers a fine-grained model of meaning that lets us also implement word similarity (and phrase similarity).

Vector semantic models are also extremely practical because they can be learned automatically from text without any complex labeling or supervision.
