<a href="https://colab.research.google.com/github/ccarpenterg/introNLP/blob/master/01a_intro_NLP_and_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP with Deep Learning



## Language Modeling

"Language modeling is the task of assigning a probability to sentences in a language (what is the probability of seeing the sentence *the lazy dog barked loudly*?). Besides assigning a probability to each sequence of words, the language models also assign a probability for the likelihood of a given word (or sequence of words) to follow a sequence of words (what is the probability of seeing the word *barked* after seeing the sequence *the lazy dog*?)." [1]

Formally, the probability of a sequence of words $\Large w_{1:n}$, ussing the chain rule of probability, is:

$$\Large P(w_{1:n}) = P(w_1)P(w_2|w_1)P(w_3|w_{1:2})...P(w_n|w_{1:n-1})$$

which can be written using the product of sequences symbol:

$$ \Large P(w_{1:n}) = \displaystyle \prod_{k=1}^{n} P(w_{k} | w_{1:k-1}) $$

where $\Large w_{1:n}$ is a sequence of words: $\Large w_1 w_2 w_3 ... w_{n-1} w_{n}$



### N-grams

N-grams are models that approximate the probability of a word given all the previous words $ P(w_{n} | w_{1:n-1}) $ by only using the conditional probability  of the preceding  $ N - 1 $ words.

**Bigram (2-grams) Model**

For $N=2$ we have:

$$\Large P(w_n | w_{1:n-1}) \approx P(w_n | w_{n-1})$$

now we replace our LM's sequence with this bigram approximation:

$$ \Large P(w_{1:n}) = \displaystyle \prod_{k=1}^{n} P(w_{k} | w_{k-1}) $$

**Trigram (3-grams) Model**

For $N=3$ we have:

$$\Large P(w_n | w_{1:n-1}) \approx P(w_n | w_{n-2:n-1})$$

and we do the same using this trigram approximation:

$$ \Large P(w_{1:n}) = \displaystyle \prod_{k=1}^{n} P(w_{k} | w_{k-2:k-1}) $$

**A concrete N-gram example**

Now we take a look at how our model works using the following sentence:

**"The Industrial Revolution began in the 18th century, when agricultural societies became more industrialized and urban."**

3-grams for this sequence of words:

**[Revolution, (The, Industrial)], [began, (Industrial, Revolution)], [in, (Revolution, began)], [the, (began, in)], [18th, (in, the)], [century, (the, 18th)]**, etc.

So let's say we are building  a text editor that is able to suggest the next word given some preceding words, and in order to achieve this we need to calculate the following probability:

$$\normalsize P(\text{century}|\text{The Industrial Revolution began in the 18th})$$

using our trigram (3-gram) approximation, we now only need the 2 preceding words in this part of corpus:

$$\normalsize P(\text{century}|\text{the 18th})$$

### Computing N-gram Probability

When working with bigrams (2-grams):

$$ \Large P(w_n | w_{n-1}) = \frac{C(w_{n-1} w_n)}{\displaystyle\sum_w C(w_{n-1} w)} $$

## Word Embeddings

### Vector Semantics

"The idea of vector semantics is to represent a word as a point in some multi-dimensional semantic space. Vector representing words are generally callled embeddings, because the word is embedded in a particular vector space."

> **Distributional hypothesis**
> 
> Words that occur in similar contexts tend to have similar meanings.
>
> **Distributional semantics**
>
> A word's meaning is given by the words that frequently appear close-by.

The vector semantics' model instantiates the distributional hypothesis by learning representations of the meaning of words direclty from their distributions in texts. It offers a fine-grained model of meaning that lets us also implement word similarity (and phrase similarity).

Vector semantic models are also extremely practical because they can be learned automatically from text without any complex labeling or supervision.


### word2vec

"Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in that space." [wikipedia https://en.wikipedia.org/wiki/Word2vec]

#### Skip-gram

One of the word2vec models is skip-gram, which predicts the context words $ \large w_{t+j}$ within a window of fixed size, given a center word $ \large w_t $.

![word2vec diagram](https://user-images.githubusercontent.com/114733/75586670-b5606400-5a53-11ea-9df0-648c9a07e5f8.jpg)


We are looking for the word vectors (embeddings) that maximizes the likelihood of our probability distribution:

$$ \Large  L(\theta) = \prod_{t=1}^{T} \prod_{-m \leq j \leq m \\ \quad j \neq 0} P(w_{t+j}|w_t; \theta) $$

and we end up minimizing the average negative log likelihood:

$$ \Large  J(\theta) = - \frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m \\ \quad j \neq 0} \log P(w_{t+j}|w_t; \theta) $$
