<a href="https://colab.research.google.com/github/ccarpenterg/introNLP/blob/master/01a_intro_NLP_and_word_embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to NLP with Deep Learning



## NLP Taks

### Part-of-speech tagging

Each word belongs to a grammatical class within a sentence. In English there are 9 basic classes or part-of-speech (POS) tags:

* Noun (N)
* Verb (V)
* Article (A)
* Adjective (ADJ)
* Adverb (ADV)
* Preposition (P)
* Conjuction (C)
* Pronoun (PRO)
* Interjection (INT)

POS tagging refers to the computational methods for assigning parts-of-speech to words in a setence.

### Named entity recognition

* Person
* Location
* Organization
* Date
* Time
* Quantity





### Sentiment Analysis

### Question Answering

### Summarization

### Machine Translation

### Natural Language Generation

### Speech Recognition

### Text-to-speech

## Language Modeling

"Language modeling is the task of assigning a probability to sentences in a language (what is the probability of seeing the sentence *the lazy dog barked loudly*?). Besides assigning a probability to each sequence of words, the language models also assign a probability for the likelihood of a given word (or sequence of words) to follow a sequence of words (what is the probability of seeing the word *barked* after seeing the sequence *the lazy dog*?)." [1]

Formally, the probability of a sequence of words $\Large w_{1:n}$, ussing the chain rule of probability, is:

$$\Large P(w_{1:n}) = P(w_1)P(w_2|w_1)P(w_3|w_{1:2})...P(w_n|w_{1:n-1})$$

which can be written using the product of sequences symbol:

$$ \Large P(w_{1:n}) = \displaystyle \prod_{k=1}^{n} P(w_{k} | w_{1:k-1}) $$

where $\Large w_{1:n}$ is a sequence of words: $\Large w_1 w_2 w_3 ... w_{n-1} w_{n}$



### N-grams

So let's say we are estimating the probabilities of a language model using some specific corpus, and we want to know the following probability:

$$\normalsize P(\text{the}|\text{its water is so transparent that})$$

One method we can use is counting the number of times the sentence ***its water is so transparent that*** appears in the corpus, and counting the number of times the sentence ***its water is so transparent that the*** appears in the corpus, and then estimating its relative frequency:

$$ \normalsize P(\text{the}|\text{its water is so transparent that}) = \frac{C(\text{its water is so transparent that the})}{C(\text{its water is so transparent that})}           $$

But as Juransky points out, "while this method of estimating probabilities directly from counts works fine in many cases, it turns out that even the web isn't big enough to give us good estimates in most cases." [1]

N-grams are models that approximate the probability of a word given all the previous words $ P(w_{n} | w_{1:n-1}) $ by only using the conditional probability  of the preceding  $ N - 1 $ words.

So when N = 2 we talk about a bigram (2-gram) model that only uses the precesing word to calculate the language model probabilities. When N = 3 we talk about a trigram (3-gram) model that only uses the preceding two words to calculate the language model probabilities.

Now we can use this approximation in the definition of our model:

**Bigram (2-grams) Model**

For $N=2$ we have:

$$\Large P(w_n | w_{1:n-1}) \approx P(w_n | w_{n-1})$$

now we replace our LM's sequence with this bigram approximation:

$$ \Large P(w_{1:n}) = \displaystyle \prod_{k=1}^{n} P(w_{k} | w_{k-1}) $$

**Trigram (3-grams) Model**

For $N=3$ we have:

$$\Large P(w_n | w_{1:n-1}) \approx P(w_n | w_{n-2:n-1})$$

and we do the same using this trigram approximation:

$$ \Large P(w_{1:n}) = \displaystyle \prod_{k=1}^{n} P(w_{k} | w_{k-2:k-1}) $$

### Computing N-gram Probability

When working with bigrams (2-grams):

$$ \Large P(w_n | w_{n-1}) = \frac{C(w_{n-1} w_n)}{\displaystyle\sum_w C(w_{n-1} w)} $$

In the denominator we can replace the summation with the count of $ w_{n-1} $:

$$ \Large P(w_n | w_{n-1}) = \frac{C(w_{n-1} w_n)}{ C(w_{n-1})} $$

Trigrams (3-grams):

$$ \Large P(w_n | w_{n-2} w_{n-1}) = \frac{C(w_{n-2}w_{n-1}w_n)}{\displaystyle\sum_w C(w_{n-2}w_{n-1}w)}   $$

In the denominator we can replace the summation with the count of $ w_{n-2} w_{n-1} $:

$$ \Large P(w_n | w_{n-2} w_{n-1}) = \frac{C(w_{n-2}w_{n-1}w_n)}{ C(w_{n-2}w_{n-1})}   $$


### A concrete N-gram example

Now we take a look at how our model works using the industrial revolution Wikipedia article as our corpus and we focus on the following sentence:

**The Industrial Revolution began in the 18th century, when agricultural societies became more industrialized and urban.**

Now let's get the 2-grams for this sequence with padding:

**"&lt;s&gt; The", "The Industrial", "Industrial Revolution", "Revolution began", "began in", "in the", "the 18th", "18th century", "century ,", ", when", "when agricultural", "agricultural societies", "societies became", "became more", "more industrialized", "industrialized and", "and urban", "urban &lt;/s&gt;"**

So let's say we are building  a text editor that is able to suggest the next word given some preceding words, and in order to achieve this we need to calculate the following probability:

$$\normalsize P(\text{century}|\text{The Industrial Revolution began in the 18th})$$

using our bigram (2-gram) approximation, we now only need the preceding word in this part of corpus:

$$\normalsize P(\text{century}|\text{18th})$$



## Word Embeddings

The basic idea behind word embeddings is that words are represented as vectors (points) in a vector space, and that the distance between these vector representations is a measure of the semantic similarity between words.

### Vector Semantics

Vector semantics is the framework of ideas upon which several word embedding models have been built. The main two ideas of this framework are as follow:

"The idea of vector semantics is to represent a word as a point in some multi-dimensional semantic space. Vector representing words are generally callled embeddings, because the word is embedded in a particular vector space."

"The idea that two words that occur in very similar distributions (that occur together with very similar words) are likely to have the same meaning."

The second idea of two words having similar distributions is the basis for the **distributional hypothesis** and for the **distributional semantics**.

> **Distributional hypothesis**
> 
> Words that occur in similar contexts tend to have similar meanings.
>
> **Distributional semantics**
>
> A word's meaning is given by the words that frequently appear close-by.

The vector semantics' model instantiates the distributional hypothesis by learning representations of the meaning of words direclty from their distributions in texts. It offers a fine-grained model of meaning that lets us also implement word similarity (and phrase similarity).

Vector semantic models are also extremely practical because they can be learned automatically from text without any complex labeling or supervision.


### word2vec

"Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in that space." [wikipedia https://en.wikipedia.org/wiki/Word2vec]

#### Skip-gram

One of the word2vec models is skip-gram, which predicts the context words $ \large w_{t+j}$ within a window of fixed size, given a center word $ \large w_t $.

![word2vec diagram](https://user-images.githubusercontent.com/114733/75586670-b5606400-5a53-11ea-9df0-648c9a07e5f8.jpg)


We are looking for the word vectors (embeddings) that maximizes the likelihood of our probability distribution:

$$ \Large  L(\theta) = \prod_{t=1}^{T} \prod_{-m \leq j \leq m \\ \quad j \neq 0} P(w_{t+j}|w_t; \theta) $$

and we end up minimizing the average negative log likelihood:

$$ \Large  J(\theta) = - \frac{1}{T} \sum_{t=1}^{T} \sum_{-m \leq j \leq m \\ \quad j \neq 0} \log P(w_{t+j}|w_t; \theta) $$


#### Softmax

In order to calculate the probability distribution $ \large P(w_{t+j}|w_t; \theta) $, skip-gram uses the softmax function:

$$ \Large P(o|c) = \frac{\exp(u_{o}^T v_c)}{\sum_{w \in V} \exp(u_w^T v_c)}  $$

$ \large o $ is the outside word and $ \large c $ is the center word; $ \large v_c $ is the center word vector; 
$ \large u_o $ is the outside word vector

## References

[1] Speech and Language Processing ([3rd. edition draft](https://web.stanford.edu/~jurafsky/slp3/)). Daniel Jurafsky, James H. Martin

[2] Neural Network Methods for Natural Language Processing ([2017](http://u.cs.biu.ac.il/~yogo/nnlp.pdf)). Yoav Goldberg

[3] Deep Learning ([online version](https://www.deeplearningbook.org/)). Ian Goodfellow, Yoshua Bengio, AAron Courville

[4] Introduction and Word Vectors ([slides](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture01-wordvecs1.pdf) ) ([video](https://youtu.be/8rXD5-xhemo)). CS224n: NLP with Deep Learning (2019), Stanford University

[5] Recurrent Neural Networks and Language Models ([slides](http://web.stanford.edu/class/cs224n/slides/cs224n-2020-lecture06-rnnlm.pdf) ) ([video](https://youtu.be/iWea12EAu6U)). CS224n: NLP with Deep Learning (2019), Stanford University