<a href="https://colab.research.google.com/github/arezaz/nlp-workshop/blob/master/IndoDataWeek_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**The Natural Language Processing Workshop**
---
#### Alireza Rezazadeh (rezaz003@umn.edu)
Indo Data Week - November, 2020


## **1. Word Representation**

#### **Overview**


1. Word Representation:
  * 1.1. Basic Approaches
      * 1.1.1. One-Hot Encoding
      * 1.1.2. Bag of Words
      * 1.1.3. TF-IDF
  * 1.2. Word Embedding
      * 1.2.1. Word2Vec
      * 1.2.2. GloVe
      

#### **1.1. Basic Aproaches**


*Vector space model* is an algebraic model of representing text as vecotrs of identifiers. All following approaches discussed here are one way or another based on the idea of vector space models. Basic idea:

<img src="https://raw.githubusercontent.com/arezaz/nlp-workshop/master/contents/1.png" alt="Drawing" width="350" />

The most basic idea: assigning unique ID to each word in the vocabulary. This way, we can represent text as a multi-dimensional vector. Let's go over a few ways to implement this idea!

##### **1.1.1. One-Hot Encoding**
<img src="https://raw.githubusercontent.com/arezaz/nlp-workshop/master/contents/2.png" alt="Drawing" width="700" />

Let's try this!

In [1]:
VOCAB = {'yellow':0, 'apple':1, 'dog':2, 'cat':3, 'blue':4, 'black':5, 'sky':6}

In [2]:
def onehot(text):
  ### your code here ###
   
  return onehot_rep

In [None]:
onehot('the sky is blue and the cat is black')

Why is this approach not so practical?
  * High dimensionality of the representation.
  * Sparsity leads to computational difficaulties.
  * Does not capture semantic relation between words.




##### **1.1.2. Bag of Words**

<img src="https://raw.githubusercontent.com/arezaz/nlp-workshop/master/contents/3.png" alt="Drawing" width="700" />

Let's try this!

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

corpus = [ 'This is the first document and it is interesting.',
           'This document is the second document and it is in English.',
           'And this is the third one.',
           'Is this the first document?']

count_vect = CountVectorizer() # CountVectorizer(binary=True)
### your code here ###


In [None]:
# Bag-of-words representation of a new documents
NewDocument = "dog and dog are friends"
temp = count_vect.transform([NewDocument])
print("BoW of '" + NewDocument + "': ",temp.toarray())

A few advantages:
  *   Straight-forward and easy to implement.
  *   Captures semantic similarity for documents.
  *   Encoding is fixed length for any document.

Some drawbacks:
  *   Vectors could become sparse for larger vocabulary.
  *   Blind! cannot capture semantic similarity of words.
  *   What if a new word is not in the vocabulary?
  *   It's a bag! disregards words order information.

##### **1.1.3. TF-IDF**
 

<img src="https://raw.githubusercontent.com/arezaz/nlp-workshop/master/contents/4.png" alt="Drawing" width="1400" />

Let's try this!

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

tfidf = TfidfVectorizer()

### your code here ###


In [None]:
NewDocument = "This English document is interesting"
temp = tfidf.transform([NewDocument])
print("TF-IDF for '"+NewDocument+"' :\n", temp.toarray())

There are a few variations of TF-IDF formulations. Advantages of this approach are:
*   Like bag-of-words, it captures semantic similarity for documents.
*   Very commonly used (even today!).

But, like any other approach, there are some drawbacks:
*   Curse of dimensionality! feature vectors can become very high dimensional.
*   Still doesn't capture semantic relation between words.
*   Sparsity in feature vecotors.
*   Does not handle out-of-vocabulary words.



#### **1.2. Word Embeddings**

Words that can be used interchangably often have a strong semantic relation. 

> The [*cat*, *dog*] is a [*domestic*, *wild*] species of [*small*, *large*] carnivorous mammal.  - *Wikipedia*

Word embeddings represent words as real-valued vectors in a predifined vector space. In particular, word embeddings are from the class of dense distributed representation. Unlike sparse representations, such as one-hot encoding, distributed representations are learned based on the usage of words. 

##### **1.2.1. Word2Vec**
 

<img src="https://raw.githubusercontent.com/arezaz/nlp-workshop/master/contents/5.png" alt="Drawing" width="1700" />

Word embedding models are usually trained on huge datasets with large vocabulary size. Luckily we don't need to train neural networks from scratch Let's try a pre-trained model!

In [1]:
from gensim.models import Word2Vec 
from gensim.test.utils import common_texts

# Build a Word2Vec model on a costume dataset
model = Word2Vec(common_texts, size=10, window=5, min_count=1, workers=4)

In [None]:
# Finding similar words
### your code here ###

# the representation for one of the words
### your code here ###


##### **1.2.2. GLoVe**
 

<img src="https://raw.githubusercontent.com/arezaz/nlp-workshop/master/contents/6.png" alt="Drawing" width="1700" />

Let's try a pre-trained GLoVe model based on twitter data!

In [None]:
import gensim.downloader

# loading GloVe model trained on tweets
glove_model = gensim.downloader.load('glove-twitter-25')

# Finding similar words
### your code here ###


To measure similarity between two vectors we can use *cosine similarity* metric. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.

<img src="https://www.oreilly.com/library/view/mastering-machine-learning/9781785283451/assets/d258ae34-f4f8-4143-b3c2-0cb10f2b82de.png" alt="Drawing" width="800" />


In [None]:
from scipy import spatial

# similarity function based on cosine similarity
def similarity(w1,w2):
  ### your code here ###
  
  return  

In [None]:
# compare the similarity of 'man', 'woman', and 'umbrella'!
### your code here ###


In [None]:
# compare the similarity of 'large', 'larger', and 'small', 'smaller'
### your code here ###


This is interesting. But wait! does data equally favor everyone? What if we want to somehow use a model trained on a large dataset of tweets to make a decision for the society? 

Let's take a look at how language can be unfair and biased! 

In [None]:
# compare similarities between 'careers' and 'names'!
### your code here ###


We experimented with one of the common word embedding models and observed how based on biased data a machine learning model can be biased. Recently, the subject of AI fairness has been attracting more attention.

---


## **2. Language Models**

#### **Overview**

2. Language Models
  * 2.1 Basic Models
      * 2.1.1. Unigram/Bigram Models
      * 2.1.2. N-grams Models
  * 2.2 Recurrent Nerual Networks (RNN)
      * 2.2.1. Basic RNN
      * 2.2.2. Vanilla RNN
      * 2.2.3. LSTM
      * 2.2.4. Deep RNN

* Hands-on: Building a Language Model

#### **2.1. Basic Models**


Language models, originally developed for speech recognition, are widely used in many NLP applications. Probabilistic language models compute the probability of a sentence or sequence of words. Specifically, language models can be developed to predict the likelihood of a given word, or a sequence of words, to follow a sequence of words.

##### **2.1.1. Unigram/Bigram Models**

<img src="https://raw.githubusercontent.com/arezaz/nlp-workshop/master/contents/7.png" alt="Drawing" width="1700" />

##### **2.1.2. N-gram Models**

<img src="https://raw.githubusercontent.com/arezaz/nlp-workshop/master/contents/8.png" alt="Drawing" width="1800" />


In [None]:
import nltk
from nltk.util import ngrams
nltk.download('punkt')


# function to extract n-grams of sentences
def ngrams_gen(seq, n):
  ### your code here ###

    return

In [None]:
# generate n-grams!: n = 1, 2, 3, 4
sentence = 'The whole is more than the sum of its parts' # -Aristotle

### your code here ###

In general, N-gram models are not sufficient models for language altough they are very useful. Language is essentially dependent on complicated long-distance histories.

#### **2.2. Recurrent Neural Networks**

As mentioned earlier, language is sequential at its essense. *Recurrent neural networks* (RNN) are specifically developed to capture a sequential process. In general, RNN is composed of units that keep a memory of preceeding history. This memory is temporal as the units keep updating at every timestep. RNN are very promising in many NLP applications.


*images courtesy of cs230 - stanford, and colah.github.io*

##### **2.2.1. Basic RNN**

At each timestep $t$, the unit takes input $x^{<t>}$, outputs $y^{<t>}$, and pass a signal $a^{<t>}$ based on its computations to the next timestep $t+1$. This signal acts as a memory in the sequence.


<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/architecture-rnn-ltr.png?9ea4417fc145b9346a3e288801dbdfdc" alt="Drawing" width="600" />

Specifically, 

$a^{<t>}=g_a(W_a [a^{<t-1>}, x^{<t>}]+b_a)$

$\hat{y}^{<t>}=g_y(W_y a^{<t>}+b_y)$

The loss function is defined as: $L(\hat{y}-y) = \sum_{t=1}^{T}L(\hat{y}^{<t>}-y^{<t>}) $. The RNN network is trained by *backpropagation through time*.

Type | Schematic | Application
 --- |    ---    |     ---
one-to-one | <img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/rnn-one-to-one-ltr.png?9c8e3b04d222d178d6bee4506cc3f779" alt="Drawing" width="400" /> | a basic nerual network 
one-to-many | <img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/rnn-one-to-many-ltr.png?d246c2f0d1e0f43a21a8bd95f579cb3b" alt="Drawing" width="400" /> | music generation
many-to-one | <img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/rnn-many-to-one-ltr.png?c8a442b3ea9f4cb81f929c089b910c9d" alt="Drawing" width="400" /> | sentiment classification
many-to-many | <img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/rnn-many-to-many-same-ltr.png?2790431b32050b34b80011afead1f232" alt="Drawing" width="400" /> | named-entity recognition (NER)
many-to-many | <img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/rnn-many-to-many-different-ltr.png?8ca8bafd1eeac4e8c961d9293858407b" alt="Drawing" width="400" /> | machine translation

##### **2.2.2. Vanilla RNN**

Vanilla RNN is a basic RNN unit. The non-linearity function is set to $tanh$ and can be described as:

$a^{<t>}=tanh(W_a[a^{<t-1>}, x^{<t>}]+b_a)$



<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-SimpleRNN.png" alt="Drawing" width="600" />

##### **2.2.3. LSTM**

LSTM (long short-term memory) network introduced back in 1990s are extremely promising in a wide variety of sequential modelings. They are formulated to allow for learning lags of unknown duration between relevant events in a sequential data. A major feature of LSTMs is their capability in experssing the notion of forgetting.

There are many variations of LSTMs. In general, an LSTM cell has three main components, also called *gates*: input gate, output gate, and forget gate. On top of that, each cell has a *state* associated with it.


<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-chain.png" alt="Drawing" width="600" />

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM2-notation.png" alt="Drawing" width="500" />


So, what is the main idea?

The cell state runs through all timesteps and the cell can add or remove information to the state signal. The three gates allows the cell to arbitrarily let information through. Gates are composed of sigmoid function to determine how much of each component let through. 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-C-line.png" alt="Drawing" width="700" />


Let's briefly walk through an LSTM cell:

*1) Decide what to forget:* look at the previous timestep hidden state $h_{t-1}$ and the input at this timestep $x_t$ and through a sigmoid layer determine how much to forget *(forget gate)*: 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-f.png" alt="Drawing" width="700" />

*2) Decide what to remember:* look at the previous timestep hidden state $h_{t-1}$ and the input at this timestep $x_t$. First, decide which values to update through a sigmoid layer *(input gate)*. Then, compute the candidate values for the updates: 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-i.png" alt="Drawing" width="700" />

*3) Update cell states:* now that we know what to remember and what to forget we can impose them to the cell state signal: 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-C.png" alt="Drawing" width="700" />

*4) Come up with an output:* look at the previous timestep hidden state $h_{t-1}$ and through a sigmoid layer decide what parts of this signal to update and output (*output gate*). Then, combine this signal with the candidate update values and compute the hidden state at this timestep $h_{1}$: 

<img src="http://colah.github.io/posts/2015-08-Understanding-LSTMs/img/LSTM3-focus-o.png" alt="Drawing" width="700" />

As mentioned before, LSTMs are great and promising in general but we should also be aware of some of their drawbacks:

* LSTMs are developed to solve the problem of vanishing gradients in the process of backpropagation. However, they also suffer from vanishing gradient problem to some extent.
* The cell function is quite complex and this can exacerbate the issue of vanishing gradients.
* Computationally expensive to train up to a decent performance level.
* Cannot remember history of sequence for relatively longer timesteps.

##### **2.2.4. Deep RNN**

How do we make an RNN deep? just stack them up and build layers on top of each other!

<img src="https://stanford.edu/~shervine/teaching/cs-230/illustrations/deep-rnn-ltr.png?f57da6de44ddd4709ad3b696cac6a912" alt="Drawing" width="400" />


---

## 3. Hands-on: Working with Transformers



### So, what are the Transformers really?

Transformer in natural language processing refers to state of the art sequence-to-sequence language models. The backbone of transformers is *attention mechanism* to handle long distance dependencies. 

Attention mechanism basically weights different positions of a single sequence in order to compute a representation of the sequence. Here's how it looks:

<img src="https://miro.medium.com/max/700/1*wa4zt-LcMWRIYLfiHfBKvA.png" alt="Drawing" width="450" />




### 3.1. GPT-2
Generative  Pre-trained Transformers (GPT) are developed by OpenAI. GPT-2 is released in 4 different variations based on their size (small, medium, large, and XL). GPT-2 is a very large model with 1.5 billion parameters. The pre-trained models contain information from 8 million web pages from. GPT-2 is an unsupervised language model announced in 2019. Building blocks of GPT-2 is:

<img src="https://camo.githubusercontent.com/795bd8868fdeb49b7ca48a935e806b56169a172b/68747470733a2f2f692e696d6775722e636f6d2f305853535842642e706e67" alt="Drawing" width="180" />


#### 3.1.1. Next Word Generation Using GPT-2



In [None]:
!pip install pytorch-transformers

In [None]:
# importing GPT2 library
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

In [None]:
# creating an instance of GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

In [None]:
# loading a pre-trained GPT-2 model with multi-head attention mechanism
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.eval()
model

In [None]:
# let's predict the next word in this sequence
text = "artificial inteligence aims to predict the"
indexed_tokens = tokenizer.encode(text)
tokens_tensor = torch.tensor([indexed_tokens])
print(f'tokens are: {indexed_tokens}')

In [None]:
# Predicting the next word in the sequence
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

In [None]:
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
predicted_text

#### 3.1.2. Next Sentences Generation Using GPT-2



In [None]:
start = 'Intelligence is most often studied in humans but has also been observed in both'
indexed_tokens = tokenizer.encode(start)

for i in range(150):
  tokens_tensor = torch.tensor([indexed_tokens])
  with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]
    predicted_index = torch.argmax(predictions[0, -1, :]).item()
    indexed_tokens = indexed_tokens + [predicted_index]

In [None]:
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])
print(predicted_text)

---
## Relevant Materials
* Vajjala, Majumder, Gupta, Surana - [Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems](https://www.amazon.com/Practical-Natural-Language-Processing-Pragmatic/dp/1492054054)
* [Sequence Models](https://www.coursera.org/learn/nlp-sequence-models) by deeplearning.ai
* [Stanford's CS230](https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks) course materials
* Christopher Olah's [blog](https://colah.github.io/)
