
# Introduction to Python and Natural Language Technologies
## Lecture 07, Deep Learning for NLP
### March 23, 2021



## Data science

> The nontrivial extraction of implicit, previously unknown, and
> potentially useful information from data.

-   non-trivial

-   relationship between data points

-   (large) dataset

-   make predictions on unknown examples

## Examples and counterexamples Convenience store statistics

-   Number of customers

    -   trivial information

-   Last month’s income

    -   trivial information

-   Items most frequently bought together

    -   *finding frequent itemsets*

-   How many cashiers need to be open at Friday 16 pm?

    -   customer queuing model

    -   *time series modeling*



## Vector Representation


- sample  
    - one data point - <span style="color: darkred">vector</span>

- feature  
    - a property or attribute of a sample - <span style="color: darkred">one element of a vector</span>

    - length of the mail

    -   sender

    -   Does it contain the word *Rolex*?

    -   Does it contain the expression *Trust fund*?

- dataset  
    - collections of all samples - <span style="color: darkred">matrix</span>

- label  
    - correct *answers* for all samples in a dataset - <span style="color: darkred">vector</span>


# Machine Learning and Deep Learning

## Neural networks and Deep learning
- Neural network is inspired by the information processing methods of biological nervous systems
- It is composed of neurons, each layer connected to the next
- Deep learning is a neural network consisting of multiple layers
    - the idea is not new
    - it is returned because of the rise of the GPUs
    - good frameworks (Pytorch, Tensorflow)

## Deep Learning

- it has a black-box nature
- interpreting them is hard
- we don't exactly know the reasoning behind a decision
- can we trust deep learning?
- the latest language model of **OpenAI**, **GPT-3**, has 175B trainable parameters [link](https://news.developer.nvidia.com/openai-presents-gpt-3-a-175-billion-parameters-language-model/)

<img src="img/dl/network.png">

this is a __feed forward neural network__ with two hidden layers. Each neuron contains an activation function:

$$\mathbf{h_1} = \sigma (\mathbf{W_1 x})$$
$$\mathbf{h_2} = \sigma (\mathbf{W_2 h_1})$$
$$\mathbf{y} = \sigma (\mathbf{W_3 h_2})$$

$\sigma$: activation function, typically non-linear such as the sigmoid
function $$\sigma(x) = \frac{1}{1 + e^{-x}}$$

During training, weights are learned to predict a value of a new input.

### Activation functions

![act](https://cdn-images-1.medium.com/max/1000/1*4ZEDRpFuCIpUjNgjDdT2Lg.png)

__What is inside a neuron?__

![perceptron](https://c.mql5.com/2/35/artificialneuron__1.gif)


*image from https://www.mql5.com/en/blogs/post/724245*

## ML vs DL

<img src="img/dl/ai_ml_dl.png" />

## ML vs DL

- Deep learning
    - Automatic feature engineering
    - Scalable with big data
    - Can solve non-separable problems as well (traditional methods struggle with non-linearity)
    - Currently most state-of-the-art methods are based on DL
- Traditional machine Learning
    - Feature extraction is done manually
    - Can learn relatively well from small data (DL can’t)
    - Scalability is worse with big data
    - It can be enough for small tasks

## Learning types 
-   **Supervised learning**

    -   is a problem where for every input variable(x) there is an ouput
        variable(y) in the training data

    -   the preparation of the output variables is usually done with
        Human resources - Labeled data

-   **Unsupervised learning**

    -   is a problem where only the input variable(x) is present in the
        training data

    -   still can be very useful since labeling data is very resource
        hungry and expensive

## Learning problems

-   __Classification__ (supervised learning)

    -   assign a label for each sample

    -   labels are predefined and usually not very numerous

    -   e.g. sentiment analysis

-   __Regression__ (supervised learning)

    -   predict a continuous variable

    -   e.g. predict real estate prices, stock market based on history,
        location, amenities

-   __Clustering__ (unsupervised learning)

    -   group samples into clusters according to a similarity measure
     
    -   e.g. group similar facebook comments
    
    -   goal: high intra-group similarity (samples in the same cluster
        should be similar to each other), low inter-group similarity
        (samples in different clusters shouldn’t be similar)

<img src="img/dl/sup_unsup.png" />

### Evaluation - Binary classification

<img src="img/dl/true_false.png" />

**Accuracy**: fraction of correctly guessed labels among all the samples

#### __Precision, recall and F-score:__

**Precision**: fraction of positive samples among those labeled positive
$$\text{Precision}=\frac{tp}{tp+fp}$$

**Recall**: fraction of recovered positive samples of all positive
samples $$\text{Recall}=\frac{tp}{tp+fn}$$

**F-score**: harmonic mean of precision and recall
$$\text{F-score} = 2 * \frac{\text{prec}  \text{rec}}{\text{prec} + \text{rec}}$$

### Evaluation - regression Root-mean-square error

$$\operatorname{RMSE}=\sqrt{\frac{\sum_{t=1}^n (\hat y_t - y_t)^2}{n}},$$

where $\hat y_t$ are the predicted values, $y_t$ is the true value and
$n$ is the number of samples.

## Train, Validation and Test set

- __training set__
    - part of the dataset used for training -


- __validation dataset__
    - part of the dataset used for cross-validation, early stopping and
    hyperparameter tuning


- __test set__
    - part of the dataset used for testing trained models. Your method
    should only be tested once on the test set.

<img src="img/dl/train_test_val.png" />

## Terminology

-   How do ML/DL algorithms learn?

-   **Loss function**: helps to calculate the prediction loss of our
    network, which tells us how bad/good is our model.

-   We want to **optimize** the loss/cost function.

-   How?

    -   **Gradient descent** helps us find the global minima of the loss
        function

    -   **Backpropagation** algorithm is used to propagate the error
        back to the weights of the model and updates them

## Important concepts in machine learning

- __Cost Function__: used to measure how badly our models are performing on a data
- __Parameters__: variables that are updated during the training
- __Sample__: single row in our data
- __Batch size__: the number of samples our model works throught before updating the weights
- __Epoch__: one epoch means that each sample in the training dataset was iterated through the model
- __Iteration__ – one update on the weights. It happens once for each batch.
- __Hyperparameters__ – variables that don't change during training (number of epochs, batch size, learning rate, etc..)

- __Gradient descent__: used to find the global optimum in the cost function

<img src="img/dl/gradient.gif?raw=true" />

## Over- and Underfitting?
![under_over](https://miro.medium.com/max/2400/1*JZbxrdzabrT33Yl-LrmShw.png)
*image from https://miro.medium.com/max/2400/1*JZbxrdzabrT33Yl-LrmShw.png*

## Recurrent neural networks

- In NLP, recurrent neural networks (RNN) are commonly used to analyse sequences. 
- It takes in a sequence of words, one at a time, and produces hidden states ($h$) after each steps. 
- RNN-s are used recurrently by feeding in the current word and the hidden state from the previous word.
- Once we have our final hidden state, $h_T$, (from feeding in the last word in the sequence, $x_T$) we feed it through a linear layer, $f$ (fully connected layer) to reduce the dimension into the dimension of the labels.

![rnn](https://github.com/bentrevett/pytorch-sentiment-analysis/raw/79bb86abc9e89951a5f8c4a25ca5de6a491a4f5d/assets/sentiment1.png)

_(image from bentrevett)_

![rnn2](https://miro.medium.com/max/1400/1*WMnFSJHzOloFlJHU6fVN-g.gif)

![rnn3](https://miro.medium.com/max/770/1*o-Cq5U8-tfa1_ve2Pf3nfg.gif)

### LSTM

- One of the biggest problem of recurrent neural networks is the vanishing gradient problem. 
- It happens when the gradient shrinks during bakcpropagarion. 
- If it becomes very small, the network stops learning. This mostly happen when long sentences are present. 
- LSTM networks address this problem by having an inner memory cell to remember important information or forget others. 
- LSTM has a similar flow as a RNN, it processes data and passes information as it propagates forward. 
- The difference is in the operations within the cells.

![lstm](https://miro.medium.com/max/770/1*0f8r3Vd-i4ueYND1CUrhMA.png)



__LSTM__ consists of:

- __Forget gate__
    - Decides what information should be kept or thrown away
    - Information from the previous hidden state and from the current input

![forget](https://miro.medium.com/max/770/1*GjehOa513_BgpDDP6Vkw2Q.gif)


- __Input gate__
    - Decides what information is relevant to add from the current step

![input](https://miro.medium.com/max/770/1*TTmYy7Sy8uUXxUXfzmoKbA.gif)


- __Cell state__

![cell](https://miro.medium.com/max/770/1*S0rXIeO_VoUVOyrYHckUWg.gif)

- __Output gate__
    - Determines what the next hidden state should be

![lstm2](https://miro.medium.com/max/770/1*VOXRGhOShoWWks6ouoDN3Q.gif)

_(images from [link](https://towardsdatascience.com/illustrated-guide-to-lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21))_

## Building a classficiation pipeline - But Neural Networks

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the dataset above. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.


In [None]:
!pip install torchtext==0.4
!pip install torch
!pip install pandas
!pip install gensim
!pip install scikit-learn

First we are going to download the dataset using [torchtext](https://pytorch.org/text/stable/index.html):

In [None]:
NGRAMS = 2
from torchtext import data
from torchtext.datasets import text_classification
import os
if not os.path.isdir('./data'):
    os.mkdir('./data')
text_classification.DATASETS['AG_NEWS'](
    root='./data', ngrams=NGRAMS, vocab=None)

In [None]:
#Import the needed libraries
from tqdm import tqdm
from sklearn.model_selection import train_test_split as split
import numpy as np
import pandas as pd

Now we use [pandas](https://pandas.pydata.org/) to read in the dataset into a DataFrame. This time we are going to use the whole Dataset for training.

In [None]:
#1-World, 2-Sports, 3-Business, 4-Sci/Tech

train_data = pd.read_csv("./data/ag_news_csv/train.csv",quotechar='"', names=['label', 'title', 'description'])
test_data = pd.read_csv("./data/ag_news_csv/test.csv",quotechar='"', names=['label', 'title', 'description'])

In [None]:
#1-World, 2-Sports, 3-Business, 4-Sci/Tech

train_data.groupby("label").size()

## Deep learning model using [Pytorch](https://pytorch.org/)

In [None]:
# First we need to import pytorch and set a fixed random seed number for reproducibility
import torch

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [None]:
# We will use the same CountVectorizer that we did in the last lab
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features=10000)

word_to_ix = vectorizer.fit(train_data.title)

In [None]:
# First we define some other important parameters for our model
VOCAB_SIZE = len(word_to_ix.vocabulary_)
NUM_LABELS = 4 

# Split the training data, but this time we will have a validation dataset instead of just having
# train and test
tr_data, val_data = split(train_data, test_size=0.3, random_state=SEED)

In [None]:
# Preparing the data loaders for the validation and the test set
# Pytorch operates on it's own datatype which is very similar to numpy's arrays
# They are called Torch Tensors: https://pytorch.org/docs/stable/tensors.html
# They are optimized for training neural networks

tr_data_vecs = torch.FloatTensor(word_to_ix.transform(tr_data.title).toarray())
tr_labels = tr_data.label.tolist()

val_data_vecs = torch.FloatTensor(word_to_ix.transform(val_data.title).toarray())
val_labels = val_data.label.tolist()

tr_data_loader = [(sample, label-1) for sample, label in zip(tr_data_vecs, tr_labels)]
val_data_loader = [(sample, label-1) for sample, label in zip(val_data_vecs, val_labels)]

In [None]:
# We then define a BATCH_SIZE for our model
# Usually we don't feed the whole dataset into our model at once
# For this we have the BATCH_SIZE parameter. Choosing this right can also improve the performance
BATCH_SIZE = 64


# The DataLoader(https://pytorch.org/docs/stable/data.html) class helps us to prepare the training batches 
# It has a lot of useful parameters, one of the is _shuffle_ which will randomize the training dataset in each epoch
# This can also improve the performance of our model
from torch.utils.data import DataLoader

train_iterator = DataLoader(tr_data_loader,
                            batch_size=BATCH_SIZE,
                            shuffle=True,
                            )

valid_iterator = DataLoader(val_data_loader,
                          batch_size=BATCH_SIZE,
                          shuffle=False,
                          )

In [None]:
# We will need to copy our model to CPU or GPU, based on what we are using
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [None]:
from torch import nn

class BoWClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.  
        # Torch defines nn.Linear(), which provides the affine map.
        # Note that we could add more Linear Layers here connected to each other
        # Then we would also need to have a HIDDEN_SIZE hyperparameter as an input to our model
        # Then, with activation functions between them (e.g. RELU) we could have a "Deep" model
        # This is just an example for a shallow network
        self.linear = nn.Linear(vocab_size, num_labels)

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        # Softmax will provide a probability distribution among the classes
        # We can then use this for our loss function
        return F.log_softmax(self.linear(bow_vec), dim=1)

In [None]:
# The INPUT_DIM is the size of our input vectors
INPUT_DIM = VOCAB_SIZE
# We have 4 classes
OUTPUT_DIM = 4

model = BoWClassifier(OUTPUT_DIM, INPUT_DIM)

In [None]:
# https://pytorch.org/docs/stable/optim.html
import torch.optim as optim

# The optimizer will update the weights of our model based on the loss function
# This is essential for correct training
# The _lr_ parameter is the learning rate, it s
optimizer = optim.Adam(model.parameters(), lr=1e-3)

![NLL](https://ljvmiranda921.github.io/assets/png/cs231n-ann/neg_log_demo.png)

_(image from [link](https://ljvmiranda921.github.io))_

In [None]:
criterion = nn.NLLLoss()

In [None]:
# Copy the model and the loss function to the correct device
model = model.to(device)
criterion = criterion.to(device)

In [None]:
from sklearn.metrics import classification_report
def class_accuracy(preds, y):
    """
    Returns accuracy per batch
    """
    # Get the predicted label from the probabilities
    rounded_preds = preds.argmax(1)
    # Calculate the correct predictions batch-wise
    correct = (rounded_preds == y).float()
    
    # Calculate the accuracy of your model
    acc = correct.sum() / len(correct)
    return acc

In [None]:
import torch.nn.functional as F

# Define the train function
def train(model, iterator, optimizer, criterion):
    
    # We will calculate loss and accuracy epoch-wise based on average batch accuracy
    epoch_loss = 0
    epoch_acc = 0
    
    # You always need to set your model to training mode
    # If you don't set your model to training mode the error won't propagate back to the weights 
    model.train()
    
    # We calculate the error on batches so the iterator will return matrices with shape [BATCH_SIZE, VOCAB_SIZE]
    for texts, labels in iterator:
        # We copy the text and label to the correct device
        texts = texts.to(device)
        labels = labels.to(device)
        
        # We reset the gradients from the last step, so the loss will be calculated correctly (and not added together)
        optimizer.zero_grad()
                
        # This runs the forward function on your model (you don't need to call it directly)    
        predictions = model(texts)

        # Calculate the loss and the accuracy on the predictions (the predictions are log probabilities, remember!)
        loss = criterion(predictions, labels)
        acc = class_accuracy(predictions, labels)
        
        # Propagate the error back on the model (this means changing the initial weights in your model)
        loss.backward()
        optimizer.step()
        
        # We add batch-wise loss to the epoch-wise loss
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:

# The evaluation is done on the validation dataset
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    # On the validation dataset we don't want training so we need to set the model on evaluation mode
    model.eval()
    
    # Also tell Pytorch to not propagate any error backwards in the model
    # This is needed when you only want to make predictions and use your model in inference mode!
    with torch.no_grad():
    
        # The remaining part is the same with the difference of not using the optimizer to backpropagation
        for texts, labels in iterator:
            # We copy the text and label to the correct device
            texts = texts.to(device)
            labels = labels.to(device)
            
            predictions = model(texts)
            loss = criterion(predictions, labels)
            
            acc = class_accuracy(predictions, labels)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In [None]:
import time

# This is just for measuring training time!
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
# Define the epoch number parameter
N_EPOCHS = 50

best_valid_loss = float('inf')

# We loop forward on the epoch number
for epoch in range(N_EPOCHS):

    start_time = time.time()
    
    # Train the model on the training set using the dataloader
    train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
    # And validate your model on the validation set
    valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    # If we find a better model, we save the weights so later we may want to reload it
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')