<a target="_blank" href="https://colab.research.google.com/github/ejcv/NLP_course/blob/main/second_session_nlp_course.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Introduction

We have seen the traditional approach to NLP, but after the resurgence of neural networks, those become the state-of-the-art methods for NLP. In this session, we will see how to use neural networks for NLP tasks. We will start with the basics of neural networks and then move on to the advanced concepts. We will also see how to use pre-trained models for NLP tasks. We will use PyTorch for this session. 

In this session we will cover the following topics:

- Introduction to Neural Networks
- Activation Functions
- Optimizers
- Loss Functions
- GRU and LSTM
- Word Embeddings
- Transformers (Attention mechanism)


## What is a neural network? 🧠

A neural network is a set of algorithms that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated. When created they were inspired by the way that biological nervous systems such as the brain process information, although today the architectures don't resemble biological neural networks.

As the name implies, a neural network is a set of neurons which are connected to each other. 
<p align="center">
<img src=https://upload.wikimedia.org/wikipedia/commons/thumb/9/99/Neural_network_example.svg/1200px-Neural_network_example.svg.png height=200>
</p>

## Perceptron 🏋️

The perceptron is the building block of a neural network. It is a single neuron that takes in a set of inputs, performs some calculations, and outputs a value.
<p align="center">
<img src=https://images.deepai.org/glossary-terms/perceptron-6168423.jpg height=200>
</p>

The perceptron true power comes from the activation function, which allows it to learn non-linear relationships between the inputs and outputs. Without it the perceptron is just a linear regression model.

## Activation Functions 🌋
After we perform the weighted sum of the inputs and the weights, we need to pass the result through an activation function. The activation function changes the output of the perceptron based on the input. There are several different activation functions, depending on the problem we are trying to solve. Here are some of the most common ones:

- Sigmoid
- Tanh
- ReLU
- Leaky ReLU
- Softmax

You can find more details [here](https://en.wikipedia.org/wiki/Activation_function).


# Loss functions 📉

The purpose of a neural network is to find the best set of weights that will minimize the loss function. The loss function is a measure of how far off the neural network is from the true values. The loss function is also known as the cost function or the error function. There are several different loss functions, depending on the problem we are trying to solve. Here are some of the most common ones:

- Mean Squared Error
- Mean Absolute Error
- Binary Cross Entropy
- Categorical Cross Entropy

You can find more details [here](https://en.wikipedia.org/wiki/Loss_function).

# Optimizers 🚀

The optimizer is the algorithm that will update the weights of the neural network. The optimizer is what makes the neural network learn. There are several different optimizers, depending on the problem we are trying to solve. Here are some of the most common ones:

- Stochastic Gradient Descent
- Adam
- RMSProp
- etc...

<p align="center">
<img src=https://miro.medium.com/v2/resize:fit:1200/1*_osB82GKHBOT8k1idLqiqA.gif height=400>
</p>


# Lets build a simple neural network to classify the tweet emotions

In [1]:
# We will reuse some of the code from previous notebook, lets put it here
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import re

# we will load the dataset
df = pd.read_csv('datasets/tweet_emotions.csv')

# We will change the function to replace specific usernames with the tag 'username'
def replace_usernames(text: str) -> str:
    """ A function that removes usernames from text. """
    return re.sub(r'@\w+', 'username', text)

# the tfidf vectorizer from sklearn has an option to use ngrams.... and do the preprocessing for us too :)

# we will create an instance of the TfidfVectorizer
vectorizer = TfidfVectorizer(strip_accents='ascii', lowercase=True, stop_words='english', ngram_range=(1,3), preprocessor=lambda x: replace_usernames(x))

# we will fit the vectorizer to the corpus
# but now the corpus is the raw text in the dataframe
vectorizer.fit(df['content'].tolist())

# we will get the vocabulary size
len(vectorizer.vocabulary_)

451093

In [2]:
# we enumerate the unique values and then convert it to a dictionary where the key is the index and the value is the sentiment
id2sent = dict(enumerate(df['sentiment'].unique()))
sent2id = {v:k for k,v in id2sent.items()}
# now we can create a column with the numeric values
df['sentiment_id'] = df['sentiment'].map(sent2id)
df.head()

Unnamed: 0,tweet_id,sentiment,content,sentiment_id
0,1956967341,empty,@tiffanylue i know i was listenin to bad habi...,0
1,1956967666,sadness,Layin n bed with a headache ughhhh...waitin o...,1
2,1956967696,sadness,Funeral ceremony...gloomy friday...,1
3,1956967789,enthusiasm,wants to hang out with friends SOON!,2
4,1956968416,neutral,@dannycastillo We want to trade with someone w...,3


In [3]:
# get the tfidf matrix
X = vectorizer.transform(df['content'].tolist())
X.shape

(40000, 451093)

In [4]:
import torch
# Check if there is any hardware acceleration available
device = (
    "cuda"
    if torch.cuda.is_available()
    else "mps"
    if torch.backends.mps.is_available()
    else "cpu"
)
print(f"Using {device} device")

Using mps device


In [5]:
# lets use pytorch to build a simple neural network to classify the sentiment of the tweets


INPUT_DIM=X.shape[1]
OUTPUT_DIM=len(id2sent)

model=torch.nn.Sequential()

# add the first layer
model.add_module("layer_1",torch.nn.Linear(INPUT_DIM,10))
model.add_module("relu_1",torch.nn.ReLU())

# add the second layer
model.add_module("layer_2",torch.nn.Linear(10,OUTPUT_DIM))


# print the model
print(model)


Sequential(
  (layer_1): Linear(in_features=451093, out_features=10, bias=True)
  (relu_1): ReLU()
  (layer_2): Linear(in_features=10, out_features=13, bias=True)
)


In [6]:
y = df['sentiment_id'].tolist()
# one hot encode the labels
y = torch.LongTensor(y)
y = torch.nn.functional.one_hot(y).type(torch.FloatTensor)
y

tensor([[1., 0., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        [0., 1., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])

In [7]:
import numpy as np
X = X.tocoo()
X = torch.sparse.FloatTensor(torch.LongTensor([X.row.tolist(), X.col.tolist()]),
                              torch.FloatTensor(X.data.astype(np.float32)))

In [8]:

# we need to convert the data into tensors
# we will use the torch.tensor method to convert the data into tensors
# we will use the torch.utils.data.TensorDataset method to create a dataset
# we will use the torch.utils.data.DataLoader method to create a dataloader
from torch.utils.data import TensorDataset, DataLoader

# lets create a dataset
dataset = TensorDataset(X, y)

total_count = len(dataset)
train_count = int(0.8 * total_count) 
test_count = int(0.2 * total_count)
train_dataset, test_dataset = torch.utils.data.random_split(dataset, (train_count, test_count))




In [9]:
# Initialize the loss function
from torch import nn
learning_rate = 1e-3
batch_size = 64
epochs = 5

loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# lets create a dataloader
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True)



In [10]:
model(train_dataloader.dataset[0][0])

tensor([ 0.0069,  0.2645, -0.3012,  0.1854,  0.0121,  0.0988, -0.0268, -0.1467,
        -0.2800,  0.0224,  0.2401, -0.2591, -0.2746], grad_fn=<AddBackward0>)

In [11]:
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    # Set the model to training mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        if batch % 100 == 0:
            loss, current = loss.item(), (batch + 1) * len(X)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    # Set the model to evaluation mode - important for batch normalization and dropout layers
    # Unnecessary in this situation but added for best practices
    model.eval()
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for X, y in dataloader:
            X = X.to_dense()
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y.argmax(1)).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

In [12]:

for t in range(epochs):
    continue
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(test_dataloader, model, loss_fn)
print("Done!")

Done!


## Embeddings
As we can see the performance is pretty bad, even though we are using neural networks. This is because we are not optimizing our representations. We are just using vectors with the size of the vocabulary. We can do better than that. We can use embeddings. Embeddings are a way to represent words in a vector space. The idea is that words that are similar will be close to each other in the vector space. We can use pre-trained embeddings or we can train our own embeddings.

The ideal latent space (vector space) is where semantically similar words are close together. For example, the words "car" and "automobile" are similar, so they should be close together in the vector space. The same goes for "bike" and "bicycle". The words "car" and "bicycle" are not similar, so they should be far apart in the vector space.

<p align="center">
<img src="https://miro.medium.com/v2/resize:fit:2000/1*SYiW1MUZul1NvL1kc1RxwQ.png" width="600" />
</p>

### The two most common ways to create word embeddings are:
- **Word2Vec** (Mikolov et al., 2013)
- **GloVe** (Pennington et al., 2014)


## Should I train my own embedder model? 🤔
It depends, most of the time is not worth it. If you have tons of data and enough computational power, then you can train your own model. Otherwise, you can use a pre-trained model. And if you have a very specific domain, you can do **fine-tuning**. Which means, taking advantage of the features the pre-trained representations have learned and then adjust the model to your domain.

## Where do I get the pretrained models? 🤗
There are many places you can find the pretrained models, but the most popular one is [Hugging Face](https://huggingface.co/). They have a lot of models for different tasks and different languages. You can also find the models in the [TensorFlow Hub](https://tfhub.dev/).


## Transformers 🤖
Up to 2017, the word2vec and Glove approaches were the state-of-the-art, but that year a groundbreaking paper was published: [Attention is all you need](https://arxiv.org/abs/1706.03762). This paper introduced the **Transformer** architecture. This architecture is based on the attention mechanism. The attention mechanism is a way to learn the relationships between the words in a sentence. The transformer architecture is the base for the **BERT**, **GPT-{1:4}**, **Palm**, **LLaMA** models, which are the state-of-the-art model for NLP tasks.

# You can use a pretrained model in a downstream task

# Fine tune a pretrained model

In [24]:
from datasets import Dataset
df['label'] = df['sentiment_id']
df['text'] = df['content']

dataset = Dataset.from_pandas(df[['text', 'label']])
dataset = dataset.train_test_split(test_size=0.1)
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 36000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 4000
    })
})

In [25]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")


def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/36000 [00:00<?, ? examples/s]

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

In [34]:
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [35]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=OUTPUT_DIM)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

In [37]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [38]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [39]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

In [40]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

In [41]:
trainer.train()

  0%|          | 0/375 [00:00<?, ?it/s]

KeyboardInterrupt: 