<a href="https://colab.research.google.com/github/frank-895/machine_learning_journey/blob/main/NLP_disaster_tweets/NLP_disaster_tweets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [61]:
%%capture
!pip install datasets
!pip install evaluate

In [2]:
import pandas as pd, fastai

# Natural Language Processing with Disaster Tweets

## Introduction

In this notebook, I will be making a submission to the following Kaggle competition: [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/competitions/nlp-getting-started/data).

The goal is to use **natural language processing** to predict if a tweet is talking about a real natural disaster or not. If `target = 1`, the tweet is talking about a real disaster. If `target = 0` the tweet is **not** talking about a real disaster.

In line with FastAI's lesson 7, I will be using this competition as an opportunity to learn new machine learning skills, including:
- **memory and gradient accumulation**.
- **ensembling** a number of models, including over-weighted models.
- creating a **multi-target** model.

## Data Processing

### Collecting Data

Let's start by having a look at our training data.

In [49]:
df = pd.read_csv('train.csv')
df

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [50]:
df_test = pd.read_csv('test.csv')
df_test

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan
...,...,...,...,...
3258,10861,,,EARTHQUAKE SAFETY LOS ANGELES ÛÒ SAFETY FASTE...
3259,10865,,,Storm in RI worse than last hurricane. My city...
3260,10868,,,Green Line derailment in Chicago http://t.co/U...
3261,10874,,,MEG issues Hazardous Weather Outlook (HWO) htt...


My first note is that there appears to be a number of empty data points. We can have a deeper dive into this.

At this point, let's define our variables:
- `text` is the text of a tweet.
- `keyword` is a keyword from that tweet (although this may be blank!).
- `location` is the location the tweet was sent from (may also be blank).

In [51]:
df.isnull().sum()

Unnamed: 0,0
id,0
keyword,61
location,2533
text,0
target,0


In [52]:
df_test.isnull().sum()

Unnamed: 0,0
id,0
keyword,26
location,1105
text,0


We have lots of empty data points!

Since the fact `location` or `keyword` is empty could be useful for the model, we will replace `NaN` values with the string `'empty'`.

Because we will be using `Transformers` to create our model, we need to relabel our `target` column to `labels`.

In [53]:
df.rename(columns={'target':'labels'}, inplace=True)

### Feature Engineering

At this point, we want to combine all our features into a single input string that we can tokenize and numericalize.

Later, when we perform create a multi-target model, we will use try to also predict the `location` based on the tweet, by taking a subset of both `df` and `test_df` where `location` is not empty. We will need to create a new model that doesn't use `location` in the input data.  

In [54]:
df.fillna("empty", inplace=True)

Now we can combine our features:

In [55]:
df["text"] = "KEYWORD:" + df.keyword + "LOCATION:" + df.location + "TEXT:" + df.text
df.drop(['id', 'keyword', 'location'], axis=1, inplace=True)
df.head()

Unnamed: 0,text,labels
0,KEYWORD:emptyLOCATION:emptyTEXT:Our Deeds are ...,1
1,KEYWORD:emptyLOCATION:emptyTEXT:Forest fire ne...,1
2,KEYWORD:emptyLOCATION:emptyTEXT:All residents ...,1
3,"KEYWORD:emptyLOCATION:emptyTEXT:13,000 people ...",1
4,KEYWORD:emptyLOCATION:emptyTEXT:Just got sent ...,1


### Validation Set

We are going to be using

## Memory and Gradient Accumulation

We are going to be **ensembling** 5 different pretrained NLP models from Transformers. I've defined them below.

In [56]:
models = ["distilbert-base-uncased","bert-base-uncased","roberta-base","xlm-roberta-base","google/electra-base-discriminator"]

Firstly, I'm going to introduce the idea of **gradient accumulation**. Then, I'm going to use gradient accumulation to enable the ensembling of these 5 models, some of which are quite large. This will allow me to run them on Colab's free (and somewhat limited) GPU!

This [Kaggle notebook](https://www.kaggle.com/code/jhoward/scaling-up-road-to-the-top-part-3) goes into incredible detail about what gradient accumulation is. If I was using fastai, I could simulate the steps, but it is more challenging (and not really necessary) in Transformers.

Essentially, the variable `accum` will divide the batch size by this value. Rather than updating the model's weights after every batch, we will keep **accumulating** the gradients (specifically `accum` times)!

This explains why in PyTorch, when we create the loss function manually we need this line:
```
coeffs.grad.zero_()
```

Without this line, the gradients will automatically accumulate. So, when we define the `accum` variable, we will only call `zero_()` when we have completed a **full** batch, like so:

```
count = 0

for x,y in dl:
  count += len(x)
  calc_loss(coeffs, x, y).backward()

  if count >= batch_size:
    coeffs.data.sub_(coeffs.grad * lr)
    coeffs.grad.zero_()
    count=0
```

**Why is this useful?** Well, mathematically, the training loop is nearly identical to when `accum=1`. However, the amount of memory used by the GPU will be much smaller as it is not working out an enormous number of gradients at the same time.

This is fantastic because more expensive GPUs generally have more memory, **but not necessarily much more performance**. This is a really cost-effective way of simulating the performance of larger GPUs without actually needing their memory.

**Why don't we just use a smaller batch size?** Well, larger batches mean the model updates the weights less frequently. This means that the average gradient is less suseptible to noise, as it is calcualted from a larger number of parameters. This can reduce the chance of **overfitting** and improve the model's ability to generalise.

We will be using **gradient accumulation** with some of our models when we ensemble them.

## Ensembling and Weighted Models

Now, we need to run all 5 models and ensemble their predictions!

In [57]:
from transformers import TrainingArguments, Trainer
from datasets import Dataset
from transformers import AutoTokenizer

dataset = Dataset.from_pandas(df)

data = dataset.train_test_split(test_size=0.2)
train = data['train']
val = data['test']

train_tokenized = {}
val_tokenized = {}

for name in models:
  tokenizer = AutoTokenizer.from_pretrained(name)

  train_tokenized[name] = train.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True), batched=True)
  val_tokenized[name] = val.map(lambda x: tokenizer(x['text'], padding='max_length', truncation=True), batched=True)

  train_tokenized[name] = train.map(lambda x: {**tokenizer(x['text'], padding='max_length', truncation=True), "labels": x["labels"]}, batched=True)
  val_tokenized[name] = val.map(lambda x: {**tokenizer(x['text'], padding='max_length', truncation=True), "labels": x["labels"]}, batched=True)


Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

Map:   0%|          | 0/6090 [00:00<?, ? examples/s]

Map:   0%|          | 0/1523 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import evaluate  # Hugging Face's metric library
import numpy as np
import torch, gc

trained = []
accuracy_metric = evaluate.load("accuracy")  # Load accuracy metric
gc.collect()
torch.cuda.empty_cache()

def accuracy(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)  # Convert logits to class predictions
    return accuracy_metric.compute(predictions=predictions, references=labels)

training_args = TrainingArguments('outputs',
    eval_strategy="epoch",     # Evaluate after each epoch
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',
    gradient_accumulation_steps=8,
    learning_rate= 1e-5,
    fp16=True,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=128,
    num_train_epochs=4,
    weight_decay=0.1,
    report_to='none'
)

for name in models:
  print(name)
  model = AutoModelForSequenceClassification.from_pretrained(name, num_labels=2)

  trainer = Trainer(
        model,
        training_args,
        train_dataset=train_tokenized[name],
        eval_dataset=val_tokenized[name],
        compute_metrics=accuracy
  )

  trainer.train()
  trained.append(model)

  model.to("cpu")
  del model
  gc.collect()
  torch.cuda.empty_cache()

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


distilbert-base-uncased


Epoch,Training Loss,Validation Loss,Accuracy
0,No log,0.600253,0.753775
1,No log,0.470475,0.795798
2,No log,0.445618,0.803677
3,No log,0.440949,0.802364


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


bert-base-uncased


Epoch,Training Loss,Validation Loss,Accuracy
0,No log,0.55144,0.763624
1,No log,0.470223,0.806303
2,No log,0.445619,0.80302


## Multi-target Model

## Conclusion