# Fine Tuning `BERT` for `Disaster Tweets` Classification


# About the `Problem`

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programmatically monitoring Twitter (i.e. disaster relief organizations and news agencies). However, identifying such tweets has always been a difficult task because of the ambiguity in the linguistic structure of the tweets and hence it is not always clear whether an individual’s words are actually announcing a disaster.

More details [here](https://www.kaggle.com/c/nlp-getting-started/overview)

<img src = "img/disaster.png" >

# Installation 

In [None]:
#!pip install transformers

# Setup

To start, we import some Python libraries and initialize a SageMaker session, S3 bucket and prefix, and IAM role.

In [20]:
import os
import numpy as np
import pandas as pd
import sagemaker

sagemaker_session = sagemaker.Session()    # Provides a collection of methods for working with SageMaker resources

bucket = sagemaker_session.default_bucket()
prefix = "sagemaker/DEMO-pytorch-bert"

role = sagemaker.get_execution_role()      # Get the execution role for the notebook instance. 
                                           # This is the IAM role that we created for our notebook instance. 
                                           # We pass the role to the tuning job(later on).

# Prepare training data


In [21]:
df = pd.read_csv(
    "dataset/raw/data.csv",
    header=None,
    usecols=[1, 3],
    names=["label", "sentence"],
)


sentences = df.sentence.values
labels = df.label.values

In [22]:
df.tail()

Unnamed: 0,label,sentence
7608,1,Two giant cranes holding a bridge collapse int...
7609,1,@aria_ahrary @TheTawniest The out of control w...
7610,1,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...
7611,1,Police investigating after an e-bike collided ...
7612,1,The Latest: More Homes Razed by Northern Calif...


Printing few tweets with its class label 

In [23]:
list(zip(sentences[80:85], labels[80:85]))

[("mom: 'we didn't get home as fast as we wished' \nme: 'why is that?'\nmom: 'there was an accident and some truck spilt mayonnaise all over ??????",
  0),
 ("I was in a horrible car accident this past Sunday. I'm finally able to get around. Thank you GOD??",
  1),
 ('Can wait to see how pissed Donnie is when I tell him I was in ANOTHER accident??',
  0),
 ("#TruckCrash Overturns On #FortWorth Interstate http://t.co/Rs22LJ4qFp Click here if you've been in a crash&gt;http://t.co/Ld0unIYw4k",
  1),
 ('Accident in #Ashville on US 23 SB before SR 752 #traffic http://t.co/hylMo0WgFI',
  1)]

### Cleaning Text


As we can see from the above output, there are few information which are not that important, like `URLs`, `Emojis`, `Tags`, etc. So, now lets try to clean the dataset before we actually pass this data for training. 

In [24]:
import string
import re

In [25]:
# Helper functions to clean text by removing urls, emojis, html tags and punctuations.

def remove_URL(text):
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)


def remove_emoji(text):
    emoji_pattern = re.compile(
        '['
        u'\U0001F600-\U0001F64F'  # emoticons
        u'\U0001F300-\U0001F5FF'  # symbols & pictographs
        u'\U0001F680-\U0001F6FF'  # transport & map symbols
        u'\U0001F1E0-\U0001F1FF'  # flags (iOS)
        u'\U00002702-\U000027B0'
        u'\U000024C2-\U0001F251'
        ']+',
        flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


def remove_html(text):
    html = re.compile(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    return re.sub(html, '', text)


def remove_punct(text):
    table = str.maketrans('', '', string.punctuation)
    return text.translate(table)



In [26]:
df['sentence'] = df['sentence'].apply(lambda x: remove_URL(x))
df['sentence'] = df['sentence'].apply(lambda x: remove_emoji(x))
df['sentence'] = df['sentence'].apply(lambda x: remove_html(x))
df['sentence'] = df['sentence'].apply(lambda x: remove_punct(x))

In [27]:
df.head()

Unnamed: 0,label,sentence
0,1,Our Deeds are the Reason of this earthquake Ma...
1,1,Forest fire near La Ronge Sask Canada
2,1,All residents asked to shelter in place are be...
3,1,13000 people receive wildfires evacuation orde...
4,1,Just got sent this photo from Ruby Alaska as s...


In [28]:
sentences = df.sentence.values
labels = df.label.values

In [29]:
list(zip(sentences[80:85], labels[80:85]))

[('mom we didnt get home as fast as we wished \nme why is that\nmom there was an accident and some truck spilt mayonnaise all over ',
  0),
 ('I was in a horrible car accident this past Sunday Im finally able to get around Thank you GOD',
  1),
 ('Can wait to see how pissed Donnie is when I tell him I was in ANOTHER accident',
  0),
 ('TruckCrash Overturns On FortWorth Interstate  Click here if youve been in a crash',
  1),
 ('Accident in Ashville on US 23 SB before SR 752 traffic ', 1)]

In [30]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(df)               # Default split ratio 75/25, we can modify using "test_size"
train.to_csv("dataset/train.csv", index=False)
test.to_csv("dataset/test.csv", index=False)

### Upload both to Amazon S3 for use later

The SageMaker Python SDK provides a helpful function for uploading to Amazon S3:

In [31]:
inputs_train = sagemaker_session.upload_data("dataset/train.csv", bucket=bucket, key_prefix=prefix)
inputs_test = sagemaker_session.upload_data("dataset/test.csv", bucket=bucket, key_prefix=prefix)

# Amazon SageMaker Training

## Training script

In [33]:
!pygmentize code/train_bert.py

[34mimport[39;49;00m [04m[36margparse[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m
[34mimport[39;49;00m [04m[36mlogging[39;49;00m
[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

[34mimport[39;49;00m [04m[36mnumpy[39;49;00m [34mas[39;49;00m [04m[36mnp[39;49;00m
[34mimport[39;49;00m [04m[36mpandas[39;49;00m [34mas[39;49;00m [04m[36mpd[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m [34mas[39;49;00m [04m[36mdist[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m
[34mimport[39;49;00m [04m[36mtorch[39;49;00m[04m[36m.[39;49;00m[04m[36mutils[39;49;00m[04m[36m.[39;49;00m[04m[36mdata[39;49;00m[04m[36m.[39;49;00m[04m[36mdistributed[39;49;00m
[34mfrom[39;49;00m

# Train on Amazon SageMaker



In [None]:
from sagemaker.pytorch import PyTorch

# 1. Defining the estimator 

estimator = PyTorch(entry_point="ddp-launcher.py",
                    source_dir="code",
                    role=role,
                    framework_version="1.10.2",
                    py_version="py38",
                    instance_count=2,                          # Distributed training for GPU instances.
                    instance_type="ml.p4d.24xlarge",             # Type of instance we want the training to happen
                    hyperparameters={"epochs": 20,
                                     "num_labels": 2,
                                     "backend": "nccl",        # gloo and tcp for cpu instances - gloo and nccl for gpu instances
                                    },
                    debugger_hook_config=False,  # deactivate debugger to avoid warnings in model artifact
                    disable_profiler=True,  # keep running resources to a minimum to avoid permission errors

                   )

# 2. Start the Training 

estimator.fit({"training": inputs_train, "testing": inputs_test})
