<h3 style='color: blue; text-align: center'>Import Libraries and Load Data</h3>

In [1]:
import pandas as pd
data = pd.read_csv("Disaster.csv")
data.columns

Index(['Name', 'UserName', 'Timestamp', 'Verified', 'Tweets', 'Comments',
       'Retweets', 'Likes', 'Impressions', 'Tags', 'Tweet Link', 'Tweet ID',
       'Disaster'],
      dtype='object')

In [2]:
texts = data['Tweets'].tolist()
labels = data['Disaster'].tolist()

In [3]:
data.Disaster.value_counts()

Disaster
Drought       770
Wildfire      540
Earthquake    500
Floods        436
Hurricanes    178
Tornadoes     135
Name: count, dtype: int64

In [4]:
# Create a mapping dictionary for disaster types
disaster_mapping = {
    'Drought': 0,
    'Earthquake': 1,
    'Wildfire': 2,
    'Floods': 3,
    'Hurricanes': 4,
    'Tornadoes': 5
}

# Apply the mapping to the Disaster column
data['Disaster'] = data['Disaster'].map(disaster_mapping)

In [5]:
data.Disaster.value_counts()

Disaster
0    770
2    540
1    500
3    436
4    178
5    135
Name: count, dtype: int64

<h3 style='color: blue; text-align: center'>Data Preparation and Model Initialization</h3>

In [6]:
import torch
from sklearn.model_selection import train_test_split
from transformers import AdamW, AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding

# Load the model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<h3 style='color: blue; text-align: center'>Sample Data and Split into Train/Test Sets</h3>

In [7]:
# Data Preparation: Select 1000 random samples from the dataset
data = data.sample(1000, random_state=42 )

# Split the data into train and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(
    data['Tweets'], data['Disaster'], test_size=0.2, random_state=42)

<h3 style='color: blue; text-align: center'>Tokenize the texts </h3>

In [8]:
# Tokenize the texts
train_encodings = tokenizer(train_texts.tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test_texts.tolist(), truncation=True, padding=True)

<h3 style='color: blue; text-align: center'>Create Dataset Class and DataLoader</h3>

In [9]:
# Convert to torch tensors
class DisasterDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [10]:
train_dataset = DisasterDataset(train_encodings, train_labels.tolist())
test_dataset = DisasterDataset(test_encodings, test_labels.tolist())

# Create a DataLoader
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)

<h3 style='color: blue; text-align: center'>Define Optimizer and Training Loop</h3>

In [11]:
# Define the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training loop
model.train()
for epoch in range(3):  # Training for 3 epochs
    for batch in train_dataloader:
        optimizer.zero_grad()
        
        inputs = {key: val for key, val in batch.items() if key != 'labels'}
        labels = batch['labels']
        
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        
        print(f"Epoch: {epoch}, Loss: {loss.item()}")



Epoch: 0, Loss: 2.0212249755859375
Epoch: 0, Loss: 1.883760690689087
Epoch: 0, Loss: 1.7661470174789429
Epoch: 0, Loss: 1.8595359325408936
Epoch: 0, Loss: 1.736466646194458
Epoch: 0, Loss: 1.6578303575515747
Epoch: 0, Loss: 1.7566641569137573
Epoch: 0, Loss: 1.675097942352295
Epoch: 0, Loss: 1.4616562128067017
Epoch: 0, Loss: 1.4926913976669312
Epoch: 0, Loss: 1.5711513757705688
Epoch: 0, Loss: 1.5582966804504395
Epoch: 0, Loss: 1.548918604850769
Epoch: 0, Loss: 1.3403314352035522
Epoch: 0, Loss: 1.2531391382217407
Epoch: 0, Loss: 1.4997732639312744
Epoch: 0, Loss: 1.389554500579834
Epoch: 0, Loss: 1.31046462059021
Epoch: 0, Loss: 1.2023437023162842
Epoch: 0, Loss: 1.280198574066162
Epoch: 0, Loss: 1.2630658149719238
Epoch: 0, Loss: 1.0989173650741577
Epoch: 0, Loss: 1.2215421199798584
Epoch: 0, Loss: 1.2026135921478271
Epoch: 0, Loss: 1.116483449935913
Epoch: 0, Loss: 1.0485080480575562
Epoch: 0, Loss: 1.1488882303237915
Epoch: 0, Loss: 1.111499309539795
Epoch: 0, Loss: 0.837650239467

<h3 style='color: blue; text-align: center'>Evaluate Model Performance</h3>

In [12]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, confusion_matrix

# Switch the model to evaluation mode
model.eval()

# Create DataLoader for the test set
test_dataloader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)

# Initialize lists to store true labels and predictions
predictions, true_labels = [], []

# Evaluate the model
for batch in test_dataloader:
    inputs = {key: val for key, val in batch.items() if key != 'labels'}
    labels = batch['labels']
    
    with torch.no_grad():
        outputs = model(**inputs)
        
    logits = outputs.logits
    predictions.extend(torch.argmax(logits, dim=-1).cpu().numpy())
    true_labels.extend(labels.cpu().numpy())

# Calculate accuracy, precision, recall, and F1-score
accuracy = accuracy_score(true_labels, predictions)
precision, recall, f1, _ = precision_recall_fscore_support(true_labels, predictions, average='weighted')

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-Score: {f1}")

Accuracy: 0.98
Precision: 0.9804385964912281
Recall: 0.98
F1-Score: 0.9800443673241175


<h3 style='color: blue; text-align: center'>Model and Tokenizer Saving</h3>

In [13]:
model_save = "disaster_model.pth"
torch.save(model.state_dict(), model_save)

In [14]:
tokenizer.save_pretrained("tokenizer2/")

('tokenizer2/tokenizer_config.json',
 'tokenizer2/special_tokens_map.json',
 'tokenizer2/vocab.txt',
 'tokenizer2/added_tokens.json',
 'tokenizer2/tokenizer.json')

##### Disaster Labels

Here are the disaster labels with their corresponding values:

- **Drought**: 0
- **Earthquake**: 1
- **Wildfire**: 2
- **Floods**: 3
- **Hurricanes**: 4
- **Tornadoes**: 5

<h3 style='color: blue; text-align: center'>Example Prediction</h3>

In [15]:
# Load the model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=6)
model.load_state_dict(torch.load(model_save))
model.eval()

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("tokenizer2/")

# Example prediction
new_texts = ["The smoke from the wildfire is affecting air quality in nearby cities."]
new_encodings = tokenizer(new_texts, truncation=True, padding=True, return_tensors="pt")

with torch.no_grad():
    outputs = model(**new_encodings)
    predictions = torch.argmax(outputs.logits, dim=-1)
    print(predictions.item()) 


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  model.load_state_dict(torch.load(model_save))


2


<h3 style='color: blue; text-align: center'>Prediction Result</h3>

As we can see, the predicted result for the text:

<span style="color: red;">"The smoke from the wildfire is affecting air quality in nearby cities."</span>

is **2**, which corresponds to **Wildfire**.

<h3 style='color: blue; text-align: center'>Class Distribution Check</h3>

In [16]:
import numpy as np
unique, counts = np.unique(train_labels, return_counts=True)
print(dict(zip(unique, counts)))

{0: 263, 1: 144, 2: 164, 3: 138, 4: 49, 5: 42}


<h3 style='color: blue; text-align: center'>Predict Multiple Texts</h3>

In [21]:
texts_to_predict = [
    "The prolonged drought is severely affecting agricultural output in the region.",
    "The earthquake caused extensive damage to buildings and infrastructure in the city.",
    "Wildfires are raging through the forest, threatening homes and wildlife.",
    "Heavy rains have caused severe flooding in the downtown area.",
    "The hurricane made landfall last night, causing widespread power outages.",
    "A series of tornadoes have torn through the region, causing widespread destruction.",
    "The hurricane's strong winds and heavy rains have led to significant damage.",
    "Emergency shelters have been set up to accommodate those displaced by the hurricane."
]

In [22]:
# Tokenize the texts
encodings = tokenizer(texts_to_predict, truncation=True, padding=True, return_tensors="pt")
model.eval()  # Switch to evaluation mode

with torch.no_grad():  # Disable gradient calculations
    outputs = model(**encodings)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)  # Get the index of the highest logit for each example

# Convert predictions to a list
predicted_labels = predictions.tolist()


<h3 style='color: blue; text-align: center'>Prediction Results</h3>

In [23]:
# Reverse the disaster mapping
reverse_disaster_mapping = {v: k for k, v in disaster_mapping.items()}

# Convert numeric labels to disaster types
predicted_disasters = [reverse_disaster_mapping[label] for label in predicted_labels]

# Print predictions
for text, disaster in zip(texts_to_predict, predicted_disasters):
    print(f"Text: {text}")
    print(f"Predicted Disaster: {disaster}")
    print("-" * 50)

Text: The prolonged drought is severely affecting agricultural output in the region.
Predicted Disaster: Drought
--------------------------------------------------
Text: The earthquake caused extensive damage to buildings and infrastructure in the city.
Predicted Disaster: Earthquake
--------------------------------------------------
Text: Wildfires are raging through the forest, threatening homes and wildlife.
Predicted Disaster: Wildfire
--------------------------------------------------
Text: Heavy rains have caused severe flooding in the downtown area.
Predicted Disaster: Floods
--------------------------------------------------
Text: The hurricane made landfall last night, causing widespread power outages.
Predicted Disaster: Tornadoes
--------------------------------------------------
Text: A series of tornadoes have torn through the region, causing widespread destruction.
Predicted Disaster: Tornadoes
--------------------------------------------------
Text: The hurricane's stron

<h3 style='color: blue; text-align: center'>Conclusion</h3>

The BERT-based model achieved a high accuracy of `0.98`% in classifying disaster-related tweets, demonstrating its effectiveness in disaster type prediction. The predictions were highly accurate, making the model reliable for real-world applications. Below is the confusion matrix for further insights into the model's performance.