# **NLP Twitter Disaster Classifier Project**
This project aims to classify tweets based on their content on whether they are commenting on real disasters or not.

---
## **1. Importing data**
Data set from [Keggle](https://www.kaggle.com/competitions/nlp-getting-started)

In [1]:
import pandas as pd

df = pd.read_csv('..\\Datasets\\Twitter Disaster\\train.csv')

print(df.head())

   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  


---
## **2. Preprocessing data**
**This includes:**
- Removing unnecessary text.
- Converting the text to lowercase.
- Tokenizing the text.
- Removing stopwords.
- Applying Lemmatization.
- Extracting important info (such as hashtags)

In [2]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download()

lemmatizer = WordNetLemmatizer()
stop_words = stopwords.words('english')

# Removes URLs, HTML tags, Hashtags, then converts text to lowercase, and tokenizes and lemmatizes it.
def preprocess_text(text):
    text = re.sub(r'http\S+', ' ', text)
    text = re.sub(r'<.*?>', ' ', text)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = text.lower()
    words = nltk.word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Processing Hashtags specifically
def preprocess_hashtags(text):
    text = text.lower()
    text = re.findall(r'#\w[\w_]*', text)
    return ' '.join(text)

# Apply preprocessing to the tweet text
df['cleaned_text'] = df['text'].apply(preprocess_text)
# Extract hashtags to a seperate column
df['hashtags'] = df['text'].apply(preprocess_hashtags)

print(df.head())

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target                                       cleaned_text  \
0       1         deed reason earthquake may allah forgive u   
1       1              forest fire near la ronge sask canada   
2       1  resident asked shelter place notified officer ...   
3       1  people receive wildfire evacuation order calif...   
4       1  got sent photo ruby alaska smoke wildfire pour...   

             hashtags  
0         #earthquake  
1                      

---
# **3. Feature Extraction**
using TF-IDF, vectorize the text.

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack


tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(df['cleaned_text'])
y = df['target']

# Vectorize and process hashtags
X_hashtags = tfidf_vectorizer.fit_transform(df['hashtags'])
hashtag_set = ['#disaster', '#earthquake', '#flood', '#fire'] 
# Add binary features for each hashtag in the hashtag_set
for hashtag in hashtag_set:
    df[hashtag] = df['hashtags'].apply(lambda x: int(hashtag in x))


# Merge cleaned_text and hashtags into a single column
X = hstack([X, X_hashtags])

print(X)

  (0, 3267)	0.4917752267617825
  (0, 10595)	0.3600775896596117
  (0, 3958)	0.33751907334621206
  (0, 8093)	0.3037960127435994
  (0, 355)	0.4247908189689651
  (0, 4931)	0.4917752267617825
  (0, 15326)	1.0
  (1, 4923)	0.306581494866499
  (1, 4756)	0.22047874066399495
  (1, 8831)	0.3154679292758814
  (1, 7285)	0.35138136833297784
  (1, 11094)	0.4917659025364492
  (1, 11339)	0.4917659025364492
  (1, 1936)	0.3882533009679326
  (2, 10838)	0.2760497419903291
  (2, 745)	0.27191221012041994
  (2, 11689)	0.5723111169469418
  (2, 9894)	0.45693338781287907
  (2, 9073)	0.32474807166271447
  (2, 9216)	0.22846669390643953
  (2, 4358)	0.2096271571821481
  (2, 9386)	0.2204021066462517
  (2, 4448)	0.2517005771541364
  (3, 4358)	0.39180400373235424
  (3, 9386)	0.4119429418203297
  :	:
  (7610, 7192)	0.4783083287340091
  (7610, 13882)	0.49463673979851114
  (7610, 5778)	0.5517244286343893
  (7611, 1990)	0.17545585317139523
  (7611, 9986)	0.16956362461618765
  (7611, 7540)	0.1700821329880351
  (7611, 10957)

---
# **4-5.Model Training and Evaluation**
train multiple models and select the best model.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=26)

print(f'Training data shape: {X_train.shape}')
print(f'Test data shape: {X_test.shape}')

Training data shape: (6090, 16724)
Test data shape: (1523, 16724)


In [5]:
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix, precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold, cross_val_score
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Resample data to handle imbalance
smote = SMOTE(random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

# dictionary for all the models used
models = {
    'Logistic Regression': LogisticRegression(class_weight='balanced', max_iter=1000),
    'SVM': SVC(class_weight='balanced', probability=True),
    'XGBoost': xgb.XGBClassifier(scale_pos_weight=(np.sum(y_train == 0)/np.sum(y_train == 1)), eval_metric='mlogloss'),
    'Random Forest': RandomForestClassifier(class_weight='balanced'),
}

evaluation_results = {}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=26)

for model_name, model in models.items():
    print(f"Evaluating {model_name}...")
    
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_test)
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    conf_matrix = confusion_matrix(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred)

    cv_scores = cross_val_score(model, X_train, y_train, cv=cv, scoring='f1')
    mean_cv_score = np.mean(cv_scores)

    evaluation_results[model_name] = {
        'Classification Report': classification_rep,
        'Confusion Matrix': conf_matrix,
        'Mean Cross-Validation F1 Score': mean_cv_score,
    }
    
    print(f'{model_name} Performance:')
    print('Classification Report:\n', classification_rep)
    print('Mean Cross-Validation F1 Score:', mean_cv_score)
    print('Confusion Matrix:\n', conf_matrix)
    print('\n')


Evaluating Logistic Regression...
Logistic Regression Performance:
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.88      0.84       893
           1       0.80      0.70      0.75       630

    accuracy                           0.81      1523
   macro avg       0.80      0.79      0.80      1523
weighted avg       0.81      0.81      0.80      1523

Mean Cross-Validation F1 Score: 0.8111678611122946
Confusion Matrix:
 [[784 109]
 [187 443]]


Evaluating SVM...
SVM Performance:
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.92      0.85       893
           1       0.85      0.64      0.73       630

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.79      1523
weighted avg       0.81      0.80      0.80      1523

Mean Cross-Validation F1 Score: 0.8144967739572859
Confusion Matrix:
 [[821  72]
 [229 401]]


Evalua

---
# **6.  Interpretation and Application**
using metrics, find the best possible model to be used on the data.

In [6]:
# Determine the best model based on Mean Cross-Validation F1 Score
best_model_name = max(evaluation_results, key=lambda k: evaluation_results[k]['Mean Cross-Validation F1 Score'])
best_model_metrics = evaluation_results[best_model_name]

# Print the model's info
print(f'======================================')
print(f"The best model is: {best_model_name}")
print(f'======================================')
print(f'Metrics:\n---------------------')
print('Classification Report:', best_model_metrics['Classification Report'])
print('Confusion Matrix:\n', best_model_metrics['Confusion Matrix'])
print('Mean Cross-Validation F1 Score:', best_model_metrics['Mean Cross-Validation F1 Score'])

The best model is: SVM
Metrics:
---------------------
Classification Report:               precision    recall  f1-score   support

           0       0.78      0.92      0.85       893
           1       0.85      0.64      0.73       630

    accuracy                           0.80      1523
   macro avg       0.81      0.78      0.79      1523
weighted avg       0.81      0.80      0.80      1523

Confusion Matrix:
 [[821  72]
 [229 401]]
Mean Cross-Validation F1 Score: 0.8144967739572859


### **How Can We Use This Model Efficiently?**

#### **Real-Time Monitoring**

We can **connect** the model to social media to keep an eye on **disaster-related tweets**. This way, it can automatically **alert authorities** about new situations, helping them respond quickly and coordinate efforts.

#### **Spotting Key Information**

The model helps **pick out** and **highlight** crucial info from social media, making sure that **important updates** get to the right people in time. This helps affected communities stay informed.

#### **Guiding Resource Use**

It also looks at **tweet trends** to help decide where **resources** should go. By showing how **serious** and **widespread** a disaster is, the model supports **smarter planning** and quicker responses.