### Natural Language Processing with Disaster Tweets (Part3 -- RoBERTa)
# 15. Training, fine-tuning RoBerta-base

----------------------------------------------------------------------------------

## RoBERTa-base with ClassificationModel
My initial foray into model training involved the BERT architecture, which, while powerful, proved to be resource-intensive and complex to fine-tune effectively. This experience led us to explore RoBERTa, a variant of BERT that offers several advantages, particularly in terms of ease of use and training efficiency.

The BERT model, with its intricate architecture and numerous hyperparameters, required substantial effort and time to optimize. We invested considerable energy into hyperparameter tuning, including adjustments to the learning rate, batch size, and the addition of layers to enhance model capacity. Despite our dedication, the results were not as promising as anticipated, leading to frustration and a realization that the complexity of BERT was hindering our progress.

In contrast, RoBERTa presents a more streamlined approach to model training. By utilizing the ClassificationModel class, we were able to simplify the process of building and training our model significantly. This class abstracts many of the complexities associated with model configuration and hyperparameter tuning, allowing us to focus on the core aspects of our NLP task. The ease of implementation provided by RoBERTa not only reduced the time spent on model setup but also enabled us to achieve faster iterations and more effective experimentation.

Moreover, RoBERTa's training methodology, which includes dynamic masking and a larger training dataset, enhances its performance on various NLP tasks. This robustness, combined with the user-friendly interface of the ClassificationModel class, allowed us to achieve competitive results with less effort compared to our previous experiences with BERT. The transition to RoBERTa has not only improved our workflow but has also reinvigorated our enthusiasm for model development.

In conclusion, our shift from BERT to RoBERTa exemplifies the importance of selecting the right tools and frameworks in the pursuit of effective NLP solutions. By leveraging the capabilities of the ClassificationModel class, we have streamlined our model training process, allowing us to focus on refining our approach and achieving better results. As we continue to explore the potential of RoBERTa, we are optimistic about the advancements we can make in our NLP projects, ultimately leading to more impactful outcomes in our research and applications.

In [None]:
!pip install transformers==2.11.0 --quiet
!pip install pyspellchecker --quiet
!pip install simpletransformers --quiet

In [None]:
import random
import torch
import numpy as np
import pandas as pd
import time
import simpletransformers
from simpletransformers.classification import ClassificationModel
import warnings
warnings.simplefilter('ignore')
from scipy.special import softmax
import sklearn
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.metrics import log_loss, f1_score
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 100)

def seed_all(seed_value):
    random.seed(seed_value) # Python
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu  vars
    
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
        torch.backends.cudnn.deterministic = True  #needed
        torch.backends.cudnn.benchmark = False

seed_all(79)

In [None]:
if torch.cuda.is_available():      
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

## Preparing Dataset for training

In [None]:
train = pd.read_csv("/kaggle/input/cleaned-data/df_train.csv")
test = pd.read_csv("/kaggle/input/cleaned-data/df_test.csv")
print("Shape of train data : ",train.shape)
print("Shape of test data : ",test.shape)

In [None]:
# Add the keyword column to the text column
train['keyword'].fillna('', inplace=True)
train['final_text'] = train['keyword'] + ' ' + train['text'] 
test['keyword'].fillna('', inplace=True)
test['final_text'] = test['keyword'] + ' ' + test['text'] 

In [None]:
first_col = ['final_text']
last_cols = [col for col in train.columns if col not in first_col]

train = train[first_col+last_cols]
train.head()

In [None]:
train=train.drop(['id'],axis=1)
train=train.drop(['keyword'],axis=1)
train=train.drop(['b4combine'],axis=1)
train=train.drop(['b4embedding_text'],axis=1)
train=train.drop(['word_count'],axis=1)
train=train.drop(['unique_word_count'],axis=1)
train=train.drop(['stop_word_count'],axis=1)
train=train.drop(['mean_word_length'],axis=1)
train=train.drop(['char_count'],axis=1)
train=train.drop(['punctuation_count'],axis=1)
train=train.drop(['text'],axis=1)

In [None]:
train.head()

In [None]:
final=pd.DataFrame()
final['id']=test['id']
final.head()

In [None]:
first_col = ['final_text']
last_cols = [col for col in test.columns if col not in first_col]

test = test[first_col+last_cols]
test.head()

In [None]:
test=test.drop(['id'],axis=1)
test=test.drop(['keyword'],axis=1)
test=test.drop(['text'],axis=1)
test=test.drop(['b4combine'],axis=1)
test=test.drop(['b4embedding_text'],axis=1)
test=test.drop(['word_count'],axis=1)
test=test.drop(['unique_word_count'],axis=1)
test=test.drop(['stop_word_count'],axis=1)
test=test.drop(['mean_word_length'],axis=1)
test=test.drop(['char_count'],axis=1)
test=test.drop(['punctuation_count'],axis=1)
test['label']=0


In [None]:
test.head()

In [None]:
train['target'].value_counts()

In [None]:
test.head()

In [None]:
print("Target Imbalance Rate: 0 vs 1 = ", 4305/3198)

In [None]:
train = train.reindex(np.random.permutation(train.index))
train= train.reset_index(drop=True)
train.head()

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold
from scipy.special import softmax

In [None]:
f1 = sklearn.metrics.f1_score

## Parameters Tuning
*This is the current parameter settings of training a RoBERTa model for this classification project including some recommendation for the next potential improvement through parameter tuning.*

- Epochs:  Two epochs may be insufficient for convergence, especially for complex models like RoBERTa. Consider increasing this to 3-5 epochs and monitor performance on the validation set.

- Experimenting with the Learning Rate:  A learning rate of 2e-5 is a common starting point, but it may be beneficial to test different values (e.g., 1e-5, 3e-5) and consider using a learning rate scheduler to adapt the learning rate during training.

- Considering Mixed Precision Training:  If your hardware supports it (e.g., NVIDIA GPUs), enabling mixed precision training (fp16: True) can speed up training and reduce memory usage.

- Adjusting Class Weights & Monitoring Performance Metrics(F1 score):  The weight parameter is crucial for addressing class imbalance. The weights reflect the actual class distribution in this dataset.  F1 score:  F1 score is a critical metric for evaluating model performance, especially in the context of class imbalance. By monitoring the F1 score, you can gain insights into the model's ability to balance precision and recall for both classes. This is particularly important when the minority class is underrepresented, as a high overall accuracy may mask poor performance on that class. Regularly tracking the F1 score allows for timely adjustments to class weights and other parameters to ensure that the model is effectively learning from both classes.  By systematically tuning these parameters and evaluating their impact on model performance, you can enhance the effectiveness of the RoBERTa model for your specific classification task. This iterative process will help in achieving better generalization and improved predictive accuracy, particularly in scenarios where class imbalance is a significant concern.

In [None]:
# model configuration
model_args = {
    "save_eval_checkpoints": False,
    "save_model_every_epoch": False,
    'reprocess_input_data': True,
    'overwrite_output_dir': True,
    'manual_seed': 79,
    "silent": True,
    'num_train_epochs': 2,
    'learning_rate': 2e-5,
    'fp16': False,
    'max_seq_length': 64,
}

In [None]:
print(train.columns) 

In [None]:
print(train['final_text'].head())  # Check the first few entries
print(train['final_text'].apply(type))  # Check types of entries
print(test['final_text'].head())  # Check the first few entries
print(test['final_text'].apply(type))  # Check types of entries


In [None]:
train['final_text'].fillna("", inplace=True)  # Replace NaN with empty strings
test['final_text'].fillna("", inplace=True)  # Replace NaN with empty strings

In [None]:
train['final_text'] = train['final_text'].astype(str)  # Convert all entries to string
test['final_text'] = test['final_text'].astype(str)  # Convert all entries to string

## Model information (RoBERTa-base)
- RoBERTa is an advanced version of BERT (Bidirectional Encoder Representations from Transformers) developed by Facebook AI in 2019. 
- It was introduced in the paper:   📄 RoBERTa: A Robustly Optimized BERT Pretraining Approach (Liu et al., 2019).
- RoBERTa builds upon BERT but makes several improvements in the way the model is pre-trained, leading to better performance on many NLP tasks.

-------------------------------
🔥 How is RoBERTa Different from BERT?     
*RoBERTa improves BERT in several key ways:*

1) Trained on More Data 📊
- RoBERTa is trained on 160GB of text data, compared to BERT’s 16GB.
- This extra data includes Common Crawl, BooksCorpus, OpenWebText, and Wikipedia.
2) Removes Next Sentence Prediction (NSP) ❌
- BERT uses Next Sentence Prediction (NSP) during pre-training.
- RoBERTa removes NSP, which was found to be unnecessary and even detrimental.
3) Uses More Data for Masked Language Modeling (MLM) 🔄
- In BERT, 15% of words are masked once per epoch.
- RoBERTa dynamically changes the masked words in every iteration, helping the model generalize better.
4) Bigger Batches and Longer Training 🏋️‍♂️
- RoBERTa is trained for more steps and with larger batch sizes compared to BERT.
- BERT’s max batch size: 256 sequences
- RoBERTa’s max batch size: 8,000 sequences
5) Better Hyperparameter Tuning 🎯
- The learning rate, batch size, and training schedules are optimized for better results.


📊 Performance: How Well Does RoBERTa Perform?

- RoBERTa generally outperforms BERT across various NLP benchmarks, including the GLUE tasks, where RoBERTa achieved a score of 88.5 compared to BERT's lower performance. Its enhanced training methods, such as dynamic masking and a larger dataset, contribute to its superior ability to understand language complexities. RoBERTa has demonstrated significant improvements over BERT in several key benchmarks:

- SQuAD (Stanford Question Answering Dataset): RoBERTa achieved an F1 score of 94.6, surpassing BERT's score of 93.2. This indicates a notable enhancement in question-answering capabilities.

- GLUE (General Language Understanding Evaluation): RoBERTa scored 88.5, while BERT managed only 84.6, showcasing its superior performance across various language understanding tasks.

- Named Entity Recognition (NER): RoBERTa excels in extracting complex entities from unstructured text, outperforming BERT in recognizing names of people, places, and organizations.

- Sentiment Analysis: RoBERTa's fine-tuned models are particularly effective in detecting subtle emotions in text, outperforming BERT in classifying sentiments accurately.

These benchmarks highlight RoBERTa's advancements in natural language processing, making it a preferred choice for many applications requiring high accuracy and efficiency.  RoBERTa is more accurate than BERT for many tasks like text classification, question answering, and sentiment analysis.

## Training with ClassificationModel Class

The ClassificationModel class in simpletransformers is designed to make it easier to work with transformer-based models for text classification tasks. By specifying the model type, pre-trained weights, and training arguments, you can quickly set up and fine-tune a model for your specific classification needs.

- Model Type: The ClassificationModel allows you to work with various transformer architectures, such as BERT, RoBERTa, DistilBERT, and more, by specifying the model type (e.g., 'roberta' for RoBERTa).
- Pre-trained Model: You can specify a pre-trained model checkpoint from Hugging Face's Model Hub (e.g., 'roberta-base'), which contains weights trained on large corpora and can be fine-tuned for specific tasks like classification.

1) Model Type:
- The first parameter (e.g., 'roberta') specifies the type of model you want to use for classification. This indicates that the model will leverage RoBERTa architecture.

2) Model Name or Path:
- The second parameter (e.g., 'roberta-base') is the name of the pre-trained model or a path to a local model directory. This is where the model's weights and configuration will be loaded from.

3) Weight:
- The weight parameter (optional) allows you to set class weights for imbalanced datasets. It can help the model pay more attention to underrepresented classes during training. In your example, weight=[1, 1.346] indicates the relative importance of two classes.

4) Arguments (args):
- The args parameter allows you to specify training arguments such as learning rate, batch size, number of epochs, and more. This is typically a dictionary with keys that correspond to the training options.


In [None]:
import time
start_time = time.time()
# Clear CUDA cache
torch.cuda.empty_cache()

# Set up Stratified K-Fold
kf = StratifiedKFold(n_splits=15, shuffle=True, random_state=79)
err = []
y_pred_tot = []

# Perform Stratified K-Fold cross-validation
for train_index, test_index in kf.split(train, train['target']):
    train1_trn, train1_val = train.iloc[train_index], train.iloc[test_index]
    
    # Initialize the model
    model_rb = ClassificationModel('roberta', 'roberta-base', weight=[1, 1.346], args=model_args)
    
    # Train the model
    model_rb.train_model(train1_trn, eval_df=train1_val)

    # Evaluate the model
    result, model_outputs, _ = model_rb.eval_model(train1_val, f1=sklearn.metrics.f1_score, acc=sklearn.metrics.accuracy_score)
    print(f"F1 Score: {result['f1']:.4f}, Accuracy: {result['acc']:.4f}")    
    
    err.append(result['f1'])

    # Make predictions
    try:
        predictions, _ = model_rb.predict(test['final_text'].tolist())  # Convert to list
        y_pred_tot.append(predictions)
        
    except Exception as e:
        print("Error during prediction:", e)
    
    # Print mean F1 score after each fold
    print("Mean F1 Score: ", np.mean(err))

# Final mean F1 score across all folds
print("Final Mean F1 Score: ", np.mean(err))

end_time = time.time()
print(f"Execution time: {end_time - start_time} seconds")

# 16. Prediction

In [None]:
target_submit =np.mean(y_pred_tot,0)
print(target_submit[100:150])

In [None]:
target_submit

In [None]:
#predictions, raw_outputs = model_rb.predict(test['final_text'])
final['target']=target_submit
final['target'] = final['target'].apply(lambda x: 1 if x>0.5 else 0)
final.head()

In [None]:
final.to_csv('submission.csv',index=False)

In [None]:
submission = pd.read_csv("/kaggle/input/submis/submission.csv")
submission['target'] = submission['target'].apply(lambda x: 1 if x>0.5 else 0)
submission.head()

In [None]:
submission.to_csv('submission.csv',index=False)

# 17. Submission Result


![Image](https://github.com/user-attachments/assets/c685acef-2ea0-4bde-bcb3-0b8e3d8514d9)

# 18. Conclusion 

### *After Fine-Tuning BERT, and then RoBERTa for Disaster Prediction Using Twitter Text*
- This project aimed to enhance disaster prediction capabilities by fine-tuning BERT and RoBERTa models on Twitter text data. Our results indicated that both models effectively identified disaster-related tweets, with RoBERTa consistently a little outperforming BERT in accuracy, precision, and recall metrics. 

- Future plans include exploring additional data augmentation techniques to improve model robustness and experimenting with ensemble methods that combine predictions from both models. We also aim to incorporate sentiment analysis to better understand public sentiment during disasters, which could enhance the predictive capabilities of our models. Continuous evaluation and adaptation of our models will be essential as we gather more diverse and real-time data from Twitter. Project Conclusion Report: Fine-Tuning BERT and RoBERTa for Disaster Prediction Using Twitter Text

- In this project, we focused on predicting disasters through binary classification of Twitter text by fine-tuning BERT and RoBERTa models. The results showed that BERT achieved an accuracy of **0.82071**, while RoBERTa slightly outperformed it with an accuracy of **0.83144**. Although the use of transformer architecture was deemed appropriate for understanding the relationships between words in the text, the score of 0.83144 was not entirely satisfactory.

- The relatively low performance can be attributed to the limited dataset of approximately 7,000 tweets, which is significantly smaller compared to the vast datasets used to train large language models (LLMs). To address this, we are confident that acquiring a larger dataset of tweets will enable us to reach a score of 1 in future iterations.

- Additionally, we believe that developing a custom transformer model tailored to our specific needs could also bring us closer to achieving this goal. By leveraging more extensive and diverse data, we aim to enhance the model's predictive capabilities and overall performance in disaster prediction tasks.

- In conclusion, the project has laid a solid foundation for future work, and we are optimistic about the potential improvements that can be made with more data and refined modeling techniques.