### Natural Language Processing with Disaster Tweets (Part2 -- BERT)

# 5. Training, fine-tuning Bert-base

-----------------------------

# BERT-base with Classification Model

In this BERT model experiment, the first setup (with 1 dropout layer and 1 dense layer) performed better than the second setup (with 2 dropout layers and 3 dense layers). Here's how we can interpret the results from different perspectives:

1. Increase in Model Complexity
In the second experiment, you added more dropout and dense layers (2 dropout layers and 3 dense layers). Adding more layers increases the number of parameters in the model, which can allow it to learn more complex patterns. However, this also increases the risk of overfitting. A more complex model may fit the training data too closely and fail to generalize well to unseen data. In contrast, the simpler model in the first experiment may have been better at generalizing to new data.

2. Role of Dropout Layers
Dropout is a regularization technique that helps prevent overfitting by randomly ignoring certain neurons during training. However, adding too many dropout layers can make training unstable. In the first experiment, having just 1 dropout layer likely provided a good balance of regularization without causing too much information loss. In the second experiment, increasing the dropout layers to 2 might have overly constrained the model, leading to performance degradation due to too much regularization.

3. Adding Dense Layers
Adding dense layers increases the representational power of the model, but too many layers can make the learning process more challenging or cause issues like gradient vanishing. The first experiment, with only 1 dense layer, might have been sufficient to capture the important patterns in the data. In contrast, adding more layers in the second experiment might have made the model unnecessarily complex, making it harder for the model to learn effectively, and possibly leading to performance loss.

4. Learning Rate and Initialization Issues
When changing the model architecture, the learning rate or parameter initialization may need to be adjusted. In the second experiment, with a more complex architecture, it’s possible that the learning rate or initialization wasn’t ideal for the new setup, leading to poorer performance.


*(One thing I regret is not properly saving the notebook information from the first experiment, so I can only submit the notebook from the second experiment.)*

In [1]:
!pip install transformers -q
!pip install transformers datasets torch -q
!pip install torch torchvision torchaudio -q

print("all done")

all done


In [2]:
import os
import gc
gc.enable()
import time
import warnings
warnings.filterwarnings("ignore")
import re
import string
import operator
import urllib.request
import zipfile

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 100)

import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict

from transformers import BertForSequenceClassification
from transformers import BertModel
from transformers import AutoTokenizer, BertTokenizer, BertForSequenceClassification, DataCollatorWithPadding

import torch
from torch.optim import AdamW
import torch.nn as nn
from torch.nn import CrossEntropyLoss
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from torch.cuda.amp import autocast, GradScaler
from torch.optim.lr_scheduler import ReduceLROnPlateau

from datasets import load_dataset
from wordcloud import STOPWORDS

import nltk

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer


from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit, GroupKFold, train_test_split, GroupShuffleSplit
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
from sklearn.utils.class_weight import compute_class_weight
import torch.nn.functional as F
from datasets import load_dataset

import tensorflow as tf
import tensorflow_hub as hub
import nltk
from nltk.stem import PorterStemmer
from tensorflow import keras
from tensorflow.keras.optimizers import SGD, Adam
from tensorflow.keras.layers import Dense, Input, Dropout, GlobalAveragePooling1D
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, Callback

In [None]:
def set_seed(seed):
    import random
    SEED=seed
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

In [None]:
# Check and set device (using GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# 6. Encoding with Tokenizer

In [None]:
class TextDataset(Dataset):
    def __init__(self, texts, targets, max_len):
        self.texts = texts
        self.targets = targets  # ✅ targets가 None일 수도 있음
        self.max_len =  max_len
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            truncation=True,
            padding="max_length",
            max_length=self.max_len,
            return_tensors="pt",
            return_token_type_ids=True  # ✅ 추가됨
        )
        
        # 예측 시 targets가 None일 수 있으므로 조건부로 반환
        item = {
            "input_ids": encoding["input_ids"].squeeze(0),
            "attention_mask": encoding["attention_mask"].squeeze(0),
            "token_type_ids": encoding["token_type_ids"].squeeze(0),  # ✅ 포함
        }
        
        if self.targets is not None:
            item["labels"] = torch.tensor(self.targets[idx], dtype=torch.long)
        
        return item

# 7. Model building on top of Bert 
Implemented a custom BertClassifier class using the BERT pre-trained layer and incorporating custom new layers (dropout layers, and dense layers), and a sigmoid activation function.

For reference, BERT-Base Uncased model (bert-en-uncased-l-12-h-768-a-12/2) has 12-layer, 768-hidden, 12-heads, 110M parameters:

- 12-layer: The BERT-Base model consists of 12 transformer encoder layers. Each layer extracts high-dimensional features from the input text to contribute to context understanding.
- 768-hidden: The hidden size (or unit) of each layer is 768. This means that each word is ultimately represented by 768 numbers, which contain the meaning and context of the word.
- 12-heads: Each encoder layer has 12 attention heads. Multi-head attention allows the model to capture information from different aspects of the text. For example, it can focus on grammar, meaning, and structure differently.
- 110M parameters: This model has about 110 million parameters, which indicates the amount and complexity of information learned by the model.

In [None]:
import torch
import torch.nn as nn
from transformers import BertModel

class BertClassifier(nn.Module):
    def __init__(self, freeze_bert=False):
        super(BertClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False  # BERT 가중치 업데이트 방지 (선택 사항)

        self.dropout1 = nn.Dropout(0.1)
        self.fc1 = nn.Linear(self.bert.config.hidden_size, 256)
        self.dropout2 = nn.Dropout(0.1)
        self.fc2 = nn.Linear(256, 32)
        self.fc3 = nn.Linear(32, 1)

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        clf_output = outputs.pooler_output  # [CLS]의 변환값
        x = self.dropout1(clf_output)
        x = self.fc1(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        x = self.fc3(x)
        return x  # Sigmoid 제거 (loss 함수에서 처리)

# 8. Define Metrix to Measure

### Explanation of Training Metrics  

- **Training Precision**:  
  This measures the proportion of **true positive predictions** out of all positive predictions made by the model. It evaluates how many of the predicted positive cases were actually correct, focusing on minimizing false positives (FP). A higher precision indicates a lower rate of incorrect positive predictions.  

  \
  \begin{equation}
  \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}
  \end{equation}
    

- **Training Recall**:  
  This metric assesses the proportion of **actual positive cases** that were correctly predicted by the model. It evaluates how well the model identifies all relevant instances while minimizing false negatives (FN). A higher recall indicates fewer missed positive cases.  

  \
  \begin{equation}
  \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}
  \end{equation}
    

- **Training F1 Score**:  
  The F1 score is the **harmonic mean** of precision and recall, balancing both metrics. It is particularly useful when dealing with imbalanced datasets, as it ensures that neither precision nor recall is disproportionately prioritized.  

  \
  \begin{equation}
  \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
  \end{equation}
    

- **Training Accuracy**:  
  Accuracy represents the proportion of **correct predictions** (both positive and negative) out of all predictions made by the model. While useful in balanced datasets, it may not always be a reliable metric in highly imbalanced cases.  

  \
  \begin{equation}
  \text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Samples}}
  \end{equation}
    

- **Training Loss**:  
  The loss function measures the **difference between the predicted output and the actual label**. It is used to optimize the model during training, and lower values indicate better performance. Common loss functions include Cross-Entropy Loss (for classification) and Mean Squared Error (for regression).


In [None]:
class ClassificationReport:
    def __init__(self):
        self.train_precision_scores = []
        self.train_recall_scores = []
        self.train_f1_scores = []
        self.train_accuracy_scores = []  # Added list for accuracy
        self.train_loss = []
        self.val_precision_scores = []
        self.val_recall_scores = []
        self.val_f1_scores = []
        self.val_accuracy_scores = []  # Added list for accuracy
        self.val_loss = []

    def on_epoch_end(self, model, train_loader, val_loader, device, criterion):
        model.eval()
        
        # 🔹 학습 데이터 평가
        train_preds, train_labels, train_loss = self._predict_with_loss(model, train_loader, device, criterion)
        train_precision = precision_score(train_labels, train_preds, average='macro')
        train_recall = recall_score(train_labels, train_preds, average='macro')
        train_f1 = f1_score(train_labels, train_preds, average='macro')
        train_accuracy = np.mean(train_preds == train_labels)  # Accuracy calculation

        self.train_precision_scores.append(train_precision)
        self.train_recall_scores.append(train_recall)
        self.train_f1_scores.append(train_f1)
        self.train_accuracy_scores.append(train_accuracy)  # Store accuracy
        self.train_loss.append(train_loss)

        # 🔹 검증 데이터 평가
        val_preds, val_labels, val_loss = self._predict_with_loss(model, val_loader, device, criterion)
        val_precision = precision_score(val_labels, val_preds, average='macro')
        val_recall = recall_score(val_labels, val_preds, average='macro')
        val_f1 = f1_score(val_labels, val_preds, average='macro')
        val_accuracy = np.mean(val_preds == val_labels)  # Accuracy calculation

        self.val_precision_scores.append(val_precision)
        self.val_recall_scores.append(val_recall)
        self.val_f1_scores.append(val_f1)
        self.val_accuracy_scores.append(val_accuracy)  # Store accuracy
        self.val_loss.append(val_loss)

        # 🔹 Epoch별 점수 출력
        print(f'- Training Precision: {train_precision:.6f} - Training Recall: {train_recall:.6f} - Training F1: {train_f1:.6f} - Training Accuracy: {train_accuracy:.6f} - Training Loss: {train_loss:.6f}')
        print(f'- Validation Precision: {val_precision:.6f} - Validation Recall: {val_recall:.6f} - Validation F1: {val_f1:.6f} - Validation Accuracy: {val_accuracy:.6f} - Validation Loss: {val_loss:.6f}')

        # 🔹 CUDA 메모리 정리
        torch.cuda.empty_cache()

    def _predict_with_loss(self, model, loader, device, criterion):
        all_preds = []
        all_labels = []
        total_loss = 0.0
        with torch.no_grad():
            for batch in loader:
                inputs = {key: val.to(device) for key, val in batch.items() if key != 'labels'}
                labels = batch['labels'].to(device).float().unsqueeze(1)  # (batch_size, 1)로 변환

                outputs = model(**inputs)
                loss = criterion(outputs, labels)    # 이제 labels는 float 타입이므로 오류가 발생하지 않음
                total_loss += loss.item()

                # 🔹 `sigmoid` 적용 후 `round()` 수행 (Binary Classification)
                preds = torch.sigmoid(outputs).cpu().numpy()
                preds = np.round(preds)  # 0 또는 1로 변환

                all_preds.extend(preds)
                all_labels.extend(labels.cpu().numpy())

        return np.array(all_preds), np.array(all_labels), total_loss / len(loader)


# 9. Training Model

- For effective *hyperparameter tuning*, it is crucial to integrate $ptim.lr_scheduler$ in combination with a predefined learning rate schedule that adjusts dynamically based on patience intervals. This ensures that the model adapts to different training phases by fine-tuning the learning rate according to performance fluctuations.

- Interestingly, during the fine-tuning process of the BERT model, I discovered that it achieved optimal performance when initialized with an exceptionally low learning rate. This was a rare yet significant observation, as such an approach was not commonly required for fine-tuning other models. This insight highlights the sensitivity of BERT to learning rate adjustments and underscores the importance of careful tuning to maximize its potential.

In [None]:
class DisasterDetector:
    def __init__(self, max_seq_length, lr, epochs, batch_size, patience, model=None):
        self.model = model if model is not None else BertClassifier()  # 모델 초기화
        self.max_seq_length = max_seq_length
        self.lr = lr
        self.epochs = epochs
        self.batch_size = batch_size
        self.patience = patience
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        self.models = []
        self.scores = {}
        self.best_model = None  # 학습된 모델을 저장할 변수
        self.best_model = BertClassifier().to(self.device)  # 외부에서 주어진 모델을 사용

        if model:
            self.best_model = model.to(self.device)  # 🔹 주어진 모델 사용

        # 🔹 best_model이 없으면 저장된 모델 불러오기
        if self.best_model is None and os.path.exists("/kaggle/input/bert-v0224/best_model.pth"):
            try:
                state_dict = torch.load("/kaggle/input/bert-v0224/best_model.pth", map_location=self.device)
                self.best_model = BertClassifier().to(self.device)
                self.best_model.load_state_dict(state_dict)
                print("✅ Model loaded successfully from 'best_model.pth'")

            except Exception as e:
                print(f"⚠️ Model loading failed: {e}")
                self.best_model = None  # 모델 로드 실패 시 None으로 설정
        else:
            print("⚠️ No trained model found. Train the model first!")
                        
    
    def train(self, df_train):
        skf = StratifiedKFold(n_splits=5, random_state=42, shuffle=True)
        best_accuracy = -np.inf 
        best_f1_score = -np.inf
        best_model_state = None
        patience_counter = 0
        torch.cuda.empty_cache()  # ✅ 메모리 정리
        
        device = self.device  

        ### 🔹 학습을 위한 모델은 self.best_model을 사용하지 않고, 새로운 모델을 생성
        model = BertClassifier(freeze_bert=True).to(device)  
        model = torch.compile(model)  # ✅ 모델 컴파일 적용 (선택 사항)

        for fold, (trn_idx, val_idx) in enumerate(skf.split(df_train['text'], df_train['target'])):
            print(f'\n.....[Fold {fold}].....\n')
            
            train_dataset = TextDataset(df_train.loc[trn_idx, 'text'].values, df_train.loc[trn_idx, 'target'].values, max_len=self.max_seq_length)  # TextDataset(texts, labels, max_len) 형태로 데이터 전달
            val_dataset = TextDataset(df_train.loc[val_idx, 'text'].values, df_train.loc[val_idx, 'target'].values, max_len=self.max_seq_length)
            train_loader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True) # collate_fn=collate_fn
            val_loader = DataLoader(val_dataset, batch_size=self.batch_size, shuffle=False)  # collate_fn=collate_fn
            print("Train/Val dataset are loaded...")

            optimizer = optim.Adam(model.parameters(), lr=self.lr, weight_decay=1e-5)
            criterion = nn.BCEWithLogitsLoss()
            scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=5, factor=0.3, verbose=True)
            #scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=self.lr, steps_per_epoch=len(train_loader), epochs=self.epochs)

            metrics = ClassificationReport()
            learning_rate = [0.05, 0.02, 0.105, 0.1, 0.2]  # 미리 지정한 학습률 리스트
            lr_adjustment_count = 0  # 현재 학습률 변경 횟수
            best_f1_score = 0
            best_accuracy = 0
            patience_counter = 0

            for epoch in range(self.epochs):
                print(f'Epoch: {epoch+1}')
                model.train()
                
                for batch in train_loader:
                    optimizer.zero_grad()
                    inputs = {"input_ids": batch["input_ids"].to(self.device), "attention_mask": batch["attention_mask"].to(self.device), "token_type_ids": batch["token_type_ids"].to(self.device)}  # ✅ token_type_ids 추가
                    labels = batch['labels'].to(self.device).unsqueeze(1).float()
                    outputs = model(**inputs)
                    loss = criterion(outputs.view(-1, 1), labels)
                    loss.backward()
                    optimizer.step()

                metrics.on_epoch_end(model, train_loader, val_loader, self.device, criterion)
                val_loss = metrics.val_loss[-1]  # 마지막 검증 손실을 scalar로 저장
                scheduler.step(val_loss)  # 스칼라 값을 전달
                val_accuracy = metrics.val_accuracy_scores[-1]
                val_f1 = metrics.val_f1_scores[-1]  # 불균형 데이터일 경우 F1-score를 기준으로 하는 것이 더 적절함

                scheduler.step(val_loss)  # 🔹 ReduceLROnPlateau 실행

                if val_f1 > best_f1_score:  # Best 모델 업데이트
                    best_f1_score = val_f1
                    best_model_state = model.state_dict()
                    print("Best f1 score is updated!")

                if val_accuracy > best_accuracy:  # Early Stopping 기준 체크
                    best_accuracy = val_accuracy
                    patience_counter = 0  
                    print("Best accuracy updated!")
                else:
                    patience_counter += 1  

                if patience_counter >= self.patience:
                    current_lr = optimizer.param_groups[0]['lr']
                    
                    # 🔹 scheduler가 이미 LR을 줄여서 너무 작아졌다면, 우리가 직접 변경
                    if current_lr < 1e-6 and lr_adjustment_count < len(learning_rate):
                        new_lr = learning_rate[lr_adjustment_count]  
                        for param_group in optimizer.param_groups:
                            param_group['lr'] = new_lr  # 학습률 변경
                            patience_counter = 0  # patience 초기화 후 다시 시도
                            lr_adjustment_count += 1
                            print(f"Learning rate manually adjusted! New LR: {new_lr}, Attempts left: {len(learning_rate) - lr_adjustment_count}")
                    elif current_lr < 1e-6 and lr_adjustment_count >= len(learning_rate):
                        print("Early stopping triggered based on accuracy & LR tuning exhausted!")
                        break
        # 🔹 학습이 끝난 후, self.best_model에 가장 좋은 가중치를 로드
        if best_model_state:
            self.best_model = BertClassifier().to(self.device)
            self.best_model.load_state_dict(best_model_state, strict=False)
            torch.save(self.best_model.state_dict(), "best_model.pth")  # 🔹 모델 가중치 저장
            print("Best model is saved in 'best_model.pth'")
           

    def predict(self, df_test):
        if self.best_model is None:
            raise ValueError("❌ No trained model found. Train the model first!")
            
    
    self.best_model.eval()  # 모델이 None이 아니라는 게 보장됨
    test_dataset = TextDataset(df_test['text'].values, None, max_len=self.max_seq_length)
    test_loader = DataLoader(test_dataset, batch_size=self.batch_size, shuffle=False)
    predictions = []
    
    with torch.no_grad():
        for batch in test_loader:
            # 필요한 입력만 가져오기
            inputs = {key: val.to(self.device) for key, val in batch.items() if key in ["input_ids", "attention_mask", "token_type_ids"]}
            outputs = self.best_model(**inputs)
            
            # logits에 sigmoid 적용
            preds = torch.sigmoid(outputs).squeeze().cpu().numpy()  # 이진 분류의 경우
            preds = np.round(preds)  # 이진 분류의 경우
            predictions.extend(preds.tolist())

    return predictions

# 10. Debugging 

### Debugging with `torch._dynamo.config.suppress_errors = True`
In deep learning model development using PyTorch, debugging runtime errors can be challenging, especially when utilizing `torch.compile()` or other optimization features. The introduction of `torch._dynamo` provides automatic graph capture and optimization for model execution. However, certain edge cases can lead to internal errors, causing execution failures. To mitigate this, `torch._dynamo.config.suppress_errors = True` is often used as a temporary debugging measure.

### Understanding the Error
When using `torch.compile()`, PyTorch attempts to optimize and trace the model execution graph. If an unexpected error occurs during compilation, it may lead to crashes or obscure error messages, making it difficult to identify the root cause. These errors can arise from various sources, including:
- Unsupported Python constructs or dynamic control flows.
- Incompatible third-party libraries.
- Unhandled exceptions in PyTorch’s internal compilation process.
- Graph-breaking operations that prevent optimization.

### Purpose of `torch._dynamo.config.suppress_errors = True`
Setting `torch._dynamo.config.suppress_errors = True` serves the following purposes:
- **Error Suppression:** Instead of crashing, PyTorch will gracefully fallback to eager execution mode when an error occurs during compilation.
- **Improved Debugging Workflow:** This allows developers to continue execution without abruptly terminating the program, helping isolate problematic code sections.
- **Automatic Fallback Mechanism:** When an optimization fails, execution proceeds without compilation, ensuring that the model can still run.

### Implications and Considerations
While this setting is useful for debugging, it is important to note:
- **Errors are hidden:** Since PyTorch suppresses internal compilation errors, developers might not be immediately aware of optimization failures.
- **Potential Performance Degradation:** If compilation fails and the model runs in eager mode, expected speedups from `torch.compile()` will not be realized.
- **Should Not Be Used in Production:** Suppressing errors is primarily a debugging tool and should not be enabled in a production environment where error visibility is critical.

### Recommended Debugging Approach
To effectively debug PyTorch compilation errors:
1. Run the model **without** `torch.compile()` to ensure it functions correctly in eager mode.
2. Enable `torch._dynamo.config.suppress_errors = False` to observe specific error messages.
3. If errors are still unclear, enable suppression to continue execution and isolate failing components.
4. Use `torch._dynamo.explain(model, example_inputs)` to analyze graph capture behavior.
5. Check for unsupported operations and try alternative model implementations if needed.

### For the next training
Using `torch._dynamo.config.suppress_errors = True` is a valuable debugging technique when working with PyTorch’s compilation features. While it helps prevent crashes and allows execution to proceed, developers should use it cautiously and aim to resolve underlying issues instead of relying on error suppression as a long-term solution. By systematically analyzing compilation failures, models can be optimized for performance while maintaining robustness.



In [None]:
import torch._dynamo
torch._dynamo.config.suppress_errors = True

# 11. Training

In [None]:
detector = DisasterDetector(max_seq_length=169, lr=0.00017, epochs=10, batch_size=32, patience=3)
detector.train(df_train)  # 모델 학습
print("training is completed.")

# 12. Predicdtion

In [None]:
df_test

In [None]:
# 예측 수행
predictions = DisasterDetector.predict(df_test)

print("Predictions >>>>> \n", predictions))

# 13. Submission

In [None]:
# 제출 파일 로드

model_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
print("model_submission.head(): ", model_submission.head())  
print(model_submission.columns)  
print("데이터 크기 일치 여부 확인:   ", len(model_submission), len(y_pred))  

In [None]:
# 리스트 내부의 값만 추출해서 target 컬럼에 할당
model_submission["target"] = model_submission["target"].apply(lambda x: x[0] if isinstance(x, list) else x)

# 0 또는 1만 필요-> int 변환
model_submission["target"] = model_submission["target"].astype(int)
print(model_submission.head())  # 확인

model_submission.to_csv('/kaggle/working/submission.csv', index=False)

## Submission Score

I conducted experiments using two different model architectures to evaluate their impact on performance.

- **The first approach** : involved adding a single dropout layer followed by one linear layer.
- **The second approach** : incorporated two dropout layers along with three fully connected (dense) layers.
  
After training and evaluating both models, the results indicated that the first approach achieved superior performance in terms of accuracy and F1 score. This suggests that a simpler architecture with fewer layers and regularization performed better, possibly due to reduced overfitting and more efficient learning. 

One regret during repeating the experiment, did not save the notebool, lost the best parameter information and just got submission score result on the board. The below screen shot was from the first approach using a single dropout layer followed by one linear layer.

![Image](https://github.com/user-attachments/assets/72b3c6fc-ce7f-4c78-a677-9011a1359c5a)

# 14. Conclusion

- The better performance of the first experiment (with simpler setting as I mentioned initially) suggests that a simpler model was able to generalize better and avoid overfitting. The second experiment, with more complex layers, likely suffered from difficulties in training or overfitting. This highlights an important lesson: increasing model complexity doesn’t always lead to better performance. Choosing the right model complexity and regularization techniques is key to optimizing performance.
  
- Throughout our exploration of hyperparameter tuning, gained a deeper understanding of the critical role that learning rate adjustments play, particularly in the context of the BERT model. Our findings revealed that BERT exhibits a unique sensitivity to learning rate changes, achieving optimal performance with an exceptionally low learning rate. This insight underscores the necessity of meticulous tuning, as it can significantly influence the model's ability to adapt and perform effectively across various training phases.

- In our experimentation, we also attempted to enhance the BERT architecture by adding additional layers to improve its capacity for learning complex patterns. However, despite all the efforts, the performance metrics did not meet our expectations. This prompted a reevaluation of our approach, leading us to consider RoBERTa as a promising alternative. Known for its robust training methodology and improved performance on various benchmarks, RoBERTa presents an exciting opportunity to further explore the capabilities of transformer-based models.

- As we transition to RoBERTa, we remain committed to the principles of effective hyperparameter tuning, including the integration of dynamic learning rate schedules that adapt based on performance fluctuations. This approach will not only help us refine our models but also ensure that we maximize their potential in understanding and generating human-like text.

- In summary, our exploration of BERT and the subsequent decision to pivot toward RoBERTa highlights the iterative nature of model development in machine learning. Each step, whether a success or a setback, contributes to our understanding and ultimately guides us toward achieving superior performance in our natural language processing endeavors. As we continue this journey, we are excited about the possibilities that lie ahead with RoBERTa and the insights we will gain through further experimentation and tuning.