## What are Transformers?


https://www.youtube.com/watch?v=4Bdc55j80l8&t=414s&pp=ygUYdHJhbnNmb3JtZXIgYXJjaGl0ZWN0dXJl


### **1. Transformer Architecture: Encoder and Decoder**

The **Transformer architecture**, introduced in the paper "***Attention is All You Need***" by Vaswani et al. (2017), marked a significant breakthrough in natural language processing (NLP). Unlike traditional RNNs, which process sequences word-by-word, the Transformer processes all tokens in parallel, using a mechanism called **Self-Attention**.

### **1.1. High-Level Architecture**
- The Transformer consists of **two main components**:
  1. **Encoder**: Processes the input text and learns its representation.
  2. **Decoder**: Uses the encoded representation to generate the output text.

Each component is made up of **layers**, and the original Transformer uses **6 layers** for both the Encoder and the Decoder. Let’s break down these parts in detail:

---

### **1.2. Encoder Component**

The **Encoder** is responsible for reading the input sequence and converting it into a high-dimensional representation that captures the relationships and meanings of the words.

#### **Structure of the Encoder**
Each **Encoder layer** consists of:
1. **Self-Attention Mechanism**
2. **Feed-Forward Neural Network**
3. **Layer Normalization**
4. **Residual Connections**

**1.2.1. Self-Attention Mechanism**
- **Self-Attention** allows each word in the input sequence to focus on other words to understand their relationships.
- For example, in the sentence **"The cat chased the mouse,"** the word **"cat"** can learn that **"chased"** and **"mouse"** are related to it, even though they are not immediately next to it.
  
**How Self-Attention Works**:
- **Step 1: Create Queries, Keys, and Values**:
  - Every word in the input sequence is transformed into **three vectors**: a **Query (Q), Key (K),** and **Value (V)**.
  - These vectors are created by multiplying the input embeddings with weight matrices.
  - The idea is that the **Query** represents the current word, **Keys** represent all other words, and **Values** hold the information from those words.
- **Step 2: Calculate Attention Scores**:
  - The attention score for a word is calculated as the **dot product** of the Query vector of the word with the Key vectors of all other words. This tells the model how much focus the current word should give to other words.
  - The higher the dot product, the more attention that word gets. For instance, **"cat"** might have high scores for **"chased"** and **"mouse."**
- **Step 3: Softmax and Weight the Values**:
  - Apply **softmax** to convert scores into probabilities (attention weights), ensuring they sum to 1.
  - Multiply each Value vector by its attention weight to produce the final output. This allows the model to aggregate information from relevant words.

**1.2.2. Multi-Head Attention**
- **Multi-Head Attention** means the self-attention mechanism is run **multiple times** in parallel, each time with different learned weight matrices.
- This allows the model to focus on **different aspects of relationships** (e.g., subject-object relationships, verb tense, etc.). The results from all heads are combined, enabling the model to capture a richer set of features.

**1.2.3. Feed-Forward Neural Network (FFN)**
- After self-attention, the output is passed through a **Feed-Forward Neural Network**.
- This is a simple fully connected network applied independently to each position, introducing non-linearity and learning more complex patterns.

**1.2.4. Layer Normalization and Residual Connections**
- **Layer Normalization** is applied after each sub-layer (Self-Attention and FFN) to stabilize the training process.
- **Residual Connections** are added to enable the model to learn faster by ensuring that gradient signals do not vanish. This means the output of each sub-layer is added to its input before passing it to the next sub-layer.

---

### **1.3. Positional Encoding**
- Since Transformers process all words simultaneously, they need a way to understand the **order** of words (e.g., "dog bites man" vs. "man bites dog").
- **Positional Encoding** adds information about each word’s position in the sequence. This encoding is added to the input embeddings, enabling the model to understand which word comes first, second, and so on.

---

### **1.4. Decoder Component**

The **Decoder** is responsible for generating the output sequence based on the information provided by the Encoder.

#### **Structure of the Decoder**
Each **Decoder layer** consists of:
1. **Masked Self-Attention**
2. **Encoder-Decoder Attention**
3. **Feed-Forward Neural Network**
4. **Layer Normalization**
5. **Residual Connections**

**1.4.1. Masked Self-Attention**
- **Masked Self-Attention** ensures that while generating a word, the Decoder can only look at **previously generated words** and not future ones. This is achieved by masking out future words during the training phase.

**1.4.2. Encoder-Decoder Attention**
- **Encoder-Decoder Attention** allows the Decoder to focus on relevant parts of the input sequence (from the Encoder). It is similar to the Encoder’s self-attention, but it uses the **output of the Encoder** as Keys and Values and the Decoder’s input as Queries.
- For example, while generating the word **"ran"** in the output, the model can attend to **"dog"** and **"park"** in the input, ensuring it understands that **"dog ran to the park."**

**1.4.3. Feed-Forward Neural Network (FFN)**
- Like the Encoder, each Decoder layer has a Feed-Forward Neural Network that learns complex transformations.

---

### **1.5. Putting It All Together**
The Transformer architecture can be summarized as:
- **Encoder**: Reads the entire input sequence and transforms it into a representation that captures the relationships between words using Self-Attention and FFNs.
- **Decoder**: Uses the encoded representation and **masked** self-attention to generate the output sequence one word at a time, focusing on relevant parts of the input as needed.

### **Key Strengths of Transformers**
1. **Parallel Processing**: Unlike RNNs, Transformers can process all words in a sequence simultaneously, leading to faster training times.
2. **Global Context**: Self-Attention allows the model to understand relationships between words, regardless of how far apart they are, enabling it to capture more nuanced patterns in language.
3. **Scalability**: Because Transformers can be parallelized, they scale efficiently, allowing models to be trained on massive datasets.

---


## BERT Architecture

### **2. BERT: Bidirectional Encoder Representations from Transformers**

**BERT** is a model introduced by Google in 2018, and it is one of the most popular Transformer-based models used for a wide range of NLP tasks, including **question answering, sentiment analysis, and named entity recognition**. Unlike the original Transformer architecture, which has both an Encoder and a Decoder, **BERT uses only the Encoder** part of the Transformer.

### **2.1 What Makes BERT Unique?**
1. **Bidirectional Context Understanding**:
   - Traditional models, including early Transformer-based models, often read text in a left-to-right or right-to-left manner. This means they understand each word by considering only the preceding or succeeding words.
   - **BERT is bidirectional**, meaning it reads the entire sequence of words simultaneously and understands each word based on its **full context**. This allows BERT to capture richer relationships between words because it considers **both** the words before and after a given word.

2. **Pre-Training and Fine-Tuning**:
   - **Pre-Training**: BERT is pre-trained on massive datasets (like Wikipedia and BooksCorpus) using two tasks:
     1. **Masked Language Modeling (MLM)**: BERT randomly masks (hides) some words in the input sentence and learns to predict those masked words based on the surrounding words. For example, in **"The cat sat on the [MASK],"** BERT would learn to predict **"mat."** This helps it understand context.
     2. **Next Sentence Prediction (NSP)**: BERT also learns the relationship between two sentences by predicting if the second sentence logically follows the first. This helps in tasks that require understanding the connection between sentences, like **question answering.**
   - **Fine-Tuning**: After pre-training, BERT can be fine-tuned on specific tasks. This means it can be adapted to perform tasks like sentiment analysis, text classification, or question answering with just a small amount of task-specific data.

### **2.2 BERT Architecture**
BERT uses the **Encoder** part of the Transformer architecture, which we've already discussed. However, it has some specifics:

1. **Multiple Encoder Layers**:
   - The original BERT model has **12 layers** (BERT-Base) or **24 layers** (BERT-Large) of Encoders. Each layer is essentially the same as what we described earlier, with **Self-Attention** and **Feed-Forward Neural Networks.**
   - Each layer can process the entire input sequence simultaneously, and each word can attend to every other word using Self-Attention.

2. **Positional Embeddings**:
   - Since BERT does not process input sequentially, it needs **Positional Embeddings** to know the order of words. These embeddings are added to the input word embeddings to provide information about the position of each word in the sentence.

3. **Input Representations in BERT**:
   - BERT’s input representation is **very detailed**. Each input consists of three embeddings:
     1. **Token Embeddings**: The standard word embeddings (like we see in Word2Vec or FastText).
     2. **Segment Embeddings**: Used to distinguish between two sentences in the same input (important for Next Sentence Prediction tasks).
     3. **Positional Embeddings**: Provide information about the position of each word in the input sequence.
   - The final input representation for each token is the **sum of these three embeddings.**

### **2.3 Example: BERT Embeddings**
Suppose we have the input sentence:
> "The cat sat on the mat."

**Input Processing**:
1. **Token Embeddings**: BERT will convert each word into a dense vector (e.g., "The" -> [0.25, 0.45, ...], "cat" -> [0.33, 0.67, ...]).
2. **Positional Embeddings**: Each token gets an additional embedding based on its position in the sentence (e.g., "The" gets the position 0 embedding, "cat" gets position 1).
3. **Segment Embeddings**: If this were part of a pair of sentences, we could assign different embeddings to tokens from different sentences (e.g., "Sentence A" vs. "Sentence B").

**Combined Representation**:
- For the word **"cat,"** the final input embedding will be the sum of **Token + Positional + Segment** embeddings.

### **2.4 Why BERT is Effective for Complex Contexts**
1. **Bidirectional Self-Attention**:
   - Because BERT is **bidirectional**, it can understand context based on both preceding and succeeding words. This is crucial for understanding complex sentence structures and nuanced meanings.
   - For example, BERT can distinguish between **"bank"** in "river bank" and "financial bank" because it reads the whole sentence.

2. **Contextual Word Embeddings**:
   - Unlike traditional word embeddings that are static (e.g., Word2Vec always has the same vector for "bank"), BERT generates **dynamic, context-aware embeddings**. The embedding for "bank" will be different depending on whether it's in the context of "money" or "river."
   - This makes BERT very powerful for tasks where subtle differences in meaning are important.

### **2.5 Use Cases of BERT**
1. **Text Classification**: Sentiment analysis, spam detection, topic categorization.
2. **Question Answering**: Given a passage and a question, BERT can find the answer in the passage.
3. **Named Entity Recognition (NER)**: Identifying entities like names, places, and dates in text.
4. **Text Summarization**: Extracting key information from long texts.

---


## GPT Architecture

Great! Now let's move on to **GPT-based models** and understand how they utilize the **Decoder** part of the Transformer architecture, focusing on how they differ from BERT.

### **3. GPT-Based Models (Generative Pre-trained Transformer)**

**GPT (Generative Pre-trained Transformer)** models, introduced by OpenAI, are designed primarily for **text generation** tasks. Unlike BERT, which uses the Encoder from the Transformer architecture to understand and process input, **GPT relies on the Decoder** to generate coherent and contextually appropriate text.

### **3.1 Key Differences Between BERT and GPT**
1. **Architecture Usage**:
   - **BERT**: Uses the **Encoder** part of the Transformer. It is mainly focused on understanding language, making it effective for tasks like classification, question answering, and named entity recognition (NER).
   - **GPT**: Uses the **Decoder** part of the Transformer. It is primarily designed for **generative tasks** like text completion, conversation, and creative writing.

2. **Directionality**:
   - **BERT is Bidirectional**: It reads the entire sequence of words simultaneously, considering the context from both left and right (past and future words). This helps it understand complex dependencies.
   - **GPT is Unidirectional**: It reads the text from **left to right**. This means GPT generates text by predicting the next word in the sequence based only on the words it has already seen (left context). This makes it ideal for generative tasks where the model needs to create coherent sequences word by word.

3. **Training Objective**:
   - **BERT**: Pre-trained on two tasks—**Masked Language Modeling (MLM)** (predicting masked words in a sentence) and **Next Sentence Prediction (NSP)** (understanding sentence relationships).
   - **GPT**: Pre-trained using a **causal language modeling (CLM)** objective, where it learns to predict the next word in a sequence. This trains it to generate fluent, coherent text that follows naturally from the context.

### **3.2 How GPT Works**
**The GPT model architecture** is a stack of **Decoder layers** from the Transformer, each layer containing:
1. **Masked Self-Attention**:
   - Similar to the Decoder in the original Transformer, GPT uses **Masked Self-Attention**, meaning that while predicting a word, it only considers the previous words (left-to-right context). This mask prevents it from seeing future words, ensuring that it generates text in a coherent, sequential manner.
2. **Feed-Forward Neural Network (FFN)**:
   - After the attention mechanism, the outputs are passed through an FFN, which helps the model learn complex transformations and patterns.
3. **Residual Connections and Layer Normalization**:
   - Like in the Encoder, each layer in GPT includes residual connections and layer normalization to stabilize the training process.

### **3.3 Example: Text Generation with GPT**
Let’s look at how GPT might generate text:
Suppose we have the prompt:
> "The sun was setting over the horizon, casting a warm glow over the..."

1. **Step 1: Input Processing**:
   - The input ("The sun was setting over the horizon, casting a warm glow over the") is tokenized and passed through the GPT model.
2. **Step 2: Masked Self-Attention**:
   - GPT reads each word from left to right and learns which words should come next based on the attention mechanism. For instance, it might focus on **"setting"** and **"horizon"** to understand that the next word could be something related to **"sky"** or **"landscape."**
3. **Step 3: Word Generation**:
   - The model predicts the next word based on the previous context, such as **"ocean."** It then appends **"ocean"** to the input and repeats the process to generate the following word. This continues until it generates a complete sentence or paragraph.

### **3.4 GPT Variants and Evolution**
1. **GPT-1**: The original model that introduced the concept of using the Decoder architecture for generative tasks. It was a proof-of-concept that demonstrated how effective pre-training could be for language generation.
2. **GPT-2**: A larger and more powerful version of GPT-1. It could generate coherent and contextually appropriate paragraphs of text, leading to impressive (and sometimes surprising) results. GPT-2 was **pre-trained on 1.5 billion parameters**.
3. **GPT-3**: The most famous version, capable of performing tasks like **translation, summarization, code generation, and conversation**. It has **175 billion parameters**, making it significantly more powerful than its predecessors. GPT-3 can perform tasks it hasn’t been explicitly trained for through a technique called **few-shot learning**, where it learns from just a few examples provided at inference time.

### **3.5 Use Cases of GPT Models**
1. **Text Completion**: Given a prompt, GPT can generate a continuation of the text, making it useful for creative writing, marketing copy, and storytelling.
2. **Conversation/Chatbots**: GPT can maintain context over a conversation, making it effective for creating more natural chatbots.
3. **Content Creation**: Writing articles, summaries, and even poetry.
4. **Code Generation**: Generating code snippets based on a description (like GitHub Copilot).
5. **Translation**: Translating text from one language to another.

### **3.6 Limitations of GPT Models**
1. **Lack of True Understanding**: While GPT can generate coherent text, it doesn’t truly understand the meaning. It’s pattern-based and can sometimes produce **misleading or nonsensical** content if the input is ambiguous.
2. **Bias and Ethical Concerns**: Since GPT is trained on vast amounts of internet data, it can inadvertently learn and reproduce **biases** present in that data. This has led to discussions about the ethical implications of deploying such models.
3. **No Bidirectional Context**: Unlike BERT, which can consider both past and future context, GPT only reads left-to-right, which might limit its performance in tasks that require a deeper understanding of the entire sentence.

---


## Fine-Tune BERT

In [1]:
# Import necessary libraries
# wandb api-key (to be pasted below): 77d5e6061e08428a2172dead53787acbd14bfe6b
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset, random_split
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from transformers import logging
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Step 1: Preprocess the Dataset
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Step 2: Load and Preprocess the Dataset
file_path = 'data_mental_health.csv'
df = pd.read_csv(file_path)

# Drop unnecessary columns if any (e.g., index column)
df_cleaned = df.drop(columns=['Unnamed: 0'], errors='ignore')

# Convert target labels to numeric
df_cleaned['class'] = df_cleaned['class'].apply(lambda x: 1 if x == 'suicide' else 0)

# Tokenizer for BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Prepare Dataset
X_train, X_test, y_train, y_test = train_test_split(df_cleaned['text'], df_cleaned['class'], test_size=0.2, random_state=42)
train_dataset = TextDataset(X_train.tolist(), y_train.tolist(), tokenizer, max_len=128)
test_dataset = TextDataset(X_test.tolist(), y_test.tolist(), tokenizer, max_len=128)

# Step 3: Load Pre-Trained BERT Model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Define Evaluation Metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Step 4: Fine-Tune the Model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,  # Keep this low for notebook runtime
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=10,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

# Train the Model
trainer.train()

# Evaluate the Model
eval_result = trainer.evaluate()
print(f"Evaluation Result: {eval_result}")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.036,0.259483,0.93,0.928862,0.906746,0.952083
2,0.1117,0.22743,0.949,0.94715,0.942268,0.952083


Evaluation Result: {'eval_loss': 0.22743020951747894, 'eval_accuracy': 0.949, 'eval_f1': 0.9471502590673575, 'eval_precision': 0.9422680412371134, 'eval_recall': 0.9520833333333333, 'eval_runtime': 9.0137, 'eval_samples_per_second': 110.942, 'eval_steps_per_second': 13.868, 'epoch': 2.0}


RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

In [2]:
def predict_text(text, tokenizer, model, max_len=128):
    # Check the device of the model (it will be on GPU if available)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)

    # Tokenize and prepare input
    encoding = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=max_len,
        return_attention_mask=True,
        return_tensors='pt',
        padding='max_length',
        truncation=True,
    )
    input_ids = encoding['input_ids'].to(device)  # Move input to the same device
    attention_mask = encoding['attention_mask'].to(device)  # Move input to the same device

    # Predict using the model
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predicted_class = torch.argmax(logits, dim=1).item()

    return "Suicide" if predicted_class == 1 else "Non-Suicide"


In [3]:
tricky_text = "Lately, I’ve been struggling a lot. There are days when I feel completely overwhelmed, like everything is crashing down around me, and I just want to escape. But then there are moments where I think maybe things could get better, that I might find a way through this. I’ve been trying to reach out to friends, and they’ve been supportive, but it’s hard to explain what I’m going through. Some days are okay, but other days, the darkness just feels too heavy to bear. I wish I could see a light at the end of the tunnel, but it’s not always there. I just don’t know what to do anymore."

normal_text = "I’ve been dealing with depression for a few years now, and it hasn’t been easy. There were times when I felt like giving up, but I found strength in seeking help. Therapy and talking to friends really made a difference. Now, I’m in a much better place, and I want to use my experience to support others who might be going through something similar. Mental health is so important, and I believe we need to talk about it openly. If sharing my story can encourage even one person to seek help, then it’s worth it."

print("Tricky Example Prediction:", predict_text(tricky_text, tokenizer, model))
print("Normal Example Prediction:", predict_text(normal_text, tokenizer, model))

Tricky Example Prediction: Suicide
Normal Example Prediction: Suicide


In [24]:
# Import necessary libraries
import pandas as pd
import torch
from torch.utils.data import DataLoader, Dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import numpy as np
from sklearn.utils.class_weight import compute_class_weight

# Step 1: Preprocess the Dataset
class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len):
        self.samples = []
        self.tokenizer = tokenizer
        self.max_len = max_len

        for text, label in zip(texts, labels):
            # Tokenize the text and get token IDs
            tokenized_text = self.tokenizer.encode(text, add_special_tokens=True)
            # Split the tokenized text into chunks
            for i in range(0, len(tokenized_text), self.max_len - 2):  # Reserve space for special tokens
                chunk = tokenized_text[i:i + self.max_len - 2]
                # Add special tokens
                chunk = [self.tokenizer.cls_token_id] + chunk + [self.tokenizer.sep_token_id]
                # Pad if necessary
                padding_length = self.max_len - len(chunk)
                attention_mask = [1] * len(chunk) + [0] * padding_length
                chunk += [self.tokenizer.pad_token_id] * padding_length
                self.samples.append({
                    'input_ids': torch.tensor(chunk, dtype=torch.long),
                    'attention_mask': torch.tensor(attention_mask, dtype=torch.long),
                    'labels': torch.tensor(label, dtype=torch.long)
                })

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        return self.samples[idx]


# Step 2: Load and Preprocess the Dataset
file_path = 'data_mental_health.csv'
df = pd.read_csv(file_path)

# Drop unnecessary columns if any (e.g., index column)
df_cleaned = df.drop(columns=['Unnamed: 0'], errors='ignore')

# Convert target labels to numeric
df_cleaned['class'] = df_cleaned['class'].apply(lambda x: 1 if x == 'suicide' else 0)

# Tokenizer for BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Prepare Dataset
X_train, X_test, y_train, y_test = train_test_split(
    df_cleaned['text'], df_cleaned['class'], test_size=0.2, random_state=42
)

# Set max sequence length to 512
max_len = 512

# Update dataset creation
train_dataset = TextDataset(X_train.tolist(), y_train.tolist(), tokenizer, max_len=max_len)
test_dataset = TextDataset(X_test.tolist(), y_test.tolist(), tokenizer, max_len=max_len)


# Step 3: Compute Class Weights
class_weights = compute_class_weight(
    class_weight='balanced',
    classes=np.unique(y_train),
    y=y_train
)
class_weights = torch.tensor(class_weights, dtype=torch.float)

# Device Configuration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class_weights = class_weights.to(device)

# Step 4: Load Pre-Trained BERT Model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model.to(device)

# Define Evaluation Metrics
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='binary', zero_division=0
    )
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Custom Trainer to include class weights
class WeightedTrainer(Trainer):
    def __init__(self, class_weights, *args, **kwargs):
        super(WeightedTrainer, self).__init__(*args, **kwargs)
        self.class_weights = class_weights

    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels").to(device)
        # forward pass
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # compute custom loss
        loss_fct = torch.nn.CrossEntropyLoss(weight=self.class_weights)
        loss = loss_fct(
            logits.view(-1, self.model.config.num_labels),
            labels.view(-1)
        )
        return (loss, outputs) if return_outputs else loss

# Step 5: Fine-Tune the Model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=10,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
    seed=42  # Set seed for reproducibility
)

trainer = WeightedTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
    class_weights=class_weights
)

# Train the Model
trainer.train()

# Evaluate the Model
eval_result = trainer.evaluate()
print(f"Evaluation Result: {eval_result}")

def predict_text(text, tokenizer, model, max_len=128):
    # Tokenize the text and get token IDs
    tokenized_text = tokenizer.encode(text, add_special_tokens=True)
    logits_list = []

    # Split into chunks
    for i in range(0, len(tokenized_text), max_len - 2):  # Reserve space for special tokens
        chunk = tokenized_text[i:i + max_len - 2]
        # Add special tokens
        chunk = [tokenizer.cls_token_id] + chunk + [tokenizer.sep_token_id]
        # Pad if necessary
        padding_length = max_len - len(chunk)
        attention_mask = [1] * len(chunk) + [0] * padding_length
        chunk += [tokenizer.pad_token_id] * padding_length

        input_ids = torch.tensor([chunk], dtype=torch.long).to(device)
        attention_mask = torch.tensor([attention_mask], dtype=torch.long).to(device)

        # Predict using the model
        model.eval()
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            logits_list.append(logits)

    # Aggregate logits
    aggregated_logits = torch.sum(torch.stack(logits_list), dim=0)
    predicted_class = torch.argmax(aggregated_logits, dim=1).item()

    return "Suicide" if predicted_class == 1 else "Non-Suicide"


# Test Texts
tricky_text = "Lately, I’ve been struggling a lot. There are days when I feel completely overwhelmed, like everything is crashing down around me, and I just want to escape. But then there are moments where I think maybe things could get better, that I might find a way through this. I’ve been trying to reach out to friends, and they’ve been supportive, but it’s hard to explain what I’m going through. Some days are okay, but other days, the darkness just feels too heavy to bear. I wish I could see a light at the end of the tunnel, but it’s not always there. I just don’t know what to do anymore."

normal_text = "I’ve been dealing with depression for a few years now, and it hasn’t been easy. There were times when I felt like giving up, but I found strength in seeking help. Therapy and talking to friends really made a difference. Now, I’m in a much better place, and I want to use my experience to support others who might be going through something similar. Mental health is so important, and I believe we need to talk about it openly. If sharing my story can encourage even one person to seek help, then it’s worth it."

# Predictions
print("Tricky Example Prediction:", predict_text(tricky_text, tokenizer, model))
print("Normal Example Prediction:", predict_text(normal_text, tokenizer, model))


Token indices sequence length is longer than the specified maximum sequence length for this model (573 > 512). Running this sequence through the model will result in indexing errors
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,0.3706,0.184729,0.934579,0.931774,0.983539,0.885185
2,0.0613,0.232448,0.943925,0.944444,0.944444,0.944444
3,0.0005,0.208562,0.952336,0.953082,0.946984,0.959259


Evaluation Result: {'eval_loss': 0.20856234431266785, 'eval_accuracy': 0.9523364485981308, 'eval_f1': 0.953081876724931, 'eval_precision': 0.946983546617916, 'eval_recall': 0.9592592592592593, 'eval_runtime': 31.5315, 'eval_samples_per_second': 33.934, 'eval_steps_per_second': 4.25, 'epoch': 3.0}
Tricky Example Prediction: Suicide
Normal Example Prediction: Suicide


In [36]:
def predict_text(text, tokenizer, model, max_len=128):
    # Tokenize the text and get token IDs
    tokenized_text = tokenizer.encode(text, add_special_tokens=True)
    logits_list = []

    # Split into chunks
    for i in range(0, len(tokenized_text), max_len - 2):  # Reserve space for special tokens
        chunk = tokenized_text[i:i + max_len - 2]
        # Add special tokens
        chunk = [tokenizer.cls_token_id] + chunk + [tokenizer.sep_token_id]
        # Pad if necessary
        padding_length = max_len - len(chunk)
        attention_mask = [1] * len(chunk) + [0] * padding_length
        chunk += [tokenizer.pad_token_id] * padding_length

        input_ids = torch.tensor([chunk], dtype=torch.long).to(device)
        attention_mask = torch.tensor([attention_mask], dtype=torch.long).to(device)

        # Predict using the model
        model.eval()
        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits
            logits_list.append(logits)

    # Aggregate logits
    aggregated_logits = torch.sum(torch.stack(logits_list), dim=0)
    predicted_class = torch.argmax(aggregated_logits, dim=1).item()

    return "Suicide" if predicted_class == 1 else "Non-Suicide"


# Test Texts
tricky_text = "Lately, I’ve been struggling a lot. There are days when I feel completely overwhelmed, like everything is crashing down around me, and I just want to escape. But then there are moments where I think maybe things could get better, that I might find a way through this. I’ve been trying to reach out to friends, and they’ve been supportive, but it’s hard to explain what I’m going through. Some days are okay, but other days, the darkness just feels too heavy to bear. I wish I could see a light at the end of the tunnel, but it’s not always there. I just don’t know what to do anymore."

normal_text = "I want to use my experience to support others who might be going through something similar, and I believe we need to talk about it openly. If sharing my story can encourage even one person to seek help, then it’s worth it."

# Predictions
print("Tricky Example Prediction:", predict_text(tricky_text, tokenizer, model))
print("Normal Example Prediction:", predict_text(normal_text, tokenizer, model))

Tricky Example Prediction: Suicide
Normal Example Prediction: Suicide
