


# **Depression detection using RoBERTa Pre-trained Language Models**




In this notebook, we use the RoBERTa model to classify social media texts into three categories: "Not depressed," "Moderately depressed," and "Severely depressed." We will preprocess the data, define the RoBERTa model, train it on the training dataset, and evaluate its performance on the test dataset.

In [None]:
!pip install transformers datasets


In [None]:
!pip install accelerate -U

In [1]:
import pandas as pd
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import torch

# Load the dataset
train_data = pd.read_csv('/content/train.csv')
test_data = pd.read_csv('/content/test.csv')

**Data Preprocessing**  

We preprocess the data by tokenizing the text using the RoBERTa tokenizer, padding and truncating the sequences to a maximum length of 512 tokens.  
**Dataset Preparation**  
We prepare the dataset for the Trainer by creating a custom DepressionDataset class that includes the tokenized encodings and the corresponding labels.

In [2]:
# Preprocess the data
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

def preprocess(data):
    return tokenizer(data['text'].tolist(), padding=True, truncation=True, max_length=512)

train_encodings = preprocess(train_data)
test_encodings = preprocess(test_data)

# Prepare the dataset for the Trainer
class DepressionDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = DepressionDataset(train_encodings, train_data['labels'])
test_dataset = DepressionDataset(test_encodings, test_data['labels'])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


**Model Definition**  
We define the RoBERTa model for sequence classification using the RobertaForSequenceClassification class with the pre-trained 'roberta-base' model and three output labels.

In [3]:
# Define the model
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=3)

# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
)


Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Training**  

We train the model using the Trainer class with the defined training arguments, including the number of training epochs, batch size, and evaluation steps.

In [4]:

# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=lambda p: {'accuracy': accuracy_score(p.label_ids, p.predictions.argmax(-1))}
)

# Train the model
trainer.train()

Step,Training Loss
10,1.1417
20,1.1432
30,1.1139
40,1.0979
50,1.0569
60,1.0062
70,0.9596
80,0.9901
90,0.9728
100,0.9416


TrainOutput(global_step=1128, training_loss=0.7585916586801515, metrics={'train_runtime': 1609.7544, 'train_samples_per_second': 11.193, 'train_steps_per_second': 0.701, 'total_flos': 4740777560623104.0, 'train_loss': 0.7585916586801515, 'epoch': 3.0})

**Evaluation**  

After training, we evaluate the model on the test dataset and print the evaluation results, including the accuracy and a classification report showing precision, recall, and F1-score for each class.

In [5]:



# Evaluate the model
eval_results = trainer.evaluate()
print(f"Accuracy: {eval_results['eval_accuracy']}")
print(classification_report(test_data['labels'], trainer.predict(test_dataset).predictions.argmax(-1)))




Accuracy: 0.6474576271186441
              precision    recall  f1-score   support

           0       0.46      0.56      0.50       228
           1       0.76      0.71      0.74      2169
           2       0.46      0.50      0.48       848

    accuracy                           0.65      3245
   macro avg       0.56      0.59      0.57      3245
weighted avg       0.66      0.65      0.65      3245



The model achieved an overall accuracy of 64.75%, indicating that 64.75% of the predictions made by the model are correct.

Precision:

Class 0 (not depressed): 0.46
Class 1 (moderately depressed): 0.76
Class 2 (severely depressed): 0.46
This indicates that the model is better at predicting class 1 (moderately depressed) compared to the other classes.

Recall:

Class 0: 0.56
Class 1: 0.71
Class 2: 0.50
The model is better at capturing instances of class 1 (moderately depressed) compared to the other classes.

F1-score:

Class 0: 0.50
Class 1: 0.74
Class 2: 0.48
The F1-scores provide a balance between precision and recall, with higher scores indicating better performance.

Support:

Class 0: 228
Class 1: 2169
Class 2: 848
The support values indicate the number of actual occurrences of each class in the test dataset, showing an imbalance in the dataset.

Overall, the model demonstrates moderate performance in detecting depression from social media posts, with higher accuracy and F1-score for class 1 (moderately depressed) compared to the other classes.  

This could be because of the unequal distribution of the data set
In the training dataset we have   

Total Not depressed	650  
Total Moderately Depressed	3101  
Total Severly Depressed	2255  

and in Testing data set we have    
Total Not depressed	228  
Total Moderately Depressed	2169  
Total Severly Depressed	848  

So with a better dataset we might be able to improve the performance of this model.


We provide a function classify_text to classify new social media texts entered by the user. The function tokenizes the input text, makes a prediction using the trained model, and returns the predicted class ("Not depressed," "Moderately depressed," or "Severely depressed").

In [6]:
def classify_text(text):
    # Tokenize the input text
    inputs = tokenizer([text], padding=True, truncation=True, max_length=512, return_tensors="pt")

    # Move the input tensor to the same device as the model
    device = next(model.parameters()).device
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Make a prediction using the model
    outputs = model(**inputs)
    logits = outputs.logits

    # Get the predicted class
    _, predicted = torch.max(logits, dim=1)

    # Return the predicted class as a string
    if predicted.item() == 2:
        return "Not depressed"
    elif predicted.item() == 1:
        return "Moderately depressed"
    else:
        return "Severely depressed"

In [7]:
# Example usage
user_input = input("Enter a social media text to classify: ")
print(f"The text is classified as: {classify_text(user_input)}")

Enter a social media text to classify: I am Happy
The text is classified as: Not depressed


In [8]:
from sklearn.metrics import confusion_matrix

# Get the actual and predicted labels
y_true = test_data['labels']
y_pred = trainer.predict(test_dataset).predictions.argmax(-1)

# Compute the confusion matrix
conf_matrix = confusion_matrix(y_true, y_pred)

print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[ 127   95    6]
 [ 121 1550  498]
 [  29  395  424]]


References

In [None]:
@inproceedings{10.1007/978-3-031-16364-7_11,
    title={Data Set Creation and Empirical Analysis for Detecting Signs of Depression from Social Media Postings},
    author= {Kayalvizhi, Sampath
    and Thenmozhi, Durairaj},
    editor={Kalinathan, Lekshmi
    and R., Priyadharsini
    and Kanmani, Madheswari
    and S., Manisha},
    booktitle={Computational Intelligence in Data Science},
    year={2022},
    publisher={Springer International Publishing},
    address={Cham},
    pages={136--151},
    isbn={978-3-031-16364-7}
}

The future scope of this work includes webscrapping to get real time data and implementing a similar model in social media to monitor and help at-risk individuals.

The main code ends here. The following is just one model that I found online and used, trying to learn about pipelines.

In [1]:
from transformers import pipeline
predict_task = pipeline(model="mrjunos/depression-reddit-distilroberta-base", task="text-classification")
predict_task("Stop listing your issues here, use forum instead or open ticket.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/880 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/351 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

[{'label': 'not_depression', 'score': 0.9813856482505798}]

In [3]:
user_input=input("How are you feeling ")
predict_task(user_input)

How are you feeling I am Happy


[{'label': 'not_depression', 'score': 0.8455097079277039}]

In [4]:
import torch
from transformers import RobertaForSequenceClassification, RobertaTokenizerFast
from sklearn.metrics import accuracy_score

# Load the pre-trained model and tokenizer
model = RobertaForSequenceClassification.from_pretrained('mrjunos/depression-reddit-distilroberta-base')
tokenizer = RobertaTokenizerFast.from_pretrained('mrjunos/depression-reddit-distilroberta-base')

# Define the text to classify
text = "I am feeling really down today"

# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')

# Make a prediction
outputs = model(**inputs)
logits = outputs.logits

# Get the predicted class
_, predicted = torch.max(logits, dim=1)

# Print the predicted class and the corresponding label
print("Predicted class:", model.config.id2label[predicted.item()])

# Evaluate the model on a sample dataset
sample_dataset = [
    {"text": "I am feeling really down today"},
    {"text": "I am feeling great today"},
    {"text": "I am feeling so-so today"}
]

# Tokenize the dataset
inputs = tokenizer(
    [x["text"] for x in sample_dataset],
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=128
)

# Make predictions
outputs = model(**inputs)
logits = outputs.logits

# Get the predicted labels
_, predicted = torch.max(logits, dim=1)

# Get the true labels
true_labels = torch.tensor([0, 1, 0])

# Print the accuracy
print("Accuracy:", accuracy_score(true_labels.numpy(), predicted.numpy()))

Predicted class: not_depression
Accuracy: 0.6666666666666666
