## Notebook goal
Investigate the ability of Roberta-Base mdoel to transfer learning from English to Arabic on the task of detecting Depression.

## Data
### English
https://drive.google.com/file/d/1KUVHGOP6vEaYAt9usv-BDLccSeP6lCHW/view?usp=share_link

### Arabic
loaded from drive as in this code.

## Methods
-Load finetuned model from drive. Saad has finetuned a mental roberta model previously that we can use. We can also create new finetuned models if we want using the Englsh dataset.
model_path = '/home/chqi/NLP4Health/checkpoint-20094'

-Fine tune the loaded model on Arabic depression data

## Baseline
Our baseline is doing the same task, binary classification for depression in Arabic, using the base model (before finetuning). base model is loaded using:
base_model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)


In [1]:
# Install the transformers library
!pip install transformers[torch]

# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import os

# Specify the path to your model checkpoints in Google Drive
# Ensure this path is correct and the directory exists
model_path = '/home/chqi/NLP4Health/checkpoint-20094'

# Check if the directory exists
if not os.path.isdir(model_path):
    print(f"Error: Model directory not found at {model_path}")
else:
    print(f"Model directory found at {model_path}. Attempting to load model and tokenizer.")
    # Load the model and tokenizer for sequence classification
    try:
        # Load the tokenizer first
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        # Load the model, specifying the number of labels
        model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)

        print("Model and tokenizer loaded successfully using Auto classes.")
    except Exception as e:
        print(f"An error occurred while loading the model or tokenizer: {e}")

Defaulting to user installation because normal site-packages is not writeable
Model directory found at /home/chqi/NLP4Health/checkpoint-20094. Attempting to load model and tokenizer.
Model and tokenizer loaded successfully using Auto classes.


In [2]:
import pandas as pd
from transformers import RobertaTokenizer, RobertaForSequenceClassification, Trainer, TrainingArguments
import torch
import os

# Load the dataset
df = pd.read_csv('/home/chqi/NLP4Health/dep_arabic_20k.csv')

# Filter for 'depression' and 'control' classes
df_binary = df[df['condition'].isin(['depression', 'control'])].copy()

# Balance the dataset
depression_count = df_binary[df_binary['condition'] == 'depression'].shape[0]
control_df = df_binary[df_binary['condition'] == 'control']
sampled_control_df = control_df.sample(n=depression_count, random_state=42)

balanced_df = pd.concat([df_binary[df_binary['condition'] == 'depression'], sampled_control_df])

# Shuffle the balanced dataset
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Inspect the balanced dataset
print("Balanced dataset class distribution:")
print(balanced_df['condition'].value_counts())

# Map 'depression' to 1 and 'control' to 0
balanced_df['label'] = balanced_df['condition'].apply(lambda x: 1 if x == 'depression' else 0)

# Handle potential missing values in 'selftext' and 'title'
balanced_df['selftext'] = balanced_df['selftext'].fillna('')
balanced_df['title'] = balanced_df['title'].fillna('')

# Combine 'title' and 'selftext' for classification
balanced_df['text'] = balanced_df['title'] + " " + balanced_df['selftext']


Balanced dataset class distribution:
condition
control       20000
depression    20000
Name: count, dtype: int64


In [3]:
from sklearn.model_selection import train_test_split
# Split the dataset into training and validation sets
train_df, val_df = train_test_split(balanced_df, test_size=0.2, random_state=42)

# Inspect label distribution in train and validation sets
print("\nTraining set label distribution:")
print(train_df['label'].value_counts())
print("\nValidation set label distribution:")
print(val_df['label'].value_counts())


Training set label distribution:
label
1    16003
0    15997
Name: count, dtype: int64

Validation set label distribution:
label
0    4003
1    3997
Name: count, dtype: int64


In [4]:
# Use the tokenizer loaded from the previous cell
# tokenizer is already loaded in cell 45ykJe8vXw61

# Create a custom dataset class
class DepressionDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

In [5]:
# Tokenize the text data using the tokenizer from the previous cell
train_encodings = tokenizer(list(train_df['text']), truncation=True, padding=True, max_length=512)
val_encodings = tokenizer(list(val_df['text']), truncation=True, padding=True, max_length=512)

# Create dataset objects
train_dataset = DepressionDataset(train_encodings, list(train_df['label']))
val_dataset = DepressionDataset(val_encodings, list(val_df['label']))

print("\nBinary classification setup complete. train_dataset and val_dataset are ready.")
print(f"Training dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")

# Load the RoBERTa model for sequence classification
# Use the base model loaded in the previous cell and adapt it for sequence classification
# model = RobertaForSequenceClassification.from_pretrained('/home/chqi/NLP4Health/ptsd_roberta-base', num_labels=2)


# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=1,              # number of training epochs
    per_device_train_batch_size=96,  # batch size per device during training
    per_device_eval_batch_size=96,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    eval_strategy="epoch"
)

# Create Trainer instance
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

print("\nTrainer instance created. Ready to train the model.")

# Train the model
# trainer.train()

# print("Model training complete.")


Binary classification setup complete. train_dataset and val_dataset are ready.
Training dataset size: 32000
Validation dataset size: 8000


Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.



Trainer instance created. Ready to train the model.


In [6]:
trainer.train()
print("Model training complete.")

[34m[1mwandb[0m: Currently logged in as: [33mchqiu[0m ([33mchqiu-george-washington-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
  Expected `list[str]` but got `tuple` - serialized value may not be as expected
  Expected `list[str]` but got `tuple` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(




Epoch,Training Loss,Validation Loss
1,0.6868,0.682562




Model training complete.


In [7]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate_model(trainer, dataset):
    """
    Evaluates the model and prints accuracy, precision, recall, and F1 score.

    Args:
        trainer: The trained Hugging Face Trainer object.
        dataset: The dataset to evaluate on.
    """
    print("Evaluating model...")
    predictions = trainer.predict(dataset)
    preds = predictions.predictions.argmax(-1)
    labels = predictions.label_ids

    accuracy = accuracy_score(labels, preds)
    precision = precision_score(labels, preds)
    recall = recall_score(labels, preds)
    f1 = f1_score(labels, preds)

    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1 Score: {f1:.4f}")

print("Evaluation function defined.")

Evaluation function defined.


In [8]:
evaluate_model(trainer, val_dataset)

Evaluating model...




  Accuracy: 0.5560
  Precision: 0.5507
  Recall: 0.6047
  F1 Score: 0.5764


# Baseline

In [12]:
# Load the pre-trained RoBERTa base model
base_model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)

# Create a Trainer instance for the base model using the same training arguments and datasets
base_trainer = Trainer(
    model=base_model,                 # the base RoBERTa model
    args=training_args,               # use the same training arguments as before
    train_dataset=train_dataset,      # use the same training dataset
    eval_dataset=val_dataset          # use the same validation dataset
)

print("\nBase model Trainer instance created. Ready to train the base model.")

# Train the base model
# base_trainer.train()

print("Base model training setup complete.")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.



Base model Trainer instance created. Ready to train the base model.
Base model training setup complete.


In [13]:
base_trainer.train()



Epoch,Training Loss,Validation Loss
1,0.6936,0.692781




TrainOutput(global_step=84, training_loss=0.6966583047594342, metrics={'train_runtime': 121.2931, 'train_samples_per_second': 263.824, 'train_steps_per_second': 0.693, 'total_flos': 8419553771520000.0, 'train_loss': 0.6966583047594342, 'epoch': 1.0})

In [14]:
evaluate_model(base_trainer, val_dataset)

Evaluating model...


  Accuracy: 0.5059
  Precision: 0.5030
  Recall: 0.9267
  F1 Score: 0.6521
