## Scaling of Machine Learning Model

The original dataset for this project consisted of mental health-related Reddit data with approximately 50,000 rows. Due to the relatively small size of this dataset, the initial model was trained, tested, and validated using the entire dataset. For the scaling phase of the project, additional data was collected from Reddit. However, privacy regulations made it impossible to obtain new tagged data reflecting the actual mental health conditions of the individuals who wrote the posts. Instead, Reddit data was sourced with tags based on subreddit classifications.

Two such datasets were combined to evaluate the performance of the model and its underlying system. The original dataset included seven classes representing specific mental health conditions. In contrast, the newly cleaned and combined dataset contains over 500,000 rows and spans four subreddit categories. For clarity "Data 1” refers to the original dataset, while “Data 2” refers to the newly acquired dataset.



In [1]:
import torch
print("PyTorch version:", torch.__version__)
print("Is CUDA available?:", torch.cuda.is_available())
if torch.cuda.is_available():
    print("CUDA device name:", torch.cuda.get_device_name(0))

PyTorch version: 2.4.1
Is CUDA available?: True
CUDA device name: NVIDIA GeForce RTX 4070


In [2]:
from transformers import BertTokenizer, BertForSequenceClassification
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from torch.optim.lr_scheduler import ReduceLROnPlateau
from tqdm import tqdm  # For progress bar
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
import re
from time import time
from datasets import Dataset

## Data Load 

In [3]:
data1 = pd.read_csv('data_clean.csv', encoding='latin-1')

In [4]:
data1.shape

(48249, 8)

In [3]:
data2 = pd.read_csv('data2_3combined.csv', encoding='latin-1')

In [4]:
data2.shape

(528959, 3)

## Clean and Preprocess the Data

In [7]:
## Functions needed

In [5]:
# Text cleaning function for BERT
def text_clean_for_bert(text):
    text = re.sub(r'\S+@\S+', '', text)  # remove emails
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # remove URLs
    text = re.sub(r'\d+', '', text)  # remove numbers
    emoji_pattern = re.compile("[" 
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F700-\U0001F77F"  # alchemical symbols
                               u"\U0001F780-\U0001F7FF"  # geometric shapes extended
                               u"\U0001F800-\U0001F8FF"  # supplemental arrows
                               u"\U0001F900-\U0001F9FF"  # supplemental symbols & pictographs
                               u"\U0001FA00-\U0001FA6F"  # chess symbols
                               u"\U0001FA70-\U0001FAFF"  # symbols and pictographs extended
                               u"\U00002702-\U000027B0"  # Dingbats
                               u"\U000024C2-\U0001F251"  # Enclosed characters
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    # Regex to remove words with non-ASCII characters
    text = re.sub(r'\b\w*[^\x00-\x7F]+\w*\b', '', text)
    
    return text.strip()

def clean_preprocess_and_split(data):
    # Clean data for BERT
    data.loc[:,'bert_clean'] = data['statement'].apply(text_clean_for_bert)
    
    # Eliminates rows where the 'bert_clean' column contains fewer than 2 words after cleaning
    data =  data[data['bert_clean'].str.split(' ').apply(lambda x:len(x)>=2)]

    # Label encoding
    encoder = LabelEncoder()
    data.loc[:,'status_encoded'] = encoder.fit_transform(data['status'])
    
    # Test data split
    X_temp, X_test, y_temp, y_test = train_test_split(data['bert_clean'], data['status_encoded'], test_size=0.15, random_state=42, stratify=data['status_encoded'])
    # Train Validation data split
    X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.15, random_state=42, stratify=y_temp)

    return X_train, y_train, X_val, y_val, X_test, y_test

In [8]:
# Preprocess and split dataset 1
X_train1, y_train1, X_val1, y_val1, X_test1, y_test1 = clean_preprocess_and_split(data1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data.loc[:,'status_encoded'] = encoder.fit_transform(data['status'])


## Train data

In [16]:
class ModelTrainer:
    def __init__(self, batch_size,num_classes):
        self.batch_size = batch_size
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model_name = 'bert-base-uncased'
        self.num_classes = num_classes
        # Define loss function
        self.loss_fn = torch.nn.CrossEntropyLoss()
    

    # Tokenization function
    def tokenize_data(self, tokenizer, texts, labels, max_len=256):
        start_time = time()
        inputs = tokenizer(
            texts.tolist(),  
            padding=True, 
            truncation=True, 
            max_length=max_len, 
            return_tensors="pt"
            )
        tokenization_time = time() - start_time
        dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], torch.tensor(labels.values, dtype=torch.long))
        return dataset, tokenization_time

    def generate_dataloaders(self, X_train, y_train, X_val, y_val, X_test, y_test):
        # Load tokenizer
        tokenizer = BertTokenizer.from_pretrained(self.model_name)

        # Tokenize train and test sets
        y_train = y_train.astype(int)
        y_val = y_val.astype(int)
        y_test = y_test.astype(int)
        train_dataset, tokenization_time = self.tokenize_data(tokenizer, X_train, y_train)
        val_dataset, _  = self.tokenize_data(tokenizer, X_val, y_val)
        test_dataset, _  = self.tokenize_data(tokenizer, X_test, y_test)

        # Create DataLoaders
        train_loader = DataLoader(train_dataset, batch_size=self.batch_size, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=self.batch_size, shuffle=True)
        test_loader = DataLoader(test_dataset, batch_size=self.batch_size, shuffle=False)

        return train_loader, val_loader, test_loader, tokenization_time

    def train(self,train_loader,val_loader):
        start_time = time()
        # Load the pre-trained BERT model
        num_labels = self.num_classes
        model = BertForSequenceClassification.from_pretrained(self.model_name, num_labels=num_labels)
        model.config.hidden_dropout_prob = 0.3  # Ensure dropout is set

        # Move model to GPU if available
        model.to(self.device)

        # Set model to training mode
        model.train()

        # Define optimizer with weight decay
        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5, weight_decay=0.01)

        # Define learning rate scheduler (ReduceLROnPlateau)
        scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=2, min_lr=1e-6, verbose=True)

        # Instantiate EarlyStopping
        early_stopping = EarlyStopping(patience=2, min_delta=0.001)

        # Lists to store loss and accuracy values for each epoch
        train_losses = []
        val_losses = []
        train_accuracies = []
        val_accuracies = []

        # Fine-tuning loop with early stopping and learning rate scheduling
        num_epochs = 4
        for epoch in range(num_epochs):
            model.train()
            total_loss, total_correct, total_samples = 0, 0, 0

            for input_ids, attention_mask, labels in train_loader:
                input_ids, attention_mask, labels = input_ids.to(self.device), attention_mask.to(self.device), labels.to(self.device)
                optimizer.zero_grad()
                outputs = model(input_ids, attention_mask=attention_mask)
                loss = self.loss_fn(outputs.logits, labels)
                loss.backward()
                optimizer.step()
                total_loss += loss.item()
                preds = torch.argmax(outputs.logits, dim=-1)
                total_correct += (preds == labels).sum().item()
                total_samples += labels.size(0)

            avg_train_loss = total_loss / len(train_loader)
            train_accuracy = total_correct / total_samples

            # Append the training loss and accuracy
            train_losses.append(avg_train_loss)
            train_accuracies.append(train_accuracy)

            
            # Evaluate on validation set
            val_loss, val_accuracy, _ = self.evaluate(model, val_loader)
            val_losses.append(val_loss)
            val_accuracies.append(val_accuracy)

            # Early stopping check
            early_stopping(val_loss)
            if early_stopping.early_stop:
                print("Early stopping")
                break

            # Adjust learning rate manually after 2 epochs
            if epoch == 1:
                print("Reducing learning rate to 1e-6")
                for param_group in optimizer.param_groups:
                    param_group['lr'] = 1e-6

            # Adjust learning rate if validation loss plateaus
            scheduler.step(val_loss)

            # Clear GPU memory
            torch.cuda.empty_cache()
        
        training_time = time()-start_time
        return model,training_time, train_losses, train_accuracies,val_losses, val_accuracies

    def evaluate(self, model, data_loader):
        model.eval()  # Set the model to evaluation mode
        total_loss, total_correct, total_samples = 0, 0, 0
        y_pred = []

        with torch.no_grad():
            for input_ids, attention_mask, labels in data_loader:
                input_ids, attention_mask, labels = input_ids.to(self.device), attention_mask.to(self.device), labels.to(self.device)
                outputs = model(input_ids, attention_mask=attention_mask)
                loss = self.loss_fn(outputs.logits, labels)
                total_loss += loss.item()
                preds = torch.argmax(outputs.logits, dim=-1)
                y_pred.extend(preds.cpu().numpy())  # Move to CPU and convert to NumPy
                total_correct += (preds == labels).sum().item()
                total_samples += labels.size(0)

        avg_loss = total_loss / len(data_loader)
        accuracy = total_correct / total_samples
        return avg_loss, accuracy, y_pred

    def process(self, X_train, y_train, X_val, y_val, X_test, y_test):
        train_loader, val_loader, test_loader, tokenization_time = self.generate_dataloaders(X_train, y_train, X_val, y_val, X_test, y_test)
        model, training_time, train_losses, train_accuracies,val_losses, val_accuracies = self.train(train_loader,val_loader)
        avg_loss, accuracy, y_pred = self.evaluate(model, test_loader)
        return {            
            "tokenization_time": tokenization_time,
            "training_time": training_time,
            "accuracy": accuracy,
            "y_pred": y_pred,
            "model": model
        }
                 

In [6]:
# Define EarlyStopping class
class EarlyStopping:
    def __init__(self, patience=2, min_delta=0.001):
        self.patience = patience
        self.min_delta = min_delta
        self.best_val_loss = None
        self.counter = 0
        self.early_stop = False

    def __call__(self, val_loss):
        if self.best_val_loss is None:
            self.best_val_loss = val_loss
        elif val_loss < self.best_val_loss - self.min_delta:
            self.best_val_loss = val_loss
            self.counter = 0  # Reset the counter if validation loss improves
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.early_stop = True  # Stop training if patience is exceeded

In [12]:
num_classes1 = data1['status'].value_counts().shape[0]

In [14]:
print(f"Number of classes in the first dataset is: {num_classes1}")

Number of classes in the first dataset is: 7


In [18]:
# Instantiate the ModelTrainer class
trainer = ModelTrainer(batch_size=32, num_classes=num_classes1)

# Process the splits
results = trainer.process(X_train1, y_train1, X_val1, y_val1, X_test1, y_test1)

# Access results
y_pred = results["y_pred"]
print("Training Time:", results["training_time"])
print("Tokenization Time:", results["tokenization_time"])
print("Test Accuracy:", results["accuracy"])

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  attn_output = torch.nn.functional.scaled_dot_product_attention(


Reducing learning rate to 1e-6
Training Time: 1689.0828523635864
Tokenization Time: 33.55921173095703
Test Accuracy: 0.8250587747199557


In [20]:
# Save entire model
torch.save(results["model"], "model1.pth")

## Data 2 

In [7]:
# Preprocess and split dataset 1
X_train2, y_train2, X_val2, y_val2, X_test2, y_test2 = clean_preprocess_and_split(data2)

In [8]:
num_classes = data2['status'].value_counts().shape[0]

In [9]:
print(f"Number of classes in the second dataset is: {num_classes}")

Number of classes in the second dataset is: 4


In [41]:
# Instantiate the ModelTrainer class
trainer = ModelTrainer(batch_size=32, num_classes=num_classes)

# Process the splits
results2 = trainer.process(X_train2, y_train2, X_val2, y_val2, X_test2, y_test2)

# Access results
y_pred2 = results2["y_pred"]
print("Training Time:", results2["training_time"])
print("Tokenization Time:", results2["tokenization_time"])
print("Test Accuracy:", results2["accuracy"])

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Reducing learning rate to 1e-6
Early stopping
Training Time: 18687.928438186646
Tokenization Time: 523.191891670227
Test Accuracy: 0.8306614236741279


In [40]:
# Save entire model
torch.save(results2["model"], "model2.pth")

In [37]:
"""
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model2 = torch.load("model2.pth")
model2.to(device)
"""

'\ndevice = torch.device(\'cuda\' if torch.cuda.is_available() else \'cpu\')\nmodel2 = torch.load("model2.pth")\nmodel2.to(device)\n'

The size of the first dataset is 48,249, and the second dataset is 528,959. There is an approximately 11-fold difference between the sizes of these datasets. The training time for the first model is 1,689 seconds (approximately 28 minutes), and for the second model, it is 18,688 seconds (approximately 5 hours, 11 minutes). The ratio of training times is 11, which matches the ratio of their sizes. 
The tokenization time for the first model is 33.6 seconds, and for the second model, it is 523 seconds (approximately 9 minutes), with a ratio of approximately 15. The test accuracy of the first model is 0.825, while the second model achieves 0.83.

These values suggests that the training process scales linearly with dataset size, indicating that the model handles larger datasets efficiently without any unexpected exponential increase in training time.

To further reduce training time, we could have leveraged parallelization with multiple GPUs. However, since we currently have only one GPU, we will maintain the existing system configuration.