The purpose of this notebook is to help facilitate the testing of various models to try and create the most optimal machine learning algorithm possible for dividing scholarly articles into the BioPsychoSocial, as explained and refined by Dr Karl Maier. The overall goal here being to help separate the articles into their somewhat distinct categories for better finding of related content in the academic field. The models in this notebook are facilitated by Hugging Face, which hosts a library of models with a standardized way of interacting with each.

# Zero-Shot Classification Attempt


- Begin by performing a pip install to gather the required libraries if they do not exist already

In [None]:
!pip install transformers pandas

- The following piece of code was an inital test that was used just to see the Hugging Face pipeline in action. It had horrible accuracy compared to what the real answer should be, lol.

In [None]:
from transformers import pipeline

# Load zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Text to classify
text = "This study examines the impact of climate change on marine biodiversity."

# Labels with descriptions
labels = [
    "Biophysical: Related to the physical and biological aspects of the environment and living organisms.",
    "Social: Pertaining to human society, interactions, and social structures.",
    "Psychological: Concerned with mental processes and behavior."
]

# Perform classification
result = classifier(text, candidate_labels=labels, multi_label=False)

# Output results
print(result)

- From there, the information I wanted to test the model on needed to be uploaded, hence the file.upload found directly after this. The articles used for testing here can be found on the Zotero directory created by Dr Karl Maier, which was ran through some python data prep to ensure no bad values were included beforehand.

In [None]:
from google.colab import files
files.upload()

- From there we run the first values found in the csv through the model and compare the model's predictions to what the actual category is it belongs. For this test, it was quite innacurate.

In [None]:
import pandas as pd
from transformers import pipeline

# Step 2: Load the CSV file
file_path = "processed_articles_sanitized.csv"  # Replace with your actual file path
data = pd.read_csv(file_path)

# Extract the first row of data after the header
first_row = data.iloc[0]
title = first_row['title']
abstract = first_row['abstract']
true_category = first_row['category']

# Combine title and abstract for classification with improved formatting
text_to_classify = f"Title - {title}. Abstract - {abstract}"

# Step 3: Define labels with descriptions
labels = [
    "Biophysical - Involves the physical systems and processes of the body, such as genetics, cellular functions, brain activity, and how they interact with environmental factors to influence health, behavior, and overall functioning.",
    "Social - Encompasses the influence of relationships, cultural norms, socioeconomic structures, and societal dynamics on individual and collective well-being, focusing on the interconnectedness of people within communities and broader social systems.",
    "Psychological - Refers to the mental and emotional processes that shape perception, decision-making, resilience, and responses to challenges, emphasizing internal experiences like thoughts, emotions, and coping strategies."
]

# Step 4: Load the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Step 5: Perform classification on the first row's text
result = classifier(text_to_classify, candidate_labels=labels, multi_label=False)

# Step 6: Extract the predicted label with the highest score
predicted_label_full = result['labels'][0]  # Highest scoring label
predicted_label = predicted_label_full.split(" - ")[0]  # Extract the short label (e.g., "Biophysical")

# Print results
print("Classification Results:")
print(f"Text to classify: {text_to_classify}")
print(f"True category: {true_category}")
print(f"Predicted category: {predicted_label}")
print(f"Prediction correct? {predicted_label.lower() == true_category.lower()}")

# Print confidence scores for all labels
print("\nModel Confidence Scores:")
for label, score in zip(result['labels'], result['scores']):
    print(f"{label.split(' - ')[0]}: {score:.4f}")

- From there we updated what information was being recorded and presented to the user, changing from just a simple right or wrong to using the normalized confidence gap and the weighted accuracy score. The weighted accuracy score isn't the best one to use here considering some articles could reasonably belong to multiple categories, but it showed some more insight into how the model was doing. Which for this one test was terrible.

In [None]:
# Import necessary libraries
import pandas as pd
from transformers import pipeline

# Load the CSV file
file_path = "processed_articles_sanitized.csv"  # Replace with your actual file path
data = pd.read_csv(file_path)

# Extract the first row of data after the header
first_row = data.iloc[0]
title = first_row['title']
abstract = first_row['abstract']
true_category = first_row['category']

# Combine title and abstract for classification
text_to_classify = f"Title - {title}. Abstract - {abstract}"

# Define labels with descriptions
labels = [
    "Biophysical - Involves the physical systems and processes of the body, such as genetics, cellular functions, brain activity, and how they interact with environmental factors to influence health, behavior, and overall functioning.",
    "Social - Encompasses the influence of relationships, cultural norms, socioeconomic structures, and societal dynamics on individual and collective well-being, focusing on the interconnectedness of people within communities and broader social systems.",
    "Psychological - Refers to the mental and emotional processes that shape perception, decision-making, resilience, and responses to challenges, emphasizing internal experiences like thoughts, emotions, and coping strategies."
]

# Load the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Perform classification
result = classifier(text_to_classify, candidate_labels=labels, multi_label=False)

# Extract scores and labels
scores = result['scores']
predicted_label_full = result['labels'][0]  # Highest scoring label
predicted_label = predicted_label_full.split(" - ")[0]  # Short label (e.g., "Biophysical")

# Find the index of the true category in the result labels
true_index = [label.split(" - ")[0] for label in result['labels']].index(true_category)

# Calculate Normalized Confidence Gap and Weighted Accuracy Score
n_labels = len(scores)
random_baseline = 1 / n_labels

# Normalize scores
normalized_scores = [score / sum(scores) for score in scores]
correct_score = normalized_scores[true_index]
gap_random = correct_score - random_baseline

# Weighted Accuracy Score
rank = sorted(normalized_scores, reverse=True).index(correct_score) + 1
weighted_accuracy = 1.0 if rank == 1 else 0.5 if rank == 2 else 0.0

# Output results
print(f"Predicted Label: {predicted_label}")
print(f"Correct Label: {true_category}")
print("\nModel Confidence Scores:")
for label, score in zip(result['labels'], result['scores']):
    print(f"{label.split(' - ')[0]}: {score:.4f}")
print(f"\nNormalized Confidence Gap: {gap_random:.4f}")
print(f"Weighted Accuracy Score: {weighted_accuracy:.1f}")



- Once the foundation was set up, the program was changed to instead run every single article through the model and average the scores for all articles as well as all articles of each category. This started out extremely slow until the notebook was switched to use a google gpu instead of my device's specs. The results were interesting in that almost all the tests averaged out to be just about the same as if the model had randomly guessed. Which was disheartening to see.

In [None]:
# Import necessary libraries
import pandas as pd
from transformers import pipeline
import time

# Load the CSV file
file_path = "processed_articles_sanitized.csv"  # Replace with your actual file path
data = pd.read_csv(file_path)

# Define labels with descriptions
labels = [
    "Biophysical - Involves the physical systems and processes of the body, such as genetics, cellular functions, brain activity, and how they interact with environmental factors to influence health, behavior, and overall functioning.",
    "Social - Encompasses the influence of relationships, cultural norms, socioeconomic structures, and societal dynamics on individual and collective well-being, focusing on the interconnectedness of people within communities and broader social systems.",
    "Psychological - Refers to the mental and emotional processes that shape perception, decision-making, resilience, and responses to challenges, emphasizing internal experiences like thoughts, emotions, and coping strategies."
]

# Load the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Variables for aggregating statistics
category_stats = {label.split(" - ")[0]: {"gap_sum": 0, "accuracy_sum": 0, "count": 0} for label in labels}
overall_gap_sum = 0
overall_accuracy_sum = 0
total_articles = 0

# Start timing the execution
start_time = time.time()

# Define a batch size
batch_size = 100

# Process articles in batches
for batch_start in range(0, len(data), batch_size):
    elapsed_time = time.time() - start_time
    if elapsed_time > 720:
        print("\nTime limit reached. Stopping further processing...")
        break

    batch_end = min(batch_start + batch_size, len(data))
    batch = data.iloc[batch_start:batch_end]

    # Combine titles and abstracts
    texts_to_classify = [
        f"Title - {row['title']}. Abstract - {row['abstract']}" for _, row in batch.iterrows()
    ]
    true_categories = batch['category'].tolist()

    # Perform batch classification
    results = classifier(texts_to_classify, candidate_labels=labels, multi_label=False)

    for i, result in enumerate(results):
        # Extract scores and labels
        scores = result['scores']
        predicted_label_full = result['labels'][0]
        predicted_label = predicted_label_full.split(" - ")[0]

        # Find the index of the true category
        true_category = true_categories[i]
        true_index = [label.split(" - ")[0] for label in result['labels']].index(true_category)

        # Calculate statistics
        n_labels = len(scores)
        random_baseline = 1 / n_labels
        normalized_scores = [score / sum(scores) for score in scores]
        correct_score = normalized_scores[true_index]
        gap_random = correct_score - random_baseline
        rank = sorted(normalized_scores, reverse=True).index(correct_score) + 1
        weighted_accuracy = 1.0 if rank == 1 else 0.5 if rank == 2 else 0.0

        # Update statistics
        category_stats[true_category]["gap_sum"] += gap_random
        category_stats[true_category]["accuracy_sum"] += weighted_accuracy
        category_stats[true_category]["count"] += 1

        overall_gap_sum += gap_random
        overall_accuracy_sum += weighted_accuracy
        total_articles += 1

    # Update console message
    print(f"\rProcessing articles {batch_start + 1}-{batch_end}/{len(data)}...", end="")

# Wrap-up calculations
overall_gap_avg = overall_gap_sum / total_articles if total_articles else 0
overall_accuracy_avg = overall_accuracy_sum / total_articles if total_articles else 0

# Output results
print("\n\nProcessing complete!")
print(f"Processed Articles: {total_articles}")
print(f"Overall Normalized Confidence Gap Average: {overall_gap_avg:.4f}")
print(f"Overall Weighted Accuracy Average: {overall_accuracy_avg:.4f}")

for category, stats in category_stats.items():
    count = stats["count"]
    gap_avg = stats["gap_sum"] / count if count else 0
    accuracy_avg = stats["accuracy_sum"] / count if count else 0
    print(f"\nCategory: {category}")
    print(f"  Articles Processed: {count}")
    print(f"  Average Normalized Confidence Gap: {gap_avg:.4f}")
    print(f"  Average Weighted Accuracy: {accuracy_avg:.4f}")


- I attempted some slight changes to the program, including shortening the labels to be more concise and changing the multi label setting to true (as it should have been from the start), but in the end saw roughly the same results just skewed out more over the different categories. With this, I switched to trying to fine-tune an existing model.

In [None]:
# Import necessary libraries
import pandas as pd
from transformers import pipeline
import time

# Load the CSV file
file_path = "processed_articles_sanitized.csv"  # Replace with your actual file path
data = pd.read_csv(file_path)

# Define labels with descriptions
labels = [
    "Biophysical - Genetics, brain activity, cellular functions.",
    "Social - Relationships, cultural norms, societal dynamics.",
    "Psychological - Mental processes, emotions, coping strategies."
]

# Load the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Variables for aggregating statistics
category_stats = {label.split(" - ")[0]: {"gap_sum": 0, "accuracy_sum": 0, "count": 0} for label in labels}
overall_gap_sum = 0
overall_accuracy_sum = 0
total_articles = 0

# Start timing the execution
start_time = time.time()

# Define a batch size
batch_size = 100

# Process articles in batches
for batch_start in range(0, len(data), batch_size):
    elapsed_time = time.time() - start_time
    if elapsed_time > 720:
        print("\nTime limit reached. Stopping further processing...")
        break

    batch_end = min(batch_start + batch_size, len(data))
    batch = data.iloc[batch_start:batch_end]

    # Combine titles and abstracts
    texts_to_classify = [
        f"Title - {row['title']}. Abstract - {row['abstract']}" for _, row in batch.iterrows()
    ]
    true_categories = batch['category'].tolist()

    # Perform batch classification
    results = classifier(texts_to_classify, candidate_labels=labels, multi_label=True)

    for i, result in enumerate(results):
        # Extract scores and labels
        scores = result['scores']
        predicted_label_full = result['labels'][0]
        predicted_label = predicted_label_full.split(" - ")[0]

        # Find the index of the true category
        true_category = true_categories[i]
        true_index = [label.split(" - ")[0] for label in result['labels']].index(true_category)

        # Calculate statistics
        n_labels = len(scores)
        random_baseline = 1 / n_labels
        normalized_scores = [score / sum(scores) for score in scores]
        correct_score = normalized_scores[true_index]
        gap_random = correct_score - random_baseline
        rank = sorted(normalized_scores, reverse=True).index(correct_score) + 1
        weighted_accuracy = 1.0 if rank == 1 else 0.5 if rank == 2 else 0.0

        # Update statistics
        category_stats[true_category]["gap_sum"] += gap_random
        category_stats[true_category]["accuracy_sum"] += weighted_accuracy
        category_stats[true_category]["count"] += 1

        overall_gap_sum += gap_random
        overall_accuracy_sum += weighted_accuracy
        total_articles += 1

    # Update console message
    print(f"\rProcessing articles {batch_start + 1}-{batch_end}/{len(data)}...", end="")

# Wrap-up calculations
overall_gap_avg = overall_gap_sum / total_articles if total_articles else 0
overall_accuracy_avg = overall_accuracy_sum / total_articles if total_articles else 0

# Output results
print("\n\nProcessing complete!")
print(f"Processed Articles: {total_articles}")
print(f"Overall Normalized Confidence Gap Average: {overall_gap_avg:.4f}")
print(f"Overall Weighted Accuracy Average: {overall_accuracy_avg:.4f}")

for category, stats in category_stats.items():
    count = stats["count"]
    gap_avg = stats["gap_sum"] / count if count else 0
    accuracy_avg = stats["accuracy_sum"] / count if count else 0
    print(f"\nCategory: {category}")
    print(f"  Articles Processed: {count}")
    print(f"  Average Normalized Confidence Gap: {gap_avg:.4f}")
    print(f"  Average Weighted Accuracy: {accuracy_avg:.4f}")

# Fine-Tuning Attempt


In [None]:
# Install necessary libraries
!pip install transformers datasets


In [None]:
from google.colab import files
files.upload()

In [None]:
# Import libraries
import pandas as pd
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import get_scheduler
from sklearn.model_selection import train_test_split


In [None]:
# Load the dataset and inspect the first few rows to understand its structure
file_path = "processed_articles_sanitized.csv"  # Replace with your dataset path
dataset = load_dataset("csv", data_files=file_path)

# Combine title and abstract into one column
def preprocess_function(example):
    # Combine the 'title' and 'abstract' for a single example
    example["text"] = f"Title: {example['title']}. Abstract: {example['abstract']}"
    return example

# Apply preprocessing
dataset = dataset.map(preprocess_function)

# Now rename the column and drop the originals
dataset = dataset.rename_column("category", "label")  # Ensure 'label' matches Hugging Face expectations
dataset = dataset.remove_columns(["title", "abstract", "Key"])  # Remove the original columns to keep 'text' and 'label'

# Split the dataset into training and validation sets (80% train, 20% validation)
train_data = dataset["train"].train_test_split(test_size=0.2)["train"]
val_data = dataset["train"].train_test_split(test_size=0.2)["test"]

# Print one example entry from the dataset to verify
print("One example entry from the dataset:")
print(train_data[0])  # Print the first entry of the train dataset to verify





In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"  # Ensure W&B is fully disabled
os.environ["WANDB_MODE"] = "dryrun"

print(train_data[0])

# Initialize tokenizer and pre-trained model
model_name = "bert-base-uncased"  # Replace with your chosen model
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    # Ensure the "text" is a list of strings, if not, join them into a single string
    texts = [str(text) for text in examples["text"]]  # Make sure they are strings
    return tokenizer(texts, padding="max_length", truncation=True)

tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_val = val_data.map(tokenize_function, batched=True)

label_map = {
    "Biophysical": 0,
    "Psychological": 1,
    "Social": 2
}

def flatten_and_convert_labels(examples):
    # If the label is a list, get the first element
    if isinstance(examples['label'], list):
        examples['label'] = examples['label'][0]  # Extract the first element if it's a list

    # Convert the label from string to integer based on the label_map
    examples['label'] = label_map.get(examples['label'], examples['label'])  # Convert label to integer

    return examples

train_data = train_data.map(flatten_and_convert_labels)
val_data = val_data.map(flatten_and_convert_labels)

print(train_data[0])

# Load the model for sequence classification
num_labels = 3  # For Biophysical, Psychological, Social
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",  # Directory to save model/checkpoints
    evaluation_strategy="epoch",  # Evaluate at the end of each epoch
    learning_rate=3e-5,  # Standard learning rate for fine-tuning
    per_device_train_batch_size=8,  # Adjust as needed for memory
    num_train_epochs=5,  # Experiment with 2–5 epochs
    weight_decay=0.01,  # Regularization to avoid overfitting
    save_total_limit=2,  # Keep only the last two checkpoints
    logging_dir="./logs",  # Logging directory
    logging_steps=10,  # Log training progress every 10 steps
    report_to=None,  # Disable W&B logging
    lr_scheduler_type="linear",  # Linearly decrease learning rate
)

# Set up the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
)

# Train the model
trainer.train()

# Save the model
trainer.save_model("fine_tuned_model")

# Evaluate and print results
results = trainer.evaluate()
print("Evaluation Results:", results)


In [None]:
from sklearn.metrics import accuracy_score

# Make predictions on your validation/test set
predictions = trainer.predict(tokenized_val)
predicted_labels = predictions.predictions.argmax(axis=1)

# Compare with the true labels
true_labels = tokenized_val["label"]

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print(f"Accuracy: {accuracy:.4f}")