DistilBert is a smaller version of BERT that is much faster and cheaper.

From the paper,

>"we leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster"

DistilBert Paper: https://arxiv.org/abs/1910.01108v4

In [3]:
import numpy as np
import pandas as pd
import torch
from transformers import DistilBertTokenizer

In [None]:
def get_metadata_scholar(query: str):
    """Quick description

    Long description

    Parameters:
    query (str):

    Returns:

    """
    pass

In [4]:
# ------- LOADING DATA INTO A DATAFRAME -------


# Loading database created by D. Beillouin et al.
xlsx_file = pd.ExcelFile("/home/er/Documents/Cirad/colibri/data/classification_trainset/classification_trainset.xlsx")

# Adding label "excluded" or "included" for each MA
df_incl = xlsx_file.parse("retained_meta-analyses")
df_excl = xlsx_file.parse("non_retained_meta-analyses")
df_incl["Screening"] = "included"
df_excl["Screening"] = "excluded"

# Keeping only useful attributes
attributes_to_keep_incl = [
    "Screening",
    "link",
    "Article Title",
    "Abstract",
    "Keywords",
]
attributes_to_keep_excl = [
    "Screening",
    "lien pour accès",
    "title",
    "Abstract",
    "Keywords",
]
df_incl = df_incl[attributes_to_keep_incl]
df_excl = df_excl[attributes_to_keep_excl]

# Standardising columns names
new_column_names_incl = {"Article Title": "Title", "link": "DOI"}
new_column_names_excl = {"title": "Title", "lien pour accès": "DOI"}
df_incl = df_incl.rename(columns=new_column_names_incl)
df_excl = df_excl.rename(columns=new_column_names_excl)

# Merging exluded and included MA into single dataframe
raw_data = pd.concat([df_incl, df_excl], ignore_index=True)
raw_data = raw_data.fillna("")

size_1 = len(df_incl)
size_2 = len(df_excl)
size_3 = size_1 + size_2
print(
    f"Raw database contains {size_3} entries ({size_1} included MA and {size_2} excluded MA), stored into 'raw_data' variable."
)

# ------- CLEANING -------


# Function to get DOIs from URLs
def extract_doi(url):
    if str(url).startswith("https://doi.org/"):
        return str(url)[len("https://doi.org/") :]
    else:
        return None


# Extracting DOIs from URLs
raw_data["DOI"] = raw_data["DOI"].apply(extract_doi)

# Removing empty DOIs rows
raw_data = raw_data.dropna(subset=["DOI"])
size_4 = len(raw_data)
size_5 = size_3 - size_4
print(f"{size_5} rows removed because of empty DOIs. Cannot check the uniqueness.")

# Removing empty titles rows
raw_data["Title"] = raw_data["Title"].replace("", np.nan)
raw_data = raw_data.dropna(subset=["Title"])
size_6 = len(raw_data)
size_7 = size_4 - size_6
print(
    f"{size_7} rows removed because of empty titles. Cannot be processed by the ML model."
)

# Removing empty abstracts rows
raw_data["Abstract"] = raw_data["Abstract"].replace("", np.nan)
raw_data = raw_data.dropna(subset=["Abstract"])
size_8 = len(raw_data)
size_9 = size_6 - size_8
print(
    f"{size_9} rows removed because of empty abstracts. Cannot be processed by the ML model."
)

# Removing DOIs duplicates and titles duplicates
raw_data = raw_data.drop_duplicates(subset=["DOI"], keep="first")
raw_data = raw_data.drop_duplicates(subset="Title", keep="first")
size_10 = len(raw_data)
size_11 = size_8 - size_10
print(f"{size_11} DOI duplicates and title duplicates removed.")

# Droping column 'DOI' now we have unique values. No needed for the ML model
train_set = raw_data.drop(columns=["DOI"])

size_12 = train_set["Screening"].value_counts()
size_incl = size_12.loc["included"]
size_excl = size_12.loc["excluded"]
print(
    f"Cleaned database contains {size_10} entries ({size_incl} included MA and {size_excl} excluded MA), stored into 'train_set' variable."
)

# Shuffling and re-indexing
train_set = train_set.astype(str)
train_set = train_set.sample(frac=1)
train_set = train_set.reset_index(drop=True)

train_set.to_pickle("/home/er/Documents/Cirad/colibri/data/classification_trainset/classification_trainset.pkl")

print("Training set stored into 'train_set' variable and ready to be used.")
print("Summary with the 20 first lines:")
train_set.head(20)

Raw database contains 1007 entries (217 included MA and 790 excluded MA), stored into 'raw_data' variable.
151 rows removed because of empty DOIs. Cannot check the uniqueness.
0 rows removed because of empty titles. Cannot be processed by the ML model.
54 rows removed because of empty abstracts. Cannot be processed by the ML model.
8 DOI duplicates and title duplicates removed.
Cleaned database contains 794 entries (212 included MA and 582 excluded MA), stored into 'train_set' variable.
Training set stored into 'train_set' variable and ready to be used.
Summary with the 20 first lines:


Unnamed: 0,Screening,Title,Abstract,Keywords
0,excluded,Responses of a rice-wheat rotation agroecosyst...,Climate change is likely to affect agroecosyst...,
1,excluded,The number of cycles of neoadjuvant chemothera...,Objective No consensus exists on the number of...,High-grade serous ovarian cancer (HG-SOC); CA-...
2,excluded,Carbon cycling in temperate grassland under el...,An increase in mean soil surface temperature h...,CO; (2); elevated temperature; grassland; heat...
3,included,How can straw incorporation management impact ...,Straw incorporation (SI) is a common practice ...,China; Meta analysis; Soil carbon (C) sequestr...
4,excluded,Comparison of prescribing criteria to evaluate...,Because inappropriate prescribing is prevalent...,
5,included,Responses of microbial biomass carbon and nitr...,Soil microbes play important roles in regulati...,Microbial biomass carbon; Microbial biomass ni...
6,excluded,Are active organic matter fractions suitable i...,Active fractions of organic matter have been p...,Active fractions of organic matter; microbial ...
7,included,Long-term nitrogen fertilization decreases bac...,Long-term elevated nitrogen (N) input from ant...,Actinobacteria; agro-ecosystems; bacterial div...
8,excluded,Comt and mthfr polymorphisms interaction on co...,The investigation of the catechol-O-methyltran...,
9,excluded,Global-scale pattern of peatland sphagnum grow...,High-latitude peatlands contain about one thir...,


In [8]:
import torch
import torch.nn as nn
from transformers import DistilBertTokenizer, DistilBertModel
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader

# Load pre-trained DistilBERT model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = DistilBertTokenizer.from_pretrained(model_name)
model = DistilBertModel.from_pretrained(model_name)

# Convert 'Title', 'Abstract', and 'Keywords' columns to lists of strings
titles = train_set["Title"].tolist()
abstracts = train_set["Abstract"].tolist()
keywords = train_set["Keywords"].tolist()

# Tokenize and encode the titles, abstracts, and keywords
title_inputs = tokenizer(
    titles, padding="max_length", truncation=True, max_length=128, return_tensors="pt"
)
abstract_inputs = tokenizer(
    abstracts,
    padding="max_length",
    truncation=True,
    max_length=128,
    return_tensors="pt",
)
keywords_inputs = tokenizer(
    keywords, padding="max_length", truncation=True, max_length=128, return_tensors="pt"
)

# Prepare labels (assuming 'Screening' column has 'included' and 'excluded' values)
labels = torch.tensor(train_set["Screening"].map({"included": 1, "excluded": 0}).values)

# Create attention masks for each input
title_attention_mask = title_inputs["attention_mask"]
abstract_attention_mask = abstract_inputs["attention_mask"]
keywords_attention_mask = keywords_inputs["attention_mask"]

# Split data into training and validation sets
(
    train_title_inputs,
    val_title_inputs,
    train_title_attention_mask,
    val_title_attention_mask,
    train_labels,
    val_labels,
) = train_test_split(
    title_inputs["input_ids"],
    title_attention_mask,
    labels,
    test_size=0.2,
    random_state=42,
)

(
    train_abstract_inputs,
    val_abstract_inputs,
    train_abstract_attention_mask,
    val_abstract_attention_mask,
) = train_test_split(
    abstract_inputs["input_ids"],
    abstract_attention_mask,
    test_size=0.2,
    random_state=42,
)

(
    train_keywords_inputs,
    val_keywords_inputs,
    train_keywords_attention_mask,
    val_keywords_attention_mask,
) = train_test_split(
    keywords_inputs["input_ids"],
    keywords_attention_mask,
    test_size=0.2,
    random_state=42,
)

# Create DataLoader for training and validation sets for each input type
train_title_dataset = TensorDataset(
    train_title_inputs, train_title_attention_mask, train_labels
)
val_title_dataset = TensorDataset(
    val_title_inputs, val_title_attention_mask, val_labels
)

train_abstract_dataset = TensorDataset(
    train_abstract_inputs, train_abstract_attention_mask, train_labels
)
val_abstract_dataset = TensorDataset(
    val_abstract_inputs, val_abstract_attention_mask, val_labels
)

train_keywords_dataset = TensorDataset(
    train_keywords_inputs, train_keywords_attention_mask, train_labels
)
val_keywords_dataset = TensorDataset(
    val_keywords_inputs, val_keywords_attention_mask, val_labels
)

train_title_dataloader = DataLoader(train_title_dataset, batch_size=16, shuffle=True)
val_title_dataloader = DataLoader(val_title_dataset, batch_size=16)

train_abstract_dataloader = DataLoader(
    train_abstract_dataset, batch_size=16, shuffle=True
)
val_abstract_dataloader = DataLoader(val_abstract_dataset, batch_size=16)

train_keywords_dataloader = DataLoader(
    train_keywords_dataset, batch_size=16, shuffle=True
)
val_keywords_dataloader = DataLoader(val_keywords_dataset, batch_size=16)

# Define the classification head (unchanged)
classification_head = nn.Sequential(
    nn.Linear(
        model.config.hidden_size * 3, 64
    ),  # Concatenate the three outputs from title, abstract, and keywords
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(64, 2),  # 2 for binary classification (included/excluded)
)

# Fine-tune the model (unchanged)
optimizer = torch.optim.AdamW(classification_head.parameters(), lr=2e-5)
loss_fn = nn.CrossEntropyLoss()

# Training loop (unchanged)
num_epochs = 5 
for epoch in range(num_epochs):
    model.train()

    for batch_title, batch_abstract, batch_keywords, batch_labels in zip(
        train_title_dataloader,
        train_abstract_dataloader,
        train_keywords_dataloader,
        train_labels,
    ):
        # Unpack the data for each input type
        title_input_ids, title_attention_mask, _ = batch_title
        abstract_input_ids, abstract_attention_mask, _ = batch_abstract
        keywords_input_ids, keywords_attention_mask, _ = batch_keywords

        optimizer.zero_grad()

        with torch.no_grad():
            title_outputs = model(
                input_ids=title_input_ids, attention_mask=title_attention_mask
            )
            abstract_outputs = model(
                input_ids=abstract_input_ids, attention_mask=abstract_attention_mask
            )
            keywords_outputs = model(
                input_ids=keywords_input_ids, attention_mask=keywords_attention_mask
            )

        # Concatenate the outputs from title, abstract, and keywords
        concatenated_output = torch.cat(
            [
                title_outputs.last_hidden_state[:, 0, :],
                abstract_outputs.last_hidden_state[:, 0, :],
                keywords_outputs.last_hidden_state[:, 0, :],
            ],
            dim=1,
        )

        logits = classification_head(concatenated_output)
        loss = loss_fn(logits, batch_labels)
        loss.backward()
        optimizer.step()

# Validation (updated)
model.eval()
val_loss = 0.0
correct_predictions = 0
total_predictions = 0

# Check if GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

with torch.no_grad():
    for batch_title, batch_abstract, batch_keywords, batch_labels in zip(
        val_title_dataloader,
        val_abstract_dataloader,
        val_keywords_dataloader,
        val_labels,
    ):
        # Unpack the data for each input type
        title_input_ids, title_attention_mask, _ = batch_title
        abstract_input_ids, abstract_attention_mask, _ = batch_abstract
        keywords_input_ids, keywords_attention_mask, _ = batch_keywords
        batch_labels = batch_labels.to(device)  # Send labels to the device

        title_outputs = model(
            input_ids=title_input_ids.to(device),
            attention_mask=title_attention_mask.to(device),
        )
        abstract_outputs = model(
            input_ids=abstract_input_ids.to(device),
            attention_mask=abstract_attention_mask.to(device),
        )
        keywords_outputs = model(
            input_ids=keywords_input_ids.to(device),
            attention_mask=keywords_attention_mask.to(device),
        )

        # Concatenate the outputs from title, abstract, and keywords
        concatenated_output = torch.cat(
            [
                title_outputs.last_hidden_state[:, 0, :],
                abstract_outputs.last_hidden_state[:, 0, :],
                keywords_outputs.last_hidden_state[:, 0, :],
            ],
            dim=1,
        )

        logits = classification_head(concatenated_output)
        loss = loss_fn(logits, batch_labels)
        val_loss += loss.item()

        _, predicted = torch.max(logits, 1)
        correct_predictions += (predicted == batch_labels).sum().item()
        total_predictions += batch_labels.size(0)

avg_val_loss = val_loss / len(val_title_dataloader)
accuracy = correct_predictions / total_predictions

print(
    f"Epoch {epoch + 1}/{num_epochs}, Validation Loss: {avg_val_loss:.4f}, Accuracy: {accuracy:.4f}"
)

# Save the fine-tuned model (unchanged)
torch.save(classification_head.state_dict(), "fine_tuned_model.pt")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


ValueError: Expected input batch_size (16) to match target batch_size (0).