# TNC Title + Abstract Relavance Prediction Tool Using SPECTER

This notebook includes the training processes of two machine learning models which predict a text's relevance based on its title and abstract.

Contributor: Alyssa Wu, Pomona College '28, https://www.linkedin.com/in/zjwualyssa/

# Import Libraries + Dataset

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

# For metric evaluation
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Using PyTorch to build neural network
import torch
import torch.nn as nn
import torch.optim as optim

In [5]:
# Import training dataset
df = pd.read_csv("TAB_new.csv")
df.head()

""" TAB_new is a dataset of title and abstracts of papers labeled 0 (irrelevant) or
1 (relevant) in relation to agroforestry. The column "TAB_preproc" are those same
title and abstracts preprocessed so the text is lowercase, has no punctuation,
and no stopwords."""

Unnamed: 0,label,TAB,TAB_preproc
0,1,Timber-Yielding Plants of the Tamaulipan Thorn...,plants tamaulipan thorn bioenergy potential cu...
1,1,Restoration: Success and Completion Criteria R...,success completion criteria restoration distur...
2,1,Soil Carbon Sequestration: Ethiopia Sequestrat...,soil carbon ethiopia sequestration soil organi...
3,1,Village Bamboos It has been recognized that ba...,village bamboos recognized bamboos growing wil...
4,1,Physical protection by soil aggregates stabili...,physical protection soil aggregates stabilizes...


# Encoding Using SPECTER

SPECTER is an embedder trained designed to create document-level embeddings for scientific documents papers that incorporates information about inter-document relatedness.

> "...SPECTER incorporates inter-document context into the Transformer (Vaswani et al., 2017) language models (e.g., SciBERT (Beltagy et al., 2019) ) to learn document representations that are effective across a wide-variety of downstream tasks, without the need for any task-specific fine-tuning of the pretrained language model. We specifically use citations as a naturally occurring, inter-document incidental supervision signal indicating which documents are most related and formulate the signal into a triplet-loss pretraining objective. Unlike many prior works, at inference time, our model does not require any citation information. This is critical for embedding new papers that have not yet been cited" (Cohan et al., 2020).

* *SPECTER: Document-level Representation Learning using Citation-informed Transformers*
* https://papertohtml.org/paper?id=a3e4ceb42cbcd2c807d53aff90a8cb1f5ee3f031


Each embedding is a mathematical representation (vector) of a scientific reader's understanding of each title and abstract.

In [6]:
# Load SPECTER (encoder)
from transformers import AutoModel

model_name = 'allenai/specter'
model = AutoModel.from_pretrained(model_name)
model.eval() # inference mode

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(31116, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

In [7]:
# Change pandas dataframe to Hugging Face dataset
HF_df = Dataset.from_pandas(df)

In [8]:
# Embed title + abstract
tokenizer = AutoTokenizer.from_pretrained(model_name)

def encode_tab(example):
    """
    input:: Hugging Face dataset
    return:: embed TAB column to some ex) [0.123, -0.456, 0.789, ..., 0.001]  # Shape: (768,)
    NOTE: extra work is because we loaded BERT & BERT variations as end-to-end models (aka including classification) while SPECTER is embedding only
    """
    input_text = example["TAB_preproc"]
    inputs = tokenizer(input_text, return_tensors = "pt", truncation = True, max_length = 512) # Dict {'input_ids': tensor, 'token_type_ids': tensor}
    with torch.no_grad(): # disables gradient tracking...we are doing inference only, not training --> no need for backprop & store gradient
        outputs = model(**inputs) # ** unpacks the dictionary
        cls_emb = outputs.last_hidden_state[:, 0, :]  # extract CLS token...[:, 0, :] all items, first item, all items
    return {"embedding": cls_emb.squeeze().numpy()}

tokenizer_config.json:   0%|          | 0.00/321 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [9]:
from datasets import Features, Sequence, Value

# Add 'embedding' feature as a sequence of floats
features = HF_df.features.copy()
features["embedding"] = Sequence(Value("float32"))

tokenized_df = HF_df.map(
    encode_tab,
    features = features,
    batched = False
)

Map:   0%|          | 0/7789 [00:00<?, ? examples/s]

In [None]:
# Example code to better see how the encoding works
ex = HF_df['TAB_preproc'][6] # random example of preoprocessed title + abstract text
ex # string

ex_token = tokenizer(HF_df['TAB'][6], return_tensors = "pt", truncation = True, max_length = 512)
ex_token['input_ids'].type # dict of shape {'input_ids': Tensor, 'attention_mask': Tensor}

ex_out = model(**ex_token)
ex_out #'BaseModelOutputWithPoolingAndCrossAttentions' object w/ attributes

ex_embedding = ex_out.last_hidden_state[:, 0, :] # ex.out.last_hidden_state has torch.Size([1, 220, 768])
ex_embedding.squeeze().shape # torch.Size([768]), in code grad is disabled so it can be converted to a numpy array

# Neural Networks

To process the high dimensionionality of the embeddings, we will use a neural network. A neural network to learn how to "weigh" the importance of difference features of the (786, ) shaped vector, like how a scientist learns to weigh the importance of certain words and phrases, to determine a paper's relevance.

In [13]:
# Prepare data for models
X = tokenized_df['embedding'] # vector embedding of shape (768, )
y = tokenized_df['label'] # label 0/1 (irrelevant/relevant)

# Split df into train, val, and test
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.20)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.50)

print(f"Train: {len(X_train)}, Val: {len(X_val)}, Test: {len(X_test)}")

Train: 6231, Val: 779, Test: 779


In [14]:
# Turn into PyTorch tensors
X_train_tensor = torch.tensor(X_train, dtype = torch.float32)
y_train_tensor = torch.tensor(y_train, dtype = torch.float32)

X_val_tensor = torch.tensor(X_val, dtype = torch.float32)
y_val_tensor = torch.tensor(y_val, dtype = torch.float32)

X_test_tensor = torch.tensor(X_test, dtype = torch.float32)
y_test_tensor = torch.tensor(y_test, dtype = torch.float32)

In [15]:
# Calculate class weights
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight('balanced', classes=np.array([0,1]), y=y_train_tensor.numpy())
class_weights = torch.tensor(class_weights, dtype=torch.float32)
class_weights

tensor([0.5813, 3.5769])

## Model 1: Singular Dense Layer


In [16]:
# Define the model
model1 = nn.Sequential(
    nn.Dropout(0.2),
    nn.Linear(768, 1),
    nn.Sigmoid()
 )

print(model1)

Sequential(
  (0): Dropout(p=0.2, inplace=False)
  (1): Linear(in_features=768, out_features=1, bias=True)
  (2): Sigmoid()
)


In [18]:
# Train the model

# Define hyperparameters (all adjustable)
optimizer = optim.Adam(model.parameters(), lr=0.005)
threshold = 0.5
n_epochs = 30
batch_size = 32

# To help save model + weights with highest val_prec * val_recall (identifies the highest % of relevant papers)
highest_score = 0

for epoch in range(n_epochs):
    # Store for per epoch evaluation
    train_targets = []
    train_preds = []
    train_loss = 0
    num_samples = 0
    val_targets = []
    val_prob = []
    val_preds = []

    # Training loop
    model1.train()
    for i in range(0, len(X_train_tensor), batch_size):
        X_batch = X_train_tensor[i:i+batch_size]
        y_batch = y_train_tensor[i:i+batch_size].unsqueeze(1)
        y_pred = model1(X_batch) # forward pass
        y_labels = (y_pred > threshold).float()

        # Weighted binary cross entropy
        loss_fn = nn.BCELoss(weight=torch.where(y_batch == 1, class_weights[1], class_weights[0]))
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Store targets and predictions
        train_targets.extend(y_batch.detach().numpy())
        train_preds.extend(y_labels.detach().numpy())

        # Track loss
        batch_size_actual = y_batch.size(0)
        train_loss += loss.item() * batch_size_actual # in case last batch is smaller than the rest
        num_samples += batch_size_actual

    # Validation loop
    model1.eval()
    with torch.no_grad():
        for j in range(0, len(X_val_tensor), batch_size):
            X_val_batch = X_val_tensor[j:j+batch_size]
            y_val_batch = y_val_tensor[j:j+batch_size].unsqueeze(1)
            y_val_pred = model1(X_val_batch)
            y_val_labels = (y_val_pred > threshold).float()

            # Store targets and predictions
            val_targets.extend(y_val_batch.detach().numpy())
            val_prob.extend(y_val_pred.detach().numpy()) # probability
            val_preds.extend(y_val_labels.detach().numpy())

    # Calculate metrics per epoch
    train_acc = accuracy_score(train_targets, train_preds)
    train_prec = precision_score(train_targets, train_preds, zero_division=0) # zero_division = 0 in case model predicts 0 positives
    train_recall = recall_score(train_targets, train_preds, zero_division=0)
    train_f1 = f1_score(train_targets, train_preds, zero_division=0)
    avg_train_loss = train_loss / num_samples

    val_acc = accuracy_score(val_targets, val_preds)
    val_prec = precision_score(val_targets, val_preds, zero_division=0)
    val_recall = recall_score(val_targets, val_preds, zero_division=0)
    val_f1 = f1_score(val_targets, val_preds, zero_division=0)

    # Save model with highest val_recall
    if val_recall > highest_score:
        highest_score = val_recall
        torch.save(model, 'specter-1layer.pt')
        torch.save(model.state_dict(), 'specter-1layer-parameters.pt')


    print(f"""Epoch {epoch+1}/{n_epochs} ===================================================================================================
          Training::  avg loss: {avg_train_loss}, accuracy: {train_acc}, precision: {train_prec}, recall: {train_recall}, f1: {train_f1}
          Validation:: accuracy: {val_acc}, precision: {val_prec}, recall: {val_recall}, f1; {val_f1}
""")

          Training::  avg loss: 0.6952584109323937, accuracy: 0.5804846734071578, precision: 0.14730878186968838, recall: 0.417910447761194, f1: 0.21783363255535607
          Validation:: accuracy: 0.5969191270860077, precision: 0.17314487632508835, recall: 0.3798449612403101, f1; 0.23786407766990292

          Training::  avg loss: 0.6967111123428933, accuracy: 0.5763119884448724, precision: 0.14662405113863364, recall: 0.4213547646383467, f1: 0.2175459395376408
          Validation:: accuracy: 0.5969191270860077, precision: 0.17314487632508835, recall: 0.3798449612403101, f1; 0.23786407766990292

          Training::  avg loss: 0.6960636349689365, accuracy: 0.5832129674209597, precision: 0.15060728744939272, recall: 0.42709529276693453, f1: 0.22268781801855733
          Validation:: accuracy: 0.5969191270860077, precision: 0.17314487632508835, recall: 0.3798449612403101, f1; 0.23786407766990292

          Training::  avg loss: 0.6963168856726855, accuracy: 0.5804846734071578, precisi

In [None]:
# Load model and assess on test set
model1 = torch.load('specter-1layer.pt', weights_only = False)
model1.eval()

model1_output = model1(X_test_tensor)
model1_preds = (model1_output > threshold).float()

model1_acc = accuracy_score(y_test_tensor, model1_preds)
model1_prec = precision_score(y_test_tensor, model1_preds, zero_division=0)
model1_recall = recall_score(y_test_tensor, model1_preds, zero_division=0)
model1_f1 = f1_score(y_test_tensor, model1_preds, zero_division=0)

print(f"Test:: accuracy: {model1_acc}, precision: {model1_prec}, recall: {model1_recall}, f1: {model1_f1}")

## Model 2: Three Dense Layers

In [None]:
# Define the model
model2 = nn.Sequential(
    nn.Linear(768, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 1),
    nn.Sigmoid()
)

print(model2)

In [None]:
# Train the model

# Set hyperparameters (all adjustable)
optimizer = optim.Adam(model.parameters(), lr=0.005)
threshold = 0.5
n_epochs = 30
batch_size = 32

# To help save model + weights with highest val_recall
highest_score = 0

for epoch in range(n_epochs):
    # Store for per epoch evaluation
    train_targets = []
    train_preds = []
    train_loss = 0
    num_samples = 0
    val_targets = []
    val_prob = []
    val_preds = []

    # Training loop
    model2.train()
    for i in range(0, len(X_train_tensor), batch_size):
        X_batch = X_train_tensor[i:i+batch_size]
        y_batch = y_train_tensor[i:i+batch_size].unsqueeze(1)
        y_pred = model2(X_batch) # forward pass
        y_labels = (y_pred > threshold).float()

        # Weighted binary cross entropy
        loss_fn = nn.BCELoss(weight=torch.where(y_batch == 1, class_weights[1], class_weights[0]))
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Store targets and predictions
        train_targets.extend(y_batch.detach().numpy())
        train_preds.extend(y_labels.detach().numpy())

        # Track loss
        batch_size_actual = y_batch.size(0)
        train_loss += loss.item() * batch_size_actual # in case last batch is smaller than the rest
        num_samples += batch_size_actual

    # Validation loop
    model2.eval()
    with torch.no_grad():
        for j in range(0, len(X_val_tensor), batch_size):
            X_val_batch = X_val_tensor[j:j+batch_size]
            y_val_batch = y_val_tensor[j:j+batch_size].unsqueeze(1)
            y_val_pred = model2(X_val_batch)
            y_val_labels = (y_val_pred > threshold).float()

            # Store targets and predictions
            val_targets.extend(y_val_batch.detach().numpy())
            val_prob.extend(y_val_pred.detach().numpy()) # probability
            val_preds.extend(y_val_labels.detach().numpy())

    # Calculate metrics per epoch
    train_acc = accuracy_score(train_targets, train_preds)
    train_prec = precision_score(train_targets, train_preds, zero_division=0) # zero_division = 0 in case model predicts 0 positives
    train_recall = recall_score(train_targets, train_preds, zero_division=0)
    train_f1 = f1_score(train_targets, train_preds, zero_division=0)
    avg_train_loss = train_loss / num_samples

    val_acc = accuracy_score(val_targets, val_preds)
    val_prec = precision_score(val_targets, val_preds, zero_division=0)
    val_recall = recall_score(val_targets, val_preds, zero_division=0)
    val_f1 = f1_score(val_targets, val_preds, zero_division=0)

    # Save model with highest recall
    if val_recall > highest_score:
        highest_score = val_recall
        torch.save(model, 'specter-3layer.pt')
        torch.save(model.state_dict(), 'specter-3layer-parameters.pt')


    print(f"""Epoch {epoch+1}/{n_epochs} ===================================================================================================
          Training::  avg loss: {avg_train_loss}, accuracy: {train_acc}, precision: {train_prec}, recall: {train_recall}, f1: {train_f1}
          Validation:: accuracy: {val_acc}, precision: {val_prec}, recall: {val_recall}, f1; {val_f1}
""")

In [None]:
# Load model and assess on test set
model2 = torch.load('specter-3layer', weights_only = False)
model2.eval()

model2_output = model2(X_test_tensor)
model2_preds = (model2_output > threshold).float()

model2_acc = accuracy_score(y_test_tensor, model2_preds)
model2_prec = precision_score(y_test_tensor, model2_preds, zero_division=0)
model2_recall = recall_score(y_test_tensor, model2_preds, zero_division=0)
model2_f1 = f1_score(y_test_tensor, model2_preds, zero_division=0)

print(f"Test:: accuracy: {model2_acc}, precision: {model2_prec}, recall: {model2_recall}, f1: {model2_f1}")