# Hugging Face model card for DistilBERT uncased text classification model: ORO relevance screening

This script will use the model configuration with the hyperparameters determined from the model_selection_excl.py script, and fit using the whole screening dataset. This is different than the model predictions obtained from the nested cross validation script in the 'binary_predictions_excl.py' script which fits models on splits of the data for a distribution of predicitons. This is because a model card can only have one model. So the purpose is just to provide an approximation/example, knowing that the model will likely be overfit compared to the predictions presented in the paper. 

## Fit the screening model and push to Huggingface

In [2]:
# Load modules
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
import pandas as pd
import tensorflow as tf
import tensorflow_addons as tfa
from datasets import load_dataset

  from .autonotebook import tqdm as notebook_tqdm
2024-03-12 10:00:23.982232: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-12 10:00:24.248520: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-12 10:00:25.419757: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2024-03-12 10:00:25.419826: W tensorflow/compiler/xla/s

In [3]:
# Define which model, tokenizer and parameters that will be used to fit the model
MODEL_NAME = 'distilbert-base-uncased'

tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

outer_scores = []
inner_scores = []
params = ['batch_size','weight_decay','learning_rate','num_epochs','class_weight']

In [4]:
# From all the folds used for model selection, find the best model
for k in range(5): # For all the folds, find the best model
    inner_df = pd.read_csv(f'/home/dveytia/ORO-map-relevance/outputs/model_selection/screen_model_selection_{k}.csv') 
    inner_df = inner_df.sort_values('F1',ascending=False).reset_index(drop=True)
    inner_scores += inner_df.to_dict('records')

inner_scores = pd.DataFrame.from_dict(inner_scores).fillna(-1)
best_model_params = (inner_scores
              .groupby(params)['F1']
              .mean()
              .sort_values(ascending=False)
              .reset_index() 
             ).to_dict('records')[0]



# can have a look at the F1 score for the best model
print(best_model_params)

{'batch_size': 16, 'weight_decay': 0.0, 'learning_rate': 1e-05, 'num_epochs': 4, 'class_weight': -1, 'F1': 0.7021713122907419}


# Try using Max's code -- model predictions not too good

In [10]:
def init_model(MODEL_NAME, num_labels, params):
    model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=num_labels)  
    optimizer = tfa.optimizers.AdamW(learning_rate=params['learning_rate'], weight_decay=params['weight_decay'])

    loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
    metrics = tf.metrics.BinaryAccuracy()
    model.compile(
        optimizer=optimizer,
        loss=loss,
        metrics=metrics
    )
    return model

In [13]:
my_model = init_model('distilbert-base-uncased', 1, best_model_params)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.bias']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [22]:
my_model.push_to_hub("distilbert_ORO_screen", use_auth_token = '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')

tf_model.h5: 100%|███████████████████████████| 268M/268M [00:34<00:00, 7.73MB/s]


In [23]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer.push_to_hub("distilbert_ORO_screen", use_auth_token = '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')

CommitInfo(commit_url='https://huggingface.co/dveytia/distilbert_ORO_screen/commit/4281423f3d5af853f3d8faa72c1ed3034f9b7ece', commit_message='Upload tokenizer', commit_description='', oid='4281423f3d5af853f3d8faa72c1ed3034f9b7ece', pr_url=None, pr_revision=None, pr_num=None)

In [24]:
# This command will create a local directory which you can push manually to git, but it's an extra step
#my_model.save_pretrained("https://huggingface.co/dveytia/distilbert_ORO_screen")

In [25]:
tokenizer.save_pretrained("https://huggingface.co/dveytia/distilbert_ORO_screen")

('https://huggingface.co/dveytia/distilbert_ORO_screen/tokenizer_config.json',
 'https://huggingface.co/dveytia/distilbert_ORO_screen/special_tokens_map.json',
 'https://huggingface.co/dveytia/distilbert_ORO_screen/vocab.txt',
 'https://huggingface.co/dveytia/distilbert_ORO_screen/added_tokens.json')

## Using the best parameters, fit the model on the full dataset

In [6]:
## Read in and Format the screening data 

## The 'seen' data
seen_df = pd.read_csv('/home/dveytia/ORO-map-relevance/data/seen/all-screen-results_screenExcl-codeIncl.txt', delimiter='\t')
seen_df['seen']=1
seen_df = seen_df.rename(columns={'include_screen':'relevant','analysis_id':'id'})
seen_df['relevant']=seen_df['relevant'].astype(int)

def map_values(x): 
    value_map = {
        "random": 1,
        "relevance sort": 0,
        "test list": 0,
        "supplemental coding": 0
    }
    return value_map.get(x, "NaN")

seen_df['random_sample'] = seen_df['sample_screen'].apply(map_values)

df = seen_df

#unseen_df = pd.read_csv('/home/dveytia/ORO-map-relevance/data/unseen/unique_references2.txt', delimiter='\t')
#unseen_df.rename(columns={'analysis_id':'id'}, inplace=True)
#unseen_df['seen']=0

#nan_count=unseen_df['abstract'].isna().sum()
#print('Number of missing abstracts is',nan_count)
#nan_articles=unseen_df[unseen_df['abstract'].isna()]
#unseen_df=unseen_df.dropna(subset=['abstract']).reset_index(drop=True)

#df = (pd.concat([seen_df,unseen_df])
#      .sort_values('id')
#      .sample(frac=1, random_state=1)
#      .reset_index(drop=True)
#)


#print('Number of unique references WITH abstract is',len(df))

#df['text'] = df['title'] + ". " + df['abstract'] + " " + "Keywords: " + df["keywords"] 
# sometimes this line above throws an error, so if it does, run:
df['text'] = df['title'].astype("str") + ". " + df['abstract'].astype("str") + " " + "Keywords: " + df["keywords"].astype("str") 
df['text'] = df.apply(lambda row: (row['title'] + ". " + row['abstract']) if pd.isna(row['text']) else row['text'], axis=1)

#seen_index = df[df['seen']==1].index
#unseen_index = df[df['seen']==0].index


In [7]:
## Convert pandas data frame to Dataset
## separate into training (non-randomly sampled) and testing (randomly sampled)

from datasets import Dataset, DatasetDict

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


train_datasets = Dataset.from_pandas(df.loc[df['random_sample'] == 0, ['text','relevant','random_sample']])
eval_datasets = Dataset.from_pandas(df.loc[df['random_sample'] == 1, ['text','relevant','random_sample']])

train_tokenized = train_datasets.map(tokenize_function, batched=True)
eval_tokenized = eval_datasets.map(tokenize_function, batched=True)

Map: 100%|████████████████████████████| 669/669 [00:03<00:00, 198.40 examples/s]
Map: 100%|██████████████████████████| 2083/2083 [00:10<00:00, 196.97 examples/s]


In [8]:
# Convert Dataset to big tensors and use the tf.data.Dataset.from_tensor_slices method
full_train_dataset = train_tokenized
full_eval_dataset = eval_tokenized

tf_train_dataset = full_train_dataset.remove_columns(["text"]).with_format("tensorflow")
train_features = {x: tf_train_dataset[x] for x in tokenizer.model_input_names}
train_tf_dataset = tf.data.Dataset.from_tensor_slices((train_features, tf_train_dataset['relevant']))
train_tf_dataset = train_tf_dataset.shuffle(len(tf_train_dataset)).batch(best_model_params['batch_size'])

tf_eval_dataset = full_eval_dataset.remove_columns(["text"]).with_format("tensorflow")
eval_features = {x: tf_eval_dataset[x] for x in tokenizer.model_input_names}
eval_tf_dataset = tf.data.Dataset.from_tensor_slices((eval_features, tf_eval_dataset['relevant']))
eval_tf_dataset = eval_tf_dataset.shuffle(len(tf_eval_dataset)).batch(best_model_params['batch_size'])


2024-03-12 10:29:33.731456: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-12 10:29:34.698184: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 21308 MB memory:  -> device: 0, name: Quadro RTX 6000, pci bus id: 0000:3b:00.0, compute capability: 7.5


In [9]:
# With this, the model can be compiled and trained 

# define model using best parameters gotten from model selection
num_labels = 1 # binary model -- so number of labels = 1
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=num_labels)  

optimizer = tfa.optimizers.AdamW(learning_rate=best_model_params['learning_rate'], weight_decay=best_model_params['weight_decay'])
loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
metrics = tf.metrics.BinaryAccuracy()
model.compile(
    optimizer=optimizer,
    loss=loss,
    metrics=metrics
)

# Fit model using training and evaluation datasets
model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=best_model_params['num_epochs']),

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Epoch 1/4
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089
Epoch 2/4
Epoch 3/4
Epoch 4/4


(<keras.callbacks.History at 0x7f36840f63d0>,)

In [10]:
model.push_to_hub("distilbert_ORO_screen", use_auth_token = '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')

tf_model.h5: 100%|███████████████████████████| 268M/268M [00:26<00:00, 9.94MB/s]


In [39]:
import requests

API_URL = "https://api-inference.huggingface.co/models/dveytia/distilbert_ORO_screen"
headers = {"Authorization": "Bearer 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"}

def query(payload):
    response = requests.post(API_URL, headers=headers, json=payload)
    return response.json()

output = query({
    "inputs": "The Paris Agreement target of limiting global surface warming to 1.5–2◦C compared to pre-industrial levels by 2100 will still heavily impact the ocean. While ambitious mitigation and adaptation are both needed, the ocean provides major opportunities for action to reduce climate change globally and its impacts on vital ecosystems and ecosystem services. A comprehensive and systematic assessment of 13 global- and local-scale, ocean-based measures was performed to help steer the development and implementation of technologies and actions toward a sustainable outcome. We show that (1) all measures have tradeoffs and multiple criteria must be used for a comprehensive assessment of their potential, (2) greatest benefit is derived by combining global and local solutions, some of which could be implemented or scaled-up immediately, (3) some measures are too uncertain to be recommended yet, (4) political consistency must be achieved through effective cross-scale governance mechanisms, (5) scientific effort must focus on effectiveness, co-benefits, disbenefits, and costs of poorly tested as well as new and emerging measures.",
})

# junk code below?

In [None]:
from transformers import TrainingArguments
#from datasets import load_metric

training_args = TrainingArguments("test_trainer", evaluation_strategy="epoch")

metric = tf.metrics.BinaryAccuracy()

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset,
    compute_metrics = compute_metrics
)

trainer.evaluate()

In [None]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=1)

model.compile(
    optimizer=tfa.optimizers.AdamW(learning_rate=best_model_params['learning_rate'], weight_decay=best_model_params['weight_decay']),
    loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
    metrics=tf.metrics.BinaryAccuracy(),
)

model.fit(train_tf_dataset, validation_data=eval_tf_dataset, epochs=),

model.fit(train_tf_dataset, 
          validation_data=eval_tf_dataset,
          epochs=best_model_params['num_epochs'],
          batch_size=best_model_params['batch_size'],
          class_weight=best_model_params['class_weight']
)

model.save_pretrained("https://huggingface.co/dveytia/distilbert_ORO_screen")