# Apply BERT to Solve a Cannabis Product Classification Problem 

**Author:** Wenhao Pan, UC Berkeley, Spring 2022.

## Table of Contents

* [Introduction](#Introduction)
* [Basic Setup](#Basic-Setup)
* [One Label](#One-Label)
* [All the Labels](#All-the-Labels)
* [Prediction](#Prediction)


## Introduction

In this notebook, we explore [BERT model](https://en.wikipedia.org/wiki/BERT_(language_model), which was created and published in 2018 by Google, on our cannabis product dataset. By running through this notebook, we are able to

* Fine-tune a BERT model for a single label, which is a binary classification task (*One Label*)
* Fine-tune multiple BERT models for multiple labels consecutively, which is a set of binary classification tasks (*All Label*)
* Use the fine-tuned BERT models to make the predictions (*Prediction*)

Within in each section, it is recommended to run the code cells in order. All the code cells in *Basic Setup* section should always be run before any other section.

**Note**: This is the google colab version of the `bert.ipynb`, which is designed for running locally (if you have an GPU).

## Basic Setup

Run the following cell to connect the notebook to your google drive so that you can save the fine-tuned model and predictions to the folder `BERT` permanently if you want. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import os
try:
    os.mkdir("/content/drive/MyDrive/BERT_exp")
except:
    print ("Folder is already existed.")
os.chdir("/content/drive/MyDrive/BERT_exp")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Folder is already existed.


Run the following cell to install all the packages that will be used later through `pypi`.

In [None]:
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install torch
!pip install transformers
!pip install datasets
!pip install scikit-learn
!pip install gensim



Run the following cell to import all the packages and functions that will be used later. `MODEL_NAME` defines which pre-trained label we want to use. Here, we are using `bert-base-uncased` which can be found [here](https://huggingface.co/bert-base-uncased).

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.optim as optim
import os

MODEL_NAME = "bert-base-uncased" # the name of the pre-trained model we want to use
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
from datasets import load_dataset, load_metric, Dataset

from sklearn.model_selection import train_test_split 
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score, confusion_matrix

from gensim.parsing import remove_stopwords, strip_numeric, strip_punctuation, strip_multiple_whitespaces

Run the following cell to confirm we are using GPU provided by Colab. If the printout is `cuda`, then we are indeed using GPU. To connect to GPU, just follow **Edit > Notebook settings** or **Runtime>Change runtime type** and **select GPU** as **Hardware accelerator**.

In [None]:
print(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

cuda


Run the following cell to confirm we have the dataset loaded. Otherwise, you need to upload the dataset (`in_sample.csv` and `out_sample.csv`) to the `BERT` folder created by the previous code cell in your google drive. These two csv files can be downloaded from [here](https://drive.google.com/file/d/1yYhdvl2BRdOW6cUT2k4HcQrymEBRZrZD/view?usp=sharing) and [here](https://drive.google.com/file/d/1xXFebXJaaaWG8lx294J56XevfZlI2NVl/view?usp=sharing).

In [None]:
assert os.path.exists("in_sample.csv") and os.path.exists("out_sample.csv"), "Raw dataset was not detected. You need to upload the dataset first!"

Run the following cell to load the helper functions we need later.

In [None]:
def clean_data(df, field, labels, remove_punctuations=False, remove_stop_words=False, remove_digits=False, minimal=False):
    """Binarizes labels for given dataframe, and exports cleaned dataframes

    Args:
        df (pd.dataframe): dataframe with label columns (see LABELS above)
        field (str): the name of the input field
        labels (list[str]): labels we currently consider
        remove_punctuations (boolean): remove punctuations from the description field if True
        remove_stop_words (boolean): remove stop words from the description field if True
        remove_digits (boolean): remove digits from the description field if True
        minimal (boolean): only keep the description and label fields if True

    Returns:
        df_clean (pd.dataframe): cleaned dataframe with binarized labels
    """
    df_clean = df.dropna(subset=[field])

    # ensure label fields are all numerical
    for label in labels:
        df_clean = df_clean[(df_clean[label] == 0) | (df_clean[label] == 1) | (df_clean[label] == '0') | (df_clean[label] == '1')]
        df_clean[label] = pd.to_numeric(df_clean[label])
    
    # remove punctuations if wanted
    if remove_punctuations:
        df_clean[field] = df_clean[field].apply(strip_punctuation)

    # remove stopwords if wanted 
    if remove_stop_words:
        df_clean[field] = df_clean[field].apply(remove_stopwords)
    
    # remove digits if wanted
    if remove_digits:
        df_clean[field] = df_clean[field].apply(strip_numeric)

    # drop unnecessary columns
    if minimal:
        df_clean = df_clean[[field] + labels]

    df_clean[field] = df_clean[field].astype(str)
    df_clean[field] = df_clean[field].str.lower() # lowercase all characters
    df_clean[field] = df_clean[field].apply(strip_multiple_whitespaces) # remove repeating whitespace
    df_clean = df_clean.replace(to_replace=[''], value=np.nan).dropna(subset=[field]) # drop empty field
    
    return df_clean


def load_data(field, labels, remove_punctuations=False, remove_stop_words=False, remove_digits=False, minimal=False):
    """Loads in_sample and out_sample data, cleans them, and exports clean csv files

    Args:
        field (str): the name of the input field
        labels (list[str]): labels we currently consider
        remove_punctuations (boolean): remove punctuations from the description field if True
        remove_stop_words (boolean): remove stop words from the description field if True
        remove_digits (boolean): remove digits from the description field if True
        minimal (boolean): only keep the description and label fields if True

    Returns:
        clean_insample (pd.DataFrame): Training Dataset
        clean_outsample (pd.DataFrame): Testing Dataset
    """
    # Check that data is downloaded
    assert os.path.exists("in_sample.csv"), "Need to download in_sample.csv first!"
    assert os.path.exists("out_sample.csv"), "Need to download out_sample.csv first!"

    insample = pd.read_csv("in_sample.csv")
    clean_insample = clean_data(insample, field, labels, remove_punctuations, remove_stop_words, remove_digits, minimal)
    clean_insample.to_csv('clean_in_sample.csv', index=False)

    outsample = pd.read_csv("out_sample.csv")
    clean_outsample = clean_data(outsample, field, labels, remove_punctuations, remove_stop_words, remove_digits, minimal)
    clean_outsample.to_csv('clean_out_sample.csv', index=False) 

    return clean_insample, clean_outsample


## One Label

In this section, we fine-tune the BERT model on a single label. 

### Load the Dataset

Change `LABELS` and `LABEL` to the target label.

In [None]:
LABELS = ['Intoxication'] 
LABEL = 'Intoxication'

Here we chose **not** to remove any stopword or digit, but you can choose differently. 

In [None]:
raw_insample = pd.read_csv("in_sample.csv")
raw_outsample = pd.read_csv("out_sample.csv")
clean_insample, clean_outsample = load_data("straindescription", LABELS, remove_stop_words=False, remove_digits=False, minimal=True)

Comparsion between raw and cleaned description field

In [None]:
raw_outsample.iloc[1, 1]

'"Blue Dream" Agrijuana --- THC = 23.70%   \nBlue Dream, a sativa-dominant hybrid originating in California, has achieved legendary status among West Coast strains. Crossing a Blueberry indica with the sativa Haze, Blue Dream balances full-body relaxation with gentle cerebral invigoration. Novice and veteran consumers alike enjoy the level effects of Blue Dream, which ease you gently into a calm euphoria. Some Blue Dream phenotypes express a more indica-like look and feel, but the sativa-leaning variety remains most prevalent.'

In [None]:
clean_outsample.iloc[1, 0]

'"blue dream" agrijuana --- thc = 23.70% blue dream, a sativa-dominant hybrid originating in california, has achieved legendary status among west coast strains. crossing a blueberry indica with the sativa haze, blue dream balances full-body relaxation with gentle cerebral invigoration. novice and veteran consumers alike enjoy the level effects of blue dream, which ease you gently into a calm euphoria. some blue dream phenotypes express a more indica-like look and feel, but the sativa-leaning variety remains most prevalent.'

Split the insample dataset into the training and the validation sets.

In [None]:
val_size = 0.2
random_state = 10
train, val = train_test_split(clean_insample, test_size=val_size, random_state=random_state)
train.to_csv('train.csv', index=False)
val.to_csv('val.csv', index=False)

Load the training and testing set into a single object called `dataset`.

In [None]:
dataset = load_dataset('csv', data_files={'train': ['train.csv'], 'val': ['val.csv'], 'test': ['clean_out_sample.csv']})

Using custom data configuration default-5d0fb156f1d0fb3f


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-5d0fb156f1d0fb3f/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-5d0fb156f1d0fb3f/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['straindescription', 'Intoxication'],
        num_rows: 7276
    })
    val: Dataset({
        features: ['straindescription', 'Intoxication'],
        num_rows: 1820
    })
    test: Dataset({
        features: ['straindescription', 'Intoxication'],
        num_rows: 5578
    })
})

In [None]:
dataset['train'][0] # first observation in the training set

{'Intoxication': 0,
 'straindescription': 'indica kief mix (56.6% thc) by cannasol --- indica // 1g for $25 // by cannasol'}

### Tokenize the textual input

The following cell is the collection of the tokenization hyperparameters.

In [None]:
padding = 'max_length' # padding strategy
padding_side = 'right' # the side on which the model should have padding applied
truncation = True # truncate strategy
truncation_side = 'right' # the side on which the model should have truncation applied
max_len = 150 # maximum length to use by one of the truncation/padding parameters

Load the pre-trained tokenizer. We padded or truncated the textual input from the right currently. 

In [None]:
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    padding_side=padding_side,
    truncation_side=truncation_side
)
tokenizer

PreTrainedTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Define the helper function for preprocessing/tokenizing the data. We can add more arguments in the call `tokenizer()` below to customize it. See more details [here](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__).

In [None]:
def preprocess_function(examples):
    """
    Preprocess the description field
    ---
    Arguments:
    examples (str, List[str], List[List[str]]: the sequence or batch of sequences to be encoded/tokenized

    Returns:
    tokenized (transformers.BatchEncoding): tokenized descriptions 
    """
    tokenized = tokenizer(
        examples["straindescription"],
        padding=padding,
        truncation=truncation,
        max_length=max_len
    )

    return tokenized

Preprocess the textual field `straindescription` and edit the tokenized dataset so that it is acceptable to the model

In [None]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns("straindescription")
tokenized_dataset = tokenized_dataset.rename_column(LABEL, "label")
tokenized_dataset

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7276
    })
    val: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1820
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5578
    })
})

### Train (fine-tune) the model

Set up the metrics. See the [reference](https://huggingface.co/metrics).

In [None]:
val_eval = {}
test_eval = {}
metric_acc = load_metric("accuracy")
metric_f1 = load_metric("f1")
metric_precision = load_metric("precision")
metric_recall = load_metric("recall")
metric_auc = load_metric("roc_auc")

def compute_metrics(eval_pred):
    """
    Compute the metrics 
    ---
    Arguments:
    eval_pred (tuple): the predicted logits and truth labels

    Returns:
    metrics (dict{str: float}): contains the computed metrics 
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    prediction_scores = np.max(logits, axis=-1)
    print(logits.shape, labels.shape)
    print(predictions.shape, prediction_scores.shape)

    pred_true = np.count_nonzero(predictions)
    pred_false = predictions.shape[0] - pred_true
    actual_true = np.count_nonzero(labels)
    actual_false = labels.shape[0] - actual_true

    acc = metric_acc.compute(predictions=predictions, references=labels)['accuracy']
    f1 = metric_f1.compute(predictions=predictions, references=labels)['f1']
    precision = metric_precision.compute(predictions=predictions, references=labels)['precision']
    recall = metric_recall.compute(predictions=predictions, references=labels)['recall']
    roc_auc = metric_auc.compute(prediction_scores=predictions, references=labels)['roc_auc']
    matthews_correlation = matthews_corrcoef(y_true=labels, y_pred=predictions)
    cohen_kappa = cohen_kappa_score(y1=labels, y2=predictions)

    tn, fp, fn, tp = confusion_matrix(y_true=labels, y_pred=predictions).ravel()
    specificity = tn / (tn + fp)
    sensitivity = tp / (tp + fn)
    informedness = specificity + sensitivity - 1

    metrics = {
        "pred_true": pred_true,
        "pred_false": pred_false,
        "actual_true": actual_true,
        "actual_false": actual_false,
        "accuracy": acc,
        "f1_score": f1,
        "precision": precision,
        "recall": recall,
        "roc_auc": roc_auc,
        "matthews_correlation": matthews_correlation,
        "cohen_kappa": cohen_kappa,
        "true_negative": tn,
        "false_positive": fp,
        "false_negative": fn,
        "true_positive": tp,
        "specificity": specificity,
        "sensitivity": sensitivity,
        "informedness": informedness
    }
    return metrics

The following cell is the collection of all the [model](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__) and [opimization](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/trainer#transformers.TrainingArguments) hyperparameters we were using.

In [None]:
# model hyperparameters
classifier_dropout = 0.15 # dropout ratio for the classification head
num_classes = 2 # number of classes

# optimization hyperparameters ###
model_dir = "bert_" + LABEL
seed = 42 # random seed for splitting the data into batches
batch_size = 16 # batch size for both training and evaluation
grad_acc_steps = 4 # number of steps for gradient accumulation
lr = 5e-5 # initial learning rate
weight_decay = 2e-3 # weight decay to apply in the AdamW optimizer
epochs = 8 # total number of training epochs 
lr_scheduler = "cosine" # type of learning rate scheduler
strategy = "steps" # strategy for logging, evaluation, and saving
steps = 100 # number of steps for logging, evaluation, and saving
eval_metric = "f1_score" # metric for selecting the best model

Load the pre-trained model. We can change more model hyperparameters to change the pre-trained model architecture by adding more arguments in `from_pretrained` to customize the pre-trained model we load.


In [None]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    classifier_dropout=classifier_dropout,
    num_labels=num_classes
)
model.config

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [None]:
# remove the cache
!rm -rf $model_dir

training_args = TrainingArguments(
    output_dir=model_dir,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=grad_acc_steps,
    learning_rate=lr,
    weight_decay=weight_decay, 
    num_train_epochs=epochs,
    lr_scheduler_type=lr_scheduler,
    evaluation_strategy=strategy,
    logging_strategy=strategy, 
    save_strategy=strategy,
    eval_steps=steps,
    logging_steps=steps,
    save_steps=steps,
    seed=seed,
    load_best_model_at_end=True,
    metric_for_best_model=eval_metric,
    report_to="none"
)

PyTorch: setting up devices


Set up the trainer function. See the [reference](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/trainer#transformers.Trainer).

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['val'],
    tokenizer=tokenizer,   
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()

***** Running training *****
  Num examples = 7276
  Num Epochs = 1
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 4
  Total optimization steps = 113


Step,Training Loss,Validation Loss,Pred True,Pred False,Actual True,Actual False,Accuracy,F1 Score,Precision,Recall,Roc Auc,Matthews Correlation,Cohen Kappa,True Negative,False Positive,False Negative,True Positive,Specificity,Sensitivity,Informedness
100,0.176,0.084301,301,1519,322,1498,0.970879,0.914928,0.946844,0.885093,0.937206,0.898127,0.897385,1482,16,37,285,0.989319,0.885093,0.874412


***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


(1820, 2) (1820,)
(1820,) (1820,)


Saving model checkpoint to bert_Intoxication/checkpoint-100
Configuration saved in bert_Intoxication/checkpoint-100/config.json
Model weights saved in bert_Intoxication/checkpoint-100/pytorch_model.bin
tokenizer config file saved in bert_Intoxication/checkpoint-100/tokenizer_config.json
Special tokens file saved in bert_Intoxication/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Intoxication/checkpoint-100 (score: 0.9149277688603531).


TrainOutput(global_step=113, training_loss=0.163691293876783, metrics={'train_runtime': 195.7194, 'train_samples_per_second': 37.176, 'train_steps_per_second': 0.577, 'total_flos': 557466548544000.0, 'train_loss': 0.163691293876783, 'epoch': 0.99})

### Evaluate the model 

Print out the model architecture information

In [None]:
model

Count the total number of model parameters.

In [None]:
sum(p.numel() for p in model.parameters() if p.requires_grad)

Evaluate the best model (checkpoint) on the validation and testing sets

In [None]:
# set up directory paths
best_model_dir = "best_" + model_dir
best_model_dir_zip = "best_" + model_dir + ".zip"
!rm -rf $best_model_dir $best_model_dir_zip # remove possible cache

# evaluate the best model
val_predictions = trainer.predict(tokenized_dataset["val"])
val_eval[LABEL] = val_predictions.metrics
test_predictions = trainer.predict(tokenized_dataset["test"])
test_eval[LABEL] = test_predictions.metrics

# save the best model
model.save_pretrained(best_model_dir)
!zip -r $best_model_dir_zip $best_model_dir

# save the evaluation result of each model
val_eval_df = pd.DataFrame.from_dict(val_eval).transpose()
val_eval_df.to_csv("val_evaluation.csv")
test_eval_df = pd.DataFrame.from_dict(test_eval).transpose()
test_eval_df.to_csv("test_evaluation.csv")

***** Running Prediction *****
  Num examples = 1820
  Batch size = 16


***** Running Prediction *****
  Num examples = 5578
  Batch size = 16


(1820, 2) (1820,)
(1820,) (1820,)


Configuration saved in best_bert_Intoxication/config.json


(5578, 2) (5578,)
(5578,) (5578,)


Model weights saved in best_bert_Intoxication/pytorch_model.bin


  adding: best_bert_Intoxication/ (stored 0%)
  adding: best_bert_Intoxication/config.json (deflated 49%)
  adding: best_bert_Intoxication/pytorch_model.bin (deflated 7%)


## All the labels

In this section, we fine-tune a model for each label. By simply running the following cell, you can get fine-tuned models for all the labels.

Notes about building models of all the labels:
1. For different labels, we clean or preprocess the input differently if we want by passing in different `remove_punctuations`, `remove_stop_words`, and `remove_digits` arguments. By default, we remove extra white spaces.
2. Decreasing `max_len` can improve the training speed but it can hurt the model performance.
3. Although we have a variable called `batch_size`, the actual batch size during the training is `batch_size * grad_acc_steps` which is `Total train batch size` in the log message of the training. This is due to gradient accumulation. See more details about it [here](https://huggingface.co/docs/transformers/main/en/performance#gradient-accumulation).
4. If GPU memory size is not enough, consider lowering `batch_size` or `epochs` so that less data will be stored in GPU memory each time.
5. If GPU disk size is not enough, consider increasing `steps` so that less model checkpoints will be saved. 

In [None]:
LABELS = ["Cannabinoid", "Intoxication", "Medical", "Wellness", "Commoditization"]

### Preprocess Setup ###
# Dataset Splitting Hyperparameters
val_size = 0.2 # validation set size
random_state = 10 # random seed 

# Tokenization Hyperparameters
padding = 'max_length' # padding strategy
padding_side = 'right' # the side on which the model should have padding applied
truncation = True # truncate strategy
truncation_side = 'right' # the side on which the model should have truncation applied
max_len = 150 # maximum length to use by one of the truncation/padding parameters

# Load the pre-trained tokenmizer ###
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    padding_side=padding_side,
    truncation_side=truncation_side
)

# Define the preprocess function ###
def preprocess_function(examples):
    """
    Preprocess the description field
    ---
    Arguments:
    examples (str, List[str], List[List[str]]: the sequence or batch of sequences to be encoded/tokenized

    Returns:
    tokenized (transformers.BatchEncoding): tokenized descriptions 
    """
    tokenized = tokenizer(
        examples["straindescription"],
        padding=padding,
        truncation=truncation,
        max_length=max_len
    )

    return tokenized

### Evaluation Metrics ###
val_eval = {}
test_eval = {}
metric_acc = load_metric("accuracy")
metric_f1 = load_metric("f1")
metric_precision = load_metric("precision")
metric_recall = load_metric("recall")
metric_auc = load_metric("roc_auc")

def compute_metrics(eval_pred):
    """
    Compute the metrics 
    ---
    Arguments:
    eval_pred (tuple): the predicted logits and truth labels

    Returns:
    metrics (dict{str: float}): contains the computed metrics 
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    prediction_scores = np.max(logits, axis=-1)
    print(logits.shape, labels.shape)
    print(predictions.shape, prediction_scores.shape)

    pred_true = np.count_nonzero(predictions)
    pred_false = predictions.shape[0] - pred_true
    actual_true = np.count_nonzero(labels)
    actual_false = labels.shape[0] - actual_true

    acc = metric_acc.compute(predictions=predictions, references=labels)['accuracy']
    f1 = metric_f1.compute(predictions=predictions, references=labels)['f1']
    precision = metric_precision.compute(predictions=predictions, references=labels)['precision']
    recall = metric_recall.compute(predictions=predictions, references=labels)['recall']
    roc_auc = metric_auc.compute(prediction_scores=predictions, references=labels)['roc_auc']
    matthews_correlation = matthews_corrcoef(y_true=labels, y_pred=predictions)
    cohen_kappa = cohen_kappa_score(y1=labels, y2=predictions)

    tn, fp, fn, tp = confusion_matrix(y_true=labels, y_pred=predictions).ravel()
    specificity = tn / (tn + fp)
    sensitivity = tp / (tp + fn)
    informedness = specificity + sensitivity - 1

    metrics = {
        "pred_true": pred_true,
        "pred_false": pred_false,
        "actual_true": actual_true,
        "actual_false": actual_false,
        "accuracy": acc,
        "f1_score": f1,
        "precision": precision,
        "recall": recall,
        "roc_auc": roc_auc,
        "matthews_correlation": matthews_correlation,
        "cohen_kappa": cohen_kappa,
        "true_negative": tn,
        "false_positive": fp,
        "false_negative": fn,
        "true_positive": tp,
        "specificity": specificity,
        "sensitivity": sensitivity,
        "informedness": informedness
    }
    return metrics

### Training and Model Setup ###
# model hyperparameters
classifier_dropout = 0.15 # dropout ratio for the classification head
num_classes = 2 # number of classes

# optimization hyperparameters ###
model_dir = "bert_" + LABEL
seed = 42 # random seed for splitting the data into batches
batch_size = 16 # batch size for both training and evaluation
grad_acc_steps = 4 # number of steps for gradient accumulation
lr = 5e-5 # initial learning rate
weight_decay = 2e-3 # weight decay to apply in the AdamW optimizer
epochs = 8 # total number of training epochs 
lr_scheduler = "cosine" # type of learning rate scheduler
strategy = "steps" # strategy for logging, evaluation, and saving
steps = 100 # number of steps for logging, evaluation, and saving
eval_metric = "f1_score" # metric for selecting the best model

### Training ###
# fine-tune a separate model for each label
for label in LABELS:

    # load the datasets
    raw_insample = pd.read_csv("in_sample.csv")
    raw_outsample = pd.read_csv("out_sample.csv")
    clean_insample, clean_outsample = load_data("straindescription", LABELS, minimal=True)
    train, val = train_test_split(clean_insample, test_size=val_size, random_state=random_state)
    train.to_csv('train.csv', index=False)
    val.to_csv('val.csv', index=False)
    dataset = load_dataset('csv', data_files={'train': ['train.csv'], 'val': ['val.csv'], 'test': ['clean_out_sample.csv']})

    # preprocess the textual input 
    tokenized_dataset = dataset.map(preprocess_function, batched=True)
    tokenized_dataset = tokenized_dataset.remove_columns("straindescription")

    # set up directory paths
    model_dir = "bert_" + label
    best_model_dir = "best_" + model_dir
    best_model_dir_zip = "best_" + model_dir + ".zip"
    !rm -rf $model_dir $best_model_dir $best_model_dir_zip # remove possible cache

    # remove other labels and rename the target label
    other_labels = list(filter(lambda x: x != label, LABELS))
    tokenized_dataset_label = tokenized_dataset.remove_columns(other_labels)
    tokenized_dataset_label = tokenized_dataset_label.rename_column(label, "label")

    # load the pre-trained model
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        classifier_dropout=classifier_dropout,
        num_labels=num_classes
    )

    # set up the training arguments
    training_args = TrainingArguments(
        output_dir=model_dir,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=grad_acc_steps,
        learning_rate=lr,
        weight_decay=weight_decay, 
        num_train_epochs=epochs,
        lr_scheduler_type=lr_scheduler,
        evaluation_strategy=strategy,
        logging_strategy=strategy, 
        save_strategy=strategy,
        eval_steps=steps,
        logging_steps=steps,
        save_steps=steps,
        seed=seed,
        load_best_model_at_end=True,
        metric_for_best_model=eval_metric,
        report_to="none"
    )

    # set up the trainer 
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset_label['train'],
        eval_dataset=tokenized_dataset_label['val'],
        tokenizer=tokenizer,   
        compute_metrics=compute_metrics,
    )

    # train (fine-tune) the model
    trainer.train()

    # evaluate the best model
    val_predictions = trainer.predict(tokenized_dataset_label["val"])
    val_eval[label] = val_predictions.metrics
    test_predictions = trainer.predict(tokenized_dataset_label["test"])
    test_eval[label] = test_predictions.metrics

    # save the best model
    model.save_pretrained(best_model_dir)
    !zip -r $best_model_dir_zip $best_model_dir

# save the evaluation result of each model
val_eval_df = pd.DataFrame.from_dict(val_eval).transpose()
val_eval_df.to_csv("val_evaluation.csv")
test_eval_df = pd.DataFrame.from_dict(test_eval).transpose()
test_eval_df.to_csv("test_evaluation.csv")

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-edb9a440da475098/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-edb9a440da475098/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

Step,Training Loss,Validation Loss,Pred True,Pred False,Actual True,Actual False,Accuracy,F1 Score,Precision,Recall,Roc Auc,Matthews Correlation,Cohen Kappa,True Negative,False Positive,False Negative,True Positive,Specificity,Sensitivity,Informedness
100,0.083,0.025923,1517,302,1513,306,0.991204,0.994719,0.993408,0.996034,0.981677,0.968434,0.968404,296,10,6,1507,0.96732,0.996034,0.963355


***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Saving model checkpoint to bert_Cannabinoid/checkpoint-100
Configuration saved in bert_Cannabinoid/checkpoint-100/config.json
Model weights saved in bert_Cannabinoid/checkpoint-100/pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid/checkpoint-100/tokenizer_config.json
Special tokens file saved in bert_Cannabinoid/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Cannabinoid/checkpoint-100 (score: 0.9947194719471946).
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Configuration saved in best_bert_Cannabinoid/config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Cannabinoid/pytorch_model.bin


  adding: best_bert_Cannabinoid/ (stored 0%)
  adding: best_bert_Cannabinoid/config.json (deflated 49%)
  adding: best_bert_Cannabinoid/pytorch_model.bin (deflated 7%)


Using custom data configuration default-466c26256fe96055


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-466c26256fe96055/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-466c26256fe96055/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

Step,Training Loss,Validation Loss,Pred True,Pred False,Actual True,Actual False,Accuracy,F1 Score,Precision,Recall,Roc Auc,Matthews Correlation,Cohen Kappa,True Negative,False Positive,False Negative,True Positive,Specificity,Sensitivity,Informedness
100,0.1829,0.100513,295,1524,325,1494,0.962617,0.890323,0.935593,0.849231,0.918257,0.869333,0.867855,1475,19,49,276,0.987282,0.849231,0.836513


***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Saving model checkpoint to bert_Intoxication/checkpoint-100
Configuration saved in bert_Intoxication/checkpoint-100/config.json
Model weights saved in bert_Intoxication/checkpoint-100/pytorch_model.bin
tokenizer config file saved in bert_Intoxication/checkpoint-100/tokenizer_config.json
Special tokens file saved in bert_Intoxication/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Intoxication/checkpoint-100 (score: 0.8903225806451613).
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Configuration saved in best_bert_Intoxication/config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Intoxication/pytorch_model.bin


  adding: best_bert_Intoxication/ (stored 0%)
  adding: best_bert_Intoxication/config.json (deflated 49%)
  adding: best_bert_Intoxication/pytorch_model.bin (deflated 7%)


Using custom data configuration default-07e0db75a3b8745c


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-07e0db75a3b8745c/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-07e0db75a3b8745c/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

Step,Training Loss,Validation Loss,Pred True,Pred False,Actual True,Actual False,Accuracy,F1 Score,Precision,Recall,Roc Auc,Matthews Correlation,Cohen Kappa,True Negative,False Positive,False Negative,True Positive,Specificity,Sensitivity,Informedness
100,0.1013,0.061243,143,1676,148,1671,0.97856,0.865979,0.881119,0.851351,0.920589,0.85448,0.854331,1654,17,22,126,0.989826,0.851351,0.841178


***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Saving model checkpoint to bert_Medical/checkpoint-100
Configuration saved in bert_Medical/checkpoint-100/config.json
Model weights saved in bert_Medical/checkpoint-100/pytorch_model.bin
tokenizer config file saved in bert_Medical/checkpoint-100/tokenizer_config.json
Special tokens file saved in bert_Medical/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Medical/checkpoint-100 (score: 0.8659793814432989).
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Configuration saved in best_bert_Medical/config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Medical/pytorch_model.bin


  adding: best_bert_Medical/ (stored 0%)
  adding: best_bert_Medical/config.json (deflated 49%)
  adding: best_bert_Medical/pytorch_model.bin (deflated 7%)


Using custom data configuration default-2da91690bcdd4413


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-2da91690bcdd4413/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-2da91690bcdd4413/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

Step,Training Loss,Validation Loss,Pred True,Pred False,Actual True,Actual False,Accuracy,F1 Score,Precision,Recall,Roc Auc,Matthews Correlation,Cohen Kappa,True Negative,False Positive,False Negative,True Positive,Specificity,Sensitivity,Informedness
100,0.148,0.08151,439,1380,432,1387,0.973062,0.943743,0.936219,0.951389,0.965601,0.926087,0.926036,1359,28,21,411,0.979813,0.951389,0.931201


***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Saving model checkpoint to bert_Wellness/checkpoint-100
Configuration saved in bert_Wellness/checkpoint-100/config.json
Model weights saved in bert_Wellness/checkpoint-100/pytorch_model.bin
tokenizer config file saved in bert_Wellness/checkpoint-100/tokenizer_config.json
Special tokens file saved in bert_Wellness/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Wellness/checkpoint-100 (score: 0.9437428243398392).
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Configuration saved in best_bert_Wellness/config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Wellness/pytorch_model.bin


  adding: best_bert_Wellness/ (stored 0%)
  adding: best_bert_Wellness/config.json (deflated 49%)
  adding: best_bert_Wellness/pytorch_model.bin (deflated 7%)


Using custom data configuration default-bfecff90f808463e


Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-bfecff90f808463e/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-bfecff90f808463e/0.0.0/433e0ccc46f9880962cc2b12065189766fbb2bee57a221866138fb9203c83519. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/8 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/6 [00:00<?, ?ba/s]

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.2",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file https://huggingface.co/bert-base-uncased/re

Step,Training Loss,Validation Loss,Pred True,Pred False,Actual True,Actual False,Accuracy,F1 Score,Precision,Recall,Roc Auc,Matthews Correlation,Cohen Kappa,True Negative,False Positive,False Negative,True Positive,Specificity,Sensitivity,Informedness
100,0.1632,0.100459,903,916,888,931,0.967565,0.967058,0.959025,0.975225,0.967742,0.935245,0.935118,894,37,22,866,0.960258,0.975225,0.935483


***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Saving model checkpoint to bert_Commoditization/checkpoint-100
Configuration saved in bert_Commoditization/checkpoint-100/config.json
Model weights saved in bert_Commoditization/checkpoint-100/pytorch_model.bin
tokenizer config file saved in bert_Commoditization/checkpoint-100/tokenizer_config.json
Special tokens file saved in bert_Commoditization/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Commoditization/checkpoint-100 (score: 0.9670575097710776).
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


Configuration saved in best_bert_Commoditization/config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Commoditization/pytorch_model.bin


  adding: best_bert_Commoditization/ (stored 0%)
  adding: best_bert_Commoditization/config.json (deflated 49%)
  adding: best_bert_Commoditization/pytorch_model.bin (deflated 7%)


In [None]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

## Prediction

In this section, we deploy the fine-tuned models to make predictions on the real data that will be used for another data analysis.

Same as the previous setup of the fine-tuning, you need to upload the real data `full_dataset.csv`.

In [None]:
LABELS = ["Cannabinoid", "Intoxication", "Medical", "Wellness", "Commoditization"]
# LABELS = ["Wellness", "Commoditization"]
# LABELS = ["Cannabinoid", "Intoxication", "Medical"]

full_dataset = pd.read_csv("full_dataset.csv")

FileNotFoundError: ignored

In [None]:
# # downsampling
# full_dataset = full_dataset.sample(50000, random_state=random_state)

Clean the real data for getting passed into the model

In [None]:
full_dataset['straindescription'] = '"' + full_dataset['strain'].astype(str) + '" -- '+ full_dataset['description'].astype(str)
clean_full = clean_data(full_dataset, "straindescription", [], minimal=True)

Define functions and hyperparmeters needed for making predictions.

**Warning:** The hyperparameter choices here should be the **same** as those in fine-tuning stage.

In [None]:
### Preprocess Setup ###
# Tokenization Hyperparameters
padding = 'max_length' # padding strategy
padding_side = 'right' # the side on which the model should have padding applied
truncation = True # truncate strategy
truncation_side = 'right' # the side on which the model should have truncation applied
max_len = 150 # maximum length to use by one of the truncation/padding parameters

# Load the pre-trained tokenmizer ###
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    padding_side=padding_side,
    truncation_side=truncation_side
)

# Define the preprocess function ###
def preprocess_function(examples):
    """
    Preprocess the description field
    ---
    Arguments:
    examples (str, List[str], List[List[str]]: the sequence or batch of sequences to be encoded/tokenized

    Returns:
    tokenized (transformers.BatchEncoding): tokenized descriptions 
    """
    tokenized = tokenizer(
        examples["straindescription"],
        padding=padding,
        truncation=truncation,
        max_length=max_len
    )

    return tokenized

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.19.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/resolve/ma

Predict each label on the real data

In [None]:
# preprocess the textual input 
dataset = Dataset.from_pandas(clean_full)
tokenized_dataset = dataset.map(preprocess_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(["straindescription", "__index_level_0__"])

for label in LABELS:

    # set up directory paths
    best_model_dir = "best_bert_" + label

    model = AutoModelForSequenceClassification.from_pretrained(best_model_dir)

    trainer = Trainer(
        model=model,
    )

    predictions = trainer.predict(tokenized_dataset)
    predict_labels = np.argmax(predictions.predictions, axis=-1)
    full_dataset[(label+"_labeled").lower()] = predict_labels

# manipulate the dataframe so that it is acceptable to another data analysis code
full_dataset = full_dataset.rename({"medical_labeled":"medical_undersampled_labeled"}, axis=1)
full_dataset["medical_labeled"] = np.zeros(full_dataset.shape[0])
full_dataset["medical_labeled"] = full_dataset["medical_labeled"].astype(int)
full_dataset["smellflavor_labeled"] = np.zeros(full_dataset.shape[0])
full_dataset["smellflavor_labeled"] = full_dataset["smellflavor_labeled"].astype(int)
full_dataset["genetics_labeled"] = np.zeros(full_dataset.shape[0])
full_dataset["genetics_labeled"] = full_dataset["genetics_labeled"].astype(int)
full_dataset.to_csv("full_dataset_with_labels.csv", index=False, line_terminator='\r\n')


  0%|          | 0/460 [00:00<?, ?ba/s]

loading configuration file best_bert_Wellness/config.json
Model config BertConfig {
  "_name_or_path": "best_bert_Wellness",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.19.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file best_bert_Wellness/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of BertForSequenceClassifi

loading configuration file best_bert_Commoditization/config.json
Model config BertConfig {
  "_name_or_path": "best_bert_Commoditization",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.19.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file best_bert_Commoditization/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of Be