# Apply BERT to Solve a Cannabis Product Classification Problem 

**Author:** Wenhao Pan, UC Berkeley, Spring 2022.

## Table of Contents

* [Introduction](#Introduction)
* [Basic Setup](#Basic-Setup)
* [One Label](#One-Label)
* [All the Labels](#All-the-Labels)
* [Prediction](#Prediction)


## Introduction

In this notebook, we explore [BERT model](https://en.wikipedia.org/wiki/BERT_(language_model), which was created and published in 2018 by Google, on our cannabis product dataset. By running through this notebook, we are able to

* Fine-tune a BERT model for a single label, which is a binary classification task (*One Label*)
* Fine-tune multiple BERT models for multiple labels consecutively, which is a set of binary classification tasks (*All Label*)
* Use the fine-tuned BERT models to make the predictions (*Prediction*)

Within in each section, it is recommended to run the code cells in order. All the code cells in *Basic Setup* section should always be run before any other section.

**Note**: This is the local version of the `bert_colab.ipynb`, which is designed for running on Google Colab (if you do not have an GPU)

## Basic Setup

Run the following cell to import all the packages and functions that will be used later. `MODEL_NAME` defines which pre-trained label we want to use. Here, we are using `bert-base-uncased` which can be found [here](https://huggingface.co/bert-base-uncased).

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.optim as optim
import os

MODEL_NAME = "bert-base-uncased" # the name of the pre-trained model we want to use
import transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, DataCollatorWithPadding
from transformers import TrainingArguments, Trainer
from datasets import load_dataset, load_metric, Dataset

from sklearn.model_selection import train_test_split 
from sklearn.metrics import matthews_corrcoef, cohen_kappa_score, confusion_matrix

from gensim.parsing import remove_stopwords, strip_numeric, strip_punctuation, strip_multiple_whitespaces

  from .autonotebook import tqdm as notebook_tqdm


Run the following cell to confirm we are using GPU provided by Colab. If the printout is `cuda`, then we are indeed using GPU.

In [2]:
print(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))

cuda


Run the following cell to confirm we have the dataset loaded. Otherwise, you need to upload the dataset (`in_sample.csv` and `out_sample.csv`) to the `BERT` folder created by the previous code cell in your google drive. These two csv files can be downloaded from [here](https://drive.google.com/file/d/1yYhdvl2BRdOW6cUT2k4HcQrymEBRZrZD/view?usp=sharing) and [here](https://drive.google.com/file/d/1xXFebXJaaaWG8lx294J56XevfZlI2NVl/view?usp=sharing).

In [3]:
assert os.path.exists("data/in_sample.csv") and os.path.exists("data/out_sample.csv"), "Raw dataset was not detected. You need to upload the dataset first!"

Run the following cell to load the helper functions we need later.

In [7]:
def clean_data(df, field, labels, remove_punctuations=False, remove_stop_words=False, remove_digits=False, minimal=False):
    """Binarizes labels for given dataframe, and exports cleaned dataframes

    Args:
        df (pd.dataframe): dataframe with label columns (see LABELS above)
        field (str): the name of the input field
        labels (list[str]): labels we currently consider
        remove_punctuations (boolean): remove punctuations from the description field if True
        remove_stop_words (boolean): remove stop words from the description field if True
        remove_digits (boolean): remove digits from the description field if True
        minimal (boolean): only keep the description and label fields if True

    Returns:
        df_clean (pd.dataframe): cleaned dataframe with binarized labels
    """
    df_clean = df.dropna(subset=[field])

    # ensure label fields are all numerical
    for label in labels:
        df_clean = df_clean[(df_clean[label] == 0) | (df_clean[label] == 1) | (df_clean[label] == '0') | (df_clean[label] == '1')]
        df_clean[label] = pd.to_numeric(df_clean[label])
    
    # remove punctuations if wanted
    if remove_punctuations:
        df_clean[field] = df_clean[field].apply(strip_punctuation)

    # remove stopwords if wanted 
    if remove_stop_words:
        df_clean[field] = df_clean[field].apply(remove_stopwords)
    
    # remove digits if wanted
    if remove_digits:
        df_clean[field] = df_clean[field].apply(strip_numeric)

    # drop unnecessary columns
    if minimal:
        df_clean = df_clean[[field] + labels]

    df_clean[field] = df_clean[field].astype(str)
    df_clean[field] = df_clean[field].str.lower() # lowercase all characters
    df_clean[field] = df_clean[field].apply(strip_multiple_whitespaces) # remove repeating whitespace
    df_clean = df_clean.replace(to_replace=[''], value=np.nan).dropna(subset=[field]) # drop empty field
    
    return df_clean


def load_data(field, labels, remove_punctuations=False, remove_stop_words=False, remove_digits=False, minimal=False):
    """Loads in_sample and out_sample data, cleans them, and exports clean csv files

    Args:
        field (str): the name of the input field
        labels (list[str]): labels we currently consider
        remove_punctuations (boolean): remove punctuations from the description field if True
        remove_stop_words (boolean): remove stop words from the description field if True
        remove_digits (boolean): remove digits from the description field if True
        minimal (boolean): only keep the description and label fields if True

    Returns:
        clean_insample (pd.DataFrame): Training Dataset
        clean_outsample (pd.DataFrame): Testing Dataset
    """
    # Check that data is downloaded
    assert os.path.exists("data/in_sample.csv"), "Need to download in_sample.csv first!"
    assert os.path.exists("data/out_sample.csv"), "Need to download out_sample.csv first!"

    insample = pd.read_csv("data/in_sample.csv")
    clean_insample = clean_data(insample, field, labels, remove_punctuations, remove_stop_words, remove_digits, minimal)
    clean_insample.to_csv('data/clean_in_sample.csv', index=False)

    outsample = pd.read_csv("data/out_sample.csv")
    clean_outsample = clean_data(outsample, field, labels, remove_punctuations, remove_stop_words, remove_digits, minimal)
    clean_outsample.to_csv('data/clean_out_sample.csv', index=False) 

    return clean_insample, clean_outsample


## One Label

In this section, we fine-tune the BERT model on a single label. 

### Load the Dataset

Change `LABELS` and `LABEL` to the target label.

In [5]:
LABELS = ['Intoxication'] 
LABEL = 'Intoxication'

Here we chose **not** to remove any stopword or digit, but you can choose differently. 

In [8]:
raw_insample = pd.read_csv("data/in_sample.csv")
raw_outsample = pd.read_csv("data/out_sample.csv")
clean_insample, clean_outsample = load_data("straindescription", LABELS, remove_stop_words=False, remove_digits=False, minimal=True)

Comparsion between raw and cleaned description field

In [9]:
raw_outsample.iloc[1, 1]

'"Blue Dream" Agrijuana --- THC = 23.70%   \r\nBlue Dream, a sativa-dominant hybrid originating in California, has achieved legendary status among West Coast strains. Crossing a Blueberry indica with the sativa Haze, Blue Dream balances full-body relaxation with gentle cerebral invigoration. Novice and veteran consumers alike enjoy the level effects of Blue Dream, which ease you gently into a calm euphoria. Some Blue Dream phenotypes express a more indica-like look and feel, but the sativa-leaning variety remains most prevalent.'

In [10]:
clean_outsample.iloc[1, 0]

'"blue dream" agrijuana --- thc = 23.70% blue dream, a sativa-dominant hybrid originating in california, has achieved legendary status among west coast strains. crossing a blueberry indica with the sativa haze, blue dream balances full-body relaxation with gentle cerebral invigoration. novice and veteran consumers alike enjoy the level effects of blue dream, which ease you gently into a calm euphoria. some blue dream phenotypes express a more indica-like look and feel, but the sativa-leaning variety remains most prevalent.'

Split the insample dataset into the training and the validation sets.

In [13]:
val_size = 0.2
random_state = 10
train, val = train_test_split(clean_insample, test_size=val_size, random_state=random_state)
train.to_csv('data/train.csv', index=False)
val.to_csv('data/val.csv', index=False)

Load the training and testing set into a single object called `dataset`.

In [14]:
dataset = load_dataset('csv', data_files={'train': ['data/train.csv'], 'val': ['data/val.csv'], 'test': ['data/clean_out_sample.csv']})

Using custom data configuration default-a309bb0209129b0a


Downloading and preparing dataset csv/default to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-a309bb0209129b0a\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 3082.54it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1541.65it/s]
                            

Dataset csv downloaded and prepared to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-a309bb0209129b0a\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 117.92it/s]


In [15]:
dataset

DatasetDict({
    train: Dataset({
        features: ['straindescription', 'Intoxication'],
        num_rows: 7276
    })
    val: Dataset({
        features: ['straindescription', 'Intoxication'],
        num_rows: 1820
    })
    test: Dataset({
        features: ['straindescription', 'Intoxication'],
        num_rows: 5578
    })
})

In [16]:
dataset['train'][0] # first observation in the training set

{'straindescription': 'indica kief mix (56.6% thc) by cannasol --- indica // 1g for $25 // by cannasol',
 'Intoxication': 0}

### Tokenize the textual input

The following cell is the collection of the tokenization hyperparameters.

In [17]:
padding = 'max_length' # padding strategy
padding_side = 'right' # the side on which the model should have padding applied
truncation = True # truncate strategy
truncation_side = 'right' # the side on which the model should have truncation applied
max_len = 150 # maximum length to use by one of the truncation/padding parameters

Load the pre-trained tokenizer. We padded or truncated the textual input from the right currently. 

In [18]:
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    padding_side=padding_side,
    truncation_side=truncation_side
)
tokenizer

Downloading tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 28.4kB/s]
Downloading config.json: 100%|██████████| 570/570 [00:00<00:00, 586kB/s]
Downloading vocab.txt: 100%|██████████| 226k/226k [00:00<00:00, 709kB/s] 
Downloading tokenizer.json: 100%|██████████| 455k/455k [00:00<00:00, 1.44MB/s]


PreTrainedTokenizerFast(name_or_path='bert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Define the helper function for preprocessing/tokenizing the data. We can add more arguments in the call `tokenizer()` below to customize it. See more details [here](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__).

In [19]:
def preprocess_function(examples):
    """
    Preprocess the description field
    ---
    Arguments:
    examples (str, List[str], List[List[str]]: the sequence or batch of sequences to be encoded/tokenized

    Returns:
    tokenized (transformers.BatchEncoding): tokenized descriptions 
    """
    tokenized = tokenizer(
        examples["straindescription"],
        padding=padding,
        truncation=truncation,
        max_length=max_len
    )

    return tokenized

Preprocess the textual field `straindescription` and edit the tokenized dataset so that it is acceptable to the model

In [20]:
tokenized_dataset = dataset.map(preprocess_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns("straindescription")
tokenized_dataset = tokenized_dataset.rename_column(LABEL, "label")
tokenized_dataset

100%|██████████| 8/8 [00:00<00:00, 17.45ba/s]
100%|██████████| 2/2 [00:00<00:00,  9.59ba/s]
100%|██████████| 6/6 [00:00<00:00, 18.25ba/s]


DatasetDict({
    train: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 7276
    })
    val: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1820
    })
    test: Dataset({
        features: ['label', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 5578
    })
})

### Train (fine-tune) the model

Set up the metrics. See the [reference](https://huggingface.co/metrics).

In [21]:
val_eval = {}
test_eval = {}
metric_acc = load_metric("accuracy")
metric_f1 = load_metric("f1")
metric_precision = load_metric("precision")
metric_recall = load_metric("recall")
metric_auc = load_metric("roc_auc")

def compute_metrics(eval_pred):
    """
    Compute the metrics 
    ---
    Arguments:
    eval_pred (tuple): the predicted logits and truth labels

    Returns:
    metrics (dict{str: float}): contains the computed metrics 
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    prediction_scores = np.max(logits, axis=-1)
    print(logits.shape, labels.shape)
    print(predictions.shape, prediction_scores.shape)

    pred_true = np.count_nonzero(predictions)
    pred_false = predictions.shape[0] - pred_true
    actual_true = np.count_nonzero(labels)
    actual_false = labels.shape[0] - actual_true

    acc = metric_acc.compute(predictions=predictions, references=labels)['accuracy']
    f1 = metric_f1.compute(predictions=predictions, references=labels)['f1']
    precision = metric_precision.compute(predictions=predictions, references=labels)['precision']
    recall = metric_recall.compute(predictions=predictions, references=labels)['recall']
    roc_auc = metric_auc.compute(prediction_scores=predictions, references=labels)['roc_auc']
    matthews_correlation = matthews_corrcoef(y_true=labels, y_pred=predictions)
    cohen_kappa = cohen_kappa_score(y1=labels, y2=predictions)

    tn, fp, fn, tp = confusion_matrix(y_true=labels, y_pred=predictions).ravel()
    specificity = tn / (tn + fp)
    sensitivity = tp / (tp + fn)
    informedness = specificity + sensitivity - 1

    metrics = {
        "pred_true": pred_true,
        "pred_false": pred_false,
        "actual_true": actual_true,
        "actual_false": actual_false,
        "accuracy": acc,
        "f1_score": f1,
        "precision": precision,
        "recall": recall,
        "roc_auc": roc_auc,
        "matthews_correlation": matthews_correlation,
        "cohen_kappa": cohen_kappa,
        "true_negative": tn,
        "false_positive": fp,
        "false_negative": fn,
        "true_positive": tp,
        "specificity": specificity,
        "sensitivity": sensitivity,
        "informedness": informedness
    }
    return metrics

Downloading builder script: 4.21kB [00:00, 4.32MB/s]                   
Downloading builder script: 6.50kB [00:00, 6.68MB/s]                   
Downloading builder script: 7.55kB [00:00, 7.76MB/s]                   
Downloading builder script: 7.38kB [00:00, 7.57MB/s]                   
Downloading builder script: 9.55kB [00:00, 9.81MB/s]                   


The following cell is the collection of all the [model](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/tokenizer#transformers.PreTrainedTokenizerFast.__call__) and [opimization](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/trainer#transformers.TrainingArguments) hyperparameters we were using.

In [22]:
# model hyperparameters
classifier_dropout = 0.15 # dropout ratio for the classification head
num_classes = 2 # number of classes

# optimization hyperparameters ###
model_dir = "bert_" + LABEL
seed = 42 # random seed for splitting the data into batches
batch_size = 16 # batch size for both training and evaluation
grad_acc_steps = 4 # number of steps for gradient accumulation
lr = 5e-5 # initial learning rate
weight_decay = 2e-3 # weight decay to apply in the AdamW optimizer
epochs = 8 # total number of training epochs 
lr_scheduler = "cosine" # type of learning rate scheduler
strategy = "steps" # strategy for logging, evaluation, and saving
steps = 100 # number of steps for logging, evaluation, and saving
eval_metric = "f1_score" # metric for selecting the best model

Load the pre-trained model. We can change more model hyperparameters to change the pre-trained model architecture by adding more arguments in `from_pretrained` to customize the pre-trained model we load.


In [23]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL_NAME,
    classifier_dropout=classifier_dropout,
    num_labels=num_classes
)
model.config

Downloading pytorch_model.bin: 100%|██████████| 420M/420M [00:27<00:00, 16.2MB/s] 
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights

BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

In [26]:
# remove the cache
os.system(f'rm -rf {model_dir}/')

training_args = TrainingArguments(
    output_dir=model_dir,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    gradient_accumulation_steps=grad_acc_steps,
    learning_rate=lr,
    weight_decay=weight_decay, 
    num_train_epochs=epochs,
    lr_scheduler_type=lr_scheduler,
    evaluation_strategy=strategy,
    logging_strategy=strategy, 
    save_strategy=strategy,
    eval_steps=steps,
    logging_steps=steps,
    save_steps=steps,
    seed=seed,
    load_best_model_at_end=True,
    metric_for_best_model=eval_metric,
    report_to="none"
)

Set up the trainer function. See the [reference](https://huggingface.co/docs/transformers/v4.18.0/en/main_classes/trainer#transformers.Trainer).

In [27]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['val'],
    tokenizer=tokenizer,   
    compute_metrics=compute_metrics,
)

In [28]:
trainer.train()

***** Running training *****
  Num examples = 7276
  Num Epochs = 8
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 4
  Total optimization steps = 904
 11%|█         | 100/904 [00:39<04:55,  2.72it/s]***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


{'loss': 0.1653, 'learning_rate': 4.850549408038498e-05, 'epoch': 0.88}


                                                 
 11%|█         | 100/904 [00:43<04:55,  2.72it/s]Saving model checkpoint to bert_Intoxication\checkpoint-100
Configuration saved in bert_Intoxication\checkpoint-100\config.json


(1820, 2) (1820,)
(1820,) (1820,)
{'eval_loss': 0.08346753567457199, 'eval_pred_true': 320, 'eval_pred_false': 1500, 'eval_actual_true': 322, 'eval_actual_false': 1498, 'eval_accuracy': 0.9725274725274725, 'eval_f1_score': 0.9221183800623053, 'eval_precision': 0.925, 'eval_recall': 0.9192546583850931, 'eval_roc_auc': 0.9516166482846694, 'eval_matthews_correlation': 0.9054472682157899, 'eval_cohen_kappa': 0.9054407913878382, 'eval_true_negative': 1474, 'eval_false_positive': 24, 'eval_false_negative': 26, 'eval_true_positive': 296, 'eval_specificity': 0.9839786381842457, 'eval_sensitivity': 0.9192546583850931, 'eval_informedness': 0.9032332965693388, 'eval_runtime': 3.8739, 'eval_samples_per_second': 469.817, 'eval_steps_per_second': 29.428, 'epoch': 0.88}


Model weights saved in bert_Intoxication\checkpoint-100\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-100\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-100\special_tokens_map.json
 22%|██▏       | 200/904 [01:24<04:32,  2.59it/s]***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


{'loss': 0.0641, 'learning_rate': 4.420066015704105e-05, 'epoch': 1.76}


                                                 
 22%|██▏       | 200/904 [01:28<04:32,  2.59it/s]Saving model checkpoint to bert_Intoxication\checkpoint-200
Configuration saved in bert_Intoxication\checkpoint-200\config.json


(1820, 2) (1820,)
(1820,) (1820,)
{'eval_loss': 0.08722817897796631, 'eval_pred_true': 322, 'eval_pred_false': 1498, 'eval_actual_true': 322, 'eval_actual_false': 1498, 'eval_accuracy': 0.9736263736263736, 'eval_f1_score': 0.9254658385093167, 'eval_precision': 0.9254658385093167, 'eval_recall': 0.9254658385093167, 'eval_roc_auc': 0.9547222383467812, 'eval_matthews_correlation': 0.9094444766935624, 'eval_cohen_kappa': 0.9094444766935624, 'eval_true_negative': 1474, 'eval_false_positive': 24, 'eval_false_negative': 24, 'eval_true_positive': 298, 'eval_specificity': 0.9839786381842457, 'eval_sensitivity': 0.9254658385093167, 'eval_informedness': 0.9094444766935625, 'eval_runtime': 4.0551, 'eval_samples_per_second': 448.812, 'eval_steps_per_second': 28.112, 'epoch': 1.76}


Model weights saved in bert_Intoxication\checkpoint-200\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-200\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-200\special_tokens_map.json
 33%|███▎      | 300/904 [02:09<03:54,  2.58it/s]***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


{'loss': 0.0499, 'learning_rate': 3.760018621248e-05, 'epoch': 2.65}


                                                 
 33%|███▎      | 300/904 [02:13<03:54,  2.58it/s]Saving model checkpoint to bert_Intoxication\checkpoint-300
Configuration saved in bert_Intoxication\checkpoint-300\config.json


(1820, 2) (1820,)
(1820,) (1820,)
{'eval_loss': 0.09073109924793243, 'eval_pred_true': 330, 'eval_pred_false': 1490, 'eval_actual_true': 322, 'eval_actual_false': 1498, 'eval_accuracy': 0.9725274725274725, 'eval_f1_score': 0.9233128834355827, 'eval_precision': 0.9121212121212121, 'eval_recall': 0.9347826086956522, 'eval_roc_auc': 0.9577117315841412, 'eval_matthews_correlation': 0.9066836368888807, 'eval_cohen_kappa': 0.9065823512503594, 'eval_true_negative': 1469, 'eval_false_positive': 29, 'eval_false_negative': 21, 'eval_true_positive': 301, 'eval_specificity': 0.9806408544726302, 'eval_sensitivity': 0.9347826086956522, 'eval_informedness': 0.9154234631682825, 'eval_runtime': 4.055, 'eval_samples_per_second': 448.827, 'eval_steps_per_second': 28.113, 'epoch': 2.65}


Model weights saved in bert_Intoxication\checkpoint-300\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-300\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-300\special_tokens_map.json
 44%|████▍     | 400/904 [02:54<03:14,  2.59it/s]***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


{'loss': 0.0269, 'learning_rate': 2.9493228037294702e-05, 'epoch': 3.54}


                                                 
 44%|████▍     | 400/904 [02:58<03:14,  2.59it/s]Saving model checkpoint to bert_Intoxication\checkpoint-400
Configuration saved in bert_Intoxication\checkpoint-400\config.json


(1820, 2) (1820,)
(1820,) (1820,)
{'eval_loss': 0.09519203752279282, 'eval_pred_true': 320, 'eval_pred_false': 1500, 'eval_actual_true': 322, 'eval_actual_false': 1498, 'eval_accuracy': 0.978021978021978, 'eval_f1_score': 0.9376947040498443, 'eval_precision': 0.940625, 'eval_recall': 0.9347826086956522, 'eval_roc_auc': 0.9610495152957568, 'eval_matthews_correlation': 0.9243592452190117, 'eval_cohen_kappa': 0.9243526331102706, 'eval_true_negative': 1479, 'eval_false_positive': 19, 'eval_false_negative': 21, 'eval_true_positive': 301, 'eval_specificity': 0.9873164218958611, 'eval_sensitivity': 0.9347826086956522, 'eval_informedness': 0.9220990305915133, 'eval_runtime': 4.0076, 'eval_samples_per_second': 454.141, 'eval_steps_per_second': 28.446, 'epoch': 3.54}


Model weights saved in bert_Intoxication\checkpoint-400\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-400\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-400\special_tokens_map.json
 55%|█████▌    | 500/904 [03:39<02:39,  2.53it/s]***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


{'loss': 0.0256, 'learning_rate': 2.0849057390116042e-05, 'epoch': 4.42}


                                                 
 55%|█████▌    | 500/904 [03:43<02:39,  2.53it/s]Saving model checkpoint to bert_Intoxication\checkpoint-500
Configuration saved in bert_Intoxication\checkpoint-500\config.json


(1820, 2) (1820,)
(1820,) (1820,)
{'eval_loss': 0.109396792948246, 'eval_pred_true': 334, 'eval_pred_false': 1486, 'eval_actual_true': 322, 'eval_actual_false': 1498, 'eval_accuracy': 0.9736263736263736, 'eval_f1_score': 0.9268292682926829, 'eval_precision': 0.9101796407185628, 'eval_recall': 0.9440993788819876, 'eval_roc_auc': 0.9620363383061473, 'eval_matthews_correlation': 0.9109768347620685, 'eval_cohen_kappa': 0.9107500429086333, 'eval_true_negative': 1468, 'eval_false_positive': 30, 'eval_false_negative': 18, 'eval_true_positive': 304, 'eval_specificity': 0.9799732977303071, 'eval_sensitivity': 0.9440993788819876, 'eval_informedness': 0.9240726766122946, 'eval_runtime': 4.063, 'eval_samples_per_second': 447.941, 'eval_steps_per_second': 28.058, 'epoch': 4.42}


Model weights saved in bert_Intoxication\checkpoint-500\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-500\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-500\special_tokens_map.json
 66%|██████▋   | 600/904 [04:25<01:57,  2.58it/s]***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


{'loss': 0.0206, 'learning_rate': 1.270117540713368e-05, 'epoch': 5.31}


                                                 
 66%|██████▋   | 600/904 [04:29<01:57,  2.58it/s]Saving model checkpoint to bert_Intoxication\checkpoint-600
Configuration saved in bert_Intoxication\checkpoint-600\config.json


(1820, 2) (1820,)
(1820,) (1820,)
{'eval_loss': 0.10205859690904617, 'eval_pred_true': 327, 'eval_pred_false': 1493, 'eval_actual_true': 322, 'eval_actual_false': 1498, 'eval_accuracy': 0.9763736263736263, 'eval_f1_score': 0.9337442218798152, 'eval_precision': 0.926605504587156, 'eval_recall': 0.9409937888198758, 'eval_roc_auc': 0.9624862135020608, 'eval_matthews_correlation': 0.9194092084295595, 'eval_cohen_kappa': 0.9193687975998154, 'eval_true_negative': 1474, 'eval_false_positive': 24, 'eval_false_negative': 19, 'eval_true_positive': 303, 'eval_specificity': 0.9839786381842457, 'eval_sensitivity': 0.9409937888198758, 'eval_informedness': 0.9249724270041215, 'eval_runtime': 4.0638, 'eval_samples_per_second': 447.862, 'eval_steps_per_second': 28.053, 'epoch': 5.31}


Model weights saved in bert_Intoxication\checkpoint-600\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-600\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-600\special_tokens_map.json
 77%|███████▋  | 700/904 [05:10<01:15,  2.70it/s]***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


{'loss': 0.0096, 'learning_rate': 6.0237467168189674e-06, 'epoch': 6.19}


                                                 
 77%|███████▋  | 700/904 [05:15<01:15,  2.70it/s]Saving model checkpoint to bert_Intoxication\checkpoint-700
Configuration saved in bert_Intoxication\checkpoint-700\config.json


(1820, 2) (1820,)
(1820,) (1820,)
{'eval_loss': 0.1318700909614563, 'eval_pred_true': 318, 'eval_pred_false': 1502, 'eval_actual_true': 322, 'eval_actual_false': 1498, 'eval_accuracy': 0.9758241758241758, 'eval_f1_score': 0.93125, 'eval_precision': 0.9371069182389937, 'eval_recall': 0.9254658385093167, 'eval_roc_auc': 0.9560573518314274, 'eval_matthews_correlation': 0.9166103841372993, 'eval_cohen_kappa': 0.9165840284664295, 'eval_true_negative': 1478, 'eval_false_positive': 20, 'eval_false_negative': 24, 'eval_true_positive': 298, 'eval_specificity': 0.986648865153538, 'eval_sensitivity': 0.9254658385093167, 'eval_informedness': 0.9121147036628547, 'eval_runtime': 4.1044, 'eval_samples_per_second': 443.426, 'eval_steps_per_second': 27.775, 'epoch': 6.19}


Model weights saved in bert_Intoxication\checkpoint-700\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-700\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-700\special_tokens_map.json
 88%|████████▊ | 800/904 [05:57<00:41,  2.51it/s]***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


{'loss': 0.0092, 'learning_rate': 1.615127855610496e-06, 'epoch': 7.08}


                                                 
 88%|████████▊ | 800/904 [06:01<00:41,  2.51it/s]Saving model checkpoint to bert_Intoxication\checkpoint-800
Configuration saved in bert_Intoxication\checkpoint-800\config.json


(1820, 2) (1820,)
(1820,) (1820,)
{'eval_loss': 0.14043883979320526, 'eval_pred_true': 315, 'eval_pred_false': 1505, 'eval_actual_true': 322, 'eval_actual_false': 1498, 'eval_accuracy': 0.9752747252747253, 'eval_f1_score': 0.9293563579277865, 'eval_precision': 0.9396825396825397, 'eval_recall': 0.9192546583850931, 'eval_roc_auc': 0.9532855401404771, 'eval_matthews_correlation': 0.914454658274474, 'eval_cohen_kappa': 0.9143735362997658, 'eval_true_negative': 1479, 'eval_false_positive': 19, 'eval_false_negative': 26, 'eval_true_positive': 296, 'eval_specificity': 0.9873164218958611, 'eval_sensitivity': 0.9192546583850931, 'eval_informedness': 0.9065710802809543, 'eval_runtime': 4.3408, 'eval_samples_per_second': 419.273, 'eval_steps_per_second': 26.262, 'epoch': 7.08}


Model weights saved in bert_Intoxication\checkpoint-800\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-800\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-800\special_tokens_map.json
100%|█████████▉| 900/904 [06:45<00:01,  2.47it/s]***** Running Evaluation *****
  Num examples = 1820
  Batch size = 16


{'loss': 0.0074, 'learning_rate': 2.4153823404732268e-09, 'epoch': 7.96}


                                                 
100%|█████████▉| 900/904 [06:50<00:01,  2.47it/s]Saving model checkpoint to bert_Intoxication\checkpoint-900
Configuration saved in bert_Intoxication\checkpoint-900\config.json


(1820, 2) (1820,)
(1820,) (1820,)
{'eval_loss': 0.14063620567321777, 'eval_pred_true': 315, 'eval_pred_false': 1505, 'eval_actual_true': 322, 'eval_actual_false': 1498, 'eval_accuracy': 0.9752747252747253, 'eval_f1_score': 0.9293563579277865, 'eval_precision': 0.9396825396825397, 'eval_recall': 0.9192546583850931, 'eval_roc_auc': 0.9532855401404771, 'eval_matthews_correlation': 0.914454658274474, 'eval_cohen_kappa': 0.9143735362997658, 'eval_true_negative': 1479, 'eval_false_positive': 19, 'eval_false_negative': 26, 'eval_true_positive': 296, 'eval_specificity': 0.9873164218958611, 'eval_sensitivity': 0.9192546583850931, 'eval_informedness': 0.9065710802809543, 'eval_runtime': 4.1217, 'eval_samples_per_second': 441.561, 'eval_steps_per_second': 27.658, 'epoch': 7.96}


Model weights saved in bert_Intoxication\checkpoint-900\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-900\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-900\special_tokens_map.json
100%|██████████| 904/904 [06:55<00:00,  1.24s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Intoxication\checkpoint-400 (score: 0.9376947040498443).
100%|██████████| 904/904 [06:55<00:00,  2.17it/s]

{'train_runtime': 415.9625, 'train_samples_per_second': 139.936, 'train_steps_per_second': 2.173, 'train_loss': 0.04188883714934201, 'epoch': 7.99}





TrainOutput(global_step=904, training_loss=0.04188883714934201, metrics={'train_runtime': 415.9625, 'train_samples_per_second': 139.936, 'train_steps_per_second': 2.173, 'train_loss': 0.04188883714934201, 'epoch': 7.99})

### Evaluate the model 

Print out the model architecture information

In [29]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

Count the total number of model parameters.

In [30]:
sum(p.numel() for p in model.parameters() if p.requires_grad)

109483778

Evaluate the best model (checkpoint) on the validation and testing sets

In [31]:
# set up directory paths
best_model_dir = "best_" + model_dir
best_model_dir_zip = "best_" + model_dir + ".zip"
os.system(f'rm -rf {best_model_dir} {best_model_dir_zip}') # remove possible cache

# evaluate the best model
val_predictions = trainer.predict(tokenized_dataset["val"])
val_eval[LABEL] = val_predictions.metrics
test_predictions = trainer.predict(tokenized_dataset["test"])
test_eval[LABEL] = test_predictions.metrics

# save the best model
model.save_pretrained(best_model_dir)
os.system(f'zip -r {best_model_dir_zip} {best_model_dir}')

# save the evaluation result of each model
val_eval_df = pd.DataFrame.from_dict(val_eval).transpose()
val_eval_df.to_csv("metrics/val_evaluation.csv")
test_eval_df = pd.DataFrame.from_dict(test_eval).transpose()
test_eval_df.to_csv("metrics/test_evaluation.csv")

***** Running Prediction *****
  Num examples = 1820
  Batch size = 16
100%|██████████| 114/114 [00:03<00:00, 28.78it/s]
***** Running Prediction *****
  Num examples = 5578
  Batch size = 16


(1820, 2) (1820,)
(1820,) (1820,)


100%|██████████| 349/349 [00:12<00:00, 28.24it/s]
Configuration saved in best_bert_Intoxication\config.json


(5578, 2) (5578,)
(5578,) (5578,)


Model weights saved in best_bert_Intoxication\pytorch_model.bin


## All the labels

In this section, we fine-tune a model for each label. By simply running the following cell, you can get fine-tuned models for all the labels.

Notes about building models of all the labels:
1. For different labels, we clean or preprocess the input differently if we want by passing in different `remove_punctuations`, `remove_stop_words`, and `remove_digits` arguments. By default, we remove extra white spaces.
2. Decreasing `max_len` can improve the training speed but it can hurt the model performance.
3. Although we have a variable called `batch_size`, the actual batch size during the training is `batch_size * grad_acc_steps` which is `Total train batch size` in the log message of the training. This is due to gradient accumulation. See more details about it [here](https://huggingface.co/docs/transformers/main/en/performance#gradient-accumulation).
4. If GPU memory size is not enough, consider lowering `batch_size` or `epochs` so that less data will be stored in GPU memory each time.
5. If GPU disk size is not enough, consider increasing `steps` so that less model checkpoints will be saved. 

In [33]:
LABELS = ["Cannabinoid", "Intoxication", "Medical", "Wellness", "Commoditization"]

### Preprocess Setup ###
# Dataset Splitting Hyperparameters
val_size = 0.2 # validation set size
random_state = 10 # random seed 

# Tokenization Hyperparameters
padding = 'max_length' # padding strategy
padding_side = 'right' # the side on which the model should have padding applied
truncation = True # truncate strategy
truncation_side = 'right' # the side on which the model should have truncation applied
max_len = 150 # maximum length to use by one of the truncation/padding parameters

# Load the pre-trained tokenmizer ###
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    padding_side=padding_side,
    truncation_side=truncation_side
)

# Define the preprocess function ###
def preprocess_function(examples):
    """
    Preprocess the description field
    ---
    Arguments:
    examples (str, List[str], List[List[str]]: the sequence or batch of sequences to be encoded/tokenized

    Returns:
    tokenized (transformers.BatchEncoding): tokenized descriptions 
    """
    tokenized = tokenizer(
        examples["straindescription"],
        padding=padding,
        truncation=truncation,
        max_length=max_len
    )

    return tokenized

### Evaluation Metrics ###
val_eval = {}
test_eval = {}
metric_acc = load_metric("accuracy")
metric_f1 = load_metric("f1")
metric_precision = load_metric("precision")
metric_recall = load_metric("recall")
metric_auc = load_metric("roc_auc")

def compute_metrics(eval_pred):
    """
    Compute the metrics 
    ---
    Arguments:
    eval_pred (tuple): the predicted logits and truth labels

    Returns:
    metrics (dict{str: float}): contains the computed metrics 
    """
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    prediction_scores = np.max(logits, axis=-1)
    print(logits.shape, labels.shape)
    print(predictions.shape, prediction_scores.shape)

    pred_true = np.count_nonzero(predictions)
    pred_false = predictions.shape[0] - pred_true
    actual_true = np.count_nonzero(labels)
    actual_false = labels.shape[0] - actual_true

    acc = metric_acc.compute(predictions=predictions, references=labels)['accuracy']
    f1 = metric_f1.compute(predictions=predictions, references=labels)['f1']
    precision = metric_precision.compute(predictions=predictions, references=labels)['precision']
    recall = metric_recall.compute(predictions=predictions, references=labels)['recall']
    roc_auc = metric_auc.compute(prediction_scores=predictions, references=labels)['roc_auc']
    matthews_correlation = matthews_corrcoef(y_true=labels, y_pred=predictions)
    cohen_kappa = cohen_kappa_score(y1=labels, y2=predictions)

    tn, fp, fn, tp = confusion_matrix(y_true=labels, y_pred=predictions).ravel()
    specificity = tn / (tn + fp)
    sensitivity = tp / (tp + fn)
    informedness = specificity + sensitivity - 1

    metrics = {
        "pred_true": pred_true,
        "pred_false": pred_false,
        "actual_true": actual_true,
        "actual_false": actual_false,
        "accuracy": acc,
        "f1_score": f1,
        "precision": precision,
        "recall": recall,
        "roc_auc": roc_auc,
        "matthews_correlation": matthews_correlation,
        "cohen_kappa": cohen_kappa,
        "true_negative": tn,
        "false_positive": fp,
        "false_negative": fn,
        "true_positive": tp,
        "specificity": specificity,
        "sensitivity": sensitivity,
        "informedness": informedness
    }
    return metrics

### Training and Model Setup ###
# model hyperparameters
classifier_dropout = 0.15 # dropout ratio for the classification head
num_classes = 2 # number of classes

# optimization hyperparameters ###
seed = 42 # random seed for splitting the data into batches
batch_size = 16 # batch size for both training and evaluation
grad_acc_steps = 4 # number of steps for gradient accumulation
lr = 5e-5 # initial learning rate
weight_decay = 2e-3 # weight decay to apply in the AdamW optimizer
epochs = 8 # total number of training epochs 
lr_scheduler = "cosine" # type of learning rate scheduler
strategy = "steps" # strategy for logging, evaluation, and saving
steps = 100 # number of steps for logging, evaluation, and saving
eval_metric = "f1_score" # metric for selecting the best model

### Training ###
# fine-tune a separate model for each label
for label in LABELS:

    # load the datasets
    raw_insample = pd.read_csv("data/in_sample.csv")
    raw_outsample = pd.read_csv("data/out_sample.csv")
    clean_insample, clean_outsample = load_data("straindescription", LABELS, minimal=True)
    train, val = train_test_split(clean_insample, test_size=val_size, random_state=random_state)
    train.to_csv('data/train.csv', index=False)
    val.to_csv('data/val.csv', index=False)
    dataset = load_dataset('csv', data_files={'train': ['data/train.csv'], 'val': ['data/val.csv'], 'test': ['data/clean_out_sample.csv']})

    # preprocess the textual input 
    tokenized_dataset = dataset.map(preprocess_function, batched=True)
    tokenized_dataset = tokenized_dataset.remove_columns("straindescription")

    # set up directory paths
    model_dir = "bert_" + label
    best_model_dir = "best_" + model_dir
    best_model_dir_zip = "best_" + model_dir + ".zip"
    os.system(f'rm -rf {model_dir} {best_model_dir} {best_model_dir_zip}') # remove possible cache

    # remove other labels and rename the target label
    other_labels = list(filter(lambda x: x != label, LABELS))
    tokenized_dataset_label = tokenized_dataset.remove_columns(other_labels)
    tokenized_dataset_label = tokenized_dataset_label.rename_column(label, "label")

    # load the pre-trained model
    model = AutoModelForSequenceClassification.from_pretrained(
        MODEL_NAME,
        classifier_dropout=classifier_dropout,
        num_labels=num_classes
    )

    # set up the training arguments
    training_args = TrainingArguments(
        output_dir=model_dir,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        gradient_accumulation_steps=grad_acc_steps,
        learning_rate=lr,
        weight_decay=weight_decay, 
        num_train_epochs=epochs,
        lr_scheduler_type=lr_scheduler,
        evaluation_strategy=strategy,
        logging_strategy=strategy, 
        save_strategy=strategy,
        eval_steps=steps,
        logging_steps=steps,
        save_steps=steps,
        seed=seed,
        load_best_model_at_end=True,
        metric_for_best_model=eval_metric,
        report_to="none"
    )

    # set up the trainer 
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset_label['train'],
        eval_dataset=tokenized_dataset_label['val'],
        tokenizer=tokenizer,   
        compute_metrics=compute_metrics,
    )

    # train (fine-tune) the model
    trainer.train()

    # evaluate the best model
    val_predictions = trainer.predict(tokenized_dataset_label["val"])
    val_eval[label] = val_predictions.metrics
    test_predictions = trainer.predict(tokenized_dataset_label["test"])
    test_eval[label] = test_predictions.metrics

    # save the best model
    model.save_pretrained(best_model_dir)
    os.system(f'zip -r {best_model_dir_zip} {best_model_dir}')

# save the evaluation result of each model
val_eval_df = pd.DataFrame.from_dict(val_eval).transpose()
val_eval_df.to_csv("metrics/val_evaluation.csv")
test_eval_df = pd.DataFrame.from_dict(test_eval).transpose()
test_eval_df.to_csv("metrics/test_evaluation.csv")

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Wenhao/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/

Downloading and preparing dataset csv/default to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-d184c0856eacff86\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 3081.78it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1541.27it/s]
                            

Dataset csv downloaded and prepared to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-d184c0856eacff86\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 114.15it/s]
100%|██████████| 8/8 [00:00<00:00, 17.22ba/s]
100%|██████████| 2/2 [00:00<00:00, 18.25ba/s]
100%|██████████| 6/6 [00:00<00:00, 11.08ba/s]
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Wenhao/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embe

{'loss': 0.1039, 'learning_rate': 4.850549408038498e-05, 'epoch': 0.88}



 11%|█         | 100/904 [00:42<05:11,  2.58it/s]Saving model checkpoint to bert_Cannabinoid\checkpoint-100
Configuration saved in bert_Cannabinoid\checkpoint-100\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.043513040989637375, 'eval_pred_true': 1505, 'eval_pred_false': 314, 'eval_actual_true': 1513, 'eval_actual_false': 306, 'eval_accuracy': 0.9912039582188016, 'eval_f1_score': 0.9946984758117958, 'eval_precision': 0.9973421926910299, 'eval_recall': 0.9920687376074026, 'eval_roc_auc': 0.989498421091283, 'eval_matthews_correlation': 0.9690103018864258, 'eval_cohen_kappa': 0.9688930881604768, 'eval_true_negative': 302, 'eval_false_positive': 4, 'eval_false_negative': 12, 'eval_true_positive': 1501, 'eval_specificity': 0.9869281045751634, 'eval_sensitivity': 0.9920687376074026, 'eval_informedness': 0.9789968421825659, 'eval_runtime': 4.1556, 'eval_samples_per_second': 437.725, 'eval_steps_per_second': 27.433, 'epoch': 0.88}


Model weights saved in bert_Cannabinoid\checkpoint-100\pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid\checkpoint-100\tokenizer_config.json
Special tokens file saved in bert_Cannabinoid\checkpoint-100\special_tokens_map.json
 22%|██▏       | 200/904 [01:23<04:31,  2.59it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0299, 'learning_rate': 4.420066015704105e-05, 'epoch': 1.76}



 22%|██▏       | 200/904 [01:28<04:31,  2.59it/s]Saving model checkpoint to bert_Cannabinoid\checkpoint-200
Configuration saved in bert_Cannabinoid\checkpoint-200\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.035827070474624634, 'eval_pred_true': 1516, 'eval_pred_false': 303, 'eval_actual_true': 1513, 'eval_actual_false': 306, 'eval_accuracy': 0.9928532160527762, 'eval_f1_score': 0.9957081545064379, 'eval_precision': 0.9947229551451188, 'eval_recall': 0.9966953073364178, 'eval_roc_auc': 0.9852757582433723, 'eval_matthews_correlation': 0.9743788642043096, 'eval_cohen_kappa': 0.9743618099714312, 'eval_true_negative': 298, 'eval_false_positive': 8, 'eval_false_negative': 5, 'eval_true_positive': 1508, 'eval_specificity': 0.9738562091503268, 'eval_sensitivity': 0.9966953073364178, 'eval_informedness': 0.9705515164867444, 'eval_runtime': 4.1051, 'eval_samples_per_second': 443.106, 'eval_steps_per_second': 27.77, 'epoch': 1.76}


Model weights saved in bert_Cannabinoid\checkpoint-200\pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid\checkpoint-200\tokenizer_config.json
Special tokens file saved in bert_Cannabinoid\checkpoint-200\special_tokens_map.json
 33%|███▎      | 300/904 [02:11<04:02,  2.49it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0207, 'learning_rate': 3.760018621248e-05, 'epoch': 2.65}



 33%|███▎      | 300/904 [02:16<04:02,  2.49it/s]Saving model checkpoint to bert_Cannabinoid\checkpoint-300
Configuration saved in bert_Cannabinoid\checkpoint-300\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.0395059734582901, 'eval_pred_true': 1511, 'eval_pred_false': 308, 'eval_actual_true': 1513, 'eval_actual_false': 306, 'eval_accuracy': 0.9934029686641012, 'eval_f1_score': 0.996031746031746, 'eval_precision': 0.9966909331568498, 'eval_recall': 0.9953734302709848, 'eval_roc_auc': 0.9895167804949696, 'eval_matthews_correlation': 0.9764953203908172, 'eval_cohen_kappa': 0.9764878227430873, 'eval_true_negative': 301, 'eval_false_positive': 5, 'eval_false_negative': 7, 'eval_true_positive': 1506, 'eval_specificity': 0.9836601307189542, 'eval_sensitivity': 0.9953734302709848, 'eval_informedness': 0.979033560989939, 'eval_runtime': 4.1131, 'eval_samples_per_second': 442.245, 'eval_steps_per_second': 27.716, 'epoch': 2.65}


Model weights saved in bert_Cannabinoid\checkpoint-300\pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid\checkpoint-300\tokenizer_config.json
Special tokens file saved in bert_Cannabinoid\checkpoint-300\special_tokens_map.json
 44%|████▍     | 400/904 [02:58<03:20,  2.52it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0171, 'learning_rate': 2.9493228037294702e-05, 'epoch': 3.54}



 44%|████▍     | 400/904 [03:02<03:20,  2.52it/s]Saving model checkpoint to bert_Cannabinoid\checkpoint-400
Configuration saved in bert_Cannabinoid\checkpoint-400\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.03570589795708656, 'eval_pred_true': 1511, 'eval_pred_false': 308, 'eval_actual_true': 1513, 'eval_actual_false': 306, 'eval_accuracy': 0.994502473886751, 'eval_f1_score': 0.9966931216931216, 'eval_precision': 0.9973527465254798, 'eval_recall': 0.9960343688037012, 'eval_roc_auc': 0.9914812366894323, 'eval_matthews_correlation': 0.9804140466887513, 'eval_cohen_kappa': 0.9804065189525728, 'eval_true_negative': 302, 'eval_false_positive': 4, 'eval_false_negative': 6, 'eval_true_positive': 1507, 'eval_specificity': 0.9869281045751634, 'eval_sensitivity': 0.9960343688037012, 'eval_informedness': 0.9829624733788647, 'eval_runtime': 4.1007, 'eval_samples_per_second': 443.579, 'eval_steps_per_second': 27.8, 'epoch': 3.54}


Model weights saved in bert_Cannabinoid\checkpoint-400\pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid\checkpoint-400\tokenizer_config.json
Special tokens file saved in bert_Cannabinoid\checkpoint-400\special_tokens_map.json
 55%|█████▌    | 500/904 [03:46<02:37,  2.57it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0162, 'learning_rate': 2.0849057390116042e-05, 'epoch': 4.42}



 55%|█████▌    | 500/904 [03:50<02:37,  2.57it/s]Saving model checkpoint to bert_Cannabinoid\checkpoint-500
Configuration saved in bert_Cannabinoid\checkpoint-500\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.04298040270805359, 'eval_pred_true': 1513, 'eval_pred_false': 306, 'eval_actual_true': 1513, 'eval_actual_false': 306, 'eval_accuracy': 0.9934029686641012, 'eval_f1_score': 0.9960343688037012, 'eval_precision': 0.9960343688037012, 'eval_recall': 0.9960343688037012, 'eval_roc_auc': 0.9882132628332232, 'eval_matthews_correlation': 0.9764265256664464, 'eval_cohen_kappa': 0.9764265256664464, 'eval_true_negative': 300, 'eval_false_positive': 6, 'eval_false_negative': 6, 'eval_true_positive': 1507, 'eval_specificity': 0.9803921568627451, 'eval_sensitivity': 0.9960343688037012, 'eval_informedness': 0.9764265256664464, 'eval_runtime': 4.057, 'eval_samples_per_second': 448.357, 'eval_steps_per_second': 28.099, 'epoch': 4.42}


Model weights saved in bert_Cannabinoid\checkpoint-500\pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid\checkpoint-500\tokenizer_config.json
Special tokens file saved in bert_Cannabinoid\checkpoint-500\special_tokens_map.json
 66%|██████▋   | 600/904 [04:33<01:59,  2.53it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0091, 'learning_rate': 1.270117540713368e-05, 'epoch': 5.31}



 66%|██████▋   | 600/904 [04:38<01:59,  2.53it/s]Saving model checkpoint to bert_Cannabinoid\checkpoint-600
Configuration saved in bert_Cannabinoid\checkpoint-600\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.050997599959373474, 'eval_pred_true': 1511, 'eval_pred_false': 308, 'eval_actual_true': 1513, 'eval_actual_false': 306, 'eval_accuracy': 0.9934029686641012, 'eval_f1_score': 0.996031746031746, 'eval_precision': 0.9966909331568498, 'eval_recall': 0.9953734302709848, 'eval_roc_auc': 0.9895167804949696, 'eval_matthews_correlation': 0.9764953203908172, 'eval_cohen_kappa': 0.9764878227430873, 'eval_true_negative': 301, 'eval_false_positive': 5, 'eval_false_negative': 7, 'eval_true_positive': 1506, 'eval_specificity': 0.9836601307189542, 'eval_sensitivity': 0.9953734302709848, 'eval_informedness': 0.979033560989939, 'eval_runtime': 4.1454, 'eval_samples_per_second': 438.796, 'eval_steps_per_second': 27.5, 'epoch': 5.31}


Model weights saved in bert_Cannabinoid\checkpoint-600\pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid\checkpoint-600\tokenizer_config.json
Special tokens file saved in bert_Cannabinoid\checkpoint-600\special_tokens_map.json
 77%|███████▋  | 700/904 [05:21<01:18,  2.59it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0026, 'learning_rate': 6.0237467168189674e-06, 'epoch': 6.19}



 77%|███████▋  | 700/904 [05:25<01:18,  2.59it/s]Saving model checkpoint to bert_Cannabinoid\checkpoint-700
Configuration saved in bert_Cannabinoid\checkpoint-700\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.05447913706302643, 'eval_pred_true': 1511, 'eval_pred_false': 308, 'eval_actual_true': 1513, 'eval_actual_false': 306, 'eval_accuracy': 0.9934029686641012, 'eval_f1_score': 0.996031746031746, 'eval_precision': 0.9966909331568498, 'eval_recall': 0.9953734302709848, 'eval_roc_auc': 0.9895167804949696, 'eval_matthews_correlation': 0.9764953203908172, 'eval_cohen_kappa': 0.9764878227430873, 'eval_true_negative': 301, 'eval_false_positive': 5, 'eval_false_negative': 7, 'eval_true_positive': 1506, 'eval_specificity': 0.9836601307189542, 'eval_sensitivity': 0.9953734302709848, 'eval_informedness': 0.979033560989939, 'eval_runtime': 4.0888, 'eval_samples_per_second': 444.877, 'eval_steps_per_second': 27.881, 'epoch': 6.19}


Model weights saved in bert_Cannabinoid\checkpoint-700\pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid\checkpoint-700\tokenizer_config.json
Special tokens file saved in bert_Cannabinoid\checkpoint-700\special_tokens_map.json
 88%|████████▊ | 800/904 [06:08<00:40,  2.56it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0025, 'learning_rate': 1.615127855610496e-06, 'epoch': 7.08}



 88%|████████▊ | 800/904 [06:12<00:40,  2.56it/s]Saving model checkpoint to bert_Cannabinoid\checkpoint-800
Configuration saved in bert_Cannabinoid\checkpoint-800\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.05653368681669235, 'eval_pred_true': 1511, 'eval_pred_false': 308, 'eval_actual_true': 1513, 'eval_actual_false': 306, 'eval_accuracy': 0.9934029686641012, 'eval_f1_score': 0.996031746031746, 'eval_precision': 0.9966909331568498, 'eval_recall': 0.9953734302709848, 'eval_roc_auc': 0.9895167804949696, 'eval_matthews_correlation': 0.9764953203908172, 'eval_cohen_kappa': 0.9764878227430873, 'eval_true_negative': 301, 'eval_false_positive': 5, 'eval_false_negative': 7, 'eval_true_positive': 1506, 'eval_specificity': 0.9836601307189542, 'eval_sensitivity': 0.9953734302709848, 'eval_informedness': 0.979033560989939, 'eval_runtime': 4.129, 'eval_samples_per_second': 440.542, 'eval_steps_per_second': 27.61, 'epoch': 7.08}


Model weights saved in bert_Cannabinoid\checkpoint-800\pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid\checkpoint-800\tokenizer_config.json
Special tokens file saved in bert_Cannabinoid\checkpoint-800\special_tokens_map.json
100%|█████████▉| 900/904 [06:56<00:01,  2.37it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0017, 'learning_rate': 2.4153823404732268e-09, 'epoch': 7.96}



100%|█████████▉| 900/904 [07:00<00:01,  2.37it/s]Saving model checkpoint to bert_Cannabinoid\checkpoint-900
Configuration saved in bert_Cannabinoid\checkpoint-900\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.05576470121741295, 'eval_pred_true': 1511, 'eval_pred_false': 308, 'eval_actual_true': 1513, 'eval_actual_false': 306, 'eval_accuracy': 0.9934029686641012, 'eval_f1_score': 0.996031746031746, 'eval_precision': 0.9966909331568498, 'eval_recall': 0.9953734302709848, 'eval_roc_auc': 0.9895167804949696, 'eval_matthews_correlation': 0.9764953203908172, 'eval_cohen_kappa': 0.9764878227430873, 'eval_true_negative': 301, 'eval_false_positive': 5, 'eval_false_negative': 7, 'eval_true_positive': 1506, 'eval_specificity': 0.9836601307189542, 'eval_sensitivity': 0.9953734302709848, 'eval_informedness': 0.979033560989939, 'eval_runtime': 4.4148, 'eval_samples_per_second': 412.025, 'eval_steps_per_second': 25.822, 'epoch': 7.96}


Model weights saved in bert_Cannabinoid\checkpoint-900\pytorch_model.bin
tokenizer config file saved in bert_Cannabinoid\checkpoint-900\tokenizer_config.json
Special tokens file saved in bert_Cannabinoid\checkpoint-900\special_tokens_map.json
100%|██████████| 904/904 [07:06<00:00,  1.29s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Cannabinoid\checkpoint-400 (score: 0.9966931216931216).
100%|██████████| 904/904 [07:06<00:00,  2.12it/s]
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


{'train_runtime': 426.4783, 'train_samples_per_second': 136.485, 'train_steps_per_second': 2.12, 'train_loss': 0.022567778233381037, 'epoch': 7.99}


100%|██████████| 114/114 [00:04<00:00, 27.54it/s]
***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


100%|██████████| 349/349 [00:12<00:00, 28.43it/s]
Configuration saved in best_bert_Cannabinoid\config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Cannabinoid\pytorch_model.bin
Using custom data configuration default-1cf62e4eda38bbd3


Downloading and preparing dataset csv/default to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-1cf62e4eda38bbd3\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files: 100%|██████████| 3/3 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 385.01it/s]
                            

Dataset csv downloaded and prepared to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-1cf62e4eda38bbd3\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 36.42it/s]
100%|██████████| 8/8 [00:01<00:00,  7.09ba/s]
100%|██████████| 2/2 [00:00<00:00,  9.52ba/s]
100%|██████████| 6/6 [00:00<00:00, 11.10ba/s]
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Wenhao/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embed

{'loss': 0.1733, 'learning_rate': 4.850549408038498e-05, 'epoch': 0.88}



 11%|█         | 100/904 [00:43<05:03,  2.65it/s]Saving model checkpoint to bert_Intoxication\checkpoint-100
Configuration saved in bert_Intoxication\checkpoint-100\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.11114317178726196, 'eval_pred_true': 340, 'eval_pred_false': 1479, 'eval_actual_true': 325, 'eval_actual_false': 1494, 'eval_accuracy': 0.9642660802638813, 'eval_f1_score': 0.9022556390977443, 'eval_precision': 0.8823529411764706, 'eval_recall': 0.9230769230769231, 'eval_roc_auc': 0.9481515806817011, 'eval_matthews_correlation': 0.8807413000939107, 'eval_cohen_kappa': 0.8804058120539937, 'eval_true_negative': 1454, 'eval_false_positive': 40, 'eval_false_negative': 25, 'eval_true_positive': 300, 'eval_specificity': 0.9732262382864793, 'eval_sensitivity': 0.9230769230769231, 'eval_informedness': 0.8963031613634023, 'eval_runtime': 3.9844, 'eval_samples_per_second': 456.535, 'eval_steps_per_second': 28.612, 'epoch': 0.88}


Model weights saved in bert_Intoxication\checkpoint-100\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-100\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-100\special_tokens_map.json
 22%|██▏       | 200/904 [01:23<04:39,  2.52it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0812, 'learning_rate': 4.420066015704105e-05, 'epoch': 1.76}



 22%|██▏       | 200/904 [01:28<04:39,  2.52it/s]Saving model checkpoint to bert_Intoxication\checkpoint-200
Configuration saved in bert_Intoxication\checkpoint-200\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.08136641979217529, 'eval_pred_true': 318, 'eval_pred_false': 1501, 'eval_actual_true': 325, 'eval_actual_false': 1494, 'eval_accuracy': 0.9741616272677295, 'eval_f1_score': 0.926905132192846, 'eval_precision': 0.9371069182389937, 'eval_recall': 0.916923076923077, 'eval_roc_auc': 0.9517680980331583, 'eval_matthews_correlation': 0.9112942466046706, 'eval_cohen_kappa': 0.9112145698954324, 'eval_true_negative': 1474, 'eval_false_positive': 20, 'eval_false_negative': 27, 'eval_true_positive': 298, 'eval_specificity': 0.9866131191432396, 'eval_sensitivity': 0.916923076923077, 'eval_informedness': 0.9035361960663164, 'eval_runtime': 4.1264, 'eval_samples_per_second': 440.819, 'eval_steps_per_second': 27.627, 'epoch': 1.76}


Model weights saved in bert_Intoxication\checkpoint-200\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-200\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-200\special_tokens_map.json
 33%|███▎      | 300/904 [02:08<03:51,  2.61it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0558, 'learning_rate': 3.760018621248e-05, 'epoch': 2.65}



 33%|███▎      | 300/904 [02:12<03:51,  2.61it/s]Saving model checkpoint to bert_Intoxication\checkpoint-300
Configuration saved in bert_Intoxication\checkpoint-300\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.08304107934236526, 'eval_pred_true': 332, 'eval_pred_false': 1487, 'eval_actual_true': 325, 'eval_actual_false': 1494, 'eval_accuracy': 0.9763606377130292, 'eval_f1_score': 0.9345509893455098, 'eval_precision': 0.9246987951807228, 'eval_recall': 0.9446153846153846, 'eval_roc_auc': 0.9639408917722171, 'eval_matthews_correlation': 0.9202060870062253, 'eval_cohen_kappa': 0.9201282979486012, 'eval_true_negative': 1469, 'eval_false_positive': 25, 'eval_false_negative': 18, 'eval_true_positive': 307, 'eval_specificity': 0.9832663989290495, 'eval_sensitivity': 0.9446153846153846, 'eval_informedness': 0.9278817835444342, 'eval_runtime': 3.8323, 'eval_samples_per_second': 474.654, 'eval_steps_per_second': 29.747, 'epoch': 2.65}


Model weights saved in bert_Intoxication\checkpoint-300\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-300\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-300\special_tokens_map.json
 44%|████▍     | 400/904 [02:49<03:00,  2.78it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0372, 'learning_rate': 2.9493228037294702e-05, 'epoch': 3.54}



 44%|████▍     | 400/904 [02:53<03:00,  2.78it/s]Saving model checkpoint to bert_Intoxication\checkpoint-400
Configuration saved in bert_Intoxication\checkpoint-400\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.09848097711801529, 'eval_pred_true': 325, 'eval_pred_false': 1494, 'eval_actual_true': 325, 'eval_actual_false': 1494, 'eval_accuracy': 0.976910390324354, 'eval_f1_score': 0.9353846153846154, 'eval_precision': 0.9353846153846154, 'eval_recall': 0.9353846153846154, 'eval_roc_auc': 0.9606641952425086, 'eval_matthews_correlation': 0.921328390485017, 'eval_cohen_kappa': 0.921328390485017, 'eval_true_negative': 1473, 'eval_false_positive': 21, 'eval_false_negative': 21, 'eval_true_positive': 304, 'eval_specificity': 0.9859437751004017, 'eval_sensitivity': 0.9353846153846154, 'eval_informedness': 0.9213283904850171, 'eval_runtime': 3.7428, 'eval_samples_per_second': 486.0, 'eval_steps_per_second': 30.458, 'epoch': 3.54}


Model weights saved in bert_Intoxication\checkpoint-400\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-400\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-400\special_tokens_map.json
 55%|█████▌    | 500/904 [03:31<02:27,  2.74it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0322, 'learning_rate': 2.0849057390116042e-05, 'epoch': 4.42}



 55%|█████▌    | 500/904 [03:35<02:27,  2.74it/s]Saving model checkpoint to bert_Intoxication\checkpoint-500
Configuration saved in bert_Intoxication\checkpoint-500\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.10159847885370255, 'eval_pred_true': 326, 'eval_pred_false': 1493, 'eval_actual_true': 325, 'eval_actual_false': 1494, 'eval_accuracy': 0.977460142935679, 'eval_f1_score': 0.9370199692780339, 'eval_precision': 0.9355828220858896, 'eval_recall': 0.9384615384615385, 'eval_roc_auc': 0.9622026567809701, 'eval_matthews_correlation': 0.9232954779491716, 'eval_cohen_kappa': 0.923293862089607, 'eval_true_negative': 1473, 'eval_false_positive': 21, 'eval_false_negative': 20, 'eval_true_positive': 305, 'eval_specificity': 0.9859437751004017, 'eval_sensitivity': 0.9384615384615385, 'eval_informedness': 0.9244053135619401, 'eval_runtime': 3.7623, 'eval_samples_per_second': 483.477, 'eval_steps_per_second': 30.3, 'epoch': 4.42}


Model weights saved in bert_Intoxication\checkpoint-500\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-500\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-500\special_tokens_map.json
 66%|██████▋   | 600/904 [04:13<01:50,  2.76it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0269, 'learning_rate': 1.270117540713368e-05, 'epoch': 5.31}



 66%|██████▋   | 600/904 [04:17<01:50,  2.76it/s]Saving model checkpoint to bert_Intoxication\checkpoint-600
Configuration saved in bert_Intoxication\checkpoint-600\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.12242142111063004, 'eval_pred_true': 314, 'eval_pred_false': 1505, 'eval_actual_true': 325, 'eval_actual_false': 1494, 'eval_accuracy': 0.9741616272677295, 'eval_f1_score': 0.9264475743348982, 'eval_precision': 0.9426751592356688, 'eval_recall': 0.9107692307692308, 'eval_roc_auc': 0.9493605189990731, 'eval_matthews_correlation': 0.910979939056208, 'eval_cohen_kappa': 0.9107813170173266, 'eval_true_negative': 1476, 'eval_false_positive': 18, 'eval_false_negative': 29, 'eval_true_positive': 296, 'eval_specificity': 0.9879518072289156, 'eval_sensitivity': 0.9107692307692308, 'eval_informedness': 0.8987210379981465, 'eval_runtime': 3.7771, 'eval_samples_per_second': 481.585, 'eval_steps_per_second': 30.182, 'epoch': 5.31}


Model weights saved in bert_Intoxication\checkpoint-600\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-600\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-600\special_tokens_map.json
 77%|███████▋  | 700/904 [04:54<01:14,  2.75it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0163, 'learning_rate': 6.0237467168189674e-06, 'epoch': 6.19}



 77%|███████▋  | 700/904 [04:58<01:14,  2.75it/s]Saving model checkpoint to bert_Intoxication\checkpoint-700
Configuration saved in bert_Intoxication\checkpoint-700\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.1236707791686058, 'eval_pred_true': 319, 'eval_pred_false': 1500, 'eval_actual_true': 325, 'eval_actual_false': 1494, 'eval_accuracy': 0.9747113798790544, 'eval_f1_score': 0.9285714285714286, 'eval_precision': 0.9373040752351097, 'eval_recall': 0.92, 'eval_roc_auc': 0.9533065595716199, 'eval_matthews_correlation': 0.9132675094604481, 'eval_cohen_kappa': 0.9132089875799462, 'eval_true_negative': 1474, 'eval_false_positive': 20, 'eval_false_negative': 26, 'eval_true_positive': 299, 'eval_specificity': 0.9866131191432396, 'eval_sensitivity': 0.92, 'eval_informedness': 0.9066131191432396, 'eval_runtime': 3.776, 'eval_samples_per_second': 481.722, 'eval_steps_per_second': 30.19, 'epoch': 6.19}


Model weights saved in bert_Intoxication\checkpoint-700\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-700\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-700\special_tokens_map.json
 88%|████████▊ | 800/904 [05:36<00:38,  2.70it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0145, 'learning_rate': 1.615127855610496e-06, 'epoch': 7.08}



 88%|████████▊ | 800/904 [05:40<00:38,  2.70it/s]Saving model checkpoint to bert_Intoxication\checkpoint-800
Configuration saved in bert_Intoxication\checkpoint-800\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.11874077469110489, 'eval_pred_true': 322, 'eval_pred_false': 1497, 'eval_actual_true': 325, 'eval_actual_false': 1494, 'eval_accuracy': 0.9763606377130292, 'eval_f1_score': 0.9335394126738794, 'eval_precision': 0.937888198757764, 'eval_recall': 0.9292307692307692, 'eval_roc_auc': 0.9579219441870045, 'eval_matthews_correlation': 0.9191779441572238, 'eval_cohen_kappa': 0.919163325902523, 'eval_true_negative': 1474, 'eval_false_positive': 20, 'eval_false_negative': 23, 'eval_true_positive': 302, 'eval_specificity': 0.9866131191432396, 'eval_sensitivity': 0.9292307692307692, 'eval_informedness': 0.9158438883740088, 'eval_runtime': 3.8131, 'eval_samples_per_second': 477.036, 'eval_steps_per_second': 29.897, 'epoch': 7.08}


Model weights saved in bert_Intoxication\checkpoint-800\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-800\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-800\special_tokens_map.json
100%|█████████▉| 900/904 [06:18<00:01,  2.71it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0102, 'learning_rate': 2.4153823404732268e-09, 'epoch': 7.96}



100%|█████████▉| 900/904 [06:22<00:01,  2.71it/s]Saving model checkpoint to bert_Intoxication\checkpoint-900
Configuration saved in bert_Intoxication\checkpoint-900\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.11955656111240387, 'eval_pred_true': 322, 'eval_pred_false': 1497, 'eval_actual_true': 325, 'eval_actual_false': 1494, 'eval_accuracy': 0.9763606377130292, 'eval_f1_score': 0.9335394126738794, 'eval_precision': 0.937888198757764, 'eval_recall': 0.9292307692307692, 'eval_roc_auc': 0.9579219441870045, 'eval_matthews_correlation': 0.9191779441572238, 'eval_cohen_kappa': 0.919163325902523, 'eval_true_negative': 1474, 'eval_false_positive': 20, 'eval_false_negative': 23, 'eval_true_positive': 302, 'eval_specificity': 0.9866131191432396, 'eval_sensitivity': 0.9292307692307692, 'eval_informedness': 0.9158438883740088, 'eval_runtime': 3.8277, 'eval_samples_per_second': 475.216, 'eval_steps_per_second': 29.783, 'epoch': 7.96}


Model weights saved in bert_Intoxication\checkpoint-900\pytorch_model.bin
tokenizer config file saved in bert_Intoxication\checkpoint-900\tokenizer_config.json
Special tokens file saved in bert_Intoxication\checkpoint-900\special_tokens_map.json
100%|██████████| 904/904 [06:25<00:00,  1.10it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Intoxication\checkpoint-500 (score: 0.9370199692780339).
100%|██████████| 904/904 [06:25<00:00,  2.35it/s]
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


{'train_runtime': 385.4379, 'train_samples_per_second': 151.018, 'train_steps_per_second': 2.345, 'train_loss': 0.04960260415795894, 'epoch': 7.99}


100%|██████████| 114/114 [00:03<00:00, 30.49it/s]
***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


100%|██████████| 349/349 [00:11<00:00, 30.25it/s]
Configuration saved in best_bert_Intoxication\config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Intoxication\pytorch_model.bin
Using custom data configuration default-d920446e86ab2f1c


Downloading and preparing dataset csv/default to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-d920446e86ab2f1c\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files: 100%|██████████| 3/3 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 1541.08it/s]
                            

Dataset csv downloaded and prepared to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-d920446e86ab2f1c\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 123.29it/s]
100%|██████████| 8/8 [00:00<00:00, 14.55ba/s]
100%|██████████| 2/2 [00:00<00:00, 15.47ba/s]
100%|██████████| 6/6 [00:00<00:00,  7.55ba/s]
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Wenhao/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embe

{'loss': 0.1046, 'learning_rate': 4.850549408038498e-05, 'epoch': 0.88}



 11%|█         | 100/904 [00:40<04:48,  2.79it/s]Saving model checkpoint to bert_Medical\checkpoint-100
Configuration saved in bert_Medical\checkpoint-100\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.07202532887458801, 'eval_pred_true': 135, 'eval_pred_false': 1684, 'eval_actual_true': 148, 'eval_actual_false': 1671, 'eval_accuracy': 0.9785596481583287, 'eval_f1_score': 0.862190812720848, 'eval_precision': 0.9037037037037037, 'eval_recall': 0.8243243243243243, 'eval_roc_auc': 0.9082722758665308, 'eval_matthews_correlation': 0.8516497832149494, 'eval_cohen_kappa': 0.8505929652897853, 'eval_true_negative': 1658, 'eval_false_positive': 13, 'eval_false_negative': 26, 'eval_true_positive': 122, 'eval_specificity': 0.9922202274087373, 'eval_sensitivity': 0.8243243243243243, 'eval_informedness': 0.8165445517330616, 'eval_runtime': 3.768, 'eval_samples_per_second': 482.744, 'eval_steps_per_second': 30.254, 'epoch': 0.88}


Model weights saved in bert_Medical\checkpoint-100\pytorch_model.bin
tokenizer config file saved in bert_Medical\checkpoint-100\tokenizer_config.json
Special tokens file saved in bert_Medical\checkpoint-100\special_tokens_map.json
 22%|██▏       | 200/904 [01:18<04:17,  2.74it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0436, 'learning_rate': 4.420066015704105e-05, 'epoch': 1.76}



 22%|██▏       | 200/904 [01:22<04:17,  2.74it/s]Saving model checkpoint to bert_Medical\checkpoint-200
Configuration saved in bert_Medical\checkpoint-200\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.06023724377155304, 'eval_pred_true': 147, 'eval_pred_false': 1672, 'eval_actual_true': 148, 'eval_actual_false': 1671, 'eval_accuracy': 0.9807586586036283, 'eval_f1_score': 0.8813559322033898, 'eval_precision': 0.8843537414965986, 'eval_recall': 0.8783783783783784, 'eval_roc_auc': 0.9341024148025945, 'eval_matthews_correlation': 0.8708923508308618, 'eval_cohen_kappa': 0.8708864250759999, 'eval_true_negative': 1654, 'eval_false_positive': 17, 'eval_false_negative': 18, 'eval_true_positive': 130, 'eval_specificity': 0.9898264512268103, 'eval_sensitivity': 0.8783783783783784, 'eval_informedness': 0.8682048296051887, 'eval_runtime': 3.795, 'eval_samples_per_second': 479.311, 'eval_steps_per_second': 30.039, 'epoch': 1.76}


Model weights saved in bert_Medical\checkpoint-200\pytorch_model.bin
tokenizer config file saved in bert_Medical\checkpoint-200\tokenizer_config.json
Special tokens file saved in bert_Medical\checkpoint-200\special_tokens_map.json
 33%|███▎      | 300/904 [02:00<03:43,  2.70it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.036, 'learning_rate': 3.760018621248e-05, 'epoch': 2.65}



 33%|███▎      | 300/904 [02:04<03:43,  2.70it/s]Saving model checkpoint to bert_Medical\checkpoint-300
Configuration saved in bert_Medical\checkpoint-300\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.06426158547401428, 'eval_pred_true': 143, 'eval_pred_false': 1676, 'eval_actual_true': 148, 'eval_actual_false': 1671, 'eval_accuracy': 0.982957669048928, 'eval_f1_score': 0.8934707903780069, 'eval_precision': 0.9090909090909091, 'eval_recall': 0.8783783783783784, 'eval_roc_auc': 0.9352993028935579, 'eval_matthews_correlation': 0.8843659726777096, 'eval_cohen_kappa': 0.8842117367315467, 'eval_true_negative': 1658, 'eval_false_positive': 13, 'eval_false_negative': 18, 'eval_true_positive': 130, 'eval_specificity': 0.9922202274087373, 'eval_sensitivity': 0.8783783783783784, 'eval_informedness': 0.8705986057871158, 'eval_runtime': 3.7929, 'eval_samples_per_second': 479.576, 'eval_steps_per_second': 30.056, 'epoch': 2.65}


Model weights saved in bert_Medical\checkpoint-300\pytorch_model.bin
tokenizer config file saved in bert_Medical\checkpoint-300\tokenizer_config.json
Special tokens file saved in bert_Medical\checkpoint-300\special_tokens_map.json
 44%|████▍     | 400/904 [02:43<03:06,  2.70it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0249, 'learning_rate': 2.9493228037294702e-05, 'epoch': 3.54}



 44%|████▍     | 400/904 [02:47<03:06,  2.70it/s]Saving model checkpoint to bert_Medical\checkpoint-400
Configuration saved in bert_Medical\checkpoint-400\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.06223713606595993, 'eval_pred_true': 150, 'eval_pred_false': 1669, 'eval_actual_true': 148, 'eval_actual_false': 1671, 'eval_accuracy': 0.9835074216602528, 'eval_f1_score': 0.8993288590604026, 'eval_precision': 0.8933333333333333, 'eval_recall': 0.9054054054054054, 'eval_roc_auc': 0.9479151503388487, 'eval_matthews_correlation': 0.8903710542890096, 'eval_cohen_kappa': 0.8903472638055547, 'eval_true_negative': 1655, 'eval_false_positive': 16, 'eval_false_negative': 14, 'eval_true_positive': 134, 'eval_specificity': 0.990424895272292, 'eval_sensitivity': 0.9054054054054054, 'eval_informedness': 0.8958303006776975, 'eval_runtime': 3.8365, 'eval_samples_per_second': 474.129, 'eval_steps_per_second': 29.714, 'epoch': 3.54}


Model weights saved in bert_Medical\checkpoint-400\pytorch_model.bin
tokenizer config file saved in bert_Medical\checkpoint-400\tokenizer_config.json
Special tokens file saved in bert_Medical\checkpoint-400\special_tokens_map.json
 55%|█████▌    | 500/904 [03:26<02:29,  2.70it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0146, 'learning_rate': 2.0849057390116042e-05, 'epoch': 4.42}



 55%|█████▌    | 500/904 [03:30<02:29,  2.70it/s]Saving model checkpoint to bert_Medical\checkpoint-500
Configuration saved in bert_Medical\checkpoint-500\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.07681536674499512, 'eval_pred_true': 144, 'eval_pred_false': 1675, 'eval_actual_true': 148, 'eval_actual_false': 1671, 'eval_accuracy': 0.9824079164376031, 'eval_f1_score': 0.8904109589041096, 'eval_precision': 0.9027777777777778, 'eval_recall': 0.8783783783783784, 'eval_roc_auc': 0.9350000808708171, 'eval_matthews_correlation': 0.8809469661755792, 'eval_cohen_kappa': 0.8808492520326534, 'eval_true_negative': 1657, 'eval_false_positive': 14, 'eval_false_negative': 18, 'eval_true_positive': 130, 'eval_specificity': 0.9916217833632556, 'eval_sensitivity': 0.8783783783783784, 'eval_informedness': 0.870000161741634, 'eval_runtime': 3.8196, 'eval_samples_per_second': 476.234, 'eval_steps_per_second': 29.846, 'epoch': 4.42}


Model weights saved in bert_Medical\checkpoint-500\pytorch_model.bin
tokenizer config file saved in bert_Medical\checkpoint-500\tokenizer_config.json
Special tokens file saved in bert_Medical\checkpoint-500\special_tokens_map.json
 66%|██████▋   | 600/904 [04:09<01:53,  2.69it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0111, 'learning_rate': 1.270117540713368e-05, 'epoch': 5.31}



 66%|██████▋   | 600/904 [04:13<01:53,  2.69it/s]Saving model checkpoint to bert_Medical\checkpoint-600
Configuration saved in bert_Medical\checkpoint-600\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.06801323592662811, 'eval_pred_true': 147, 'eval_pred_false': 1672, 'eval_actual_true': 148, 'eval_actual_false': 1671, 'eval_accuracy': 0.9851566794942276, 'eval_f1_score': 0.9084745762711863, 'eval_precision': 0.9115646258503401, 'eval_recall': 0.9054054054054054, 'eval_roc_auc': 0.9488128164070713, 'eval_matthews_correlation': 0.9004042259048921, 'eval_cohen_kappa': 0.9003980993443428, 'eval_true_negative': 1658, 'eval_false_positive': 13, 'eval_false_negative': 14, 'eval_true_positive': 134, 'eval_specificity': 0.9922202274087373, 'eval_sensitivity': 0.9054054054054054, 'eval_informedness': 0.8976256328141425, 'eval_runtime': 3.8738, 'eval_samples_per_second': 469.569, 'eval_steps_per_second': 29.429, 'epoch': 5.31}


Model weights saved in bert_Medical\checkpoint-600\pytorch_model.bin
tokenizer config file saved in bert_Medical\checkpoint-600\tokenizer_config.json
Special tokens file saved in bert_Medical\checkpoint-600\special_tokens_map.json
 77%|███████▋  | 700/904 [04:52<01:15,  2.72it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0093, 'learning_rate': 6.0237467168189674e-06, 'epoch': 6.19}



 77%|███████▋  | 700/904 [04:56<01:15,  2.72it/s]Saving model checkpoint to bert_Medical\checkpoint-700
Configuration saved in bert_Medical\checkpoint-700\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.07647331058979034, 'eval_pred_true': 137, 'eval_pred_false': 1682, 'eval_actual_true': 148, 'eval_actual_false': 1671, 'eval_accuracy': 0.9840571742715778, 'eval_f1_score': 0.8982456140350877, 'eval_precision': 0.9343065693430657, 'eval_recall': 0.8648648648648649, 'eval_roc_auc': 0.9297394342277646, 'eval_matthews_correlation': 0.8903915026872373, 'eval_cohen_kappa': 0.8896106206172062, 'eval_true_negative': 1662, 'eval_false_positive': 9, 'eval_false_negative': 20, 'eval_true_positive': 128, 'eval_specificity': 0.9946140035906643, 'eval_sensitivity': 0.8648648648648649, 'eval_informedness': 0.8594788684555292, 'eval_runtime': 3.8516, 'eval_samples_per_second': 472.277, 'eval_steps_per_second': 29.598, 'epoch': 6.19}


Model weights saved in bert_Medical\checkpoint-700\pytorch_model.bin
tokenizer config file saved in bert_Medical\checkpoint-700\tokenizer_config.json
Special tokens file saved in bert_Medical\checkpoint-700\special_tokens_map.json
 88%|████████▊ | 800/904 [05:37<00:42,  2.44it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0055, 'learning_rate': 1.615127855610496e-06, 'epoch': 7.08}



 88%|████████▊ | 800/904 [05:41<00:42,  2.44it/s]Saving model checkpoint to bert_Medical\checkpoint-800
Configuration saved in bert_Medical\checkpoint-800\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.07809162884950638, 'eval_pred_true': 140, 'eval_pred_false': 1679, 'eval_actual_true': 148, 'eval_actual_false': 1671, 'eval_accuracy': 0.9846069268829027, 'eval_f1_score': 0.9027777777777779, 'eval_precision': 0.9285714285714286, 'eval_recall': 0.8783783783783784, 'eval_roc_auc': 0.9361969689617805, 'eval_matthews_correlation': 0.8948337500663693, 'eval_cohen_kappa': 0.8944265720350225, 'eval_true_negative': 1661, 'eval_false_positive': 10, 'eval_false_negative': 18, 'eval_true_positive': 130, 'eval_specificity': 0.9940155595451825, 'eval_sensitivity': 0.8783783783783784, 'eval_informedness': 0.8723939379235608, 'eval_runtime': 4.2361, 'eval_samples_per_second': 429.403, 'eval_steps_per_second': 26.911, 'epoch': 7.08}


Model weights saved in bert_Medical\checkpoint-800\pytorch_model.bin
tokenizer config file saved in bert_Medical\checkpoint-800\tokenizer_config.json
Special tokens file saved in bert_Medical\checkpoint-800\special_tokens_map.json
100%|█████████▉| 900/904 [06:23<00:01,  2.48it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0057, 'learning_rate': 2.4153823404732268e-09, 'epoch': 7.96}



100%|█████████▉| 900/904 [06:28<00:01,  2.48it/s]Saving model checkpoint to bert_Medical\checkpoint-900
Configuration saved in bert_Medical\checkpoint-900\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.07861775159835815, 'eval_pred_true': 140, 'eval_pred_false': 1679, 'eval_actual_true': 148, 'eval_actual_false': 1671, 'eval_accuracy': 0.9846069268829027, 'eval_f1_score': 0.9027777777777779, 'eval_precision': 0.9285714285714286, 'eval_recall': 0.8783783783783784, 'eval_roc_auc': 0.9361969689617805, 'eval_matthews_correlation': 0.8948337500663693, 'eval_cohen_kappa': 0.8944265720350225, 'eval_true_negative': 1661, 'eval_false_positive': 10, 'eval_false_negative': 18, 'eval_true_positive': 130, 'eval_specificity': 0.9940155595451825, 'eval_sensitivity': 0.8783783783783784, 'eval_informedness': 0.8723939379235608, 'eval_runtime': 4.1824, 'eval_samples_per_second': 434.913, 'eval_steps_per_second': 27.257, 'epoch': 7.96}


Model weights saved in bert_Medical\checkpoint-900\pytorch_model.bin
tokenizer config file saved in bert_Medical\checkpoint-900\tokenizer_config.json
Special tokens file saved in bert_Medical\checkpoint-900\special_tokens_map.json
100%|██████████| 904/904 [06:31<00:00,  1.04s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Medical\checkpoint-600 (score: 0.9084745762711863).
100%|██████████| 904/904 [06:31<00:00,  2.31it/s]
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


{'train_runtime': 391.9483, 'train_samples_per_second': 148.509, 'train_steps_per_second': 2.306, 'train_loss': 0.028266166423373255, 'epoch': 7.99}


100%|██████████| 114/114 [00:04<00:00, 28.06it/s]
***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


100%|██████████| 349/349 [00:12<00:00, 27.82it/s]
Configuration saved in best_bert_Medical\config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Medical\pytorch_model.bin
Using custom data configuration default-858da31bdbaea2ab


Downloading and preparing dataset csv/default to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-858da31bdbaea2ab\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 3081.78it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 342.47it/s]
                            

Dataset csv downloaded and prepared to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-858da31bdbaea2ab\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 96.30it/s]
100%|██████████| 8/8 [00:00<00:00, 12.65ba/s]
100%|██████████| 2/2 [00:00<00:00, 13.67ba/s]
100%|██████████| 6/6 [00:00<00:00, 14.77ba/s]
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Wenhao/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embed

{'loss': 0.1516, 'learning_rate': 4.850549408038498e-05, 'epoch': 0.88}



 11%|█         | 100/904 [00:43<05:15,  2.55it/s]Saving model checkpoint to bert_Wellness\checkpoint-100
Configuration saved in bert_Wellness\checkpoint-100\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.09217400848865509, 'eval_pred_true': 423, 'eval_pred_false': 1396, 'eval_actual_true': 432, 'eval_actual_false': 1387, 'eval_accuracy': 0.9697636063771303, 'eval_f1_score': 0.935672514619883, 'eval_precision': 0.9456264775413712, 'eval_recall': 0.9259259259259259, 'eval_roc_auc': 0.9546716868274187, 'eval_matthews_correlation': 0.9159992489394891, 'eval_cohen_kappa': 0.9159125312139375, 'eval_true_negative': 1364, 'eval_false_positive': 23, 'eval_false_negative': 32, 'eval_true_positive': 400, 'eval_specificity': 0.9834174477289113, 'eval_sensitivity': 0.9259259259259259, 'eval_informedness': 0.9093433736548371, 'eval_runtime': 4.0984, 'eval_samples_per_second': 443.828, 'eval_steps_per_second': 27.816, 'epoch': 0.88}


Model weights saved in bert_Wellness\checkpoint-100\pytorch_model.bin
tokenizer config file saved in bert_Wellness\checkpoint-100\tokenizer_config.json
Special tokens file saved in bert_Wellness\checkpoint-100\special_tokens_map.json
 22%|██▏       | 200/904 [01:27<04:43,  2.48it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0703, 'learning_rate': 4.420066015704105e-05, 'epoch': 1.76}



 22%|██▏       | 200/904 [01:31<04:43,  2.48it/s]Saving model checkpoint to bert_Wellness\checkpoint-200
Configuration saved in bert_Wellness\checkpoint-200\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.09449244290590286, 'eval_pred_true': 412, 'eval_pred_false': 1407, 'eval_actual_true': 432, 'eval_actual_false': 1387, 'eval_accuracy': 0.9670148433205058, 'eval_f1_score': 0.9289099526066351, 'eval_precision': 0.9514563106796117, 'eval_recall': 0.9074074074074074, 'eval_roc_auc': 0.9464938983684477, 'eval_matthews_correlation': 0.9078831822022991, 'eval_cohen_kappa': 0.907451062862725, 'eval_true_negative': 1367, 'eval_false_positive': 20, 'eval_false_negative': 40, 'eval_true_positive': 392, 'eval_specificity': 0.9855803893294881, 'eval_sensitivity': 0.9074074074074074, 'eval_informedness': 0.8929877967368955, 'eval_runtime': 4.1999, 'eval_samples_per_second': 433.108, 'eval_steps_per_second': 27.144, 'epoch': 1.76}


Model weights saved in bert_Wellness\checkpoint-200\pytorch_model.bin
tokenizer config file saved in bert_Wellness\checkpoint-200\tokenizer_config.json
Special tokens file saved in bert_Wellness\checkpoint-200\special_tokens_map.json
 33%|███▎      | 300/904 [02:15<04:00,  2.51it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.052, 'learning_rate': 3.760018621248e-05, 'epoch': 2.65}



 33%|███▎      | 300/904 [02:19<04:00,  2.51it/s]Saving model checkpoint to bert_Wellness\checkpoint-300
Configuration saved in bert_Wellness\checkpoint-300\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.09355633705854416, 'eval_pred_true': 434, 'eval_pred_false': 1385, 'eval_actual_true': 432, 'eval_actual_false': 1387, 'eval_accuracy': 0.9725123694337549, 'eval_f1_score': 0.9422632794457274, 'eval_precision': 0.9400921658986175, 'eval_recall': 0.9444444444444444, 'eval_roc_auc': 0.9628494752863894, 'eval_matthews_correlation': 0.9242301330127344, 'eval_cohen_kappa': 0.9242258876693565, 'eval_true_negative': 1361, 'eval_false_positive': 26, 'eval_false_negative': 24, 'eval_true_positive': 408, 'eval_specificity': 0.9812545061283345, 'eval_sensitivity': 0.9444444444444444, 'eval_informedness': 0.9256989505727788, 'eval_runtime': 4.1916, 'eval_samples_per_second': 433.964, 'eval_steps_per_second': 27.197, 'epoch': 2.65}


Model weights saved in bert_Wellness\checkpoint-300\pytorch_model.bin
tokenizer config file saved in bert_Wellness\checkpoint-300\tokenizer_config.json
Special tokens file saved in bert_Wellness\checkpoint-300\special_tokens_map.json
 44%|████▍     | 400/904 [03:03<03:18,  2.54it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.036, 'learning_rate': 2.9493228037294702e-05, 'epoch': 3.54}



 44%|████▍     | 400/904 [03:07<03:18,  2.54it/s]Saving model checkpoint to bert_Wellness\checkpoint-400
Configuration saved in bert_Wellness\checkpoint-400\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.0949365645647049, 'eval_pred_true': 448, 'eval_pred_false': 1371, 'eval_actual_true': 432, 'eval_actual_false': 1387, 'eval_accuracy': 0.9736118746564046, 'eval_f1_score': 0.9454545454545454, 'eval_precision': 0.9285714285714286, 'eval_recall': 0.9629629629629629, 'eval_roc_auc': 0.9699457929450719, 'eval_matthews_correlation': 0.9283251817211698, 'eval_cohen_kappa': 0.9280582178687725, 'eval_true_negative': 1355, 'eval_false_positive': 32, 'eval_false_negative': 16, 'eval_true_positive': 416, 'eval_specificity': 0.976928622927181, 'eval_sensitivity': 0.9629629629629629, 'eval_informedness': 0.9398915858901438, 'eval_runtime': 4.1186, 'eval_samples_per_second': 441.658, 'eval_steps_per_second': 27.679, 'epoch': 3.54}


Model weights saved in bert_Wellness\checkpoint-400\pytorch_model.bin
tokenizer config file saved in bert_Wellness\checkpoint-400\tokenizer_config.json
Special tokens file saved in bert_Wellness\checkpoint-400\special_tokens_map.json
 55%|█████▌    | 500/904 [03:51<02:42,  2.49it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0194, 'learning_rate': 2.0849057390116042e-05, 'epoch': 4.42}



 55%|█████▌    | 500/904 [03:55<02:42,  2.49it/s]Saving model checkpoint to bert_Wellness\checkpoint-500
Configuration saved in bert_Wellness\checkpoint-500\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.12827327847480774, 'eval_pred_true': 431, 'eval_pred_false': 1388, 'eval_actual_true': 432, 'eval_actual_false': 1387, 'eval_accuracy': 0.9730621220450797, 'eval_f1_score': 0.9432213209733488, 'eval_precision': 0.9443155452436195, 'eval_recall': 0.9421296296296297, 'eval_roc_auc': 0.9624130484125077, 'eval_matthews_correlation': 0.9255647623567097, 'eval_cohen_kappa': 0.9255636943978394, 'eval_true_negative': 1363, 'eval_false_positive': 24, 'eval_false_negative': 25, 'eval_true_positive': 407, 'eval_specificity': 0.9826964671953857, 'eval_sensitivity': 0.9421296296296297, 'eval_informedness': 0.9248260968250155, 'eval_runtime': 4.1657, 'eval_samples_per_second': 436.665, 'eval_steps_per_second': 27.367, 'epoch': 4.42}


Model weights saved in bert_Wellness\checkpoint-500\pytorch_model.bin
tokenizer config file saved in bert_Wellness\checkpoint-500\tokenizer_config.json
Special tokens file saved in bert_Wellness\checkpoint-500\special_tokens_map.json
 66%|██████▋   | 600/904 [04:39<02:07,  2.38it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0197, 'learning_rate': 1.270117540713368e-05, 'epoch': 5.31}



 66%|██████▋   | 600/904 [04:43<02:07,  2.38it/s]Saving model checkpoint to bert_Wellness\checkpoint-600
Configuration saved in bert_Wellness\checkpoint-600\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.12506075203418732, 'eval_pred_true': 428, 'eval_pred_false': 1391, 'eval_actual_true': 432, 'eval_actual_false': 1387, 'eval_accuracy': 0.971412864211105, 'eval_f1_score': 0.9395348837209302, 'eval_precision': 0.9439252336448598, 'eval_recall': 0.9351851851851852, 'eval_roc_auc': 0.9589408261902854, 'eval_matthews_correlation': 0.9208339932978261, 'eval_cohen_kappa': 0.920816911501254, 'eval_true_negative': 1363, 'eval_false_positive': 24, 'eval_false_negative': 28, 'eval_true_positive': 404, 'eval_specificity': 0.9826964671953857, 'eval_sensitivity': 0.9351851851851852, 'eval_informedness': 0.9178816523805708, 'eval_runtime': 4.5669, 'eval_samples_per_second': 398.298, 'eval_steps_per_second': 24.962, 'epoch': 5.31}


Model weights saved in bert_Wellness\checkpoint-600\pytorch_model.bin
tokenizer config file saved in bert_Wellness\checkpoint-600\tokenizer_config.json
Special tokens file saved in bert_Wellness\checkpoint-600\special_tokens_map.json
 77%|███████▋  | 700/904 [05:31<01:23,  2.45it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0112, 'learning_rate': 6.0237467168189674e-06, 'epoch': 6.19}



 77%|███████▋  | 700/904 [05:35<01:23,  2.45it/s]Saving model checkpoint to bert_Wellness\checkpoint-700
Configuration saved in bert_Wellness\checkpoint-700\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.13456088304519653, 'eval_pred_true': 439, 'eval_pred_false': 1380, 'eval_actual_true': 432, 'eval_actual_false': 1387, 'eval_accuracy': 0.9741616272677295, 'eval_f1_score': 0.9460390355912743, 'eval_precision': 0.9384965831435079, 'eval_recall': 0.9537037037037037, 'eval_roc_auc': 0.9671186146492563, 'eval_matthews_correlation': 0.9291064403782154, 'eval_cohen_kappa': 0.9290545727034413, 'eval_true_negative': 1360, 'eval_false_positive': 27, 'eval_false_negative': 20, 'eval_true_positive': 412, 'eval_specificity': 0.9805335255948089, 'eval_sensitivity': 0.9537037037037037, 'eval_informedness': 0.9342372292985126, 'eval_runtime': 4.3623, 'eval_samples_per_second': 416.978, 'eval_steps_per_second': 26.133, 'epoch': 6.19}


Model weights saved in bert_Wellness\checkpoint-700\pytorch_model.bin
tokenizer config file saved in bert_Wellness\checkpoint-700\tokenizer_config.json
Special tokens file saved in bert_Wellness\checkpoint-700\special_tokens_map.json
 88%|████████▊ | 800/904 [06:19<00:41,  2.50it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0097, 'learning_rate': 1.615127855610496e-06, 'epoch': 7.08}



 88%|████████▊ | 800/904 [06:23<00:41,  2.50it/s]Saving model checkpoint to bert_Wellness\checkpoint-800
Configuration saved in bert_Wellness\checkpoint-800\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.137597918510437, 'eval_pred_true': 435, 'eval_pred_false': 1384, 'eval_actual_true': 432, 'eval_actual_false': 1387, 'eval_accuracy': 0.9730621220450797, 'eval_f1_score': 0.9434832756632066, 'eval_precision': 0.9402298850574713, 'eval_recall': 0.9467592592592593, 'eval_roc_auc': 0.9640068826937969, 'eval_matthews_correlation': 0.9258099599184247, 'eval_cohen_kappa': 0.9258004067487323, 'eval_true_negative': 1361, 'eval_false_positive': 26, 'eval_false_negative': 23, 'eval_true_positive': 409, 'eval_specificity': 0.9812545061283345, 'eval_sensitivity': 0.9467592592592593, 'eval_informedness': 0.9280137653875937, 'eval_runtime': 4.2185, 'eval_samples_per_second': 431.197, 'eval_steps_per_second': 27.024, 'epoch': 7.08}


Model weights saved in bert_Wellness\checkpoint-800\pytorch_model.bin
tokenizer config file saved in bert_Wellness\checkpoint-800\tokenizer_config.json
Special tokens file saved in bert_Wellness\checkpoint-800\special_tokens_map.json
100%|█████████▉| 900/904 [07:07<00:01,  2.53it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0075, 'learning_rate': 2.4153823404732268e-09, 'epoch': 7.96}



100%|█████████▉| 900/904 [07:11<00:01,  2.53it/s]Saving model checkpoint to bert_Wellness\checkpoint-900
Configuration saved in bert_Wellness\checkpoint-900\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.13786864280700684, 'eval_pred_true': 437, 'eval_pred_false': 1382, 'eval_actual_true': 432, 'eval_actual_false': 1387, 'eval_accuracy': 0.9741616272677295, 'eval_f1_score': 0.9459148446490219, 'eval_precision': 0.9405034324942791, 'eval_recall': 0.9513888888888888, 'eval_roc_auc': 0.9663216975086117, 'eval_matthews_correlation': 0.9289684891027109, 'eval_cohen_kappa': 0.9289419462191942, 'eval_true_negative': 1361, 'eval_false_positive': 26, 'eval_false_negative': 21, 'eval_true_positive': 411, 'eval_specificity': 0.9812545061283345, 'eval_sensitivity': 0.9513888888888888, 'eval_informedness': 0.9326433950172235, 'eval_runtime': 4.1517, 'eval_samples_per_second': 438.131, 'eval_steps_per_second': 27.458, 'epoch': 7.96}


Model weights saved in bert_Wellness\checkpoint-900\pytorch_model.bin
tokenizer config file saved in bert_Wellness\checkpoint-900\tokenizer_config.json
Special tokens file saved in bert_Wellness\checkpoint-900\special_tokens_map.json
100%|██████████| 904/904 [07:17<00:00,  1.21s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Wellness\checkpoint-700 (score: 0.9460390355912743).
100%|██████████| 904/904 [07:17<00:00,  2.07it/s]
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


{'train_runtime': 437.5439, 'train_samples_per_second': 133.034, 'train_steps_per_second': 2.066, 'train_loss': 0.04180279130577645, 'epoch': 7.99}


100%|██████████| 114/114 [00:04<00:00, 27.23it/s]
***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


100%|██████████| 349/349 [00:12<00:00, 27.48it/s]
Configuration saved in best_bert_Wellness\config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Wellness\pytorch_model.bin
Using custom data configuration default-819dea91605b1feb


Downloading and preparing dataset csv/default to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-819dea91605b1feb\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a...


Downloading data files: 100%|██████████| 3/3 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 440.30it/s]
                            

Dataset csv downloaded and prepared to C:\Users\Wenhao\.cache\huggingface\datasets\csv\default-819dea91605b1feb\0.0.0\652c3096f041ee27b04d2232d41f10547a8fecda3e284a79a0ec4053c916ef7a. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 25.08it/s]
100%|██████████| 8/8 [00:01<00:00,  6.04ba/s]
100%|██████████| 2/2 [00:00<00:00,  2.29ba/s]
100%|██████████| 6/6 [00:00<00:00,  9.56ba/s]
loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Wenhao/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embed

{'loss': 0.1999, 'learning_rate': 4.850549408038498e-05, 'epoch': 0.88}



 11%|█         | 100/904 [00:47<05:34,  2.40it/s]Saving model checkpoint to bert_Commoditization\checkpoint-100
Configuration saved in bert_Commoditization\checkpoint-100\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.11403504759073257, 'eval_pred_true': 903, 'eval_pred_false': 916, 'eval_actual_true': 888, 'eval_actual_false': 931, 'eval_accuracy': 0.9642660802638813, 'eval_f1_score': 0.9637074260189837, 'eval_precision': 0.955703211517165, 'eval_recall': 0.9718468468468469, 'eval_roc_auc': 0.9644411463020487, 'eval_matthews_correlation': 0.9286464339067967, 'eval_cohen_kappa': 0.9285200843237504, 'eval_true_negative': 891, 'eval_false_positive': 40, 'eval_false_negative': 25, 'eval_true_positive': 863, 'eval_specificity': 0.9570354457572503, 'eval_sensitivity': 0.9718468468468469, 'eval_informedness': 0.9288822926040972, 'eval_runtime': 4.5774, 'eval_samples_per_second': 397.388, 'eval_steps_per_second': 24.905, 'epoch': 0.88}


Model weights saved in bert_Commoditization\checkpoint-100\pytorch_model.bin
tokenizer config file saved in bert_Commoditization\checkpoint-100\tokenizer_config.json
Special tokens file saved in bert_Commoditization\checkpoint-100\special_tokens_map.json
 22%|██▏       | 200/904 [01:33<04:45,  2.46it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.1136, 'learning_rate': 4.420066015704105e-05, 'epoch': 1.76}



 22%|██▏       | 200/904 [01:37<04:45,  2.46it/s]Saving model checkpoint to bert_Commoditization\checkpoint-200
Configuration saved in bert_Commoditization\checkpoint-200\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.09845311939716339, 'eval_pred_true': 879, 'eval_pred_false': 940, 'eval_actual_true': 888, 'eval_actual_false': 931, 'eval_accuracy': 0.9708631115997801, 'eval_f1_score': 0.9700056593095643, 'eval_precision': 0.9749715585893061, 'eval_recall': 0.9650900900900901, 'eval_roc_auc': 0.9707297926282888, 'eval_matthews_correlation': 0.9417261724768966, 'eval_cohen_kappa': 0.9416799903694281, 'eval_true_negative': 909, 'eval_false_positive': 22, 'eval_false_negative': 31, 'eval_true_positive': 857, 'eval_specificity': 0.9763694951664876, 'eval_sensitivity': 0.9650900900900901, 'eval_informedness': 0.9414595852565777, 'eval_runtime': 4.2255, 'eval_samples_per_second': 430.484, 'eval_steps_per_second': 26.979, 'epoch': 1.76}


Model weights saved in bert_Commoditization\checkpoint-200\pytorch_model.bin
tokenizer config file saved in bert_Commoditization\checkpoint-200\tokenizer_config.json
Special tokens file saved in bert_Commoditization\checkpoint-200\special_tokens_map.json
 33%|███▎      | 300/904 [02:22<04:04,  2.48it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0849, 'learning_rate': 3.760018621248e-05, 'epoch': 2.65}



 33%|███▎      | 300/904 [02:26<04:04,  2.48it/s]Saving model checkpoint to bert_Commoditization\checkpoint-300
Configuration saved in bert_Commoditization\checkpoint-300\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.11079037934541702, 'eval_pred_true': 867, 'eval_pred_false': 952, 'eval_actual_true': 888, 'eval_actual_false': 931, 'eval_accuracy': 0.9708631115997801, 'eval_f1_score': 0.9698005698005698, 'eval_precision': 0.9815455594002307, 'eval_recall': 0.9583333333333334, 'eval_roc_auc': 0.9705737558181168, 'eval_matthews_correlation': 0.9419134507348376, 'eval_cohen_kappa': 0.9416617802878334, 'eval_true_negative': 915, 'eval_false_positive': 16, 'eval_false_negative': 37, 'eval_true_positive': 851, 'eval_specificity': 0.9828141783029001, 'eval_sensitivity': 0.9583333333333334, 'eval_informedness': 0.9411475116362333, 'eval_runtime': 4.1784, 'eval_samples_per_second': 435.336, 'eval_steps_per_second': 27.283, 'epoch': 2.65}


Model weights saved in bert_Commoditization\checkpoint-300\pytorch_model.bin
tokenizer config file saved in bert_Commoditization\checkpoint-300\tokenizer_config.json
Special tokens file saved in bert_Commoditization\checkpoint-300\special_tokens_map.json
 44%|████▍     | 400/904 [03:10<03:25,  2.45it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0695, 'learning_rate': 2.9493228037294702e-05, 'epoch': 3.54}



 44%|████▍     | 400/904 [03:15<03:25,  2.45it/s]Saving model checkpoint to bert_Commoditization\checkpoint-400
Configuration saved in bert_Commoditization\checkpoint-400\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.13342389464378357, 'eval_pred_true': 914, 'eval_pred_false': 905, 'eval_actual_true': 888, 'eval_actual_false': 931, 'eval_accuracy': 0.9648158328752061, 'eval_f1_score': 0.9644839067702553, 'eval_precision': 0.9507658643326039, 'eval_recall': 0.9786036036036037, 'eval_roc_auc': 0.965134240040255, 'eval_matthews_correlation': 0.9300199014259295, 'eval_cohen_kappa': 0.9296398952237857, 'eval_true_negative': 886, 'eval_false_positive': 45, 'eval_false_negative': 19, 'eval_true_positive': 869, 'eval_specificity': 0.9516648764769066, 'eval_sensitivity': 0.9786036036036037, 'eval_informedness': 0.9302684800805103, 'eval_runtime': 4.2685, 'eval_samples_per_second': 426.148, 'eval_steps_per_second': 26.707, 'epoch': 3.54}


Model weights saved in bert_Commoditization\checkpoint-400\pytorch_model.bin
tokenizer config file saved in bert_Commoditization\checkpoint-400\tokenizer_config.json
Special tokens file saved in bert_Commoditization\checkpoint-400\special_tokens_map.json
 55%|█████▌    | 500/904 [03:59<02:43,  2.48it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.052, 'learning_rate': 2.0849057390116042e-05, 'epoch': 4.42}



 55%|█████▌    | 500/904 [04:03<02:43,  2.48it/s]Saving model checkpoint to bert_Commoditization\checkpoint-500
Configuration saved in bert_Commoditization\checkpoint-500\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.11876137554645538, 'eval_pred_true': 902, 'eval_pred_false': 917, 'eval_actual_true': 888, 'eval_actual_false': 931, 'eval_accuracy': 0.9648158328752061, 'eval_f1_score': 0.9642458100558658, 'eval_precision': 0.9567627494456763, 'eval_recall': 0.9718468468468469, 'eval_roc_auc': 0.964978203230083, 'eval_matthews_correlation': 0.9297281431774591, 'eval_cohen_kappa': 0.9296179456826786, 'eval_true_negative': 892, 'eval_false_positive': 39, 'eval_false_negative': 25, 'eval_true_positive': 863, 'eval_specificity': 0.958109559613319, 'eval_sensitivity': 0.9718468468468469, 'eval_informedness': 0.9299564064601658, 'eval_runtime': 4.1863, 'eval_samples_per_second': 434.511, 'eval_steps_per_second': 27.232, 'epoch': 4.42}


Model weights saved in bert_Commoditization\checkpoint-500\pytorch_model.bin
tokenizer config file saved in bert_Commoditization\checkpoint-500\tokenizer_config.json
Special tokens file saved in bert_Commoditization\checkpoint-500\special_tokens_map.json
 66%|██████▋   | 600/904 [04:48<02:02,  2.48it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0392, 'learning_rate': 1.270117540713368e-05, 'epoch': 5.31}



 66%|██████▋   | 600/904 [04:52<02:02,  2.48it/s]Saving model checkpoint to bert_Commoditization\checkpoint-600
Configuration saved in bert_Commoditization\checkpoint-600\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.1496126651763916, 'eval_pred_true': 909, 'eval_pred_false': 910, 'eval_actual_true': 888, 'eval_actual_false': 931, 'eval_accuracy': 0.9664650907091809, 'eval_f1_score': 0.9660545353366723, 'eval_precision': 0.9548954895489549, 'eval_recall': 0.9774774774774775, 'eval_roc_auc': 0.9667194046893295, 'eval_matthews_correlation': 0.93317810206732, 'eval_cohen_kappa': 0.9329293097810089, 'eval_true_negative': 890, 'eval_false_positive': 41, 'eval_false_negative': 20, 'eval_true_positive': 868, 'eval_specificity': 0.9559613319011815, 'eval_sensitivity': 0.9774774774774775, 'eval_informedness': 0.933438809378659, 'eval_runtime': 4.2338, 'eval_samples_per_second': 429.637, 'eval_steps_per_second': 26.926, 'epoch': 5.31}


Model weights saved in bert_Commoditization\checkpoint-600\pytorch_model.bin
tokenizer config file saved in bert_Commoditization\checkpoint-600\tokenizer_config.json
Special tokens file saved in bert_Commoditization\checkpoint-600\special_tokens_map.json
 77%|███████▋  | 700/904 [05:37<01:23,  2.45it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0248, 'learning_rate': 6.0237467168189674e-06, 'epoch': 6.19}



 77%|███████▋  | 700/904 [05:41<01:23,  2.45it/s]Saving model checkpoint to bert_Commoditization\checkpoint-700
Configuration saved in bert_Commoditization\checkpoint-700\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.17037703096866608, 'eval_pred_true': 879, 'eval_pred_false': 940, 'eval_actual_true': 888, 'eval_actual_false': 931, 'eval_accuracy': 0.9675645959318306, 'eval_f1_score': 0.9666100735710242, 'eval_precision': 0.9715585893060296, 'eval_recall': 0.9617117117117117, 'eval_roc_auc': 0.9674294326549965, 'eval_matthews_correlation': 0.9351235834457434, 'eval_cohen_kappa': 0.9350777251282312, 'eval_true_negative': 906, 'eval_false_positive': 25, 'eval_false_negative': 34, 'eval_true_positive': 854, 'eval_specificity': 0.9731471535982814, 'eval_sensitivity': 0.9617117117117117, 'eval_informedness': 0.9348588653099932, 'eval_runtime': 4.287, 'eval_samples_per_second': 424.31, 'eval_steps_per_second': 26.592, 'epoch': 6.19}


Model weights saved in bert_Commoditization\checkpoint-700\pytorch_model.bin
tokenizer config file saved in bert_Commoditization\checkpoint-700\tokenizer_config.json
Special tokens file saved in bert_Commoditization\checkpoint-700\special_tokens_map.json
 88%|████████▊ | 800/904 [06:25<00:41,  2.50it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0174, 'learning_rate': 1.615127855610496e-06, 'epoch': 7.08}



 88%|████████▊ | 800/904 [06:29<00:41,  2.50it/s]Saving model checkpoint to bert_Commoditization\checkpoint-800
Configuration saved in bert_Commoditization\checkpoint-800\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.17765575647354126, 'eval_pred_true': 889, 'eval_pred_false': 930, 'eval_actual_true': 888, 'eval_actual_false': 931, 'eval_accuracy': 0.9675645959318306, 'eval_f1_score': 0.9667979741136747, 'eval_precision': 0.9662542182227222, 'eval_recall': 0.9673423423423423, 'eval_roc_auc': 0.96755946333014, 'eval_matthews_correlation': 0.9350951742903791, 'eval_cohen_kappa': 0.9350946084636278, 'eval_true_negative': 901, 'eval_false_positive': 30, 'eval_false_negative': 29, 'eval_true_positive': 859, 'eval_specificity': 0.9677765843179377, 'eval_sensitivity': 0.9673423423423423, 'eval_informedness': 0.93511892666028, 'eval_runtime': 4.2061, 'eval_samples_per_second': 432.471, 'eval_steps_per_second': 27.104, 'epoch': 7.08}


Model weights saved in bert_Commoditization\checkpoint-800\pytorch_model.bin
tokenizer config file saved in bert_Commoditization\checkpoint-800\tokenizer_config.json
Special tokens file saved in bert_Commoditization\checkpoint-800\special_tokens_map.json
100%|█████████▉| 900/904 [07:12<00:01,  2.55it/s]***** Running Evaluation *****
  Num examples = 1819
  Batch size = 16


{'loss': 0.0154, 'learning_rate': 2.4153823404732268e-09, 'epoch': 7.96}



100%|█████████▉| 900/904 [07:16<00:01,  2.55it/s]Saving model checkpoint to bert_Commoditization\checkpoint-900
Configuration saved in bert_Commoditization\checkpoint-900\config.json


(1819, 2) (1819,)
(1819,) (1819,)
{'eval_loss': 0.17927874624729156, 'eval_pred_true': 886, 'eval_pred_false': 933, 'eval_actual_true': 888, 'eval_actual_false': 931, 'eval_accuracy': 0.9670148433205058, 'eval_f1_score': 0.9661781285231117, 'eval_precision': 0.9672686230248307, 'eval_recall': 0.9650900900900901, 'eval_roc_auc': 0.9669703941320482, 'eval_matthews_correlation': 0.9339916281620994, 'eval_cohen_kappa': 0.9339893671712926, 'eval_true_negative': 902, 'eval_false_positive': 29, 'eval_false_negative': 31, 'eval_true_positive': 857, 'eval_specificity': 0.9688506981740065, 'eval_sensitivity': 0.9650900900900901, 'eval_informedness': 0.9339407882640964, 'eval_runtime': 4.1245, 'eval_samples_per_second': 441.021, 'eval_steps_per_second': 27.64, 'epoch': 7.96}


Model weights saved in bert_Commoditization\checkpoint-900\pytorch_model.bin
tokenizer config file saved in bert_Commoditization\checkpoint-900\tokenizer_config.json
Special tokens file saved in bert_Commoditization\checkpoint-900\special_tokens_map.json
100%|██████████| 904/904 [07:22<00:00,  1.21s/it]

Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert_Commoditization\checkpoint-200 (score: 0.9700056593095643).
100%|██████████| 904/904 [07:22<00:00,  2.04it/s]
***** Running Prediction *****
  Num examples = 1819
  Batch size = 16


{'train_runtime': 442.4916, 'train_samples_per_second': 131.546, 'train_steps_per_second': 2.043, 'train_loss': 0.06826890756961256, 'epoch': 7.99}


100%|██████████| 114/114 [00:04<00:00, 27.55it/s]
***** Running Prediction *****
  Num examples = 5577
  Batch size = 16


(1819, 2) (1819,)
(1819,) (1819,)


100%|██████████| 349/349 [00:12<00:00, 28.10it/s]
Configuration saved in best_bert_Commoditization\config.json


(5577, 2) (5577,)
(5577,) (5577,)


Model weights saved in best_bert_Commoditization\pytorch_model.bin


In [None]:
model

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

## Prediction

In this section, we deploy the fine-tuned models to make predictions on the real data that will be used for another data analysis.

Same as the previous setup of the fine-tuning, you need to upload the real data `full_dataset.csv`.

In [34]:
LABELS = ["Cannabinoid", "Intoxication", "Medical", "Wellness", "Commoditization"]
# LABELS = ["Wellness", "Commoditization"]
# LABELS = ["Cannabinoid", "Intoxication", "Medical"]

full_dataset = pd.read_csv("data/full_dataset.csv")

  full_dataset = pd.read_csv("data/full_dataset.csv")


In [36]:
# downsampling
full_dataset = full_dataset.sample(50000, random_state=random_state)

Clean the real data for getting passed into the model

In [37]:
full_dataset['straindescription'] = '"' + full_dataset['strain'].astype(str) + '" -- '+ full_dataset['description'].astype(str)
clean_full = clean_data(full_dataset, "straindescription", [], minimal=True)

Define functions and hyperparmeters needed for making predictions.

**Warning:** The hyperparameter choices here should be the **same** as those in fine-tuning stage.

In [38]:
### Preprocess Setup ###
# Tokenization Hyperparameters
padding = 'max_length' # padding strategy
padding_side = 'right' # the side on which the model should have padding applied
truncation = True # truncate strategy
truncation_side = 'right' # the side on which the model should have truncation applied
max_len = 150 # maximum length to use by one of the truncation/padding parameters

# Load the pre-trained tokenmizer ###
tokenizer = AutoTokenizer.from_pretrained(
    MODEL_NAME,
    padding_side=padding_side,
    truncation_side=truncation_side
)

# Define the preprocess function ###
def preprocess_function(examples):
    """
    Preprocess the description field
    ---
    Arguments:
    examples (str, List[str], List[List[str]]: the sequence or batch of sequences to be encoded/tokenized

    Returns:
    tokenized (transformers.BatchEncoding): tokenized descriptions 
    """
    tokenized = tokenizer(
        examples["straindescription"],
        padding=padding,
        truncation=truncation,
        max_length=max_len
    )

    return tokenized

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\Wenhao/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.21.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading file https://huggingface.co/bert-base-uncased/

Predict each label on the real data

In [39]:
# preprocess the textual input 
dataset = Dataset.from_pandas(clean_full)
tokenized_dataset = dataset.map(preprocess_function, batched=True)
tokenized_dataset = tokenized_dataset.remove_columns(["straindescription", "__index_level_0__"])

for label in LABELS:

    # set up directory paths
    best_model_dir = "best_bert_" + label

    model = AutoModelForSequenceClassification.from_pretrained(best_model_dir)

    trainer = Trainer(
        model=model,
    )

    predictions = trainer.predict(tokenized_dataset)
    predict_labels = np.argmax(predictions.predictions, axis=-1)
    full_dataset[(label+"_labeled").lower()] = predict_labels

# manipulate the dataframe so that it is acceptable to another data analysis code
full_dataset = full_dataset.rename({"Medical_labeled":"Medical_undersampled_labeled"}, axis=1)
full_dataset["Medical_labeled"] = np.zeros(full_dataset.shape[0])
full_dataset["Medical_labeled"] = full_dataset["Medical_labeled"].astype(int)
full_dataset["Smell flavor_labeled"] = np.zeros(full_dataset.shape[0])
full_dataset["Smell flavor_labeled"] = full_dataset["Smell flavor_labeled"].astype(int)
full_dataset["Genetics_labeled"] = np.zeros(full_dataset.shape[0])
full_dataset["Genetics_labeled"] = full_dataset["Genetics_labeled"].astype(int)
full_dataset.to_csv("full_dataset_with_labels.csv", index=False, line_terminator='\r\n')


100%|██████████| 50/50 [00:02<00:00, 17.38ba/s]
loading configuration file best_bert_Cannabinoid\config.json
Model config BertConfig {
  "_name_or_path": "best_bert_Cannabinoid",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": 0.15,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.21.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file best_bert_Cannabinoid\pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceCl