## Importing Required Libraries

In [1]:
import os
import datetime 
import pandas as pd 
from datasets import load_dataset
from transformers import BertTokenizer
from transformers import AutoTokenizer, AutoModel
from transformers import AutoModelForSequenceClassification
import numpy as np
from datasets import load_metric
from transformers import TrainingArguments, Trainer
from transformers import pipeline
import evaluate

## Get Current working directory

In [3]:
my_dir = os.getcwd() + '/'
my_dir

'/home/bilal/bilal_data/nlp/discharge_report_model/'

## Loading DataSet

In [89]:
df = pd.read_csv('brog_train.csv')
df

Unnamed: 0,label,text
0,0,Discussed with pt HEP and reason for new PT re...
1,0,Caregiver was educated to use WC as needed for...
2,1,Discharged early into treatment plan so unable...
3,0,Patient not showing any signficant progress fo...
4,0,Patient will continue with HEP only and discha...
...,...,...
1811,0,The patient has been admitted for Home health.
1812,0,"At time of last PR, pt's DGI score was slightl..."
1813,0,CNA staff continuously used transport chair or...
1814,1,Pt has demonstrated some improvement with phys...


## Load dataset using load_dataset function from datasets class

In [12]:
dataset = load_dataset('csv', data_files={'train': './brog_train.csv', 'test': './brog_test.csv'})
dataset

Using custom data configuration default-fb72b38ec20e879d
Found cached dataset csv (/home/bilal/.cache/huggingface/datasets/csv/default-fb72b38ec20e879d/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317)


  0%|          | 0/2 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 1816
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 1816
    })
})

Before we can feed those texts to our model, we need to preprocess them. 
This is done by a Transformers Tokenizer which will (as the name indicates) 
tokenize the inputs (including converting the tokens to their corresponding 
IDs in the pretrained vocabulary) and put it in a format the model expects, 
as well as generate the other inputs that model requires.
To do all of this, we instantiate our tokenizer with the AutoTokenizer.from_pretrained
method, which will ensure:

## Tokenizer

In [38]:
checkpoint = "emilyalsentzer/Bio_Discharge_Summary_BERT"

In [39]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint , model_max_length=512)

Saving The Tokenizer

In [40]:
tokenizer.save_pretrained(my_dir+"save_tokenizer")

('/home/bilal/bilal_data/nlp/discharge_report_model/save_tokenizer/tokenizer_config.json',
 '/home/bilal/bilal_data/nlp/discharge_report_model/save_tokenizer/special_tokens_map.json',
 '/home/bilal/bilal_data/nlp/discharge_report_model/save_tokenizer/vocab.txt',
 '/home/bilal/bilal_data/nlp/discharge_report_model/save_tokenizer/added_tokens.json',
 '/home/bilal/bilal_data/nlp/discharge_report_model/save_tokenizer/tokenizer.json')

## Loading Tokenizer From Save Directory

In [41]:
tokenizer = AutoTokenizer.from_pretrained('./save_tokenizer/')

We can then write the function that will preprocess our samples. 
We just feed them to the tokenizer with the argument truncation=True. 
This will ensure that an input longer that what the model selected can
handle will be truncated to the maximum length accepted by the model.

In [42]:
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)

To apply this function on all the pairs of sentences in our dataset, 
we just use the map method of our dataset object we created earlier. 
This will apply the function on all the elements of all the splits in 
dataset, so our training and testing data will be preprocessed in one 
single command

In [43]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets

Loading cached processed dataset at /home/bilal/.cache/huggingface/datasets/csv/default-fb72b38ec20e879d/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317/cache-daf8801bed112dbb.arrow


  0%|          | 0/2 [00:00<?, ?ba/s]

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1816
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 1816
    })
})

## Train & Test 

In [44]:
train_dataset = tokenized_datasets['train']
eval_dataset = tokenized_datasets['test']

Now that our data is ready, we can download the pretrained model and fine-tune it. 
Since our task is of the text classification kind, we use the AutoModelForSequenceClassification 
class. Like with the tokenizer, the from_pretrained method will download and cache the model for us.

In [45]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at emilyalsentzer/Bio_Discharge_Summary_BERT were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from

## Save The Downloaded Model

In [46]:
model.save_pretrained(my_dir+"save_model")

Load Model from save Directory

In [47]:
# model = AutoModelForSequenceClassification.from_pretrained("./save_model/", num_labels=2)

## Defining Metric

In [49]:
#metric = load_metric("accuracy")
metric = evaluate.load("accuracy")

In [50]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis = -1)
    return metric.compute(predictions=predictions, references=labels)

TrainingArguments, which is a class that contains all the attributes to
customize the training. It requires one folder name, which will be used 
to save the checkpoints of the model, and all other arguments are optional:

In [56]:
training_args = TrainingArguments(
         overwrite_output_dir = 'True',
         output_dir = "./fine_tuned_model/",
         evaluation_strategy = "epoch",
         num_train_epochs = 10, 
         save_total_limit = 1,
         save_strategy = "no",
         load_best_model_at_end = False,
         report_to="none")

PyTorch: setting up devices


Then we just need to pass all of this along with our datasets to the Trainer:

In [57]:
trainer = Trainer(
    model = model,
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    compute_metrics = compute_metrics,
)

## Traning Start

In [58]:
trainer.train()

The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 1816
  Num Epochs = 10
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 2270
  Number of trainable parameters = 108311810


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.256283,0.924009
2,No log,0.090347,0.979075
3,0.229200,0.028824,0.993392
4,0.229200,0.013717,0.996696
5,0.063200,0.008022,0.998348
6,0.063200,0.001321,0.999449
7,0.014000,4.1e-05,1.0
8,0.014000,3e-05,1.0
9,0.000100,3e-05,1.0
10,0.000100,2.6e-05,1.0


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1816
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1816
  Batch size = 8
The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 1816
  Batch size = 8
The following co

TrainOutput(global_step=2270, training_loss=0.0679853109519781, metrics={'train_runtime': 44806.8426, 'train_samples_per_second': 0.405, 'train_steps_per_second': 0.051, 'total_flos': 4778096765337600.0, 'train_loss': 0.0679853109519781, 'epoch': 10.0})

## Saving the Fine Tuned model

In [59]:
trainer.save_model()

Saving model checkpoint to ./fine_tuned_model/
Configuration saved in ./fine_tuned_model/config.json
Model weights saved in ./fine_tuned_model/pytorch_model.bin


Calling Prediction on eval_dataset

In [62]:
predictions = trainer.predict(eval_dataset)
predicted_values = list(np.argmax(predictions.predictions, axis=-1))
print(predicted_values)

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 1816
  Batch size = 8


[0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 

In [91]:
compare = df
compare.insert(1, column = "predicted", value = predicted_values)
compare

Unnamed: 0,label,predicted,text
0,0,0,Discussed with pt HEP and reason for new PT re...
1,0,0,Caregiver was educated to use WC as needed for...
2,1,1,Discharged early into treatment plan so unable...
3,0,0,Patient not showing any signficant progress fo...
4,0,0,Patient will continue with HEP only and discha...
...,...,...,...
1811,0,0,The patient has been admitted for Home health.
1812,0,0,"At time of last PR, pt's DGI score was slightl..."
1813,0,0,CNA staff continuously used transport chair or...
1814,1,1,Pt has demonstrated some improvement with phys...


Now We will call Inferance by Loading the save tokeniizer and fine tuned model

In [66]:
tokenizer = AutoTokenizer.from_pretrained('./save_tokenizer/')
model = AutoModelForSequenceClassification.from_pretrained("./fine_tuned_model/", num_labels=2)
#calling prediction

print("\n\nLabel 0 = No Improvement\nLabel 1 = Improvements\n\n")
for i in range(20):
    print("Enter Text")
    text = str(input())
    classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)                                                                                         
    result=classifier(text)
    print("\n\n",result)

loading file vocab.txt
loading file tokenizer.json
loading file added_tokens.json
loading file special_tokens_map.json
loading file tokenizer_config.json
loading configuration file ./fine_tuned_model/config.json
Model config BertConfig {
  "_name_or_path": "./fine_tuned_model/",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 28996
}

loading weights file ./fine_tuned_model/pytorch_model.bin
All model c



Label 0 = No Improvement
Label 1 = Improvements


Enter Text
patient has made valuable achievement in her goals


 [{'label': 'LABEL_1', 'score': 0.9999874830245972}]
Enter Text
patient did not improve anymore


 [{'label': 'LABEL_1', 'score': 0.999976634979248}]
Enter Text
Patient D/C from services


 [{'label': 'LABEL_0', 'score': 0.9999698400497437}]
Enter Text
Not making much progress with telehealth due to no caregiver to assist with activities which pose a safety risk


 [{'label': 'LABEL_0', 'score': 0.9999798536300659}]
Enter Text
 Not able to complete a D/C report in person due to COVID, still not allowing therapist into home.


 [{'label': 'LABEL_0', 'score': 0.9999779462814331}]
Enter Text
Patient regressed significantly without skilled therapy and appears to be more forgetful as she no longer remembers to do her HEP.


 [{'label': 'LABEL_0', 'score': 0.9999759197235107}]
Enter Text
Education will persist with patient and family to start in person PT/OT for functional acti

KeyboardInterrupt: Interrupted by user