# Report Similarity v5 (Similarity between sections) 
- used `final_samples.csv`
- From tutorial: https://towardsdatascience.com/bert-for-measuring-text-similarity-eec91c6bf9e1

In [1]:
import pandas as pd
import re
from datasets import Dataset, load_dataset
from transformers import AutoTokenizer, AutoModel
import torch
import nltk
from sklearn.metrics.pairwise import cosine_similarity

tokenizer = AutoTokenizer.from_pretrained("ICLbioengNLP/CXR_BioClinicalBERT_chunkedv1")
model = AutoModel.from_pretrained('ICLbioengNLP/CXR_BioClinicalBERT_chunkedv1')

Some weights of the model checkpoint at ICLbioengNLP/CXR_BioClinicalBERT_chunkedv1 were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertModel were not initialized from the model checkpoint at ICLbioengNLP/CXR_BioClinicalBERT_chunkedv1 and are newly initialized: ['bert.poole

In [2]:
report_df = pd.read_csv('final_samples.csv')
report_df = report_df.drop('Unnamed: 0', 1)
print(len(report_df["study_id"].tolist()))
display(report_df.head(n=5))

164


Unnamed: 0,study_id,diagnosis,diagnosis_id,impression,findings,Atelectasis,Cardiomegaly,Consolidation,Edema,Enlarged Cardiomediastinum,Fracture,Lung Lesion,Lung Opacity,No Finding,Pleural Effusion,Pleural Other,Pneumonia,Pneumothorax
0,s58402174,Atelectasis,1,Increasing bibasilar atelectasis. Possible mi...,AP portable semi upright view of the chest.\n ...,Positive,,,Uncertain,,,,,,,,,
1,s59983953,Atelectasis,1,1. Bibasilar and right upper lobe atelectasis...,An endotracheal tube approximately 7 cm from t...,Positive,,,,,,,,,,,,
2,s55481818,Atelectasis,1,Emphysema and bibasilar atelectasis. No evide...,Linear opacities of the lung bases bilaterally...,Positive,,,,,,,,,,,Negative,
3,s51499550,Atelectasis,1,Limited exam with given low lung volumes with ...,AP portable upright view of the chest. Midli...,Positive,,,,,,,,,,,Negative,
4,s51644170,Atelectasis,1,Persistently low lung volumes with streaky rig...,Patient is status post median sternotomy. Rig...,Positive,,,,,,,,,,,,


In [3]:
sample_dataset = dict.fromkeys(["study_id", "diagnosis", "diagnosis_id", "impression", "findings"])
sample_dataset["study_id"] = report_df["study_id"].tolist()
sample_dataset["diagnosis"] = report_df["diagnosis"].tolist()
sample_dataset["diagnosis_id"] = report_df["diagnosis_id"].tolist()
sample_dataset["impression"] = report_df["impression"].tolist()
sample_dataset["findings"] = report_df["findings"].tolist()


### Only change here! choose the targeted report and section to compare

In [4]:
# Only change this - the index of the report you want to compare against all the others
targeted_index = 120   # from 0 to 164
section = "findings" # impression or findings

### Moving the targeted report

In [5]:
# move the targeted report to the start of the list
sample_dataset["study_id"].insert(0, sample_dataset["study_id"].pop(targeted_index))
sample_dataset["diagnosis"].insert(0, sample_dataset["diagnosis"].pop(targeted_index))
sample_dataset["diagnosis_id"].insert(0, sample_dataset["diagnosis_id"].pop(targeted_index))
sample_dataset["impression"].insert(0, sample_dataset["impression"].pop(targeted_index))
sample_dataset["findings"].insert(0, sample_dataset["findings"].pop(targeted_index))

print("Sample report: ", sample_dataset["study_id"][0])
print()
print("Diagnosis: ", sample_dataset["diagnosis"][0])
print()
print("Imperssion: \n" , sample_dataset["impression"][0])
print()
print("Findings: \n", sample_dataset["findings"][0])

Sample report:  s58866273

Diagnosis:  Pneumothorax

Imperssion: 
 1.  Dobbhoff tube in the stomach.
 2.  Unchanged right basilar loculated hydropneumothorax.

Findings: 
 A single portable AP chest radiograph was obtained.  The tip of a
 Dobbhoff catheter projects over the stomach.  The tip of a right PICC line
 ends in the low SVC.  There is interval improved aeration of lungs with
 persistence of a right basilar loculated hydropneumothorax.  A pigtail
 catheter remains in unchanged position.  There is a small left pleural
 effusion.


### Report embeddings - Mean pooling operation

In [6]:
# initialize dictionary that will contain tokenized sentences
tokens = {'input_ids': [], 'attention_mask': []}
sample_size = len(sample_dataset["study_id"])

for i in range(sample_size):
    report = sample_dataset[section][i]
    
    # tokenize sentence and append to dictionary lists
    new_tokens = tokenizer.encode_plus(report, max_length=150, truncation=True,
                                       padding='max_length', return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

In [7]:
# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])
print(tokens['input_ids'])
print(tokens['attention_mask'])

tensor([[  101,   170,  1423,  ...,     0,     0,     0],
        [  101,   170,  1643,  ...,     0,     0,     0],
        [  101,  1126,  1322,  ..., 12602,   174,   102],
        ...,
        [  101,  1175,  1110,  ...,     0,     0,     0],
        [  101,  1103,  5351,  ...,     0,     0,     0],
        [  101,   185,  1161,  ...,     0,     0,     0]])
tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])


In [8]:
tokens['input_ids'].shape

torch.Size([164, 150])

In [9]:
outputs = model(**tokens)
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

In [10]:
embeddings = outputs.last_hidden_state
embeddings.shape

torch.Size([164, 150, 768])

The outputs 'last_hidden_state' tensor contains the dense vector representations of our text:

In [11]:
def mean_pooling(model_output, attention_mask):
    # Access the last_hidden_state
    token_embeddings = model_output.last_hidden_state
    
    # multiply each value in the embedding tensor by its respective attention_mask value so to ignore [PAD] tokens
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

In [12]:
mean_pooled_embeddings = mean_pooling(outputs, tokens['attention_mask'])
mean_pooled_embeddings
mean_pooled_embeddings.shape

torch.Size([164, 768])

### Calculating dense similarity vector - cosine similarity

In [13]:
# convert from PyTorch tensor to numpy array
mean_pooled_NPembeddings = mean_pooled_embeddings.detach().numpy()

# calculate
cos_similarities = cosine_similarity(
    [mean_pooled_NPembeddings[0]],
    mean_pooled_NPembeddings[1:]
)

print(cos_similarities)
print(len(cos_similarities[0]))

[[0.93787754 0.9551142  0.92547214 0.942333   0.9462352  0.9519487
  0.94086635 0.93424726 0.9432101  0.94874936 0.92735004 0.9465703
  0.9391953  0.95078176 0.91532195 0.90332973 0.93629956 0.93811226
  0.93015325 0.9425262  0.9395634  0.93104047 0.94988406 0.9321181
  0.93793094 0.9605327  0.93583643 0.91326207 0.94519097 0.9494306
  0.9279354  0.93748343 0.9364872  0.90917003 0.9536949  0.9630689
  0.9420947  0.94015366 0.9554666  0.9430383  0.9479212  0.9244891
  0.91958034 0.9425365  0.9465373  0.9632773  0.9386427  0.9577706
  0.9488589  0.95145726 0.9351248  0.9531458  0.94290817 0.93517756
  0.94221145 0.95178246 0.93308437 0.93285084 0.95443314 0.9139195
  0.90209246 0.9479429  0.95308316 0.9507174  0.9112017  0.9404598
  0.9239759  0.9550652  0.93906134 0.92977    0.9296198  0.9144655
  0.9348012  0.93759096 0.9422262  0.93825454 0.9314673  0.94310576
  0.9347931  0.9369739  0.9366757  0.9274663  0.95045733 0.9304414
  0.9349586  0.9443276  0.94693446 0.9422597  0.9490072  0.

In [14]:
# Put all of them into a table:
# remove the targeted report from sample_dataset first
removed_id = sample_dataset["study_id"].pop(0)
removed_diag = sample_dataset["diagnosis"].pop(0)
removed_diag_id = sample_dataset["diagnosis_id"].pop(0)
removed_imp = sample_dataset["impression"].pop(0)
removed_find = sample_dataset["findings"].pop(0)

In [15]:
# Add cosine_similarity to the dictionary 
sample_dataset['cosine_similarity'] = cos_similarities[0].tolist()

print(len(sample_dataset["study_id"]))
print(len(sample_dataset["diagnosis"]))
print(len(sample_dataset["cosine_similarity"]))

163
163
163


In [16]:
if section == "impression":
    sample_dataset.pop("findings")

if section == "findings":
    sample_dataset.pop("impression")

In [17]:
cos_sim_df = pd.DataFrame.from_dict(sample_dataset)
sort_cos_sim_df = cos_sim_df.sort_values(by=['cosine_similarity'], ascending=False)
print("Targeted report of a diagnosis >>> ", removed_diag)

if section == "impression":
    print("Impression: \n", removed_imp)

if section == "findings":
    print("Findings: \n", removed_find)

print()
print("The 10 most similar reports to >>> ", removed_diag, " :")
display(sort_cos_sim_df.head(n=10))

Targeted report of a diagnosis >>>  Pneumothorax
Findings: 
 A single portable AP chest radiograph was obtained.  The tip of a
 Dobbhoff catheter projects over the stomach.  The tip of a right PICC line
 ends in the low SVC.  There is interval improved aeration of lungs with
 persistence of a right basilar loculated hydropneumothorax.  A pigtail
 catheter remains in unchanged position.  There is a small left pleural
 effusion.

The 10 most similar reports to >>>  Pneumothorax  :


Unnamed: 0,study_id,diagnosis,diagnosis_id,findings,cosine_similarity
104,s50128467,Pleural Effusion,9,Since the prior examination there is little ch...,0.96487
108,s53367019,Pleural Effusion,9,There has been interval placement of a right I...,0.964469
45,s56140154,Edema,4,There has been improvement in mild-to-moderate...,0.963277
35,s58585557,Consolidatio,3,Portable semi-upright radiograph of the chest ...,0.963069
126,s55902256,Pneumothorax,10,Comparison is made to prior study from ___.\n ...,0.96161
112,s53795595,Pleural Effusion,9,There has been interval decrease in size of th...,0.960636
25,s53462705,Cardiomegaly,2,There has been interval removal of a right-sid...,0.960533
160,s56443683,No findings,13,There is persistent opacification of the media...,0.959271
153,s52538997,No findings,13,New left-sided Port-A-Cath is seen entering th...,0.958453
118,s51435896,Pneumothorax,10,In the interim since the most recent prior\n c...,0.958182


In [18]:
print("The 10 furthest reports to >>> ", removed_diag, " :")
display(sort_cos_sim_df.tail(n=10))

The 10 furthest reports to >>>  Pneumothorax  :


Unnamed: 0,study_id,diagnosis,diagnosis_id,findings,cosine_similarity
157,s56078456,No findings,13,Frontal and lateral views of the chest. The l...,0.911429
64,s52775752,Fracture,6,"No focal consolidation, pleural effusion, pneu...",0.911202
132,s56129930,Pneumonia,11,There is increased opacification in the left l...,0.909271
33,s58831403,Consolidatio,3,AP portable upright chest radiograph was provi...,0.90917
152,s58736291,No findings,13,"No focal consolidation, pleural effusion, or p...",0.908885
15,s51513702,Cardiomegaly,2,Single AP portable view of the chest. No prio...,0.90333
60,s58521372,Enlarged Cardiomediastinum,5,Frontal and lateral views of the chest were ob...,0.902092
144,s59535316,Pleural Other,12,Single portable view of the chest. Low lung v...,0.896426
150,s58307391,No findings,13,The lungs are well expanded and clear. The\n ...,0.893125
115,s52971492,Pleural Effusion,9,PA and lateral chest views were obtained with ...,0.892972
