# **CHECK RESULTS**

## **Author:** Gema De Vargas Romero

## **Master Thesis:** "Development of a Named Entity Recognition System to automatically assign tumor morphology entity mentions to health-related documents in Spanish." 

The aim of this notebook is to obtain the performance results over the test set with gold standards. This performance focuses on exact entity match, instead of BIOES-V match.

In previous notebooks, each model was trainedover the train dataset and evaluated over the development datasets 1 and 2. Once the optimal model regarding each machine learning method was achieved, this was trained over both train and development datasets in order to produce a final model for each method. 

Then, the final models were employed to obtain the predictions over the test and background files. Since we are only interesting on evaluating the performance of the models just over the test files, these must be distinguished from the background files. For this purpose, the names of the files in the test dataset with gold standards are employed. The match between the name in these two datasets allows to select just the test dataset predictions.

In [None]:
from google.colab import drive
drive.mount("/content/drive/")

path='drive/My Drive/Ejemplos NER - TFM/'
!ls 'drive/My Drive/Ejemplos NER - TFM/'

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).
 bert
 check_results2.ipynb
 check_results.ipynb
 data
 dev_set
 dev_set2
'Dictionary based NER (spacy).ipynb'
'Ehealth_Dictionary based NER (spacy).ipynb'
 last_step_cantemist.ipynb
 last_step_cantemist_TEST.ipynb
 NER_by_BERT_Cantemist_BIOESV.ipynb
 NER_by_BERT_Cantemist_Competicion.ipynb
 NER_by_BERT_Cantemist.ipynb
 NER_by_BI_LSTM_CRF_Cantemist_BIOESV_2.ipynb
 NER_by_BI_LSTM_CRF_Cantemist_BIOESV.ipynb
 NER_by_BI_LSTM_CRF_Cantemist_Competicion.ipynb
 NER_by_BI_LSTM_CRF_Cantemist.ipynb
 NER_by_CRF_Cantemist_Competicion.ipynb
 NER_by_CRF_Cantemist.ipynb
 NER_by_CRF_Ehealth.ipynb
 NER_by_CRF.ipynb
 Preprocessing_NER_Cantemist.ipynb
 resources
 results_bert
 results_bert2
 results_BILSTM_ap1
 results_BILSTM_ap2
 results_BILSTM_ap3
 results_CRF
 sample_set
 Scielo+Wiki_skipgram_cased.bin
 Scielo+Wiki_skipgram_cased.vec
 test-background-set-to-publish
 test_se

## **Load libraries**

In [None]:
import pandas as pd
import numpy as np
import pickle as pkl
!pip install sklearn-crfsuite
from sklearn_crfsuite.metrics import flat_f1_score
from sklearn_crfsuite.metrics import flat_precision_score
from sklearn_crfsuite.metrics import flat_recall_score
from sklearn_crfsuite.metrics import flat_classification_report

Collecting sklearn-crfsuite
  Downloading https://files.pythonhosted.org/packages/25/74/5b7befa513482e6dee1f3dd68171a6c9dfc14c0eaa00f885ffeba54fe9b0/sklearn_crfsuite-0.3.6-py2.py3-none-any.whl
Collecting python-crfsuite>=0.8.3
[?25l  Downloading https://files.pythonhosted.org/packages/95/99/869dde6dbf3e0d07a013c8eebfb0a3d30776334e0097f8432b631a9a3a19/python_crfsuite-0.9.7-cp36-cp36m-manylinux1_x86_64.whl (743kB)
[K     |████████████████████████████████| 747kB 8.1MB/s 
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.7 sklearn-crfsuite-0.3.6


## **Read the files**

We want the predictions obtained juts for those files in the test and background dataset that are also in the test with gold standards dataset.

In [None]:
with open(path+'data/files_txt_test_true', 'rb') as file: 
  files_txt_test_true = pkl.load(file)
file.close()

with open(path+'data/files_txt_test', 'rb') as file: 
  files_txt_test = pkl.load(file)
file.close()

In [None]:
# Test set with gold standards
print(files_txt_test_true[0])
print(len(files_txt_test_true))

# Test and background set
print(files_txt_test[0])
print(len(files_txt_test))

drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco1006.txt
300
drive/My Drive/Ejemplos NER - TFM/test-background-set-to-publish/S0004-06142005000100009-1.txt
5232


In [None]:
path_test = 'drive/My Drive/Ejemplos NER - TFM/test-background-set-to-publish/'
path_test_true = 'drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/'
len_path_test = len(path_test)
len_path_test_true = len(path_test_true)

file_indices = []

for i in range(len(files_txt_test)):
  name_file_test = files_txt_test[i][len_path_test:]
  for j in range(len(files_txt_test_true)):
    name_file_test_true = files_txt_test_true[j][len_path_test_true:]
    if name_file_test==name_file_test_true:
      file_indices.append(i)

print(len(file_indices))

print(file_indices)
# these are the indices of clinical cases that we must keep

300
[4736, 4740, 4743, 4747, 4749, 4750, 4751, 4752, 4754, 4756, 4757, 4758, 4760, 4761, 4762, 4763, 4764, 4765, 4768, 4769, 4770, 4773, 4775, 4782, 4784, 4786, 4788, 4790, 4791, 4793, 4794, 4796, 4797, 4799, 4801, 4806, 4809, 4811, 4812, 4813, 4814, 4815, 4817, 4818, 4823, 4824, 4825, 4830, 4831, 4832, 4834, 4836, 4838, 4840, 4842, 4844, 4845, 4847, 4849, 4853, 4855, 4860, 4863, 4864, 4865, 4867, 4868, 4869, 4871, 4875, 4877, 4879, 4881, 4883, 4884, 4885, 4886, 4887, 4888, 4891, 4892, 4895, 4897, 4898, 4899, 4900, 4901, 4904, 4907, 4908, 4909, 4910, 4914, 4915, 4916, 4917, 4918, 4919, 4920, 4922, 4923, 4925, 4929, 4930, 4932, 4933, 4936, 4940, 4943, 4944, 4945, 4949, 4950, 4953, 4958, 4960, 4961, 4963, 4964, 4966, 4972, 4974, 4975, 4976, 4977, 4978, 4981, 4982, 4985, 4987, 4990, 4991, 4994, 4995, 4997, 4998, 5000, 5001, 5003, 5005, 5006, 5008, 5009, 5010, 5011, 5012, 5015, 5016, 5017, 5018, 5019, 5020, 5021, 5023, 5024, 5025, 5026, 5027, 5028, 5029, 5030, 5031, 5032, 5033, 5034, 5035,

In [None]:
print(files_txt_test[file_indices[0]])
print(files_txt_test_true[0])
print()
print(files_txt_test[file_indices[1]])
print(files_txt_test_true[1])
print()
print(files_txt_test[file_indices[2]])
print(files_txt_test_true[2])
print()
print(files_txt_test[file_indices[-1]])
print(files_txt_test_true[-1])

drive/My Drive/Ejemplos NER - TFM/test-background-set-to-publish/cc_onco1006.txt
drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco1006.txt

drive/My Drive/Ejemplos NER - TFM/test-background-set-to-publish/cc_onco1023.txt
drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco1023.txt

drive/My Drive/Ejemplos NER - TFM/test-background-set-to-publish/cc_onco1027.txt
drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco1027.txt

drive/My Drive/Ejemplos NER - TFM/test-background-set-to-publish/cc_onco978.txt
drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco978.txt


#### **Read gold standard annotation files**

In [None]:
def read_ann(files_ann):
  ann = []
  # Reading .ann files
  for file in files_ann:     
    f=open(file, mode = 'r')
    lines = f.readlines()
    ann_aux = []

    for line in lines:
      # We are only interested in keeping the entities: ID starts by T.
      # Remove relations: starting by R     
      if str(line[0])== 'T':
        data_ann = line.split("\t")
        ann_aux.append(data_ann)  

    ann.append(ann_aux)   
    f.close()
  return ann

In [None]:
# READING THE .ann FILES

path_Cantemist_test_true = path+"test_set/cantemist-ner/"

import glob   

path_ann_test_true = path_Cantemist_test_true +'*.ann'  

files_ann_test_true= glob.glob(path_ann_test_true)   

# Sort the files
files_ann_test_true = sorted(files_ann_test_true)

ann_test_true = read_ann(files_ann_test_true)

In [None]:
ann_test_true[1]

[['T1', 'MORFOLOGIA_NEOPLASIA 303 305', 'M1\n'],
 ['T2', 'MORFOLOGIA_NEOPLASIA 336 338', 'M1\n'],
 ['T3', 'MORFOLOGIA_NEOPLASIA 742 744', 'M1\n'],
 ['T4', 'MORFOLOGIA_NEOPLASIA 1144 1146', 'M1\n'],
 ['T5', 'MORFOLOGIA_NEOPLASIA 1264 1266', 'M1\n'],
 ['T6', 'MORFOLOGIA_NEOPLASIA 1811 1813', 'M1\n'],
 ['T7', 'MORFOLOGIA_NEOPLASIA 2955 2974', 'lesiones pulmonares\n'],
 ['T8', 'MORFOLOGIA_NEOPLASIA 1644 1657', 'masa pulmonar\n'],
 ['T9', 'MORFOLOGIA_NEOPLASIA 2218 2234', 'masa suprarrenal\n'],
 ['T10', 'MORFOLOGIA_NEOPLASIA 3561 3571', 'metástasis\n'],
 ['T11', 'MORFOLOGIA_NEOPLASIA 970 985', 'nódulo pulmonar\n'],
 ['T12', 'MORFOLOGIA_NEOPLASIA 2196 2211', 'nódulo pulmonar\n'],
 ['T13', 'MORFOLOGIA_NEOPLASIA 685 703', 'nódulos pulmonares\n'],
 ['T14', 'MORFOLOGIA_NEOPLASIA 1096 1114', 'nódulos pulmonares\n'],
 ['T15', 'MORFOLOGIA_NEOPLASIA 16 46', 'carcinoma renal células claras\n'],
 ['T16', 'MORFOLOGIA_NEOPLASIA 47 54', 'pTxNxM0\n'],
 ['T17', 'MORFOLOGIA_NEOPLASIA 146 176', 'carcinoma re

In [None]:
print("Number of clinical cases in the gold standard dataset: %d" %len(ann_test_true))

df_ann_test_true = pd.DataFrame(columns = ["clinical_case", "Entity_ID", "code", "Entity"])

for cc in range(len(ann_test_true)): #300 clinical cases
  ann_test_true2 = [np.hstack((cc+1,ann_test_true[cc][j])) for j in range(len(ann_test_true[cc]))]
  df = pd.DataFrame(ann_test_true2, columns = ["clinical_case", "Entity_ID", "code", "Entity"])

  df_ann_test_true = df_ann_test_true.append(df)

Number of clinical cases in the gold standard dataset: 300


In [None]:
df_ann_test_true[df_ann_test_true['clinical_case']=='2']

Unnamed: 0,clinical_case,Entity_ID,code,Entity
0,2,T1,MORFOLOGIA_NEOPLASIA 303 305,M1\n
1,2,T2,MORFOLOGIA_NEOPLASIA 336 338,M1\n
2,2,T3,MORFOLOGIA_NEOPLASIA 742 744,M1\n
3,2,T4,MORFOLOGIA_NEOPLASIA 1144 1146,M1\n
4,2,T5,MORFOLOGIA_NEOPLASIA 1264 1266,M1\n
5,2,T6,MORFOLOGIA_NEOPLASIA 1811 1813,M1\n
6,2,T7,MORFOLOGIA_NEOPLASIA 2955 2974,lesiones pulmonares\n
7,2,T8,MORFOLOGIA_NEOPLASIA 1644 1657,masa pulmonar\n
8,2,T9,MORFOLOGIA_NEOPLASIA 2218 2234,masa suprarrenal\n
9,2,T10,MORFOLOGIA_NEOPLASIA 3561 3571,metástasis\n


In [None]:
df_ann_test_true = df_ann_test_true.drop_duplicates(['clinical_case', 'code'], keep='first')

### **True labels**

In [None]:
labels = ['B-MOR', 'I-MOR', 'E-MOR', 'S-MOR', 'V-MOR']

### **Functions**

In [None]:
def calculate_metrics(df_gs, df_pred):
    Pred_Pos_per_cc = df_pred.drop_duplicates(subset=['clinical_case', 
                                                  "code"]).groupby("clinical_case")["code"].count()
    Pred_Pos = df_pred.drop_duplicates(subset=['clinical_case', "code"]).shape[0]
    
    # Gold Standard Positives:
    GS_Pos_per_cc = df_gs.drop_duplicates(subset=['clinical_case', 
                                               "code"]).groupby("clinical_case")["code"].count()
    GS_Pos = df_gs.drop_duplicates(subset=['clinical_case', "code"]).shape[0]
    cc = set(df_gs.clinical_case.tolist())
    TP_per_cc = pd.Series(dtype=float)
    for c in cc:
        pred = set(df_pred.loc[df_pred['clinical_case']==c,'code'].values)
        gs = set(df_gs.loc[df_gs['clinical_case']==c,'code'].values)
        TP_per_cc[c] = len(pred.intersection(gs))
        
    TP = sum(TP_per_cc.values)
        
    
    # Calculate Final Metrics:
    P_per_cc =  TP_per_cc / Pred_Pos_per_cc
    P = TP / Pred_Pos
    R_per_cc = TP_per_cc / GS_Pos_per_cc
    R = TP / GS_Pos
    F1_per_cc = (2 * P_per_cc * R_per_cc) / (P_per_cc + R_per_cc)
    if (P+R) == 0:
        F1 = 0
        warnings.warn('Global F1 score automatically set to zero to avoid division by zero')
        return P_per_cc, P, R_per_cc, R, F1_per_cc, F1
    F1 = (2 * P * R) / (P + R)
    
    return round(P_per_cc,3), round(P,3), round(R_per_cc,3), round(R,3), round(F1_per_cc,3), round(F1,3)

In [None]:
def calculate_errors(df_gs, df_pred):
  cc = set(df_gs.clinical_case.tolist())
  mismatch_FN_per_cc = []
  mismatch_FP_per_cc = []
  for c in cc:
    pred = set(df_pred.loc[df_pred['clinical_case']==c,'code'].values)
    gs = set(df_gs.loc[df_gs['clinical_case']==c,'code'].values)

    FN_c = gs.difference(pred)
    FP_c = pred.difference(gs)
  
    mismatch_FN_per_cc.extend([np.hstack([c,FN_c_i]) for FN_c_i in FN_c])
    mismatch_FP_per_cc.extend([np.hstack([c,FP_c_i]) for FP_c_i in FP_c])

  # 1. df_mismatch_FN: those entities that are in the gold standard but are not in the predictions
  df_mismatch_FN = pd.DataFrame(mismatch_FN_per_cc, columns=['clinical_case', 'code_GS'])

  # --------------------------------------------------------------------

  # 2. df_mismatch_FN: those entities that are in the predictions but are not in the gold standards
  df_mismatch_FP = pd.DataFrame(mismatch_FP_per_cc, columns=['clinical_case', 'code_PRED'])

  return df_mismatch_FN, df_mismatch_FP

### **CRF predictions**

**Read the prediction .ann files**

We want to read them in the same order as the gold standard dataset.

In [None]:
path_crf = 'drive/My Drive/Ejemplos NER - TFM/results_CRF/ann/'
path_test_true = 'drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/'
len_path_crf = len(path_crf)
len_path_test_true = len(path_test_true)

files_ann_crf = []

for i in range(len(files_ann_test_true)):
  name_ann = files_ann_test_true[i][len_path_test_true:] # these are the names of the ann files ex. 'cc_onco1006.ann'
  new_name_ann = path_crf + name_ann

  files_ann_crf.append(new_name_ann)


In [None]:
print(files_ann_test_true[0])
print(files_ann_crf[0])
print(files_ann_test_true[-1])
print(files_ann_crf[-1])

print(len(files_ann_test_true))
print(len(files_ann_crf))

drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/results_CRF/ann/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco978.ann
drive/My Drive/Ejemplos NER - TFM/results_CRF/ann/cc_onco978.ann
300
300


In [None]:
ann_crf = read_ann(files_ann_crf)

In [None]:
print("Number of clinical cases in the gold standard dataset: %d" %len(ann_crf))

df_ann_crf = pd.DataFrame(columns = ["clinical_case", "Entity_ID", "code", "Entity"])

for cc in range(len(ann_crf)): #300 clinical cases
  ann_crf2 = [np.hstack((cc+1,ann_crf[cc][j])) for j in range(len(ann_crf[cc]))]
  df = pd.DataFrame(ann_crf2, columns = ["clinical_case", "Entity_ID", "code", "Entity"])

  df_ann_crf = df_ann_crf.append(df)

Number of clinical cases in the gold standard dataset: 300


In [None]:
df_ann_crf

Unnamed: 0,clinical_case,Entity_ID,code,Entity
0,1,T1,MORFOLOGIA_NEOPLASIA 794 806,neoformación\n
1,1,T2,MORFOLOGIA_NEOPLASIA 882 894,metastásicas\n
2,1,T3,MORFOLOGIA_NEOPLASIA 1115 1147,adenocarcinoma bien diferenciado\n
3,1,T4,MORFOLOGIA_NEOPLASIA 1590 1602,neoformación\n
4,1,T5,MORFOLOGIA_NEOPLASIA 1678 1690,metastásicas\n
...,...,...,...,...
8,300,T9,MORFOLOGIA_NEOPLASIA 1884 1910,cáncer de mama metastásico\n
9,300,T10,MORFOLOGIA_NEOPLASIA 1912 1915,CMm\n
10,300,T11,MORFOLOGIA_NEOPLASIA 3350 3397,Carcinoma de mama ductal infiltrante estadio IV\n
11,300,T12,MORFOLOGIA_NEOPLASIA 4171 4178,grado I\n


In [None]:
P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_crf)

In [None]:
print("Precision: %f" %P)
print("Recall: %f" %R)
print("F1 score: %f" %F1)

Precision: 0.800000
Recall: 0.768000
F1 score: 0.783000


In [None]:
df_mismatch_FN, df_mismatch_FP = calculate_errors(df_ann_test_true, df_ann_crf)
# df_mismatch_FN2 # these are the entities not found
# df_mismatch_FP2 # these are entities found in the predictions that are not in the gold standard
# it will usually be due to an entity found with a different number of words than the gold standard

# merge
df_mismatch_FN_data = df_mismatch_FN.merge(df_ann_test_true[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_GS'], right_on=['clinical_case','code'])

df_mismatch_FP_data = df_mismatch_FP.merge(df_ann_crf[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_PRED'], right_on=['clinical_case','code'])

df_mismatch_FN_data = df_mismatch_FN_data.drop(['code'],axis = 1)
df_mismatch_FP_data = df_mismatch_FP_data.drop(['code'],axis = 1)

df_mismatch_FN_data['start'] = df_mismatch_FN_data["code_GS"].str.split(" ", expand = True)[1] 
df_mismatch_FP_data['start'] = df_mismatch_FP_data["code_PRED"].str.split(" ", expand = True)[1] 

df_mismatch = df_mismatch_FN_data.merge(df_mismatch_FP_data[['clinical_case','code_PRED', 'Entity','start']], 
                                             how = 'outer', left_on=['clinical_case','start'], right_on=['clinical_case','start'])

print(df_mismatch.isnull().sum())

print("\nTotal number of entities in the gold standard files: %d" %len(df_ann_test_true))
print("Total number of entities in the prediction files: %d" %len(df_ann_crf))
print("Total number of mismatches %d" %len(df_mismatch))

print("Total number of unrecognized entities %d" %df_mismatch.isnull().sum()['code_PRED'])
print("Total number of false positives entities %d" %df_mismatch.isnull().sum()['code_GS'])

matches = df_ann_test_true.merge(df_ann_crf[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'inner', left_on=['clinical_case','code'], right_on=['clinical_case','code'])
print("\nTotal number of matches %d" %len(matches))

print("Total number of FN %d" %len(df_mismatch_FN))
print("Total number of FP %d" %len(df_mismatch_FP))


clinical_case      0
code_GS          345
Entity_ID        345
Entity_x         345
start              0
code_PRED        481
Entity_y         481
dtype: int64

Total number of entities in the gold standard files: 3633
Total number of entities in the prediction files: 3487
Total number of mismatches 1189
Total number of unrecognized entities 481
Total number of false positives entities 345

Total number of matches 2789
Total number of FN 844
Total number of FP 698


In [None]:
#df_mismatch.to_excel(path+'mismatch_crf.xlsx', index = False)

### **BILSTM-CRF approach 1 predictions**

**Read the prediction .ann files**

We want to read them in the same order as the gold standard dataset.

In [None]:
path_bilstm1 = 'drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap1/ann/'
path_test_true = 'drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/'
len_path_bilstm1 = len(path_bilstm1)
len_path_test_true = len(path_test_true)

files_ann_bilstm1 = []

for i in range(len(files_ann_test_true)):
  name_ann = files_ann_test_true[i][len_path_test_true:] # these are the names of theann files ex. 'cc_onco1006.ann'
  new_name_ann = path_bilstm1 + name_ann

  files_ann_bilstm1.append(new_name_ann)


In [None]:
print(files_ann_test_true[0])
print(files_ann_bilstm1[0])
print(files_ann_test_true[-1])
print(files_ann_bilstm1[-1])

print(len(files_ann_test_true))
print(len(files_ann_bilstm1))

drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap1/ann/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco978.ann
drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap1/ann/cc_onco978.ann
300
300


In [None]:
ann_bilstm1 = read_ann(files_ann_bilstm1)

In [None]:
print("Number of clinical cases in the gold standard dataset: %d" %len(ann_bilstm1))

df_ann_bilstm1 = pd.DataFrame(columns = ["clinical_case", "Entity_ID", "code", "Entity"])

for cc in range(len(ann_bilstm1)): #300 clinical cases
  ann_bilstm1_2 = [np.hstack((cc+1,ann_bilstm1[cc][j])) for j in range(len(ann_bilstm1[cc]))]
  df = pd.DataFrame(ann_bilstm1_2, columns = ["clinical_case", "Entity_ID", "code", "Entity"])

  df_ann_bilstm1 = df_ann_bilstm1.append(df)

Number of clinical cases in the gold standard dataset: 300


In [None]:
df_ann_bilstm1

Unnamed: 0,clinical_case,Entity_ID,code,Entity
0,1,T1,MORFOLOGIA_NEOPLASIA 794 806,neoformación\n
1,1,T2,MORFOLOGIA_NEOPLASIA 882 894,metastásicas\n
2,1,T3,MORFOLOGIA_NEOPLASIA 1115 1147,adenocarcinoma bien diferenciado\n
3,1,T4,MORFOLOGIA_NEOPLASIA 1590 1602,neoformación\n
4,1,T5,MORFOLOGIA_NEOPLASIA 1678 1690,metastásicas\n
...,...,...,...,...
5,300,T6,MORFOLOGIA_NEOPLASIA 1218 1246,carcinoma ductal infiltrante\n
6,300,T7,MORFOLOGIA_NEOPLASIA 1464 1474,metástasis\n
7,300,T8,MORFOLOGIA_NEOPLASIA 1884 1910,cáncer de mama metastásico\n
8,300,T9,MORFOLOGIA_NEOPLASIA 3350 3386,Carcinoma de mama ductal infiltrante\n


In [None]:
P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_bilstm1)

In [None]:
print("Precision: %f" %P)
print("Recall: %f" %R)
print("F1 score: %f" %F1)

Precision: 0.771000
Recall: 0.773000
F1 score: 0.772000


In [None]:
df_mismatch_FN, df_mismatch_FP = calculate_errors(df_ann_test_true, df_ann_bilstm1)
# df_mismatch_FN2 # these are the entities not found
# df_mismatch_FP2 # these are entities found in the predictions that are not in the gold standard
# it will usually be due to an entity found with a different number of words than the gold standard

# merge
df_mismatch_FN_data = df_mismatch_FN.merge(df_ann_test_true[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_GS'], right_on=['clinical_case','code'])

df_mismatch_FP_data = df_mismatch_FP.merge(df_ann_bilstm1[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_PRED'], right_on=['clinical_case','code'])

df_mismatch_FN_data = df_mismatch_FN_data.drop(['code'],axis = 1)
df_mismatch_FP_data = df_mismatch_FP_data.drop(['code'],axis = 1)

df_mismatch_FN_data['start'] = df_mismatch_FN_data["code_GS"].str.split(" ", expand = True)[1] 
df_mismatch_FP_data['start'] = df_mismatch_FP_data["code_PRED"].str.split(" ", expand = True)[1] 

df_mismatch = df_mismatch_FN_data.merge(df_mismatch_FP_data[['clinical_case','code_PRED', 'Entity','start']], 
                                             how = 'outer', left_on=['clinical_case','start'], right_on=['clinical_case','start'])

print(df_mismatch.isnull().sum())

print("\nTotal number of entities in the gold standard files: %d" %len(df_ann_test_true))
print("Total number of entities in the prediction files: %d" %len(df_ann_bilstm1))

print("Total number of mismatches %d" %len(df_mismatch))

print("Total number of unrecognized entities %d" %df_mismatch.isnull().sum()['code_PRED'])
print("Total number of false positives entities %d" %df_mismatch.isnull().sum()['code_GS'])

matches = df_ann_test_true.merge(df_ann_bilstm1[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'inner', left_on=['clinical_case','code'], right_on=['clinical_case','code'])
print("\nTotal number of matches %d" %len(matches))

print("Total number of FN %d" %len(df_mismatch_FN))
print("Total number of FP %d" %len(df_mismatch_FP))

clinical_case      0
code_GS          493
Entity_ID        493
Entity_x         493
start              0
code_PRED        474
Entity_y         474
dtype: int64

Total number of entities in the gold standard files: 3633
Total number of entities in the prediction files: 3643
Total number of mismatches 1319
Total number of unrecognized entities 474
Total number of false positives entities 493

Total number of matches 2807
Total number of FN 826
Total number of FP 836


In [None]:
#df_mismatch.to_excel(path+'mismatch_bilstm1.xlsx', index = False)

### **BILSTM-CRF approach 2 predictions**

**Read the prediction .ann files**

We want to read them in the same order as the gold standard dataset.

In [None]:
path_bilstm2 = 'drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap2/ann/'
path_test_true = 'drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/'
len_path_bilstm2 = len(path_bilstm2)
len_path_test_true = len(path_test_true)

files_ann_bilstm2 = []

for i in range(len(files_ann_test_true)):
  name_ann = files_ann_test_true[i][len_path_test_true:] # these are the names of theann files ex. 'cc_onco1006.ann'
  new_name_ann = path_bilstm2 + name_ann

  files_ann_bilstm2.append(new_name_ann)


In [None]:
print(files_ann_test_true[0])
print(files_ann_bilstm2[0])
print(files_ann_test_true[-1])
print(files_ann_bilstm2[-1])

print(len(files_ann_test_true))
print(len(files_ann_bilstm2))

drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap2/ann/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco978.ann
drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap2/ann/cc_onco978.ann
300
300


In [None]:
ann_bilstm2 = read_ann(files_ann_bilstm2)

In [None]:
print("Number of clinical cases in the gold standard dataset: %d" %len(ann_bilstm2))

df_ann_bilstm2 = pd.DataFrame(columns = ["clinical_case", "Entity_ID", "code", "Entity"])

for cc in range(len(ann_bilstm2)): #300 clinical cases
  ann_bilstm2_2 = [np.hstack((cc+1,ann_bilstm2[cc][j])) for j in range(len(ann_bilstm2[cc]))]
  df = pd.DataFrame(ann_bilstm2_2, columns = ["clinical_case", "Entity_ID", "code", "Entity"])

  df_ann_bilstm2 = df_ann_bilstm2.append(df)

Number of clinical cases in the gold standard dataset: 300


In [None]:
df_ann_bilstm2

Unnamed: 0,clinical_case,Entity_ID,code,Entity
0,1,T1,MORFOLOGIA_NEOPLASIA 794 806,neoformación\n
1,1,T2,MORFOLOGIA_NEOPLASIA 882 894,metastásicas\n
2,1,T3,MORFOLOGIA_NEOPLASIA 1115 1147,adenocarcinoma bien diferenciado\n
3,1,T4,MORFOLOGIA_NEOPLASIA 1590 1602,neoformación\n
4,1,T5,MORFOLOGIA_NEOPLASIA 1678 1690,metastásicas\n
...,...,...,...,...
5,300,T6,MORFOLOGIA_NEOPLASIA 1218 1246,carcinoma ductal infiltrante\n
6,300,T7,MORFOLOGIA_NEOPLASIA 1464 1477,metástasis M1\n
7,300,T8,MORFOLOGIA_NEOPLASIA 1884 1910,cáncer de mama metastásico\n
8,300,T9,MORFOLOGIA_NEOPLASIA 3350 3399,Carcinoma de mama ductal infiltrante estadio I...


In [None]:
P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_bilstm2)

In [None]:
print("Precision: %f" %P)
print("Recall: %f" %R)
print("F1 score: %f" %F1)

Precision: 0.828000
Recall: 0.769000
F1 score: 0.797000


In [None]:
df_mismatch_FN, df_mismatch_FP = calculate_errors(df_ann_test_true, df_ann_bilstm2)
# df_mismatch_FN2 # these are the entities not found
# df_mismatch_FP2 # these are entities found in the predictions that are not in the gold standard
# it will usually be due to an entity found with a different number of words than the gold standard

# merge
df_mismatch_FN_data = df_mismatch_FN.merge(df_ann_test_true[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_GS'], right_on=['clinical_case','code'])

df_mismatch_FP_data = df_mismatch_FP.merge(df_ann_bilstm2[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_PRED'], right_on=['clinical_case','code'])

df_mismatch_FN_data = df_mismatch_FN_data.drop(['code'],axis = 1)
df_mismatch_FP_data = df_mismatch_FP_data.drop(['code'],axis = 1)

df_mismatch_FN_data['start'] = df_mismatch_FN_data["code_GS"].str.split(" ", expand = True)[1] 
df_mismatch_FP_data['start'] = df_mismatch_FP_data["code_PRED"].str.split(" ", expand = True)[1] 

df_mismatch = df_mismatch_FN_data.merge(df_mismatch_FP_data[['clinical_case','code_PRED', 'Entity','start']], 
                                             how = 'outer', left_on=['clinical_case','start'], right_on=['clinical_case','start'])

print(df_mismatch.isnull().sum())

print("\nTotal number of entities in the gold standard files: %d" %len(df_ann_test_true))
print("Total number of entities in the prediction files: %d" %len(df_ann_bilstm2))

print("Total number of mismatches %d" %len(df_mismatch))

print("Total number of unrecognized entities %d" %df_mismatch.isnull().sum()['code_PRED'])
print("Total number of false positives entities %d" %df_mismatch.isnull().sum()['code_GS'])

matches = df_ann_test_true.merge(df_ann_bilstm2[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'inner', left_on=['clinical_case','code'], right_on=['clinical_case','code'])
print("\nTotal number of matches %d" %len(matches))

print("Total number of FN %d" %len(df_mismatch_FN))
print("Total number of FP %d" %len(df_mismatch_FP))

clinical_case      0
code_GS          270
Entity_ID        270
Entity_x         270
start              0
code_PRED        518
Entity_y         518
dtype: int64

Total number of entities in the gold standard files: 3633
Total number of entities in the prediction files: 3372
Total number of mismatches 1111
Total number of unrecognized entities 518
Total number of false positives entities 270

Total number of matches 2792
Total number of FN 841
Total number of FP 580


In [None]:
#df_mismatch.to_excel(path+'mismatch_bilstm2.xlsx', index = False)

### **BILSTM-CRF approach 3 predictions**

**Read the prediction .ann files**

We want to read them in the same order as the gold standard dataset.

In [None]:
path_bilstm3 = 'drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap3/ann/'
path_test_true = 'drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/'
len_path_bilstm3 = len(path_bilstm3)
len_path_test_true = len(path_test_true)

files_ann_bilstm3 = []

for i in range(len(files_ann_test_true)):
  name_ann = files_ann_test_true[i][len_path_test_true:] # these are the names of theann files ex. 'cc_onco1006.ann'
  new_name_ann = path_bilstm3 + name_ann

  files_ann_bilstm3.append(new_name_ann)


In [None]:
print(files_ann_test_true[0])
print(files_ann_bilstm3[0])
print(files_ann_test_true[-1])
print(files_ann_bilstm3[-1])

print(len(files_ann_test_true))
print(len(files_ann_bilstm3))

drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap3/ann/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco978.ann
drive/My Drive/Ejemplos NER - TFM/results_BILSTM_ap3/ann/cc_onco978.ann
300
300


In [None]:
ann_bilstm3 = read_ann(files_ann_bilstm3)

In [None]:
print("Number of clinical cases in the gold standard dataset: %d" %len(ann_bilstm3))

df_ann_bilstm3 = pd.DataFrame(columns = ["clinical_case", "Entity_ID", "code", "Entity"])

for cc in range(len(ann_bilstm3)): #300 clinical cases
  ann_bilstm3_2 = [np.hstack((cc+1,ann_bilstm3[cc][j])) for j in range(len(ann_bilstm3[cc]))]
  df = pd.DataFrame(ann_bilstm3_2, columns = ["clinical_case", "Entity_ID", "code", "Entity"])

  df_ann_bilstm3 = df_ann_bilstm3.append(df)

Number of clinical cases in the gold standard dataset: 300


In [None]:
df_ann_bilstm3

Unnamed: 0,clinical_case,Entity_ID,code,Entity
0,1,T1,MORFOLOGIA_NEOPLASIA 794 806,neoformación\n
1,1,T2,MORFOLOGIA_NEOPLASIA 882 894,metastásicas\n
2,1,T3,MORFOLOGIA_NEOPLASIA 1115 1147,adenocarcinoma bien diferenciado\n
3,1,T4,MORFOLOGIA_NEOPLASIA 1590 1602,neoformación\n
4,1,T5,MORFOLOGIA_NEOPLASIA 1678 1690,metastásicas\n
...,...,...,...,...
6,300,T7,MORFOLOGIA_NEOPLASIA 1464 1474,metástasis\n
7,300,T8,MORFOLOGIA_NEOPLASIA 1884 1910,cáncer de mama metastásico\n
8,300,T9,MORFOLOGIA_NEOPLASIA 1912 1915,CMm\n
9,300,T10,MORFOLOGIA_NEOPLASIA 3350 3399,Carcinoma de mama ductal infiltrante estadio I...


In [None]:
P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_bilstm3)

In [None]:
print("Precision: %f" %P)
print("Recall: %f" %R)
print("F1 score: %f" %F1)

Precision: 0.784000
Recall: 0.759000
F1 score: 0.771000


In [None]:
df_mismatch_FN, df_mismatch_FP = calculate_errors(df_ann_test_true, df_ann_bilstm3)
# df_mismatch_FN2 # these are the entities not found
# df_mismatch_FP2 # these are entities found in the predictions that are not in the gold standard
# it will usually be due to an entity found with a different number of words than the gold standard

# merge
df_mismatch_FN_data = df_mismatch_FN.merge(df_ann_test_true[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_GS'], right_on=['clinical_case','code'])

df_mismatch_FP_data = df_mismatch_FP.merge(df_ann_bilstm3[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_PRED'], right_on=['clinical_case','code'])

df_mismatch_FN_data = df_mismatch_FN_data.drop(['code'],axis = 1)
df_mismatch_FP_data = df_mismatch_FP_data.drop(['code'],axis = 1)

df_mismatch_FN_data['start'] = df_mismatch_FN_data["code_GS"].str.split(" ", expand = True)[1] 
df_mismatch_FP_data['start'] = df_mismatch_FP_data["code_PRED"].str.split(" ", expand = True)[1] 

df_mismatch = df_mismatch_FN_data.merge(df_mismatch_FP_data[['clinical_case','code_PRED', 'Entity','start']], 
                                             how = 'outer', left_on=['clinical_case','start'], right_on=['clinical_case','start'])

print(df_mismatch.isnull().sum())

print("\nTotal number of entities in the gold standard files: %d" %len(df_ann_test_true))
print("Total number of entities in the prediction files: %d" %len(df_ann_bilstm3))

print("Total number of mismatches %d" %len(df_mismatch))

print("Total number of unrecognized entities %d" %df_mismatch.isnull().sum()['code_PRED'])
print("Total number of false positives entities %d" %df_mismatch.isnull().sum()['code_GS'])

matches = df_ann_test_true.merge(df_ann_bilstm3[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'inner', left_on=['clinical_case','code'], right_on=['clinical_case','code'])
print("\nTotal number of matches %d" %len(matches))

print("Total number of FN %d" %len(df_mismatch_FN))
print("Total number of FP %d" %len(df_mismatch_FP))

clinical_case      0
code_GS          422
Entity_ID        422
Entity_x         422
start              0
code_PRED        525
Entity_y         525
dtype: int64

Total number of entities in the gold standard files: 3633
Total number of entities in the prediction files: 3521
Total number of mismatches 1296
Total number of unrecognized entities 525
Total number of false positives entities 422

Total number of matches 2759
Total number of FN 874
Total number of FP 762


In [None]:
#df_mismatch.to_excel(path+'mismatch_bilstm3.xlsx', index = False)

### **BERT predictions**

**Read the prediction .ann files**

We want to read them in the same order as the gold standard dataset.

In [None]:
path_bert = 'drive/My Drive/Ejemplos NER - TFM/results_bert/annotations/'
path_test_true = 'drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/'
len_path_bert = len(path_bert)
len_path_test_true = len(path_test_true)

files_ann_bert = []

for i in range(len(files_ann_test_true)):
  name_ann = files_ann_test_true[i][len_path_test_true:] # these are the names of theann files ex. 'cc_onco1006.ann'
  new_name_ann = path_bert + name_ann

  files_ann_bert.append(new_name_ann)


In [None]:
print(files_ann_test_true[0])
print(files_ann_bert[0])
print(files_ann_test_true[-1])
print(files_ann_bert[-1])

print(len(files_ann_test_true))
print(len(files_ann_bert))

drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/results_bert/annotations/cc_onco1006.ann
drive/My Drive/Ejemplos NER - TFM/test_set/cantemist-ner/cc_onco978.ann
drive/My Drive/Ejemplos NER - TFM/results_bert/annotations/cc_onco978.ann
300
300


In [None]:
ann_bert = read_ann(files_ann_bert)

In [None]:
print("Number of clinical cases in the gold standard dataset: %d" %len(ann_bert))

df_ann_bert = pd.DataFrame(columns = ["clinical_case", "Entity_ID", "code", "Entity"])

for cc in range(len(ann_bert)): #300 clinical cases
  ann_bert2 = [np.hstack((cc+1,ann_bert[cc][j])) for j in range(len(ann_bert[cc]))]
  df = pd.DataFrame(ann_bert2, columns = ["clinical_case", "Entity_ID", "code", "Entity"])

  df_ann_bert = df_ann_bert.append(df)

Number of clinical cases in the gold standard dataset: 300


In [None]:
df_ann_bert

Unnamed: 0,clinical_case,Entity_ID,code,Entity
0,1,T1,MORFOLOGIA_NEOPLASIA 794 806,neoformación\n
1,1,T2,MORFOLOGIA_NEOPLASIA 882 894,metastásicas\n
2,1,T3,MORFOLOGIA_NEOPLASIA 1115 1147,adenocarcinoma bien diferenciado\n
3,1,T4,MORFOLOGIA_NEOPLASIA 1590 1602,neoformación\n
4,1,T5,MORFOLOGIA_NEOPLASIA 1678 1690,metastásicas\n
...,...,...,...,...
7,300,T8,MORFOLOGIA_NEOPLASIA 1596 1614,afectación anexial\n
8,300,T9,MORFOLOGIA_NEOPLASIA 1884 1910,cáncer de mama metastásico\n
9,300,T10,MORFOLOGIA_NEOPLASIA 1912 1915,CMm\n
10,300,T11,MORFOLOGIA_NEOPLASIA 3350 3386,Carcinoma de mama ductal infiltrante\n


In [None]:
P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_bert)

In [None]:
print("Precision: %f" %P)
print("Recall: %f" %R)
print("F1 score: %f" %F1)

Precision: 0.756000
Recall: 0.775000
F1 score: 0.765000


In [None]:
df_mismatch_FN, df_mismatch_FP = calculate_errors(df_ann_test_true, df_ann_bert)
# df_mismatch_FN2 # these are the entities not found
# df_mismatch_FP2 # these are entities found in the predictions that are not in the gold standard
# it will usually be due to an entity found with a different number of words than the gold standard

# merge
df_mismatch_FN_data = df_mismatch_FN.merge(df_ann_test_true[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_GS'], right_on=['clinical_case','code'])

df_mismatch_FP_data = df_mismatch_FP.merge(df_ann_bert[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'left', left_on=['clinical_case','code_PRED'], right_on=['clinical_case','code'])

df_mismatch_FN_data = df_mismatch_FN_data.drop(['code'],axis = 1)
df_mismatch_FP_data = df_mismatch_FP_data.drop(['code'],axis = 1)

df_mismatch_FN_data['start'] = df_mismatch_FN_data["code_GS"].str.split(" ", expand = True)[1] 
df_mismatch_FP_data['start'] = df_mismatch_FP_data["code_PRED"].str.split(" ", expand = True)[1] 

df_mismatch = df_mismatch_FN_data.merge(df_mismatch_FP_data[['clinical_case','code_PRED', 'Entity','start']], 
                                             how = 'outer', left_on=['clinical_case','start'], right_on=['clinical_case','start'])

print(df_mismatch.isnull().sum())

print("\nTotal number of entities in the gold standard files: %d" %len(df_ann_test_true))
print("Total number of entities in the prediction files: %d" %len(df_ann_bert))

print("Total number of mismatches %d" %len(df_mismatch))

print("Total number of unrecognized entities %d" %df_mismatch.isnull().sum()['code_PRED'])
print("Total number of false positives entities %d" %df_mismatch.isnull().sum()['code_GS'])

matches = df_ann_test_true.merge(df_ann_bert[['clinical_case','Entity_ID','code', 'Entity']], 
                                             how = 'inner', left_on=['clinical_case','code'], right_on=['clinical_case','code'])
print("\nTotal number of matches %d" %len(matches))

print("Total number of FN %d" %len(df_mismatch_FN))
print("Total number of FP %d" %len(df_mismatch_FP))

clinical_case      0
code_GS          574
Entity_ID        574
Entity_x         574
start              0
code_PRED        470
Entity_y         470
dtype: int64

Total number of entities in the gold standard files: 3633
Total number of entities in the prediction files: 3728
Total number of mismatches 1390
Total number of unrecognized entities 470
Total number of false positives entities 574

Total number of matches 2817
Total number of FN 816
Total number of FP 911


In [None]:
#df_mismatch.to_excel(path+'mismatch_bert.xlsx', index = False)

### **Comparison**

In [None]:
df_results = pd.DataFrame(columns=['Model', 'Precision', 'Recall', 'F1 score'])

P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_crf)
df_results = df_results.append(pd.DataFrame(np.matrix(['Crf',P,R,F1]), 
                                            columns=['Model', 'Precision', 'Recall', 'F1 score']),ignore_index=True)

P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_bilstm1)
df_results = df_results.append(pd.DataFrame(np.matrix(['Bilstm-crf approach 1',P,R,F1]), 
                                            columns=['Model', 'Precision', 'Recall', 'F1 score']),ignore_index=True)

P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_bilstm2)
df_results = df_results.append(pd.DataFrame(np.matrix(['Bilstm-crf approach 2',P,R,F1]), 
                                            columns=['Model', 'Precision', 'Recall', 'F1 score']),ignore_index=True)

P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_bilstm3)
df_results = df_results.append(pd.DataFrame(np.matrix(['Bilstm-crf approach 3',P,R,F1]), 
                                            columns=['Model', 'Precision', 'Recall', 'F1 score']),ignore_index=True)

P_per_cc, P, R_per_cc, R, F1_per_cc, F1 = calculate_metrics(df_ann_test_true, df_ann_bert)
df_results = df_results.append(pd.DataFrame(np.matrix(['Bert',P,R,F1]), 
                                            columns=['Model', 'Precision', 'Recall', 'F1 score']),ignore_index=True)


In [None]:
df_results

Unnamed: 0,Model,Precision,Recall,F1 score
0,Crf,0.8,0.768,0.783
1,Bilstm-crf approach 1,0.771,0.773,0.772
2,Bilstm-crf approach 2,0.828,0.769,0.797
3,Bilstm-crf approach 3,0.784,0.759,0.771
4,Bert,0.756,0.775,0.765
