# Comparing an AI algorithm to a group of human readers

* The following URL provides background material to this demo: 
  * [AI evaluation chapter](https://dpc10ster.github.io/RJafrocFrocBook/standalone-cad-radiologists.html)

# Glossary of terms

* AI artifical intelligence algorithm that analyzes images to identify targets of interest
* Treatment and modality are used interchangeably.
* Reader, radiologist or algorithmic observer are also used interchangeably.
* Rating: level of suspicion, recorded on an ordinal scale, with higher values associated with increasing confidence in presence of disease. 
* In the ROC paradigm each case interpretation results in **one** rating.
* In the FROC paradigm each case interpretation results in **zero or more** marked suspicious regions, each of which yields a rating reflecting confidence in presence of disease at each marked region. 
* FROC data consists of mark-rating pairs, where the number of mark-rating pairs on each case is a-priori unpredictable.
* K1 = number of non-diseased cases
* K2 = number of diseased cases
* K = K1 + K2 = total number of cases
* I = number of modalities
* J = number of readers
* FOM, the figure of merit, in my opinion the most important concept in the analysis; for a given treatment and modality it is a **scalar** measure of performance, usually in the range 0 to 1, with higher values corresponding to better performance.
* Significance testing: the analytic procedure used to determine if differences in FOM are statistically significant
* RRRC, random-reader random-case, i.e., the analysis allows for random variability associated with readers and random variability associated with cases; some readers are better than others, some are worse; some cases are more difficult to interpret than others, etc.

In [1]:
import numpy as np
from StSignificanceTesting import StSignificanceTestingCadVsRad
from DfReadDataFile import DfReadDataFile

## Read the Excel file

In [2]:
ds1 = DfReadDataFile("extdata/NicoRadRoc.xlsx")
statsRoc = StSignificanceTestingCadVsRad(ds1, FOM = "Wilcoxon")

In [9]:
statsRoc

Unnamed: 0,Rad,CAD,Rad-CAD,VarRad,VarError,Cov2,MSden1T,stdErr1T,ddf1T,Tstat,p val,CI_lo,CI_hi
0,0.848663,0.816927,0.031736,0.000878,0.0014,0.000924,0.009195,0.031964,877.886378,0.992869,0.321047,-0.030999,0.094471


In [6]:
(-0.030999 + 0.094471)/2 # averge of CI equals Rad-CAD

0.031736

In [3]:
ds2 = DfReadDataFile("extdata/CadFrocData.xlsx", DataType = "FROC")
statsFroc = StSignificanceTestingCadVsRad(ds2, FOM = "wAfroc")

In [4]:
statsFroc

Unnamed: 0,Rad,CAD,Rad-CAD,VarRad,VarError,Cov2,MSden1T,stdErr1T,ddf1T,Tstat,p val,CI_lo,CI_hi
0,0.791968,0.628125,0.163843,0.001528,0.00277,0.0021,0.020423,0.047637,1429.873398,3.439408,0.0006,0.070397,0.257288


In [5]:
(0.070397 + 0.257288)/2 # averge of CI equals Rad-CAD

0.1638425

## Comments

* The FROC dataset yields a significant difference betweeen average radiologist minus CAD
* The ROC dataset does not
* The main reason for this is the much larger difference betweeen average radiologist minus CAD
  * 0.031736 for ROC dataset
  * 0.163843 for FROC dataset