# Topic 2.7: Validation in medical image analysis

This notebook combines theory with questions to support the understanding of validation metrics in medical image analysis. Use available markdown sections to fill in your answers to questions as you proceed through the notebook.

**Contents:** <br>

1. [Validation (concepts)](#validation)<br>

    1.1 [Quality characteristics](#quality_characteristics)<br>
    
    1.2 [Ground truth](#ground_truth)<br>
    
    1.3 [Measures of quality](#quality_measures)<br>
    
    
2. [Common limitations of performance metrics in biomedical image analysis](#limitations)

<div id='validation'></div>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/read_ico.png" width="42" height="42"></div> 

## 1. Validation (concepts)

Validation of a medical image analysis methods is the estimation of correctness of certain results from tests of a method on a representative sample set. Prior to performing validation, suitable data needs to be selected, comparison measures need to be chosen, and a norm (e.g. [ground-truth](#ground_truth), explained below) needs to be defined. Validation can provide information about our method with respect to another method used to generate the same results. Remember that it is mandatory to document a detailed description of the validation procedure together with a well-founded justification of selected measures, as it allows potential new users of the method to investigate the validity of the arguments used to build the validation scenario.

<div id='quality_characteristics'></div>

### 1.1 Quality characteristics
There are a number of quality characteristics used in validation of medical image analysis methods: 

#### Accuracy

Accuracy determines the deviation of results from known ground truth. It is computed via a measure of quality ([section 1.3](#quality_measures)) comparing results with some norm. Accuracy can be calculated as the ratio between true/false positives and negatives: $A = \frac{(TP + TN)}{(TP+FP+FN+TN)}$.
    
#### Precision, reproducicility, reliability, replicability 

These characteristics measure the extent to which equal or similar input produces equal or similar results. Reliable methods produce output within a given range of variation (e.g. in terms appearance). A method is reproducible, if two runs of this method with the exact same input and setup produce the exact same results. Replicability of a method can be determined if two runs of a methods with the same input and same setup arrive at similar conclusions.

#### Robustness 

Robustness of a method characterizes the change of the quality of an analysis result if conditions deviate from assumptions made for analysis (e.g., when noise level increases or if object appearance deviates from prior assumptions). Robust methods are insensitive to variations outside of the given range (e.g. wrong parametrization). 

#### Efficiency

The effort which must be exerted to achieve an analysis result is described by efficiency. You may recall that there are semi-automated methods that require some degree of human interaction or expert knowledge. These factors contribute to the overall determination of a method's efficiency.

#### Fault detection

The ability to discover possible faults while an analysis method is being applied is called fault detection. It is a very useful feature, because it requires the method to test for reliability of its own results. 

<div id='ground_truth'></div>

### 1.2 Ground truth

You may remember one of the lecture slides with the following statement: _"In medical image analysis, the truth is difficult to come by, since the reason for producing images in the first place was to gather information about the human body that cannot be accessed otherwise."_. 

Ground truth is a conceptual term relative to the knowledge of the truth concerning a specific question (the “ideal expected result”). In validation, all measures of quality estimation for an analysis method require comparison of the method's produced results with the true information. Ground truth data can be either real or artificial, however, it is never completely certain whether selected data are representative of the desired ground truth. 

#### Ground truth from real data
Ground truth based on real data can be created by applying the currently established best method to it, if such method at all exists. An example is the use of mutual information and spline-based non-rigid registration for registering MR brain images. An often encountered problem is proving that the conditions under which a standard is applied, are comparable with those conditions under which they are considered to be an established standard. Moreover, the implementation of the establish methods is rarely available, even though these days, more implementations become open-source or integrated in widely used freely downloadable software packages.  

If an established method is missing, human experts may help produce ground truth data through data annotation. This approach requires a lot of effor both from the method's developer, as well as the expert who has to carry out the analysis on several datasets, document findings, and sometimes it is desirable to have the expert analyse the data(sets) multiple times (intra-observer variability) to increase the significance of the results. The developer must provide a sufficiently good user interface for the expert to avoid bias by the input component quality. Sometimes it may be more beneficial to ask more experts and mearuse (inter-observer) variability. In such case, it is crucial to define what is meant by agreement among all (e.g. agreement by all / the majority of observers, etc.).

An algorithm for the validation of image segmentation that estimates reference standard based on a set of segmentations is called [STAPLE](https://pubmed.ncbi.nlm.nih.gov/15250643/) (Simultaneous Truth and Performance Level Estimation).

#### Ground truth from phantoms

Phantoms can be used as ground truth as well. They are classified as follows:

_Based on real data_

- cadaver phantoms (human or animal)
- artificial hardware phantoms (e.g. CT and MRI slices generated in the [Visible Human Project](http://vhp.med.umich.edu/))

_Based on simulated data_

- software phantoms representing the reconstructed image or the imaged measurement distribution
- mathematical simulations (e.g. Shepp-logan phantom)

Phantoms are characteristic for specific properties (material, measurement properties, influences from image reconstruction, shape properties), according to which they are applied in different tasks. Phantoms are only useful in validation analyses when results have been generated in the them. For a detection task, a couple of locations must be specified, and for registration tasks, fiducial markers have to be implanted, for example. Material and measurement properties are often idealized. Image artefacts are typically simulated, e.g. by using zero-mean Gaussian noise to simulate detector noise; smoothing data to evoke partial volume effects or through inclusion of artificial shading to model signal fluctuations. 

The advantage of a software phantom is that it is more straighforward to account for anatomical variation by creating several phantoms with different shapes, unlike in hardware phantoms, where anatomical variation can hardly be modelled. Examples of software phantoms include the [BrainWeb phantom](www.bic.mni.mcgill.ca/brainweb/); the [Field II ultrasound simulation program](http://server.electro.dtu.dk/personal/jaj/field/); the [XCAT](http://dmip1.rad.jhmi.edu/xcat/) phantom; or the dynamic [MCAT](http://www.bme.unc.edu/*wsegars/index.html) heart phantom simulating a moving heart.

<div id='quality_measures'></div>

### 1.3 Measures of quality

Quality is determined by the kind of analysis which has been conducted on a dataset:

| Task         | Quality measure                                                           |
|:--------------|:---------------------------------------------------------------------------:|
| Delineation  | Correspondence between the delineated object and a reference segmentation |
| Detection    | Ratio between correct and incorrect decisions                             |
| Registration | Deviation from the correct registration transformation                    |

#### Quality for a delineation task

When delineating an object in an image, a measure of comparison between some referencce $g$ (usually a ground truth) and the delineated object $f$ is required. Mutual correpondence may be determined by calculating volumetric overlap, overlap between object and background or performing distance measurements (of boundary deviations). In 3D cases, volumetric measurements aim to count the number of voxels in both the delineated object and the reference norm weighted by the volume covered by each voxel. 

Overlaps between objects $f$ and $g$ can be calculated by measures that count over-segmentation (number of elements, for) and under-segmentation. 

The next measures often used for quality assessment are _Dice similarity coefficient_ (DSC) a.k.a _Sørensen–Dice coefficient_ ($d$),  _Intersection over union_ ($i$) and the _Jaccard index_ ($j$):

\begin{equation}
d = \frac{2|F\cap\,G|}{|F|+|G|}\,\,, 
\end{equation}

\begin{equation}
i = \frac{\mathrm{DSC}}{2 - \mathrm{DSC}}\,\,, and
\end{equation}

\begin{equation}
j = \frac{|F\cap\,G|}{|F\cup\,G|}\,\,,
\end{equation}

where $F\cap\,G$ is the size of elements (voxels) in overlap, and $|F|$, $|G|$ are the sizes of individual volumes. The coefficient is equal to 1 in case of perfect correspondence; otherwise it is smaller than 1. In the medical image analysis community, the Dice coefficient is more popular, and therefore also found more often in literature.

Neither Dice nor Jaccard indices can be used to measure outliers (e.g. in tasks where organ boundaries are to be delineated as part of access planning in surgery). In minimally invasive procedures, it is crucial to determine the deviation from the delineated boundary from the true boundary. This can be done by _Hausdorff distance_ (HD) between $F$ and $G$. The Hausdorff distance is defined as the maximum of all shortest distances $d$ between points in $F$ and $G$. Since this measure is highly sensitive to image artefacts, the quantile Hausdorff distance is used, where distances of largest outliers are averaged. It is computed from a quantile of a histogram of distances from $F$ to $G$ and from $G$ to $F$:

$$
\begin{equation}
h^{q} = \mathrm{max}(t_{q}(d(f,G)),t_{q}(d(g,F)))
\end{equation}
$$

#### Quality for a detection task

In detection tasks, an object is either found or not found while the object is or is not present in the data. The quality of detection is measured by _sensitivity_ (a.k.a recall rate) and _specificity_ (a.k.a precision rate):

- True positives (TP) are those detections belonging to the data and rightly resulted as positive. 
- True negatives (TN) are those objects not present in the data and rightly resulted as negative. 
- False positives (FP) are those objects that do not belong to the data, but were detected as present. 
- False negatives (FN) are results that belong to the data, but were classified as absent.

Sensitivity can be calculated as $\frac{\mathrm{TP}}{\mathrm{TP + FN}}$, while specificity is defined as $\frac{\mathrm{TN}}{\mathrm{TN + FP}}$. A good detection method would produce as many TP and TN as possible. FPs (e.g. tumor detected, though absent) and FNs (e.g. tumor overlooked) may have various consequences, and are therefore measured as two types of error (type-I error, and type-II error). The so-called _confusion matrix_ listing detection results in an organized way, specifies a two-class classification problem: 
<br>
<br>
<center width="100%"><img src="../reader/assets/confusion_matrix.jpg" width="300"></center>

<font size="1">Figure from [Guide to Medical Image Analysis - Methods and Algorithms](https://link.springer.com/book/10.1007/978-1-4471-2751-2)</font>

These metrics are commonly used in detection tasks involving medical images. Interestingly, they are also very important when interpreting the performance of any test (e.g., airport security, breast cancer screening, quality assurance in companies,etc.).

In practice, a trade-off between specificity and sensitivity is often targeted. In detection tasks, the ratio of sensitivity versus specificity is measured by the _receiver operator characteristic_ (ROC). The ROC curve can also serve as a measure of human operator performance when several operators performed the same task.
<br>
<br>
<center width="100%"><img src="../reader/assets/roc_curve.jpg" width="300"></center>

<font size="1">Figure from [Guide to Medical Image Analysis - Methods and Algorithms](https://link.springer.com/book/10.1007/978-1-4471-2751-2)</font>

#### Quality for a registration task

As you already know from the [Introduction image registration](../reader/1.1_Registration_geometrical-transformations.ipynb), registration aims to find a transformation that maps an $n$-dimensional image onto another one. In case of different dimensionalities of the registered objects, the transformation includes a projection step of the scene from higher dimension to the scene of lower dimension. 

- _Direct measurement_: The quality of a registration method can be measured directly (average deviation of known transformation parameters together with detection of outliers) based on comparisons between vector fields (in non-rigid registration) or differences in global rotation and translation (in rigid transformation). 
- _Indirect measurement_: In case the transformation is unknown, we typically exchange moving and still image. Hence, the transformation should be the inverse of each other. Another way is to compute locations of fiducial markers after registration (but it is essential that the point pairs used for this computation must not have been used for computing the registration transformations). 
<br>
<br>
<center width="100%"><img src="../reader/assets/quality_measures_registration_tasks.png" width="500"></center>

<font size="1">Figure from [Guide to Medical Image Analysis - Methods and Algorithms](https://link.springer.com/book/10.1007/978-1-4471-2751-2)</font>

<div id='limitations'></div>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/read_ico.png" width="42" height="42"></div> 

## 2. Common limitations of performance metrics used for segmentation tasks

Recent meta-analytical research has detected major culprits in algorithm validation. Most of these flaws are related to the practical use of some performance metrics in a given analysis task. One of the core issues in medical image analysis is the choice of inappropriate metrics [Maier-Hein, L. et al. (2018)](https://www.nature.com/articles/s41467-018-07619-7). In the same publication, it has been reported that image segmentation is the most popular of all medical image processing tasks taking into account international challenges. In these competitions, the chosen metrics significantly influence the rankings of various methods, and researchers are missing guidelines for choosing the right metric for a a given problem. 

It is important to understand the mathematical properties of a metric before applying it to a given task. Segmentation of small structures, such as brain lesions (e.g. multiple sclerosis) often employs Dice scores, which may not be an appropriate metric because of the often unknown pathological outlines and high inter-observer variability in such tasks. The predictions of two algorithms may differ only by one pixel, yet the impact on the Dice score outcome is substantial (see figure below).
<br>
<br>
<center width="100%"><img src="../reader/assets/small_structure_segmentation.jpg" width="600"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>

Similar issues may arise in the presence of image artefacts such as noise or errors in reference annotations. As seen in the figure below, a single erroneous pixel in the reference annotation may lead to a large performance decrease.
<br>
<br>
<center width="100%"><img src="../reader/assets/noise_effect_segmentation.jpg" width="400"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>

In overlap measurements, dedicated metrics are incapable of discovering differences in shapes, which may have huge impact e.g. on radiotherapy applications. Completely different predictions may therefore lead to the exact same DSC value.
<br>
<br>
<center width="100%"><img src="../reader/assets/shape_unawareness.jpg" width="500"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>

In some applications detecting over- and undersegmentation, the DSC metric does not represent these performance indicators reliably, while HD is invariant to these properties.
<br>
<br>
<center width="100%"><img src="../reader/assets/over_under_segmentation.jpg" width="500"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>

Commonly, segmentation metrics, such as DSC, are applied to detection and localization problems as well. In general, the DSC tends to be strongly biased against single objects, which is why its application in detection tasks should be avoided. An example where DSC underperforms, can be seen below
<br>
<br>
<center width="100%"><img src="../reader/assets/detection_performance.jpg" width="500"></center>

<font size="1">Figure from [Common Limitations of Image Processing
Metrics: A Picture Story](https://arxiv.org/abs/2104.05642)</font>

Metrics are typically aggregated over all test cases to produce overall ranking. However, this can be detrimental in case of missing values (NA) and lead to a substantially higher DSC or varying HD compared to setting missing values to zero. Moreover, a single metric usually does not reflect all important features for algorithm validation. Through the combination of multiple metrics helps mitigate the problem, it has to be kept in mind that some metrics are mathematically related to each other, such as DSC and Intersection over union (IoU). Thus combining related metrics will not change the ranking.

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/question_ico.png" width="42" height="42"></div>

### *Question 1*:

Describe a situation where volume computation would be an appropriate criterion for measuring the quality of a delineation task. When should it not be used?

<font style="color:red">Type your answer here</font>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/question_ico.png" width="42" height="42"></div>

### *Question 2*:
What information about delineation quality is revealed by the Hausdorff distance? Please describe a scenario where this measure is important to rate a delineation method.

<font style="color:red">Type your answer here</font>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/question_ico.png" width="42" height="42"></div>

### *Question 3*:
What needs to be made sure when selecting test data for ground truth?

<font style="color:red">Type your answer here</font>

<div style="float:right;margin:-5px 5px"><img src="../reader/assets/question_ico.png" width="42" height="42"></div>

### *Question 4*:
Why is it necessary to carry out manual segmentation several times by different and by the same person if it shall be used for ground truth? How is the information that is gained from these multiple segmentations used for rating the performance of an algorithm?

<font style="color:red">Type your answer here</font>

## References

[1] Guide to Medical Image Analysis - Methods and Algorithms [LINK](https://link.springer.com/book/10.1007/978-1-4471-2751-2)

[2] Common Limitations of Image Processing Metrics: A Picture Story [LINK](https://arxiv.org/abs/2104.05642)