In [1]:
%matplotlib inline

# MULTI-DISEASE DETECTION IN RETINAL IMAGING - an overview of this paper
---
---
---

## Introduction

***Author:*** Atanas Kuzmanov

***Date:*** 2022-February-20

*This is an article developed as a scientific notebook for an exam project assignment for a Deep Learning course from an Artificial Intelligence module.*

*One of the aims of this article is to understand some Deep Learning (DL) basics, more specifically to understand Neural Networks (NNs) and how to improve them, so we can create models, train them, test them and extract predictions and information we might be interested in.*


*This paper is a retelling and an overview of the original paper, where most of the contents deemed important have been unchanged:*


**MULTI-DISEASE DETECTION IN RETINAL IMAGING**

**BASED ON ENSEMBLING HETEROGENEOUS DEEP LEARNING MODELS**

*Dominik Müller1, Iñaki Soto-Rey1,2 and Frank Kramer1*

1 IT-Infrastructure for Translational Medical Research, University of
Augsburg, Germany

2 Medical Data Integration Center, University Hospital Augsburg, Germany

Published: `[v1] Fri, 26 Mar 2021 18:02:17 UTC (757 KB)`

`References:`


Paper:

- MULTI-DISEASE DETECTION IN RETINAL IMAGING BASED ON ENSEMBLING HETEROGENEOUS DEEP LEARNING MODELS
[[Reference]](#MULTI-DISEASE-DETECTION-IN-RETINAL-IMAGING-PDF)

- Multi-Disease Detection in Retinal Imaging - papers with code
[[Reference]](#Multi-Disease-Detection-in-Retinal-Imaging---papers-with-code)

Code:

- Multi-Disease Detection in Retinal Imaging - GitHub
[[Reference]](#Multi-Disease-Detection-in-Retinal-Imaging---GitHub)

- AUCMEDI - A Framework for Automated Classification of Medical Images
[[Reference]](#AUCMEDI---A-Framework-for-Automated-Classification-of-Medical-Images)

Data:

- RETINAL FUNDUS MULTI-DISEASE IMAGE DATASET (RFMID)
[[Reference]](#RETINAL-FUNDUS-MULTI-DISEASE-IMAGE-DATASET-(RFMID))

- RFMiD Train Dataset - kaggle
[[Reference]](#RFMiD-Train-Dataset---kaggle)

Other:

- Retinal Image Analysis for multi-Disease Detection Challenge website
[[Reference]](#Retinal-Image-Analysis-for-multi-Disease-Detection-Challenge-website)

- IEEE ISBI 2021 International Symposium on Biomedical Imaging April 13-16 2021
[[Reference]](#IEEE-ISBI-2021-International-Symposium-on-Biomedical-Imaging-April-13-16-2021)

---

## Disclaimer

_My experience in trying to develop, debug, train, fit, test etc. models in Google Colab has been not good and less than efficient. I do not mean to say that Google Colab is not good or efficient, it might as well be me doing something wrong or using it in the wrong way._

_As a result I have not had sufficient computing power at my disposal for proper hyperparameters._

_If I did, I could have used a good `batch number` and calculations such as the ones below to get good numbers for how many `steps` should we perform for training and validation per each of our `epochs` according to the `batches` of our dataset would have been suitable to get good or even state of the art results, and then perform hyperparameter tunning of those and other hyperparameters such as `learning_rate` and others._

_As proof of this, please refer to the `screenshots/` directory, located in the [Google Drive Colab Notebooks folder](https://drive.google.com/drive/folders/1wKcqaW1y31s5UXe2BEvwsfcIlvJMKqeK?usp=sharing), mentioned in the previous section and check out my pains, struggles and frustrations with running anything in Google Colab._

_Things to look out for in the screenshots:_


- _Check out the computer clock and see what time it is in between screenshots (the screenshots are also automatically timestamped) of executing one cell and se **how ridiculously long** it takes for it to execute. And I mean this for normal cells which either process a batch from the dataset or try to train a model, not some cell which is executing something obscure which ends up in an infinite loop or crashes Google Colab._


- _Check out how often Google Colab has crashed on me for no particular reason._


- _Check out how often the Google Colab session has been taken away from me, before I could save my model or before it has reached a checkpoint so it would save itself._

- _Check out how often the Google Colab has become unresponsive for long periods of times, sometimes hours, until I have had to manually interrupt my session, lose everything in my session as a result and have to start over._

- _The GPU allocation would be removed as a capability after about an hour worth of working in Google Colab._

- _Google Drive and Google Colab synchronization problems._

- _... etc._


_Because of the above, I have lost more than half of my time for this project in struggling with Google Colab issues, rather than actually working and developing the project._

---

## Setup and Running this Notebook

_This is a Jupyter Notebook developed and meant to be run in `Google Colab` [[Reference]](#Google-Colab)._

_It is quite feature packed and it might take a bit longer to load, depending on the machine on which you are running it on. Please allow sufficient time for all of it to run all the way, until the last LaTeX formula, Markdown, Python, graphs, plots, images, etc. have loaded and executed. This also valid if you use `Kernel -> Restart & Run All`, however in this case this is not meant to be used for this notebook, as some of the cells are Deep Learning model experiments which run for hours, and also running them again will lose some of the results from the experiments._

_If you would like to try and run the notebook in an idempotent way I suggest that you comment out the lines of code which fit any models, and only use the lines of code which load pre-trained saved models and saved history._

### Requirements

- Tensorflow 2.7.0
- Keras 2.7.0

### Google Drive

_Because this notebook is developed as a scientific notebook for an exam project assignment for a Deep Learning course from an Artificial Intelligence module, in order to be assessed you will need to access and download the contents of this shared `Google Drive` folder in order to be able to run it:_

___Note: The contents of this folder are around 10GB (around 5GB of which is for the models I have managed to train).___

[Google Drive Project Shared folder - softuni-ai-dl-project-2022-paper-2](https://drive.google.com/drive/folders/1mG5QXZIMtxP7V41hPM31wm6cUmy8gh7_?usp=sharing)

https://drive.google.com/drive/folders/1mG5QXZIMtxP7V41hPM31wm6cUmy8gh7_?usp=sharing

_In case there is a problem with the links above, here is a link to my entire `Google Drive Colab Notebooks folder`:_

[Google Drive Colab Notebooks folder](https://drive.google.com/drive/folders/1wKcqaW1y31s5UXe2BEvwsfcIlvJMKqeK?usp=sharing)

https://drive.google.com/drive/folders/1wKcqaW1y31s5UXe2BEvwsfcIlvJMKqeK?usp=sharing

_The contents should look like this or similar:_

```
.ipynb_checkpoints/                         --> MacOS folder for Jupyter Notebook checkpoints for this notebook
aucmedi-downloads-from-zendo/               --> Incomplete downloads - models download is broken
aucmedi-lib/                                --> aucmedi-lib - original unmodified library
Evaluation_Set/                             --> Evaluation dataset
paper2-experiments-notebook-1-bkps/         --> Backup copies of the experiments notebook
paper2-experiments-notebook-1.ipynb         --> The final version of the experiments notebook
riadd.aucmedi-repository/                   --> ORIGINAL REPO WITH MODIFIED EXPERIMENTED DEBUGGED SCRIPTS
  |__DEBUG-FILES                            --> DIR WITH DEBUG FILES FROM THE MODIFIED SCRIPTS FROM THE REPO
Screens1/                                   --> Additional screenshots
SoftUni-AI-DL-project-paper-2-2022/         --> PAPER DIRECTORY
  |__SoftUni-AI-DL-project-paper-2-2022-VERSION.ipynb  --> FINAL VERSION OF THIS ACTUAL PAPER
  |__2103.14660v1-resources/                           --> PAPER RESOURCES
  |__SoftUni-AI-DL-project-paper-2-2022-1-bkps/.       --> BACKUP COPIES OF THIS PAPER
Training_Set/                               --> Copy of modified dataset for relevant experiments notebook
Training_Set-orig-full/                     --> Original full dataset
../screenshots/                             --> Screenshots
```


---

## Paper Summary

_Preventable or undiagnosed visual impairment and blindness affect billion of people worldwide. Automated multi-disease detection models offer great potential to address this problem via clinical decision support in diagnosis. In this work, we proposed an innovative multi-disease detection pipeline for retinal imaging which utilizes ensemble learning to combine the predictive capabilities of several heterogeneous deep convolutional neural network models. Our pipeline includes state-of-the-art strategies like transfer learning, class weighting, real-time image augmentation and Focal loss utilization. Furthermore, we integrated ensemble learning techniques like heterogeneous deep learning models, bagging via 5-fold cross-validation and stacked logistic regression models. Through internal and external evaluation, we were able to validate and demonstrate high accuracy and reliability of our pipeline, as well as the comparability with other state-of-the-art pipelines for retinal disease prediction._

_The implemented medical image classification pipeline in this paper can be summarized in the following core steps:_

- _Stratified multi-label 5-fold cross-validation_

- _Class weighted Focal loss and up-sampling_

- _Extensive real-time image augmentation_

- _Multiple deep learning model architectures_

- _Ensemble learning strategies: bagging and stacking_

- _Individual training for multi-disease labels and disease risk detection utilizing transfer learning on ImageNet_

- _Stacked binary logistic regression models for distinct classification_

- _Retinal Imaging Dataset - The RFMiD dataset consists of 3200 retinal images for which 1920 images were used as training dataset. The fundus images were captured by three different fundus cameras having a resolution of 4288x2848 (277 images), 2048x1536 (150 images) and 2144x1424 (1493 images), respectively._


_The unique feature of this paper which captured my attention was the innovative use of ensembles, state-of-the-art strategies and pipelines, if you find this summary interesting, please continue reading for the rest of the paper._

___In addition to reading and researching this paper I had the idea that if I get it to run I would like to modify the Neural Network architectures and/or hyperparameters and try to change them in order to improve them.___

___This proved next to impossible due to the reasons explained in the [Disclaimer](#Disclaimer) section.___

___In addition to that the download with the paper for the original pre-trained models is broken and never finishes no matter how many times I tried it.___

___I wanted to modify the Neural Network architectures and/or hyperparameters and try to change them in order to improve them, so I started by reducing the `dataset`, so that it would hopefully run in Google Colab. In reducing the dataset I still made sure to have at least one of each class, as it is a multi-class problem.___

_I even created the following dictionary from the data for myself, in order to know where is the first occurrence of each class, to make sure I include at least one on my reduced dataset:_

```
COLUMN - CLASS = INDEX

---

C - DR = 2

D - ARMD = 7

E - MH = 5

F - DN = 14

G - MYA = 7

H - BRVO = 42

I - TSLN = 25

J - ERM = 10

K - LS = 6

L - MS = 16

M - CSR = 54

N - ODC = 5

O - CRVO = 59

P - TV = 19

Q - AH = 113

R - ODP = 28

S - ODE = 56

T - ST = 1222

U - AION = 200

V - PT = 148

W - RT = 36

X - RS = 181

Y - CRS = 31

Z - EDN = 28

AA - RPEC = 61

AB - MHL = 174

AC - RP = 631

AD - OTHER = 83
```

_I did consider doing oversampling, however this is still not implemented for a multi-class problem: https://github.com/scikit-learn-contrib/imbalanced-learn_

___With regards as to where to find the relevant files, please refer to the [Google Drive](#Google-Drive) section above.___

___Please find the modified dataset in the following directory structure:___

```
Training_Set/                               --> Copy of modified dataset for relevant experiments notebook
```

___However I ran into a problem:___

_The problem was that despite the fact that I made sure that I have at least one of each class, as it is a multi-class problem, I ended up having exceptions and errors, down the pipeline, stating that in at least one of my samples I have only zeros, meaning I am missing a class:_

```
line 81, in compute_multilabel_weights
    weight = compute_class_weight(class_weight=method, classes=[0,1], y=ohe_array[:, i])

    raise ValueError("classes should have valid labels that are in y")
ValueError: classes should have valid labels that are in y
```

_I even managed to trace the error from the ` aucmedi repo scripts --> to the aucmedi library --> to https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/class_weight.py` where the exception is raised, and the exception seems to be raised for the correct reason._ _The mystery remains as to why I get this problem after reducing the dataset in first place._

_I did a lot of debugging, generating output in debug files etc. which can be found in the directories below._

_Here is the code from <https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/class_weight.py>:_

```
    elif class_weight == "balanced":
        # Find the weight of each class as present in y.
        le = LabelEncoder()
        y_ind = le.fit_transform(y)
        if not all(np.in1d(classes, le.classes_)):
            raise ValueError("classes should have valid labels that are in y")
```

_On the basis of that code, here is some of my debug code in this script `classifier_DenseNet201.py`:_

```
    classes=[0,1]
    for j in range(0, np.shape(y_train)[1]):
        le = preprocessing.LabelEncoder()
        y_ind = le.fit_transform(y_train[:, j])
        # if not all(np.in1d(classes, le.classes_)):
        if not all(np.in1d(classes, le.classes_)):
            print(">>> error:", classes, "not in", le.classes_)
            print(">>> i:" , i)
            print(">>> j:" , j)
            with np.printoptions(threshold=np.inf):
                with open('error1.txt', 'w') as f:
                    print(">>> error:", classes, "not in", le.classes_, file=f)
                    print(">>> i:" , i, file=f)
                    print(">>> j:" , j, file=f)
                    print(">>> y_train[:, j]: ", y_train[:, j], file=f)
```

_And here are some output from my debug code, confirming that indeed I end up with `0`s in the data:_

```
>>> y_train shape:  (205, 28)
>>> error: [0, 1] not in [0]
>>> i: 17
>>> error: [0, 1] not in [0]
>>> i: 19
>>> error: [0, 1] not in [0]
>>> i: 26
Traceback (most recent call last):
  File "scripts/classifier_DenseNet201.py", line 228, in <module>
    class_weights = compute_multilabel_weights(ohe_array=y_train)
  File "/usr/local/lib/python3.7/dist-packages/aucmedi/utils/class_weights.py", line 81, in compute_multilabel_weights
    y=ohe_array[:, i])
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/class_weight.py", line 50, in compute_class_weight
    raise ValueError("classes should have valid labels that are in y")
ValueError: classes should have valid labels that are in y
```

___Please find the modified scripts in:___

```
riadd.aucmedi-repository/                   --> ORIGINAL REPO WITH MODIFIED EXPERIMENTED DEBUGGED SCRIPTS
  |__DEBUG-FILES                            --> DIR WITH DEBUG FILES FROM THE MODIFIED SCRIPTS FROM THE REPO
```

___Please see this script in particular:___

```
riadd.aucmedi-repository/
  |__scripts/
      |__classifier_DenseNet201.py
```

_I have done most of the code modifications for debugging of aforementioned problem in there. Please keep in mind this is only debug code, so it is not tidied up._

___Despite all of the aforementioned problems I did manage to do one complete experiment run, but with only some classifiers and detectors, not all due to the reasons explained in the [Disclaimer](#Disclaimer) section.___ 

___In order to do that I had to overcome the problem explained above, and I achieved it by replacing the loss function in the `classifier_DenseNet201.py` script:___

___I replaced this:___

```
#    loss=multilabel_focal_loss(class_weights),
```

___with:___

```
loss='categorical_crossentropy',
```

___With that I did get around the error, but the results did not look good which was to be expected.___

___Please see the following directories and files for the results of this experiment:___

```
riadd.aucmedi-repository/
  |__models
  |__preds
```

___An additional problems were the sacrifices I have had to make and reduce our hyperparameters with Google Colab training due to the reasons explained in the [Disclaimer](#Disclaimer) section.___

___For additional context and visual representation of the problems explained above please see the screenshots in the following directories:___

```
Screens1/                                   --> Additional screenshots
../screenshots/                             --> Screenshots
```


---

## Notes

### Foreword

_One of the aims of this article is to understand some Deep Learning (DL) basics, basic concepts and intuitions._

_Because of that goal it is important to be able to train and test models multiple times, so we can determine the best hyperparameters and tune models to improve them. Unfortunately at the time of writing this article, in the year 2022, I am using my personal laptop from 2015. I thought this machine has fared rather well for it's age, and have great respect for it and what I have put it through, as we have been together through thick and thin. That was until I had to actually do DL on it for this article when I realized it is not going to fare well for this purpose. Most of the models were unable to run or finish running once I tried to train them, or once I tried to change the hyperparameters to improve them. My machine would just heat up with fans running at the highest rpms and still seem stuck on executing a cell for more than 30min. If I had carried on like this I would not have been able to finish this article, so instead most or all of the hyperparameters for the models are set to severely low or high, depending on the context, in order to reduce iterations or features of the data, so that this notebook would work and I would be able to demonstrate or give an example of the idea I am trying to explain. Please keep this in mind when going through the article._

_This will be sufficient for the purpose of this article, just to demonstrate and help understand DL basics, basic concepts and intuitions, but if you need to try out some of the examples and extend and improve them for actual Deep Learning keep in mind you need a powerful machine with a good GPU, or you can use a cloud platform suitable for DL, such as Google Colab._

### References Notes

_Any and all references, citations, resources or other materials used to understand and explain, provide examples, and build this article have been referenced in order to give credit where credit is due and avoid plagiarism._
_If a citation is the bigger part of a section, and has been edited, added to, modified, etc. the reference to that section would be at the end of it, separated with a horizontal line, like this example:_

> ---
> [[Example Reference]](#ExampleReference)

_If a citation has been inserted and is relatively short, the relevant reference will be at the end of the sentence or paragraph, for example:_

> Example. [[Example Reference]](#ExampleReference)

_In case a reference is missed due to human error, all references can be found in the [References](#References) section. Anything which is found in the [References](#References) should be considered as a valid reference for everything in this paper, even if not explicitly referenced._

### Narrative

_I have tried to provide a nice flow, ease of read and a friendly and humorous tone of the article, and at the same time clear and understandable communication. In order to aid this I have provided a narrative to this article. In order to distinguish it I have used italics for it throughout the article. Please consider any text in italics, such as the one you are currently reading, as narrative. It can also be both in bold and italics._

> _Example narrative._

### Code

_Currently most of the code in the article has been refactored into separate functions and most of the other code in the article is left fragmented throughout. There is a very good reason for this, which is that one of the aims of this article is to also understand a bit of Deep Learning. This is why the fragments of code throughout this article are used to help us and illustrate and demonstrate different parts of ML as a whole._

_Some of the code quality has been improved by making some functions idempotent with special checks, so that they have the same effect, no matter how many times they are ran._

_Most of the commented out code in this article is left on purpose to serve as information, as part of the intent for this article is for it to be a knowledgebase._

### Table of Contents (TOC)

_Please refer to the [Table of Contents](#Table-of-Contents) section in [Appendix A](#Appendix-A) for instructions on how you can use get a Table of Contents for this article in Jupyter Notebook._

### Running this Jupyter Notebook

_This Jupyter Notebook is quite feature packed and it might take a bit longer to load, depending on the machine on which you are running it on. Please allow sufficient time for all of it to run all the way, until the last LaTeX formula, Markdown, Python, graphs, plots, images, etc. have loaded and executed. This also valid if you use `Kernel -> Restart & Run All`._

### Testing

#### Project tests

- _Any mathematics in the project for which I have had doubts or have not understood I have tested using Wolfram Alpha._

- _I have repeatedly ran "Kernel -> Restart & Run All" to confirm all is working and have fixed bugs when things have been broken._

#### Code tests

- _There are code test, however the focus of this notebook is not on code tests. Due to the nature of this notebook, being focused on ML, most of the tests of this note book are actually metrics, scoring, score analysis, model testing and cross-validation._

- _There are tests in the project. Since code tests are outside of the focus of this project most of the tests are visual print outs of the data and visual confirmations._

- _Most of the tests in this project are visual and are marked with this "`### Test`" comment above it._

- _There are also tests which are more functional and for example print a message if an assertion error is not thrown._

_I consider this amount of test coverage adequate for the purpose of this article._

---

## Abstract

Preventable or undiagnosed visual impairment and blindness affect
billion of people worldwide. Automated multi-disease detection models
offer great potential to address this problem via clinical decision
support in diagnosis. In this work, we proposed an innovative
multi-disease detection pipeline for retinal imaging which utilizes
ensemble learning to combine the predictive capabilities of several
heterogeneous deep convolutional neural network models. Our pipeline
includes state-of-the-art strategies like transfer learning, class
weighting, real-time image augmentation and Focal loss utilization.
Furthermore, we integrated ensemble learning techniques like
heterogeneous deep learning models, bagging via 5-fold cross-validation
and stacked logistic regression models. Through internal and external
evaluation, we were able to validate and demonstrate high accuracy and
reliability of our pipeline, as well as the comparability with other
stateof-the-art pipelines for retinal disease prediction.

> ***Index Terms---*** Retinal Disease Detection, Ensemble Learning,
> Class

## **1. INTRODUCTION**

Even if the medical progress in the last 30 years made it possible to
successfully treat the majority of diseases causing visual impairment,
growing and aging populations lead to an increasing challenge in retinal
disease diagnosis \[1\]. The World Health Organization (WHO) estimates
the prevalence of blindness and visual impairment to 2.2 billion people
worldwide, of whom at least 1 billion affections could have been
prevented or is yet to be addressed \[2\]. Early detection and correct
diagnosis are essential to forestall disease course and prevent
blindness.

The use of clinical decision support (CDS) systems for diagnosis has
been increasing over the past decade \[3\]. Recently, modern deep
learning models allow automated and reliable classification of medical
images with remarkable accuracy comparable to physicians \[4\].
Nevertheless, these models often lack capabilities to detect rare
pathologies such as central retinal artery occlusion or anterior
ischemic optic neuropathy \[5\], \[6\].

In this study we push towards creating a highly accurate and reliable
multi-disease detection pipeline based on ensemble, transfer and deep
learning techniques. Furthermore, we utilize the new Retinal Fundus
Multi-Disease Image Dataset (RFMiD) containing various rare and
challenging conditions to demonstrate our detection capabilities for
uncommon diseases.

## **2. METHODS**

The implemented medical image classification pipeline can be
summarized in the following core steps and is illustrated in Fig. 1:
- Stratified multi-label 5-fold cross-validation
- Class weighted Focal loss and up-sampling
- Extensive real-time image augmentation
- Multiple deep learning model architectures
- Ensemble learning strategies: bagging and stacking
- Individual training for multi-disease labels and disease risk
detection utilizing transfer learning on ImageNet
- Stacked binary logistic regression models for distinct
classification
**2.1. Retinal Imaging Dataset**
The RFMiD dataset consists of 3200 retinal images for which 1920
images were used as training dataset \[7\]. The fundus images were
captured by three different fundus cameras having a resolution of
4288x2848 (277 images), 2048x1536 (150 images) and 2144x1424 (1493
images), respectively.

**Tab. 1.** Annotation frequency for each class in the dataset.

![Tab1](2103.14660v1-resources/images/Tab1.png)

![image1](2103.14660v1-resources/2103.14660v1-paper-images/image1.png)

**Fig. 1**. Flowchart diagram of the implemented medical image
analysis pipeline for multi-disease detection in retinal imaging. The
workflow is starting with the retinal imaging dataset (RFMiD) and ends
with computed predictions for novel images.

The images were annotated with 46 conditions, including various rare and
challenging diseases, through adjudicated consensus of two senior
retinal experts. These 46 conditions are represented by the following
classes, which are also listed in Tab. 1: An overall normal/abnormal
class, 27 specific condition classes and 1 'OTHER' class consisting of
the remaining extremely rare conditions. Besides the training dataset,
the organizers of the RIADD challenge hold 1280 images back for external
validation and testing datasets to ensure robust evaluation \[7\],
\[8\].

### **2.2. Preprocessing and Image Augmentation**

In order to simplify the pattern finding process of the deep learning
model, as well as to increase data variability, we applied several
preprocessing methods.

We utilized extensive image augmentation for up-sampling to balance class distribution and real-time augmentation during
training to obtain novel and unique images in each epoch. The
augmentation techniques consisted of rotation, flipping, and altering in
brightness, saturation, contrast and hue. Through the up-sampling, it
was ensured that each label occurred at least 100 times in the dataset
which increased the total number of training images from 1920 to 3354.

Afterwards, all images were square padded in order to avoid aspect ratio
loss during posterior resizing. The retinal images were also cropped to
ensure that the fundus is center located in the image. The cropping was
performed individually for each microscope resolution and resulted in
the following image shapes: 1424x1424, 1536x1536 and 3464x3464 pixels.
The images were then resized to model input sizes according to the neural network architecture, which was
380x380 for EfficientNetB4, 299x299 for InceptionV3 and 244x244 for
all remaining architectures \[9\]--\[12\].

Before feeding the image to the deep convolutional neural network, we
applied value intensity normalization as last preprocessing step. The
intensities were zero-centered via the Z-Score normalization approach
based on the mean and standard deviation computed on the ImageNet
dataset \[13\].

### **2.3. Deep Learning Models**

The state-of-the-art for medical image classification are the
unmatched deep convolutional neural network models \[4\], \[14\].
Nevertheless, the hyper parameter configuration and architecture
selection are highly dependent on the required computer vision task,
as well as the key difference between pipelines \[4\], \[15\]. Thus,
our pipeline combines two different types of image classification
models: The disease risk detector for binary classifying
normal/abnormal images and the disease label classifier for
multi-label annotation of abnormal images.

Both model types were pretrained on the ImageNet

dataset \[13\]. For the fitting process, we applied a transfer
learning training, with frozen architecture layers except for the
classification head, and a fine-tuning strategy with unfrozen layers.
Whereas the transfer learning fitting was performed for 10 epochs
using the Adam optimization with an initial learning rate of 1-E04,
the fine-tuning had a maximal training time of 290 epochs and using a
dynamic learning rate for the Adam optimization starting from 1-E05 to
a maximum decrease to 1-E07 (decreasing factor of 0.1 after 8 epochs without improvement on the monitored validation loss)
\[16\]. Furthermore, an early stopping and model checkpoint technique
was utilized for the fine-tuning process, stopping after 20 epochs
without improvement (after epoch 60) and saving the best model measured
according to the validation loss. Instead of defining an epoch as a
cycle through the full training dataset, we establish an epoch to have
250 iterations. This allowed to increase the number of seen batches and,
thus, to increase the information given to the model during the fitting
process of an epoch. As training loss function, we utilized the weighted
Focal loss from *Lin et al.* \[17\].

$$FL(𝑝𝑡) = −𝛼𝑡(1 − 𝑝𝑡)𝛾 log (𝑝𝑡) (1)$$

In the above formula, *pt* is the probability for the correct ground
truth class *t*, *γ* a tunable focusing parameter (which we set to 2.0)
and *αt* the associated weight for class *t*.

### *2.3.1 Disease Risk Detector*

The disease risk detector was established as a binary classifier of the
disease risk class for general categorizing between normal and abnormal
retinal images. Thus, this model type was trained using only the disease
risk class and ignoring all multi-label annotations. Rather than using a
single model architecture, we trained multiple models based on the
DenseNet201 and EfficientNetB4 architecture \[9\], \[10\]. For class
weight computation, we divided the number of samples by the
multiplication of the number of classes (2 for a binary classification)
with the number of class occurrences in the dataset.

### *2.3.2 Disease Label Classifier*

In contrast, the disease label classifier was established as multi-label
classifier of all 28 remaining classes (excluding disease risk) and was
trained on the one hot encoded array of the disease labels. Furthermore,
we utilized four different architectures for this model type: ResNet152,
InceptionV3, DenseNet201 and EfficientNetB4 \[9\]--\[12\]. Identical to
class weight computation of the disease risk detector, we computed the
weights individually as binary classification for each class. Even if
this classifier is provided with all classes, the binary weights balance
the decision for each label individually.

### **2.4. Ensemble Learning Strategy**

#### *2.4.1 Bagging*

Next to the utilization of multiple architecture, we also applied a
5-fold cross-validation based as a bagging approach for ensemble
learning. Our aim was to create a large variety of models which were
trained on different subsets of the training data. This approach not
only allowed a more efficient usage of the available training data, but
also increased the reliability of a prediction. This strategy resulted in an ensemble of
10 disease risk detector models (2 architectures with each 5 folds)
and 20 disease label classifier models (4 architectures with each 5
folds).

#### *2.4.2 Stacking*

For combining the predictions of our, in total, 30 models, we
integrated a stacking setup. On top of all deep convolutional neural
networks, we applied a binary logistic regression algorithm for each
class, individually. Thus, the predictions of all models were utilized
as input for computing the classification of a single class. This
approach allowed combining the information of all other class
predictions to derive an inference for one single class. Overall, this
strategy resulted in 29 distinct logistic regression models (1 for
disease risk and 28 for each disease-label including the 'other'
class). The individual predicted class probabilities are then
concatenated to the final prediction.

The logistic regression models were also trained with the same 5-fold
cross-validation sampling on a heavily augmented version of the
training dataset to avoid overfitting as well as avoiding training the
logistic regression models on already seen images from the neural
network models. As logistic regression solver, we utilized the
large-scale boundconstrained optimization (short: 'LBFGS') from *Zhu
et al*. \[18\].

## **3. RESULTS AND DISCUSSION**

The sequential training of a complete cross-validation for one
architecture on a single NVIDIA TITAN RTX GPU took around 13.5 hours
with 63 epochs on average for each deep convolutional neural network
model. Logistic Regression training required less than 30 minutes for
all class models combined. No signs of overfitting were observed for
the disease label classifiers through validation monitoring, as it can
be seen in Fig. 2. However, the disease risk detectors showed a strong
trend to overfit after the transfer learning phase. Through our
strategy to use the model with the best validation loss, it was still possible to obtain a powerful model for
detection.

![image2](2103.14660v1-resources/2103.14660v1-paper-images/image2.png)


**Fig. 2.** Loss course during the
training process for training and validation data. The lines were
computed via locally estimated scatterplot smoothing and represent the
average loss across all folds. The gray areas around the lines
represent the confidence intervals.

![image3](2103.14660v1-resources/2103.14660v1-paper-images/image3.png)

**Fig. 3.** Receiver operating characteristic (ROC) curves for each
model type applied in our pipeline. The ROC curves showing the
individual model performance measured by the true positive and false
positive rate. The cross-validation models were macro-averaged for each
model type to reduce illustration complexity.

### **3.1. Internal Performance Evaluation**

For estimating the performance of our pipeline, we utilized the
validation subsets of the 5-fold cross-validation models from the
heavily augmented version of our dataset. This approach allowed to
obtain testing samples which were never seen in the training process for
reliable performance evaluation. For the complex multi-label evaluation,
we computed the popular area under the receiver operating characteristic
(AUROC) curve, as well as the mean average precision (mAP). Both scores
were macro-averaged over classes and cross-validation folds to reduce
complexity.

Our multi-disease detection pipeline revealed a strong and robust
classification performance with the capability to also detect rare
conditions accurately in retinal images. Whereas the disease label
classifier models separately only achieved an AUROC of around 0.97 and a
mAP of 0.93, the disease risk detectors demonstrated to have a really
strong predictive power of 0.98 up to 0.99 AUROC and mAP. However, for
the classifiers the InceptionV3 architecture indicated to have the worst
performance compared to the other architectures with only 0.93 AUROC and
0.66 mAP. The associated receiver operating characteristics of the
models are illustrated in Fig. 3.

Training a strong multi-label classifier is in general a complex task,
however, the extreme class imbalance between the conditions revealed a
hard challenge for building a reliable model \[19\], \[20\]. Our applied
up-sampling and class weighting technique demonstrated to have a
critical boost on the predictive capabilities of the classifier models.
Nearly all labels were able to be accurately detected, including the 'OTHER' class consisting of various extremely rare
conditions. Nevertheless, the two classes 'EDN' and 'CRS' were the
most challenging conditions for all classifier models. Both classes
belong to very rare conditions, combined with less than 1.2%
occurrence in the original and 2.5% occurrence in the up-sampled
dataset. Still, our stacked logistic regression algorithm was able to
balance this issue and infer the correct 'EDN' and 'CRS'
classifications through context. Overall, our applied ensemble
learning strategies resulted in a significant performance improvement
compared to the individual deep convolutional neural network models.

More details on the internal performance evaluation are listed in Tab. 2.

### **3.2. External Evaluation through the RIADD Challenge** 

Furthermore, we participated at the RIADD challenge which was organized by the
authors of the RFMiD dataset \[7\], \[8\]. The challenge participation
allowed not only an independent evaluation of the predictive power of our pipeline on an unseen and
unpublished testing set, but also the comparison with the currently best
retinal disease classifiers in the world.

![Tab2](2103.14660v1-resources/images/Tab2.png)

**Tab. 2**. Achieved results of the internal performance evaluation
showing the average AUROC and mAP score for each model utilized in our
pipeline. The scores were macroaveraged across all cross-validation
folds and classes.

In our participation, we were able to reach rank 19 from a total of 59
teams in the first evaluation phase and rank 8 in the final phase. In
the independent evaluation from the challenge organizers, we achieved an
AUROC of 0.95 for the disease risk classification. For multi-label
scoring, they computed the average between the macro-averaged AUROC and
the mAP, for which we reached the score 0.70. The top performing ranks
shared only a marginal scoring difference which is why we had only a
final score difference of 0.05 to the first ranked team. Furthermore,
the participation results demonstrated that ensemble learning based
classification for deep convolutional neural network models is
compatible or even superior to other approaches in the scientific field
such as focusing on a single large architecture.

### **3.3. Experiments and Improvements**

Additionally, we experimented with using weighted crossentropy loss for
training our both model types. This resulted in inferior models for
disease label classification, however, the cross-entropy loss fitted
disease risk detector models showed less overfitting with equal
performance. Further experimentation with loss functions for the disease
risk detector models could provide the solution to avoid overfitting.

An important point for the RIADD challenge participation would be the utilization of more training data, especially
for the difficult 'CRS' and 'EDN' classes. According to the challenge
rules, other public available datasets like Kaggle DR, IDRiD, Messidor
or APTOS are allowed to be used as additional training data \[8\]. Our
pipeline, which was trained exclusively on the RFMiD dataset, could be
further improved with more retinal images of very rare conditions.
Besides the training data, more improvement points for further research
in retinal disease detection would be the inclusion of image cropping
strategies to reduce information loss through resolution resizing, the
usage of more architectures (especially with different input
resolutions) to increase the model ensemble, and the utilization of
specific retinal filters or retinal vessel segmentation as additional
information to utilize for the predictions.

## **4. CONCLUSIONS**

In this study, we introduced a powerful multi-disease detection pipeline
for retinal imaging which exploits ensemble learning techniques to
combine the predictions of various deep convolutional neural network
models. Next to state-of-the-art strategies, such as transfer learning,
class weighting, extensive real-time image augmentation and Focal loss
utilization, we applied 5-fold cross-validation as bagging technique and
used multiple convolutional neural network architectures to create an ensemble of models. With a stacking
approach of class-wise distinct logistic regression models, we
combined the knowledge of all neural network models to compute highly
accurate and reliable retinal condition predictions. Next to an
internal performance evaluation, we also proved the precision and
comparability of our pipeline through the participation at the RIADD
challenge.

## **APENDIX**

In order to ensure full reproducibility and to create a base for
further research, the complete code of this study, including extensive
documentation, is available in the following public Git repository:
https://github.com/frankkramer-lab/riadd.aucmedi
Furthermore, the trained models, evaluation results and metadata are
available in the following public Zenodo repository:
https://doi.org/10.5281/zenodo.4573990

## **ACKNOWLEDGMENTS**

We want to thank Dennis Klonnek, Edmund Müller and Johann Frei for
their useful comments and support.

## **COMPLIANCE WITH ETHICAL STANDARDS**

This research study was conducted retrospectively using human subject
data made available in open access by *Pachade et al.* \[7\], \[8\].
Ethical approval was not required as confirmed by the license attached
with the open access data.

## **CONFLICT OF INTEREST**

None declared.

## **FUNDING**

This work is a part of the DIFUTURE project funded by the German
Ministry of Education and Research (Bundesministerium für Bildung und
Forschung, BMBF) grant FKZ01ZZ1804E.

## References from original paper <a id="ReferencesOGPaper"></a>

[1] J. D. Adelson et al., “Causes of blindness and vision impairment
in 2020 and trends over 30 years, and prevalence of avoidable
blindness in relation to VISION 2020: the Right to Sight: an
analysis for the Global Burden of Disease Study,” Lancet Glob.
Heal., vol. 9, no. 2, pp. e144–e160, Feb. 2021, doi:
10.1016/S2214-109X(20)30489-7.

[2] World Health Organization, “Blindness and vision impairment.”
https://www.who.int/news-room/fact-sheets/detail/blindness-andvisual-impairment (accessed Feb. 27, 2021).

[3] R. T. Sutton, D. Pincock, D. C. Baumgart, D. C. Sadowski, R. N.
Fedorak, and K. I. Kroeker, “An overview of clinical decision
support systems: benefits, risks, and strategies for success,” npj
Digital Medicine, vol. 3, no. 1. Nature Research, pp. 1–10, Dec. 01, 2020, doi: 10.1038/s41746-020-0221-y

[4] G. Litjens et al., “A survey on deep learning in medical image
analysis,” Med. Image Anal., vol. 42, no. December 2012, pp. 60
–
88, 2017, doi: 10.1016/j.media.2017.07.005.

[5] J. Y. Choi, T. K. Yoo, J. G. Seo, J. Kwak, T. T. Um, and T. H.
Rim, “Multi
-categorical deep learning neural network to classify
retinal images: A pilot study employing small database,” PLoS
One, vol. 12, no. 11, p. e0187336, Nov. 2017, doi:
10.1371/journal.pone.0187336.

[6] G. Quellec, M. Lamard, P. H. Conze, P. Massin, and B. Cochener,
“Automatic detection of rare pathologies in fundus photographs
using few
-shot learning,” Med. Image Anal., vol. 61, p. 101660,
Apr. 2020, doi: 10.1016/j.media.2020.101660.

[7] S. Pachade et al., “Retinal Fundus Multi
-Disease Image Dataset
(RFMiD): A Dataset for Multi
-Disease Detection Research,” Data,
vol. 6, no. 2, p. 14, Feb. 2021, doi: 10.3390/data6020014.

[8] “Home
- RIADD (ISBI
-2021)
- Grand Challenge.”
https://riadd.grand
-challenge.org/Home/ (accessed Feb. 27, 2021).

[9] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger,
“Densely Connected Convolutional Networks,” Proc.
- 30th IEEE
Conf. Comput. Vis. Pattern Recognition, CVPR 2017, vol. 2017
-
January, pp. 2261
–2269, Aug. 2016, Accessed: Feb. 27, 2021.
[Online]. Available: http://arxiv.org/abs/1608.06993.

[10] M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for
Convolutional Neural Networks,” 36th Int. Conf. Mach. Learn.
ICML 2019, vol. 2019
-June, pp. 10691
–10700, May 2019,
Accessed: Feb. 27, 2021. [Online]. Available:
http://arxiv.org/abs/1905.11946.

[11] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
“Rethinking the Inception Architecture for Computer Vision,” in
Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, Dec. 2016, vol. 2016
-
December, pp. 2818
–2826, doi: 10.1109/CVPR.2016.308.

[12] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, Dec.
2016, vol. 2016
-December, pp. 770
–778, doi:
10.1109/CVPR.2016.90.

[13] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition
Challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211
–252, Dec.
2015, doi: 10.1007/s11263
-015
-0816
-y.

[14] S. Muhammad et al., “Medical Image Analysis using
Convolutional Neural Networks A Review,” J. Med. Syst., vol. 42,
no. 11, pp. 1
–13, Nov. 2018, doi: 10.1007/s10916
-018
-1088
-1.

[15] J. Ker, L. Wang, J. Rao, and T. Lim, “Deep Learning Applications
in Medical Image Analysis,” IEEE Access, vol. 6, pp. 9375
–9379,
2017, doi: 10.1109/ACCESS.2017.2788044.

[16] D. P. Kingma and J. Lei Ba, “Adam: A Method for Stochastic
Optimization,” 2014. https://arxiv.org/abs/1412.6980.

[17] T.
-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss
for Dense Object Detection,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 42, no. 2, pp. 318
–327, Aug. 2017, Accessed: Feb. 27,
2021. [Online]. Available: http://arxiv.org/abs/1708.02002.

[18] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal, “L
-BFGS
-B: Fortran
Subroutines for Large
-Scale Bound
-Constrained Optimization,”
ACM Trans. Math. Softw., vol. 23, no. 4, pp. 550
–560, Dec. 1997,
doi: 10.1145/279232.279236.

[19] P. Kaur and A. Gosain, “Issues and challenges of class imbalance
problem in classification,” Int. J. Inf. Technol., pp. 1
–7, Oct. 2020,
doi: 10.1007/s41870
-018
-0251
-8.

[20] L. Gao, L. Zhang, C. Liu, and S. Wu, “Handling imbalanced
medical image data: A deep
-learning
-based one
-class
classification approach,” Artif. Intell. Med., vol. 108, p. 101935,
Aug. 2020, doi: 10.1016/j.artmed.2020.101935.

---

---

---

## Appendix A

### Table of Contents

*In order to use a Table of Contents for this article, please use the `toc2` extension from `Nbextensions` for Jupyter Notebook. You can find instructions on how to install and use it in this <a href="https://stackoverflow.com/questions/21151450/how-can-i-add-a-table-of-contents-to-a-jupyter-jupyterlab-notebook">link</a>.*

---

---

---

---

---

---

---

---

---

---

---

---

---

---

---

## References <a id="ReferencesSection"></a>


### MULTI-DISEASE DETECTION IN RETINAL IMAGING PDF
<https://arxiv.org/pdf/2103.14660v1.pdf>

### Multi-Disease Detection in Retinal Imaging - papers with code
<https://paperswithcode.com/paper/multi-disease-detection-in-retinal-imaging>

### Multi-Disease Detection in Retinal Imaging - GitHub
<https://github.com/frankkramer-lab/riadd.aucmedi>

### AUCMEDI - A Framework for Automated Classification of Medical Images
<https://pypi.org/project/aucmedi/>

### RETINAL FUNDUS MULTI-DISEASE IMAGE DATASET (RFMID)
<https://ieee-dataport.org/open-access/retinal-fundus-multi-disease-image-dataset-rfmid#files>

### RFMiD Train Dataset - kaggle
<https://www.kaggle.com/awsaf49/rfmid-train-dataset>

### Retinal Image Analysis for multi-Disease Detection Challenge website
<https://riadd.grand-challenge.org/>

### IEEE ISBI 2021 International Symposium on Biomedical Imaging April 13-16 2021
<https://biomedicalimaging.org/2021/>

---

### sklearn utils class_weight.py - GitHub
<https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/class_weight.py>

### sklearn.preprocessing.LabelEncoder
<https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html>

### sklearn.utils.class_weight.compute_class_weight
<https://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html>

---

### 
<>

### 
<>

### 
<>

### 
<>

### 
<>

### 
<>

### 
<>

### 
<>

### 
<>

### 
<>

### 
<>


---

### Austin Powers - Live dangerously meme 1
<https://i.kym-cdn.com/photos/images/newsfeed/000/511/991/3a5.jpg>


---