# Report of Project 05: Implementation and Evaluation of Machine Learning (Support Vector Machine) Segmentation
## Data analysis project - B.Sc. Molecular Biotechnology Heidelberg University
### 19.07.21
### Authors: Michelle Emmert, Juan Andre Hamdan, Laura Sanchis Pla and Gloria Timm

# Abstract

Support Vector Machines, also known as SVMs, are supervised learning models with associated learning algorithms that
analyze data for classification and regression analysis.

Their theoretical foundations and their experimental success encourage further research on their characteristics, as well
as their further use. This report presents the summary of a project that implements and evaluates a Support Vector
Machine for segmentation of cell nuclei using the dice score, synthetically generated images, and a variety of different
pre-processing and preparation methods.



# Table of contents
**1. Introduction** <br>
**2. Our Data** <br>
**3. Our algorithms pipeline** <br>
**4. Pre-processing** <br>
**5. Support Vector Machine** <br>
**5.1 The Mathematical Background** <br>
**5.2 Data reduction: Principle component analysis and Tiles** <br>
**5.3 Stochastic gradient decent to minimize the loss gradient** <br>
**5.4 K-Fold Cross validation** <br>
**5.5 Implementing the Support Vector Machine** <br>
**6. Evaluation using the Dice coefficient** <br>
**6.1 The Theory behind the Dice Coefficient** <br>
**6.2.1 Implementing the Dice Coefficient** <br>
**6.2.1 Unittesting the Dice Coefficient** <br>
**6.3 Synthetic Images** <br>
**6.3.1 Definition and Goal** <br>
**6.3.2 Image composition** <br>
**6.3.3 Domain randomization** <br>
**7. Results** <br>
**8. Discussion** <br>
**9. Bibliography**


# 1. Introduction
Image segmentation is a process, during which important features of a picture are extracted, to aid analysis and retrieval
of information. This is done by assigning labels to all pixels of the image. Pixels sharing defined traits, are allocated the same label.
Nuclei segmentation is a subform of image segmentation. In medicine nuclei segmentation is used to classify and
grade cancer based on cancer histopathology. Today, cancer grading is still often done manually by visual analysis of tissue samples.
Problematic with this method is its inter- and intra-observer variability regarding the gradings quality, its low reproducibility
as well as its great time consumption. Cancer grading is an essential step in quantifying the degree of malignancy and thus
to predict patient prognosis and prescribe a treatment.

Using a support vector machine is one way nuclei segmentation can be achieved.
We use the support vector machine to differentiate between foreground and background pixels. Foreground pixels are
subsequently labeled as nuclei and displayed white. Background pixels are colored black. This also allows counting of the nuclei in the end.
To efficiently segment our images with a support vector machine, they first need to undergo preprocessing, to enhance
picture quality. After segmentation, we want to evaluate the quality of our segmentation, which we will be using the Dice
score for.

# 2. Our Data
Our data consists of 3 datasets with a total of 28 very different kinds of images all showing nuclei.
The pictures of the first dataset show GFP-transfected GOWT1 mouse embryonic stem cells. The second set of images show
Histone-2B-GFP expressing HeLa cells, while our third set consists of pictures of mouse embryonic fibroblasts, in which
CD-antigens were tagged with enhanced-GFP.
Though all images are microscopic images of nuclei, the three sets vary greatly in other features.
The images are of different formats (1024 x 1024, 1100 x 700, 1344 x 1024), were all acquired differently and display
different numbers of nuclei: within a range of 15-65 nuclei per image.
Additionally, all images pose diverse challenges for our image analysis.
Firstly brightness and resolution of the images differ. On top of that every set of images has its individual challenges
like white flashes, clustering of nuclei or nuclei leaving the image or undergoing mitosis.

# 3. Our algorithm's pipeline
Using the data described in #2 we apply either a watershed filter or a gaussian filter (see #4) to improve the quality of our images.
To evaluate the usefulness of these filtering methods, we use the un-preprocessed pictures for one SVM-run as well. The filtered image is further
prepared for the SVM by undergoing one of two slightly different principle-component analysis and/or by being cut into a
specific number of tiles (see #5.2). PCA reduces the features of our images to those who explain the most variance. It thus is a
selective way of data reduction.
By spliting the images into tiles, the total amount of data is reduced non-selectively. After processing the images either
with PCA and/or the tiles algorithm, all intensity values are normalized.
Additionally, as label-assignation, the ground truth images are thresholded to obtain black-and-white-only images.
The next step is the segmentation using the Support Vector Machine (see #5).
At last, the segmentation results are compared by calculating and comparing the dice score with the help of the ground truth images (see #6).

# 4. Pre-processing
We implemented two different pre-processing methods in our SVM algorithm in order to improve the quality of our raw
images, and therefore achieve better segmentation results.
The first method we used is a Gaussian filter. It is a technique that was shown to be particularly useful to filter images with a lot
of noise. This is the case, since the results of the filtering show a relative independence on the variance value of the Gaussian
kernel (Gedraite, E. et al. 2011). Problematic is, that the use of a Gaussian filter as pre-processing for edge detection could also give
rise to edge position displacement, edges vanishing, and phantom edges (Deng, G. et al. 1993).

The second pre-processing step is a form of super-pixel segmentation.
The principle behind this type of segmentation is to group similar pixels from a local region together and change
their intensities to a common value. For our approach, we used Watershed, a gradient-ascend-based super pixel
algorithm.
It works iteratively to refine the super-pixel clusters until a certain criteria is achieved. The image can be viewed
as a topographic surface with high intensity peaks and low intensity valleys. First, every isolated valley is
labeled differently. As the intensity level rises, different valleys start to merge together. To avoid this from happening,
barriers are built at the merging points. This is repeated, until all peaks are labeled. The barriers subsequently result in
the final segmentation result.

# 5. Support Vector Machine
In 1992, Vapnik and coworkers proposed a supervised algorithm, developed from statistical learning, to solve classification
problems (Vapnik et al., 1992). Since then, their machine learning method evolved into what is now known as
Support Vector Machines (SVM): a class of algorithms for classification, regression and other applications that
represents the current state of the art in the field (Suthaharan, 2016; Guanasekaran, 2010; Christianini and Ricci, 2008).

By providing a training data set with binary labels, the SVM is able to learn how to classify data points using certain
features. This capability can subsequently be used to classify new data, called test data, using its features (Thai et. al, 2012).
SVMs have been successfully applied to a number of applications, ranging from the time series prediction and face recognition
to biological data processing for medical diagnosis (Evgeniou, 2001).
In image processing, support vector machines are used for one of the classical challenges: image classification (Evgeniou, 2001).

## 5.1 The Mathematical Background
The mathematical concepts are key in understanding how Support Vector Machines work.
The goal of an SVM is, to separate data points into two groups of provided labels with an optimal hyperplane.
This hyperplane is described by

\begin{equation}
(2) \ w * x + b = 0
\end{equation}

and fullfilling the following condition

\begin{equation}
(3) \ h =
\left\{
  \begin{aligned}
    +1 \ \ if \ \ w⋅x_i +b≥ +1 - \varepsilon_i\\
    -1 \ \ if \ \ w⋅x_i +b< -1 - \varepsilon_i
  \end{aligned}
  \right.
  \ \ \varepsilon \geqq 0 \ \forall_i \ ; \ i=1...m
\end{equation}

whilst for two dimensions $w = (a, -1)$, whereas $a$ is the slope of the line, and $x = (x_1, x_2)$ and represents a
data point. $\varepsilon$ is a variable, standing for the inaccuracy of the hyperplane. It is added to the constraint to
prevent overfitting of the model onto the training set. Without $\varepsilon$ the geometric margin M is called a hard margin,
as it does not allow data points of one group to be falsely labeled as members of the other group. This does not result in
the best model, as single falsely assigned data points, can have a lower impact on the quality of the model, than a suboptimal
hyperplane. Therefore, $\varepsilon$ is introduced and thereupon M is called a soft margin.

To choose the optimal hyperplane we need to minimize the margin as follows.

\begin{equation}
(4)\ M = min_{i=1...m} \ y_i(\frac{w}{||w||}*x + \frac{b}{||w||})
\end{equation}

The largest margin M out of all margins computed in our training phase, will be selected. The variables w and b are
divided by the length of the vector w calculated with the Euclidean norm formula, as they need to be scale invariant.
The aim is, to find the values for w and b, corresponding to the largest margin.

This leads us to the following optimization problem.
We want to maximize M:

\begin{equation}
(5) \ max_{w,b,\varepsilon} M
\end{equation}

This maximization problem is equivalent to

\begin{equation}
(6) \ max_{w,b,\varepsilon} \frac{1}{||w||} \ + \ \sum^{m}_{i=1}\varepsilon_i
\end{equation}

and can be rewritten as the following minimalization problem.

\begin{equation}
(7) \ min_{w,b,\varepsilon}  \ \frac{1}{2}||w||^2\ + \ C\sum^{m}_{i=1}\varepsilon_i \\
subject \ to \ \  y_i(w⋅x_i+b)≥1− \varepsilon_i \ , \varepsilon \geqq 0 \ \forall_i \ , \ i=1...m
\end{equation}

The regularization parameter C is chosen by the user and determines the weight of $\varepsilon$.
A larger C leads to a higher penalty for errors and therefore to a harder margin.

In order to solve this constrained optimization problem, in which we want to maximize the margin while fulfilling our
conditions or constraints, Lagrange multipliers are used. The idea behind this mathematical concept is that at the optimum,
the gradient of our objective function is parallel or antiparallel to the gradient of the constraint function.
Therefore, both have to be equal or a multiple of each other, which is what the Lagrange multiplier is showcasing.

\begin{equation}
(8) \ \nabla f(x) - \alpha \nabla g(x) = 0
\end{equation}

When we insert our functions, we get the following Lagrangian function:

\begin{equation}
(9) \ f(x)= \frac{1}{2}||w||^2\ + \ C\sum^{m}_{i=1}\varepsilon_i
\\(10) \ g(x) = y_i(w⋅x_i+b) - 1 + \varepsilon_i
\\(11) \ \mathcal{L}(w,b,\alpha) =\frac{1}{2}||w||^2\ + \ C\sum^{m}_{i=1}\varepsilon_i - \sum^{m}_{i=1}\alpha_i[y_i
(w⋅x_i+b) - 1 + \varepsilon_i]
\end{equation}

In order to solve this Lagrange problem, it is relaxed into a dual problem: The constraints are incorporated
into the function, resulting in it only depending on the Lagrange multipliers. This facilitates the solving.
Below, the two constraints for the dual problem are described:

\begin{equation}
(12) \ \nabla_w \mathcal{L}(w,b,\alpha) = w - \sum^{m}_{i=1} \alpha_i y_i x_i = 0
\\ (13) \ \nabla_b \mathcal{L}(w,b,\alpha) = - \sum^{m}_{i=1} \alpha_i y_i = 0
\end{equation}

If they are inserted into the Lagrange function, the result of the dual problem is the following:
\begin{equation}
(14) \ max_{\alpha}  \ \sum^{m}_{i=1}\alpha_i - \frac{1}{2}\sum^{m}_{i=1}\sum^{m}_{j=1}\alpha_i\alpha_j y_i y_j x_i · x_j \\
subject \ to \ \ 0≤\alpha_i≤C, i=1...m, \sum^{m}_{i=1}\alpha_iy_i = 0
\end{equation}

From the above equation, it becomes clear that the maximization depends solely on the dot product of the support vectors
$x_i · x_j$. This is an advantage when dealing with data that is not linearly separable. The 'trick' is to transform the data
into a higher dimension, in which a separating hyperplane can be found. However, for a large dataset, calculating the transformation
would be a very time-consuming operation. For that reason, instead of actually calculating the transformation, the Kernel trick is used.
This means a function is used, which calculates the dot product of $x_i · x_j$ as if the two were in a higher dimension.

The linear Kernel function can be described as follows:

\begin{equation}
K(x_i, x_j)=\phi(x_i) · phi(x_j)
\end{equation}

but it varies depending on the type of data that has to be classified. Other know and comonly used Kernel functions are:

\begin{equation}
K(x_i, x_j)=x_i · x_j,
\end{equation}

!!!*still missing/ has to be elaborated*!!!:
-Wolfe dual problem
-Kernel trick

## 5.2 Data reduction: Principle component analysis and Tiles
As the flattened images lead to an array with over a million columns, which would take the SVM a huge amount of time to process,
we saw the need to apply a data reduction method beforehand.
The two possibilites we used and compared, were Principal Component Analysis and cutting the image into tiles before averageing
over each tile.

Principal Component Analysis (PCA) is a mathematical procedure which uses sophisticated mathematical principles to transform
a number of correlated variables into a smaller number of variables called principal components. This
statistical method is frequently used in image processing for data compression, data dimension reduction,
and data decorrelation. In PCA, the information contained in a set of data is stored with reduced dimensions,
based on the integral projection of the data set onto a subspace, generated by a system of orthogonal axes. The reduced dimensions'
computational content is selected so that significant data characteristics are identified with little information loss.
Such a reduction is advantageous in several scientific applications, among which is our present implementation in image
compression. (Ashour, A. 2015, Mudrova, M. 2005)
PCA can thus be summarized as being a method for specific data reduction.

As for PCA there is one option of preforming a PCA that does not reduce the number of features furtherly used for the SVM. This can be achieved using the scaling operation
mentioned before. In this case, tiles is needed additionally to reduce the amount of data.
Another possible option would be to use a PCA that reduces the amount of features to the greater minimum of the number of
features or the number of samples. In this case, as too little PCs lead to poor results, supplementing the given data with
synthetically generated microscopic cell images is needed.

In contrast to this specific reduction of data, tiles-rendering is a simple approach that reduces data unspecifically by exploiting the fundamental properties of a problem space.
The concept behind this method is to save computational space by splitting the image pixels into multiple sets of N x N tiles
and calculating the average grey from each set. The final result of this averaging will be another image which
pixels are the average intensities of the N x N pixel sets of the bigger and higher resolution, original, image (Rastar, A. 2019).

## 5.3 Stochastic gradient decent to minimize the loss gradient
In order to minimize function (7), its gradient is calculated through the Lagrange multiplicators.
A common method to minimize this function is called Gradient Descent.
The gradient ∇P(θ) of an objective function e.g. P(θ), which is parameterized by the model's parameters θ, is calculated.
∇P(θ) represents the slope with the highest inclination of our function.
During Gradient Decent, the parameters are updated in the opposite direction of the gradient. This process is repeated until a
(local) minimum is reached, by taking steps determined beforehand by the learning rate.
To explain differently, the direction of the slope of the surface described by P(θ) is followed downwards until the lowest
point. (Ruder, 2017)
Differents kinds of gradient decents mostly only differ in the amount of data they use to compute the gradient. Essentially,
the decision between accuracy of the parameter update and the runtime has to be made.\n",
As part of our SVM, we used Stochastic Gradient Descent (SGD). In contrast to the basic gradient descent, SGD does not use\n",
all of the data for it's calculation, but only a randomly selected part of it, called stochastic representation (Johnson and Zhang, 2013).
This reduces computation time significantly and makes the program faster (Johnson and Zhang, 2013).


## 5.4 K-Fold Cross validation
Validation is a widely used technique in data science to evaluate how stable a machine learning (ML) model is and to
assess how well the model would generalize to new, independent data. Relevant for these two characteristics is the MLs
ability to differentiate between relevant patterns and noise in the data available (Vabalas et. al, 2019). As a measure
for how good the ML is able to achieve this, the bias-variance tradeoff can be used (Geman et. al, 1992; Berrar, 2019).
Bias and variance are both sources of error in ML generalization. With increasing model complexity, bias decreases and
variance increases monotonically (Yang et. al, 2020). In short: High bias indicates 'underfitting', high variance points
at an 'overfitted' model (Yang et. al, 2020). 'Underfitting' describes the performance a model, which is neither able to
classify its training data nor new data well, because it captures too little patterns. 'Overfitting' on the other hand
refers to a model, that is overly sensitive to inherent noise and random pattern in it's training data and for that reason
performs poorly on new data. Optimally both bias and variance could be minimized (Geman et. al, 1992; Berrar, 2019).
In reality the right balance of both is needed to generate an optimal model (Yang et. al, 2020).

One validation technique is k-fold cross validation (CV).
In CV, the data available is split into k subsets. The data encompasses n dissimilar samples. k is a random
integer between 1 and n. For each iteration, k-1 subsets are used as training data, while the remaining subsets are
used to test the model and are thus part of the validation set.
To put it differently: each data sample is part of the testing data once and part of the training data for all other
iterations (Vabalas et. al, 2019).
*This approach substantially reduces bias, as it uses most data points for fitting. Simultaneously variance also decreases.
But as only one datapoint is used for testing in each iteration, higher variation in testing model effectiveness
can occur (Berrar, 2019).*

As our data consists of only 28 images, we used a special case of cross-validation called leave one out cross-validation (LOOCV).
In this approach k equals the number of samples. So for each iteration 27 images are part of the training set, while 1 image
acts as the validation set.

## 5.5 Implementing the Support Vector Machine

*TO BE DONE*

# 6. Dice Coefficient
## 6.1 The Theory behind the Dice Coefficient
The Dice coefficient is a score to evaluate and compare the accuracy of a segmentation.
Needed for its calculation are the segmented image, as well as a corresponding binary reference point also called
ground truth (Bertels et al., 2019).
As ground truth image, researchers mostly use the segmentation result of humans. We will use the ground truth images
provided with our data sets, which we suspect to be acquired by this method.
Using the ground truth image, the labels true positive (TP), false positive (FP) and false
negative (FN) are assigned to each pixel of the segmented image (Menze et al., 2015).
This information is then used to calculate the dice coefficient using formula (1):

\begin{equation}
(1) \ dice = {\frac{2TP}{2TP + FP + FN}} \ \ \varepsilon \ \ [0,1]
\end{equation} <br>
(Menze et al., 2015)

A dice score of 0 indicates that the ground truth and the segmentation result do not have any overlap. A dice score of 1 on the
other hand, shows a 100% overlap of ground truth and segmented image (Bertels et al., 2019).


## 6.2.1 Implementing the Dice Coefficient

In [None]:
from SVM_Segmentation.dicescore import dice_score as dicescore

## 6.2.2 Unittesting the Dice Coefficient
To test the code for the dice coefficient, we used a frequently used method of software testing: unittests.
Unittests are a way of validating that a specific code chunk, a unit, performs as expected and thus its result is as anticipated (Hamill, 2005).

We implemented two kinds of unit tests.
The dice coefficient of an image with itself is always 1.0. For our first unit test, we used this knowledge to test our code.
For this first test we generated synthetic masks, black-and-white synthetic images, with which we performed the unit test.

For the second unit test, we defined two random arrays, consisting only of ones and zeros.
One array represented the segmented image, while the other served as ground truth.
Using formula (1) we calculated the dice by hand and compared our result with our codes' output.
In addition, we compared our code's result to the implemented

## 6.3 Synthetic Images
### 6.3.1 Definition and Goal
The basic concept behind creating synthetic images is, to use algorithms and already available images to generate
new ones (Dunn et al., 2019). Our first objective was to simply use these new images to test our code for the dice score.
But while researching on this topic we realized that synthetic images have an immense potential, especially for the training
phase of a machine learning algorithm.
This is particularly useful as our data encompasses only 28 images, which leads to a training data set of 27 images at max.
By expanding our training set with diverse images of good quality, we expect a more accurate model (Mayer et al., 2017).
There are various methods for the generation of synthetic images (Ward et al., 2019). Because of the scope of our project and the
kind of images we wanted to produce, we focused on image composition and domain randomization.

### 6.3.2 Image composition
To produce synthetic masks, a white circle is drawn on top of a black background.
While the background stays the same throughout all images, the circle's size and position is varied.
Our algorithm iterates through the random position and scaling generator. By doing this it produces random images, which
can further be used as masks for our dice score, as these images are black-and-white only. Thus, their pixel intensities are either 0 or 1.
We used these masks to test our dice score.

### 6.3.3 Domain randomization
In order to create synthetic microscopic cell images, we used a method called domain randomization.
At first, domain randomization requires collecting various foreground images, either by separating the images from their
backgrounds or by using images in .png format to begin with.
These foreground images are then pasted onto different backgrounds (Tripathi, 2019).
To obtain more variety among the resulting synthetic images, the foreground images can be modified using different contrasts,
zooms or rotations (Ward et al., 2019; Alghonaim and Johns, 2020).

We used domain randomization to generate a new set of images with cells cut from our 28 images data set.
After rotating and scaling the cells, they were pasted at a random position onto a background, which had also been cut and
scaled from our dataset.
These new images were used further on to enlarge and enrich our training data set.

# 7. Results
Our goal is to determine the best combination of the different possible pre-processing and preparation settings.
To put it differently, we want to evaluate whether at all and which specific changes to the original images enhance the
segmentation result. These changes encompass Gauss-Filter, Watershed, Tiles and Principal Component Analysis (PCA).
For a more precise evaluation, we will use the dice score function to compare the final, segmented, images. The comparisons
we did are the following:
1) for the pre-processing: with Gauss-Filter vs. without, with Watershed vs. without, with Watershed vs. with Gauss
2) for the SVM-preparation: with Tiles and PCA (without feature reduction (fr)) vs. with Tiles, with PCA (with fr) vs. with Tiles,
with PCA (with fr) vs. with Tiles and PCA.

# 8. Discussion

*TO BE DONE*

# 9. Bibliography

Alghonaim, R., Johns, E. (2020).Benchmarking Domain Randomisation for Visual Sim-to-Real Transfer. CoRR.

Berrar, D. (2019). Cross-validation. Data Science Laboratory, Tokyo Institute of Technology.

Bertels, J., Eelbode, T., Berman, M., Vandermeulen D., Maes F., Bisschops, R., Blaschko, M. (2020).
Optimization for Medical Image Segmentation: Theory and Practice when evaluating with Dice Score or Jaccard Index.
IEEE Trans Med Imaging.

Boser, B., Guyon, I., Vapnik, V. (1992). A training algorithm for optimal margin classifiers. Proceedings of the fifth
annual workshop on Computational learning theory. Ed. 07.1992, 144–152.

Christianini, N., Ricci, E. (2008). Support Vector Machines. Encyclopedia of Algorithms, Springer.

Deng, Guang & Cahill, L.W. (1993). An adaptive Gaussian filter for noise reduction and edge detection.
Proc. of Nuclear Science Symposium and Medical Imaging Conference. 3. 1615 - 1619 vol.3.

Dunn, K.W., Fu, C., Ho, D.J., Lee S., Han S., Salama P., Delp E. (2019). DeepSynth: Three-dimensional nuclear segmentation of
biological images using neural networks trained with synthetic data. Sci Rep 9, 18295

Evgeniou, T., Pontil, M. (2001). Support Vector Machines: Theory and Applications. Computer Science

Johnson, R., Zhang, T. (2013). Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. 26th International
Conference on Neural Information Processing Systems.

Gedraite, E., Hadad, M. 2011. Investigation on the effect of a Gaussian Blur in image filtering and segmentation. 393-396.

Geman, S., Bienenstock, E., Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computation.

Guanasekaran, T., Shankar Kumar, K.R. (2010), Modified concentric circular micostrip array configurations for
wimax base station. Journal of Theoretical and Applied Information Technology Vol 12. No. 1

Hamill, P. (2005). Unit Test Frameworks: Tools for High-Quality Software Development (S. 1 f.).

Mayer, N., Ilg, E., Fischer, P., Hazirbas, C., Cremers, D., Dosovitskiy, A.,Brox, T. (2018). What Makes Good Synthetic
Training Data for Learning Disparity and Optical Flow Estimation?. Int J Comput Vis 126, 942–960.

Mehta, B., Diaz, M., Golemo, F., Pal, C. J., Paull, L. (2020). Active Domain Randomization. Proceedings of Machine Learning Research.

Menze, B., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., Lanczi, L.,
Gerstner, E., Weber, M., Arbel, T., Avants, B., Ayache, N., Buendia, P., Collins, D., Cordier, N., Corso, J., Criminisi, A., Das, T.,
Delingette, H., Demiralp, Ç., Durst, C., Dojat, M., Doyle, S., Festa, J., Forbes, F., Geremia, E., Glocker, B., Golland, P., Guo, X., Hamamci, A.,
Iftekharuddin, K., Jena, R., John, N., Konukoglu, E., Lashkari, D., Mariz, J., Meier, R., Pereira, S., Precup, D., Price, S., Raviv, T.,
Reza, S., Ryan, M., Sarikaya, D., Schwartz, L., Shin, H., Shotton, J., Silva, C., Sousa, N., Subbanna, N., Szekely, G., Taylor, T.,
Thomas, O., Tustison, N., Unal, G., Vasseur, F., Wintermark, M., Ye, D., Zhao, L., Zhao, B., Zikic, D., Prastawa, M., Reyes, M., Van Leemput, K. (2015).
The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging.

Mudrova, M.,  Prochazka A. (2005). Principal Component Analysis in Image Processing.

Nandi, D., Ashour, A., Sourav, S., Chakraborty, S., Mohammed Abdel-Megeed, M., Dey, N. (2015).
Principal Component Analysis in Medical Image Processing: A Study. International journal of Image Mining.

Rastar, A. (2019). A Novel Pixel-Averaging Technique for Extracting Training Data from a Single Image, Used in ML-Based Image Enlargement.

Ruder, S. (2017). An overview of gradient descent optimization algorithms. Insight Centre for Data Analytics.

Suthaharan S. (2016). Support Vector Machine. Machine Learning Models and Algorithms for Big Data Classification.
Integrated Series in Information Systems, vol 36. Springer.

Thai, L.H., Tran S.H., Nguyen T.T. (2012). Image Classification using Support Vector Machine and Artificial Neural Network.
International Journal of Information Technology and Computer Science (IJITCS).

Tripathi, S., Chandra S., Agrawal, A., Tyagi, A., Rehg, J. M., Chari, V. (2019). Learning to Generate Synthetic
Data via Compositing. IEEE Xplore.

Vabalas, A., Gowen, E., Poliakoff, E., Casson, A.J. (2019). Machine learning algorithm validation with a limited sample size. Plos one.

Ward, D.,Moghadam P.,Hudson, N. (2019). Deep Leaf Segmentation Using Synthetic Data. CoRR.

Yang, Z., Yu, Y., You, Y., Steinhardt, J., Ma, Y. (2020). Rethinking Bias-Variance Trade-off
for Generalization of Neural Networks. Cornell University.






