## Week 1: Literature Review and Scope Definition 

After my kickstarter meeting with Professor Rachlin, I began reading and documenting relevant papers in the field of: Neural Network Hyperparameter Tuning, Multi-Objective Optimization and Evolutionary Deep Learning. These works provided me the necessary background to identify the research area: Multi-Objective Hyperparameter Optimization of Neural Network Architecture using Evolutionary Algorithms. 

Rather than relying on manual experimentation, which currently functions as a “black art”, with no standardized protocol, using rational agents to iteratively evolve and optimize neural networks may, at best, provide such protocol or, at worst, provide information on relationships between configuration specifications and performance metrics. 

These agents not only adjust high-level hyperparameters, such as learning rates or activation functions, but also actively modify the structure of the network itself, including the number of layers, types of layers, and the interconnections between them. By formulating the problem as a multi-objective optimization task, the candidate solutions balance competing goals such as accuracy, complexity, and training efficiency. ML practicioners can use this research to assess which tradeoffs are important for their research and analysis objectives, and specify hyperparameters accordingly.

Bayesian methods

## Key Quotes
- pg 1: But in many applications, we are not only interested in optimizing ML pipelines solely for predictive accuracy; additional metrics or constraints must be considered when determining an optimal configuration, resulting in a multi-objective optimization problem often neglected in practice, due to a lack of knowledge and readily available software implementations
for multi-objective hyperparameter optimization
- pg 1: existing optimization strategies, both from the domain of evolutionary algorithms and Bayesian optimization
- pg 2: For a diagnostic test, solely looking at misclassification rates is ill-advised: Misclassifying a sick
patient as healthy (false negative) has usually much more severe consequences than classifying a
healthy person erroneously as sick (false positive), i.e., different misclassification costs, which are
often unknown or hard to quantify, have to be considered
- pg 2: This set of Pareto optimal
solutions can then be analyzed by domain experts in a post-hoc manner, and an informed decision
can be made as to which trade-off should be used in the application, without requiring the user to
specify this a priori
- We restrict the scope of this paper to the realm of supervised ML. Unsupervised ML, in contrast,
entails a different set of metrics to the scenario studied in our manuscript and is largely governed
by custom, use case-specific measures [87, 205] and sometimes even visual inspection of results
- pg 3: We will categorize these
applications through exploring three overarching perspectives on the ML process: (1) Performance
metrics, (2) metrics that measure costs and restrictions at deployment like efficiency, and (3) metrics
that enforce reliability and interpretability.
- The fundamental ML problem can be defined as follows. **this entire paragraph**
-  nested resampling techniques should be applied
-  A simple alternative
is simply splitting Doptim into two datasets Dtrain and Dval, which leads to the widely known
train/validation/test-split
- pg 5: The domain Λ of the problem is called numerical if only numeric hyperparameters 𝝀 are optimized
- By including additional
- categorical hyperparameters, like the type of kernel used in a support sector machine (SVM),
the search space becomes mixed numerical and categorical. Mixed search spaces already require
adaption of some optimization strategies, such as BO, which we will discuss in Section 4.4. It
can also be necessary to introduce further conditional hierarchies between hyperparameters. For
example, when optimizing over different kernel types of an SVM, the 𝛾 kernel hyperparameter is
only valid if the kernel type is set to Radial Basis Function (RBF), while for a polynomial kernel, a
hyperparameter for the polynomial degree must be specified. These conditional hierarchies can
become highly complicated - especially when moving from pure HPO to optimizing over full ML
pipelines, i.e., AutoML, or over neural network architectures, referred to as neural architecture
search (NAS) [89, 202, 215]
- Feature selection is a topic that borders MOHPO and multi-objective ML and is often
handled in a multi-objective manner 
- 6:  fairness with respect to two subpopulations satisfies a specific
value
- 6:  It is the task of an ML practitioner to translate a real-world problem into an ML task - and therefore objectives and constraints - to measure the quality and feasibility of a given model
- Analogously, we say a HPC 𝝀 ∈ Λ˜ (Pareto-)dominates another configuration 𝝀
′
, if and only if
𝑐(𝝀) ≺ 𝑐(𝝀
′
). In other words: 𝝀 dominates 𝝀
′
, if and only if there is no criterion 𝑐𝑖
in which 𝝀
′
is
superior to 𝝀, and at least one criterion 𝑐𝑗
in which 𝝀 is strictly better.
- This situation arises if there exist 𝑖, 𝑗 ∈ {1, . . . ,𝑚} for which 𝑐𝑖 < 𝑐
′
𝑖
but also
𝑐
′
𝑗 < 𝑐𝑗
. Hence, in contrast to single-objective optimization, there is in general no unique single
best solution 𝝀
∗
, but a set of Pareto optimal solutions that are pairwise incomparable with regard
to ≺. This set of solutions is referred to as the Pareto (optimal) set and defined as
P :=

𝝀 ∈ Λ˜
|  𝝀
′
∈ Λ˜ s.t. 𝝀
′ ≺ 𝝀
	
- ted solution sets - i.e., within
each set, no configuration is dominated by another configuration. The associated approximated
Pareto fronts are denoted by Pˆ
S1
and Pˆ
S2
respectively. According to Zitzler et al. [2003], Pˆ
S1
is
said to weakly dominate Pˆ
S2
, denoted as Pˆ
S1 ⪯ Pˆ
S2
, if for every solution 𝝀2 ∈ S2 there is at least
one solution 𝝀1 ∈ S1 which weakly dominates 𝝀2. Pˆ
S1
is furthermore said to be better than Pˆ
S2
,
denoted5
as Pˆ
S1
⊳ Pˆ
S2
, if Pˆ
S1 ⪯ Pˆ
S2
, but not every solution of Pˆ
S1
is weakly dominated by any
solution in Pˆ
S2
, i.e., Pˆ
S2 ⪯̸ Pˆ
S1
. This represents the weakest form of superiority between two
approximations of the Pareto front
How well a single solution set
represents the Pareto front can be divided into four qualities [170]:
Convergence The proximity to the true Pareto front
Spread The coverage of the Pareto front
Uniformity The evenness of the distribution of the solutions
Cardinality The number of solutions
The importance of normalized
objectives and various methods on how to implement them have been of interest within the multiobjective evolutionary computation community; even the particular effects on certain algorithms
has been examined in detail

Scalarization transforms a multi-objective goal into a single-objective one, i.e., it is a function
𝑠 : ℝ𝑚 × T → ℝ that maps 𝑚 criteria to a single criterion to be optimized, configured by
scalarization hyperparameters 𝛼 ∈ T. Having only one objective often simplifies the optimization
problem [194]. However, there are two main drawbacks to using scalarization for MOO [141]:
Firstly, the scalarization hyperparameters 𝛼 must be chosen sensibly, such that the single-objective
represents the desired relationship between the multiple criteria – which is not trivial, especially
without extensive prior knowledge of the optimization problem and not adequately represent a multi-objective problem with conflicting objectives


## Confusing Statements
- pg 2: However,
it is often unclear how a trade-off between different objectives should be defined a priori, i.e., before
possible alternative solutions are known
- pg 2: many ML and data mining applications inherently concern trade-offs
and thus should be approached via MOO methods
- pg 2: And even if the main interest lies in a single
objective it still might be advantageous to approach the problem via MOO methods since they
have the potential to reduce local minima
- Mixed and hierarchical search spaces can be
treated with BO with special kernel functions3 or by using a suitable surrogate, e.g., random forests.
- Evolutionary algorithms (Section 4.3), on the other hand, can not select HPCs as effectively as BO
and thus usually need more proposals than BO; however, they propose HPCs naturally in batches
and can handle mixed and hierarchical search spaces with ease because of the discrete nature of
their proposal-generating operations.
- To record the evaluated hyperparameter configurations and their respective scores, we introduce
the so-called archive A = ( (𝝀
(1)
, 𝑐(𝝀
(1)
)), (𝝀
(2)
, 𝑐(𝝀
(2)
)), . . . ), with A[𝑡+1] = A[𝑡] ∪ (𝝀
+
, 𝑐(𝝀
+
))
if a single configuration is presented by an algorithm that iteratively proposes hyperparameter
configurations.
-  Model parameters are fixed by the ML algorithm
at training time in accordance to one or multiple metrics, whereas hyperparameters are chosen by
the ML practitioner before training and influence the behavior of the learning algorithm and the
structure of its associated hypothesis space.
- While the search space is generally smaller for hyperparameter
optimization, the problem tends to be more expensive as multiple evaluations of the ML algorithm
are required. 
- Quality
indicators that focus on all four qualities listed above can be divided into distance-based, which
require the knowledge of the true Pareto front or a suitable approximation of it, and volume-based,
which measure the volume between the approximated Pareto Front and a method-specific point.

## Week 2: Evo Framework Familiarization and Strategy

Professor Rachlin provided the most updated code of the Evo Framework to which I did a complete deep dive of the code base. First I went line by line, handwritting the code to understand the relationship between the Profile class, Environment class, decorators and the TA Assignment implementation. Once I understood how to define agents, objectives, solutions and the environment, I created schematics, UML diagrams and pseudocode to map out Python scripts for the MOHPO (Multi-Objective Hyperparameter Optimization) problem. 

Further exploring papers, I began brainstorming potential objective functions (i.e. training time, memory size, number of neurons, number of layers), agents (i.e. change layer type, change model optimizer,  

How will the model be trained efficiently? we may need to enahce the evo framework to say maximizing or minimizing or objective. accuracy to error rate (just to get it to work). run the actual code and make sure you understand how it works

how do i want to present the design of the neural net arch?
no. of layers, no. of neurons, backpropagation, number of epochs

how accurate is the network and there's different measurements of accuracy
false positive, false negative (type I is fun, type II youre screwed in health diagnostics )

what parameters should be considered:
specific standardized acuracy. minimize the complexity of network, total number of nodes.

should i be storing the models as pkl loads? and if self.model is not None then whatever? or should the computer be handling all of it in real time?

F(Object[H,M, Metrics]) —> Number
train_model()
train_model(input_data)
build_model()
test_model()
Agent(Solution) —> Soluion



## Week 3: Introduction to Keras and Evo Framework Integration

creating a solution object that has 

Learn Keras. Deep Learning with Python textbook. Write function to train and build and test the model. script I might write to build the neural network could be driven by hyperparameters. most of the models did not have that good of a performance

Keep the architecture VERY simple 

Finally I was able to get the coding working but the first working iteration is resultling in a solution set that is empty. I wonder if there is some issue with how Pareto optimal solutions are calculated why this is happening.

I am going to attempt to increase the number of layers to see if the additional complexity changes performance outcome.

**NOTE:** So that was not the issue. Working in colab changes how you refer to files and directories referenced at run time and after the first 10 solutions were initiated the function call just ended. Why the solutions didn't stay in the populat I am not sure but that is the issue that is happening here.

To be a good programmer you have to pay attention to the smallest of things. I kept getting a population of 0 for several reasons but one of the reasons was the constrains file initialized a max for time and because I was measuring in nanoseconds ALL of the algorithms went over time.

Use iris dataset for simplicity

'softmax' removing since requires unique specificity for units.. should i include it down the line?

start basic hyperparametrs first. constrain the search space of possibilites...what would be the minimal specification of a neural network that i can construct and run against some dataset...start as simple as possible

create an object model that has these 3 elements in them and stores them in a way that we can interrogate the object and get back out a metric

## Week 4: Optimize Codebase 

I missed the meeting with Rachlin and prepared a report of the work I have done thus far. 

evo deep learning book. Look at news outlets to identify topics for datasets that would be relevant to use for analysis. Onsider astronomy and bioinformatics as Rachlin enjoys those. Incorporate a validation set that after the pareto set ofoptimal solutions is found, tested again on the validation set, this time reducing to the final set. Looking at how many configurations can be made in a short time span.  getting the Pareto optimal set of solutions and evaluating the different trade-offs (also fundamental trade off between accuracy and complexity. simplest neural network that still has good accuracy). the Number of Layers, CNN for image analysis and image net competitions. T

remove sol.data from each solution instance, should only be one
consider different types of layers: add `layers.Dropout(0.5)` dropout layers (applied to layer than comes before it) pg 151
include other accuracy metrics (recall, precision, sensitivity, etc)
change size to number of layers and number of neurons
include validation set to filter out the final pareto optimal solutions
keep the evolution as is. once you have finished, get the models from the solutions, validate them on the validation set, get the final scores and THEN show what the output is (maybe store all these metrics in metrics, but only focus on certain ones for objectives

include addressing overfitting (looking for similar performance between test and train tests.
find dataset with biomedical relevance
add epochs as a hyperparameter
keep track of the number of different archietectures you are able to generate per hour (10 is too few, 1000 is useful)
show dashboard showing tradeoffs, so what's the relationship between the objectives (maybe can have pareto optimal solutions vs all solutions) --> is there a string positive correlation between increasing accuray and increasing complexity?

review class materials on evo framework and deepl learning