# Randomized Optimization

## Optimization
---
- Input space $X$
- Objective function $f:x \rightarrow \mathbb{R}$
    - fitness function
    - maps inputs to score
- Goal:
    - Find $x^* \in X$ such that $f(x^*)=max_x f(x)$
- Optimization helps:
    - Find the best process
    - Find the best route
    - Find the root (where it crosses origin)
    - neural network find weights that minimize error
    - optimize structure of decision tree
    - Find the best parameters of learning algorithms

## Optimization Approaches
---
- Generate and test: small input space, complex functions
- Calculus: function has derivative, solvable derivative = 0
- Newton's method: function has a derivative, iteratively improve, just single optima   


- What if assumptions don't hold?
    - big input space
    - complex function
    - no derivative (or hard to find)
    - many local optima
    
In these cases use Randomized Optimization.

## Hill Climbing
---
<img src="../images/hill_climbing.jpeg" width=500 align="right"/>  

N: neighborhood   

- Guess $x \in X$
- Repeat:
    - Let neighbor $n^* = argmax_{n \in N(x)} f(n)$
        - find neighbor with largest function value
    - if $f(n) > f(x): x=n$
        - if that neighbor has a higher function value move to that point 
    - else: stop   

## Random Restart Hill Climbing
---
- Once local optimum reached, try again starting from randomly chosen x
- Advantages
    - multiple tries to find good starting place
    - not much more expensive (constant factor)
- If there is only one optima, doing random restart will keep giving the same answer
- Randomized Hill Climbing may not do better than evaluating all the space in the worst case, but it won’t be worse  
- May only be a 'sliver' of the space to find global optimum (basin of attraction).  Bigger basin results in better performance.  If too small could be needle in hay stack.

## Simulated Annealing
---
- Don't always improve (exploit) - Sometimes you need to search (explore)
- Repeated heating and cooling strengthes the blade

### Annealing Algorithm
- For a finite set of iterations:   
    1. Sample new point $x_t$ in $N(x)$
    2. Jump to new sample with probability given by an acceptable probability function $P(x, x_t, T)$  (move to new $x_t$ probabilistically)
    3. Decrease temperature T > 0
        - $P(x,x_t,T)$ = 
            - 1  if  $f(x_t) \geq f(x)$
            - $e^{\frac{f(x_t)-f(x)}{T}}$, otherwise (look at fitness difference the two) 
                 - bit T gives us close to $e^0$ or 1 (likely to jump to new x)
                 - small T $\rightarrow e^{\infty}$ (only hill climb)  
        - **Decrease T slowly** to give the algorithm a chance to find global minima basin
            
            
### Properties of Simulated Annealing   
$T \rightarrow 0$: like hill climbing   
$T \rightarrow \infty$: random walk     

Probability of ending at any point x:   
$P_r($ending at x$) = \frac{e^{\frac{f(x)}{T}}}{Z_T}$   

- Most likely to be in places of high fitness
- Decreasing T puts all the weight on f(x) and eventually, pushes the probability to its maximum
- However, we need to decrease T slowly to avoid ending up in a local minima
- This is called Boltzmann Distribution

## Genetic Algorithms
---
<img src="../images/genetic_algorithms.jpeg" width=500 align="right"/>   

- Population of individuals (input points)
- Mutation: local search N(x)
- Crossover: combine points to hopefully create something better (population holds information)
- Generations: iterations of improvements   



- Genetic Algorithms perform a randomized, parallel, hill-climbing search for the hypotheses that optimizes a predefined fitness function.
- The algorithm operates by iteratively updating a pool of hypotheses, called the population. On each iteration, all members of the population are evaluated according to the fitness function (A predefined numerical measure for the problem). A new population is then generated by probabilistically selecting the most fit individuals from the current population.

## Genetic Algorithm Skeleton
---
- $P_o$ = initial population of size K   
- Repeat until converged:   
    - compute fitness of all $x \in P_t$
    - select 'most fit' individuals (top half, weighted probability)  
    - pair up individuals, replacing "least fit" individuals via crossover/mutation


- More detail:
    - Initialize population: generate an initial hypotheses population $P_t$ of size K, where t indicates the $t^{th}$ generation.
    - Evaluate: compute fitness for all $h \in P_t$
    - Create a new generation $P_{t+1}$:
        1. Select “most fit” individuals according to the fitness function:    
        → Truncation Selection: We take the top half of the population in terms of their scores and declare them to be the most fit.   
        → Roulette Wheel Selection: We select individuals at random, but we give the higher
        scoring individuals a higher probability to be selected (similar to having a temperature parameter close to ∞).   
        $P_r(h_i) = \frac{Fitness(h_i)}{\sum_{h_j \in P_0} Fitness(h_j)}$   
        2. Crossover: Probabilistically select pairs of hypotheses from $P_0$ according to $P_r (h_i)$. For each pair, produce an offspring by applying the crossover operator (Copying selected bits from each parent).
        3. Mutate: Choose a percentage of the members of $P_0$ with uniform probability. For each, invert one randomly selected bit in its representation.
        4. Update $P_t ← P_{t+1}$
        5. Evaluate: Compute fitness for all $h \in P$
        6. Repeat till converge.   
        
        
- Genetic Algorithms is less likely to fall into a local minima, because it moves abruptly, replacing parents with offspring that might be radically different. As opposed to Gradient Descent which moves smoothly from one hypothesis to another that is very similar.

## MIMIC
---
### Finding Optima by Estimating Probability Densities  
- Directly model probabilty distribution  
- Successively Refine Model
    - Convey Structure

### Problems with Randomized Optimization Algorithms:
- There’s no structure or learning. You start with a point and end up with a point that is closer to the global optima.
    - only points, no structure
- It’s not clear what kind of probability distribution we’re dealing with.
    - unclear probability distribution

### Probability Model
Directly model probability distribution and successively refine the model, will end up with structure.   

$P^{\theta_t}(x)$ = 
- $\dfrac{1}{z_{\theta}}$, if $f(x) \geq 0$
- $0$ otherwise  


This probability is uniform over all values of x whose fitness is above some threshold $\theta$.    


$P^{\theta_{min}}(x)$ = uniform   
$P^{\theta_{max}}(x)$ = optima 

### Pseudo Code
---
- Generate samples from probability distribution $P^{θ_t}(x) \rightarrow$ Generate population
- Set $\theta_{t+1}$ to $n_{th}$ percentile
- Retain only those samples such that $f(x) \geq \theta_{t+1} \rightarrow$ Retain fittest
- Estimate $P^{\theta_{t+1}}(x) \rightarrow$ Estimate a new distribution
- Repeat


- More detail:
    1. We have some threshold θ.
    2. We generate a probability distribution that is uniform over all points that have a fitness value ≥ θ.
        - This means we generate all the points whose fitness is at least as good as $\theta$.
    3. Take from those the points whose fitness is much higher than θ (Maybe highest 50%)
    4. Keep repeating till you reach $\theta_{max}$.
- This way helps us retain the structure from time step to time step.
- This should work as intended if:
    - We can estimate $P^{\theta_{t+1}}(x)$ given a finite set of data.
    - $P^{\theta_t}(x) \approx P^{\theta_{t+1}}(x)$. That is, when it generates $P^{\theta_t}(x)$, it also gives samples for the next distribution $P^{\theta_{t+1}}(x)$, because both distributions are relatively close.
- This will eventually lead to $\theta_{max}$, which convey the global optima.

### Estimating Distributions
--- 
