# Genetic Algorithm
## Real Optimization - Statistical Distribution Fitting

This notebook demonstrates using the GA to find the best fitting distribution for a given dataset. The data is simulated from a known distribution with known parameters, so it is possible to assess the accuracy of the GA result.

There are three major perceptions of data in statistics:
- [Frequentist](https://en.wikipedia.org/wiki/Frequentist_inference) - considers observed data to be a random sample from an unknown population generated by a "real" probability distribution
- [Bayesian](https://en.wikipedia.org/wiki/Bayesian_inference) - considers observed data to be "real", which can be represented by a probability distribution
- [Information Theoretic](https://en.wikipedia.org/wiki/Information_theory) - focuses on determining the maximal amount of information in (or that can be gleaned from) some data

The Frequentist perspective underlies the majority of statistical thinking used, and gives us hypothesis testing and confidence intervals. An exercise commonly performed in statistics - whether Frequentist of Bayesian - is that of determining a statistical probability distribution $f\left(X\vert\theta\right)$ which fits a dataset $X$ best, given a vector of parameters $\theta$ (the length of which depends on $f$). Frequentists will pick the distribution and it's parameters by maximizing the likelihood function $l\left(\theta\vert X\right)$, or the log likelihood $\log l\left(\theta\vert X\right)$ instead:
\begin{align}
l\left(\theta\vert X\right) =& \prod_i^{n}f\left(X\vert\theta\right)\\
\log l\left(\theta\vert X\right) =& \sum_i^{n}\log\big(f\left(X\vert\theta\right)\big)
\end{align}
Note that the log likelihood is the sum of the log of the probability densities for each observed datapoint, given parameters $\theta$. The [maximum likelihood estimate](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation), or MLE $\hat{\theta}$, is the parameter vector which has the highest probability of generating the sample data observed. Instead of finding the parameters which maximize the log likelihood, Bayesians will use the [highest posterior density interval (or credibility interval)](https://en.wikipedia.org/wiki/Credible_interval). MLE's for some statistical distributions, such as the Gaussian, can be found analytically, computed as a function of the observed sample data. For example, the univariate Gaussian distribution and the MLE's of it's parameters $\mu$ and $\sigma$ are:
\begin{align}
f\left(x_i\vert\mu,\sigma\right) =& \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x_i-\mu}{\sigma}\right)^2}\\
\hat{\mu} =& \bar{X} = \frac{1}{n}\sum_{i=1}^n x_i\\
\hat{\sigma} = & S = \frac{1}{n-1}\sum_{i=1}^n \left(x_i-\bar{X}\right)^2
\end{align}

For most other probability distributions, the likelihood function must be numerically optimized. If the MLE for a distribution $f$ fit to a dataset $X$ gives the parameters most likely to have generated the observed sample data, then we can pick the distribution $\hat{f}$ most likely to have generated the sample data by as the distribution associated with the maximum likelihood evaluated at the MLE's:
\begin{equation}
\hat{f} = \underset{j}{argmax}\big[\log l_j\left(\hat{\theta}_j\vert X\right)\big]\text{, for }\big[j\in\text{set of distributions}\big].
\end{equation}

To use the GA to find the MLE's for a distribution with $n$ parameters, each binary word on which the GA operates should be of length $\sum_{i=1}^nq_i$, with the $i^\text{th}$ parameter being encoded in $q_i$ bits. There is no requirement for $q_i = q_j$.

- <a href=#SD>Simulate Data</a>
- <a href=#PD>Prepare Distributions</a>
- <a href=#GA>Run GA</a>
- <a href=#PR>Plot Results</a>
- <a href=#end>End</a>

<a id=top></a>

In [None]:
import numpy as np
import pandas as pd
import datetime as dt
import ipdb
import time
import sys
import scipy.stats as stt
from collections import OrderedDict

import chart_studio.plotly as ply
import chart_studio.tools as plytool
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.offline as plyoff
import plotly.subplots as plysub

# to use plotly offline, need to initialize with a plot\n",
plyoff.init_notebook_mode(connected=True)
init = go.Figure(data=[go.Scatter({'x':[1, 2], 'y':[42, 42], 'mode':'markers'})], layout=go.Layout(title='Init', xaxis={'title':'x'}))
plyoff.iplot(init)

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
# my imports
sys.path.append('../src/')
from GA.GA import *
from GA.Objective import *
from Utils.Utils import *

### Simulate Data
Simulate a dataset from a specified statistical distribution with a set of known parameters.

<a href=#top>Go to Top</a>
<a id=SD></a>

In [None]:
''' make some data '''
np.random.seed(42)
n = 100

# get the distribution
simDist = str.upper(input("Select a distribution: 'NRM', 'GAM', 'EXP', 'CHI', 'STU', 'CAU', 'LPL', 'PAR'"))

# simulate
if simDist == 'NRM':
    # Gaussian
    params = [42, 1]
    distObj = stt.norm(loc=params[0], scale=params[1])
    simName = '%s(%d, %d)'%(simDist, params[0], params[1])
elif simDist == 'GAM':
    # Gamma
    params = [42, 2, 6]
    distObj = stt.gamma(loc=params[0], scale=params[1], a=params[2])
    simName = '%s(%d, %d, %d)'%(simDist, params[0], params[1], params[2])
elif simDist == 'LOG':
    # Lognormal
    params = [42, 1]
    distObj = stt.lognorm(s=1, loc=params[0], scale=params[1])
    simName = '%s(%d, %d)'%(simDist, params[0], params[1])
elif simDist == 'EXP':
    # Exponential
    params = [6]
    distObj = stt.expon(scale=params[0])
    simName = '%s(%d)'%(simDist, params[0])
elif simDist == 'CHI':
    params = [6]
    distObj = stt.chi2(df=params[0])
    simName = '%s(%d)'%(simDist, params[0])
elif simDist == 'STU':
    # Student's t
    params = [6]
    distObj = stt.t(df=params[0])
    simName = '%s(%d)'%(simDist, params[0])
elif simDist == 'CAU':
    # Cauchy
    params = [42]
    distObj = stt.cauchy(loc=params[0])
    simName = '%s(%d)'%(simDist, params[0])
elif simDist == 'LPL':
    # Laplace
    params = [42, 1]
    distObj = stt.laplace(loc=params[0], scale=params[1])
    simName = '%s(%d, %d)'%(simDist, params[0], params[1])
elif simDist == 'PAR':
    # Pareto
    params = [6]
    distObj = stt.pareto(b=params[0])
    simName = '%s(%d)'%(simDist, params[0])
else:
    # no selection, so Uniform
    print('Invalid input, so using Uniform!')
    simDist = 'UNI'
    params = [42, 1]
    distObj = stt.uniform(loc=params[0], scale=params[1])
    simName = '%s(%d, %d)'%(simDist, params[0], params[1])
data = distObj.rvs(size=n)
rng = distObj.ppf([0.01, 0.99])

# plot
x = np.linspace(rng[0], rng[1], 100)
_, y = ComputeLikelihood(x, params, simDist)
trcs = [go.Scatter(x=x, y=y, mode='lines', line={'color':'green'}, name=simName),
        go.Histogram(x=data, nbinsx=20, histnorm='probability density', marker={'color':'green', 'opacity':0.75}, name='Sample Data')]
fig = go.Figure(data=trcs, layout=go.Layout(title='Data and Known Distribution'))
plyoff.plot(fig, filename='../output/SampleData_%s.html'%(re.sub('[^0-9A-Za-z_]', '_', simName)), auto_open=True, include_mathjax='cdn')

### Prepare Distributions
Define the set of distributions which will be fit by the GA. For each distribution fit to the simulated dataset, also store:

- the range in which each parameter is expected to be found
- the likelihood of the lower and upper bound of the parameter ranges

<a href=#top>Go to Top</a>
<a id=PD></a>

In [None]:
# define distributions of interest
dists = ['NRM', 'GAM', 'EXP', 'CHI', 'STU', 'CAU', 'LPL', 'PAR']
results = OrderedDict.fromkeys(dists, None)

# get bounds on parameters
for dist in dists:
    # compute
    rng = PDFParamRanges(data, dist, scale=3)
    l = ComputeLikelihood(data, rng[0], dist)
    u = ComputeLikelihood(data, rng[1], dist)
    # store
    results[dist] = [rng, l, u]
    # talk
    print('%s = (%0.3f, %0.3f)'%(dist, l[0], u[0]))

### Run GA
<a href=#top>Go to Top</a>
<a id=GA></a>

In [None]:
''' prepare GA input parameters '''
# distribution parameters
bitCnt = 16
dist = 'NRM'
lowerB = results[dist][0][0]
upperB = results[dist][0][1]
bits = [bitCnt]*len(lowerB)

# GA parameters
parmsGA = {'initPerc':0.5, 'showTopRes':10, 'populSize':200, 'numGens':100,
           'noChangeTerm':80, 'convgCrit':0.00001, 'elitism':True, 'mateType':2, 'probXover':0.8,
           'probMutate':0.3, 'probEngineer':0.2, 'optimGoal':1, 'plotFlag':True, 'printFreq':10,
           'xoverType':1, 'bits':bits, 'lowerB':lowerB, 'upperB':upperB}

# data parameters
parmsData = {'data':data, 'name':simName}

# objective parameters
parmsObj = {'function':'ComputeLikelihood',
            'arguments':{'data':None, 'params':None, 'dist':dist}}

In [None]:
''' run the GA - hold on to your butts '''
# parameters
randSeed = None#42
verb = False
MSims = 2

# iterate over distributions
for dist in dists:
    print('Executing GA for %s Distribution'%dist)
    # set the bit parameters
    lowerB = results[dist][0][0]
    upperB = results[dist][0][1]
    bits = [bitCnt]*len(lowerB)
    parmsGA['bits'] = bits
    parmsGA['lowerB'] = lowerB
    parmsGA['upperB'] = upperB
    # set the distribution
    parmsObj['arguments']['dist'] = dist

    # init
    bestRess = [None]*MSims
    bestParams = [None]*MSims
    bestScores = [None]*MSims
    genBests = [None]*MSims
    genBestParams = [None]*MSims
    genScores = [None]*MSims
    randSeeds = [None]*MSims
    timeStamps = [None]*MSims
    figGAProgresss = [None]*MSims

    for sim in range(MSims):
        print('Executing GA %d of %d'%(sim+1, MSims))
        bestRess[sim], bestParams[sim], bestScores[sim], genBests[sim],\
            genBestParams[sim], genScores[sim], randSeeds[sim], timeStamps[sim],\
            figGAProgresss[sim] = RunGARealOptim(parmsGA, parmsData, parmsObj, verb, randSeed)

    # get the overall best
    bestIndx = np.argmax(parmsGA['optimGoal']*np.array(bestScores))
    bestScore = bestScores[bestIndx]
    bestParam = bestParams[bestIndx]
    timeStamp = timeStamps[bestIndx]
    
    # store the results in the dists dict
    results[dist].extend([bestScore, bestParam, bestRess, bestParams, bestScores, genBests, genBestParams, genScores, randSeeds, timeStamps, figGAProgresss])

In [None]:
''' order the distributions by best fit '''
# order of scores
indx = np.argsort([parmsGA['optimGoal']*v[3] for v in results.values()])[::-1]
distsOrd = [dists[i] for i in indx]

# show
# show the best results per distribution
print('Distributions in Order of Fit')
for dist in distsOrd:
    print('Distribution %s Score = %0.2f, Parameters = %r'%(dist, results[dist][3], results[dist][4]))

### Plot Results
Plot the distribution used to simulate the data, the actual simulated data, and the top 3 fit distribution found by the GA.

<a href=#top>Go to Top</a>
<a id=PR></a>

In [None]:
''' plot the known distribution plus top 3 fit '''
# compute y values for fit distributions
_, yhat0 = ComputeLikelihood(x, results[distsOrd[0]][4], distsOrd[0])
_, yhat1 = ComputeLikelihood(x, results[distsOrd[1]][4], distsOrd[1])
_, yhat2 = ComputeLikelihood(x, results[distsOrd[2]][4], distsOrd[2])

# create traces
trcs = [go.Scatter(x=x, y=y, mode='lines', line={'color':'green'}, name=simName),
        go.Scatter(x=x, y=yhat0, mode='lines', line={'color':'blue'},
                   name=distsOrd[0]+'('+','.join(['%0.4f'%val for val in np.atleast_1d(results[distsOrd[0]][4].squeeze())])+') - (%0.4f)'%results[distsOrd[0]][3]),
        go.Scatter(x=x, y=yhat1, mode='lines', line={'color':'purple'},
                   name=distsOrd[1]+'('+','.join(['%0.4f'%val for val in np.atleast_1d(results[distsOrd[1]][4].squeeze())])+') - (%0.4f)'%results[distsOrd[1]][3]),
        go.Scatter(x=x, y=yhat2, mode='lines', line={'color':'red'},
                   name=distsOrd[2]+'('+','.join(['%0.4f'%val for val in np.atleast_1d(results[distsOrd[2]][4].squeeze())])+') - (%0.4f)'%results[distsOrd[2]][3]),
        go.Histogram(x=data, nbinsx=20, histnorm='probability density', marker={'color':'green', 'opacity':0.75}, name='Sample Data')]

fig = go.Figure(data=trcs, layout=go.Layout(title='Known and Top 3 Fit Distributions'))
plyoff.plot(fig, filename='../output/GADistFitResult_%s_%s.html'%(timeStamp, re.sub('[^0-9A-Za-z_]', '_', simName)), auto_open=True, include_mathjax='cdn')

### End

<a href=#top>Go to Top</a>
<a id=end></a>