# Genetic Algorithm
## Feature selection

This notebook demonstrates feature selection with the GA for a given dataset with the goal of minimzing error from a machine learning model. The data is simulated using a known dependence structure, so it is possible to assess the accuracy of the GA result.

Statistical modelers have been trying models on subsets of features for almost as long as statistical modeling (most of what we call "machine learning" is actually statistical modeling) has been around. Perhaps unimaginably, we call the process of selecting a subset of available features [feature selection](https://en.wikipedia.org/wiki/Feature_selection). In feature selection, we use some procedure to generate subsets of the existing features, fit a model to them, and evaluate that model to find an optimal subset. The goal of feature selection is usually to balance two considerations: model performance and model complexity. It is generally beneficial for a model to be simpler - to use fewer features, for example. We often prefer a simpler model, even if it performs slightly worse than a more complex model. This follows the principle of [occam's razor](https://en.wikipedia.org/wiki/Occam%27s_razor).

A simple way to perform feature selection, that guarantees finding the most optimal subset of features, is combinatorial enumeration - a.k.a. brute force. Combinatorial enumeration does exactly what it sounds like - the model is evauated on the enumeration of all possible combinations of features. This is no mean feat, as the number of ways to combine $p$ features is exponential in $p$; there are $2^{p-1}$ possible subsets. The GA is a useful tool for feature selection; for $p$ features, each individual is a p-length binary string indicating that a feature is in that solution (1) or out of it (0). If $p=8$, for example, one solution may be $10011001$; in this case, features 1,4,5,8 will be used, while 2,3,6,7 will not.

- <a href=#SD>Simulate Data</a>
- <a href=#GA>Run GA</a>
- <a href=#PR>Plot Results</a>
- <a href=#end>End</a>

<a id=top></a>

In [None]:
import numpy as np
import pandas as pd
import datetime as dt
import ipdb
import time
import sys

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, Lasso, Ridge

import chart_studio.plotly as ply
import chart_studio.tools as plytool
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.offline as plyoff
import plotly.subplots as plysub

# to use plotly offline, need to initialize with a plot\n",
plyoff.init_notebook_mode(connected=True)
init = go.Figure(data=[go.Scatter({'x':[1, 2], 'y':[42, 42], 'mode':'markers'})], layout=go.Layout(title='Init', xaxis={'title':'x'}))
plyoff.iplot(init)

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
# my imports
sys.path.append('../src/')
from GA.GA import *
from GA.Objective import *
from Utils.Utils import *

### Simulate Data
Simulate a dataset such that there is a known relationship between (some of) the features and a target variable.

<a href=#top>Go to Top</a>
<a id=SD></a>

In [None]:
''' generate some data '''
# data generating process parameters
p = 20
n = 100
gamma = 5
mu = 0
sigma = 0.1

# real features
simSubs = np.zeros(shape=(p,), dtype=int)
simSubs[:3] = 1
B = np.zeros(shape=(p,), dtype=float)
B[simSubs==1] = [8, -1, 4]

# generate the features & target
np.random.seed(42)
X = np.random.rand(n, p)*gamma
noise =  np.random.normal(loc=mu, scale=sigma, size=n)
y = np.sum(X[:,simSubs==1]*B[B != 0], axis=1)# + noise
simName = '+'.join(['%dX%d'%(B[i], i) for (i, f) in enumerate(simSubs) if f])

# create the dataframe
feats = ['X%d'%i for i in range(p)]
data = pd.DataFrame(data=y, columns=['target'])
data[feats] = X

# talk
display(data.head())

In [None]:
# review the features correlations
figCorr = correlationsPlot(data.corr(), plotTitl='Feature Correlations Plot',
                           trcLims=(0.0, 0.75, 0.9, 0.95, 1.0), tweaks=(20, None, None, 1.05))
plyoff.plot(figCorr, filename='../output/Correlations_%s.html'%(re.sub('[^0-9A-Za-z_]', '_', simName)), auto_open=True, include_mathjax='cdn')

### Run GA
<a href=#top>Go to Top</a>
<a id=GA></a>

In [None]:
''' prepare GA input parameters '''
# GA parameters
parmsGA = {'initPerc':0.5, 'forceVars':None, 'showTopSubs':10, 'populSize':200, 'numGens':100,
           'noChangeTerm':80, 'convgCrit':0.00001, 'elitism':True, 'mateType':2, 'probXover':0.8,
           'probMutate':0.3, 'probEngineer':0.2, 'optimGoal':-1, 'plotFlag':True, 'printFreq':10,
           'xoverType':1}
# data parameters
parmsData = {'data':data, 'name':simName}
# objective parameters
#estim = DecisionTreeRegressor()
estim = LinearRegression(fit_intercept=False)
#estim = Lasso()
#estim = Ridge()
parmsObj = {'function':'RegressionMetric',
            'arguments':{'data':None, 'subset':None, 'metric':'RMSE', 'estim':estim, 'optimGoal':parmsGA['optimGoal']}}

In [None]:
''' run the GA - hold on to your butts '''
# parameters
randSeed = None#42
verb = False
MSims = 1

# init
bestSubss = [None]*MSims
bestScores = [None]*MSims
genBestss = [None]*MSims
genScoress = [None]*MSims
randSeeds = [None]*MSims
timeStamps = [None]*MSims
figGAProgresss = [None]*MSims
seedSubs = []

for sim in range(MSims):
    print('Executing GA %d of %d'%(sim+1, MSims))
    bestSubss[sim], bestScores[sim], genBestss[sim], genScoress[sim],\
    randSeeds[sim], timeStamps[sim], figGAProgresss[sim] = RunGASubset(parmsGA, parmsData, parmsObj, seedSubs, verb, randSeed)
    # add the best subset to seed the next GP run, if new
    try:
        seedSubs.index(bestSubss[sim])
    except ValueError:
        # this best is new, so add
        seedSubs.append(bestSubss[sim])

# get the overall best
bestIndx = np.argmax(parmsGA['optimGoal']*np.array(bestScores))
bestScore = bestScores[bestIndx]
bestSubs = bestSubss[bestIndx]
timeStamp = timeStamps[bestIndx]

### Plot Results
Generate residuals-based diagnostic plots for three models, using features sets:

- best subset found by the GA
- subset with all features
- subset used to generate the response

<a href=#top>Go to Top</a>
<a id=PR></a>

In [None]:
# set some objective stuff for the plots
parmsObj['arguments']['data'] = data.copy()
objStr = '%s(%s)'%(parmsObj['function'], ', '.join(['%s=%r'%(key, val) for (key, val) in parmsObj['arguments'].items()\
        if key not in ['data', 'subset']]))
objStr = re.sub('[^0-9A-Za-z_]', '_', objStr)

In [None]:
''' evaluate the best subset '''
# subset name
name = BinaryStr(bestSubs)

# show the selected columns
keep = [f for b, f in zip(bestSubs, feats) if b]
print('Best Subset Columns: %r'%keep)

# get the predictions & model
parmsObj['arguments']['subset'] = bestSubs
_, preds, estim = globals()[parmsObj['function']](**parmsObj['arguments'])

# add the subset results & compute error
data[name] = preds
data['G_error'] = data['target'] - data[name]

# talk
display(data.head())

# plot
figGAPerformance = ResultsPlots(data, sequenceCol=None, responseCol='target',
                                predCol=name, resdCol='G_error', colorCol=None,
                                overall_title='GA Performance: %s = %0.4f'%(name, bestScore), plot_colors=('blue',)*4)
plyoff.plot(figGAPerformance, filename='../output/GAPerformance_%s_%s_%s.html'\
            %(timeStamp, re.sub('[^0-9A-Za-z_]', '_', simName), objStr), auto_open=True, include_mathjax='cdn')

In [None]:
''' evaluate the full subset '''
# subset
fullSubs = np.ones(shape=(p,1))
name = BinaryStr(fullSubs)

# get the predictions & model
parmsObj['arguments']['subset'] = fullSubs.squeeze()
fullScore, preds, estim = globals()[parmsObj['function']](**parmsObj['arguments'])

# add the subset results & compute error
data['full'] = preds
data['F_error'] = data['target'] - data['full']

# talk
display(data.head())

# plot
figFull = ResultsPlots(data, sequenceCol=None, responseCol='target', predCol='full',
                       resdCol='F_error', colorCol=None, overall_title='Full Model = %0.4f'%fullScore,
                       plot_colors=('red',)*4)
plyoff.plot(figFull, filename='../output/FullModel_%s_%s.html'\
            %(re.sub('[^0-9A-Za-z_]', '_', simName), objStr), auto_open=True, include_mathjax='cdn')

In [None]:
''' evaluate the true subset '''
# subset
name = BinaryStr(simSubs)

# get the predictions & model
parmsObj['arguments']['subset'] = simSubs.squeeze()
simScore, preds, estim = globals()[parmsObj['function']](**parmsObj['arguments'])

# add the subset results & compute error
data['True'] = preds
data['T_error'] = data['target'] - data['True']

# talk
display(data.head())

# plot
figTrue = ResultsPlots(data, sequenceCol=None, responseCol='target', predCol='True',
                       resdCol='T_error', colorCol=None, overall_title='True Model = %0.4f'%simScore,
                       plot_colors=('green',)*4)
plyoff.plot(figTrue, filename='../output/TrueModel_%s_%s.html'\
            %(re.sub('[^0-9A-Za-z_]', '_', simName), objStr), auto_open=True, include_mathjax='cdn')

### End

<a href=#top>Go to Top</a>
<a id=end></a>