### Genetic Algorithm 
- uses energy efficiency dataset https://archive.ics.uci.edu/ml/datasets/energy+efficiency#


We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.


Attribute Information:

The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses. 

Specifically: 
- X1	Relative Compactness 
- X2	Surface Area 
- X3	Wall Area 
- X4	Roof Area 
- X5	Overall Height 
- X6	Orientation 
- X7	Glazing Area 
- X8	Glazing Area Distribution 
- y1	Heating Load 
- y2	Cooling Load



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.model_selection import train_test_split
from sklearn import cross_validation
from sklearn import preprocessing
from sklearn import svm



### Get the dataset
- df.sample(frac=1) shuffles the data 

In [2]:
df = pd.read_csv("energy_efficiency.csv")
df = df.sample(frac=1)
rows = len(df)
cols = len(df.keys())
print(rows, cols)
df.head()

768 10


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
436,0.9,563.5,318.5,122.5,7.0,2,0.25,4,32.67,32.12
584,0.86,588.0,294.0,147.0,7.0,2,0.4,2,32.48,35.48
28,0.71,710.5,269.5,220.5,3.5,2,0.0,0,6.37,11.27
408,0.74,686.0,245.0,220.5,3.5,2,0.25,3,11.8,14.49
484,0.9,563.5,318.5,122.5,7.0,2,0.25,5,33.13,32.25


### X6 and X8 are categorical variables

In [3]:
X = pd.get_dummies(df.iloc[:,0:(cols-2)], columns=['X6', 'X8']).values
y = df.iloc[:, (cols-2):(cols-1)].values
print(X[0])
print(y[0])

[9.000e-01 5.635e+02 3.185e+02 1.225e+02 7.000e+00 2.500e-01 1.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 1.000e+00 0.000e+00]
[32.67]


### Scale the data with minmax scaler

In [4]:
mms = preprocessing.MinMaxScaler()
X = mms.fit_transform(X)
print(X[0])

[0.77777778 0.16666667 0.42857143 0.11111111 1.         0.625
 1.         0.         0.         0.         0.         0.
 0.         0.         1.         0.        ]


### Set genetic algorithm parameters
- M : number of generations
- N : population size
- Pc: probability of crossover
- Pm: probability of mutation
- l : string size
- k : tournament selection contestants

### **since we are performing a GA for SVM, the parameters to tune are c and gamma**

#### SVM hyperparameters
- kernel: poly, rbf, linear
- poly: degree of polynomial
- c: controls the tradeoff between low error and maximizing the norm of the weight
- gamma: determines strength of training sample with gaussian kernel

In [5]:
#probability of crossover
Pc = .95 #these do not need to sum to 1
#probability of mutation
Pm = .08
#size of population
pop = 100
#number of generations
gen = 400
l = 24 # first 12 are for c, second 12 are for gamma

In [6]:
# the chromosome
xy = np.random.choice([0,1], size=(l,), p=[.5, .5])
print(xy)

[1 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 0 1 0 1 1 1 1 0]


### (1) Create a random population of chromosomes (potential solutions)

In [7]:
population = np.empty((0, len(xy)))
for i in range(gen):
    random.shuffle(xy)
    population = np.vstack((population, xy))

In [8]:
print(population[0:5])

[[1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1.]
 [1. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1.]
 [0. 1. 0. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1.]
 [0. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 0. 1.]
 [1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 0. 0. 0. 1. 1. 0. 1. 1.]]


### (2) Calculate precision of the chromosomes
- Range (a,b)
- length, l
- precision (b-a)/(2^l-1)
- encode (literal base 2 encoding)
- decode (sum(bit*2^i)*precision + a

In [9]:
# parameter c, see precision formula
ac = 10
bc = 1000
lc = len(xy)/2
pc = (bc - ac)/((2**lc)-1)

In [10]:
# parameter gamma, see precision formula
ag = .05
bg = .99
lg = len(xy)/2
pg = (bg - ag)/((2**lg)-1)

### Decode the chromosomes
- returns a real value in the specified range for the parameter

In [11]:
def decode(xy, index, precision, lowerBound):
    """Decodes a chromosome into a real value"""
    power = 0 #binary powers
    sum = 0
    for i in range(len(xy)//2):
        val = xy[index]*(2**power)
        sum += val
        index -= 1
        power += 1
    return (sum * precision) + lowerBound

In [12]:
#index of the start of the c parameter from SVM 
cIndex = -1 # start at the end of the chromosome
#index of the start of the gamma parameter from SVM
gIndex = (l//2) + 1 #start in the middle of the chromosome
c1 = decode(xy, cIndex, pc, ac)
g1 = decode(xy, gIndex, pg, ag)
print("c bounds: ({},{})\tcvalue: {}".format(ac,bc,c1))
print("g bounds: ({},{})\tgvalue: {}".format(ag,bg,g1))


c bounds: (10,1000)	cvalue: 544.5274725274726
g bounds: (0.05,0.99)	gvalue: 0.9860976800976801


### Perform the algorithm

- keep track of children and mutations 

In [30]:
def parentSelection(population, l):
    """
    returns two parents from a population given the chromosome size l
    """
    parents = np.zeros((0, l))
    for k in range(2):
        #get some random samples to pick good parents
        candidates = population[np.random.choice(len(population), 3, replace=False)]

        #the best selection for a parent
        selection = -1
        score = 10000000

        for choice in candidates:
            #c and gamma values for SVR
            c = decode(choice, cIndex, pc, ac)
            g = decode(choice, gIndex, pg, ag)

            total = 0
            kf = cross_validation.KFold(len(X),n_folds=num_folds)

            for train_index, test_index in kf:
                X_train, X_test = X[train_index], X[test_index]
                y_train, y_test = y[train_index], y[test_index]

                model = svm.SVR(kernel='rbf',C=c, gamma=g)
                model.fit(X_train, y_train)
                acc = model.score(X_test, y_test)
                err = 1-acc
                total += err
            #need the total error
            total = total/num_folds
            print(total)
            if (total < score):
                score = total
                selection = choice

        print(selection, score)
        parents = np.vstack((parents, selection))
    return parents

def crossover(parents):
    children = np.copy(parents)

    #perform crossover where Pc is the probability that crossover takes place
    if(np.random.rand() < Pc):
        #get 2 random indicies for the crossover portion
        indices = np.random.choice(l,2,replace=False)
        #ensure first index is smaller than the second
        if(indices[0] > indices[1]):
            temp = indices[0]
            indices[0] = indices[1]
            indices[1] = temp
        #print(indices)

        children[0][indices[0]:indices[1]] = parents[1][indices[0]:indices[1]]
        children[1][indices[0]:indices[1]] = parents[0][indices[0]:indices[1]]
        return children

    else:
        print("no crossover")
        return parents


In [14]:
bestC = []
worstC = []
bestG = []
worstC = []

ofg = np.empty((0, len(xy)+2))
ofgf = []

mgm1 = np.empty((0, len(xy)+1))
mgm2 = np.empty((0, len(xy)+1))

mgm11 = np.empty((0, len(xy)+2))
mgm22 = np.empty((0, len(xy)+2))

mgm111 = np.empty((0, len(xy)+2))
mgm222 = np.empty((0, len(xy)+2))

In [31]:
#index of the start of the c parameter from SVM 
cIndex = -1 # start at the end of the chromosome
#index of the start of the gamma parameter from SVM
gIndex = (l//2) + 1 #start in the middle of the chromosome

for i in range(1):
    newPopulation = np.empty((0, len(xy)))
    
    agc1 = np.empty((0, len(xy)+1))
    acg2 = np.empty((0, len(xy)+1))
    
    mgc1 = []
    mgc2 = []
    
    bestgc = np.empty((0, len(xy)+1))
    fbgc = []
    fwgc = []
    
    num_folds = 3
    
    for j in range(1): #pop//2
        #dads = population[np.random.choice(len(population), 3, replace=False)]
        parents = parentSelection(population, l)
        children = crossover(parents)
        #print(parents)
        #print(children)

        
        
        

        

0.0883016160415148
0.29023636380941015
0.11917401133097223
[1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1.] 0.0883016160415148
0.2945123660280436
0.07076947019779456
0.11999671188156058
[0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1.] 0.07076947019779456
[10 13]
[[1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1.]
 [0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1.]]
[[1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 0. 1.]
 [0. 1. 0. 0. 0. 1. 0. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1.]]
