### Genetic Algorithm 
- uses energy efficiency dataset https://archive.ics.uci.edu/ml/datasets/energy+efficiency#


We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 768 building shapes. The dataset comprises 768 samples and 8 features, aiming to predict two real valued responses. It can also be used as a multi-class classification problem if the response is rounded to the nearest integer.


Attribute Information:

The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses. 

Specifically: 
- X1	Relative Compactness 
- X2	Surface Area 
- X3	Wall Area 
- X4	Roof Area 
- X5	Overall Height 
- X6	Orientation 
- X7	Glazing Area 
- X8	Glazing Area Distribution 
- y1	Heating Load 
- y2	Cooling Load



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
from sklearn.model_selection import train_test_split
from sklearn import cross_validation
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
from sklearn import svm



### Get the dataset
- df.sample(frac=1) shuffles the data 

In [2]:
df = pd.read_csv("energy_efficiency.csv")
df = df.sample(frac=1)
rows = len(df)
cols = len(df.keys())
print(rows, cols)
df.head()

768 10


Unnamed: 0,X1,X2,X3,X4,X5,X6,X7,X8,Y1,Y2
608,0.69,735.0,294.0,220.5,3.5,2,0.4,2,14.75,16.44
360,0.74,686.0,245.0,220.5,3.5,2,0.25,2,12.35,14.73
723,0.98,514.5,294.0,110.25,7.0,5,0.4,5,32.73,34.01
253,0.82,612.5,318.5,147.0,7.0,3,0.1,5,23.89,24.77
176,0.69,735.0,294.0,220.5,3.5,2,0.1,3,11.22,14.44


### X6 and X8 are categorical variables

In [3]:
X = pd.get_dummies(df.iloc[:,0:(cols-2)], columns=['X6', 'X8']).values
y = df.iloc[:, (cols-2):(cols-1)].values
print(X[0])
print(y[0])

[6.900e-01 7.350e+02 2.940e+02 2.205e+02 3.500e+00 4.000e-01 1.000e+00
 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00
 0.000e+00 0.000e+00]
[14.75]


### Scale the data with minmax scaler

In [4]:
mms = preprocessing.MinMaxScaler()
X = mms.fit_transform(X)
print(X[0])

[0.19444444 0.75       0.28571429 1.         0.         1.
 1.         0.         0.         0.         0.         0.
 1.         0.         0.         0.        ]


### Set genetic algorithm parameters
- M : number of generations
- N : population size
- Pc: probability of crossover
- Pm: probability of mutation
- l : string size
- k : tournament selection contestants

### **since we are performing a GA for SVM, the parameters to tune are c and gamma**

#### SVM hyperparameters
- kernel: poly, rbf, linear
- poly: degree of polynomial
- c: controls the tradeoff between low error and maximizing the norm of the weight
- gamma: determines strength of training sample with gaussian kernel

In [83]:
#probability of crossover
Pc = .95 #these do not need to sum to 1
#probability of mutation
Pm = .12
#size of population
pop = 6
#number of generations
gen = 3
l = 24 # first 12 are for c, second 12 are for gamma

In [6]:
# the chromosome
xy = np.random.choice([0,1], size=(l,), p=[.5, .5])
print(xy)

[0 1 1 0 0 0 0 0 0 1 0 1 0 0 1 0 0 1 1 1 0 1 1 1]


### (1) Create a random population of chromosomes (potential solutions)

In [7]:
population = np.empty((0, len(xy)))
for i in range(gen):
    random.shuffle(xy)
    population = np.vstack((population, xy))

In [8]:
print(population[0:5])

[[0. 1. 1. 1. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0. 1. 0. 0. 0. 0. 0. 1.]
 [1. 0. 0. 0. 1. 1. 0. 1. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 1. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 0. 1. 1. 1. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 0. 0. 1. 1. 0. 1. 0. 1. 0. 0. 1. 1.]]


### (2) Calculate precision of the chromosomes
- Range (a,b)
- length, l
- precision (b-a)/(2^l-1)
- encode (literal base 2 encoding)
- decode (sum(bit*2^i)*precision + a

In [9]:
# parameter c, see precision formula
ac = 10
bc = 1000
lc = len(xy)/2
pc = (bc - ac)/((2**lc)-1)

In [10]:
# parameter gamma, see precision formula
ag = .05
bg = .99
lg = len(xy)/2
pg = (bg - ag)/((2**lg)-1)

### Decode the chromosomes
- returns a real value in the specified range for the parameter

In [11]:
def decode(xy, index, precision, lowerBound):
    """Decodes a chromosome into a real value"""
    power = 0 #binary powers
    sum = 0
    for i in range(len(xy)//2):
        val = xy[index]*(2**power)
        sum += val
        index -= 1
        power += 1
    return (sum * precision) + lowerBound

In [12]:
#index of the start of the c parameter from SVM 
cIndex = -1 # start at the end of the chromosome
#index of the start of the gamma parameter from SVM
gIndex = (l//2) + 1 #start in the middle of the chromosome
c1 = decode(xy, cIndex, pc, ac)
g1 = decode(xy, gIndex, pg, ag)
print("c bounds: ({},{})\tcvalue: {}".format(ac,bc,c1))
print("g bounds: ({},{})\tgvalue: {}".format(ag,bg,g1))


c bounds: (10,1000)	cvalue: 215.73626373626374
g bounds: (0.05,0.99)	gvalue: 0.14824664224664225


### Perform the algorithm

- keep track of children and mutations 

In [13]:
def kfold(choice, num_folds):
    c = decode(choice, cIndex, pc, ac)
    g = decode(choice, gIndex, pg, ag)
    total = 0
    kf = cross_validation.KFold(len(X),n_folds=num_folds)

    for train_index, test_index in kf:
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        model = svm.SVR(kernel='rbf',C=c, gamma=g)
        model.fit(X_train, y_train)
        acc = model.score(X_test, y_test)
        err = 1-acc
        total += err
    #need the total error
    total = total/num_folds
    return total

def parentSelection(population, l):
    """
    returns two parents from a population given the chromosome size l
    """
    parents = np.zeros((0, l))
    for k in range(2):
        #get some random samples to pick good parents
        candidates = population[np.random.choice(len(population), 3, replace=False)]

        #the best selection for a parent
        selection = -1
        score = 10000000

        for choice in candidates:
            #c and gamma values for SVR
            total = kfold(choice, num_folds)
            #print(total)
            if (total < score):
                score = total
                selection = choice

        #print(selection, score)
        parents = np.vstack((parents, selection))
    return parents

def crossover(parents):
    children = np.copy(parents)

    #perform crossover where Pc is the probability that crossover takes place
    if(np.random.rand() < Pc):
        #get 2 random indicies for the crossover portion
        indices = np.random.choice(l,2,replace=False)
        #ensure first index is smaller than the second
        if(indices[0] > indices[1]):
            temp = indices[0]
            indices[0] = indices[1]
            indices[1] = temp
        #print(indices)

        children[0][indices[0]:indices[1]] = parents[1][indices[0]:indices[1]]
        children[1][indices[0]:indices[1]] = parents[0][indices[0]:indices[1]]
    return children

def mutate(children):
    for child in children:
        for i in range(0, len(child)):
            if(np.random.rand() < Pm):
                #print(i)
                if(child[i] == 0):
                    child[i] = 1
                else:
                    child[i] = 0


In [132]:
# bestC = []
# worstC = []
# bestG = []
# worstG = []

fittest = np.empty((0, len(xy)+2))
ofgf = []

min0 = np.empty((0, len(xy)+2))
min1 = np.empty((0, len(xy)+2))

# mgm00 = np.empty((0, len(xy)+2))
# mgm11 = np.empty((0, len(xy)+2))

# mgm000 = np.empty((0, len(xy)+2))
# mgm111 = np.empty((0, len(xy)+2))

In [133]:
#index of the start of the c parameter from SVM 
cIndex = -1 # start at the end of the chromosome
#index of the start of the gamma parameter from SVM
gIndex = (l//2) + 1 #start in the middle of the chromosome

for k in range(gen):
    print("generation: ", k)
    
    newPopulation = np.empty((0, len(xy)))
    
    agc0 = np.empty((0, len(xy)+1))
    agc1 = np.empty((0, len(xy)+1))
    
    mgc0 = []
    mgc1 = []
    
    bestgc = np.empty((0, len(xy)+1))
    fbgc = []
    fwgc = []
    
    num_folds = 3
    
    
    for j in range(pop//2): #pop//2
        #print("family: ", j)
        
        parents = parentSelection(population, l)
        children = crossover(parents)
        mutate(children)
        fitness = []
        for child in children:
            fitness.append(kfold(child, num_folds))
        print("fitness of child 0 in family {} is {}".format(j, fitness[0]))
        print("fitness of child 1 in family {} is {}".format(j, fitness[1]))
        
        temp0 = np.hstack((fitness[0], children[0]))
        temp1 = np.hstack((fitness[1], children[1]))

        agc0 = np.vstack((agc0, temp0))
        agc1 = np.vstack((agc1, temp1))
        bestgc = np.vstack((agc0,agc1))  
        
        #create a new population
        newPopulation = np.vstack((newPopulation, children[0], children[1]))
        
        #keep up with our best chromosomes
        best = min(agc0[:,0:1])
        print(best)
        for i in range(0, len(agc0)):
            test = agc0[i]
            if(test[0:1] == best):
                mgc0 = agc0[i,:]
                
        best = min(agc1[:,0:1])
        print(best)
        for i in range(0, len(agc1)):
            test = agc1[i]
            if(test[0:1] == best):
                mgc1 = agc1[i,:]
                
                
    #apply elitism / survival of the fittest (worst child is replaced with best child)
    best = min(bestgc[:,0:1])
    print(best)
    for i in range(0, len(bestgc)):
        test = bestgc[i]
        if(test[0:1] == best):
            fbgc = bestgc[i,:]

    worst = max(bestgc[:,0:1])
    for i in range(0, len(bestgc)):
        test = bestgc[i]
        if(test[0:1] == worst):
            fwgc = bestgc[i,:]
            
    #print("fbgc: ", fbgc)
    #print("fwgc: ", fwgc)
    
    for i in range(0, len(newPopulation)):
        if(np.array_equal(newPopulation[i], fwgc[1:])):
            newPopulation[i] = fbgc[1:]

    population = newPopulation
    print("pop len: ", len(population))

    min0 = np.insert(mgc0, 0, k)
    min1 = np.insert(mgc1, 0, k)
    
    #mgm0 = np.vstack((mgm0, mgc0))
    #mgm1 = np.vstack((mgm1, mgc1))
    fittest = np.vstack((ofg, min0, min1))
    #print("mgm0: ", mgm0)
    #print("mgm1: ", mgm1)

#ofg = np.vstack((mgm0, mgm1))
print("ofg: ", fittest)
best = min(fittest[:,1:2])
print(best)
for i in range(0, len(fittest)):
    test = fittest[i]
    #print(test)
    if(test[1:2] == best):
        ofgf = fittest[i,:]
print("Min for all generations: ", ofgf)

generation:  0
fitness of child 0 in family 0 is 0.06493535111858544
fitness of child 1 in family 0 is 0.0661776139122225
[0.06493535]
[0.06617761]
fitness of child 0 in family 1 is 0.06371402931210786
fitness of child 1 in family 1 is 0.09097046869885843
[0.06371403]
[0.06617761]
fitness of child 0 in family 2 is 0.3076182674427406
fitness of child 1 in family 2 is 0.06826214757496833
[0.06371403]
[0.06617761]
[0.06371403]
generation:  1
fitness of child 0 in family 0 is 0.06508729598197709
fitness of child 1 in family 0 is 0.11529584535923725
[0.0650873]
[0.11529585]
fitness of child 0 in family 1 is 0.1746197336517711
fitness of child 1 in family 1 is 0.20724053980667942
[0.0650873]
[0.11529585]
fitness of child 0 in family 2 is 0.1490189619757036
fitness of child 1 in family 2 is 0.08778699311115827
[0.0650873]
[0.08778699]
[0.0650873]
generation:  2
fitness of child 0 in family 0 is 0.14336960383082042
fitness of child 1 in family 0 is 0.08859824707849
[0.1433696]
[0.08859825]
fit

In [129]:
print(population)

[[1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0.]
 [1. 0. 0. 0. 1. 0. 0. 0. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 0. 1. 1. 1. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1.]
 [1. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 0.]
 [1. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 0. 0. 0.]]


In [118]:
print("final solution: ", ofgf[2:])
print("best accuracy: ", (1-ofgf[1:2]))

final solution:  [1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1.]
best accuracy:  [0.93594746]


In [119]:
c = decode(ofgf[2:], cIndex, pc, ac)
g = decode(ofgf[2:], gIndex, pg, ag)
print(c,g)

882.021978021978 0.08833455433455434


In [120]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.25)

model = svm.SVR(kernel="rbf", C=c, gamma=g)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = model.score(X_test, y_test)

print(acc)

0.9397641910141139
