# CHAID

The EPC data contains several categorical variables with a lot of values. In order to find suitable features which will retain the most information, three feature sets are explored;
* data driven
* domain driven
* exhaustive

The first approach, termed data driven, uses statistical methods to reduce the number of variables. As the variables containing textual descriptions of the property have been created free-hand, many contain a large number of unique values. In some cases, only recorded for one property. The data driven approach uses a single level Chi-square Automatic Interaction Detector (CHAID) to group the levels within each categorical variable into a smaller number of groups. CHAID groups values with a similar response rate or in this context Energy Efficiency Rating (EER). 

This script run CHAID and stores the results in a dictionary

In [1]:
import numpy as np
import pandas as pd
import datetime
import os
import glob
import json
from CHAID import Tree
import re

In [3]:
# set variables from config file
config_path = os.path.abspath('..')[:-7]

with open(config_path + '/config.json', 'r') as f:
    config = json.load(f)

processing_path = config['DEFAULT']['processing_path']
epc_train_fname = config['DEFAULT']['epc_train_fname']
epc_test_fname = config['DEFAULT']['epc_test_fname']
epc_train_clean_fname = config['DEFAULT']['epc_train_clean_fname']
epc_chaid_fname = config['DEFAULT']['epc_chaid_fname']

In [4]:
dtype_dict = {'INSPECTION_DATE':'str'}

epc_train = pd.read_csv(os.path.join(processing_path,epc_train_clean_fname),
                        header = 0,
                        delimiter = ',',
                        dtype = dtype_dict,
                        parse_dates = ['INSPECTION_DATE'])

In [5]:
#Get quantile boundafries
quantiles = epc_train['CURRENT_ENERGY_EFFICIENCY'].describe()
print(quantiles['25%'])
print(quantiles['75%'])

52.0
71.0


In [6]:
#Create a new new target field within the training data
min_eff = epc_train['CURRENT_ENERGY_EFFICIENCY'].min()
max_eff = epc_train['CURRENT_ENERGY_EFFICIENCY'].max()
epc_train['eff_flag'] = pd.cut(epc_train['CURRENT_ENERGY_EFFICIENCY'],
                               bins = [min_eff,53.0,71.0,max_eff],
                               labels = ['0','99','1'])

#Drop unwated '99' level of eff_flag and convert to integer
epc_train = epc_train[epc_train['eff_flag'].isin(['0','1'])]
epc_train['eff_flag'] = epc_train['eff_flag'].astype(int)

In [7]:
#Get the {0,1} sample size by taking the smallest class from the 0
#and 1 outcomes
sample_size = epc_train['eff_flag'].value_counts().min()


#Subsample positive and negative samples
neg_eff_flag = epc_train[epc_train['eff_flag'] == 0].sample(sample_size,random_state = 1234,axis = 0)

pos_eff_flag = epc_train[epc_train['eff_flag'] == 1].sample(sample_size,random_state = 1234,axis = 0)

#Concatenate
epc_chaid = pd.concat([neg_eff_flag, pos_eff_flag])
epc_chaid['eff_flag'].value_counts()

#Randomly shuffle epc_CHAID and reset the index
epc_chaid = epc_chaid.sample(frac = 1).reset_index(drop=True)

In [11]:
#Numeric
var_list_num = epc_chaid.select_dtypes(include= 'number').columns.tolist()
var_list_num.remove('CURRENT_ENERGY_EFFICIENCY')

#Categorical
var_list_cat = epc_chaid.select_dtypes(include= ['object','category']).columns.tolist()
var_list_cat.remove('LMK_KEY')
var_list_cat.remove('POSTCODE')
var_list_cat.remove('CURRENT_ENERGY_RATING')
var_list_cat.remove('LODGEMENT_DATE')

### Creating a dictionary of CHAID scores

In [12]:
chaid_dict = {}
for var in var_list_cat:
    #Set the inputs and outputs
    #The imputs are given as a dictionary along with the type
    #The output must be of string type
    #I have assume all features are nominal, we can change the features dictionary to include the ordinal type
    features = {var:'nominal'}
    label = 'eff_flag'
    #Create the Tree
    chaid_dict[var] = {}
    tree = Tree.from_pandas_df(epc_chaid, i_variables = features, d_variable = label, alpha_merge = 0.0)
    #Loop through all the nodes and enter into a dictionary
    print('\n\n\nVariable: %s' % var)
    print('p-value: %f' % tree.tree_store[0].split.p)
    print('Chi2: %f' % tree.tree_store[0].split.score)
    for i in range(1, len(tree.tree_store)):
        count = tree.tree_store[i].members[0] + tree.tree_store[i].members[1]
        rate = tree.tree_store[i].members[1] / count
        print('\nNode %i:\n\tCount = %i\tRate = %f' % (i,count,rate))
        print('\t%s' % tree.tree_store[i].choices)
        chaid_dict[var]['node' + str(i)] = tree.tree_store[i].choices




Variable: region
p-value: 0.000000
Chi2: 14047.791587

Node 1:
	Count = 93959	Rate = 0.424217
	['Blaenau Gwent', 'Pembrokeshire', 'Rhondda Cynon Taf', 'Neath Port Talbot', 'Carmarthenshire', 'Powys', 'Conwy']

Node 2:
	Count = 96951	Rate = 0.551124
	['Bridgend', 'Wrexham', 'Monmouthshire', 'Merthyr Tydfil', 'Vale of Glamorgan', 'Caerphilly', 'Flintshire', 'Swansea']

Node 3:
	Count = 57894	Rate = 0.656027
	['Cardiff', 'Torfaen', 'Newport']

Node 4:
	Count = 36030	Rate = 0.309353
	['Ceredigion', 'Debighshire', 'Gwynedd', 'Isle of Anglesey']



Variable: PROPERTY_TYPE
p-value: 0.000000
Chi2: 34442.140161

Node 1:
	Count = 26988	Rate = 0.273158
	['Bungalow', 'Park home']

Node 2:
	Count = 61090	Rate = 0.821051
	['Flat']

Node 3:
	Count = 196756	Rate = 0.431433
	['House', 'Maisonette']



Variable: BUILT_FORM
p-value: 0.000000
Chi2: 6688.231114

Node 1:
	Count = 20896	Rate = 0.772301
	['<missing>', 'Enclosed End-Terrace', 'Enclosed Mid-Terrace']

Node 2:
	Count = 263938	Rate = 0.478442


p-value: 0.000000
Chi2: 99696.632305

Node 1:
	Count = 139864	Rate = 0.697828
	['<missing>', 'Good']

Node 2:
	Count = 48402	Rate = 0.285112
	['Average']

Node 3:
	Count = 27136	Rate = 0.933004
	['Very Good']

Node 4:
	Count = 69432	Rate = 0.082066
	['Very Poor', 'Poor']



Variable: HOT_WATER_ENV_EFF
p-value: 0.000000
Chi2: 82614.861754

Node 1:
	Count = 148200	Rate = 0.652928
	['<missing>', 'Good']

Node 2:
	Count = 105963	Rate = 0.165463
	['Average', 'Very Poor', 'Poor']

Node 3:
	Count = 30671	Rate = 0.916827
	['Very Good']



Variable: FLOOR_DESCRIPTION
p-value: 0.000000
Chi2: 114957.356804

Node 1:
	Count = 57981	Rate = 0.801659
	['<missing>', '(same dwelling below) insulated', 'average thermal transmittance 0.5 w/m²k', 'average thermal transmittance 0.9 w/m²k', 'to unheated space, limited insulation', 'average thermal transmittance 0.8 w/m²k', 'to unheated space, insulated', 'to external air, limited insulation', 'average thermal transmittance 0.4 w/m²k', 'suspended, insulated',

p-value: 0.000000
Chi2: 174883.464798

Node 1:
	Count = 12591	Rate = 0.493607
	['<missing>', 'Average']

Node 2:
	Count = 113652	Rate = 0.833395
	['Good']

Node 3:
	Count = 32716	Rate = 0.993367
	['Very Good']

Node 4:
	Count = 125875	Rate = 0.071388
	['Very Poor', 'Poor']



Variable: WALLS_ENV_EFF
p-value: 0.000000
Chi2: 174883.464798

Node 1:
	Count = 12591	Rate = 0.493607
	['<missing>', 'Average']

Node 2:
	Count = 113652	Rate = 0.833395
	['Good']

Node 3:
	Count = 32716	Rate = 0.993367
	['Very Good']

Node 4:
	Count = 125875	Rate = 0.071388
	['Very Poor', 'Poor']



Variable: ROOF_DESCRIPTION
p-value: 0.000000
Chi2: 120660.324669

Node 1:
	Count = 107081	Rate = 0.559268
	['<missing>', 'Pitched, 350mm loft insulation', 'Pitched, 250mm loft insulation + Pitched, 0mm loft insulation', 'average thermal transmittance 0.8 w/m²k', 'average thermal transmittance 0.6 w/m²k', 'Pitched, 400+mm loft insulation', 'Pitched, 300+mm loft insulation', 'Roof room(s), insulated', 'Pitched', 'Pitched

p-value: 0.000000
Chi2: 106638.422678

Node 1:
	Count = 80319	Rate = 0.839527
	['<missing>', 'Very Good']

Node 2:
	Count = 103069	Rate = 0.613637
	['Good']

Node 3:
	Count = 38440	Rate = 0.243444
	['Average']

Node 4:
	Count = 47060	Rate = 0.017085
	['Very Poor']

Node 5:
	Count = 15946	Rate = 0.098959
	['Poor']



Variable: ROOF_ENV_EFF
p-value: 0.000000
Chi2: 106638.422678

Node 1:
	Count = 80319	Rate = 0.839527
	['<missing>', 'Very Good']

Node 2:
	Count = 103069	Rate = 0.613637
	['Good']

Node 3:
	Count = 38440	Rate = 0.243444
	['Average']

Node 4:
	Count = 47060	Rate = 0.017085
	['Very Poor']

Node 5:
	Count = 15946	Rate = 0.098959
	['Poor']



Variable: MAINHEAT_ENERGY_EFF
p-value: 0.000000
Chi2: 60719.187867

Node 1:
	Count = 28187	Rate = 0.935680
	['<missing>', 'Very Good']

Node 2:
	Count = 167680	Rate = 0.590375
	['Good']

Node 3:
	Count = 88967	Rate = 0.191633
	['Average', 'Very Poor', 'Poor']



Variable: MAINHEAT_ENV_EFF
p-value: 0.000000
Chi2: 51657.096984

Node 1:
	Coun




Variable: SECONDHEAT_DESCRIPTION
p-value: 0.000000
Chi2: 77515.269337

Node 1:
	Count = 148827	Rate = 0.733886
	['<missing>', 'Hot-Water-Only Systems, gas', 'Room heaters, |Gwresogyddion ystafell, |electric|trydan', 'Room heaters, |Gwresogyddion ystafell, |mains gas|nwy prif gyflenwad', 'Electric Underfloor Heating (Standard tariff), electric', 'None|Dim', 'Room heaters, |Gwresogyddion ystafell, |wood logs|logiau coed', 'Room heaters, bulk wood pellets', 'Room heaters, |Gwresogyddion ystafell, |dual fuel (mineral and wood)|dau danwydd (mwynau a choed)', 'Room heaters, |Gwresogyddion ystafell, |wood chips|asglodion coed', 'None']

Node 2:
	Count = 39114	Rate = 0.432556
	['Room heaters, electric', 'Portable electric heaters(assumed)', 'Room heaters, wood chips', 'none', 'Gwresogyddion ystafell, logiau coed', 'Room heaters, wood pellets', 'Dim']

Node 3:
	Count = 71841	Rate = 0.204883
	['Gwresogyddion ystafell, trydan', 'Portable electric heaters', 'Room heaters, mains gas', 'Gwresogyd

In [13]:
%store chaid_dict

Stored 'chaid_dict' (dict)
