# ENEXIS Graduation Project

### prerequisite: install cbs odata library

In [None]:
pip install cbsodata

In [None]:
import cbsodata
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nbconvert
# pd.set_option("max_rows", 120)
pd.options.mode.chained_assignment = None  # default='warn'

Lets select a data set from CBS. One of the most comprehensive sets is "Kerncijfers wijken en buurten" which is actualized every year. It contains demographical data but also some data related to energy consumption, which can be linked to Enexis data via CBS area codes (Gemeente, Wijk and Buurt level). "Kerncijfers wijken en buurten 2019" is the most recent set containing largely complete data. In the sets of 2020 and 2021 many feature columns are yet empty.

In [None]:
a021 = '85039NED' #Kerncijfers wijken en buurten 2021
a020 = '84799NED' #Kerncijfers wijken en buurten 2020
a019 = '84583NED' #Kerncijfers wijken en buurten 2019
a018 = '84286NED' #Kerncijfers wijken en buurten 2018
a017 = '83765NED' #Kerncijfers wijken en buurten 2017
a016 = '83487NED' #Kerncijfers wijken en buurten 2016

a1 = '85126NED' #Energieverbruik woningen; wijkbuurt 2020
a2 = '85010NED' #Zonnestroom; wijken en buurten, 2019
a3 = '85080NED' #Woningen, hoofdverwarmingsinstallaties, gasinstallatie, cv warmtepomp, stadsverwarming, wijken en buurten 2020

In [None]:
selected_dataset = a020
dataset_new_format = False
if (selected_dataset == '85039NED') | (selected_dataset == '84799NED'):
    dataset_new_format = True

In [None]:
df_kerncijfers = pd.DataFrame(cbsodata.get_data(selected_dataset))

In [None]:
for name in df_kerncijfers.columns:
     print(name)

In [None]:
# for col in df_orig.columns:
#     print(col)

In [None]:
df_kerncijfers.head(5).transpose()

In [None]:
#remove whitespaces from beginning and end of string column labels
df_kerncijfers = df_kerncijfers.apply(lambda x: x.str.strip() if x.dtype == "object" else x)   

Data on the level of Buurt are selected for EDA

In [None]:
df = df_kerncijfers[df_kerncijfers['SoortRegio_2'] == 'Buurt']

In [None]:
df['IndelingswijzigingWijkenEnBuurten_4'].value_counts() # just checking how many changes have been made since previous year

Some rows have 0 values for "HuishoudensTotaal_28" feature. To use this feature for feature engineering, I need to eliminate 0 values, so I replace them with 1. 

In [None]:
df['HuishoudensTotaal_28'].replace({0 : 1}, inplace = True)

Changing Buurt codes into integers, so first remove "BU" letters at the beginning, and then changing the type into Int32

In [None]:
df['Codering_3'] = df['Codering_3'].map(lambda x: x.lstrip('BU'))

In [None]:
df['Codering_3'] = df['Codering_3'].astype(int)

In [None]:
df.head()

In order to use in the EDA, numeric features have to be intensive variables i.e. such which do not depend on the size of the system. For this reason, two new variables are created by dividing the existing extensive variables `AantalInkomensontvangers_70` and `BedrijfsvestigingenTotaal_91` by number of inhabitants and number of households, respectively. Another new variable is defined to describe average education level, as a kind of weighted average of `OpleidingsniveauLaag_64`, `OpleidingsniveauMiddelbaar_65` and `OpleidingsniveauHoog_66`, with respective weights of 1, 2 and 3. The new variable assumes values in the range of 1 to 3.

In [None]:
df['Gemiddeld_opleidingsniveau'] = (df['OpleidingsniveauLaag_64'] * 1 + df['OpleidingsniveauMiddelbaar_65'] * 2 + 
                                df['OpleidingsniveauHoog_66'] * 3) / (df['OpleidingsniveauLaag_64'] + 
                                df['OpleidingsniveauMiddelbaar_65'] + df['OpleidingsniveauHoog_66'])

In [None]:
df['Percent_inkomensontvangers'] = df['AantalInkomensontvangers_70'] / df['AantalInwoners_5']

In [None]:
df['Bedrijfsvestigingen_per_huishuidens'] = df['BedrijfsvestigingenTotaal_91'] / df['HuishoudensTotaal_28']

We now make a slice of the dataframe df, containing only the selected features, and the newly added features. The features are stored in a list object `Selected_CBS_features`.

In [None]:
Selected_CBS_features_old_format = ['WijkenEnBuurten',
 'Gemeentenaam_1',
 'SoortRegio_2',
 'Codering_3',
 'MeestVoorkomendePostcode_113',
 'HuishoudensTotaal_28',
 'GemiddeldeHuishoudensgrootte_32',
 'Bevolkingsdichtheid_33',
  'GemiddeldeWoningwaarde_35',
 'PercentageEengezinswoning_36',
 'Koopwoningen_40',
 'InBezitWoningcorporatie_42',
 'InBezitOverigeVerhuurders_43',
 'BouwjaarVanaf2000_46',
 'GemiddeldElektriciteitsverbruikTotaal_47',
 'GemiddeldAardgasverbruikTotaal_55',
 'Gemiddeld_opleidingsniveau',
 'Percent_inkomensontvangers',
 'Bedrijfsvestigingen_per_huishuidens',
 'PersonenautoSPerHuishouden_102',
 'AfstandTotSchool_108',
 'Omgevingsadressendichtheid_116']

Selected_CBS_features_new_format = ['WijkenEnBuurten',
 'Gemeentenaam_1',
 'SoortRegio_2',
 'Codering_3',
 'MeestVoorkomendePostcode_113',
 'HuishoudensTotaal_28',
 'GemiddeldeHuishoudensgrootte_32',
 'Bevolkingsdichtheid_33',
 'GemiddeldeWOZWaardeVanWoningen_35',
 'PercentageEengezinswoning_36',
 'Koopwoningen_40',
 'InBezitWoningcorporatie_42',
 'InBezitOverigeVerhuurders_43',
 'BouwjaarVanaf2000_46',
 'GemiddeldAardgasverbruikTotaal_55',
 'Gemiddeld_opleidingsniveau',
 'Percent_inkomensontvangers',
 'Bedrijfsvestigingen_per_huishuidens',
 'PersonenautoSPerHuishouden_102',
 'AfstandTotSchool_108',
 'Omgevingsadressendichtheid_116']

In [None]:
features = Selected_CBS_features_old_format
if(dataset_new_format):
    features = Selected_CBS_features_new_format
df = df[features]

### Description of data

We now have the final set of features extracted from CBS dataset. Let's explore its main characteristics.

In [None]:
df.info()

There is a hight amount of missing values, which will need to be handled later. For now we will perform an initial EDA with the set as it is.

In [None]:
df.describe(include = 'all').transpose()

In [None]:
df.head().transpose()

Lets first have a look on the distribution of the individual features, using a set of histograms.

In [None]:
df.drop('Codering_3', axis = 1).hist(bins=25, figsize=(16,14));

### Explore correlations between average energy consumption and other features

Let's explore correlations between features. We will use correlation matrix with Pearson correlation coefficients. The most interesting for our subject is the average consumption of electricity, so the features will be ordered by decreasing correlation with `GemiddeldElektriciteitsverbruikTotaal_47`. 
Note: The dataset contains as well electricity and gas consumption data calculated separately per type of dwellings: Appartement, Tussenwoning, Hoekwoning, Twee-onder-één-kap-woning, Vrijstaande woning, as well as per ownership: Huurwoning and Eigen woning. These differences can be looked into at a later stage.

In [None]:
df.info()

In [None]:
# corr_matrix = df.corr().sort_values(by = 'GemiddeldElektriciteitsverbruikTotaal_47', ascending = False).transpose()
# corr_matrix = corr_matrix.sort_values(by = 'GemiddeldElektriciteitsverbruikTotaal_47', ascending = False)

In [None]:
# plt.figure(figsize = (14,11))
# sns.heatmap(data = corr_matrix, annot = True, fmt='.2f', cmap = 'RdBu_r', linewidths=.1, square=True, vmax=1, center = 0)

Let's focus on the correlation between average electricity consumption and other features. High positive / negative values mean strong positive / negative linear correlation, values close to 0 indicate a weak or non-linear correlation. 

In [None]:
# plt.figure(figsize = (10,10))
# sns.heatmap(data = corr_matrix[['GemiddeldElektriciteitsverbruikTotaal_47']], 
            # annot = True, fmt='.2f', cmap = 'RdBu_r', linewidths=.1, square=True, vmax=1, center = 0)

Finally, let's visualize correlations between each individual numeric feature (on x axes) and average electricity consumption (y axis of each diagram)

In [None]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_num = df.select_dtypes(include=numerics).drop('Codering_3', axis = 1)

In [None]:
# after the notebook Pima Indians Diabetes © 2020 by Laurence Frank and Daniel Kapitan. 
# https://jakevdp.github.io/PythonDataScienceHandbook/04.08-multiple-subplots.html 
# fig, ax = plt.subplots(5, 4, figsize=(15,15), sharey=True, gridspec_kw={'hspace': 0.3})
# print('num columns', df_num.columns)
# for i, col in enumerate(df_num.columns):
#     # print(i//4, i%4)
#     _ax=ax[i // 4, i % 4]
#     sns.scatterplot(x=col, y='GemiddeldElektriciteitsverbruikTotaal_47', data=df_num, ax=_ax)

Finally, we write the dataset to a csv file, so that it can be used as input file in next stages of analysis.

In [None]:
# test to see if it works
# dg = df.loc[:,['WijkenEnBuurten','Codering_3','HuishoudensTotaal_28']]
# dg.to_csv(path_or_buf = 'test_CBS_dataset_' + selected_dataset + '.csv', index = False)

In [None]:
df.to_csv(path_or_buf = 'CBS_dataset_' + selected_dataset + '.csv', decimal=',', sep=';', index = False)

In [None]:
#nbconvert.PDFExporter('Enexis CBS data EDA v1')