# ENEXIS Graduation Project

####  EDA of PV installed capacity

### Contents

#### Characteristics of den Bosch - demographic features vs installed PV by buurt
   * [1. Reading of PV installed capacity & demographics dataset](#readpv)
   * [2. Selection of a specific municipality to focus on : Best](#selbest)
   * [3. Demographic characteristics of Best by buurt](#dembest)

#### PV installed capacity  - development in time in den Bosch by buurt
  
   * [1. Total number of installations as function of time](#total)
   * [2. Percentage of privately owned houses](#koop)
   * [3. Percentage of one family houses](#een)
   * [4. Housholds with PV per 100 households in 2022](#pv2022)
   * [5. Province](#prov)
   * [6. Average electricity consumption](#elec)

In [None]:
import pandas as pd
import numpy as np
import sys
import matplotlib.pyplot as plt
import seaborn as sns
import cbsodata
pd.options.mode.chained_assignment = None  # default='warn'

<a id='readpv'></a>
#### 1. Reading of PV installed capacity & demographics dataset

In [None]:
c_path = "../Data/"
v_file = "PV installed capacity & demographics"

In [None]:
df     = pd.read_csv(filepath_or_buffer = c_path + v_file + ".csv",
                      encoding           = 'UTF-8')

In [None]:
df.info()

In [None]:
df.shape

In order to keep the uniform distance between time points, records at mid-year are removed

In [None]:
df.head()

In [None]:
df = df.drop(df[df['Year'].isin(['2021-07-01', '2020-07-01'])].index)

In [None]:
df.columns

In [None]:
df = df.drop(['WijkenEnBuurten', 'Gemeentenaam_1', 'SoortRegio_2', ], axis = 1)

In [None]:
# Changing the format of `Year1 variable into 4 digit number
df['Year'] = df['Year'].apply(lambda x: x[:4])

In [None]:
df = df.sort_values(by = 'Year')

In [None]:
# storing the original dataset under variable `df_orig`
df_orig = df.copy()

In [None]:
#df = df_orig.copy()

In [None]:
df['GM_NAAM'].unique()

<a id='selbest'></a>
#### 2. Selection of a specific municipality to focus on : 's-Hertogenbosch 

In [None]:
df = df[df['GM_NAAM'] == "'s-Hertogenbosch"]

In [None]:
df.info()

In [None]:
df['Year'].value_counts()

It draws attention that there are many missing values in 2020, and also 2021 and 2022. 

In [None]:
df = df.drop(['BU_2021', 'WK_2021', 'WK_NAAM', 'GM_2021',
       'GM_NAAM', 'ProvinciecodePV', 'Provincienaam'], axis = 1)

In [None]:
df.describe(include = 'all').T

In [None]:
df[df.isna().any(axis=1)]

In [None]:
df[df['Opgesteld vermogen'] == 0].T

In [None]:
df[df['BU_NAAM'] == 'De Sprookjesbuurt'].T

<a id='dembest'></a>
#### 3. Demographic characteristics of 's-Hertogenbosch by buurt

In [None]:
df_2022 = df[df['Year'] == '2022']

In [None]:
df_2022 = df_2022.drop(['Year'], axis = 1)

In [None]:
df_2022 = df_2022.sort_values(by = 'Opgesteld_vermogen_per100houshoudens', ascending = False)

In [None]:
df_2022 = df_2022.set_index('BU_NAAM')

In [None]:
df_2022.T

In [None]:
plt.figure(figsize = (15,10))
sns.barplot(data = df_2022, x = df_2022.index, y = 'Opgesteld_vermogen_per100houshoudens', color = 'SteelBlue')
sns.barplot(data = df_2022, x = df_2022.index, y = 'PVinstallaties_per100houshoudens', color = 'Red')
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'OV_per_installatie', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

Many buurten with high average OV per installation have names containing "Bedrijventerrein" or "Landelijk gebied". These buurten are possibly ones with lot of commercial or agricultural activity and therefore not entirely residential character. PV figures in these areas may therefore be contaminated by commercial installations. Such buurten should be removed from the dataset.

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'Bevolkingsdichtheid_33', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'GemiddeldeWoningwaarde_35', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'PercentageEengezinswoning_36', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'Koopwoningen_40', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'BouwjaarVanaf2000_46', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'GemiddeldElektriciteitsverbruikTotaal_47', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'Percent_inkomensontvangers', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In one case the "Percent inkomensontvangers" is larger than 1. Is it possible or is it a mistake?

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'Bedrijfsvestigingen_per_huishuidens', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

Again, buurten with high average OV capacity also have typically high number of firms. This is another argument to consider such buurten as not typically residential and therefore filter them out. Let's visualize this correlation.

In [None]:
plt.figure(figsize = (10,7))
sns.scatterplot(data = df_2022, x = 'Bedrijfsvestigingen_per_huishuidens', y = 'OV_per_installatie')

Let's investigate the outliers, i.e. buurten with OV > 8 and Bedrijfsvestigingen > 0.8

In [None]:
df_2022[(df_2022['Bedrijfsvestigingen_per_huishuidens'] > 0.8) | (df_2022['OV_per_installatie'] > 8)].T

Typical for this group is:
- low total number of households
- house value is often missing
- high percentage of eengezinswoningen and koopwoningen
- high degree of urbanization
- buurten with "bedrijfsterrein" and "landelijk gebied" in the name are all in here

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'PersonenautoSPerHuishouden_102', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'AfstandTotSchool_108', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In [None]:
plt.figure(figsize = (15,6))
sns.barplot(data = df_2022, x = df_2022.index, y = 'MateVanStedelijkheid_115', color = 'SteelBlue')
plt.xticks(rotation = 90)
plt.show()

In [None]:
corr_matrix = df_2022.corr().sort_values(by = 'Opgesteld_vermogen_per100houshoudens', ascending = False).transpose()
corr_matrix = corr_matrix.sort_values(by = 'Opgesteld_vermogen_per100houshoudens', ascending = False)

In [None]:
plt.figure(figsize = (15,12))
sns.heatmap(data = corr_matrix, annot = False, fmt='.2f', cmap = 'RdBu_r', linewidths=.1, square=True, vmax=1, center = 0)

In [None]:
#fig, ax = plt.subplots(3, 4, figsize=(15,10))
#for i, col in enumerate(df_2022.values):
#    _ax=ax[i // 4, i % 4]
#    sns.barplot(x='BU_NAAM', y=col, data=df_2022, ax=_ax)
#plt.subplots_adjust(wspace=0.4, hspace=0.4)
#plt.show()

In [None]:
ax = df_2022[['Opgesteld_vermogen_per100houshoudens', 'PVinstallaties_per100houshoudens']].plot.bar(figsize = (15,10))

In [None]:
plt.figure(figsize = (12,8))
sns.lineplot(data = df, x = 'Year', y = 'PVinstallaties_per100houshoudens', hue = 'BU_NAAM')

It remains to explain why for some buurten is the number of installations apparently decreasing

In [None]:
table_Bosch1 = df.groupby('Year').mean()
#table_PV1 = table_PV1.reset_index()

In [None]:
table_Bosch1.T

In [None]:
sns.lineplot(data = df, x = 'Year', y = 'Opgesteld_vermogen_per100houshoudens')

In [None]:
sns.lineplot(data = df, x = 'Year', y = 'PVinstallaties_per100houshoudens')

### EDA of PV installed capacity as a function of demographic features

<a id='total'></a>
#### 1. Total number of installations as function of time

In [None]:
table_PV1 = df.groupby('Year').mean()
table_PV1 = table_PV1.reset_index()

In [None]:
table_PV1

In [None]:
plt.figure(figsize = (12, 7))
sns.lineplot(data = table_PV1, x = 'Year', y = 'PVinstallaties_per100houshoudens')

In [None]:
table_PV2 = df.groupby(['Year', 'MateVanStedelijkheid_115']).mean()
table_PV2 = table_PV2.reset_index()

In [None]:
table_PV2.head()

In [None]:
plt.figure(figsize = (12, 7))
sns.lineplot(data = table_PV2, x = 'Year', y = 'Opgesteld vermogen', hue = 'MateVanStedelijkheid_115', palette = "viridis")

In [None]:
plt.figure(figsize = (12, 7))
sns.lineplot(data = table_PV2, x = 'Year', y = 'OV_per_installatie', hue = 'MateVanStedelijkheid_115', palette = "viridis")

In [None]:
plt.figure(figsize = (12, 7))
sns.lineplot(data = table_PV2, x = 'Year', y = 'Opgesteld_vermogen_per100houshoudens', hue = 'MateVanStedelijkheid_115', 
             palette = "viridis")

In [None]:
plt.figure(figsize = (12, 7))
sns.lineplot(data = table_PV2, x = 'Year', y = 'PVinstallaties_per100houshoudens', hue = 'MateVanStedelijkheid_115', 
             palette = "viridis")

<a id='koop'></a>
#### 2. Percentage of privately owned houses

In [None]:
bins = [0, 20, 40, 60, 80, np.inf]
names = ['<20', '21-40', '41-60', '61-80', '81-100']

df['%_Koopwoningen'] = pd.cut(df['Koopwoningen_40'], bins, labels=names)

In [None]:
table_PV3 = df.groupby(['Year', '%_Koopwoningen']).mean()
table_PV3 = table_PV3.reset_index()

In [None]:
plt.figure(figsize = (12, 7))
sns.lineplot(data = table_PV3, x = 'Year', y = 'PVinstallaties_per100houshoudens', hue = '%_Koopwoningen', 
             palette = "viridis")

<a id='een'></a>
#### 3. Percentage of one family houses

In [None]:
bins = [0, 20, 40, 60, 80, np.inf]
names = ['<20', '21-40', '41-60', '61-80', '81-100']

df['Perc_Eengezinswoning'] = pd.cut(df['PercentageEengezinswoning_36'], bins, labels=names)
df.head()

In [None]:
table_PV4 = df.groupby(['Year', 'Perc_Eengezinswoning']).mean()
table_PV4 = table_PV4.reset_index()

In [None]:
plt.figure(figsize = (12, 7))
sns.lineplot(data = table_PV4, x = 'Year', y = 'PVinstallaties_per100houshoudens', hue = 'Perc_Eengezinswoning', 
             palette = "viridis")

<a id='pv2022'></a>
#### 4. Housholds with PV per 100 households in 2022

In [None]:
df2022 = df[df['Year'] == '2022']

In [None]:
df2022

In [None]:
bins = [0, 20, 40, 60, 80, 100, np.inf]
names = ['<20', '21-40', '41-60', '61-80', '81-100', '>100']

df2022['PVinstallaties_per100houshoudens_groups'] = pd.cut(df2022['PVinstallaties_per100houshoudens'], bins, labels=names)

In [None]:
df2022 = df2022[['BU_NAAM', 'PVinstallaties_per100houshoudens_groups']]

In [None]:
df2022.head()

In [None]:
df = df.merge(df2022, left_on= 'BU_NAAM', right_on = 'BU_NAAM')

In [None]:
df.head()

In [None]:
df['PVinstallaties_per100houshoudens'].nlargest(5)

In [None]:
df[df['PVinstallaties_per100houshoudens'].isin(df['PVinstallaties_per100houshoudens'].nlargest(5))].transpose()

In [None]:
df['PVinstallaties_per100houshoudens_groups'].value_counts()

In [None]:
table_PV5 = df.groupby(['Year', 'PVinstallaties_per100houshoudens_groups']).mean()
table_PV5 = table_PV5.reset_index()

In [None]:
table_PV5

In [None]:
plt.figure(figsize = (12, 7))
ax = sns.lineplot(data = table_PV5, x = 'Year', y = 'PVinstallaties_per100houshoudens', 
             hue = 'PVinstallaties_per100houshoudens_groups', palette = "viridis")
ax.set_ylim([0, 100])

<a id='elec'></a>
#### 6. Average electricity consumption

In [None]:
df['GemiddeldElektriciteitsverbruikTotaal_47'].hist(bins = 30)

In [None]:
bins = [0, 2500, 3000, 3500, 4000, 4500, np.inf]
names = ['<2500', '2500-3000', '3000-3500', '3500-4000', '4000-4500', '>4500']

df['Elektriciteitsverbruik_bins'] = pd.cut(df['GemiddeldElektriciteitsverbruikTotaal_47'], bins, labels=names)
df.head().transpose()

In [None]:
df['Elektriciteitsverbruik_bins'].value_counts()

In [None]:
table_PV7 = df.groupby(['Year', 'Elektriciteitsverbruik_bins']).mean()
table_PV7 = table_PV7.reset_index()

In [None]:
plt.figure(figsize = (12, 7))
sns.lineplot(data = table_PV7, x = 'Year', y = 'Opgesteld vermogen', hue = 'Elektriciteitsverbruik_bins', 
             palette = "viridis")

In [None]:
plt.figure(figsize = (12, 7))
sns.lineplot(data = table_PV7, x = 'Year', y = 'PVinstallaties_per100houshoudens', hue = 'Elektriciteitsverbruik_bins', 
             palette = "viridis")

Apparent decrease in the number of installations has to be explained