### Data understanding: installed pV CBS versus Enexis data
What does the data look like? 
We take a look at the installed pV in time. What are the differences and what are the similarities between a set of selected municipalities (Den Bosch, Arnhem, Best and Loon op Zand). And how does the CBS data on installed pV compare to the data from the Enexis files that we use? This notebook helps us understand.

In [None]:
# !pip install cbsodata
!pip install --upgrade pip
!pip install altair --upgrade

!pip install jupyter pandas vega
!pip install --upgrade notebook  # need jupyter_client >= 4.2 for sys-prefix below

In [None]:
import cbsodata
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as pl
import altair as alt

In [None]:
#Read in (Zonnestroom; vermogen bedrijven en woningen, regio (indeling 2019)
zonnestroom_2019 = '84783NED'
df_zonnestroom_2019 = pd.DataFrame(cbsodata.get_data(zonnestroom_2019))
df_zonnestroom_2019 = df_zonnestroom_2019[
    (
(df_zonnestroom_2019['RegioS'] == "'s-Hertogenbosch") |
(df_zonnestroom_2019['RegioS'] == "Loon op Zand") |
# (df_zonnestroom_2019['RegioS'] == "Arnhem") |
(df_zonnestroom_2019['RegioS'] == "Best")
    )
    &  (df_zonnestroom_2019['BedrijfstakkenWoningen']=='Woningen')
   ]

#set the period to december to get the split between the 2 data sources right.
df_zonnestroom_2019['Perioden'] = df_zonnestroom_2019['Perioden'] + "-12"

df_zonnestroom_2019.head()

### Check if we're at version 4.2.0 so we can use the right graphs.

In [None]:
alt.__version__

In [None]:
alt.Chart(df_zonnestroom_2019).mark_line().encode(
    x=alt.X("Perioden", bin=False, title='Year'),
    y=alt.Y(alt.repeat('layer'), aggregate='mean', title="Installed pV - number vs. power"),
    color=alt.ColorDatum(alt.repeat('layer'))
).repeat(layer=["AantalInstallaties_1", "OpgesteldVermogenVanZonnepanelen_2"])

### Comparing municipalities

In [None]:
chart = alt.Chart(df_zonnestroom_2019).mark_point().encode(
    alt.X(alt.repeat("column"), type='ordinal', title='Year'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color='RegioS:N'
).properties(
    width=375,
    height=250
).repeat(
    column=['Perioden'],
    row=['AantalInstallaties_1', 'OpgesteldVermogenVanZonnepanelen_2'],
    
).interactive()
chart

### Compare the CBS data with the Enexis data

CBS data goes from 2012 to 2019 at the end of the year, and the Enexis data is from 1-1-2020.

Load the enexis 2020 data:

In [None]:
import pandas as pd

Helper methods for recurring work

In [None]:
def load_enexis_data(file_name):
  return pd.read_csv(file_name,
                         sep                = ';',
                         decimal            = ',',
                         thousands          = '.',
                         encoding           = 'unicode_escape')

In [None]:
def filter_on_4_municipalities(df):
    return df[    
    (df['Gemeente']=='Arnhem') |
    (df['Gemeente']=='Best') |
    (df['Gemeente']=="'s-Hertogenbosch") |
    (df['Gemeente']=="s-Hertogenbosch") |
    (df['Gemeente']=='Loon op Zand') 
]

In [None]:
def summarize_number_of_connections(df):
    return df.groupby('Gemeente')['Aantal aansluitingen met opwekinstallatie'].sum()

In [None]:
def summarize_opgesteld_vermogen(df):
    return df.groupby('Gemeente')['Opgesteld vermogen'].sum()

Load the data for the 4 Enexis datapoints in the 'decentrale opwek' files

In [None]:
decentral_generation_012020 = load_enexis_data('../data/Enexis_decentrale_opwek_kv_(zon_pv)_01012020.csv')
decentral_generation_012020 = filter_on_4_municipalities(decentral_generation_012020)

decentral_generation_072020 = load_enexis_data('../data/Enexis_decentrale_opwek_kv_(zon_pv)_01072020.csv')
decentral_generation_072020 = filter_on_4_municipalities(decentral_generation_072020)

decentral_generation_012021 = load_enexis_data('../data/Enexis_decentrale_opwek_kv_(zon_pv)_01012021.csv')
decentral_generation_012021 = filter_on_4_municipalities(decentral_generation_012021)

decentral_generation_072021 = load_enexis_data('../data/Enexis_decentrale_opwek_kv_(zon_pv)_01072021.csv')
decentral_generation_072021 = filter_on_4_municipalities(decentral_generation_072021)   

decentral_generation_012022 = load_enexis_data('../data/Enexis_decentrale_opwek_kv_(zon_pv)_01012022.csv')
decentral_generation_012022 = filter_on_4_municipalities(decentral_generation_012022)     

Keep only data from selected municipalities and get the total number of pv connections there:

In [None]:
totalNumberOfPVConnections012020 = summarize_number_of_connections(decentral_generation_012020)
totalNumberOfPVConnections072020 = summarize_number_of_connections(decentral_generation_072020)
totalNumberOfPVConnections012021 = summarize_number_of_connections(decentral_generation_012021)
totalNumberOfPVConnections072021 = summarize_number_of_connections(decentral_generation_072021)
totalNumberOfPVConnections012022 = summarize_number_of_connections(decentral_generation_012022)

Note that 'Arnhem' is not part of the Enexis data. This is because it is not in their servicing area.

Do the same for the generative power ('opgesteld vermogen'):

In [None]:
totalOpgesteldVermogen012020 = summarize_opgesteld_vermogen(decentral_generation_012020)
totalOpgesteldVermogen072020 = summarize_opgesteld_vermogen(decentral_generation_072020)
totalOpgesteldVermogen012021 = summarize_opgesteld_vermogen(decentral_generation_012021)
totalOpgesteldVermogen072021 = summarize_opgesteld_vermogen(decentral_generation_072021)
totalOpgesteldVermogen012022 = summarize_opgesteld_vermogen(decentral_generation_012022)

In [None]:
def create_row(id, regio, period, installations, generative_power):
    return {"ID":id, "BedrijfstakkenWoningen":"Woningen",
        "RegioS":regio, "Perioden":period,
        "AantalInstallaties_1":installations, "OpgesteldVermogenVanZonnepanelen_2":generative_power}

Add the data to the CBS data and create one graph from it.

In [None]:
best = "Best"
den_bosch = "'s-Hertogenbosch"
den_bosch_july_2020 = "s-Hertogenbosch"
loon_op_zand = "Loon op Zand"

bestRow01012020 = create_row(67890,"Best (Enexis)", "2020-1", totalNumberOfPVConnections012020[best], totalOpgesteldVermogen012020[best] )
sHertogenboschRow01012020 =  create_row(67891, "'s-Hertogenbosch (Enexis)", "2020-1", totalNumberOfPVConnections012020[den_bosch], totalOpgesteldVermogen012020[den_bosch] ) 
loonOpZandRow01012020 = create_row(67892, "Loon op Zand (Enexis)", "2020-1", totalNumberOfPVConnections012020[loon_op_zand], totalOpgesteldVermogen012020[loon_op_zand] )

bestRow01072020 = create_row(67893,"Best (Enexis)", "2020-7", totalNumberOfPVConnections072020[best], totalOpgesteldVermogen072020[best] )
sHertogenboschRow01072020 =  create_row(67894,  "'s-Hertogenbosch (Enexis)", "2020-7", totalNumberOfPVConnections072020[den_bosch_july_2020], totalOpgesteldVermogen072020[den_bosch_july_2020] ) 
loonOpZandRow01072020 = create_row(67895, "Loon op Zand (Enexis)", "2020-7", totalNumberOfPVConnections072020[loon_op_zand], totalOpgesteldVermogen072020[loon_op_zand] )

bestRow01012021 = create_row(67896,"Best (Enexis)", "2021-1", totalNumberOfPVConnections012021[best], totalOpgesteldVermogen012021[best] )
sHertogenboschRow01012021 =  create_row(67897, "'s-Hertogenbosch (Enexis)", "2021-1", totalNumberOfPVConnections012021[den_bosch], totalOpgesteldVermogen012021[den_bosch] ) 
loonOpZandRow01012021 = create_row(67898, "Loon op Zand (Enexis)", "2021-1", totalNumberOfPVConnections012021[loon_op_zand], totalOpgesteldVermogen012021[loon_op_zand] )

bestRow01072021 = create_row(67899,"Best (Enexis)", "2021-7", totalNumberOfPVConnections072021[best], totalOpgesteldVermogen072021[best] )
sHertogenboschRow01072021 =  create_row(67900, "'s-Hertogenbosch (Enexis)", "2021-7", totalNumberOfPVConnections072021[den_bosch], totalOpgesteldVermogen072021[den_bosch] ) 
loonOpZandRow01072021 = create_row(67901, "Loon op Zand (Enexis)", "2021-7", totalNumberOfPVConnections072021[loon_op_zand], totalOpgesteldVermogen072021[loon_op_zand] )

bestRow01012022 = create_row(67902,"Best (Enexis)", "2022-1", totalNumberOfPVConnections012022[best], totalOpgesteldVermogen012022[best] )
sHertogenboschRow01012022 =  create_row(67903, "'s-Hertogenbosch (Enexis)", "2022-1", totalNumberOfPVConnections012022[den_bosch], totalOpgesteldVermogen012022[den_bosch] ) 
loonOpZandRow01012022 = create_row(67904, "Loon op Zand (Enexis)", "2022-1", totalNumberOfPVConnections012022[loon_op_zand], totalOpgesteldVermogen012022[loon_op_zand] )

df = df_zonnestroom_2019.append(bestRow01012020, ignore_index=True)
df = df.append(sHertogenboschRow01012020, ignore_index=True)
df = df.append(loonOpZandRow01012020, ignore_index=True)

df = df.append(bestRow01072020, ignore_index=True)
df = df.append(sHertogenboschRow01072020, ignore_index=True)
df = df.append(loonOpZandRow01072020, ignore_index=True)

df = df.append(bestRow01012021, ignore_index=True)
df = df.append(sHertogenboschRow01012021, ignore_index=True)
df = df.append(loonOpZandRow01012021, ignore_index=True)

df = df.append(bestRow01072021, ignore_index=True)
df = df.append(sHertogenboschRow01072021, ignore_index=True)
df = df.append(loonOpZandRow01072021, ignore_index=True)

df = df.append(bestRow01012022, ignore_index=True)
df = df.append(sHertogenboschRow01012022, ignore_index=True)
df = df.append(loonOpZandRow01012022, ignore_index=True)


Logarithmic scale

In [None]:
combinedDataChart_log_scale = alt.Chart(df).mark_point().encode(
    alt.X(alt.repeat("column"), type='temporal', title='Year'),
    alt.Y(alt.repeat("row"), type='quantitative',
    scale=alt.Scale(type="log")),
    color='RegioS:N'
).properties(
    width=500,
    height=400
).repeat(
    column=['Perioden'],
    row=['AantalInstallaties_1', 'OpgesteldVermogenVanZonnepanelen_2'],
    
)
combinedDataChart_log_scale

Non-logarithmic scale

In [None]:
combinedDataChart = alt.Chart(df).mark_point().encode(
    alt.X(alt.repeat("column"), type='temporal', title='Year'),
    alt.Y(alt.repeat("row"), type='quantitative'),
    color='RegioS:N'
).properties(
    width=500,
    height=400
).repeat(
    column=['Perioden'],
    row=['AantalInstallaties_1', 'OpgesteldVermogenVanZonnepanelen_2'],
    
)
combinedDataChart

## Observations:

Comparing municipalities shows the same general form of the data, depending mainly on the size of the municipality.

The CBS zonnestroom data 2019 accounts for the data of all of 2019, up to 31-12-2019. The Enexis data is of 1-1-2020, so these values (even though they are spaced 'a year' apart, are about the same information. Therefore the zonnestroom data has been set to december of the year. In the graphs there is no value for Arnhem on 1-1-2020, that is because it is not in the Enexis service area. Therefore the 'Arnhem data' has been left out of the graphs.

For the municipalities observed, we see close alignment between the CBS zonnestroom and Enexis data for the number of installations. For the installed power we see that the Enexis values are slightly higher.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=d0604020-40e6-4d7d-a2ba-74ef2b385723' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>