# EDA

Let's have a first look at the data. 

The Variables in the files are:
* S1: heat power
* S2: flow rate
* S3: leader temperature (initial temperature)
* S4: return temperature

For the years 2020 and 2021 we also have the outside temperature at the location of the heat plant. 

We also have: 
* holydays in Bavaria (a lot!)
* school vacations in Bavaria

We will first focus only on the years 2020 and 2021, since here, we have all the data. For 2022, temperature is missing. 

## Read the data

In [None]:
import pandas as pd
import numpy as np
#import seaborn as sns
import plotly.graph_objs as go
import plotly.express as px
from plotly.subplots import make_subplots

In [None]:
df_2020 = pd.read_csv("../raw_data/2020_heat.csv", delimiter=";", index_col=False)
df_2021 = pd.read_csv("../raw_data/2021_heat.csv", delimiter=";", index_col=False)
df_2022 = pd.read_csv("../raw_data/2022_heat.csv", delimiter=";", index_col=False)

df = pd.concat([df_2020, df_2021, df_2022], ignore_index=True)
df.rename({"S1": "heat_power", "S2": "flow_rate", "S3": "leader_temp", "S4": "return_temp"}, axis=1, inplace=True)
df

In [None]:
df.info()

In [None]:
fig = px.line(data_frame=df, x="Timestamp", y="heat_power")
fig.show()

As we can see, there are some mssing values in october of 2020. There are also some values of zero, and in summer of 2021 one extremely learge peak. I don't know what to make of these right now, I will ignore them for now. 

## Fun with datetime
Everyones favorite pasttime: dealing with the datetime. Basically, I want the naive but local dattime as a variable. For now, I am not going to use datetime as the index, because of thing like daylight savings time. 


In [None]:
df.Timestamp = pd.to_datetime(df.Timestamp, infer_datetime_format=True, utc=True).dt.tz_convert(tz="Europe/Berlin")


In [None]:
#df.Timestamp = df.Timestamp.dt.tz_convert(tz="Europe/Berlin")

In [None]:
df.info()


In [None]:
df.head()

### Check for correlations
My guess would be that heat power is proportional to temperature difference times flow rate.

In [None]:
df["heat_flow_calc"] = (df.leader_temp - df.return_temp) * df.flow_rate
df.head()

In [None]:
# fig = px.scatter_matrix(df.drop("Timestamp", axis=1))
# fig.show()

In [None]:
fig = px.scatter(data_frame=df, x="heat_power", y="heat_flow_calc")
fig.show()

That is indeed the case. So for now, we will drop all columns except the heat power. 

In [None]:
df.drop(["flow_rate", "leader_temp", "return_temp", "heat_flow_calc"], axis=1, inplace=True)
df.head()

## Feature engineering
let's play with the datetime a bit.

In [None]:
#df["date"] = df.Timestamp.dt.date
df["time"] = df.Timestamp.dt.time
df["hour"] = df.Timestamp.dt.hour
df["week_nr"] = df.Timestamp.dt.isocalendar().week
df["weekday"] = df.Timestamp.dt.weekday

In [None]:
df.head()

I want to look how the hour, week and weekday and how it influences our heat demand. 

In [None]:
# fig = make_subplots(rows=1, cols=3)
# fig.add_scatter(x=df.time, y=df.heat_power, row=1, col=1, mode="markers")
# fig.add_scatter(x=df.weekday, y=df.heat_power, row=1, col=2, mode="markers")
# fig.add_scatter(x=df.week_nr, y=df.heat_power, row=1, col=3, mode="markers")
# fig.show()

Ok, we do not see really a lot. To get a better insight, we should do a proper averaging. 

In [None]:
df.boxplot(by="hour", column="heat_power")

We see a maximum at about 9:00 and a minimum at about 17:00 with several ups and downs duting the night hours. 

In [None]:
df.boxplot(by="weekday", column="heat_power")

No clear pattern visible here...

In [None]:
df.boxplot(by="week_nr", column="heat_power")

Should be no big surprise that the heat demand is larger in winter than in summer...

## Look at temperature
For the years 2020 and 2021 we have the outside temperature at the plant location. Let's have a look!

In [None]:
temp2020 = pd.read_csv("../raw_data/2020_temp.csv", delimiter=";", index_col=False)
temp2021 = pd.read_csv("../raw_data/2021_temp.csv", delimiter=";", index_col=False)
df_temp = pd.concat([temp2020, temp2021], ignore_index=True)
df_temp.rename({"S1": "temperature"}, axis=1, inplace=True)
df_temp.Timestamp = pd.to_datetime(df.Timestamp, infer_datetime_format=True, utc=True).dt.tz_convert(tz="Europe/Berlin")
df_temp

In [None]:
# merge data frames
df = pd.merge(df, df_temp, how="outer", on="Timestamp")
df

In [None]:
fig = px.scatter(data_frame=df, y="heat_power", x="temperature")
fig.show()

As expected, when the temperature is low, the heat demand is usually higher than at high temperatures.

## Holydays and vacations

In [None]:
holy_2020 = pd.read_csv("../raw_data/feiertage_Bayern_2020.csv", index_col=False)
holy_2020