# Assignment 4: Prediction of renewable energy generation

## Context
A friend recently had to sign a new electricity supply contract. The high prices surprised him very much and he decided to look into the electricity market. Among a lot of other information, he learned that electricity is also traded on an exchange.
He found the so-called "day-ahead" market, where electricity is traded for the next day, the most interesting. He found out that the price is almost completely determined by the supply, because the demand hardly changes at such short notice (no private person turns on less light and no industrial company stops production at such short notice). He also found out that there are already very good forecasting models for this.
The electricity supply, on the other hand, has become much more exciting in recent years. Wind turbines and large photovoltaic plants have hardly any running costs and can therefore undercut any other power plant (in the short term). However, their production strongly depends on the current weather in the area where the respective plant is located. The "conventional" power plants, which then fill the gaps to demand, determine the electricity price based on their operating costs. If you know the current oil, coal and gas prices, this is also relatively easy to predict.

Meanwhile, he is sure that you could make good money if you had a good forecast of how much electricity wind power and PV will deliver. Since he has heard that you now have some experience with data analysis, he asks you to help him and to create a forecast model that predicts the amount of electricity produced (wind & photovoltaic) based on a weather report.

As he is aware that data is needed for this, he has already obtained data:
 - From "SMARTD" (part of the regulatory authority) the installed production capacity of the different types of power plants ("energy_installed_capacity.csv") and the amount of electricity produced in reality ("energy_produced.csv").
 - Daily records from the DWD (weather service) at many measuring stations. Two files, one with the measured values (weather.csv“), one with further information (e.g. location) of the measuring stations („weatherstations.csv“).
Both sources (all four data sets) cover the period from 2016 to 2021 inclusive.

He is also sure - if the model is good enough to be worthwhile - to be able to buy sufficiently good weather forecasts, no matter in which form they would be needed exactly. So there are no limititations how the data is groupped or preprocessed.

## Assignment

Develop a forecast model, evaluate it and answer the question whether it would be useful for the intended use!
Deliver a Jupyter notebook (able to run on the server) that includes the following parts:
- Data analysis and exploration, including preparation for the model. This includes (but not limited to):
  - understanding the data (continous/categorial, range of values ...)
  - unifying the time base
  - detecting and dealing missing data point
  - possibly necessary simplifications
- Develop and evaluate a model for the forecast.
- Conclude whether (at least on the basis of the data) a meaningfully usable forecasting model could be achieved. 

Also leave drafts steps in the Jupyter notebook so that we can understand your approach.
For each decision that is relevant to the result, give a brief justification, if not clear from the context. So after a parameter analysis comparable to Task 3 in Assignment 3, no justification would of course be needed for the choice of epochs and learning rate. No justification is necessary in an "exploratory phase" either, as these form the basis for later justifications.     

**Hints:**
 - Do not underestimate the importance of the data preprocessing.
 - Remember what we talked in the different lectures, where we have talked about different ways to solve different problems. For time reasons we often had just choosen one, but that one do not necesarily be the right one in this assignment.
 - The data is quite "raw", it has some faults and/or is not in the shape you may need it and can include unnecessary information.
 - Use your "common sense" especially during the preprocessing stage.
 - You can add an arbitrarily number of additional cells of course.
 - If you want to use additional python libraries, just ask, usually we will be able to provide them.
 - If you have more than one idea to solve a problem, allow yourself to experiment a bit! There is not only one solution but at the end make very clear what is your final result.
 - For EDA/Data Preparation and ANN there are total of 60 points. Roughly equal distributed, but depending where you make some decisions there can a bit movement.
 - For EDA/Data preparation you may consider, as an example, selecting the three measured parameters and compare the distribution of the values in terms of time, location, etc.

## Exploratory Data Analysis & Data preparation (~30 Points)


In [1]:
import pandas as pd 
import numpy as np 
import datetime

# keras imports for the dataset and building a neural network 
from keras.models import Sequential, load_model
from keras.layers.core import Dense, Dropout, Activation
from keras.layers import GaussianNoise
from keras.layers import InputLayer


#sklearn imports for preprocessing 
from sklearn.model_selection import train_test_split
from sklearn import preprocessing

Init Plugin
Init Graph Optimizer
Init Kernel


In [2]:
# importing neccessary data
weather_df = pd.read_csv('weather.csv')
weather_df_droplist= ['Minimum Temperature', 'Average Temperature', 'Maximum Temperature', 'relative humidity', 'average air preassure', 'Rain'] #drop the useless data without influnce on the energy_produced
weather_df.drop( weather_df_droplist, inplace= True, axis = 1)

ep_df = pd.read_csv('energy_produced.csv', sep= ';' )
#only need PV and Wind
ep_df_droplist= ['Water power[MW]','Biomass[MW]','Nuclear power[MW]','Brown coal[MW]','Coal[MW]','Natural gas[MW]','Pump storage[MW]','Other conventional[MW]', 'Other renewables[MW]'] 
ep_df.drop(ep_df_droplist, inplace= True, axis = 1)

eic_df= pd.read_csv('energy_installed_capacity.csv', sep = ';')
#only need PV and Wind 
eic_df_droplist = ['Biomass[MW]', 'Water power[MW]','Other renewables[MW]', 'Nuclear power[MW]', 'Brown coal[MW]', 'Coal[MW]','Natural gas[MW]','Pump storage[MW]','Other conventional[MW]'] 
eic_df.drop(eic_df_droplist, inplace = True, axis = 1)

ws_df = pd.read_csv('weatherstations.csv')
ws_df_droplist = ['Operator']
ws_df.drop(ws_df_droplist, inplace = True, axis= 1)

#converting date columnes from string to date type and setting date as index
#ep_df['Date'] = pd.to_datetime(ep_df['Date'], format= '%d.%m.%Y')
eic_df['Date'] = pd.to_datetime(eic_df['Date'], format= '%d.%m.%Y')
weather_df['Date'] = pd.to_datetime(weather_df['Date'], format= '%Y-%m-%d')

#ep_df.set_index('Date', inplace = True)
eic_df.set_index('Date', inplace = True)
#weather_df.set_index('Date', inplace = True)

# calculating the utilization rate of the energy installed capacity over each day 

def preprocess_energy():
    start_date = datetime.date(2016, 1, 1)   # setting start date for the loop, because there are more than one row per date
    end_date = datetime.date(2021, 12, 31)     # setting end date for loop #2021, 12, 31 subset of the given 
    delta = datetime.timedelta(days=1)       # time increase loop    
    
    onshore_list = []                        # lists to fill with data 
    offshore_list = []
    pv_list= []
    date_list = []

    while start_date <= end_date:           
        filt_ep = (ep_df['Date'] == start_date.strftime('%d.%m.%Y'))   #filter each day in given dataframe
        ep_date_mean = ep_df[filt_ep].mean(numeric_only = True)        # calculate the mean of each column for the filtered data  #resample method dataframe %day
        offshore_list.append(round(ep_date_mean[0]/eic_df.loc[start_date.strftime('%Y'),'Wind Offshore[MW]'].values[0]*100, 2)) # dividing the mean data by the energy installed capacity
        onshore_list.append(round(ep_date_mean[1]/eic_df.loc[start_date.strftime('%Y'),'Wind Onshore[MW]'].values[0]*100, 2))   # and storing the value as utilization rate in % in the lists
        pv_list.append(round(ep_date_mean[2]/eic_df.loc[start_date.strftime('%Y'),'Photovoltaics[MW]'].values[0]*100, 2))
        date_list.append(start_date)
        start_date += delta

    df = pd.DataFrame(zip(date_list, pv_list, offshore_list, onshore_list), columns= ['Date', 'PV [%]', 'Offshore [%]', 'Onshore [%]']) #building a new dataframe for further tasks
    return df

p_ep_df = preprocess_energy()

# Dividing the weatherstations in groups by lattitude
import numpy as np

def preprocess_weather():
    
    lat_list = [47,48,49,50,51,52,53,54]  #list of all lattitudes in germany
    list = []
    df_list = []
    ws_df['geographic latitude'] = ws_df['geographic latitude'].astype(int) #convert type from float to int to filter
    
    for i in lat_list:
        filt_df = ws_df['geographic latitude'] == i         # filter only stations with the lattitude i 
        df = ws_df[filt_df]                                 # apply filter to df 
        list = df['Stations_ID'].tolist()                   # filter station ids for the lattitude i 
        filt_df = weather_df['Stations_ID'].isin(list)      # create filter for station id to df 
        st_weather_df = weather_df[filt_df]                            # apply filter to df and save 
        
        start_date = datetime.date(2016, 1, 1)   # setting start date for the loop, because there are more than one row per date
        end_date = datetime.date(2021, 12, 31)     # setting end date for loop #2021, 12, 31 subset of the given 
        delta = datetime.timedelta(days=1)       # time increase loop    
    
        avg_wind_list = []
        max_wind_list = []
        sun_list= []
        cloud_list = []
        date_list = []


        while start_date <= end_date:
            filt_weather = (st_weather_df['Date'] == start_date.strftime('%d.%m.%Y'))        #filter each day in given dataframe
            weather_date_mean = st_weather_df[filt_weather].mean(numeric_only = True)        # calculate the mean of each column for the filtered data. 0 is != NaN
            avg_wind_list.append(round(weather_date_mean[1], 2))
            max_wind_list.append(round(weather_date_mean[2], 2))   
            sun_list.append(round(weather_date_mean[3], 2))
            cloud_list.append(round(weather_date_mean[4], 2))
            date_list.append(start_date)
            start_date += delta
            df = pd.DataFrame(zip(date_list, avg_wind_list, max_wind_list, sun_list, cloud_list), columns= ['Date', 'Avg. Windspeed', 'Max. Windspeed', 'Sunshine duration', 'Cloud'])
            df = df.set_index('Date')
        
        
        df_list.append(df)

    return df_list

test_list=preprocess_weather()

p_weather_df = pd.concat(test_list, axis = 1) # Merging the two dataframes together by date as index # def preprocess_weather():

#powerweather_df = p_weather_df.set_index('Date').join(p_ep_df.set_index('Date')) # Merging the two dataframes together by date as index 

In [11]:
n = preprocessing.normalize(p_weather_df)
scaled_weather_df = pd.DataFrame(n)

n = preprocessing.normalize(p_ep_df.set_index('Date'))
scaled_ep_df = pd.DataFrame(n)

weather_train, weather_test, ep_train, ep_test = train_test_split(scaled_weather_df , scaled_ep_df, test_size = 0.25)

weather_train.shape

(1644, 32)

## Developing and evaluation of the ANN (~30 Points)


In [26]:
model = Sequential()
model.add(InputLayer(input_shape=(32,)))

model.add(Dense(200))
model.add(Activation('relu'))

model.add(Dense(3))
model.add(Activation('sigmoid'))

model.compile(loss= 'categorical_crossentropy', optimizer = 'adam', metrics= 'accuracy')


In [27]:
model.fit(weather_train, ep_train, epochs = 10, verbose=True, validation_data=(weather_test, ep_test))

Epoch 1/10

2022-03-04 14:32:36.997449: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


Epoch 2/10
 9/52 [====>.........................] - ETA: 0s - loss: 1.5695 - accuracy: 0.7382

2022-03-04 14:32:37.484757: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:112] Plugin optimizer for device_type GPU is enabled.


Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x16e3dbca0>

## Summary (10 Points)

Is this model usable for predicting the amount of generated renewable energy based on weather data?