# Production Data Cleaning

In [1]:
import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt 
import seaborn as sns

pd.set_option('display.max_columns', None)

# Loading Data

The Data used in this notebook is from 
> https://ned.nl/

In [2]:
df = pd.read_csv('../data/ProductionData/energie_opwek.csv')
df

Unnamed: 0,validfrom (UTC),validto (UTC),point,type,granularity,timezone,activity,classification,capacity (kW),volume (kWh),percentage,emission (kg CO2),emissionfactor (kg CO2/kWh)
0,2020-12-31 23:00:00,2021-01-01 00:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,418750,418750,0.536859,125625,0.3
1,2021-01-01 00:00:00,2021-01-01 01:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,416000,416000,0.533333,124800,0.3
2,2021-01-01 01:00:00,2021-01-01 02:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,430250,430250,0.551602,129075,0.3
3,2021-01-01 02:00:00,2021-01-01 03:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,436750,436750,0.559936,131025,0.3
4,2021-01-01 03:00:00,2021-01-01 04:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,436000,436000,0.558974,130800,0.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
271579,2023-12-31 18:00:00,2023-12-31 19:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3144000,3144000,1.431400,0,0.0
271580,2023-12-31 19:00:00,2023-12-31 20:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3102250,3102250,1.412392,0,0.0
271581,2023-12-31 20:00:00,2023-12-31 21:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3086249,3086249,1.405107,0,0.0
271582,2023-12-31 21:00:00,2023-12-31 22:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3051749,3051749,1.389403,0,0.0


## Data cleaning 

To start of the data cleaning process i printed the head of the dataframe to see what needs to be cleaned at first glance. 

In [3]:
df

Unnamed: 0,validfrom (UTC),validto (UTC),point,type,granularity,timezone,activity,classification,capacity (kW),volume (kWh),percentage,emission (kg CO2),emissionfactor (kg CO2/kWh)
0,2020-12-31 23:00:00,2021-01-01 00:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,418750,418750,0.536859,125625,0.3
1,2021-01-01 00:00:00,2021-01-01 01:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,416000,416000,0.533333,124800,0.3
2,2021-01-01 01:00:00,2021-01-01 02:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,430250,430250,0.551602,129075,0.3
3,2021-01-01 02:00:00,2021-01-01 03:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,436750,436750,0.559936,131025,0.3
4,2021-01-01 03:00:00,2021-01-01 04:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,436000,436000,0.558974,130800,0.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
271579,2023-12-31 18:00:00,2023-12-31 19:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3144000,3144000,1.431400,0,0.0
271580,2023-12-31 19:00:00,2023-12-31 20:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3102250,3102250,1.412392,0,0.0
271581,2023-12-31 20:00:00,2023-12-31 21:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3086249,3086249,1.405107,0,0.0
271582,2023-12-31 21:00:00,2023-12-31 22:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3051749,3051749,1.389403,0,0.0


From the data above i can see there are a couple of columns where I should remove the measurement scale. 

In [4]:

# Rename columns
df.rename(columns={
    'capacity (kW)': 'capacity',
    'volume (kWh)': 'volume',
    'emission (kg CO2)': 'emission',
    'emissionfactor (kg CO2/kWh)': 'emissionfactor'
}, inplace=True)
df

Unnamed: 0,validfrom (UTC),validto (UTC),point,type,granularity,timezone,activity,classification,capacity,volume,percentage,emission,emissionfactor
0,2020-12-31 23:00:00,2021-01-01 00:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,418750,418750,0.536859,125625,0.3
1,2021-01-01 00:00:00,2021-01-01 01:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,416000,416000,0.533333,124800,0.3
2,2021-01-01 01:00:00,2021-01-01 02:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,430250,430250,0.551602,129075,0.3
3,2021-01-01 02:00:00,2021-01-01 03:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,436750,436750,0.559936,131025,0.3
4,2021-01-01 03:00:00,2021-01-01 04:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,436000,436000,0.558974,130800,0.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...
271579,2023-12-31 18:00:00,2023-12-31 19:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3144000,3144000,1.431400,0,0.0
271580,2023-12-31 19:00:00,2023-12-31 20:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3102250,3102250,1.412392,0,0.0
271581,2023-12-31 20:00:00,2023-12-31 21:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3086249,3086249,1.405107,0,0.0
271582,2023-12-31 21:00:00,2023-12-31 22:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3051749,3051749,1.389403,0,0.0


After changing the column names to a more understandable name, it is time to look at the information about the dataframe. 
I left the valid from and valid to out since we only need one of the columns. 

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271584 entries, 0 to 271583
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   validfrom (UTC)  271584 non-null  object 
 1   validto (UTC)    271584 non-null  object 
 2   point            271584 non-null  object 
 3   type             271584 non-null  object 
 4   granularity      271584 non-null  object 
 5   timezone         271584 non-null  object 
 6   activity         271584 non-null  object 
 7   classification   271584 non-null  object 
 8   capacity         271584 non-null  int64  
 9   volume           271584 non-null  int64  
 10  percentage       271584 non-null  float64
 11  emission         271584 non-null  int64  
 12  emissionfactor   271584 non-null  float64
dtypes: float64(2), int64(3), object(8)
memory usage: 26.9+ MB


Here it is visable that the valid from and valid to is not a datetime type but an object type, so we will fix that in the next code cell. 

In [6]:
df['datetime'] = pd.to_datetime(df['validfrom (UTC)'])

# Drop unnecessary columns
df.drop(columns=['validfrom (UTC)', 'validto (UTC)'], inplace=True)

# Reorder columns
cols = ['datetime'] + [col for col in df if col != 'datetime']
df = df[cols]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271584 entries, 0 to 271583
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   datetime        271584 non-null  datetime64[ns]
 1   point           271584 non-null  object        
 2   type            271584 non-null  object        
 3   granularity     271584 non-null  object        
 4   timezone        271584 non-null  object        
 5   activity        271584 non-null  object        
 6   classification  271584 non-null  object        
 7   capacity        271584 non-null  int64         
 8   volume          271584 non-null  int64         
 9   percentage      271584 non-null  float64       
 10  emission        271584 non-null  int64         
 11  emissionfactor  271584 non-null  float64       
dtypes: datetime64[ns](1), float64(2), int64(3), object(6)
memory usage: 24.9+ MB


We also removed the valid to column and changed the valid from column to datetime. This column represents the time that the measuring started. 

The next step is to look if there are any NAN values in the data. 

In [7]:
df.isnull().sum()

datetime          0
point             0
type              0
granularity       0
timezone          0
activity          0
classification    0
capacity          0
volume            0
percentage        0
emission          0
emissionfactor    0
dtype: int64

There are no NAN values in the dataset, so we can export this data to a cleaned csv file. 

When working on the data, there was data from 2016 in the production dataset. To ensure that we only have data from 2021 and later, we remove the older data here.

In [8]:
df = df[df['datetime'].dt.year >= 2021]
df

Unnamed: 0,datetime,point,type,granularity,timezone,activity,classification,capacity,volume,percentage,emission,emissionfactor
1,2021-01-01 00:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,416000,416000,0.533333,124800,0.3
2,2021-01-01 01:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,430250,430250,0.551602,129075,0.3
3,2021-01-01 02:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,436750,436750,0.559936,131025,0.3
4,2021-01-01 03:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,436000,436000,0.558974,130800,0.3
5,2021-01-01 04:00:00,Nederland,WastePower,Hour,UTC,Providing,Current,435500,435500,0.558333,130650,0.3
...,...,...,...,...,...,...,...,...,...,...,...,...
271579,2023-12-31 18:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3144000,3144000,1.431400,0,0.0
271580,2023-12-31 19:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3102250,3102250,1.412392,0,0.0
271581,2023-12-31 20:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3086249,3086249,1.405107,0,0.0
271582,2023-12-31 21:00:00,Nederland,WindOffshoreC,Hour,UTC,Providing,Current,3051749,3051749,1.389403,0,0.0


Converting the dataframe to a CSV file so we can use it in the next steps of the project.

In [9]:
df.to_csv('../data/ProductionData/cleaned_energie_opwek.csv', index=False)