### Data Preprocessing

Chosen Data Sets:

GoiEner Data Set

Smart Grid Smart City Customer Trial Data Set

METER UK Household Electricity and Activity Survey, 2016-2019

Original SmartMeter dataset (year 2014)

Install Dependencies

In [1]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import holidays

In [2]:
# Read original dataset
smartmeter_data = pd.read_csv('D:/FL Publication/Datasets for the Publication/Smart-meter Dataset/LCL-FullData/CC_LCL-FullData.csv')

In [3]:
# Convert datetime column to pandas DateTime type
smartmeter_data['DateTime'] = pd.to_datetime(smartmeter_data['DateTime'])

In [4]:
# Extract datetime components
smartmeter_data['year'] = smartmeter_data['DateTime'].dt.year.astype('int16')
smartmeter_data['month'] = smartmeter_data['DateTime'].dt.month.astype('int8')
smartmeter_data['day'] = smartmeter_data['DateTime'].dt.day.astype('int8')
smartmeter_data['hour'] = smartmeter_data['DateTime'].dt.hour.astype('int8')
smartmeter_data['minute'] = smartmeter_data['DateTime'].dt.minute.astype('int8')
smartmeter_data['dayofyear'] = smartmeter_data['DateTime'].dt.dayofyear.astype('int16')
smartmeter_data['dayofweek'] = smartmeter_data['DateTime'].dt.dayofweek.astype('int8')
smartmeter_data['is_weekend'] = smartmeter_data['dayofweek'].isin([5, 6]).astype('bool')

In [5]:
uk_holidays = holidays.GB()
smartmeter_data['is_holiday'] = smartmeter_data['DateTime'].dt.date.map(lambda x: x in uk_holidays).astype('bool')

In [None]:
# Use data from 2012 and 2013
# There is no irregularities or potential problems in 2012 data. I inspected.
smartmeter_data = smartmeter_data[(smartmeter_data['year'] == 2012) | (smartmeter_data['year'] == 2013)]

In [7]:
# Drop DateTime column as it is not needed anymore
smartmeter_data.drop(columns=["DateTime"], inplace=True)

In [8]:
# Select only standard customers 
smartmeter_data = smartmeter_data[smartmeter_data["stdorToU"] == "Std"]
smartmeter_data.drop(columns=["stdorToU"], inplace=True)

In [None]:
# Encode the ID of meters
label_encoder_lclid = LabelEncoder()
smartmeter_data['LCLid'] = label_encoder_lclid.fit_transform(smartmeter_data['LCLid'])
smartmeter_data['LCLid'] = smartmeter_data['LCLid'].astype('int16')

In [10]:
# Ensure that target does not contain NULLs and is numerric 
smartmeter_data['KWH/hh (per half hour) '] = pd.to_numeric(smartmeter_data['KWH/hh (per half hour) '], errors='coerce').fillna(0.0).astype('float64')

In [11]:
print(smartmeter_data.columns)

Index(['LCLid', 'KWH/hh (per half hour) ', 'year', 'month', 'day', 'hour',
       'minute', 'dayofyear', 'dayofweek', 'is_weekend', 'is_holiday'],
      dtype='object')


In [12]:
# Calculate hourly consumption 
hourly_data = smartmeter_data.groupby(['LCLid', 'year', 'month', 'day', 'dayofyear','dayofweek', 'hour', 'is_weekend', 'is_holiday'])['KWH/hh (per half hour) '].sum().reset_index()
hourly_data = hourly_data.rename(columns={'KWH/hh (per half hour) ': 'KWH/hh (per hour)'})

In [None]:
# Sort the data 
hourly_data.sort_values(by=['LCLid', 'year', 'month', 'day', 'hour'], inplace=True)

# Save the DataFrame to a CSV file
hourly_data.to_csv('D:/FL Publication/Datasets for the Publication/Smart-meter Dataset/Preprocessed_data.csv', index=False)

In [14]:
print(hourly_data.head(500))

     LCLid  year  month  day  dayofyear  dayofweek  hour  is_weekend  \
0        0  2012     10   12        286          4     0       False   
1        0  2012     10   12        286          4     1       False   
2        0  2012     10   12        286          4     2       False   
3        0  2012     10   12        286          4     3       False   
4        0  2012     10   12        286          4     4       False   
..     ...   ...    ...  ...        ...        ...   ...         ...   
495      0  2012     11    1        306          3    15       False   
496      0  2012     11    1        306          3    16       False   
497      0  2012     11    1        306          3    17       False   
498      0  2012     11    1        306          3    18       False   
499      0  2012     11    1        306          3    19       False   

     is_holiday  KWH/hh (per hour)  
0         False              0.000  
1         False              0.000  
2         False         

Holidays bunu da ekle

Check whether aggregation is done correctly