### Data Preprocessing

Chosen Data Sets:

GoiEner Data Set

Smart Grid Smart City Customer Trial Data Set

METER UK Household Electricity and Activity Survey, 2016-2019

Original SmartMeter dataset (year 2014)

Install Dependencies

In [1]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import holidays

The organisation of this data set is different from others. There is an Electricity ID (idElectricity) and Meta ID (Meta_idMeta). The Meta ID connects the electricity consumption data to household, activities, and indvidiual data. The datasets are recorded in 28 Hours periods in different times between February 2016 to January 2019 (Before Covid 19) for each house. The data is available in 1 Min. and 10 Min. intervals.

Some Statistics on the Data:

Households participated: 361

Distinct Households: 293



In [None]:
# The data is recorded for every second, and aggregated into 1 Min and 10 Min intervals. We are going to use 1 Min version.
# Also, electricity is measured as Watt, we will make it kWh so that it is consistent with other data sets we have.
# For each MetaID we have data for 28 Hours, starts with 17:00 and ends at 21:00 next day (There are exceptions.) The data recordings for different
# households are from different dates. 
# ID Electricity is like an index, I think that it is unnecessary at that point.
# Delete rows where MetaID is not available "NULL".

In [2]:
# Load the data
METER_data = pd.read_csv('D:/FL Publication/Datasets for the Publication/METER UK Household Electricity and Activity Survey, 2016-2019/8634csv_1CB23E7148D5085ACDF690DCCDC0066A_V1/UKDA-8634-csv/csv/electricity_1min.csv')


In [3]:
# Encode the ID of meters
label_encoder_Meta_idMeta = LabelEncoder()
METER_data['Meta_idMeta'] = label_encoder_Meta_idMeta.fit_transform(METER_data['Meta_idMeta'])
METER_data['Meta_idMeta'] = METER_data['Meta_idMeta'].astype('int16')

In [4]:
METER_data['dt'] = pd.to_datetime(METER_data['dt'])

In [5]:
METER_data['year'] = METER_data['dt'].dt.year.astype('int16')
METER_data['month'] = METER_data['dt'].dt.month.astype('int16')
METER_data['day'] = METER_data['dt'].dt.day.astype('int16')
METER_data['hour'] = METER_data['dt'].dt.hour.astype('int16')
METER_data['minute'] = METER_data['dt'].dt.minute.astype('int16')
METER_data['day_of_year'] = METER_data['dt'].dt.day_of_year.astype('int16')
METER_data['day_of_week'] = METER_data['dt'].dt.day_of_week.astype('int16')
METER_data['is_weekend'] = METER_data['dt'].dt.dayofweek >= 5
METER_data['is_weekend'] = METER_data['is_weekend'].astype('bool')

In [6]:
uk_holidays = holidays.GB()
METER_data['is_holiday'] = METER_data['dt'].dt.date.map(lambda x: x in uk_holidays).astype('bool')

In [7]:
# Making sure weekend and holiday columns are as expected
print(METER_data[METER_data['is_weekend']].head(500))
print(METER_data[METER_data['is_holiday']].head(500))
# Yes, they work as expected. 

        idElectricity                  dt  Meta_idMeta    Watt  year  month  \
0              156358 2000-01-01 07:55:00          313     NaN  2000      1   
1              156359 2000-01-01 07:56:00          313     NaN  2000      1   
2              156360 2000-01-01 07:57:00          313     NaN  2000      1   
3              156361 2000-01-01 07:58:00          313     NaN  2000      1   
4              156362 2000-01-01 07:59:00          313     NaN  2000      1   
...               ...                 ...          ...     ...   ...    ...   
114435         278698 2016-10-22 03:50:00           77  567.80  2016     10   
114436         278699 2016-10-22 03:51:00           77  567.00  2016     10   
114437         278700 2016-10-22 03:52:00           77  566.65  2016     10   
114438         278701 2016-10-22 03:53:00           77  565.75  2016     10   
114439         278702 2016-10-22 03:54:00           77  563.70  2016     10   

        day  hour  minute  day_of_year  day_of_week

In [8]:
# Drop the old dt column, as we do not need it anymore.
METER_data.drop(columns=["dt"], inplace=True)

In [9]:
# Drop the idElectricity column
METER_data.drop(columns=["idElectricity"], inplace=True)

In [10]:
# Delete rows with no MetaID

#First see the number of rows where ID or Watt column is missing

null_count_ID = METER_data['Meta_idMeta'].isnull().sum()
null_count_watt = METER_data['Watt'].isnull().sum()
print(f"Number of rows with missing Meta_idMeta: {null_count_ID}")
print(f"Number of rows with missing Meta_idMeta: {null_count_watt}")

Number of rows with missing Meta_idMeta: 0
Number of rows with missing Meta_idMeta: 1068


In [None]:
# The number of raws where Meta ID and Watt value is NULL is 0.18 percent of the overall dataset and visual inspection with graphs seems to show that
# the missing values are at random. For this reason we will simply remove missing observations from the data set.

METER_data = METER_data.dropna(subset=['Meta_idMeta','Watt'])

In [12]:
# Convert Watt to kW
METER_data['kW'] = METER_data['Watt'] / 1000

# Drop Watt column

METER_data.drop(columns=["Watt"], inplace=True)

In [None]:
# Save it as preprocessed csv file

In [13]:
METER_data.to_csv('D:/FL Publication/Code_new/FL_Publication_1/Current_Simulations/Data_Preprocessing/Preprocessed_data/METER_electricity_preprocessed_1min.csv',index=False)

Consider using data on households to improve forecasting (You don't have to use household data only for fairness aspects.)

Remove absolute time, it is unnecessary

In [14]:
print(METER_data.tail())

        Meta_idMeta  year  month  day  hour  minute  day_of_year  day_of_week  \
518133          311  2019      1   16    20      57           16            2   
518134          311  2019      1   16    20      58           16            2   
518135          311  2019      1   16    20      59           16            2   
518136          311  2019      1   16    21       0           16            2   
518137          311  2019      1   16    21       1           16            2   

        is_weekend  is_holiday        kW  
518133       False       False  0.205033  
518134       False       False  0.212087  
518135       False       False  0.298270  
518136       False       False  0.443230  
518137       False       False  0.362507  
