### Data Preprocessing

Chosen Data Sets:

GoiEner Data Set

Smart Grid Smart City Customer Trial Data Set

METER UK Household Electricity and Activity Survey, 2016-2019

Original SmartMeter dataset (year 2014)

Install Dependencies

There are two versions of this dataset. First is raw data with missing values. The second is where missing values are imputed and the data is segmented in three different periods. pre-, in-, post-pandemic. The number of CSV files, hence customers, are not same in raw and processed data are not same since some of the csv files appear more than once in different periods of pre, in and post pandemic.

Here we will first see how many customers have data in different sub periods between 2014 and 2022.

In [1]:
import pandas as pd
import os
import holidays

In [2]:
# Set the folder paths for imp_pre, imp_in, and imp_post

folder_paths = {
    "pre_pandemic": "D:/FL Publication/Datasets for the Publication/GoiEner/7362094/imp-pre/goi4_pre/imp_csv",
    "in_pandemic": "D:/FL Publication/Datasets for the Publication/GoiEner/7362094/imp-in/goi4_in/imp_csv",
    "post_pandemic": "D:/FL Publication/Datasets for the Publication/GoiEner/7362094/imp-post/goi4_pst/imp_csv"
}

In [3]:
# Dictionary to store customer IDs for each period
# Sets are used to store multiple items in a single variable. These are unordered, unchangeable, and unindexed.
customer_ids = {
    "pre_pandemic": set(),
    "in_pandemic": set(),
    "post_pandemic": set()
}

In [4]:
# Loop through each folder and read the CSV files
for period, folder_path in folder_paths.items():
    for file_name in os.listdir(folder_path):
        if file_name.endswith(".csv"):
            customer_id = file_name.split(".")[0]
            customer_ids[period].add(customer_id)

In [5]:
# Check for customers that appear in multiple periods
pre_and_in = customer_ids["pre_pandemic"].intersection(customer_ids["in_pandemic"])
pre_and_post = customer_ids["pre_pandemic"].intersection(customer_ids["post_pandemic"])
in_and_post = customer_ids["in_pandemic"].intersection(customer_ids["post_pandemic"])
all_periods = customer_ids["pre_pandemic"].intersection(customer_ids["in_pandemic"], customer_ids["post_pandemic"])

results = {
    "pre_and_in_pandemic": list(pre_and_in),
    "pre_and_post_pandemic": list(pre_and_post),
    "in_and_post_pandemic": list(in_and_post),
    "all_periods": list(all_periods)
}

In [6]:
df = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in results.items()]))
print(df)

                                     pre_and_in_pandemic  \
0      4ae14bd23af5ea9153c82870a7d46b06042970bdee2c80...   
1      897a78edf9694a0a44a7c98fb763d696a5dc2c8503163c...   
2      e071ac0732664381e236fb660361d667d467dc7896d42e...   
3      193b8e83344d8bea06cb6713a3ac8f706cb9a099fa14cc...   
4      832767c8be9103cfd16ede32cf8bc549ce2430bd23e6f2...   
...                                                  ...   
13905                                                NaN   
13906                                                NaN   
13907                                                NaN   
13908                                                NaN   
13909                                                NaN   

                                   pre_and_post_pandemic  \
0      4ae14bd23af5ea9153c82870a7d46b06042970bdee2c80...   
1      897a78edf9694a0a44a7c98fb763d696a5dc2c8503163c...   
2      e071ac0732664381e236fb660361d667d467dc7896d42e...   
3      193b8e83344d8bea06cb6713a3ac8f70

In [4]:
# We will preprocess csv files one by one and then save them in different folders.

preprocessed_folder = "D:/FL Publication/Datasets for the Publication/GoiEner/7362094/imp_preprocessed"


In [5]:
es_holidays = holidays.Spain()

In [6]:
def is_holiday(date):
    """Check if a date is a holiday in Spain."""
    return date in es_holidays

In [7]:
# We will first extract all the possible variables from datetime variable and then look at specific trends for each using graphs

# Function to preprocess the data
def preprocess_file(file_path):
    df = pd.read_csv(file_path)

    # Convert 'timestamp' column to datetime type
    df['timestamp'] = pd.to_datetime(df['timestamp'])

    # Extract other variables
    
    df['year'] = df['timestamp'].dt.year.astype('int16')
    df['month'] = df['timestamp'].dt.month.astype('int8')
    df['day'] = df['timestamp'].dt.day.astype('int16')
    df['hour'] = df['timestamp'].dt.hour.astype('int8')
    df['day_of_year'] = df['timestamp'].dt.day_of_year.astype('int16')
    df['day_of_week'] = df['timestamp'].dt.day_of_week.astype('int8')
    df['is_weekend'] = df['timestamp'].dt.dayofweek >= 5
    df['is_weekend'] = df['is_weekend'].astype('bool')

   # Add holiday information
    df['is_holiday'] = df['timestamp'].dt.date.map(is_holiday)
    df['is_holiday'] = df['is_holiday'].astype('bool')

    # Drop the original timestamp column
    df.drop(columns=["timestamp"], inplace=True)

    
    return df

In [8]:
# Process each file in each folder
for period, folder_path in folder_paths.items():
    for file_name in os.listdir(folder_path):
        if file_name.endswith(".csv"):
            file_path = os.path.join(folder_path, file_name)
            # Preprocess the file
            processed_df = preprocess_file(file_path)
            # Save to the output folder with the same file name
            output_path = os.path.join(preprocessed_folder, period + "_preprocessed", file_name)
            os.makedirs(os.path.join(preprocessed_folder, period + "_preprocessed"), exist_ok=True) 
            processed_df.to_csv(output_path, index=False)

Consider including some of the other variables such as economic activity and other metadata to improve prediction (This is in addition to fairness aspect)

Remove absolute time, it is unnecessary: Date time already uniquely defines each observation. Also we would like to see how electricity consumption changes in a year, month, week etc with seasonality and date related patterns. Absolute time do not reflect these cyclicalities.

Holidays bunu da ekle