## Dataset preprocessing
This section outlines the steps taken to preprocess the dataset before using any model. Proper preprocessing is crucial for ensuring that the data is clean, consistent, and suitable for analysis.
The dataset seems pretty clean, but we performed the following preprocessing steps to ensure data quality:


In [7]:
import pandas as pd
import os

Replace "-" with 0 in the 'Gen' column to handle missing generation values.

In [9]:
def fix_gen_column(csv_path):
    df = pd.read_csv(csv_path, dtype=str)
    df.columns = df.columns.str.strip()

    if "Gen" not in df.columns:
        print(f"'Gen' column not found in {csv_path}. Skipping.")
        return

    # Replace cells that are exactly "-" (allow surrounding spaces) with "0"
    df["Gen"] = df["Gen"].replace(r"^\s*-\s*$", "0", regex=True)

    # Trim whitespace and convert to numeric, non-numeric -> NaN -> fill with 0
    df["Gen"] = pd.to_numeric(df["Gen"].str.strip(), errors="coerce").fillna(0)

    df.to_csv(csv_path, index=False)
    print("Fixed Gen column and saved:", csv_path)

**Data Type Conversion**: Convert the relevant columns to numeric types to facilitate mathematical operations and analysis. Remove $ signs and commas before conversion.

In [11]:
def convert_dollar(csv_path):
    data_hist = pd.read_csv(csv_path)

    if 'RT Busbar' not in data_hist.columns or 'RT Hub' not in data_hist.columns or \
       'DA Busbar' not in data_hist.columns or 'DA Hub' not in data_hist.columns or \
       'P/OP' not in data_hist.columns:
        print(f"One or more required columns not found in {csv_path}. Skipping.")
        return

    data_hist['RT Busbar'] = data_hist['RT Busbar'].astype(str).str.replace(r'[\(,]', '-', regex=True)
    data_hist['RT Busbar'] = data_hist['RT Busbar'].astype(str).str.replace(r'[\),]', '', regex=True)
    
    data_hist['RT Busbar'] = data_hist['RT Busbar'].astype(str).str.replace(r'[\$,]', '', regex=True)
    data_hist['RT Busbar'] = pd.to_numeric(data_hist['RT Busbar'], errors='coerce')
    
    data_hist['RT Hub'] = data_hist['RT Hub'].astype(str).str.replace(r'[\$,]', '', regex=True)
    data_hist['RT Hub'] = pd.to_numeric(data_hist['RT Hub'], errors='coerce')
    
    
    data_hist['DA Busbar'] = data_hist['DA Busbar'].astype(str).str.replace(r'[\(,]', '-', regex=True)
    data_hist['DA Busbar'] = data_hist['DA Busbar'].astype(str).str.replace(r'[\),]', '', regex=True)
    
    data_hist['DA Busbar'] = data_hist['DA Busbar'].astype(str).str.replace(r'[\$,]', '', regex=True)
    data_hist['DA Busbar'] = pd.to_numeric(data_hist['DA Busbar'], errors='coerce')
    
    data_hist['DA Hub'] = data_hist['DA Hub'].astype(str).str.replace(r'[\$,]', '', regex=True)
    data_hist['DA Hub'] = pd.to_numeric(data_hist['DA Hub'], errors='coerce')
    
    data_hist['P/OP'] = data_hist['P/OP'].astype(str).str.replace('OP' , '0', regex=True)
    data_hist['P/OP'] = data_hist['P/OP'].astype(str).str.replace('P', '1', regex=True)
    data_hist['P/OP'] = pd.to_numeric(data_hist['P/OP'], errors='coerce')

    data_hist.to_csv(csv_path, index=False)
    print("Converted dollar columns and saved:", csv_path)


**Handling Missing Values**: Check for any missing values in the dataset. If any found, just drop those rows to maintain data integrity.

In [6]:
def drop_missing_values(csv_path):  
    df = pd.read_csv(csv_path)  
    df.dropna(inplace=True)
    df.to_csv(csv_path, index=False)
    print("Dropped rows with missing values and saved:", csv_path)

Preprocess all the CSV files in the data directory to ensure consistency across datasets.

In [None]:
data_dir = "../data"
# traverse all csv files in data directory and apply fixes
for filename in os.listdir(data_dir):
    if filename.endswith(".csv"):
        csv_path = os.path.join(data_dir, filename)
        fix_gen_column(csv_path)
        drop_missing_values(csv_path)
        convert_dollar(csv_path)