## Dataset preprocessing
This section outlines the steps taken to preprocess the dataset before using any model. Proper preprocessing is crucial for ensuring that the data is clean, consistent, and suitable for analysis.
The dataset seems pretty clean, but we performed the following preprocessing steps to ensure data quality:


In [7]:
import pandas as pd
import os

1. Replace "-" with 0 in the 'Gen' column to handle missing generation values.

In [9]:
def fix_gen_column(csv_path):
    df = pd.read_csv(csv_path, dtype=str)
    df.columns = df.columns.str.strip()

    if "Gen" not in df.columns:
        print(f"'Gen' column not found in {csv_path}. Skipping.")
        return

    # Replace cells that are exactly "-" (allow surrounding spaces) with "0"
    df["Gen"] = df["Gen"].replace(r"^\s*-\s*$", "0", regex=True)

    # Trim whitespace and convert to numeric, non-numeric -> NaN -> fill with 0
    df["Gen"] = pd.to_numeric(df["Gen"].str.strip(), errors="coerce").fillna(0)

    df.to_csv(csv_path, index=False)
    print("Fixed Gen column and saved:", csv_path)

2. **Data Type Conversion**: Convert the relevant columns to numeric types to facilitate mathematical operations and analysis. Remove $ signs and commas before conversion.

3. **Handling Missing Values**: Check for any missing values in the dataset. If any found, just drop those rows to maintain data integrity.

In [6]:
def drop_missing_values(csv_path):  
    df = pd.read_csv(csv_path)  
    df.dropna(inplace=True)
    df.to_csv(csv_path, index=False)
    print("Dropped rows with missing values and saved:", csv_path)

Preprocess all the CSV files in the data directory to ensure consistency across datasets.

In [10]:
data_dir = "../data"
# traverse all csv files in data directory and apply fixes
for filename in os.listdir(data_dir):
    if filename.endswith(".csv"):
        csv_path = os.path.join(data_dir, filename)
        fix_gen_column(csv_path)
        drop_missing_values(csv_path)

'Gen' column not found in ../data/CAISO-Forward-Prices.csv. Skipping.
Dropped rows with missing values and saved: ../data/CAISO-Forward-Prices.csv
Fixed Gen column and saved: ../data/CAISO-Historical-Data.csv
Dropped rows with missing values and saved: ../data/CAISO-Historical-Data.csv
'Gen' column not found in ../data/MISO-Forward-Prices.csv. Skipping.
Dropped rows with missing values and saved: ../data/MISO-Forward-Prices.csv
Fixed Gen column and saved: ../data/MISO-Historical-Data.csv
Dropped rows with missing values and saved: ../data/MISO-Historical-Data.csv
'Gen' column not found in ../data/ERCOT-Forward-Prices.csv. Skipping.
Dropped rows with missing values and saved: ../data/ERCOT-Forward-Prices.csv
Fixed Gen column and saved: ../data/ERCOT-Historical-Data.csv
Dropped rows with missing values and saved: ../data/ERCOT-Historical-Data.csv
