-- PROGRAMMING FOR DATA ANALYTICS : KAN__


__Author    : Clyde Watts__  
__Lecturer : Andrew Beaty__  
__Date      : 2025-11-20__



https://www.sciencedirect.com/science/article/pii/S030626192402227X?via%3Dihub

In [1]:
import torch
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Requires: pip install pykan

import os
import seaborn as sns
import datetime as datetime

__Prompt__

I have a dataframe called df_merge_hourly and a list of feature column names called feature_cols. My target variable is PV(W) (or target_col).

Please write a Python code cell using matplotlib and seaborn that iterates through each feature in feature_cols and generates a 4-panel "Feature Report" dashboard for it. For each feature, the figure should have a 2x2 grid layout containing:

1. A Histogram/Distribution plot of the feature. 2. A Scatter plot showing the relationship between the feature (x-axis) and the target PV(W) (y-axis). Use low alpha (transparency) to handle density. 3. A Seasonal Heatmap: The x-axis should be 'Hour of Day', the y-axis should be 'Month', and the color should represent the average value of the feature. 4. A Monthly Boxplot: Showing the feature's distribution across different months.

Ensure the code first checks if 'Month' and 'Hour' columns exist in df_merge_hourly, and creates them from the index if they are missing. Handle any potential non-numeric errors gracefully.

## Feature Analysis
 
This is a cheat sheet summary of each features metrics .  This will be done using copilot agent. It as experirment and prototype to see if this concept will work. This is done on the last day of the project to see if it is worth doing 

## Configure File Paths and Solar Parameters

Identical to the ANN notebook, this sets up:
- Directory paths for data and models
- Solar panel configuration (19 panels, 8,360W total capacity)
- Location coordinates (Bettystown, Ireland)

See the ANN notebook for detailed explanation of these parameters.


In [2]:
# Determine the current path of the notebook
notebook_path = os.path.abspath("big_project.ipynb")
notebook_dir = os.path.dirname(notebook_path).replace('\\', '/')
print("Current notebook directory:", notebook_dir)
HOME_DIR = f'{notebook_dir}'
DATA_DIR = f'{HOME_DIR}/data/'
MODEL_DIR = f'{HOME_DIR}/model/'
print("Data directory set to:", DATA_DIR)
RAW_DATA_DIR = f'{DATA_DIR}/raw_data/'
TRAIN_DATA_DIR = f'{DATA_DIR}/training_data/'
SQL_DB_PATH = f'{DATA_DIR}/db_sqlite/'
SQL_DB_FILE = f'{SQL_DB_PATH}/big_project_db.sqlite3'
BACKUP_FILE_TYPE = 'feather'  # Options: 'csv', 'feather', 'parquet'

# Meteostat setup
METEOSTAT_CACHE_DIR = f'{DATA_DIR}/meteostat_cache/'
SOLAR_SITE_POSITION = (53.6985, -6.2080)  # Bettystown, Ireland
LATITUDE, LONGITUDE = SOLAR_SITE_POSITION
WEATHER_START_DATE = datetime.datetime(2024, 1, 1)
WEATHER_END_DATE = datetime.datetime.now()
# Solar panel configuration 
# Determined this using gemini and google maps measurements
ROOF_PANE_I_ANGLE = 30  # degrees
ROOF_PANE_II_ANGLE = 30  # degrees
ROOF_PANE_I_AZIMUTH = 65  # degrees ( East-South-East)
ROOF_PANE_II_AZIMUTH = 245  # degrees ( West-South-West)
ROOF_PANE_I_COUNT = 7
ROOF_PANE_II_COUNT = 12
SOLAR_PANEL_POWER_RATING_W = 440  # Watts per panel
TOTAL_SOLAR_PANE_I_CAPACITY_W = ROOF_PANE_I_COUNT * SOLAR_PANEL_POWER_RATING_W
TOTAL_SOLAR_PANE_II_CAPACITY_W = ROOF_PANE_II_COUNT * SOLAR_PANEL_POWER_RATING_W
TOTAL_SOLAR_CAPACITY_W = TOTAL_SOLAR_PANE_I_CAPACITY_W + TOTAL_SOLAR_PANE_II_CAPACITY_W

Current notebook directory: c:/Users/cw171001/OneDrive - Teradata/Documents/GitHub/PFDA-programming-for-data-analytics/big_project
Data directory set to: c:/Users/cw171001/OneDrive - Teradata/Documents/GitHub/PFDA-programming-for-data-analytics/big_project/data/


## Set Nighttime Threshold

Filters out nighttime data where Clear Sky GHI ≤ 50 W/m². 

Only daytime data with meaningful solar radiation is used for training.


In [3]:
hourly_nighlty_threshold = 50

In [4]:
df_merge_hourly = pd.read_feather(f"{TRAIN_DATA_DIR}/hourly_solar_full_data.feather")

# Remove all rows where Clear sky GHI is less than or equal to 50
df_merge_hourly = df_merge_hourly[df_merge_hourly['Clear sky GHI'] > hourly_nighlty_threshold]


## Extract Weather Condition Features

Extracts one-hot encoded weather features:
- **Level 1**: Specific conditions (clear, cloudy, rain, fog, etc.)
- **Level 2**: Broader categories (visibility, precipitation, severe weather)

These help KAN understand weather impacts on solar output.


In [5]:
level1_features = [level for level in df_merge_hourly.columns.tolist() if level.startswith('level1_')]
level2_features = [level for level in df_merge_hourly.columns.tolist() if level.startswith('level2_')]

## Define Target Variable and Features

**This cell is empty but typically would:**
- Define the target variable (PV(W), Clearsky_Index, or error terms)
- Select input features
- Remove irrelevant columns

The actual feature selection appears to happen in a later cell.


In [6]:

display(pd.DataFrame({"Columns": df_merge_hourly.columns, "Data Types": df_merge_hourly.dtypes}))

Unnamed: 0,Columns,Data Types
index,index,int64
DateTime,DateTime,datetime64[ns]
PV(W),PV(W),float64
Temperature(C),Temperature(C),Float64
Humidity(%),Humidity(%),Float64
...,...,...
Hour,Hour,int32
Clearsky_Index,Clearsky_Index,float64
PV(W)_error,PV(W)_error,float64
PV(W)_error_index,PV(W)_error_index,float64


## Additional Feature Engineering

**This cell appears empty** - likely placeholder or contains whitespace.


In [7]:

feature_cols = []
test_no="999"
# Define test parameters
test_no = 1  # Increment this for each test run
test_name = f"No Name - Target {target_col}"
notes = ""  # Add any notes about this test run

# Define target column

#target_col = 'PV(W)'
# Kan Prefers Clearsky_Index
target_col = 'Clearsky_Index'
#target_col = 'PV(W)_error'
#target_col = 'PV(W)_error_index'
#
test_name=f"Optimal Features  No Level 2 and No Clearsky - Target {target_col}"
notes="This is the best combination of features exclude level 2 and no clearsky weather features"

# Put change here to add more features
feature_cols.append('Temperature(C)')
feature_cols.append('Humidity(%)')
feature_cols.append('Sunshine Duration')
feature_cols.append('Condition Code')
feature_cols.append('Precipitation(mm)')
feature_cols.append('Dew Point(C)')
feature_cols.append('Wind Direction(deg)')
feature_cols.append('Wind Speed(m/s)')
feature_cols.append('Wind Gust(m/s)')
feature_cols.append('Pressure(hPa)')
feature_cols.append('Snow Depth(cm)')
feature_cols.append('Wind Cooling')
#  level1_features
feature_cols.append('# Observation period')
feature_cols.append('TOA')
feature_cols.append('Clear sky GHI')
feature_cols.append('Clear sky BHI')
feature_cols.append('Clear sky DHI')
feature_cols.append('Clear sky BNI')
feature_cols.append('GHI')
feature_cols.append('BHI')
feature_cols.append('DHI')
feature_cols.append('BNI')
feature_cols.append('Reliability,')
feature_cols.append('POA_Pane_I(W/m^2)')
feature_cols.append('POA_Pane_II(W/m^2)')
feature_cols.append('POAC_Pane_I(W/m^2)')
feature_cols.append('POAC_Pane_II(W/m^2)')
feature_cols.append('Power_Pane_I(W)')
feature_cols.append('Power_Pane_II(W)')
feature_cols.append('Power_ClearSky_Pane_I(W)')
feature_cols.append('Power_ClearSky_Pane_II(W)')
# Relate to target #feature_cols.append('Total_Power_Output(W)')
feature_cols.append('Total_Power_ClearSky_Output(W)')
feature_cols.append('WeekOfYear')
feature_cols.append('Month_Sin')
feature_cols.append('DayOfYear_Sin')
feature_cols.append('HourOfDay_Sin')
feature_cols.append('Month_Cos')
feature_cols.append('DayOfYear_Cos')
feature_cols.append('HourOfDay_Cos')
feature_cols.append('Dew Point(C)_Lag1')
feature_cols.append('Temp_Lag1')
feature_cols.append('Humidity_Lag1')
feature_cols.append('WindSpeed_Lag1')
feature_cols.append('Dew Point(C)_Lag24')
feature_cols.append('Temp_Lag24')
feature_cols.append('Humidity_Lag24')
feature_cols.append('WindSpeed_Lag24')
feature_cols.append('Total_Power_ClearSky_Output(W)_Lag1')
feature_cols.append('Total_Power_ClearSky_Output(W)_Lag24')
# Wind direction as sin and cos
feature_cols.append('WindDir_Sin')
feature_cols.append('WindDir_Cos')

#  level2_features
feature_cols += level2_features
#  level1_features
feature_cols += level1_features

print(f"\nTesting KAN with target: {target_col} and features: {feature_cols}")

# remove any duplicate features
feature_cols = list(set(feature_cols))
print(f"Final feature list (duplicates removed): {feature_cols}")
# check if all feature columns exist in the dataframe
missing_features = [col for col in feature_cols if col not in df_merge_hourly.columns]
if missing_features:
    print(f"❌ ERROR: The following feature columns are missing from the dataframe: {missing_features}")
else:
    print("✅ All feature columns are present in the dataframe.")


NameError: name 'target_col' is not defined