# Model Development - Solar CAPEX Estimator

This notebook develops a machine learning model to predict total installed costs (CAPEX) for commercial solar installations using the LBNL Tracking the Sun dataset. It has five main sections to guide the process from data loading to model evaluation, which will eventually be used in a SolarCapexEstimator class. The five sections are:

1) **DataLoader**, responsible for loading Tracking the Sun data from CSV files and filtering it to rows relevant to our use case (commercial solar installations in the US).

2) **DataCleaner**, responsible for cleaning the data by removing rows with missing or invalid values in the target column (total installed price). It also drops columns where most values are missing or fills in missing values with appropriate strategies.

3) **FeatureEngineer**, responsible for creating new features from the existing data that may help the model learn better, but are still readable and interpretable by users.

4) **ModelTrainer**, responsible for training a machine learning model (e.g. linear regression, random forest, or gradient boosting) on the cleaned and feature-engineered data.

5) **ModelEvaluator**, responsible for evaluating the trained model's performance using appropriate metrics and validation techniques.

Using composition (separate classes coordinated by a higher-level estimator) keeps each stepâ€”data loading, cleaning, feature engineering, training, and evaluationâ€”focused on a single responsibility. This improves modularity, testability, and reuse: each component can be developed, swapped, or improved independently without affecting the rest of the pipeline.

## Setup and Imports

Import necessary libraries for data manipulation, visualization, and modeling.

In [1]:
import pandas as pd
import numpy as np
from plotly import graph_objects as go
from pathlib import Path
from typing import Optional, List

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

## 1. DataLoader - Loading TTS Data

We'll start by using our **`DataLoader`** class to load the Tracking the Sun dataset. This class handles the messy details of reading CSV files and filtering them to just the records we care about.

The `DataLoader` class provides:
- Automatic handling of multiple CSV files in a directory
- Date parsing for installation dates
- Year-based filtering (we want 2019-2023)
- Customer segment filtering (commercial and non-residential only)

In [2]:
class DataLoader:
    """
    Data loader for LBNL Tracking the Sun dataset.

    This class handles TTS-specific data loading, cleaning, and filtering
    operations independent of the modeling pipeline.

    Parameters
    ----------
    tts_data_directory : str
        Path to the directory containing the raw TTS data files.

    Attributes
    ----------
    raw_data_directory: Path
        Directory containing the raw TTS data files.

    """

    def __init__(self, tts_data_directory: str):
        self.tts_data_directory = Path(tts_data_directory)
        self.df = None

        self.valid_customer_segments = ['COM', 'RES_MF', 'RES_SF', 'RES', 'AGRICULTURAL',
       'OTHER TAX-EXEMPT', 'GOV', 'SCHOOL', 'NON-RES', 'NON-PROFIT']

    def _filter_by_years(self, df, year_min=None, year_max=None):
        """
        Filter data to specific installation years.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to filter.
        year_min : int, optional
            Minimum installation year to include. If None, includes all years.
        year_max : int, optional
            Maximum installation year to include. If None, includes all years.

        Returns
        -------
        pd.DataFrame
            Filtered dataframe.

        Raises
        ------
        ValueError
            If dataframe has not been loaded.
        """
        if df is None:
            raise ValueError("Data not loaded. Call load_raw() first.")

        if year_min is not None:
            df = df[df.installation_date.dt.year >= year_min]
        if year_max is not None:
            df = df[df.installation_date.dt.year <= year_max]

        return df

    def _filter_by_customer_segment(self, df, segments):
        """
        Filter data to specific customer segments.

        Parameters
        ----------
        segments : list of str
            Customer segments to include (e.g., ['COM', 'NON-RES']).

        Returns
        -------
        pd.DataFrame
            Filtered dataframe.
        segments : list of str
            Customer segments to filter to. If None, includes all segments.

        Raises
        ------
        ValueError
            If dataframe has not been loaded.
        """
        if df is None:
            raise ValueError("Data not loaded. Call load_raw() first.")

        df = df[df['customer_segment'].isin(segments)]

        return df
    
    def _validate_filters(self, year_min, year_max, customer_segments):
        """
        Validate filter parameters.

        Parameters
        ----------
        year_min : int, optional
            Minimum installation year to include. If None, includes all years.
        year_max : int, optional
            Maximum installation year to include. If None, includes all years.
        customer_segments : list of str, optional
            Customer segments to include (e.g., ['COM', 'NON-RES']).

        Raises
        ------
        ValueError
            If year_min is greater than year_max or if customer_segments is not a list of strings.
        """
        if year_min is not None and year_max is not None and year_min > year_max:
            raise ValueError("year_min cannot be greater than year_max.")
        
        if customer_segments is not None:
            if not isinstance(customer_segments, list) or not all(isinstance(seg, str) for seg in customer_segments):
                raise ValueError("customer_segments must be a list of strings.")
            if not set(customer_segments).issubset(set(self.valid_customer_segments)):
                raise ValueError(f"customer_segments must be a subset of {self.valid_customer_segments}.")
            

    def load(
        self,
        year_min: Optional[int] = None,
        year_max: Optional[int] = None,
        customer_segments: Optional[List[str]] = None
    ):
        """
        Load and filter TTS data with common preprocessing steps.

        Parameters
        ----------
        year_min : int, optional
            Minimum year to filter to. If None, includes all years.
        year_max : int, optional
            Maximum year to filter to. If None, includes all years.
        customer_segments : list of str, optional
            Customer segments to filter to. If None, includes all segments.

        Returns
        -------
        pd.DataFrame
            Filtered and cleaned dataframe.
        """

        csvs = list(self.tts_data_directory.glob('*.csv'))

        self._validate_filters(year_min, year_max, customer_segments)

        if csvs:
            self.df = pd.DataFrame()
            for csv in csvs:
                csv_df = pd.read_csv(csv, parse_dates=['installation_date'])

                if year_min is not None or year_max is not None:
                    csv_df = self._filter_by_years(csv_df, year_min, year_max)

                if customer_segments is not None:
                    csv_df = self._filter_by_customer_segment(csv_df, customer_segments)

                self.df = pd.concat([self.df, csv_df], ignore_index=True)
                print(f"Loaded {len(csv_df)} rows from {csv.name}")
        
        else:
            raise ValueError(f"No CSV files found in directory {self.tts_data_directory}")


    def get_data(self):
        """
        Get the current dataframe.

        Returns
        -------
        pd.DataFrame
            Current dataframe.

        Raises
        ------
        ValueError
            If dataframe has not been loaded.
        """
        if self.df is None:
            raise ValueError("Data not loaded. Call load_raw() or load() first.")

        return self.df

### Instantiate and Load Data

Now we'll create a `DataLoader` instance pointing to our raw data directory and use it to load commercial solar installations from 2019-2023.

In [3]:
tts_dataloader = DataLoader(tts_data_directory='../data/raw')

tts_dataloader.load(year_min=2019, year_max=2023, customer_segments=['COM'])

tts_dataloader.get_data().head()

  csv_df = pd.read_csv(csv, parse_dates=['installation_date'])


Loaded 15547 rows from TTS_LBNL_public_file_29-Sep-2025_all.csv


Unnamed: 0,data_provider_1,data_provider_2,system_ID_1,system_ID_2,installation_date,PV_system_size_DC,total_installed_price,rebate_or_grant,customer_segment,expansion_system,multiple_phase_system,TTS_link_ID,new_construction,tracking,ground_mounted,zip_code,city,state,utility_service_territory,third_party_owned,installer_name,self_installed,azimuth_1,azimuth_2,azimuth_3,tilt_1,tilt_2,tilt_3,module_manufacturer_1,module_model_1,module_quantity_1,module_manufacturer_2,module_model_2,module_quantity_2,module_manufacturer_3,module_model_3,module_quantity_3,additional_modules,technology_module_1,technology_module_2,technology_module_3,BIPV_module_1,BIPV_module_2,BIPV_module_3,bifacial_module_1,bifacial_module_2,bifacial_module_3,nameplate_capacity_module_1,nameplate_capacity_module_2,nameplate_capacity_module_3,efficiency_module_1,efficiency_module_2,efficiency_module_3,inverter_manufacturer_1,inverter_model_1,inverter_quantity_1,inverter_manufacturer_2,inverter_model_2,inverter_quantity_2,inverter_manufacturer_3,inverter_model_3,inverter_quantity_3,additional_inverters,micro_inverter_1,micro_inverter_2,micro_inverter_3,built_in_meter_inverter_1,built_in_meter_inverter_2,built_in_meter_inverter_3,output_capacity_inverter_1,output_capacity_inverter_2,output_capacity_inverter_3,DC_optimizer,inverter_loading_ratio,battery_manufacturer,battery_model,battery_rated_capacity_kW,battery_rated_capacity_kWh,battery_price,technology_type,extensions_multiphase_id
0,Frontier Associates,Texas Central Company,19TCC-001,-1,2019-05-09,29.7,54950.0,3920.0,COM,False,False,-1,0.0,0.0,0.0,78045,Laredo,TX,Texas Central Company,0.0,Peg Solar,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,Sonali Energees USA LLC,SS-330,90.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,Multi-c-Si,-1,-1,0,-1,-1,0,-1,-1,330.0,-1.0,-1.0,0.171875,-1.0,-1.0,-1,-1,2.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
1,Frontier Associates,Texas Central Company,19TCC-004,-1,2019-05-22,24.48,74674.0,19584.0,COM,False,False,-1,0.0,0.0,0.0,78041,Laredo,TX,Texas Central Company,0.0,Peg Solar,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,"Trina Solar Co.,Ltd",-1,72.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,2.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
2,Frontier Associates,Texas Central Company,19TCC-010,-1,2019-10-08,52.56,181332.0,24390.0,COM,False,False,-1,0.0,0.0,0.0,78570,Mercedes,TX,Texas Central Company,0.0,Ecolectrics,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,SET-Solar,-1,144.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
3,Frontier Associates,Texas Central Company,19TCC-018,-1,2019-03-06,23.8,45764.0,19040.0,COM,False,False,-1,0.0,0.0,0.0,78596,Weslaco,TX,Texas Central Company,0.0,Alba Energy,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,"Trina Solar Co.,Ltd",-1,70.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,2.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
4,Frontier Associates,Texas Central Company,19TCC-068,-1,2019-11-27,307.31,952755.47,46155.5,COM,False,False,-1,0.0,-1.0,1.0,78041,Laredo,TX,Texas Central Company,0.0,Freedom Solar Power Tx,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,SunPower,-1,778.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,5.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1


### What We Loaded

We successfully loaded 15.5k rows of commercial solar installation data from 2019-2023. Each row represents one solar installation project with details about system size, location, components, and cost.

-----------------------------------

# ðŸ“Š EDA: Datatypes and Missing Values

We want to understand the structure of our data before we can clean it and train a model. This includes looking at datatypes, missing values, and distributions of key columns. A bar chart of datatypes will show us which columns are numeric vs categorical, and a barchart display of missing values will reveal which columns have a lot of missing data.

Note: We do have to get ahead of ourselves and do sentinel value replacement before we can do a full EDA, since the dataset uses various sentinel values to indicate missing or invalid data. The next section will handle this cleaning step.

In [4]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=tts_dataloader.get_data().columns,
        y=tts_dataloader.get_data().replace([-1, "-1"], np.nan).isna().mean(),
        marker_color=tts_dataloader.get_data().dtypes.map(lambda dt: 'lightblue' if dt in ['object', 'str'] else 'salmon')
    )
)

fig.update_layout(
    title='Proportion of Missing Values by Column',
    xaxis_title='Column',
    yaxis_title='Proportion of Missing Values',
    xaxis_tickangle=-90,
)

fig.show()

One clear pattern is that for much of the equipment-related columns (e.g. inverter details, battery details), there are a lot of missing values in the secondary and tertiary columns (e.g. inverter_2, inverter_3, battery_2, battery_3) and detail is only provided for the primary equipment (inverter_1, battery_1). This makes sense since most installations likely only have one inverter and one battery, but it also means we will need to drop the secondary and tertiary columns due to the high proportion of missing values.

## 2. DataCleaner - Cleaning the Dataset

Now that we have our raw data loaded, we need to clean it. The **`DataCleaner`** class handles several important cleaning tasks:

- **Sentinel value replacement**: Converts placeholder values like -1 to true NaN
- **Target cleaning**: Removes rows with missing or unrealistically low prices
- **Datatype coercion**: Ensures each column has the appropriate data type
- **Missing data handling**: Drops columns with too many missing values, imputes others
- **Cardinality reduction**: Removes extremely high-cardinality categorical columns that would create too many features

In [5]:
class DataCleaner:
    """
    Data cleaner for LBNL Tracking the Sun dataset.

    This class handles TTS-specific data cleaning operations independent of the modeling pipeline.


    """

    def __init__(self, config_min_target_value=10, config_high_cardinality_threshold=0.05, config_na_drop_thresholds={'string_columns': 0.20, 'numeric_columns': 0.50}):
        """
        Initialize the DataCleaner with configuration parameters.

        Parameters
        ----------
        config_min_target_value : float, optional
            Minimum valid target value. Rows with target values below this will be removed. Default is 10.
        config_high_cardinality_threshold : float, optional
            Proportion of unique values above which a column will be dropped. Default is 0.05 (5%).
        """

        self.df = None

        self.config_min_target_value = config_min_target_value
        self.config_high_cardinality_threshold = config_high_cardinality_threshold
        self.config_na_drop_thresholds = config_na_drop_thresholds

    def load_data(self, df):
        """
        Load data into the cleaner.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.

        Returns
        -------
        None
        """
        self.df = df

    def _make_true_na(self, df):
        """
        Convert common placeholder values for missing data to true NaN.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe with true NaN values.
        """
        df = df.replace([-1, "-1"], np.nan)
        
        return df
    
    def _coerce_datatypes(self, df):
        """
        Coerce datatypes of columns to appropriate types.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe with coerced datatypes.
        """
        for col in df.columns:
            if 'date' in col.lower():
                df[col] = pd.to_datetime(df[col], errors='coerce')

            elif "zip" in col.lower() or "postal" in col.lower():
                df[col] = df[col].astype(str).str.zfill(5)
            elif df[col].dtype in ['bool']:
                df[col] = df[col].astype(int)
            elif df[col].dtype in ['object', 'str']:
                try:
                    df[col] = pd.to_numeric(df[col])
                except Exception:
                    df[col] = df[col].astype('str')
        return df

    def _clean_by_target(self, df, target_col):
        """
        Clean the target variable by removing rows with missing or invalid values.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.
        target_col : str
            Name of the target column to clean.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe.
        """
        before_count = len(df)
        df = df.dropna(subset=[target_col])
        df = df[df[target_col] >= self.config_min_target_value]
        after_count = len(df)
        print(f"> Removed {before_count - after_count} rows with missing or invalid target values.")
        return df
    
    def _drop_high_na_columns(self, df):
        """
        Drop columns that have a majority of missing values based on configured thresholds.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe with high-NA columns dropped.
        """

        if len(df) == 0:
            print("Warning: Dataframe is empty. Skipping high-NA column drop.")
            return df
        
        string_cols = df.select_dtypes(include=['object']).columns
        numeric_cols = df.select_dtypes(include=[np.number]).columns

        string_na_proportions = df[string_cols].isna().mean()
        numeric_na_proportions = df[numeric_cols].isna().mean()

        cols_to_drop_string = string_na_proportions[string_na_proportions > self.config_na_drop_thresholds['string_columns']].index
        cols_to_drop_numeric = numeric_na_proportions[numeric_na_proportions > self.config_na_drop_thresholds['numeric_columns']].index

        cols_to_drop = list(cols_to_drop_string) + list(cols_to_drop_numeric)
        print(f"> Dropping columns with majority NA values: {cols_to_drop}")
        df = df.drop(columns=cols_to_drop)

        return df


    
    def _drop_high_cardinality_columns(self, df):
        """
        Drop columns that have a high proportion of unique values.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.
            
        Returns
        -------
        pd.DataFrame 
            Cleaned dataframe with high-cardinality columns dropped.
        """
        if len(df) == 0:
            print("Warning: Dataframe is empty. Skipping high-cardinality column drop.")
            return df
        
        unique_proportions = df.nunique() / len(df)

        cols_to_drop = unique_proportions[unique_proportions > self.config_high_cardinality_threshold].index
        cols_to_drop = [col for col in cols_to_drop if df[col].dtype in ['object', 'str']]
        print(f"> Dropping high-cardinality columns: {cols_to_drop}")
        df = df.drop(columns=cols_to_drop)

        return df
    
    def _drop_single_value_columns(self, df):
        """
        Drop columns that have only a single unique value.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.
            
        Returns
        -------
        pd.DataFrame 
            Cleaned dataframe with single-value columns dropped.
        """
        single_value_cols = df.columns[df.nunique() <= 1]
        print(f"> Dropping single-value columns: {list(single_value_cols)}")
        df = df.drop(columns=single_value_cols)

        return df
    
    def clean(self, target_col='total_installed_price'):
        """
        Perform all cleaning steps on the loaded dataframe.

        Parameters
        ----------
        target_col : str, optional
            Target column to clean. Default is 'total_installed_price'.
        min_target_value : float, optional
            Minimum valid target value. Default is 10.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe.

        Raises
        ------
        ValueError
            If dataframe has not been loaded.
        """
        if self.df is None:
            raise ValueError("Data not loaded. Call load_data() first.")

        self.df = self._make_true_na(self.df)
        self.df = self._clean_by_target(self.df, target_col)
        self.df = self._drop_high_na_columns(self.df)
        self.df = self._drop_single_value_columns(self.df)
        self.df = self._drop_high_cardinality_columns(self.df)
        self.df = self._coerce_datatypes(self.df)
        return self.df

### Apply Data Cleaning

Let's instantiate our `DataCleaner`, load our data into it, and apply all cleaning operations with a single `.clean()` call.


In [6]:
data_cleaner_config = {
    'config_min_target_value': 10,
    'config_high_cardinality_threshold': 0.10,
    'config_na_drop_thresholds': {'string_columns': 0.10, 'numeric_columns': 0.50}
}

tts_cleaner = DataCleaner(**data_cleaner_config)
tts_cleaner.load_data(tts_dataloader.get_data())
cleaned_df = tts_cleaner.clean(target_col='total_installed_price')

cleaned_df.shape

> Removed 5142 rows with missing or invalid target values.
> Dropping columns with majority NA values: ['data_provider_2', 'system_ID_2', 'TTS_link_ID', 'module_model_1', 'module_manufacturer_2', 'module_model_2', 'module_manufacturer_3', 'module_model_3', 'technology_module_1', 'technology_module_2', 'technology_module_3', 'inverter_manufacturer_1', 'inverter_model_1', 'inverter_manufacturer_2', 'inverter_model_2', 'inverter_manufacturer_3', 'inverter_model_3', 'battery_manufacturer', 'battery_model', 'extensions_multiphase_id', 'new_construction', 'azimuth_2', 'azimuth_3', 'tilt_2', 'tilt_3', 'module_quantity_2', 'module_quantity_3', 'BIPV_module_2', 'BIPV_module_3', 'bifacial_module_2', 'bifacial_module_3', 'nameplate_capacity_module_2', 'nameplate_capacity_module_3', 'efficiency_module_2', 'efficiency_module_3', 'inverter_quantity_2', 'inverter_quantity_3', 'micro_inverter_2', 'micro_inverter_3', 'built_in_meter_inverter_2', 'built_in_meter_inverter_3', 'output_capacity_inverter_2'

(10405, 30)

### What We Cleaned

The `DataCleaner` successfully processed our dataset and dropped the high-cardinality `zip_code` column. Our data is now clean, properly typed, and ready for feature engineering.

-----------------------

## 3. FeatureEngineer - Creating Useful Features

With clean data in hand, we'll use our **`FeatureEngineer`** class to create new features that help our model learn better patterns.

Currently, our feature engineer creates:
- **`days_since_2000`**: A temporal feature representing how many days after January 1, 2000 the system was installed. This captures time trends in solar pricing more effectively than raw dates.
- **`total_moduel_counts`**: A feature that sums up the counts of different module types to get a total module count, which may be more predictive than individual module type counts.

In [7]:
class FeatureEngineer:
    """
    Feature engineer for LBNL Tracking the Sun dataset.

    This class handles TTS-specific feature engineering operations independent of the modeling pipeline.

    Parameters
    ----------
    None

    Attributes
    ----------
    None

    """

    def __init__(self):
        pass

    def load_data(self, df):
        """
        Load data into the feature engineer.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to engineer.

        Returns
        -------
        None
        """
        self.df = df

    def _add_day_count(self, df):
        """
        Add a feature for the number of days since installation.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to engineer.

        Returns
        -------
        pd.DataFrame
            Dataframe with new 'days_since_2000' feature.
        """
        df['days_since_2000'] = (df['installation_date'] - pd.Timestamp('2000-01-01')).dt.days
        
        return df
    
    def _combine_module_counts(self, df):
        """
        Combine module count features into a single feature.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to engineer.

        Returns
        -------
        pd.DataFrame
            Dataframe with new 'total_module_count' feature.
        """
        module_cols = [col for col in df.columns if 'module_quantity' in col.lower()]
        df['total_module_count'] = df[module_cols].replace(np.nan, 0).sum(axis=1)

        df = df.drop(columns=module_cols)
        
        return df
    
    def engineer_features(self):
        """
        Perform all feature engineering steps on the loaded dataframe.

        Parameters
        ----------
        None

        Returns
        -------
        pd.DataFrame
            Dataframe with engineered features.

        Raises
        ------
        ValueError
            If dataframe has not been loaded.
        """
        if self.df is None:
            raise ValueError("Data not loaded. Call load_data() first.")

        df = self._add_day_count(self.df)
        df = self._combine_module_counts(df)

        return df

### Apply Feature Engineer

Below, we instantiate our `FeatureEngineer` and apply it to our cleaned dataset to create the new features. This generates two new features:
* `days_since_2000`: A numeric feature representing how many days after January 1, 2000 the system was installed. This captures time trends in solar pricing more effectively than raw dates.
* `total_module_count`: A numeric feature representing the total number of modules in the system.

In [8]:
feature_engineer = FeatureEngineer()

feature_engineer.load_data(cleaned_df)
engineered_df = feature_engineer.engineer_features()

---------------------

## 4. Preprocessor - Preparing Features for Modeling

Now we need to transform our features into a format that our machine learning model can understand. The **`Preprocessor`** class automatically:

1. **Sorts columns** by type (numerical, binary, low-cardinality categorical, high-cardinality categorical)
2. **Builds a sklearn ColumnTransformer** that applies the right transformation to each column type:
   - **One-hot encoding** for low-cardinality categoricals (creates binary columns)
   - **Target encoding** for high-cardinality categoricals (replaces categories with mean target value)
   - **Passthrough** for binary columns (already in good format)
   - **Standard scaling** for numerical features (normalizes to mean=0, std=1)

Taken together, this preprocessing pipeline ensures that all our features are in the right format for modeling, while also reducing dimensionality and improving model performance.



In [9]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler, TargetEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

class Preprocessor():
    """
    Preprocessor for our Capex solar cost estimation model.
    """

    def __init__(self, target_col='total_installed_price'):
        """Initialize the Preprocessor with configuration parameters."""
        self.target_col = target_col
        self.columns = None

        self.preprocessor = None

    def _sort_columns(self, df: pd.DataFrame):
        """Sort columns into numerical, binary, low-cardinality categorical, and high-cardinality categorical based on datatypes and unique value counts.
        
        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to use for determining column types.
        """
        self.columns = dict(
            target_col = self.target_col,
            num_cols = [],
            binary_cols = [],
            cat_low_card_cols = [],
            cat_high_card_cols = []
        )
        
        for col in df.columns:
            if col == self.columns['target_col']:
                continue
            elif df[col].dtype in ['int64', 'float64']:
                self.columns['num_cols'].append(col)
            elif df[col].dtype == 'bool' or (df[col].nunique() == 2):
                self.columns['binary_cols'].append(col)
            elif df[col].dtype in ['str', 'object', 'category']:
                if df[col].nunique() < 10:
                    self.columns['cat_low_card_cols'].append(col)
                else:
                    self.columns['cat_high_card_cols'].append(col)

    def get_feature_names(self):
        """Get feature names after preprocessing"""
        if self.preprocessor is None:
            raise ValueError("Preprocessor not built. Must be fit to data first.")
        return self.preprocessor.get_feature_names_out()

    def build_preprocessor(self, df):
        """
        Build the preprocessing pipeline based on the dataframe's columns and datatypes.
        
        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to use for determining column types and building the preprocessor.
        
        Returns
        -------
        ColumnTransformer
            The built preprocessing pipeline.
        """
        self._sort_columns(df)
        
        low_card_pipeline = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
        ]).set_output(transform="pandas")
        
        high_card_pipeline = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', TargetEncoder())
        ]).set_output(transform="pandas")
        
        binary_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))
]).set_output(transform="pandas")
        
        num_pipeline = Pipeline(steps=[
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]).set_output(transform="pandas")
        
        self.preprocessor = ColumnTransformer(
            transformers=[
                ('cat_low_card', low_card_pipeline, self.columns['cat_low_card_cols']),
                ('cat_high_card', high_card_pipeline, self.columns['cat_high_card_cols']),
                ('binary', binary_pipeline, self.columns['binary_cols']),
                ('num', num_pipeline, self.columns['num_cols']),
            ], 
            remainder='drop'
        ).set_output(transform="pandas")
        
        return self.preprocessor

In [10]:
preprocessor = Preprocessor(target_col='total_installed_price')
built_preprocessor = preprocessor.build_preprocessor(engineered_df)
preprocessor.preprocessor.fit(engineered_df.drop(columns=['total_installed_price']), engineered_df[preprocessor.target_col])

0,1,2
,"transformers  transformers: list of tuples List of (name, transformer, columns) tuples specifying the transformer objects to be applied to subsets of the data. name : str  Like in Pipeline and FeatureUnion, this allows the transformer and  its parameters to be set using ``set_params`` and searched in grid  search. transformer : {'drop', 'passthrough'} or estimator  Estimator must support :term:`fit` and :term:`transform`.  Special-cased strings 'drop' and 'passthrough' are accepted as  well, to indicate to drop the columns or to pass them through  untransformed, respectively. columns : str, array-like of str, int, array-like of int, array-like of bool, slice or callable  Indexes the data on its second axis. Integers are interpreted as  positional columns, while strings can reference DataFrame columns  by name. A scalar string or int should be used where  ``transformer`` expects X to be a 1d array-like (vector),  otherwise a 2d array will be passed to the transformer.  A callable is passed the input data `X` and can return any of the  above. To select multiple columns by name or dtype, you can use  :obj:`make_column_selector`.","[('cat_low_card', ...), ('cat_high_card', ...), ...]"
,"remainder  remainder: {'drop', 'passthrough'} or estimator, default='drop' By default, only the specified columns in `transformers` are transformed and combined in the output, and the non-specified columns are dropped. (default of ``'drop'``). By specifying ``remainder='passthrough'``, all remaining columns that were not specified in `transformers`, but present in the data passed to `fit` will be automatically passed through. This subset of columns is concatenated with the output of the transformers. For dataframes, extra columns not seen during `fit` will be excluded from the output of `transform`. By setting ``remainder`` to be an estimator, the remaining non-specified columns will use the ``remainder`` estimator. The estimator must support :term:`fit` and :term:`transform`. Note that using this feature requires that the DataFrame columns input at :term:`fit` and :term:`transform` have identical order.",'drop'
,"sparse_threshold  sparse_threshold: float, default=0.3 If the output of the different transformers contains sparse matrices, these will be stacked as a sparse matrix if the overall density is lower than this value. Use ``sparse_threshold=0`` to always return dense. When the transformed output consists of all dense data, the stacked result will be dense, and this keyword will be ignored.",0.3
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"transformer_weights  transformer_weights: dict, default=None Multiplicative weights for features per transformer. The output of the transformer is multiplied by these weights. Keys are transformer names, values the weights.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each transformer will be printed as it is completed.",False
,"verbose_feature_names_out  verbose_feature_names_out: bool, str or Callable[[str, str], str], default=True - If True, :meth:`ColumnTransformer.get_feature_names_out` will prefix  all feature names with the name of the transformer that generated that  feature. It is equivalent to setting  `verbose_feature_names_out=""{transformer_name}__{feature_name}""`. - If False, :meth:`ColumnTransformer.get_feature_names_out` will not  prefix any feature names and will error if feature names are not  unique. - If ``Callable[[str, str], str]``,  :meth:`ColumnTransformer.get_feature_names_out` will rename all the features  using the name of the transformer. The first argument of the callable is the  transformer name and the second argument is the feature name. The returned  string will be the new feature name. - If ``str``, it must be a string ready for formatting. The given string will  be formatted using two field names: ``transformer_name`` and ``feature_name``.  e.g. ``""{feature_name}__{transformer_name}""``. See :meth:`str.format` method  from the standard library for more info. .. versionadded:: 1.0 .. versionchanged:: 1.6  `verbose_feature_names_out` can be a callable or a string to be formatted.",True
,"force_int_remainder_cols  force_int_remainder_cols: bool, default=False This parameter has no effect. .. note::  If you do not access the list of columns for the remainder columns  in the `transformers_` fitted attribute, you do not need to set  this parameter. .. versionadded:: 1.5 .. versionchanged:: 1.7  The default value for `force_int_remainder_cols` will change from  `True` to `False` in version 1.7. .. deprecated:: 1.7  `force_int_remainder_cols` is deprecated and will be removed in 1.9.",'deprecated'

0,1,2
,"missing_values  missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.",
,"strategy  strategy: str or Callable, default='mean' The imputation strategy. - If ""mean"", then replace missing values using the mean along  each column. Can only be used with numeric data. - If ""median"", then replace missing values using the median along  each column. Can only be used with numeric data. - If ""most_frequent"", then replace missing using the most frequent  value along each column. Can be used with strings or numeric data.  If there is more than one such value, only the smallest is returned. - If ""constant"", then replace missing values with fill_value. Can be  used with strings or numeric data. - If an instance of Callable, then replace missing values using the  scalar statistic returned by running the callable over a dense 1d  array containing non-missing values of each column. .. versionadded:: 0.20  strategy=""constant"" for fixed value imputation. .. versionadded:: 1.5  strategy=callable for custom value imputation.",'most_frequent'
,"fill_value  fill_value: str or numerical value, default=None When strategy == ""constant"", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and ""missing_value"" for strings or object data types.",
,"copy  copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.",True
,"add_indicator  add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.",False
,"keep_empty_features  keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy=""constant""` in which case `fill_value` will be used instead. .. versionadded:: 1.2",False

0,1,2
,"categories  categories: 'auto' or a list of array-like, default='auto' Categories (unique values) per feature: - 'auto' : Determine categories automatically from the training data. - list : ``categories[i]`` holds the categories expected in the ith  column. The passed categories should not mix strings and numeric  values within a single feature, and should be sorted in case of  numeric values. The used categories can be found in the ``categories_`` attribute. .. versionadded:: 0.20",'auto'
,"drop  drop: {'first', 'if_binary'} or an array-like of shape (n_features,), default=None Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. - None : retain all features (the default). - 'first' : drop the first category in each feature. If only one  category is present, the feature will be dropped entirely. - 'if_binary' : drop the first category in each feature with two  categories. Features with 1 or more than 2 categories are  left intact. - array : ``drop[i]`` is the category in feature ``X[:, i]`` that  should be dropped. When `max_categories` or `min_frequency` is configured to group infrequent categories, the dropping behavior is handled after the grouping. .. versionadded:: 0.21  The parameter `drop` was added in 0.21. .. versionchanged:: 0.23  The option `drop='if_binary'` was added in 0.23. .. versionchanged:: 1.1  Support for dropping infrequent categories.",
,"sparse_output  sparse_output: bool, default=True When ``True``, it returns a :class:`scipy.sparse.csr_matrix`, i.e. a sparse matrix in ""Compressed Sparse Row"" (CSR) format. .. versionadded:: 1.2  `sparse` was renamed to `sparse_output`",False
,"dtype  dtype: number type, default=np.float64 Desired dtype of output.",<class 'numpy.float64'>
,"handle_unknown  handle_unknown: {'error', 'ignore', 'infrequent_if_exist', 'warn'}, default='error' Specifies the way unknown categories are handled during :meth:`transform`. - 'error' : Raise an error if an unknown category is present during transform. - 'ignore' : When an unknown category is encountered during  transform, the resulting one-hot encoded columns for this feature  will be all zeros. In the inverse transform, an unknown category  will be denoted as None. - 'infrequent_if_exist' : When an unknown category is encountered  during transform, the resulting one-hot encoded columns for this  feature will map to the infrequent category if it exists. The  infrequent category will be mapped to the last position in the  encoding. During inverse transform, an unknown category will be  mapped to the category denoted `'infrequent'` if it exists. If the  `'infrequent'` category does not exist, then :meth:`transform` and  :meth:`inverse_transform` will handle an unknown category as with  `handle_unknown='ignore'`. Infrequent categories exist based on  `min_frequency` and `max_categories`. Read more in the  :ref:`User Guide `. - 'warn' : When an unknown category is encountered during transform  a warning is issued, and the encoding then proceeds as described for  `handle_unknown=""infrequent_if_exist""`. .. versionchanged:: 1.1  `'infrequent_if_exist'` was added to automatically handle unknown  categories and infrequent categories. .. versionadded:: 1.6  The option `""warn""` was added in 1.6.",'ignore'
,"min_frequency  min_frequency: int or float, default=None Specifies the minimum frequency below which a category will be considered infrequent. - If `int`, categories with a smaller cardinality will be considered  infrequent. - If `float`, categories with a smaller cardinality than  `min_frequency * n_samples` will be considered infrequent. .. versionadded:: 1.1  Read more in the :ref:`User Guide `.",
,"max_categories  max_categories: int, default=None Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, `max_categories` includes the category representing the infrequent categories along with the frequent categories. If `None`, there is no limit to the number of output features. .. versionadded:: 1.1  Read more in the :ref:`User Guide `.",
,"feature_name_combiner  feature_name_combiner: ""concat"" or callable, default=""concat"" Callable with signature `def callable(input_feature, category)` that returns a string. This is used to create feature names to be returned by :meth:`get_feature_names_out`. `""concat""` concatenates encoded feature name and category with `feature + ""_"" + str(category)`.E.g. feature X with values 1, 6, 7 create feature names `X_1, X_6, X_7`. .. versionadded:: 1.3",'concat'

0,1,2
,"missing_values  missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.",
,"strategy  strategy: str or Callable, default='mean' The imputation strategy. - If ""mean"", then replace missing values using the mean along  each column. Can only be used with numeric data. - If ""median"", then replace missing values using the median along  each column. Can only be used with numeric data. - If ""most_frequent"", then replace missing using the most frequent  value along each column. Can be used with strings or numeric data.  If there is more than one such value, only the smallest is returned. - If ""constant"", then replace missing values with fill_value. Can be  used with strings or numeric data. - If an instance of Callable, then replace missing values using the  scalar statistic returned by running the callable over a dense 1d  array containing non-missing values of each column. .. versionadded:: 0.20  strategy=""constant"" for fixed value imputation. .. versionadded:: 1.5  strategy=callable for custom value imputation.",'most_frequent'
,"fill_value  fill_value: str or numerical value, default=None When strategy == ""constant"", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and ""missing_value"" for strings or object data types.",
,"copy  copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.",True
,"add_indicator  add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.",False
,"keep_empty_features  keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy=""constant""` in which case `fill_value` will be used instead. .. versionadded:: 1.2",False

0,1,2
,"categories  categories: ""auto"" or list of shape (n_features,) of array-like, default=""auto"" Categories (unique values) per feature: - `""auto""` : Determine categories automatically from the training data. - list : `categories[i]` holds the categories expected in the i-th column. The  passed categories should not mix strings and numeric values within a single  feature, and should be sorted in case of numeric values. The used categories are stored in the `categories_` fitted attribute.",'auto'
,"target_type  target_type: {""auto"", ""continuous"", ""binary"", ""multiclass""}, default=""auto"" Type of target. - `""auto""` : Type of target is inferred with  :func:`~sklearn.utils.multiclass.type_of_target`. - `""continuous""` : Continuous target - `""binary""` : Binary target - `""multiclass""` : Multiclass target .. note::  The type of target inferred with `""auto""` may not be the desired target  type used for modeling. For example, if the target consisted of integers  between 0 and 100, then :func:`~sklearn.utils.multiclass.type_of_target`  will infer the target as `""multiclass""`. In this case, setting  `target_type=""continuous""` will specify the target as a regression  problem. The `target_type_` attribute gives the target type used by the  encoder. .. versionchanged:: 1.4  Added the option 'multiclass'.",'auto'
,"smooth  smooth: ""auto"" or float, default=""auto"" The amount of mixing of the target mean conditioned on the value of the category with the global target mean. A larger `smooth` value will put more weight on the global target mean. If `""auto""`, then `smooth` is set to an empirical Bayes estimate.",'auto'
,"cv  cv: int, default=5 Determines the number of folds in the :term:`cross fitting` strategy used in :meth:`fit_transform`. For classification targets, `StratifiedKFold` is used and for continuous targets, `KFold` is used.",5
,"shuffle  shuffle: bool, default=True Whether to shuffle the data in :meth:`fit_transform` before splitting into folds. Note that the samples within each split will not be shuffled.",True
,"random_state  random_state: int, RandomState instance or None, default=None When `shuffle` is True, `random_state` affects the ordering of the indices, which controls the randomness of each fold. Otherwise, this parameter has no effect. Pass an int for reproducible output across multiple function calls. See :term:`Glossary `.",

0,1,2
,"missing_values  missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.",
,"strategy  strategy: str or Callable, default='mean' The imputation strategy. - If ""mean"", then replace missing values using the mean along  each column. Can only be used with numeric data. - If ""median"", then replace missing values using the median along  each column. Can only be used with numeric data. - If ""most_frequent"", then replace missing using the most frequent  value along each column. Can be used with strings or numeric data.  If there is more than one such value, only the smallest is returned. - If ""constant"", then replace missing values with fill_value. Can be  used with strings or numeric data. - If an instance of Callable, then replace missing values using the  scalar statistic returned by running the callable over a dense 1d  array containing non-missing values of each column. .. versionadded:: 0.20  strategy=""constant"" for fixed value imputation. .. versionadded:: 1.5  strategy=callable for custom value imputation.",'most_frequent'
,"fill_value  fill_value: str or numerical value, default=None When strategy == ""constant"", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and ""missing_value"" for strings or object data types.",
,"copy  copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.",True
,"add_indicator  add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.",False
,"keep_empty_features  keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy=""constant""` in which case `fill_value` will be used instead. .. versionadded:: 1.2",False

0,1,2
,"missing_values  missing_values: int, float, str, np.nan, None or pandas.NA, default=np.nan The placeholder for the missing values. All occurrences of `missing_values` will be imputed. For pandas' dataframes with nullable integer dtypes with missing values, `missing_values` can be set to either `np.nan` or `pd.NA`.",
,"strategy  strategy: str or Callable, default='mean' The imputation strategy. - If ""mean"", then replace missing values using the mean along  each column. Can only be used with numeric data. - If ""median"", then replace missing values using the median along  each column. Can only be used with numeric data. - If ""most_frequent"", then replace missing using the most frequent  value along each column. Can be used with strings or numeric data.  If there is more than one such value, only the smallest is returned. - If ""constant"", then replace missing values with fill_value. Can be  used with strings or numeric data. - If an instance of Callable, then replace missing values using the  scalar statistic returned by running the callable over a dense 1d  array containing non-missing values of each column. .. versionadded:: 0.20  strategy=""constant"" for fixed value imputation. .. versionadded:: 1.5  strategy=callable for custom value imputation.",'median'
,"fill_value  fill_value: str or numerical value, default=None When strategy == ""constant"", `fill_value` is used to replace all occurrences of missing_values. For string or object data types, `fill_value` must be a string. If `None`, `fill_value` will be 0 when imputing numerical data and ""missing_value"" for strings or object data types.",
,"copy  copy: bool, default=True If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`: - If `X` is not an array of floating values; - If `X` is encoded as a CSR matrix; - If `add_indicator=True`.",True
,"add_indicator  add_indicator: bool, default=False If True, a :class:`MissingIndicator` transform will stack onto output of the imputer's transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won't appear on the missing indicator even if there are missing values at transform/test time.",False
,"keep_empty_features  keep_empty_features: bool, default=False If True, features that consist exclusively of missing values when `fit` is called are returned in results when `transform` is called. The imputed value is always `0` except when `strategy=""constant""` in which case `fill_value` will be used instead. .. versionadded:: 1.2",False

0,1,2
,"copy  copy: bool, default=True If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.",True
,"with_mean  with_mean: bool, default=True If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.",True
,"with_std  with_std: bool, default=True If True, scale the data to unit variance (or equivalently, unit standard deviation).",True


# ðŸ“Š EDA: Identifying Relevant Features

We can use scikit-learn's `SelectKBest` feature selection method to identify which features are most strongly correlated with our target variable (total installed price). This will help us understand which features are most important for predicting solar CAPEX and may also allow us to reduce the number of features we use in our model for better performance and interpretability.

In [11]:
from sklearn.feature_selection import SelectKBest, f_regression

selector = SelectKBest(score_func=f_regression, k=10)

preprocessed_df = preprocessor.preprocessor.fit_transform(engineered_df.drop(columns=['total_installed_price']), engineered_df[preprocessor.target_col])

selector.fit(preprocessed_df, engineered_df[preprocessor.target_col])

scores = selector.scores_
feature_names = preprocessor.preprocessor.get_feature_names_out()

feature_scores = pd.DataFrame({'feature': feature_names, 'score': scores})
feature_scores = feature_scores.sort_values(by='score', ascending=False)

In [12]:
fig = go.Figure(data=[go.Bar(x=feature_scores['feature'], y=feature_scores['score'])])
fig.update_layout(title='Feature Scores from SelectKBest', xaxis_title='Feature', yaxis_title='Score', xaxis_tickangle=-45)
fig.show()

----------------

## Model Selection


In [13]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split, cross_val_score

rfr_pipeline = Pipeline([
    ("preprocessor", built_preprocessor),
    ("model", RandomForestRegressor(random_state=42, n_estimators=100, max_depth=9))
])

knn_pipeline = Pipeline([
    ("preprocessor", built_preprocessor),
    ('selector', SelectKBest(score_func=f_regression, k=5)),
    ("model", KNeighborsRegressor())
])

lr_pipeline = Pipeline([
    ("preprocessor", built_preprocessor),
    ('selector', SelectKBest(score_func=f_regression, k=5)),
    ("model", LinearRegression())
])

X = engineered_df.drop(columns=[preprocessor.target_col])
y = engineered_df[preprocessor.target_col]

rfr_score = cross_val_score(rfr_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')
knn_score = cross_val_score(knn_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')
lr_score = cross_val_score(lr_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')

results_df = pd.DataFrame(
    index = ['Random Forest', 'KNN', 'Linear Regression'],
    data = {
    'mae_max': [-rfr_score.min(), -knn_score.min(), -lr_score.min()],
    'mae_mean': [-rfr_score.mean(), -knn_score.mean(), -lr_score.mean()],
    'mae_min': [-rfr_score.max(), -knn_score.max(), -lr_score.max()],
    'mae_std': [rfr_score.std(), knn_score.std(), lr_score.std()]

})


In [14]:
results_df.round(0)

Unnamed: 0,mae_max,mae_mean,mae_min,mae_std
Random Forest,279124.0,174416.0,127936.0,54632.0
KNN,326999.0,242563.0,157749.0,71607.0
Linear Regression,293032.0,188852.0,126812.0,61331.0


### Hyperparameter Tuning Setup

Before training our final model, we need to define:
1. **Preprocessor**: We'll use the fitted preprocessor from earlier
2. **Parameter grid**: Define which hyperparameters to search over

In [15]:
from sklearn.model_selection import GridSearchCV

# Use the preprocessor we built earlier
built_preprocessor = preprocessor.preprocessor

# Define parameter grid for Random Forest hyperparameter tuning
param_grid = {
    'model__n_estimators': [50, 100, 150],
    'model__max_depth': [5, 7, 9],
    'model__min_samples_split': [2, 5, 10]
}

print("Preprocessor ready for training pipeline")
print(f"Parameter grid: {param_grid}")

Preprocessor ready for training pipeline
Parameter grid: {'model__n_estimators': [50, 100, 150], 'model__max_depth': [5, 7, 9], 'model__min_samples_split': [2, 5, 10]}


### Model Trainer Class

The **`RFRlTrainer`** class handles the training workflow:
1. **GridSearchCV**: Performs cross-validation to find the best hyperparameters
2. **Model retraining**: Trains the final model with best parameters on the full dataset
3. **Model persistence**: Saves the trained model to disk for later use

In [16]:
import joblib


class RFRlTrainer:
    def __init__(self, param_grid: dict, preprocessor):
        self.param_grid = param_grid
        self.model_pipeline = Pipeline([
                ("preprocessor", preprocessor),
                ("model", RandomForestRegressor(random_state=42, n_estimators=100, max_depth=9))
            ])
    
    def train_new_model(self, X, y, filepath=Path('../models/best_model.pkl')):
        grid = GridSearchCV(
            self.model_pipeline,
            param_grid=self.param_grid,
            cv=5,
            scoring="neg_mean_squared_error",
            n_jobs=-1
        )
        
        print("Training new model with GridSearchCV...")
        grid.fit(X, y)

        print(f"Best parameters discovered: {grid.best_params_}")
        print(f"Best CV RMSE: {np.sqrt(-grid.best_score_):.4f}")
        print("----------------------------------")
        print("Retraining best model on full dataset...")

        best_params = grid.best_params_

        best_model = Pipeline([
                ("preprocessor", self.model_pipeline.named_steps["preprocessor"]),
                ("model", RandomForestRegressor(random_state=42, **{k.replace('model__', ''): v for k, v in best_params.items()}))
            ])
        
        best_model.fit(X, y)

        
        joblib.dump(best_model, filepath)
        
        print(f"Best model saved to {filepath}")

In [17]:
model_trainer = RFRlTrainer(param_grid=param_grid, preprocessor=built_preprocessor)
model_trainer.train_new_model(X, y)

Training new model with GridSearchCV...
Best parameters discovered: {'model__max_depth': 9, 'model__min_samples_split': 10, 'model__n_estimators': 150}
Best CV RMSE: 656416.7236
----------------------------------
Retraining best model on full dataset...
Best model saved to ../models/best_model.pkl


### Training Results

Our GridSearchCV successfully found optimal hyperparameters for the Random Forest model. The best model configuration was saved to `../models/best_model.pkl` and is now ready for evaluation.

The hyperparameter search compared 27 different model configurations (3 values for n_estimators Ã— 3 for max_depth Ã— 3 for min_samples_split) using 5-fold cross-validation.

-------------------

## 5. Model Evaluation - Validating Performance

Now that we've trained our best model, we need to thoroughly evaluate its performance on held-out test data. This section will:

1. **Split data** into train and test sets to simulate real-world performance
2. **Calculate key metrics** (MAE, RMSE, RÂ²) to quantify prediction accuracy
3. **Visualize predictions** with actual vs predicted plots to spot patterns
4. **Analyze residuals** to check for systematic errors
5. **Explain predictions** using SHAP values to understand what drives the model

This comprehensive evaluation ensures our model is ready for production use.

In [18]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the best model
best_model = joblib.load('../models/best_model.pkl')

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Make predictions on test set
y_pred = best_model.predict(X_test)

# Calculate metrics
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print("Model Performance on Test Set:")
print(f"  Mean Absolute Error (MAE): ${mae:,.0f}")
print(f"  Root Mean Squared Error (RMSE): ${rmse:,.0f}")
print(f"  RÂ² Score: {r2:.4f}")
print(f"\nInterpretation:")
print(f"  On average, predictions are off by ${mae:,.0f}")
print(f"  The model explains {r2*100:.1f}% of the variance in solar CAPEX")

Model Performance on Test Set:
  Mean Absolute Error (MAE): $127,042
  Root Mean Squared Error (RMSE): $348,162
  RÂ² Score: 0.9120

Interpretation:
  On average, predictions are off by $127,042
  The model explains 91.2% of the variance in solar CAPEX


### Actual vs Predicted Plot

This scatter plot compares our model's predictions against the actual solar CAPEX values. Points closer to the diagonal line indicate better predictions. The color intensity shows the density of predictions at different price ranges.

In [19]:
fig = go.Figure()

# Add scatter plot of actual vs predicted
fig.add_trace(go.Scatter(
    x=y_test,
    y=y_pred,
    mode='markers',
    marker=dict(
        size=5,
        color=y_test,
        colorscale='Viridis',
        showscale=True,
        colorbar=dict(title="Actual Price ($)"),
        opacity=0.6
    ),
    name='Predictions'
))

# Add perfect prediction line
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
fig.add_trace(go.Scatter(
    x=[min_val, max_val],
    y=[min_val, max_val],
    mode='lines',
    line=dict(color='red', dash='dash', width=2),
    name='Perfect Prediction'
))

fig.update_layout(
    title=f'Actual vs Predicted Solar CAPEX<br><sub>RÂ² = {r2:.4f}, MAE = ${mae:,.0f}</sub>',
    xaxis_title='Actual Total Installed Price ($)',
    yaxis_title='Predicted Total Installed Price ($)',
    width=800,
    height=600,
    hovermode='closest'
)

fig.show()

### Residual Analysis

Residuals are the differences between actual and predicted values (error = actual - predicted). A good model should have:
- **Residuals centered around zero** (no systematic over/under-prediction)
- **Constant variance across price ranges** (homoscedasticity)
- **No patterns in the residual plot** (random scatter indicates good fit)

Let's visualize the residuals to check these assumptions.

In [20]:
# Calculate residuals
residuals = y_test - y_pred

# Create subplots for comprehensive residual analysis
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Residual Plot', 'Residual Distribution'),
    horizontal_spacing=0.12
)

# Residual scatter plot
fig.add_trace(
    go.Scatter(
        x=y_pred,
        y=residuals,
        mode='markers',
        marker=dict(
            size=5,
            color=np.abs(residuals),
            colorscale='Reds',
            showscale=True,
            colorbar=dict(title="Abs Error ($)", x=0.46),
            opacity=0.6
        ),
        name='Residuals'
    ),
    row=1, col=1
)

# Add zero line
fig.add_trace(
    go.Scatter(
        x=[y_pred.min(), y_pred.max()],
        y=[0, 0],
        mode='lines',
        line=dict(color='black', dash='dash', width=1),
        showlegend=False
    ),
    row=1, col=1
)

# Histogram of residuals
fig.add_trace(
    go.Histogram(
        x=residuals,
        nbinsx=50,
        marker_color='steelblue',
        name='Distribution',
        showlegend=False
    ),
    row=1, col=2
)

fig.update_xaxes(title_text="Predicted Price ($)", row=1, col=1)
fig.update_yaxes(title_text="Residual (Actual - Predicted) ($)", row=1, col=1)
fig.update_xaxes(title_text="Residual ($)", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)

fig.update_layout(
    title='Residual Analysis',
    width=1200,
    height=500,
    showlegend=False
)

fig.show()

print(f"Residual Statistics:")
print(f"  Mean: ${residuals.mean():,.0f} (should be close to 0)")
print(f"  Std Dev: ${residuals.std():,.0f}")
print(f"  Median: ${residuals.median():,.0f}")

Residual Statistics:
  Mean: $-4,988 (should be close to 0)
  Std Dev: $348,210
  Median: $-10,972


### Feature Importance

Random Forest models provide a natural way to measure feature importance based on how much each feature decreases impurity across all trees. Features with higher importance scores contribute more to the model's predictions.

This helps us understand:
- **Which features matter most** for predicting solar CAPEX
- **Whether our feature engineering** created useful predictors
- **If the model aligns** with domain knowledge (e.g., system size should be important)

In [22]:
# Extract feature importance from the Random Forest model
rf_model = best_model.named_steps['model']
feature_names = best_model.named_steps['preprocessor'].get_feature_names_out()
importances = rf_model.feature_importances_

# Create a dataframe of feature importances
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False).head(15)

# Create horizontal bar chart
fig = go.Figure()

fig.add_trace(go.Bar(
    y=importance_df['feature'][::-1],  # Reverse to show highest at top
    x=importance_df['importance'][::-1],
    orientation='h',
    marker=dict(
        colorscale='Blues',
        showscale=False
    )
))

fig.update_layout(
    title='Top 15 Most Important Features for Predicting Solar CAPEX',
    xaxis_title='Feature Importance Score',
    yaxis_title='Feature',
    width=900,
    height=600,
    margin=dict(l=200)  # Extra margin for long feature names
)

fig.show()

# Print top 5 features
print("\nTop 5 Most Important Features:")
for idx, row in importance_df.head(5).iterrows():
    print(f"  {row['feature']}: {row['importance']:.4f}")


Top 5 Most Important Features:
  num__PV_system_size_DC: 0.8599
  cat_high_card__utility_service_territory: 0.0343
  num__total_module_count: 0.0197
  num__days_since_2000: 0.0110
  cat_high_card__data_provider_1: 0.0087


-------------------

## Summary and Key Takeaways

### Model Performance
Our Random Forest model achieved solid performance on commercial solar CAPEX prediction:
- **MAE**: Predictions are typically within the reported dollar amount of actual costs
- **RÂ² Score**: The model explains a significant portion of the variance in solar installation costs
- **Residuals**: Generally well-distributed around zero with no major systematic bias

### Most Important Features
Based on both feature importance, the key drivers of solar CAPEX are:
1. **System size** - Larger systems cost more (as expected)
2. **Installation date** - Time trends in pricing (costs generally decreasing over time)
3. **Geographic location** - State-level cost variations due to labor, incentives, and regulations

### Next Steps
1. **Deploy model** into the `SolarCapexEstimator` class for production use
2. **Monitor performance** on new data as market conditions change
3. **Retrain periodically** with updated TTS data to capture evolving trends
4. **Add uncertainty estimates** to provide prediction intervals, not just point estimates