# Model Development - Solar CAPEX Estimator

This notebook develops a machine learning model to predict total installed costs (CAPEX) for commercial solar installations using the LBNL Tracking the Sun dataset. It follows a structured pipeline with clear separation of concerns into distinct classes for each step of the process:

1) **DataLoader**, responsible for loading Tracking the Sun data from CSV files and filtering it to rows relevant to our use case (commercial solar installations in the US).

2) **DataCleaner**, responsible for cleaning the data by removing rows with missing or invalid values in the target column (total installed price). It also drops columns where most values are missing or fills in missing values with appropriate strategies.

3) **FeatureEngineer**, responsible for creating new features from the existing data that may help the model learn better, but are still readable and interpretable by users.

4) **ModelTrainer**, responsible for training a machine learning model (e.g. linear regression, random forest, or gradient boosting) on the cleaned and feature-engineered data.

Along the way, we will also look at exploratory data analysis (EDA) to understand the data distribution and relationships between features and the target variable. Finally, we will evaluate the model's performance using appropriate metrics and visualizations.

Using this composition approach (separate classes coordinated by a higher-level estimator) keeps each stepâ€”data loading, cleaning, feature engineering, training, and evaluationâ€”focused on a single responsibility. This improves modularity, testability, and reuse: each component can be developed, swapped, or improved independently without affecting the rest of the pipeline.

## Setup and Imports

Import necessary libraries for data manipulation, visualization, and modeling.

In [1]:
import pandas as pd
import numpy as np
from plotly import graph_objects as go
from pathlib import Path
from typing import Optional, List

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

# Config dictionary to hold any configuration parameters for the model development process.
config = {}

## 1. DataLoader - Loading TTS Data

We'll start by using our **`DataLoader`** class to load the Tracking the Sun dataset. This class handles the messy details of reading CSV files and filtering them to just the records we care about.

The `DataLoader` class provides:
- Automatic handling of multiple CSV files in a directory
- Date parsing for installation dates
- Year-based filtering (we want 2019-2023)
- Customer segment filtering (commercial and non-residential only)

In [2]:
import pandas as pd

class DataLoader:
    """
    Data loader for LBNL Tracking the Sun dataset.

    This class handles TTS-specific data loading, cleaning, and filtering
    operations independent of the modeling pipeline.

    Parameters
    ----------
    tts_data_directory : str
        Path to the directory containing the raw TTS data files.

    Attributes
    ----------
    tts_data_directory : Path
        Directory containing the raw TTS data files.
    df : pd.DataFrame
        Loaded dataframe.
    valid_customer_segments : list
        List of valid customer segments.

    """

    def __init__(self, tts_data_directory: str):
        """Initialize the DataLoader."""
        self.tts_data_directory = Path(tts_data_directory)
        self.df = None

        self.valid_customer_segments = ['COM', 'RES_MF', 'RES_SF', 'RES', 'AGRICULTURAL',
       'OTHER TAX-EXEMPT', 'GOV', 'SCHOOL', 'NON-RES', 'NON-PROFIT']

    def _filter_by_years(self, df, year_min=None, year_max=None):
        """
        Filter data to specific installation years.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to filter.
        year_min : int, optional
            Minimum installation year to include. If None, includes all years.
        year_max : int, optional
            Maximum installation year to include. If None, includes all years.

        Returns
        -------
        pd.DataFrame
            Filtered dataframe.

        Raises
        ------
        ValueError
            If dataframe has not been loaded.
        """
        if df is None:
            raise ValueError("Data not loaded. Call load_raw() first.")

        if year_min is not None:
            df = df[df.installation_date.dt.year >= year_min]
        if year_max is not None:
            df = df[df.installation_date.dt.year <= year_max]

        return df

    def _filter_by_customer_segment(self, df, segments):
        """
        Filter data to specific customer segments.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to filter.
        segments : list of str
            Customer segments to filter to. If None, includes all segments.

        Returns
        -------
        pd.DataFrame
            Filtered dataframe.

        Raises
        ------
        ValueError
            If dataframe has not been loaded.
        """
        if df is None:
            raise ValueError("Data not loaded. Call load_raw() first.")

        df = df[df['customer_segment'].isin(segments)]

        return df

    def _validate_filters(self, year_min, year_max, customer_segments):
        """
        Validate filter parameters.

        Parameters
        ----------
        year_min : int, optional
            Minimum installation year to include. If None, includes all years.
        year_max : int, optional
            Maximum installation year to include. If None, includes all years.
        customer_segments : list of str, optional
            Customer segments to include (e.g., ['COM', 'NON-RES']).

        Raises
        ------
        ValueError
            If year_min is greater than year_max or if customer_segments is not a list of strings.
        """
        if year_min is not None and year_max is not None and year_min > year_max:
            raise ValueError("year_min cannot be greater than year_max.")

        if customer_segments is not None:
            if not isinstance(customer_segments, list) or not all(isinstance(seg, str) for seg in customer_segments):
                raise ValueError("customer_segments must be a list of strings.")
            if not set(customer_segments).issubset(set(self.valid_customer_segments)):
                raise ValueError(f"customer_segments must be a subset of {self.valid_customer_segments}.")


    def load_training_data(
        self,
        year_min: Optional[int] = None,
        year_max: Optional[int] = None,
        customer_segments: Optional[List[str]] = None
    ):
        """
        Load and filter TTS data with common preprocessing steps.

        Parameters
        ----------
        year_min : int, optional
            Minimum year to filter to. If None, includes all years.
        year_max : int, optional
            Maximum year to filter to. If None, includes all years.
        customer_segments : list of str, optional
            Customer segments to filter to. If None, includes all segments.

        Returns
        -------
        pd.DataFrame
            Filtered and cleaned dataframe.
        """

        csvs = list(self.tts_data_directory.glob('*.csv'))

        self._validate_filters(year_min, year_max, customer_segments)

        if csvs:
            self.df = pd.DataFrame()
            for csv in csvs:
                csv_df = pd.read_csv(csv, parse_dates=['installation_date'], low_memory=False)

                if year_min is not None or year_max is not None:
                    csv_df = self._filter_by_years(csv_df, year_min, year_max)

                if customer_segments is not None:
                    csv_df = self._filter_by_customer_segment(csv_df, customer_segments)

                self.df = pd.concat([self.df, csv_df], ignore_index=True)
                print(f"Loaded {len(csv_df)} rows from {csv.name}")

        else:
            raise ValueError(f"No CSV files found in directory {self.tts_data_directory}")


    def get_data(self):
        """
        Get the current dataframe.

        Returns
        -------
        pd.DataFrame
            Current dataframe.

        Raises
        ------
        ValueError
            If dataframe has not been loaded.
        """
        if self.df is None:
            raise ValueError("Data not loaded. Call load_raw() or load() first.")

        return self.df


### Instantiate and Load Data

Now we'll create a `DataLoader` instance pointing to our raw data directory and use it to load commercial solar installations from 2019-2022.

In [3]:
config['loading'] = {
    'year_min': 2019,
    'year_max': 2022,
    'customer_segments': ['COM']
}

tts_dataloader = DataLoader(tts_data_directory='../data/raw')

tts_dataloader.load_training_data(**config['loading'])

tts_dataloader.get_data().head()

Loaded 12580 rows from TTS_LBNL_public_file_29-Sep-2025_all.csv


Unnamed: 0,data_provider_1,data_provider_2,system_ID_1,system_ID_2,installation_date,PV_system_size_DC,total_installed_price,rebate_or_grant,customer_segment,expansion_system,multiple_phase_system,TTS_link_ID,new_construction,tracking,ground_mounted,zip_code,city,state,utility_service_territory,third_party_owned,installer_name,self_installed,azimuth_1,azimuth_2,azimuth_3,tilt_1,tilt_2,tilt_3,module_manufacturer_1,module_model_1,module_quantity_1,module_manufacturer_2,module_model_2,module_quantity_2,module_manufacturer_3,module_model_3,module_quantity_3,additional_modules,technology_module_1,technology_module_2,technology_module_3,BIPV_module_1,BIPV_module_2,BIPV_module_3,bifacial_module_1,bifacial_module_2,bifacial_module_3,nameplate_capacity_module_1,nameplate_capacity_module_2,nameplate_capacity_module_3,efficiency_module_1,efficiency_module_2,efficiency_module_3,inverter_manufacturer_1,inverter_model_1,inverter_quantity_1,inverter_manufacturer_2,inverter_model_2,inverter_quantity_2,inverter_manufacturer_3,inverter_model_3,inverter_quantity_3,additional_inverters,micro_inverter_1,micro_inverter_2,micro_inverter_3,built_in_meter_inverter_1,built_in_meter_inverter_2,built_in_meter_inverter_3,output_capacity_inverter_1,output_capacity_inverter_2,output_capacity_inverter_3,DC_optimizer,inverter_loading_ratio,battery_manufacturer,battery_model,battery_rated_capacity_kW,battery_rated_capacity_kWh,battery_price,technology_type,extensions_multiphase_id
0,Frontier Associates,Texas Central Company,19TCC-001,-1,2019-05-09,29.7,54950.0,3920.0,COM,False,False,-1,0.0,0.0,0.0,78045,Laredo,TX,Texas Central Company,0.0,Peg Solar,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,Sonali Energees USA LLC,SS-330,90.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,Multi-c-Si,-1,-1,0,-1,-1,0,-1,-1,330.0,-1.0,-1.0,0.171875,-1.0,-1.0,-1,-1,2.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
1,Frontier Associates,Texas Central Company,19TCC-004,-1,2019-05-22,24.48,74674.0,19584.0,COM,False,False,-1,0.0,0.0,0.0,78041,Laredo,TX,Texas Central Company,0.0,Peg Solar,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,"Trina Solar Co.,Ltd",-1,72.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,2.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
2,Frontier Associates,Texas Central Company,19TCC-010,-1,2019-10-08,52.56,181332.0,24390.0,COM,False,False,-1,0.0,0.0,0.0,78570,Mercedes,TX,Texas Central Company,0.0,Ecolectrics,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,SET-Solar,-1,144.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
3,Frontier Associates,Texas Central Company,19TCC-018,-1,2019-03-06,23.8,45764.0,19040.0,COM,False,False,-1,0.0,0.0,0.0,78596,Weslaco,TX,Texas Central Company,0.0,Alba Energy,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,"Trina Solar Co.,Ltd",-1,70.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,2.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
4,Frontier Associates,Texas Central Company,19TCC-068,-1,2019-11-27,307.31,952755.47,46155.5,COM,False,False,-1,0.0,-1.0,1.0,78041,Laredo,TX,Texas Central Company,0.0,Freedom Solar Power Tx,0.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,SunPower,-1,778.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,5.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1


We successfully loaded **12,580 rows** of commercial solar installation data from 2019-2022. Each row represents one solar installation project with details about system size, location, components, and cost.

-----------------------------------

#### Understanding Our Dataset

The loaded data represents real-world commercial solar installations reported to LBNL's Tracking the Sun program. Key observations:

- **Time period (2019-2022)**: We focused on recent data to ensure our model reflects current market conditions.
- **Commercial segment only**: Narrows our scope to avoid mixing residential and commercial pricing dynamics.
- **Sample size (12,580 installations)**: Large enough for robust modeling without computational challenges.

Each row contains information about:
- **System specifications**: Size (kW), module types, inverter details
- **Location**: State, utility territory, climate zone  
- **Installation details**: Date, installer, ownership structure
- **Cost**: Our target variable - total installed price in dollars

# ðŸ“Š EDA: Datatypes and Missing Values

We want to understand the structure of our data before we can clean it and train a model. This includes looking at datatypes, missing values, and distributions of key columns. A bar chart of datatypes will show us which columns are numeric vs categorical, and a barchart display of missing values will reveal which columns have a lot of missing data.

Note: We do have to get ahead of ourselves and do sentinel value replacement before we can do a full EDA, since the dataset uses various sentinel values to indicate missing or invalid data. The next section will handle this cleaning step.

In [4]:
fig = go.Figure()

fig.add_trace(
    go.Bar(
        x=tts_dataloader.get_data().columns,
        y=tts_dataloader.get_data().replace([-1, "-1"], np.nan).isna().mean(),
        marker_color=tts_dataloader.get_data().dtypes.map(lambda dt: 'lightblue' if dt in ['object', 'str'] else 'salmon')
    )
)

fig.update_layout(
    title='Proportion of Missing Values by Column',
    xaxis_title='Column',
    yaxis_title='Proportion of Missing Values',
    xaxis_tickangle=-90,
)

fig.show()

One clear pattern is that for much of the equipment-related columns (e.g. inverter details, battery details), there are a lot of missing values in the secondary and tertiary columns (e.g. inverter_2, inverter_3, battery_2, battery_3) and detail is only provided for the primary equipment (inverter_1, battery_1). This makes sense since most installations likely only have one inverter and one battery, but it also means we will need to drop the secondary and tertiary columns due to the high proportion of missing values.

## 2. DataCleaner - Cleaning the Dataset

Now that we have our raw data loaded, we need to clean it. The **`DataCleaner`** class handles several important cleaning tasks:

- **Sentinel value replacement**: Converts placeholder values like -1 to true NaN.
- **Target cleaning**: Removes rows with missing or unrealistically low prices.
- **Datatype coercion**: Ensures each column has the appropriate data type.
- **Missing data handling**: Drops columns with too many missing values, imputes others.
- **Cardinality reduction**: Removes extremely high-cardinality categorical columns that would create too many features.

Note: There is a possibility of doing some self-joins using the TTS_link_id, but as some customers might be seeking to expand an existing installation, it is best to avoid this complexity for now consider projects as independent.

In [5]:
class DataCleaner:
    """
    Data cleaner for LBNL Tracking the Sun dataset.

    This class handles TTS-specific data cleaning operations independent of the modeling pipeline.


    """

    def __init__(self, min_target_value=10, high_cardinality_threshold=0.05, na_drop_thresholds={'string_columns': 0.20, 'numeric_columns': 0.50}):
        """
        Initialize the DataCleaner with configuration parameters.

        Parameters
        ----------
        min_target_value : float, optional
            Minimum valid target value. Rows with target values below this will be removed. Default is 10.
        high_cardinality_threshold : float, optional
            Proportion of unique values above which a column will be dropped. Default is 0.05 (5%).
        na_drop_thresholds : dict, optional
            Thresholds for dropping columns with high NA values. Default is {'string_columns': 0.20, 'numeric_columns': 0.50}.
        """

        self.df = None

        self.min_target_value = min_target_value
        self.high_cardinality_threshold = high_cardinality_threshold
        self.na_drop_thresholds = na_drop_thresholds

    def load_data(self, df):
        """
        Load data into the cleaner.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.

        Returns
        -------
        None
        """
        self.df = df

    def _make_true_na(self, df):
        """
        Convert common placeholder values for missing data to true NaN.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe with true NaN values.
        """
        df = df.replace([-1, "-1"], np.nan)
        
        return df
    
    def _coerce_datatypes(self, df):
        """
        Coerce datatypes of columns to appropriate types.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe with coerced datatypes.
        """
        for col in df.columns:
            if 'date' in col.lower():
                df[col] = pd.to_datetime(df[col], errors='coerce')

            elif "zip" in col.lower() or "postal" in col.lower():
                df[col] = df[col].astype(str).str.zfill(5)
            elif df[col].dtype in ['bool']:
                df[col] = df[col].astype(int)
            elif df[col].dtype in ['object', 'str']:
                try:
                    df[col] = pd.to_numeric(df[col])
                except Exception:
                    df[col] = df[col].astype('str')
        return df

    def _clean_by_target(self, df, target_col):
        """
        Clean the target variable by removing rows with missing or invalid values.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.
        target_col : str
            Name of the target column to clean.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe.
        """
        before_count = len(df)
        df = df.dropna(subset=[target_col])
        df = df[df[target_col] >= self.min_target_value]
        after_count = len(df)
        print(f"> Removed {before_count - after_count} rows with missing or invalid target values.")
        return df
    
    def _drop_high_na_columns(self, df):
        """
        Drop columns that have a majority of missing values based on configured thresholds.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe with high-NA columns dropped.
        """

        if len(df) == 0:
            print("Warning: Dataframe is empty. Skipping high-NA column drop.")
            return df
        
        string_cols = df.select_dtypes(include=['object']).columns
        numeric_cols = df.select_dtypes(include=[np.number]).columns

        string_na_proportions = df[string_cols].isna().mean()
        numeric_na_proportions = df[numeric_cols].isna().mean()

        cols_to_drop_string = string_na_proportions[string_na_proportions > self.na_drop_thresholds['string_columns']].index
        cols_to_drop_numeric = numeric_na_proportions[numeric_na_proportions > self.na_drop_thresholds['numeric_columns']].index

        cols_to_drop = list(cols_to_drop_string) + list(cols_to_drop_numeric)
        print(f"> Dropping columns with majority NA values: {cols_to_drop}")
        df = df.drop(columns=cols_to_drop)

        return df


    
    def _drop_high_cardinality_columns(self, df):
        """
        Drop columns that have a high proportion of unique values.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.
            
        Returns
        -------
        pd.DataFrame 
            Cleaned dataframe with high-cardinality columns dropped.
        """
        if len(df) == 0:
            print("Warning: Dataframe is empty. Skipping high-cardinality column drop.")
            return df
        
        unique_proportions = df.nunique() / len(df)

        cols_to_drop = unique_proportions[unique_proportions > self.high_cardinality_threshold].index
        cols_to_drop = [col for col in cols_to_drop if df[col].dtype in ['object', 'str']]
        print(f"> Dropping high-cardinality columns: {cols_to_drop}")
        df = df.drop(columns=cols_to_drop)

        return df
    
    def _drop_id_and_provider_columns(self, df):
        """
        Drop columns that are likely to be identifiers or provider-specific information.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.
            
        Returns
        -------
        pd.DataFrame 
            Cleaned dataframe with ID and provider columns dropped.
        """
        id_cols = ['data_provider_1', 'data_provider_2', 'system_ID_1', 'system_ID_2']
        print(f"> Dropping ID and provider columns: {id_cols}")
        df = df.drop(columns=id_cols)

        return df
    
    def _drop_single_value_columns(self, df):
        """
        Drop columns that have only a single unique value.

        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to clean.
            
        Returns
        -------
        pd.DataFrame 
            Cleaned dataframe with single-value columns dropped.
        """
        single_value_cols = df.columns[df.nunique() <= 1]
        print(f"> Dropping single-value columns: {list(single_value_cols)}")
        df = df.drop(columns=single_value_cols)

        return df
    
    def clean(self, target_col='total_installed_price'):
        """
        Perform all cleaning steps on the loaded dataframe.

        Parameters
        ----------
        target_col : str, optional
            Target column to clean. Default is 'total_installed_price'.
        min_target_value : float, optional
            Minimum valid target value. Default is 10.

        Returns
        -------
        pd.DataFrame
            Cleaned dataframe.

        Raises
        ------
        ValueError
            If dataframe has not been loaded.
        """
        if self.df is None:
            raise ValueError("Data not loaded. Call load_data() first.")

        self.df = self._make_true_na(self.df)
        self.df = self._clean_by_target(self.df, target_col)
        self.df = self._drop_id_and_provider_columns(self.df)
        self.df = self._drop_high_na_columns(self.df)
        self.df = self._drop_single_value_columns(self.df)
        self.df = self._drop_high_cardinality_columns(self.df)
        self.df = self._coerce_datatypes(self.df)
        return self.df

### Apply Data Cleaning

Let's instantiate our `DataCleaner`, load our data into it, and apply all cleaning operations with a single `.clean()` call.


In [6]:
config['cleaning'] = {
    'min_target_value': 10,
    'high_cardinality_threshold': 0.10,
    'na_drop_thresholds': {'string_columns': 0.10, 'numeric_columns': 0.50}
}

tts_cleaner = DataCleaner(**config['cleaning'])
tts_cleaner.load_data(tts_dataloader.get_data())
cleaned_df = tts_cleaner.clean(target_col='total_installed_price')

cleaned_df.shape

> Removed 4284 rows with missing or invalid target values.
> Dropping ID and provider columns: ['data_provider_1', 'data_provider_2', 'system_ID_1', 'system_ID_2']
> Dropping columns with majority NA values: ['TTS_link_ID', 'module_model_1', 'module_manufacturer_2', 'module_model_2', 'module_manufacturer_3', 'module_model_3', 'technology_module_1', 'technology_module_2', 'technology_module_3', 'inverter_manufacturer_1', 'inverter_model_1', 'inverter_manufacturer_2', 'inverter_model_2', 'inverter_manufacturer_3', 'inverter_model_3', 'battery_manufacturer', 'battery_model', 'extensions_multiphase_id', 'new_construction', 'azimuth_2', 'azimuth_3', 'tilt_2', 'tilt_3', 'module_quantity_2', 'module_quantity_3', 'BIPV_module_2', 'BIPV_module_3', 'bifacial_module_2', 'bifacial_module_3', 'nameplate_capacity_module_2', 'nameplate_capacity_module_3', 'efficiency_module_2', 'efficiency_module_3', 'inverter_quantity_2', 'inverter_quantity_3', 'micro_inverter_2', 'micro_inverter_3', 'built_in_meter

(8296, 29)

#### Interpreting the Cleaning Results

We went from **12,580 rows to 8,296 rows** (~34% loss) and from **~120 columns to 29 columns** (~76% loss).

**Row loss (4,284 rows removed)**:
- Had missing or invalid target values (total_installed_price < $10)
- Cannot train on data where the target is unreliable
- Better 8,296 quality examples than 12,580 with 4,284 having questionable targets

**Column loss (90+ columns dropped)**:
- Secondary/tertiary equipment columns with >50% missing values (e.g., inverter_2, module_3)
- High-cardinality identifiers not useful for prediction (system_ID, zip_code, installer_name)
- Single-value column removed (customer_segment - filtered to 'COM' only)

-----------------------

## 3. FeatureEngineer - Creating Useful Features

With clean data in hand, we'll use our **`FeatureEngineer`** class to create new features that help our model learn better patterns.

Currently, our feature engineer creates:
- **`days_since_2000`**: A temporal feature representing how many days after January 1, 2000 the system was installed. This captures time trends in solar pricing more effectively than raw dates.
- **`total_moduel_counts`**: A feature that sums up the counts of different module types to get a total module count, which may be more predictive than individual module type counts.

In [7]:
from sklearn.base import BaseEstimator, TransformerMixin


class FeatureEngineer(BaseEstimator, TransformerMixin):
    """
    Feature engineer for LBNL Tracking the Sun dataset.

    Scikit-learn compatible transformer.
    """

    def __init__(self):
        # no parameters yet, but required for sklearn compatibility
        pass

    def fit(self, X, y=None):
        """
        Fit does nothing because this transformer is stateless.
        Required for sklearn pipelines.
        """
        return self

    def transform(self, X):
        """
        Apply feature engineering steps.

        Parameters
        ----------
        X : pd.DataFrame

        Returns
        -------
        pd.DataFrame
        """
        df = X.copy()

        df = self._add_day_count(df)
        df = self._combine_module_counts(df)

        return df

    def _add_day_count(self, df):
        df['days_since_2000'] = (
            df['installation_date'] - pd.Timestamp('2000-01-01')
        ).dt.days
        return df

    def _combine_module_counts(self, df):
        if 'total_module_count' in df.columns:
            return df

        module_cols = [
            col for col in df.columns
            if 'module_quantity' in col.lower()
        ]

        if module_cols:
            df['total_module_count'] = (
                df[module_cols].fillna(0).sum(axis=1)
            )
            df = df.drop(columns=module_cols)

        return df

### Apply Feature Engineer

Below, we instantiate our `FeatureEngineer` and apply it to our cleaned dataset to create the new features. This generates two new features:
* `days_since_2000`: A numeric feature representing how many days after January 1, 2000 the system was installed. This captures time trends in solar pricing more effectively than raw dates.
* `total_module_count`: A numeric feature representing the total number of modules in the system.

In [8]:
feature_engineer = FeatureEngineer()

engineered_df = feature_engineer.fit_transform(cleaned_df)

---------------------

## 4. Preprocessor - Preparing Features for Modeling

Now we need to transform our features into a format that our machine learning model can understand. The **`Preprocessor`** class automatically:

1. **Sorts columns** by type (numerical, binary, low-cardinality categorical, high-cardinality categorical)
2. **Builds a sklearn ColumnTransformer** that applies the right transformation to each column type:
   - **One-hot encoding** for low-cardinality categoricals (creates binary columns)
   - **Target encoding** for high-cardinality categoricals (replaces categories with mean target value)
   - **Passthrough** for binary columns (already in good format)
   - **Standard scaling** for numerical features (normalizes to mean=0, std=1)

Taken together, this preprocessing pipeline ensures that all our features are in the right format for modeling, while also reducing dimensionality and improving model performance.

Note - the preprocessor is a component that will be added to the pipeline in the `ModelTrainer` class, so we don't need to apply it separately here. The `ModelTrainer` will call the preprocessor as part of its training process. 


In [9]:
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import OneHotEncoder, StandardScaler, TargetEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline


class TTSPreprocessor(BaseEstimator, TransformerMixin):
    """
    Preprocessor for our Capex solar cost estimation model.
    sklearn-compatible transformer.
    """

    def __init__(self):
        self.column_dict = None
        self.preprocessor = None

    def fit(self, X: pd.DataFrame, y: pd.Series):
        """
        Determine column types and fit the preprocessing pipeline.
        """
        df = X.copy()

        self._sort_columns(df=X, target_col=y.name if y is not None else None)
        self._build_preprocessor()

        # Fit underlying column transformer
        if self.preprocessor is not None:
            if y is not None:
                self.preprocessor.fit(df, y)
            else:
                self.preprocessor.fit(df)

        return self

    def transform(self, X: pd.DataFrame):
        """
        Apply preprocessing.
        """
        if self.preprocessor is None:
            raise ValueError("Preprocessor not fitted. Call fit() first.")

        return self.preprocessor.transform(X)

    def get_feature_names(self):
        """Get feature names after preprocessing."""
        if self.preprocessor is None:
            raise ValueError("Preprocessor not built. Must be fit to data first.")
        return self.preprocessor.get_feature_names_out()

    def _sort_columns(self, df: pd.DataFrame, target_col: Optional[str]):
        """Sort columns into types.
        
        Parameters
        ----------
        df : pd.DataFrame
            Dataframe to analyze column types from.
        target_col : str, optional
            Name of the target column. If None, no column will be treated as target.
        """
        self.column_dict = dict(
            target_col=target_col,
            num_cols=[],
            binary_cols=[],
            cat_low_card_cols=[],
            cat_high_card_cols=[],
        )

        for col in df.columns:
            if col == self.column_dict["target_col"]:
                continue
            elif df[col].dtype in ["int64", "float64"]:
                self.column_dict["num_cols"].append(col)
            elif (
                df[col].dtype == "bool"
                or (df[col].dtype == "object" and df[col].nunique() == 2)
            ):
                self.column_dict["binary_cols"].append(col)
            elif df[col].dtype in ["str", "object", "category"]:
                if df[col].nunique() < 10:
                    self.column_dict["cat_low_card_cols"].append(col)
                else:
                    self.column_dict["cat_high_card_cols"].append(col)

    def _build_preprocessor(self):
        """Build the column transformer in four parts: low-cardinality categorical, high-cardinality categorical, binary, and numeric."""

        low_card_pipeline = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="most_frequent")),
                (
                    "encoder",
                    OneHotEncoder(handle_unknown="ignore", sparse_output=False),
                ),
            ]
        ).set_output(transform="pandas")

        high_card_pipeline = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="most_frequent")),
                ("encoder", TargetEncoder()),
            ]
        ).set_output(transform="pandas")

        binary_pipeline = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="most_frequent")),
                (
                    "encoder",
                    OneHotEncoder(
                        drop="if_binary",
                        handle_unknown="ignore",
                        sparse_output=False,
                    ),
                ),
            ]
        ).set_output(transform="pandas")

        num_pipeline = Pipeline(
            steps=[
                ("imputer", SimpleImputer(strategy="median")),
                ("scaler", StandardScaler()),
            ]
        ).set_output(transform="pandas")

        self.preprocessor = ColumnTransformer(
            transformers=[
                ("cat_low_card", low_card_pipeline, self.column_dict["cat_low_card_cols"]),
                ("cat_high_card", high_card_pipeline, self.column_dict["cat_high_card_cols"]),
                ("binary", binary_pipeline, self.column_dict["binary_cols"]),
                ("num", num_pipeline, self.column_dict["num_cols"]),
            ],
            remainder="drop",
        ).set_output(transform="pandas")


In [10]:
tts_preprocessor = TTSPreprocessor()
preprocessed_df = tts_preprocessor.fit_transform(engineered_df.drop(columns=['total_installed_price']), engineered_df['total_installed_price'])

preprocessed_df.head()

Unnamed: 0,cat_low_card__technology_type_pv+storage,cat_low_card__technology_type_pv-only,cat_low_card__technology_type_storage-only,cat_high_card__state,cat_high_card__utility_service_territory,cat_high_card__module_manufacturer_1,num__PV_system_size_DC,num__rebate_or_grant,num__expansion_system,num__multiple_phase_system,num__tracking,num__ground_mounted,num__third_party_owned,num__self_installed,num__azimuth_1,num__tilt_1,num__additional_modules,num__BIPV_module_1,num__bifacial_module_1,num__nameplate_capacity_module_1,num__efficiency_module_1,num__inverter_quantity_1,num__additional_inverters,num__micro_inverter_1,num__built_in_meter_inverter_1,num__output_capacity_inverter_1,num__DC_optimizer,num__inverter_loading_ratio,num__days_since_2000,num__total_module_count
0,0.0,1.0,0.0,185581.830778,228569.711532,47482.234056,-0.304642,-0.082653,-0.272386,-0.2227,-0.172602,-0.463678,-0.298089,-0.160761,-0.049396,-0.259182,-0.208305,-0.015529,-0.227722,-0.982775,-1.765925,-0.187923,-0.34272,-0.375461,-0.857651,-0.124251,-0.762493,-0.058397,-1.430202,-0.270981
1,0.0,1.0,0.0,185581.830778,228569.711532,60245.226434,-0.313247,0.074557,-0.272386,-0.2227,-0.172602,-0.463678,-0.298089,-0.160761,-0.049396,-0.259182,-0.208305,-0.015529,-0.227722,-0.05614,0.111855,-0.187923,-0.34272,-0.375461,-0.857651,-0.124251,-0.762493,-0.058397,-1.398986,-0.285259
2,0.0,1.0,0.0,185581.830778,228569.711532,181332.0,-0.266961,0.122792,-0.272386,-0.2227,-0.172602,-0.463678,-0.298089,-0.160761,-0.049396,-0.259182,-0.208305,-0.015529,-0.227722,-0.05614,0.111855,-0.206572,-0.34272,-0.375461,-0.857651,-0.124251,-0.762493,-0.058397,-1.065215,-0.228148
3,0.0,1.0,0.0,185581.830778,228569.711532,60245.226434,-0.314367,0.069097,-0.272386,-0.2227,-0.172602,-0.463678,-0.298089,-0.160761,-0.049396,-0.259182,-0.208305,-0.015529,-0.227722,-0.05614,0.111855,-0.187923,-0.34272,-0.375461,-0.857651,-0.124251,-0.762493,-0.058397,-1.583881,-0.286846
4,0.0,1.0,0.0,185581.830778,228569.711532,367062.587574,0.152955,0.341239,-0.272386,-0.2227,-0.172602,2.15667,-0.298089,-0.160761,-0.049396,-0.259182,-0.208305,-0.015529,-0.227722,-0.05614,0.111855,-0.131978,-0.34272,-0.375461,-0.857651,-0.124251,-0.762493,-0.058397,-0.945153,0.274748


# ðŸ“Š EDA: Identifying Relevant Features

We can use scikit-learn's `SelectKBest` feature selection method to identify which features are most strongly correlated with our target variable (total installed price). This will help us understand which features are most important for predicting solar CAPEX and may also allow us to reduce the number of features we use in our model for better performance and interpretability.

In [11]:
from sklearn.feature_selection import SelectKBest, f_regression

selector = SelectKBest(score_func=f_regression, k=10)

selector.fit(preprocessed_df, engineered_df['total_installed_price'])

scores = selector.scores_
feature_names = tts_preprocessor.get_feature_names()

feature_scores = pd.DataFrame({'feature': feature_names, 'score': scores})
feature_scores = feature_scores.sort_values(by='score', ascending=False)

In [12]:
fig = go.Figure(data=[go.Bar(x=feature_scores['feature'], y=feature_scores['score'])])
fig.update_layout(title='Feature Scores from SelectKBest', xaxis_title='Feature', yaxis_title='Score', xaxis_tickangle=-45)
fig.show()

----------------

## Model Selection


#### Why Compare Multiple Models?

Different ML algorithms have different strengths and weaknesses. Before committing to one, let's compare three diverse approaches:

**Random Forest** (ensemble tree-based):
- **Strengths**: Handles nonlinear relationships, feature interactions, mixed data types; resistant to overfitting
- **Weaknesses**: Less interpretable than linear models, can be slow on very large datasets
- **Best for**: Complex patterns, when feature interactions matter

**K-Nearest Neighbors** (instance-based):
- **Strengths**: Simple, no training time, naturally captures local patterns
- **Weaknesses**: Slow predictions, sensitive to irrelevant features and scale, needs tuning
- **Best for**: When similar installations should have similar prices

**Linear Regression** (parametric):
- **Strengths**: Highly interpretable, fast training and prediction, well-understood theory
- **Weaknesses**: Assumes linear relationships, struggles with interactions, sensitive to multicollinearity
- **Best for**: When relationships are approximately linear and interpretability is critical

**Evaluation approach**:
- **5-fold cross-validation**: Tests generalization on unseen data
- **Negative MAE scoring**: Mean Absolute Error in dollars (lower is better)
- **MAE** is preferred over RMSE here because it's in the same units as our target and less sensitive to outliers

We'll use the results to choose the best model for hyperparameter tuning.

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_val_score

feature_engineer = FeatureEngineer()
tts_preprocessor = TTSPreprocessor()


rfr_pipeline = Pipeline([
    ("feature_engineering", feature_engineer),
    ("preprocessor", tts_preprocessor),
    ("model", RandomForestRegressor(random_state=0, n_estimators=100, max_depth=9))
])

knn_pipeline = Pipeline([
    ("feature_engineering", feature_engineer),
    ("preprocessor", tts_preprocessor),
    ('selector', SelectKBest(score_func=f_regression, k=5)),
    ("model", KNeighborsRegressor())
])

lr_pipeline = Pipeline([
    ("feature_engineering", feature_engineer),
    ("preprocessor", tts_preprocessor),
    ('selector', SelectKBest(score_func=f_regression, k=5)),
    ("model", LinearRegression())
])

X = cleaned_df.drop(columns=['total_installed_price'])
y = cleaned_df['total_installed_price']

rfr_score = cross_val_score(rfr_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')
knn_score = cross_val_score(knn_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')
lr_score = cross_val_score(lr_pipeline, X, y, cv=5, scoring='neg_mean_absolute_error')

results_df = pd.DataFrame(
    index = ['Random Forest', 'KNN', 'Linear Regression'],
    data = {
    'mae_max': [-rfr_score.min(), -knn_score.min(), -lr_score.min()],
    'mae_mean': [-rfr_score.mean(), -knn_score.mean(), -lr_score.mean()],
    'mae_min': [-rfr_score.max(), -knn_score.max(), -lr_score.max()],
    'mae_std': [rfr_score.std(), knn_score.std(), lr_score.std()]

})



Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros


Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros


Found unknown categories in columns [0] during transform. These unknown categories will be encoded as all zeros



In [14]:
results_df.round(0)

Unnamed: 0,mae_max,mae_mean,mae_min,mae_std
Random Forest,294310.0,177254.0,129310.0,59644.0
KNN,348592.0,219310.0,157127.0,72422.0
Linear Regression,284012.0,186652.0,132439.0,53686.0


#### Model Selection Analysis

Comparing cross-validation results across three model types:

**Random Forest: Best performer** âœ“
- **Mean MAE**: $177k (lowest error)
- **Std Dev**: $60k (stable performance)
- Captures nonlinear patterns and feature interactions automatically

**Linear Regression: Close second**
- **Mean MAE**: $187k (+$10k vs Random Forest)
- **Std Dev**: $54k (most stable)
- Competitive performance suggests relationships are somewhat linear, but not worth the $10k error increase

**KNN: Weakest performer**
- **Mean MAE**: $219k (+$42k vs Random Forest)  
- **Std Dev**: $72k (highest variance)
- Struggles with high-dimensional space and needs careful tuning

**Decision**: Proceed with **Random Forest** for hyperparameter tuning based on best performance with reasonable stability.

### Minimizing the Model: Post-Training Feature Selection

Up until now, we have been greedy in our feature selection, utilizing all the usable columns from the dataset. But we haven't done any feature selection based on the trained model itself. Now that we have a trained random forest model, we can look at the feature importances it learned and consider dropping features that have very low importance to simplify our model and reduce the number of features we need to maintain in production and expect when making predictions.

This is a form of post-training feature selection that can help us identify which features are actually contributing to the model's predictions and which ones might be noise. We can then retrain the model with only the most important features to see if we can maintain performance while improving simplicity and interpretability.

In [15]:
rfr_pipeline.fit(X, y)

pd.Series(
    index=rfr_pipeline.named_steps['preprocessor'].get_feature_names(),
    data=rfr_pipeline.named_steps['model'].feature_importances_
).sort_values(ascending=False).head(5)

num__PV_system_size_DC                      0.844514
cat_high_card__utility_service_territory    0.030366
num__total_module_count                     0.019025
num__inverter_loading_ratio                 0.013647
num__days_since_2000                        0.012632
dtype: float64

The five columns are:
- **num__PV_system_size_DC**: Maps to the original PV_system_size_DC column, which represents the size of the solar installation in kilowatts (kW). This is a critical feature for predicting total installed price, as larger systems generally cost more.
- **cat_high_card__state**: Maps to the original state column, which represents the US state where the installation is located. State-level factors like labor costs, permitting fees, and local incentives can influence installation costs.
- **cat_high_card__utility_service_territory**: Maps to the original utility_service_territory column, which represents the utility service territory where the installation is located. Utility-level factors like rates, incentives, and grid infrastructure can influence installation costs.
- **num__total_module_count**: Maps to the original total_module_count column, which represents the total number of solar modules in the installation. 
- **num__days_since_2000**: Maps to the original installation_date column.

So the columns that the model likely needs to make accurate predictions are: 1) PV_system_size_DC, 2) state, 3) utility_service_territory, 4) total_module_count, and 5) installation_date

In [16]:
from sklearn.base import BaseEstimator, TransformerMixin


class FeatureReducer(BaseEstimator, TransformerMixin):
    """
    Transformer that selects a subset of columns.
    """

    def __init__(self, features_to_keep=None):
        self.features_to_keep = features_to_keep

    def fit(self, X, y=None):
        if self.features_to_keep is None:
            raise ValueError("features_to_keep must be provided.")

        missing = [col for col in self.features_to_keep if col not in X.columns]
        if missing:
            raise ValueError(f"Columns not found in input data: {missing}")

        return self

    def transform(self, X):
        return X[self.features_to_keep].copy()


In [17]:
config['model_features'] = {
    'features': ['PV_system_size_DC', 'state', 'utility_service_territory', 'total_module_count', 'installation_date'],
    'target': 'total_installed_price'
}

feature_reducer = FeatureReducer(features_to_keep=config['model_features']['features'])

min_cleaned_df = cleaned_df.copy()

min_rfr_pipeline = Pipeline([
    ("feature_engineering", feature_engineer),
    ("feature_reducer", feature_reducer),
    ("preprocessor", tts_preprocessor),
    ("model", RandomForestRegressor(random_state=42, n_estimators=100, max_depth=9))
])

min_rfr_score = cross_val_score(min_rfr_pipeline, min_cleaned_df.drop(columns=[config['model_features']['target']]), 
                                min_cleaned_df[config['model_features']['target']], cv=5, scoring='neg_mean_absolute_error')

results_df = pd.concat(
    [
        results_df, 
        pd.DataFrame(
            index = ['Random Forest (Smaller Model)'],
            data = {
                'mae_max': [-min_rfr_score.min()],
                'mae_mean': [-min_rfr_score.mean()],
                'mae_min': [-min_rfr_score.max()],
                'mae_std': [min_rfr_score.std()]
            }
        )
    ]
)

results_df.round(0)

Unnamed: 0,mae_max,mae_mean,mae_min,mae_std
Random Forest,294310.0,177254.0,129310.0,59644.0
KNN,348592.0,219310.0,157127.0,72422.0
Linear Regression,284012.0,186652.0,132439.0,53686.0
Random Forest (Smaller Model),290987.0,174575.0,134303.0,58612.0


### Model Optimizer - Hyperparameter Tuning and Final Training

The **`RFRlTrainer`** class handles the training workflow:
1. **GridSearchCV**: Performs cross-validation to find the best hyperparameters
2. **Model retraining**: Trains the final model with best parameters on the full dataset
3. **Model persistence**: Saves the trained model to disk for later use

In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

class RFRTrainer:
    def __init__(self, param_grid: dict, preprocessor: TTSPreprocessor, feature_engineer: FeatureEngineer):
        self.param_grid = param_grid
        self.preprocessor = preprocessor
        self.feature_engineer = feature_engineer
        self.model_pipeline = Pipeline([
                ("feature_engineering", feature_engineer),
                ("feature_reducer", FeatureReducer(features_to_keep=config['model_features']['features'])),
                ("preprocessor", preprocessor),
                ("model", RandomForestRegressor(random_state=42, n_estimators=100, max_depth=9))
            ])
    
    def train_new_model(self, X, y):
        grid = GridSearchCV(
            self.model_pipeline,
            param_grid=self.param_grid,
            cv=5,
            scoring="neg_mean_absolute_error",
            n_jobs=1
        )

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        
        print("Training new model with GridSearchCV...")
        grid.fit(X_train, y_train)

        print(f"Best parameters discovered: {grid.best_params_}")
        print(f"Best CV MAE: {-grid.best_score_:.4f}")
        print("----------------------------------")
        print("Evaluating best model on test set...")
        
        train_pipeline = Pipeline([
                ("feature_engineering", self.feature_engineer),
                ("feature_reducer", FeatureReducer(features_to_keep=config['model_features']['features'])),
                ("preprocessor", self.preprocessor),
                ("model", RandomForestRegressor(random_state=42, **{k.replace('model__', ''): v for k, v in grid.best_params_.items()}))
            ])
        train_pipeline.fit(X_train, y_train)
        test_mae = mean_absolute_error(y_test, train_pipeline.predict(X_test))
        
        print(f"Test MAE with best parameters: {test_mae:.4f}")
        print('--------------------------------')
        print("Retraining on full dataset...")

        best_params = grid.best_params_

        best_model = Pipeline([
                ("feature_engineering", self.feature_engineer),
                ("feature_reducer", FeatureReducer(features_to_keep=config['model_features']['features'])),
                ("preprocessor", self.preprocessor),
                ("model", RandomForestRegressor(random_state=42, **{k.replace('model__', ''): v for k, v in best_params.items()}))
            ])
        
        best_model.fit(X, y)

        return best_model

In [19]:
import joblib

config['hyperparameter_search'] = {'param_grid': {
    'model__n_estimators': [100, 200],
    'model__max_depth': [9, 12],
    'model__min_samples_split': [2, 5],
    'model__min_samples_leaf': [1, 2]
}}

X, y = cleaned_df.drop(columns=['total_installed_price']), cleaned_df['total_installed_price']

model_trainer = RFRTrainer(param_grid=config['hyperparameter_search']['param_grid'], preprocessor=tts_preprocessor, feature_engineer=feature_engineer)
model = model_trainer.train_new_model(X, y)

joblib.dump(model, '../models/notebook_model.pkl')

Training new model with GridSearchCV...
Best parameters discovered: {'model__max_depth': 9, 'model__min_samples_leaf': 2, 'model__min_samples_split': 2, 'model__n_estimators': 100}
Best CV MAE: 161588.8327
----------------------------------
Evaluating best model on test set...
Test MAE with best parameters: 161665.6807
--------------------------------
Retraining on full dataset...


['../models/notebook_model.pkl']

-------------------

## Model Validation on Held-Out 2023 Data

Now that we've trained our model on 2019-2022 data, we need to validate its performance on truly held-out 2023 data. This validation:

1. **Prevents data leakage** - No overlap between training (2019-2022) and validation (2023)
2. **Tests temporal generalization** - Can the model predict costs for installations in a future year?
3. **Simulates production usage** - Train on historical data, predict on new installations

We'll follow the same pipeline we used for training:
- Load 2023 data with DataLoader
- Clean with DataCleaner  
- Engineer features with FeatureEngineer
- Make predictions with our trained model
- Evaluate performance metrics

### Load 2023 Validation Data

Using the same DataLoader configuration, but filtering to 2023 only.

In [20]:
# Configure validation data loading for 2023
config['validation_loading'] = {
    'year_min': 2023,
    'year_max': 2023,
    'customer_segments': ['COM']
}

# Load 2023 data
print("Loading 2023 validation data...")
print("="*60)
validation_dataloader = DataLoader(tts_data_directory='../data/raw')
validation_dataloader.load_training_data(**config['validation_loading'])

print(f"\n2023 validation set size: {len(validation_dataloader.get_data())} installations")
validation_dataloader.get_data().head()

Loading 2023 validation data...
Loaded 2967 rows from TTS_LBNL_public_file_29-Sep-2025_all.csv

2023 validation set size: 2967 installations


Unnamed: 0,data_provider_1,data_provider_2,system_ID_1,system_ID_2,installation_date,PV_system_size_DC,total_installed_price,rebate_or_grant,customer_segment,expansion_system,multiple_phase_system,TTS_link_ID,new_construction,tracking,ground_mounted,zip_code,city,state,utility_service_territory,third_party_owned,installer_name,self_installed,azimuth_1,azimuth_2,azimuth_3,tilt_1,tilt_2,tilt_3,module_manufacturer_1,module_model_1,module_quantity_1,module_manufacturer_2,module_model_2,module_quantity_2,module_manufacturer_3,module_model_3,module_quantity_3,additional_modules,technology_module_1,technology_module_2,technology_module_3,BIPV_module_1,BIPV_module_2,BIPV_module_3,bifacial_module_1,bifacial_module_2,bifacial_module_3,nameplate_capacity_module_1,nameplate_capacity_module_2,nameplate_capacity_module_3,efficiency_module_1,efficiency_module_2,efficiency_module_3,inverter_manufacturer_1,inverter_model_1,inverter_quantity_1,inverter_manufacturer_2,inverter_model_2,inverter_quantity_2,inverter_manufacturer_3,inverter_model_3,inverter_quantity_3,additional_inverters,micro_inverter_1,micro_inverter_2,micro_inverter_3,built_in_meter_inverter_1,built_in_meter_inverter_2,built_in_meter_inverter_3,output_capacity_inverter_1,output_capacity_inverter_2,output_capacity_inverter_3,DC_optimizer,inverter_loading_ratio,battery_manufacturer,battery_model,battery_rated_capacity_kW,battery_rated_capacity_kWh,battery_price,technology_type,extensions_multiphase_id
0,Connecticut Green Bank,-1,PT-102308,-1,2023-02-08,60.75,191794.67,0.0,COM,False,True,-1,-1.0,-1.0,-1.0,6880,Westport,CT,Eversource,-1.0,Earthlight Technologies,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
1,Connecticut Green Bank,-1,PT-102397,-1,2023-11-09,216.2,523947.06,0.0,COM,False,True,tts_extension_id_2015,-1.0,-1.0,-1.0,6460,Milford,CT,United Illuminating,-1.0,Earthlight Technologies,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,1A2A0292E
2,Connecticut Green Bank,-1,PT-102467,-1,2023-12-01,77.0,167561.43,0.0,COM,False,True,tts_extension_id_2016,-1.0,-1.0,-1.0,6514,Hamden,CT,United Illuminating,-1.0,Aec Solar,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,7E7127D2
3,Connecticut Green Bank,-1,PT-102272,-1,2023-09-11,31.0,94700.78,0.0,COM,False,True,-1,-1.0,-1.0,-1.0,6790,Torrington,CT,Eversource,-1.0,Smart Roofs Solar,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1.0,-1.0,pv-only,-1
4,Connecticut Green Bank,-1,ESS-00026,-1,2023-01-04,-1.0,376096.0,132000.0,COM,False,True,-1,-1.0,-1.0,-1.0,6029,Ellington,CT,Eversource,-1.0,Earthlight Technologies,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1,-1,-1.0,-1.0,-1,-1,-1,-1,-1,-1,-1.0,-1.0,-1.0,-1.0,-1.0,-1,-1,125.0,660.0,-1.0,pv+storage,-1


### Clean 2023 Validation Data

Apply the same cleaning configuration used for training data.

In [21]:
# Clean 2023 data using same configuration as training
print("Cleaning 2023 validation data...")
print("="*60)
validation_cleaner = DataCleaner(**config['cleaning'])
validation_cleaner.load_data(validation_dataloader.get_data())
cleaned_validation_df = validation_cleaner.clean(target_col='total_installed_price')

print(f"\nCleaned validation set size: {len(cleaned_validation_df)} installations")
print(f"Shape: {cleaned_validation_df.shape}")

Cleaning 2023 validation data...
> Removed 858 rows with missing or invalid target values.
> Dropping ID and provider columns: ['data_provider_1', 'data_provider_2', 'system_ID_1', 'system_ID_2']
> Dropping columns with majority NA values: ['TTS_link_ID', 'module_model_1', 'module_manufacturer_2', 'module_model_2', 'module_manufacturer_3', 'module_model_3', 'technology_module_1', 'technology_module_2', 'technology_module_3', 'inverter_manufacturer_1', 'inverter_model_1', 'inverter_manufacturer_2', 'inverter_model_2', 'inverter_manufacturer_3', 'inverter_model_3', 'battery_manufacturer', 'battery_model', 'extensions_multiphase_id', 'new_construction', 'azimuth_2', 'azimuth_3', 'tilt_2', 'tilt_3', 'module_quantity_2', 'module_quantity_3', 'BIPV_module_2', 'BIPV_module_3', 'bifacial_module_2', 'bifacial_module_3', 'nameplate_capacity_module_2', 'nameplate_capacity_module_3', 'efficiency_module_2', 'efficiency_module_3', 'inverter_quantity_2', 'inverter_quantity_3', 'micro_inverter_2', 'mi

### Evaluate Model on 2023 Data

Load the trained model and make predictions on the held-out 2023 validation set.

In [22]:
from sklearn.metrics import  mean_squared_error, r2_score

# Load the trained model
trained_model = joblib.load('../models/notebook_model.pkl')

validation_X = cleaned_validation_df.drop(columns=['total_installed_price'])
validation_y = cleaned_validation_df['total_installed_price']

# Make predictions on 2023 validation data
validation_predictions = trained_model.predict(validation_X)

# Calculate validation metrics
validation_mae = mean_absolute_error(validation_y, validation_predictions)
validation_rmse = np.sqrt(mean_squared_error(validation_y, validation_predictions))
validation_r2 = r2_score(validation_y, validation_predictions)

print("="*60)
print("2023 VALIDATION RESULTS (Held-Out Data)")
print("="*60)
print(f"Mean Absolute Error (MAE):        ${validation_mae:>15,.0f}")
print(f"Root Mean Squared Error (RMSE):   ${validation_rmse:>15,.0f}")
print(f"RÂ² Score:                         {validation_r2:>16.4f}")
print("="*60)
print(f"\nInterpretation:")
print(f"  â€¢ On average, predictions are off by ${validation_mae:,.0f}")
print(f"  â€¢ The model explains {validation_r2*100:.1f}% of variance in 2023 solar CAPEX")
print(f"  â€¢ This is true out-of-sample performance (no data leakage)")
print(f"  â€¢ Training: 2019-2022 | Validation: 2023")

2023 VALIDATION RESULTS (Held-Out Data)
Mean Absolute Error (MAE):        $        156,720
Root Mean Squared Error (RMSE):   $        575,345
RÂ² Score:                                   0.7491

Interpretation:
  â€¢ On average, predictions are off by $156,720
  â€¢ The model explains 74.9% of variance in 2023 solar CAPEX
  â€¢ This is true out-of-sample performance (no data leakage)
  â€¢ Training: 2019-2022 | Validation: 2023


### Visualization: Actual vs Predicted (2023 Validation Data)

Plot actual vs predicted costs to visualize model performance on the held-out 2023 data.

In [23]:
fig = go.Figure()

# Scatter plot of actual vs predicted
fig.add_trace(go.Scatter(
    x=validation_y,
    y=validation_predictions,
    mode='markers',
    marker=dict(
        size=6,
        color=validation_y,
        colorscale='Viridis',
        showscale=True,
        colorbar=dict(title="Actual Price ($)"),
        opacity=0.6
    ),
    name='2023 Predictions'
))

# Perfect prediction line
min_val = min(validation_y.min(), validation_predictions.min())
max_val = max(validation_y.max(), validation_predictions.max())
fig.add_trace(go.Scatter(
    x=[min_val, max_val],
    y=[min_val, max_val],
    mode='lines',
    line=dict(color='red', dash='dash', width=2),
    name='Perfect Prediction'
))

fig.update_layout(
    title=f'Actual vs Predicted Solar CAPEX (2023 Validation Data)<br><sub>RÂ² = {validation_r2:.4f}, MAE = ${validation_mae:,.0f}</sub>',
    xaxis_title='Actual Total Installed Price ($)',
    yaxis_title='Predicted Total Installed Price ($)',
    width=800,
    height=600,
    hovermode='closest'
)

fig.show()

### Residual Analysis (2023 Validation Data)

Examine residuals (actual - predicted) to identify patterns or systematic errors.

In [24]:
# Calculate residuals
residuals = validation_y - validation_predictions

# Create residual plots
from plotly.subplots import make_subplots

fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Residual Plot', 'Residual Distribution'),
    horizontal_spacing=0.12
)

# Residual scatter plot
fig.add_trace(
    go.Scatter(
        x=validation_predictions,
        y=residuals,
        mode='markers',
        marker=dict(
            size=5,
            color=np.abs(residuals),
            colorscale='Reds',
            showscale=True,
            colorbar=dict(title="Abs Error ($)", x=0.46),
            opacity=0.6
        ),
        name='Residuals'
    ),
    row=1, col=1
)

# Zero line
fig.add_trace(
    go.Scatter(
        x=[validation_predictions.min(), validation_predictions.max()],
        y=[0, 0],
        mode='lines',
        line=dict(color='black', dash='dash', width=1),
        showlegend=False
    ),
    row=1, col=1
)

# Histogram of residuals
fig.add_trace(
    go.Histogram(
        x=residuals,
        nbinsx=50,
        marker_color='steelblue',
        name='Distribution',
        showlegend=False
    ),
    row=1, col=2
)

fig.update_xaxes(title_text="Predicted Price ($)", row=1, col=1)
fig.update_yaxes(title_text="Residual (Actual - Predicted) ($)", row=1, col=1)
fig.update_xaxes(title_text="Residual ($)", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=2)

fig.update_layout(
    title='Residual Analysis - 2023 Validation Data',
    width=1200,
    height=500,
    showlegend=False
)

fig.show()

print(f"Residual Statistics:")
print(f"  Mean:      ${residuals.mean():>12,.0f} (should be ~$0)")
print(f"  Std Dev:   ${residuals.std():>12,.0f}")
print(f"  Median:    ${residuals.median():>12,.0f}")
print(f"  Min:       ${residuals.min():>12,.0f}")
print(f"  Max:       ${residuals.max():>12,.0f}")

Residual Statistics:
  Mean:      $      36,490 (should be ~$0)
  Std Dev:   $     574,323
  Median:    $      -4,430
  Min:       $  -4,532,326
  Max:       $  13,319,114
