# Decision Trees for Solar Installation Prediction

In this notebook, we will apply the concepts learned about Decision Trees to the solar dataset. You can run this notebook in Google Colab by clicking the link below.

Click the badge below to open in Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chuckgrigsby0/agec-784/blob/main/notebooks/03_decision_tree_solar_data.ipynb)

## Setup

### Load packages and data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import graphviz
from sklearn import tree
from sklearn.model_selection import train_test_split

In [None]:
# Base URL for raw GitHub content
base_url = "https://raw.githubusercontent.com/chuckgrigsby0/agec-784/main/data/"

# Load solar directly from GitHub URL
solar_data = pd.read_csv(base_url + 'solar-data.csv')

print("Data loaded successfully!")
print(f"Number of rows and columns: {solar_data.shape}")

### Data exploration

In [None]:
# Print the column names
# Note that .columns is an attribute of solar_data
print(solar_data.columns)

In [None]:
# Print the first 5 rows of the dataset
print(solar_data.head())

In [None]:
# Compute summary statistics, rounded to 4 decimal places 
# Note: only numeric columns are included
np.round(solar_data.describe(), decimals=4)

In [None]:
# To get counts of number of households that installed solar or not, we can use the value_counts() method
solar_data['Install?'].value_counts()

In [None]:
# Convert 'Install?' to categorical with explicit ordering for consistent outputs
solar_data['Install?'] = pd.Categorical(solar_data['Install?'], categories=['No', 'Yes'])

## Decision Tree Classification

We'll build a classifier to predict whether a household installs solar panels (Yes/No) based on Income and Peak Sun Hours (PSH).

### Prepare the data

Define features (X) and target variable (y), then split into training and testing sets.

In [None]:
X = solar_data[['Income', 'PSH']]
Y = solar_data['Install?']

In [None]:
# Split data into training (70%) and testing (30%) sets
# We evaluate on test data to assess how well the model generalizes to unseen observations
X_train, X_test, y_train, y_test = train_test_split(
    X,
    Y,
    train_size=0.7,
    test_size=0.3,
    random_state=731
)

### Train the classifier

In [None]:
clf = tree.DecisionTreeClassifier(
    criterion='gini',              # Measure of split quality for classification
    splitter='best',               # Strategy for choosing splits at each node
    min_samples_split=5,           # Minimum samples required to split a node
    min_samples_leaf=3,            # Minimum samples required in a leaf node
    max_features=None,             # Consider all features for splitting
    random_state=731               # For reproducibility
)

In [None]:
clf_train = clf.fit(X_train, y_train)

### Make predictions

In [None]:
clf_preds = clf_train.predict(X_test)

In [None]:
clf_accuracy = clf_train.score(X_test, y_test)
print(f'Accuracy of Decision Tree Classifier: {clf_accuracy:.4f}')

### Evaluate model performance

In [None]:
# Evaluating on Test Data
clf_preds = pd.DataFrame({
    'Actual': y_test,
    'Predicted': clf_train.predict(X_test)
})

clf_preds.head()

### Assessing Model Accuracy

Accuracy = 1 - Misclassification Rate

The `score()` method measures the model's predictive accuracy. We can also compute accuracy manually comparing actual and predicted values.

In [None]:
# All yield the same result

print(1 - np.mean(clf_preds['Actual'] != clf_preds['Predicted'])) # Manual calculation of accuracy

print(np.mean(clf_preds['Actual'] == clf_preds['Predicted'])) # Manual calculation of accuracy

print(clf_train.score(X_test, y_test)) # Using score() method

In [None]:
# Confusion matrix shows counts of correct and incorrect predictions for each class
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(clf_preds['Actual'], clf_preds['Predicted'])
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot();

In [None]:
# Tree depth and number of leaves indicate model complexity
print(f'Depth of Decision Tree Classifier: {clf_train.get_depth()}\n'
      f'Number of leaves: {clf_train.get_n_leaves()}\n')

### Examine tree structure

In [None]:
# Extract feature and class names for visualization
fn = clf_train.feature_names_in_.tolist(); print(f'Predictor names: {fn}')
cn = clf_train.classes_.tolist(); print(f'Outcome types: {cn}')

### Visualize the decision tree

#### Using `tree.plot_tree()` to visualize the decision tree

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
tree.plot_tree(
    clf_train,
    ax=ax,
    feature_names=fn,
    class_names=cn,
    filled=True,
    rounded=True,
    fontsize=12
)
plt.tight_layout()
plt.savefig('output/figures/decision_tree_matplotlib.png', 
            dpi=300, bbox_inches='tight')

#### Using graphviz to visualize the decision tree

In [None]:
plot_tree = tree.export_graphviz(clf_train, 
                                 out_file=None,
                                 feature_names=fn,  
                                 class_names=cn,  
                                 filled=True, rounded=True,  
                                 special_characters=True)                                  
graph = graphviz.Source(plot_tree, 
                        filename="decision_tree_solar_data",
                        directory='output/figures/', 
                        format='png')  

graph.render()

## Decision Tree Regression

Decision tree regression predicts continuous values (Profit) rather than categories. We'll use the same features but different target variable.

### Prepare the data

In [None]:
X = solar_data[['Income', 'PSH']]
Y = solar_data['Profit']

In [None]:
# Same train/test split process as classification, but with Profit as the target
X_train, X_test, y_train, y_test = train_test_split(
    X,
    Y,
    train_size=0.7,
    test_size=0.3,
    random_state=731
)

### Train the regressor

In [None]:
tree_regr = tree.DecisionTreeRegressor(
    criterion='squared_error',     # MSE criterion for regression (not gini)
    splitter='best',               # Strategy for choosing splits at each node
    min_samples_split=5,           # Minimum samples required to split a node
    min_samples_leaf=5,            # Minimum samples in a leaf (higher than classifier)
    max_depth=None,                # No maximum depth constraint
    max_features=None,             # Consider all features for splitting
    random_state=731               # For reproducibility
)

In [None]:
tree_regr = tree_regr.fit(X_train, y_train)

### Make predictions

In [None]:
tree_regr_preds = tree_regr.predict(X_test)

In [None]:
# Evaluating on Test Data
tree_regr_preds = pd.DataFrame({
    'Actual': y_test,
    'Predicted': tree_regr.predict(X_test)
})

tree_regr_preds.head()

### Evaluate model performance

### Assessing Model Accuracy

$R^2 = 1 - \frac{\sum_{i}((y_{i} - \hat{y}_{i})^{2})}{\sum_{i}((y_{i} - \bar{y})^{2})}$

RÂ² ranges from 0 to 1, where 1 indicates perfect predictions and 0 means the model performs no better than predicting the mean.

In [None]:
# All yield the same result
numerator = np.sum((tree_regr_preds['Actual'] - tree_regr_preds['Predicted']) ** 2)
denominator = np.sum((tree_regr_preds['Actual'] - np.mean(tree_regr_preds['Actual'])) ** 2)

print(1 - numerator / denominator) # Manual calculation of R^2

print(tree_regr.score(X_test, y_test)) # Using score() method

In [None]:
print(f'Depth of Decision Tree: {tree_regr.get_depth()}\n'
      f'Number of leaves: {tree_regr.get_n_leaves()}\n')

### Examine tree structure

In [None]:
fn = tree_regr.feature_names_in_.tolist(); print(f'Predictor names: {fn}')

### Visualize the regression tree

#### Using `tree.plot_tree()` to visualize the decision tree

In [None]:
fig, ax = plt.subplots(figsize=(25, 10))
tree.plot_tree(
    tree_regr,
    ax=ax,
    feature_names=fn,
    filled=True,
    rounded=True,
    fontsize=12
)
plt.tight_layout()
plt.savefig('output/figures/reg_decision_tree_matplotlib.png', 
            dpi=300, bbox_inches='tight')

### Advanced: Computing Feature Importance

**Feature Importance** quantifies how much each feature contributed to the model's predictions. 

**How it's calculated**: For each feature, the model sums the variance reduction (decrease in MSE) at every split using that feature, weighted by the proportion of samples affected by each split. Values are normalized to sum to 1.0.

**How to interpret**: Features with higher importance are more influential for predictions. A feature with importance of 0.30 contributed roughly 30% of the total predictive power. However, low importance doesn't necessarily mean a feature is unimportant, as it may be correlated with other features or relevant only for specific subgroups.

In [None]:
importances = pd.Series(
    tree_regr.feature_importances_,
    index=X_train.columns
).sort_values(ascending=False)
print(f"Feature importances:\n{np.round(importances, 3)}")

In [None]:
def plot_feature_importance_seaborn(model, feature_names, top_n=10, figsize=(10, 6)):
    """
    Create a seaborn-styled feature importance plot.
    """
    # Get importances
    importances = pd.DataFrame({
        'feature': feature_names,
        'importance': model.feature_importances_
    }).sort_values('importance', ascending=False).head(top_n)
    
    # Set style
    sns.set_style("whitegrid")
    sns.set_palette("husl")
    
    # Create figure
    fig, ax = plt.subplots(figsize=figsize)
    
    # Create barplot
    sns.barplot(
        data=importances,
        y='feature',
        x='importance',
        hue='feature',
        ax=ax,
        palette='viridis'
    )
    
    # Styling
    ax.set_xlabel('Importance', fontsize=12, fontweight='bold')
    ax.set_ylabel('Feature', fontsize=12, fontweight='bold')
    ax.set_title('Top Feature Importances', fontsize=14, fontweight='bold', pad=20)
    
    # Add value labels
    for i, v in enumerate(importances['importance']):
        ax.text(v, i, f' {v:.3f}', va='center', fontsize=10)
    
    plt.tight_layout()
    sns.reset_defaults()  # Reset to avoid affecting other plots
    return fig, ax

# Usage
fig, ax = plot_feature_importance_seaborn(tree_regr, X_train.columns, top_n=10)
plt.savefig('output/figures/feature_importance_seaborn.png', dpi=300, bbox_inches='tight')
plt.show()