# Logistic Regression for Solar Installation Prediction

In this notebook, we will apply the concepts learned about logistic regression to the solar dataset. You can run this notebook in Google Colab by clicking the link below.

Click the badge below to open in Google Colab:

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/chuckgrigsby0/agec-784/blob/main/notebooks/05_logistic_solar_data.ipynb)

## Setup and Data Loading

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

In [None]:
# Load solar installation data from GitHub repository
base_url = "https://raw.githubusercontent.com/chuckgrigsby0/agec-784/main/data/"
solar_data = pd.read_csv(base_url + 'solar-data.csv')

print("Data loaded successfully!")
print(f"Number of rows and columns: {solar_data.shape}")

## Data Exploration

In [None]:
# Display column names
print(solar_data.columns)

In [None]:
# Display first 5 observations
print(solar_data.head())

In [None]:
# Summary statistics for numeric variables
np.round(solar_data.describe(), decimals=4)

In [None]:
# Count of households by installation status
solar_data['Install?'].value_counts()

## Data Preparation

In [None]:
# Create binary outcome variable (Yes = 1, No = 0)
i = solar_data.columns.get_loc('Install?') + 1
solar_data.insert(i, 'Install', np.where(solar_data['Install?'] == 'Yes', 1, 0))

In [None]:
# Split data into training (70%) and testing (30%) sets
# random_state ensures reproducibility
train_data, test_data = train_test_split(
    solar_data,
    train_size=0.7,
    test_size=0.3,
    random_state=731
)

## Model Estimation

In [None]:
# Estimate logistic regression model using training data
logit_train = smf.logit('Install ~ Income + PSH', data=train_data).fit()

In [None]:
# Display model summary
print(logit_train.summary())

## Model Prediction and Evaluation

### Generate Predictions on Test Data

In [None]:
# Generate predicted probabilities for test data
pred_prob = logit_train.predict(test_data)
print(f"Range of predicted probabilities: {np.min(pred_prob):.4f}, {np.max(pred_prob):.4f}")

### Model Performance Metrics

In [None]:
# Create predictions dataframe (threshold = 0.5)
preds_logit_df = pd.DataFrame({
    'actual': test_data['Install'],
    'pred_prob': pred_prob,
    'pred_class': np.where(pred_prob >= 0.5, 1, 0)
})

preds_logit_df.head()

### Calculate Accuracy

Accuracy measures the proportion of correct predictions.

In [None]:
# Calculate model accuracy
accuracy = accuracy_score(preds_logit_df['actual'], preds_logit_df['pred_class'])
print(f"Model Accuracy: {accuracy:.4f}")

### Confusion Matrix

The confusion matrix shows the distribution of correct and incorrect predictions.

In [None]:
# Generate confusion matrix
cm = confusion_matrix(preds_logit_df['actual'], preds_logit_df['pred_class'], labels=[0, 1])
print(f"Confusion Matrix:\n{cm}")

In [None]:
# Visualize confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

## Visualization of Marginal Effects

The following plot illustrates the marginal effect of income on the predicted probability of solar installation. To isolate the effect of income, we hold PSH constant at its mean value while varying income across its observed range. This produces the characteristic S-curve of the logistic function.

In [None]:
# Create prediction data: vary Income, hold PSH at mean
mean_psh = test_data['PSH'].mean()
income_range = np.linspace(test_data['Income'].min(), 
                          test_data['Income'].max(), 100)

pred_data = pd.DataFrame({
    'Income': income_range,
    'PSH': mean_psh
})

# Generate predicted probabilities
pred_probs = logit_train.predict(pred_data)

In [None]:
# Plot marginal effect of income
mako = sns.color_palette("mako", 10)
plt.figure(figsize=(10, 6))

# Actual installation status
sns.scatterplot(x='Income', y='Install', data=test_data, alpha=0.5, label='Actual') 

# Predicted probabilities
plt.plot(income_range, pred_probs, color=mako[2], linewidth=2, 
         label=f'Predicted Probability (PSH={mean_psh:.2f})') 

# Classification threshold
plt.axhline(y=0.5, color='green', linestyle='--', label='Classification Threshold (0.5)')

plt.xlabel('Income ($1000s)')
plt.ylabel('Probability of Installation')
plt.title('Logistic Regression: Marginal Effect of Income on Solar Installation')
plt.legend(loc='lower right')
plt.show()