# Lab 6 - CART and Random Forest Predictions

In this final part of the project, you will build and validate both CART and Random Forest models for predicting each target (Secchi Depth and Total Phosphorus) using the parcel features.

## Tasks

### Data Splitting Strategy

We want to extract 30% of the lakes to be used as a validation set, with the remaining lakes separated into a train/test split (again use 70%/30% split). In summary we should have a random split with:

- **30% of the lakes** marked as "Validation"
- **50% of the lakes** (~70% of 70%) marked as "Training"
- **20% of the lakes** (~30% of 70%) marked as "Test"

### Model Building and Validation

For each target:

1. Perform a grid search for both CART and Random Forests using the training lakes
2. Compare the best model of each type on the test lakes to determine the best overall model
3. Refit the best model on the 70% of the lakes not in the validation set (training + test)
4. Use the validation set to estimate the performance of this model
5. Write up a summary of what we learn

## Strategy Overview

### Step 1: Load and Prepare Data

1. **Load parcel features from Delta Lake** (`./data/parcels.delta`)
2. **Load water quality data** (`./data/water_quality_by_year.parquet`)
3. **Create aggregated features** by grouping parcel data:
   - Group by: `Monit_MAP_CODE1`, `Year`, `distance_category`
   - Calculate numerical summaries (mean, median, std) for: EMV_LAND, EMV_BLDG, EMV_TOTAL, ACRES_POLY, GARAGESQFT, FIN_SQ_FT, YEAR_BUILT, NUM_UNITS, TOTAL_TAX, TAX_CAPAC
   - Calculate proportions for: BASEMENT=='Y', GARAGE=='Y', GREEN_ACRE, OPEN_SPACE, AG_PRESERV
   - Count properties by USE1_DESC categories
4. **Pivot/reshape features** so each lake-year has features from all distance categories
5. **Join with water quality targets** on lake ID and year

### Step 2: Create Lake-Level Splits

Since we want to split by **lakes** (not individual rows):

1. Get unique list of lakes (DNR_ID_Site_Number)
2. Randomly assign each lake to one of three groups:
   - 30% → Validation
   - 70% → Train/Test pool
3. From the 70% Train/Test pool, split again:
   - 70% of that 70% → Training (= 49% of all lakes)
   - 30% of that 70% → Test (= 21% of all lakes)
4. Add a `DataSplit` column to the full dataset based on lake assignment

### Step 3: Model Building - Secchi Depth

**Prepare data:**
- X = all parcel features (drop lake ID, year, targets, DataSplit)
- y = avg_secchi_depth

**Grid Search on Training Set:**

**CART:**
```python
cart_params = {
    'max_depth': [3, 5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10, 20],
    'min_samples_leaf': [1, 5, 10, 20, 50]
}
```

**Random Forest:**
```python
rf_params = {
    'n_estimators': [10, 25, 50, 100, 200],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 5, 10, 20],
    'max_features': ['sqrt', 'log2', None]
}
```

**Evaluation on Test Set:**
- Predict with best CART model → calculate RMSE, R²
- Predict with best RF model → calculate RMSE, R²
- Select best overall model

**Refit on Training + Test:**
- Combine training and test lakes (70% total)
- Refit the best model

**Final Validation:**
- Predict on validation set (30% of lakes)
- Report final RMSE, R², MAE

### Step 4: Model Building - Total Phosphorus

Repeat Step 3 with y = avg_total_phosphorus

### Step 5: Summary and Interpretation

Compare results:
- Which features are most important for each target?
- How does distance from lake affect predictions?
- Which model type performs better for each target?
- What's the expected prediction error on new lakes?

## Step 1: Load and Prepare Data

### Task 1.1: Import Libraries and Load Data

In [1]:
# Import required libraries
import polars as pl
import polars.selectors as cs
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import matplotlib.pyplot as plt

In [4]:
# Load parcel data using lazy evaluation (don't read full 2.4GB into memory)
(parcels_lazy := pl.scan_delta('./data/parcel.delta'))

In [2]:
# Load water quality data
(wq := pl.read_parquet('./data/water_quality_by_year.parquet'))

DNR_ID_Site_Number,LAKE_NAME,Year,latitude,longitude,avg_secchi_depth,avg_total_phosphorus
str,str,i64,str,str,f64,f64
"""82009700-01""","""La Lake""",2008,"""44.88725237""","""-92.9713984""",1.754545,0.118636
"""82015300-01""","""Sunset Lake""",2006,"""45.13527125""","""-92.94157024""",3.344444,0.022111
"""82012200-01""","""Pine Tree Lake""",2007,"""45.10231359""","""-92.95386928""",2.8,0.021556
"""82013700-01""","""Fish Lake""",2013,"""45.12242547""","""-92.97796745""",1.131923,0.061423
"""10000200-01""","""Riley Lake""",2006,"""44.83469027""","""-93.51695191""",2.29,0.0604
…,…,…,…,…,…,…
"""82033400-01""","""Kismet Lake""",2013,"""45.09536417""","""-92.8917326""",1.67,0.025
"""82012200-01""","""Pine Tree Lake""",2012,"""45.10231359""","""-92.95386928""",2.981818,0.020182
"""82011602-01""","""Armstrong Lake""",2009,"""44.96252306""","""-92.93917709""",1.114286,0.058929
"""82013700-01""","""Fish Lake""",2012,"""45.12242547""","""-92.97796745""",1.506333,0.108897


### Task 1.2: Inspect columns to plan feature aggregation

In [None]:
# Check available columns in parcel data
parcels_lazy.collect_schema().names()