# Final Compliation - Ouray County Parcel Risk
**Author:** Bryce A Young  
**Institution**: University of Montana, National Center for Landscape Fire Analysis

**Created:** 2025-03-11 | 
**Modified:** 2025-03-11  

#### Overview
This notebook shows the salient points of predictions, clustering, and analysis for the Ouray Parcel Risk project. Consider this the highlight reel of the analysis folder in this repository.

This notebook compiles code from `risk_pred.ipynb`, `clustering.ipynb` and `structure_archetypes.ipynb`. Data from those notebooks is used in the present notebook and cleaned further. 

Final figures are produced and exported for presentations and for use in Chapters 2 and 3 of my masters thesis (risk score predictions and structure archetypes).

## Environment Setup
---

In [1]:
# setup environment
import os
### Directory ###
# Repository
os.chdir(r'D:/_PROJECTS/P001_OurayParcel/ouray')
# Root workspace
ws = r'D:/_PROJECTS/P001_OurayParcel'

### Data paths ###
# Folder where all the data inputs and outputs will live
data = os.path.join(ws, 'data')
# Scratch folder for intermediate files
scratch = os.path.join(data, '_temp')
# Any final outputs go here
out = os.path.join(data, '_out')
# Figures to export
figs = os.path.join(out, 'figures')

# correct working directory
os.getcwd()

'D:\\_PROJECTS\\P001_OurayParcel\\ouray'

## Risk Rating Predictions
---
In this section, I use a random forest classifier to make risk rating predictions. Here is the general workflow:
1. Import the most recent data containing reclassified features.
2. Re-do some of the one-hot encoding to make sure that the most important features are kept e.g. WUI class 1, wood roof
3. Select predictive categorical columns for minimal noise, along with the target columns 'Risk_Rating' and 'Risk_Rating_new'
4. Segment training and testing data
5. Run the random forest classifier and analyze the results

#### **1** Import data

In [None]:
import pandas as pd

pd.set_option('display.max_columns', None)

full_df = pd.read_csv(os.path.join(out, 'df_feat_clusters.csv'))
full_df.columns


#### **2** One-Hot Encoding

In [None]:
# Reduce dataframe to the most salient predictive features
df = full_df[]




################
# NOTE: change OHE drop='first' argument because I don't want to drop these automatically. I will drop the ones I want to later.
################




from sklearn.preprocessing import OneHotEncoder
# Initialize OHE
ohe = OneHotEncoder(sparse_output=False, drop='first') 




################
# NOTE: change OHE cols to only include the ones I want
################



# OHE cols
ohe_cols = ['wui_class', 'tax_ARCH', 'tax_EXW', 'tax_RCVR', 'tax_RSTR']
ohe_array = ohe.fit_transform(df[ohe_cols])
ohe_df = pd.DataFrame(ohe_array, columns=ohe.get_feature_names_out(ohe_cols))
df = pd.concat([df, ohe_df], axis=1)

df.head()

#### **3** Select salient features

In [None]:
# Define predictive columns (and include target variable)
rf_pred_cols = []

# Create dataframe from predictive columns
pred = df[rf_pred_cols]

#### **4** Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

# Define train-test split function
def split_data(df):
    X = df.drop(columns=['Risk_Rating'])
    y = df['Risk_Rating']
    return train_test_split(X, y, test_size=0.2, random_state=42)

# Apply to categorical data
X_train, X_test, y_train, y_test = split_data(pred)

#### **5** Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the classifier
clf.fit(X_train_enc, y_train) # Using one-hot encoded variables

# Make predictions
y_pred = clf.predict(X_test_enc)

#### **6** Analysis of Results

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.3f}")

# Confusion Matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Classification Report (Precision, Recall, F1-Score)
print("Classification Report:\n", classification_report(y_test, y_pred, digits=3))

In [None]:
import seaborn as sns
import numpy as np
from sklearn.metrics import confusion_matrix

# Compute confusion matrix
cm = confusion_matrix(y_test3, y_pred3)
classes = np.unique(y_test3)  # Get class labels

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=classes, yticklabels=classes)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix')
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Get feature importances
feature_importances = pd.DataFrame({
    'Feature': X_train3_enc.columns,
    'Importance': clf.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10, min(12, 0.5 * len(feature_importances))))
plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
plt.xlabel('Feature Importance')
plt.ylabel('Feature Name')
plt.title('Random Forest Feature Importance - pred3')
plt.tight_layout()
plt.gca().invert_yaxis()  # Highest at top
plt.show()

#### **7** Conclusions
One issue with predicting risk ratings is that there are not equal breaks.
- 20 <= 'Low' <= 240 (range=220)
- 241 <= 'Moderate' <= 305 (range=64)
- 306 <= 'High' <= 435 (range=129)
- 436 <= 'Very High' <= 505 (range=69)
- 506 <= 'Extreme' <= 1000 (range=494)

Therefore, we reclassified risk ratings with equal interval breaks, each with a range of 200 points.
- 0 <= 'Low' <= 200
- 201 <= 'Moderate' <= 400
- 401 <= 'High' <= 600
- 601 <= 'Very High' <= 800
- 801 <= 'Extreme' <= 1000

The number of observations per risk rating (original and reclassified) is printed from the code block below.

In [None]:
# Print number of observations per risk rating
print(df['Risk_Rating'].value_counts())
print(df['Risk_Rating_new'].value_counts())