In [None]:
import holoviews as hv
import geoviews as gv
import geopandas as gpd
import geoviews.feature as gf
import cartopy
import cartopy.feature as cf

from geoviews import opts
from cartopy import crs as ccrs

gv.extension("matplotlib", "bokeh")
gv.output(dpi=120, fig="svg")
hv.output(backend="bokeh")

ss = gpd.read_file("/home/jovyan/carbonplan/shapes/Supersections/")

ss.geometry = ss.simplify(tolerance=5000)  # straightens out < 1km wiggles

states = gpd.read_file(
    "https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_1_states_provinces.zip"
)
ca = states.loc[states.name == "California"]
ca_ecomap = gpd.overlay(ss.to_crs(ca.crs), ca, how="intersection")
ca_ecomap["highlight"] = ca_ecomap["SSection"] == "Southern Cascades"

In [None]:
import sys

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import (
    classification_report,
    precision_recall_fscore_support,
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

sys.path.append("/home/jovyan/carbonplan/retrospective")
from retrospective.load.fia import load_fia_tree

[Previously]() we have extensively discussed how ARB calculates "common practice." As a brief
refresher, common practice is derived from the US Forest Service's forest plot network and is meant
to describe the "average carbon" stored in all forested ecosystems, across the continential United
States (and coastal Alaska).

To define common practice, ARB first breaks CONUS into smaller "supersections." Then, within those
supersections, ARB defines a series of "assessment areas" or forest types. ARB then uses to
calculate average "standing live aboveground carbon" from all FIA plots that fit those two screening
criteria (geographic boundary, forest type). Most of the `retro` project ultimately boils down to
evaluating the validity of how ARB (with the help of CAR) chose to define these geographic and
forest type aggregations.

This notebook helps motivate and introduce a series of analyses that will allow us to explore how
the second criteria -- aggregating by species -- affects common practice.

# The Problem

Assessment areas are sometimes a concolomeration of species that really aren't all that similar. The
`Southern Cascades` supersection, shown in green below, provides a pretty powerful example of this
problem.


In [None]:
%%output backend='matplotlib', fig='svg'
gv.Polygons(ca_ecomap, vdims=["highlight"]).opts(cmap="Dark2_r")

We'll focus our discussion on the `Mixed Conifer` assessment area within the supersection. Notice
how the supersection (the green thing above) spans from California's coast all the way to the
northern Sierra foothills. Needless to say, this region includes an impressive array of forest
types. The west includes lots and lots of big, productive Douglas fir forests. They look like this:
![dougie](https://www.researchgate.net/profile/Aaron_Weiskittel/publication/283385356/figure/fig1/AS:291766009511936@1446573845818/Old-growth-stand-of-Douglas-fir-in.png)

As you move to the east, things dry out and you get forests that look totally different. On the
eastern edge of the supersection, you have lots of Ponderosa pine forests. They look like this:
![pipo](https://assets.bwbx.io/images/users/iqjWHBFdfxIU/ivcyZMldwqn8/v0/360x-1.jpg)

The visual differences are profound. More quantitatively, those differences translate into wildly
different average carbon stocks across these two forest types. Using FIA data for California, we can
precisely calculate this difference. Across all of California, Douglas fir plots have an average
above-ground carbon stock of 129 t CO2e acre-1, while Ponderosa Pine have an average of 49 t CO2e
acre-1.


In [None]:
from retrospective.load.fia import fia

subset = pd.concat([fia(postal_code, kind="long") for postal_code in ["ca", "or"]])

subset[(subset["owner"] == 46) & (subset["year"] > 2001)].groupby("field_type").slag_co2e_acre.agg(
    ["mean", "count"]
).loc[[201, 221]].rename({201: "Douglas Fir", 221: "Ponderosa"})

This long winded introduction gets us to the problem: for the `Southern Cascades` superseciton, ARB
averages these dissimilar forests and reports a single common practice. That average, in turn, lies
somewhere above what ponderosa pine forests _typically_ look like and somewhere below what doug fir
forests _typically_ look like. This becomes important because projects are awarded credits for the
amount of carbon within the project area is "above common practice."

We can be even more explicit and extend our simple Ponderosa/Dougie example one step further. Based
on the above reasoning, we would expect to see projects in the `Southern Cascades Mixed Conifer`
assessment area that have lots and lots of douglas fir. This would allow project developers to
maximize the difference between their `initial carbon stock` and `common practice`. And this is
exactly the behavior we see. If you open up the
[project viewer](https://retro.staging.carbonplan.org/browser), you'll see that there is a strong
clustering of projects along the western edge of the `Southern Cascades` supersection.


## Our Objective

We want to evaluate what would happen if we comapred projects against similar looking forests and
not the inaccurate of giant basket of potentially dissimilar forests.

To accomplish this, we must first figure out an objective way to classify each project's forest
type. While projects are required to report information about the species present in each project
assessment area, they're under no obligation to report details on this semi-fuzzy concept of "forest
type."

# Data

## Offset project species composition

For each project\* we have information about species composition on a _per assessment area_ basis.
Specifically, for each assessment area we have information about: species, total basal area, and
basal area as a fraction of all basal area within that assessment area.

These data are stored in a json object with the following schema:

```json
{<assessment_area_code>: [{"code": <fia_species_code>, "basal_area": float, "fraction": float} ...]
```

## FIA Trees

The FIAdb contains information at the scale of individual trees.

We're interested in the following attributes:

- SPCD: a unique code that identifies the species of the tree
- DIA: the tree's diameter at beast height
- TPA_UNADJ: an "adjustment factor" that allows us to scale from a single tree to a "per area"
  estimate of basal area
- STATUSCD: Flag identifying if the tree is alive (1), dead (2), or removed (3). We're only
  interested in live trees.

For each stem, we calculate `unadjusted basal area` using the following equation:

```equation
unadjusted_basal_area = (DIA/2)**2 \cdot \Pi \cdot TPA_UNADJ
```

## FIA Plots (Conditions)

FIAdb contains details about "plot condition" -- we don't need to get into the details of how
conditions differ from plots (it's pretty esoteric, but important when you're planning to measure
100,000s of plots...). For our purposes, we only care about two details:

- all trees from the TREE database maps to a single, unique condition.
- each condition is assigned a forest type code

In fact, conditions are assigned _two_ "type codes": a `FORTYPCD`, which is assigned by a computer
algrorithm (woof -- we _do not_ trust this) and a `FLDTYPCD` ("field type code"), which is assigned
by a trained forester based on the "balance of evidence" they observe while physically visiting the
plot.

# Model Structure

We are going to train some flavor of a classification model. It might even be an _ensemble_
classifier (!). The exact algorithm isn't really that important. All models, however, will share the
same training data -- features and targets -- which I will describe here.

## Targets

The target variable is forest type, either `FORTYPCD` or `FLDTYPCD`.

## Features

The feature set will consist of per-species estimates of `fractional basal area`, aggregated to the
condition level.

\*: species information remains incomplete for many projects.


# Where are we already?

Okay so we can load data! I made a little function (`load_fia_tree` that lives inside
`retrospective/load/fia.py`) which handles the data loading. For this tutorial, we'll just load a
single state. However, for training the model, it's imperative that we train a single model for
_all_ data. This is because we do not want to make artificial geographic cutoffs about where certain
`FORTYPCDS` occur. Sure, there are zero doug fir forests on the East Coast, but I'm pretty sure that
sitting down to make an exhasutive, biologically grounded list of which regions to include, on a per
FORTYPCD basis, would (i) take forever and (ii) be wrong. Let's skip the hassle and just chuck it
all in memory.


In [None]:
# its a demo, don't judge
try:
    tree_df = pd.read_csv("/home/jovyan/lost+found/ca_tree.csv")
except:
    tree_df = load_fia_tree("ca")
    tree_df.to_csv("/home/jovyan/lost+found/ca_tree.csv")

And we also have a really simple function that calculates fractional basal area on a species basis.
This is directly comporable to the information we have about species from the project db.


In [None]:
def fractional_basal_area_by_species(data):
    """For group of trees, calcuate the fraction total basal area represented by each species"""
    # cast to str so can store sparsely :)
    weights = (
        data.groupby(data["SPCD"].astype(str)).unadj_basal_area.sum() / data.unadj_basal_area.sum()
    ).round(4)
    weights = weights.to_dict()
    return weights

We then generate our feature set and our target variables.


In [None]:
features = tree_df.groupby(["PLT_CN", "CONDID"]).apply(fractional_basal_area_by_species)
targets = tree_df.groupby(["PLT_CN", "CONDID"])[["FORTYPCD", "FLDTYPCD"]].max()

Features is now a big series where the index is a unique plot-condition and the series contains a
bunch of dicts with the following schema:

```json
{<species_code>: <fractional_basal_area> ...}

```

They look like this:


In [None]:
[x for _, x in features.sample(10).items()]

### A quick note on FLDTYPCDs

FLDTYP can be null. I don't know why this is the case -- but it happens. Gotta remove/handle those
before we train? That said, I'm pretty sure we trust `FLDTYPCD` more than `FORTYPCD` so...who knows.

Oh! There is also a special `FORTYPCD`: 999. 999 represents an "unstocked" condition -- meaning
there are just no trees there anymore, either due to harvest or some sort of mortality event (e.g.,
buggies). I remove those data below. But I'll just write this down here -- ARB actually included
these unstocked plots in their estimates of common practice! Whoops.


In [None]:
# i doubt this join is strictly necessary, given that im 99 percent sure features and targets have the same ordering -- but yeah...im paranoid.
full = targets.join(features.rename("feature_lst"))
full = full.loc[(full["FLDTYPCD"] != 999)]
full = full.dropna(subset=["FLDTYPCD"])  # cant have nan in target. careful here.

We can then use DictVectorizer to make a nice `sklearn` ready feature array. I'll explode the array
to its dense format with `toarray()` -- but certain algrorithms can actually train on the sparse
represetation. I realize we can just get a big enough machine -- but as someone who once hand rolled
a sparse matrix representation (i was young and foolish) -- i find the fact that you can get
spareness this easily just super fascinating.


In [None]:
vec = DictVectorizer()
X = vec.fit_transform(
    full["feature_lst"].values
).toarray()  # .toarray() explodes the sparse array returned from DictVectorizer() out into a dense array
y = full["FLDTYPCD"]

Split our dataset into train and testing:


In [None]:
# train/test subset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

Define some models and train them!


In [None]:
# dict of classifiers we can just train doing nothing fancy at all!
clfs = {
    "naive_ bayes": MultinomialNB(),
    "linear_svm": SVC(kernel="linear", C=0.025),
    "random_forest": RandomForestClassifier(),
}

for name, clf in clfs.items():
    y_pred = clf.fit(X_train, y_train).predict(X_test)
    score = clf.score(X_test, y_test)
    print(f"{name} mislabled {(y_test != y_pred).sum()} of {X_test.shape[0]} points.")

## What i think is the right next step

As you can see from above -- this thing is fairly decent out of the box. While we'd love to have
good accuracy, we're actually fine if the model returns a range of possible classifications -- we
can take the uncertainty in the classification and propogate that through to our final estiamte of
common practice using brute force sampling. That said, I think that some simple hyperparamertization
will go a long way. Given that RF is arleady so good, maybe the next step is to just fiddle around
with RF parameters (e.g., grow more trees, change splitting criteria, etc):

```python
rf_clf = RandomForestClassifier(n_estimators=1500)
mod = rf_clf.fit(X_train, y_train)
y_preds = mod.predict(X_test)
```

My suspicion is that after some parameter tuning, we should just grow a massive RF classifier using
dask (that should be fun!) and call it a day.

We also probably want to spend some time thinking about FLDTYPS we dont really care about -- I doubt
the answer is to exclude data. At the same time, some FLDTYPCDs are going to be super rare -- rare
to the point where building a training/validation dataset is going just be plain tough. Obviously
model performance will be driven by the more common TYPS, but I don't want us spending cycles
chasing performance on TYPS that don't matter...

OH! And it's _really_ important that we
[calibrate the classification probabilities](https://scikit-learn.org/stable/modules/calibration.html)
from our model. If we're going to allow fuzzy matching (which we should!), we need to make sure that
the model reported probabilites are "real" and do not contain artifacts that arise from the
underlying classificaiton model (e.g., how RF deal with bagging influences the values produced by
`.predict_proba()`)


## Other sklearn magic I discovered...


print(classification_report(y_test, y_preds))


precision, recall, fscore, support = precision_recall_fscore_support(y_test, y_pred)
plt.scatter(recall, support)
