In [None]:
!pip install gspread oauth2client

In [None]:
%load_ext autoreload
%autoreload 2
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
import statsmodels.formula.api as smf

from carbonplan_retro.data import cat
from carbonplan_retro.load.project_db import load_project_db
from carbonplan_retro.analysis.common_practice import get_arbocs_from_ifm1
from carbonplan_retro.load.issuance import load_issuance_table

In [None]:
import seaborn as sns

sns.set(font_scale=1.5)
sns.set_style("white")

In [None]:
def use_for_arbocs(new_ifm1, project_db, name="new_allocation", scale_components=False):
    """recalcualte arbocs as a function of predicted ifm1
    only ifm1 changes -- all other components are scaled to be proportional to observed <component>/ifm_1
    scale_components: experimented with whether other bits of the baseline needed to move with changes in IFM-1 -- do
    """

    alt_baseline_carbon = new_ifm1 + project_db["baseline"]["components"]["ifm_3"]

    onsite_carbon = (
        project_db["rp_1"]["components"]["ifm_1"] + project_db["rp_1"]["components"]["ifm_3"]
    )

    adjusted_onsite = onsite_carbon * (1 - project_db["rp_1"]["confidence_deduction"]).astype(
        float
    ).round(5)

    delta_onsite = adjusted_onsite - alt_baseline_carbon

    baseline_wood_products = (
        project_db["baseline"]["components"]["ifm_7"]
        + project_db["baseline"]["components"]["ifm_8"]
    )

    actual_wood_products = (
        project_db["rp_1"]["components"]["ifm_7"] + project_db["rp_1"]["components"]["ifm_8"]
    )

    leakage_adjusted_delta_wood_products = (actual_wood_products - baseline_wood_products) * 0.8

    secondary_effects = project_db["rp_1"]["secondary_effects"]
    secondary_effects[secondary_effects > 0] = 0  # Never allowed to have positive SE.

    calculated_allocation = delta_onsite + leakage_adjusted_delta_wood_products + secondary_effects
    return calculated_allocation.rename(name)

# Overview

The goal of this notebook is to explore how common practice (CP) relates to ARBOCs. One of the main
goals of this _Retrospective_ is to understand how changes in CP translate into changes in ARBOCs.
Therefore, we need to work through the math that links CP to Baseline Components (specifically,
IFM-1) and, ultimately, ARBOCs.

We should start with the actual equation for calculating ARBOCs. If I'd started here myself, I'd
have saved myself a fair amount of grief...but hey, that's why we keep notes and do research!

ARBOCs are calculated via the following equation: \begin{align} ARBOC*{RP} & = (\Delta
Project*{onsite} - \Delta Baseline*{onsite}) + 0.8(Project*{woodProducts, RP} -
Baseline*{woodProducts, RP}) + SE*{RP}, \end{align} where SE stands for secondary effects due to
harvesting in the project scenario and RP denotes the reporting period. Furthermore, the onsite
carbon pool comprises IFM-1 (standing live) and IFM-3 (standing dead), while the wood products pools
are represented by in-use (IFM-7) and landfillled (IFM-8) woodproducts. The calculations for SE are
little more complicated and we just end up using the value directly reported in each annual OPDR.

Our analysis only concerns the first reporting period, which simplifies things a little bit. Because
we're in the first reporting period, there is no "change" from previous reporting periods, so both
$\Delta$ terms for onsite carbon pools (IFM-1 \& IFM-3) simplify to thier initial values:
\begin{align} ARBOC*{RP1} & = (Project*{onsite, RP1} - Baseline*{onsite, RP1}) +
0.8(Project*{woodProducts, RP1} - Baseline*{woodProducts, 1}) + SE*{RP1} \end{align}

We can take this equation one step further by splitting out `initial` carbon from `project` or
`growth` carbon. \begin{align} ARBOC*{RP1} & = ARBOC*{initial} + ARBOC*{growth} \\ ARBOC*{growth} &
= (Project*{onsite, End} - Project*{onsite, Start}) + 0.8(Project*{woodProducts, RP1} -
Baseline*{woodProducts, 1}) + SE*{RP1}\\ ARBOC*{initial} & = Project*{onsite, Start} -
Baseline*{onsite} \end{align}

Eq. 3 introduces a few new bits of terminology.

From the issuance table we get ARBOCs, which we'll denote as $Issuance_{RP1}$ to emphasisze its
provenance from official ARB source, so: [not clear if this helps or confuses] \begin{align}
ARBOC*{RP1} & = Issuance*{RP1} \\ ARBOC*{initial} & = Issuance*{RP1} - ARBOC\_{growth} \end{align}

Meaning if we $Project_{onsite, End}$, $Project_{onsite, Start}$, $Project_{woodProducts, RP1}$,
$Baseline_{woodProducts, RP1}$, and $SE_{RP1}$ (Eq. 2b). Thankfully, we get all the terms but
$Project_{onsite, Start}$ from annual OPDR. Some projects' OPDRs actually report
$Project_{onsite, Start}$ separately, but reporting is mostly inconsistent and we did not end up
recording. Looking forward, it should also be possible to figure out the "growth" component of IFM-1
and IFM-3 by looking across multiple reporting periods. If you just take the average increment in
IFM-1 and IFM-3 over a few reporting periods and subtract it from IFM-1 and IFM-3 for RP_1 -- that
would be a fairly good estimate of the onsite carbon at the project outset. Unfortuantely, this
approach doesn't work either -- because we only entered values for RP_1! If we did have that data
(and we might be able to get it from Barbara!), we could simply do:

\begin{align} \Delta Project*{onsite} = Project*{onsite, End} - Project\_{onsite, Start}, E some
notation average delta \end{align}

Recall, we're just interested in how changes in common practice affect calculations of "upfront"
carbon. In theory, this should just be the difference between project onsite carbon and baseline
onsite carbon, **at the start of the first reporting period**. However, the project values for IFM-1
and IFM-3 are reported at the _end of the reporting period_, meaning the reported values represent
both the initial standing carbon and the "growth" carbon through the end of the first reporting
period. Now, there are several ways to get around this problem.

where $\overline{\Delta Project_{Onsite}}$ is the mean increment in IFM-1 and IFM-3. (N.B., such an
approach would need to make sure to account for varying lengths in reporting periods.) [This math is
a little clunky right now but still helpful in thinking things through. It's likely most helpful to
rewrite in terms of RP0]

For the current iteration, however, we're going to have to estimate $ARBOC_{Upfront}$ from the
issuance table. This estimate requires a semi-strong assumption that the growth in IFM-1 and IFM-3,
the leakage term (based on woodproducts), and secondary effects (SE) do not change across reporting
periods. If we make that assumption, it should be the case that: \begin{align} ARBOC*{Upfront} & =
Issuance*{RP1} - Issuance\_{RP2} \end{align}

Finally, since we're assuming that $Project_{Onsite, RP1}$ is constant (e.g., it doesn't change in
response to changes in the baseline), any change in onsite baseline carbon caused by a change in CP
leads directly to a change in $ARBOC_{Upfront}$. The remainder of this notebook explores the logic
laid out above and concludes with a general approach for translating from CP to ARBOCs.


## Exploring the logic from the overview

### Load a bunch of data


In [None]:
retro_db = cat.retro_db_light.read()

project_db = load_project_db("/home/jovyan/lost+found/Forest-Offset-Projects-v0.3.json")
project_db = project_db[~project_db["project"]["early_action"].str.startswith("CAR")]
project_db = project_db[
    project_db["baseline"]["initial_carbon_stock"] > project_db["baseline"]["common_practice"]
]  # our analysis only pertains to upfront carbon

issuance_table = load_issuance_table()

In [None]:
exclude_imputed = False
if exclude_imputed:
    project_db = project_db[
        ~project_db["baseline"]["average_slag_comment"].str.startswith("visual")
    ]
    project_db = project_db[~project_db["baseline"]["average_slag_comment"].str.startswith("digit")]

In [None]:
data = pd.concat(
    [
        project_db["rp_1"]["allocation"],
        project_db["baseline"]["components"]["ifm_1"],
        project_db["baseline"]["components"]["ifm_3"],
        project_db["baseline"]["average_slag_baseline"],
        project_db["baseline"]["common_practice"],
        retro_db.set_index("opr_id")["arbocs_calculated"],
        project_db["baseline"]["initial_carbon_stock"],
        project_db["project"]["acreage"],
    ],
    axis=1,
)

data["baseline_onsite"] = data["ifm_1"] + data["ifm_3"]
data["project_onsite"] = (
    project_db["rp_1"]["components"]["ifm_1"] + project_db["rp_1"]["components"]["ifm_3"]
) * (1 - project_db["rp_1"]["confidence_deduction"])
data["delta_onsite"] = data["project_onsite"] - data["baseline_onsite"]

""" MANUALLY ADJUST VALUES 
It's worth noting that the current notebook has a few manually adjusted values. 
These represent changes I've made to the underlying project database (due to inaccurate data entry on my part).
However, because jupyter notebooks cannot read from the project database, I just manually adjust those values here, for now. 
A future iteration of this notebook will remove those lines (and this comment!).
"""
# data.loc[data.index == 'ACR279', 'average_slag_baseline'] = 87.1


# special casing this bad-boy
# CAR973 has lots of different reported acreages.
exclude_problematic_cases = True
if exclude_problematic_cases:
    # CAR973 has a plethora of reported acreages.
    # it seems that the acreage reported in the listing supplement, less non-forested acres
    # would most closely conform with the approach outlined here.
    data.loc[data.index == "CAR973", "acreage"] = np.nan  # 223431-10397
    # harvest problem, IDed to CARB
    data.loc[data.index == "ACR247", "acreage"] = np.nan  # 223431-10397

## Estimate ARBOC_initial

Spilled a fair amount of ink in the Overview, but the upshot is that if we rearrange Eq. 3a, we can
estimate ARBOC*{initial} as: \begin{align} ARBOC*{RP1} & = ARBOC*{initial} & = ARBOC*{RP1} -
ARBOC*{growth} \\ ARBOC*{growth} & = (Project*{onsite, End} - Project*{onsite, Start}) +
0.8(Project*{woodProducts, RP1} - Baseline*{woodProducts, 1}) + SE*{RP1}\\ ARBOC*{initial} & =
Project*{onsite, Start} - Baseline*{onsite} \end{align} Allocation, but remove leakage/woodproducts
and project secondary, the remainder is initial carbon. Somewhat counter-intuitively this is because
leakage and secondary effects negative.


In [None]:
baseline_wp = (
    project_db["baseline"]["components"]["ifm_7"] + project_db["baseline"]["components"]["ifm_8"]
)

actual_wood_products = (
    project_db["rp_1"]["components"]["ifm_7"] + project_db["rp_1"]["components"]["ifm_8"]
)

leakage_adjusted_delta_wood_products = (actual_wood_products - baseline_wp) * 0.8

data["upfront"] = (
    data["allocation"]
    - project_db["rp_1"]["secondary_effects"]
    - leakage_adjusted_delta_wood_products
)

In [None]:
data = data.dropna(subset=["delta_onsite", "upfront"])

In [None]:
mean_squared_error(data["delta_onsite"], data["upfront"]) ** 0.5

In [None]:
g = sns.FacetGrid(data=data / 1_000_000, height=3.5)
g.map(plt.scatter, "delta_onsite", "upfront", s=125, color=".3", alpha=0.55)
g.axes[0][0].plot((0, 15), (0, 15), ls="--", lw=3, c="r")

## Modeling IFM-1 from common practice

So now we need a way to calculate alternate `delta_onsite` After a little bit of sleuthing,
`average_slag_baseline` from data entry stage of the retrospective and represents the actual
constraint on baseline modeling. In many cases, `average_slag_baseline` equals `common_practice` but
not always! Slag \* Acreage. But IFM-1 is both above and belowground, which isn't constant across
species -- so run through linear model

- Classic OLS:
- Log-transform regression: ,
- Forced through Origin:

Ultimately, I settled on a simple linear regression, with an intercept term. Intercept represents
fact that ratio Also accounts for uncertainties in both data entry into project db and potential
errors in reporting of acreage and/or average baseline SLAG in the project documents (e.g., CAR1180
doesn't have an Initial OPDR).

For model development purposes, it makes total sense to calculate things in terms of ARBOCs -- this
lets us skip some of the assumptions...overview. But for purposes of


In [None]:
data["X"] = data["average_slag_baseline"] * data["acreage"]

In [None]:
log_model = False
if log_model:
    mod = smf.ols("np.log(ifm_1)~-1+np.log(X)", data=data).fit()
    data["pred_ifm_1"] = np.exp(mod.predict(data["X"]))  # transform back out of log space
else:
    mod = smf.ols("ifm_1~-1+X", data=data).fit()
    data["pred_ifm_1"] = mod.predict(data["X"])

data["predicted_arbocs"] = use_for_arbocs(data["pred_ifm_1"], project_db)

# helps motivate modeling approach

In [None]:
g = sns.FacetGrid(data=data / 1_000_000, height=4.5)
g.map(plt.scatter, "pred_ifm_1", "ifm_1", color=".3", alpha=0.55, s=125)
g.axes[0][0].plot((0, 35), (0, 35), lw=3, ls="--", c="r")
plt.ylabel("Reported Baseline IFM-1\n(MtCo2e)")
plt.xlabel("Predicted Baseline IFM-1\n(MtCo2e)")

## Some more stuff


In [None]:
sub = data.dropna(subset=["ifm_1", "pred_ifm_1"])  # some sites dont have avg slag
rmse = mean_squared_error(sub["ifm_1"], sub["pred_ifm_1"]) ** 0.5

In [None]:
# never been quite sure how much i like this metric -- but rRMSE is low
display(f"RMSE: {rmse:.2f} MtCO2e")
display(f"relative RMSE: {rmse/data['ifm_1'].mean() * 100:.2f}%")

ANd this next cell is just for model development -- if run the arboc calculation in the forward mode
(knowing everything!) and just changing IFM-1 to be based on our linear model, how much error do we
introduce?

Not a ton!


In [None]:
arboc_rmse = mean_squared_error(sub["predicted_arbocs"], sub["allocation"]) ** 0.5
display(f"RMSE: {arboc_rmse:.2f}")
display(f"percent RMSE: {arboc_rmse/sub['allocation'].mean()*100:.2f}%")

In [None]:
arboc_rmse = (
    mean_squared_error(sub["project_onsite"] - (sub["pred_ifm_1"] + sub["ifm_3"]), sub["upfront"])
    ** 0.5
)
display(f"RMSE: {arboc_rmse:.2f}")
display(f"percent RMSE: {arboc_rmse/sub['upfront'].mean()*100:.2f}%")

So based on this analysis we're going to scale slag\*acreage by the following factor!


In [None]:
mod.params

## Debugging

This section contains some useful debugging diagnostics.

For example, we know that IFM-1 against X (average_slag_baseline \* acreage) should be fairly linear
across projects. The below graph shows this relationship explicitly. When I first plotted it,
however, there were a handful of really big outliers. Upon further inspection, it turned out that
some of those outliers were actually data entry errors! So if more projects get added in the future,
it would be good to look at this graph. While it's totally possible a project could diverge from
this line (though unlikely) -- it's helpful to double check that this intution holds.


In [None]:
g = sns.FacetGrid(data=data.round() / 1_000_000, height=4.5)
g.map(sns.regplot, "ifm_1", "X")  # , s=125, alpha=0.55, color='.3')
plt.plot((0, 15), (0, 15), ls="--", c="r", lw=3)
# plt.xlabel('Predicted ARBOCs (millions)')
# plt.ylabel('Observed ARBOCs (millions)')
plt.xlim(0, 10)
plt.ylim(0, 10)

It's also helpful to look at projects where the prediction is off. AT this point I've checked all of
these and I'm fairly certain that the errors fall into two categories:

- big acerage, meaning small errors at the project level but large "total" ARBOC errors.
- $\Delta Project_{onsite}$ for some projects is just high -- so maybe a rapidly growing forest (or
  I guess slow growing) relative to the average growth across all projects


In [None]:
delta_arbocs = (sub["predicted_arbocs"] - sub["allocation"]) / sub["allocation"]

In [None]:
delta_arbocs.sort_values().describe()

In [None]:
delta_arbocs.sort_values().head(10)

In [None]:
g = sns.FacetGrid(data=data.round() / 1_000_000, height=4.5)
g.map(plt.scatter, "predicted_arbocs", "allocation", s=125, alpha=0.55, color=".3")
plt.plot((0, 15), (0, 15), ls="--", c="r", lw=3)
plt.xlabel("Predicted ARBOCs (millions)")
plt.ylabel("Observed ARBOCs (millions)")

## What about IFM-3?

Originally I had considered also modeling IFM-3 in there -- but it dawned on me that assuming IFM-3
remains constant is actually a fairly decent assumption.

First, in cases where $CP_{Alt}$ > $CP_0$, holding IFM-3 is inherently conservative -- esentially we
assume that IFM doesnt scale upwards with IFM-1, this will lower our estimate of over-credited
ARBOCs.

Second, it's actually the case that lots of projects don't harvest their IFM-3. To be precise, 75
percent of projects have IFM-3 in the baseline scenario that is within 5 percent of the project
scenario. I'm not really sure how this emerged (e.g., is it a requirement of the protocol...?) --
BUT the pattern is there, so I think we should adhere to the dominant pattern of folks not messing
with IFM-3 over the long run.


In [None]:
baseline_sd = project_db["baseline"]["components"]["ifm_3"]
project_sd = project_db["rp_1"]["components"]["ifm_3"]

sd_close = sum(((baseline_sd - project_sd) / np.mean([baseline_sd, project_sd])).abs() <= 0.05)

sd_close / len(project_db)

And we get to skip all of this now...

### get issuance by reporting period -- ARB doesn't always issue all the credits at once, so aggregate by project and rp_id

rp_sum = issuance_table.groupby(['opr_id', 'arb_rp_id']).allocation.sum().reset_index()
mean_subsequent_rp = rp_sum[rp_sum['arb_rp_id'] != 'A'].groupby('opr_id')['allocation'].mean()
rp_data = rp_sum[rp_sum['arb_rp_id'] ==
'A'].join(mean_subsequent_rp.rename('mean_subsequent_allocation'), on=['opr_id']) rp_data =
rp_data.set_index('opr_id')

#### subset to just the 72 projects where ICS > CP

rp_data = rp_data[rp_data.index.isin(project_db.index.values.tolist())]

#### some projects don't have a second reporting period -- we'll assume that their subsequent RP is the median ratio between i) mean subsequent and ii) allocation

rp_data.loc[np.isnan(rp_data['mean_subsequent_allocation']), 'mean_subsequent_allocation'] =
rp_data['allocation'] \* (rp_data['mean_subsequent_allocation']/rp_data['allocation']).median()

The algebra in the `Overview` says that $Project_{Onsite, Start} - Baseline_{Onsite}$, referred to
as `delta_onsite` in the code, should be fairly close to
$Issuance_{RP1} - \overline{Issuance_{Subsequent}}$. Keeping in mind that we only have
$Project_{Onsite, End}$, it should still be the case that `est_upfront` (calculated in the next
cell) and should be pretty correlated with `delta_onsite`. Actually, we can be even more specific --
`delta_onsite` should be slightly higher than `est_upfront` because it has some "growth" carbon
baked into it as well. And that's exactly what we see below - `delta_onsite` is a little higher than
upfront, but the two are super-duper linear with eachother.

est_upfront = rp_data['allocation'] - rp_data['mean_subsequent_allocation']

tp = pd.concat([est_upfront.rename('upfront'), data['delta_onsite']], axis=1) tp = tp.dropna() rmse
= mean_squared_error(tp['upfront'], tp['delta_onsite']) \*\* 0.5 display(f'fairly low RMSE:
{rmse/1_000_000:.2f} million ARBOCs')

And if we look at the distribution of "percent differences" between our estimated upfront carbon and
delta onsite, we see fairly good agreement. The median 'percent difference' is about 7 percent --
which at least part (if not most) can be explained as the difference $Project_{Onsite, Start}$
differing from $Project_{Onsite, End}$. But we're also assuming that SE, woodproducts, AND the
increment of IFM-1 \& IFM-3 don't change when we estimate 'upfront'. The big differences (20-30
percent) all deserve a closer look and might be explained by data entry errors (either ours or in
the project documents).
