## Post processing of aboveground biomass dataset

### Input

Random forest model prediction results from inference.ipynb. These are parquet
files (1 for each landsat scene x year) with columns x, y, biomass. x, y are in
lat/lon coordinates, and biomass is in unit of Mg biomass / ha and only accounts
for aboveground, live, woody biomass.

### Processes

For each 10x10 degree tile in our template

1. merge and mosaic all landsat scenes within a 10x10 degree tile for all years
   available and store the data in zarr format
2. fill gaps within the biomass dataset by xarray interpolate_na with linear
   method (first through dim time, then through dim x, then dim y)
3. mask with MODIS MCD12Q1 land cover dataset to only select the forest pixels
4. calculate belowground biomass and deadwood and litter

### To do

1. take diff between years to calculate biomass change biomass_change = t0 - t1
   sinks = clip max=0 emissions = clip min=0

2. co-locate with fire only on emissions

3. use emission factor to calculate fire related emissions line 44 on
   https://docs.google.com/spreadsheets/d/11CCsl1rsAlC2y9Ilfch4jSN6tFTHspMAqfzuavh_yUI/edit#gid=0
4. calculate non fire related emissions

- still need to convert from biomass to carbon - .467 \* 3.67 (maybe use
  different numbers depending on whether belowground or aboveground)

5. convert to mass and roll up by country
6. test


In [None]:
%load_ext autoreload
%autoreload 2

from pyproj import CRS
import boto3
from rasterio.session import AWSSession
from s3fs import S3FileSystem
aws_session = AWSSession(boto3.Session(),#profile_name='default'), 
                         requester_pays=True)
fs = S3FileSystem(requester_pays=True)
import xgboost as xgb

from osgeo.gdal import VSICurlClearCache
import rasterio as rio
import numpy as np
import xarray as xr
import dask
import os
import fsspec

import rioxarray # for the extension to load
import pandas as pd
from datetime import datetime

from dask_gateway import Gateway
from carbonplan_trace.v1.landsat_preprocess import access_credentials, test_credentials
from carbonplan_trace.v1.inference import predict, predict_delayed 
from carbonplan_trace.v1 import utils, postprocess
from carbonplan_trace.tiles import tiles

from prefect import task, Flow, Parameter


In [None]:
from carbonplan_trace import version

print(version)

In [None]:
# kind_of_cluster = "local"
# kind_of_cluster = "remote"
# if kind_of_cluster == "local":
#     # spin up local cluster. must be on big enough machine
#     from dask.distributed import Client

#     client = Client(
#         n_workers=4,
#         threads_per_worker=8,
#     )
#     client
# else:
#     gateway = Gateway()
#     options = gateway.cluster_options()
#     options.environment = {
#         "AWS_REQUEST_PAYER": "requester",
#         "AWS_REGION_NAME": "us-west-2",
#     }
#     options.worker_cores = 1
#     options.worker_memory = 200

#     options.image = "carbonplan/trace-python-notebook:latest"
#     cluster = gateway.new_cluster(cluster_options=options)
#     cluster.scale(40)

In [None]:
cluster

In [None]:
client = cluster.get_client()
client

In [None]:
# find existing output and skip those, something like this

# processed_scenes = []
# for year in np.arange(2015, 2021):
#     processed_scenes.extend(
#         fs.ls(f"{bucket}/inference/rf/{year}", recursive=True)
#     )

# processed_scenes = [scene[-19:-8] for scene in processed_scenes]

In [None]:
log_file_mapper = fsspec.get_mapper(
    f"s3://carbonplan-climatetrace/junk/text.txt"
)

In [None]:
with fsspec.open("s3://carbonplan-climatetrace/junk/text.txt", mode="w") as f:
    f.write("done")

In [None]:
tasks = []
# define starting and ending years (will want to go back to 2014 but that might not be ready right now)
year0, year1 = 2015, 2021
# define the size of subtile you want to work in (2 degrees recommended)
tile_degree_size = 2
# if you want to write the metadata for the zarr store
write_tile_metadata = True

In [None]:
for tile in tiles[3:4]:
    #     if tile not in already_processed:
    lat_tag, lon_tag = utils.get_lat_lon_tags_from_tile_path(tile)
    lat_lon_box = utils.parse_bounding_box_from_lat_lon_tags(lat_tag, lon_tag)
    # find the lat_lon_box for that tile
    min_lat, max_lat, min_lon, max_lon = lat_lon_box

    # initialize empty dataset. only need to do this once, and not if the tile has already been processed
    data_mapper = postprocess.initialize_empty_dataset(
        lat_tag, lon_tag, year0, year1, write_tile_metadata=write_tile_metadata
    )
    # now we'll split up each of those tiles into smaller subtiles of length `tile_degree_size`
    # and run through those. In this case since we've specified 2, we'll have 25 in each box
    for lat_increment in np.arange(0, 10, tile_degree_size)[0:1]:
        for lon_increment in np.arange(0, 10, tile_degree_size)[0:1]:
            postprocess.postprocess_subtile(
                min_lat,
                min_lon,
                lat_increment,
                lon_increment,
                year0,
                year1,
                tile_degree_size,
                data_mapper,
            )

#         tasks.append(client.compute(postprocess_delayed(subtile_ul_lat, subtile_ul_lon, year0, year1, tile_degree_size, mapper)))