# Methods

Alyssa Willson

## Outline

1. [Objectives](#sectionobjectives)
2. [Summary](#sectionsummary)
3. [Generalized Joint Attribute Model](#sectiongjam)
    - [PLS Data](#sectionplsdata)
    - [Environmental Data](#sectionenvdata)
    - ["Effort"](#sectioneffort)
    - [GJAM Drawbacks](#sectiondrawbacks)
4. [New Model](#sectionnewmodel)
5. [Future Directions](#sectionfuturedirections)

<a id = 'sectionobjectives'></a>

## Objectives

The immediate objective of this project is to investigate the relationship between the vegetation and the environment of Indiana and Illinois using the Public Land Survey (PLS)record of tree taxon presence and reconstructions of 15 environmental drivers. In so doing, we test the hypothesis that taxon presence is driven by a combination of environmental drivers and taxon covariance (or biotic interactions between taxa).

In the longer term, we are interested in using this analysis as the starting point for two projects: (1) comparing the drivers of taxon presence between pre- and post- European settlement periods; annd (2) understanding the role of temporal memory in driving vegetation composition and biomass at different temporal scales.

<a id = 'sectionsummary'></a>

## Summary

Here, I first describe the use of the Generalized Joint Attribute Model (GJAM) for investigating the vegetation-environmental relationship in Indiana and Illinois with the PLS record. Then, I propose an idea for a new model that addresses some of the shortcomings of GJAM. Finally, I will brielfy expand on the two long-term projects I mentioned above, and explain their connection to the analysis at hand.

<a id = 'sectiongjam'></a>

## Generalized Joint Attribute Model

The Generalized Joint Attribute Model (GJAM) was developed by Jim Clark et al. (2017) as a flexible framework for modeling ecological data. The model's advantages are numerous and include the following:
1. Incorporation of multiple classes of response data (e.g., presence/absence, capture/recapture) in a single analysis
2. Handling zero-inflated response variables
3. Explicit quantification of covariance between response variables (e.g., presence of different taxa)
4. Uncertainty specification in a Bayesian framework
5. Easy implementation in R

These advantages prompted use to use GJAM as a starting place for our analysis.

<a id = 'sectionplsdata'></a>

### PLS Data

We used point-level observations from the PLS record; that is, we treated each PLS corner as a point observation. At each point, we recorded the presence or absence of each taxon in the full dataset. Specifcically, we considered a taxon to be present if at least one tree recorded at that corner (maximum 4 trees/corner) was a given taxon. Any tree that was not recorded at the corner was considered absent at the point level. In total, we trained our model on 78,224 observations of 15 taxa. We withheld 24,699 observations for out-of-sample validation. All corners were located within Indiana and Illinois.

<a id = 'sectionenvdata'></a>

### Environmental Data

At each corner, we collated data on 12 environmental drivers. The drivers are listed below:

| Driver Type | Driver |
| ----------- | ------ |
| Climate     | Total Precipitation |
| Climate     | Mean Temperature |
| Soils       | CaCO<sub>3</sub> |
| Soils       | Cation Exchange Capacity |
| Soils       | Soil % Clay |
| Soils       | Soil % Sand |
| Soils       | Available Water Content |
| Soils       | Hydric Soil Presence (Boolean) |
| Soils       | Floodplain Presence (Boolean) |
| Topographic | Slope |
| Topographic | Aspect |
| Topographic | Saga Wetness Index |

In all GJAM analyses we also included a random effect of "management area." The management area variable describes the discrete geogrpahic regions within which all corners reside. In total, our training dataset included 15 management areas and our validation dataset included 7 management areas.

<a id = 'sectioneffort'></a>

### "Effort"

GJAM accomplishes the integration of multiple response data types and full covariance between response variables using censoring. The censoring procedure implemented within GJAM requires a variable called "effort," which can be thought of in our application of GJAM as the precision of the point-level observations of the response variables. To define effort, we used the distance from the corner that the land surveyor traveled to record the observed tree. We considered this a proxy for the precision of the observation because it represents the certainty that the tree recorded in the PLS record represents the actual trees at the point-level location of the corner. It is useful to note that it does not appear that our results are particularly sensitive to differences in the specific method of developing the effort variable.

<a id = 'sectiondrawbacks'></a>

### GJAM Drawbacks

We successfully implemented GJAM with the PLS data but in so doing identified a number of drawbacks to using GJAM for further research. Specifically,
1. The GJAM architecture does not allow for quantifying spatial covariance
2. There is no straightforward way to extend GJAM to operate over time (i.e., no dynamic modeling option)
3. The large number of operations requried to fit such a flexible model (including the censoring) leads to a relatively long computation time to fit the full model.

For these reasons, we are interested in building another model that allows us to incorporate correlations through space and time. With the limited flexibility of this model, we also hope to reduce computation time to be manageable over thousands of years and at the regional spatial scale.

<a id = 'sectionnewmodel'></a>

## New Model

I have started building a Bayesian hierarchical model that has similar architecture to GJAM. I envision the model working with fractional composition data so that it can be used with the current version of the STEPPS data product. Ideally, the model could be extended to other data types (e.g., continuous biomass) in the future. The objective of the model is to retain the useful features of GJAM (zero-inflation and covariance between response variables). Zero-inflation is important because the majority of taxa are absent in any given grid cell. Covariance between the fraction of each taxon is important because this allows us to identify biotic interactions via correlations that are not a result of similar abiotic niches. In addition, the model should include spatial covariance, which can be accomplished via a distance matrix, and temporal memory, accomplished by using previous time steps as regressors.

I have begun by building a Dirichlet regression model but I am very unfamiliar with working with the Dirichlet distribution and have been unable to model $\boldsymbol{\alpha}$ as a function of the environment yet. I am actively working on this model now and would appreciate any advice or feedback you can give me.

<a id = 'sectionfuturedirections'></a>

## Future Directions

I would like to conclude this document by summarizing some upcoming projects. I would welcome collaboration on these projects if you are interested.

<a id = 'sectionfuture1'></a>

### Comparing pre- and post-European settlement vegetation drivers

I am interested in extending our use of the PLS record in Indiana and Illinois to a comparison between the pre- and post-European settlement eras. Specifically, I am interested in how the drives of vegetation communities differ between the two periods. I plan to use the model described above to investigate this question by investigating the magnitude of correlation between taxa in each era, as well as the magnitude of coefficients describing the relationship between vegetation and environmental drivers. For this project, I plan to use landscape-scale data products describing tree composition that I believe you have previously developed at least for the PLS period. This project would differ from Kelly's project by taking a more mechanistic approach to the same question, investigating *how* vegetation acts differently between the two periods, instead of identifying differences in the vegetation structure.

<a id = 'sectionfuture2'></a>

### The role of temporal memory

Finally, I am interested in understanding the role of temporal memory in driving long-lived vegetation. My hypothesis is that, because forests are long-lived ecosystems, the processes that drive vegetation demography over the long term can only be fully quantified when using data a long time periods. The short-term and space-for-time data that is often used to make predictions about vegetation response to climate change is insufficient to fully understand how vegetation responds to changes in climate over the long term. To investgiate this hypothesis, I will compare forecasts made using the same model, trained with data at three different time scales. The workflow is as follows:
1. Develop model. This will be the same model described above
2. Fit the model separately with three datasets
    - Long-term data: fossil pollen data product (STEPPS)
    - Medium-term data: tree ring data products (from PalEON)
    - Space-for-time data: satellite hyperspectral data
3. Make forecasts ~150 years into the future using model fit to each dataset separately
    - Long-term data + model -> 150 year forecast
    - Medium-term data + model -> 150 year forecast
    - Space-for-time data + model -> 150 year forecast

I currently have the fractional composition data product from PalEON and tree ring-derived aboveground biomass from PalEON, which could be extended to fractional composition. I am currently working on developing a regional-scale machine learning algorithm for deriving fractional composition from hyperspectral satellite data, using FIA forest plots as training data.

The objective is to investigate differences in the forecasts to demonstrate that the temporal scale of the training data has an impact on the forecast outcome.