Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate allodb and bmss #1

Closed
maurolepore opened this issue Mar 19, 2018 · 30 comments
Closed

Integrate allodb and bmss #1

maurolepore opened this issue Mar 19, 2018 · 30 comments
Assignees

Comments

@maurolepore
Copy link
Contributor

@gonzalezeb,

Where in the table should the code look for the parameters in the column equation?

It is clear that DBH is a measurement that the user must provide for each stem. But is it not clear where the other parameters come from. Should we give them in the equations table? Should the user get them from somewhere else and feed them into our code?

For example, where should the code get a from? Or b, or d? Also, is there a lookup table to know what each of those parameters mean?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(allodb)

head(equations$equation)
#> [1] "a*(DBH^2)^b"       "a*(DBH^2)^b"       "a+b*DBH+c*(DBH^d)"
#> [4] "a+b*DBH+c*(DBH^d)" "a+b*DBH+c*(DBH^d)" "a+b*DBH+c*(DBH^d)"
glimpse(equations)
#> Observations: 421
#> Variables: 23
#> $ equation_id                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ biomass_component            <chr> "Stem and branches (live)", "Stem...
#> $ equation                     <chr> "a*(DBH^2)^b", "a*(DBH^2)^b", "a+...
#> $ allometry_specificity        <chr> "Species", "Species", "Species", ...
#> $ development_species          <chr> NA, NA, NA, "Ulmus americana", NA...
#> $ geographic_area              <chr> "North Carolina, Georgia", "North...
#> $ dbh_min_cm                   <chr> "14.22", "29.46", "2.5", "2.5", "...
#> $ dbh_max_cm                   <chr> "25.91", "41.66", "40", "40", "55...
#> $ n_trees                      <int> 9, 9, NA, NA, NA, NA, NA, NA, NA,...
#> $ dbh_units_original           <chr> "in", "in", "cm", "cm", "mm", "mm...
#> $ biomass_units_original       <chr> "lb", "lb", "kg", "kg", "kg", "kg...
#> $ allometry_development_method <chr> "harvest", "harvest", "harvest", ...
#> $ model_parameters             <chr> "DBH", "DBH", "DBH", "DBH", "DBH"...
#> $ regression_model             <chr> "linear_multiple", "linear_multip...
#> $ other_equations_tested       <chr> "yes", "yes", NA, NA, NA, NA, NA,...
#> $ log_biomass                  <chr> "10", "10", NA, NA, NA, NA, NA, N...
#> $ bias_corrected               <chr> "yes", "yes", "no", "no", "no", "...
#> $ bias_correction_factor       <chr> "included in model", "included in...
#> $ notes_fitting_model          <chr> "Regression equations were develo...
#> $ original_data_availability   <chr> "1", "1", NA, NA, NA, NA, NA, NA,...
#> $ notes_to_consider            <chr> NA, NA, NA, NA, NA, NA, "DBA = ba...
#> $ warning                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
#> $ ref_id                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, N...
@maurolepore
Copy link
Contributor Author

maurolepore commented Mar 19, 2018

@gonzalezeb,

This post brakes down the process of calculating biomass from a table of allometric equations (that we provide) and a table with dbh measurements (that the user provides). This is to more clearly show what I think it is missing from our table of equations.

@gonzalezeb
Copy link
Contributor

The parameters a, b, d, or d are in our master table. These are the coefficients used for the regression equations compiled from the original publications.

We originally though that we could replace these parameters in the actual equation, i.e, instead of having aDBH^b on the equation column we will have 4.13741DBH*1.08876.

If you thinks is best, I will modify the data.Rmd so those columns will be included on the equation table.

@maurolepore
Copy link
Contributor Author

Great! Yes, I agree that 4.13741DBH*1.08876 is better than aDBH^b.

So we still need to do the conversion, right? I'll have a look to see how big is the challenge to do this in R.

Do you have any other suggestion? I recently learned about OpenRefine -- which might be good to keep in mind.

@gonzalezeb
Copy link
Contributor

gonzalezeb commented Mar 19, 2018 via email

@maurolepore
Copy link
Contributor Author

Good news, a quick try shows that we can replace the parameters a-d. This relies on matching strings so we have to be careful to avoid mismatches. For example, if we replace a by A from apple we get Apple. But if we replace a by A in america we get AmericA .

But I still wonder where to get some other parameters.

All of this is in section 2 of this document

@gonzalezeb
Copy link
Contributor

I fixed the "other parameters" which were actually part of the original equations. I rewrote equations that needed to take a more general form (ie, ln(BST)=a+bln(DBH) changed to exp(a+bln(DBH)..

@maurolepore
Copy link
Contributor Author

Great! OK, then I'll soon rerun the replacement -- which should result in equations that are only a function of DBH. I'll be in touch in a day or so.

@gonzalezeb
Copy link
Contributor

There are now 16 unique equations but notice that 11 and 12 have 2 extra variables (WD and Bk) that are embedded within the formula.

distinct(master, equation)
equation
1 a*(DBH^2)^b
2 a*(DBH^b)
3 a+bDBH+c(DBH^d)
4 a*(DBA^b)
5 exp(a+bln(DBH))
6 exp(a+b
DBH+c*(ln(DBH^d)))
7 10^a+b*(log10(DBH^c))
8 a+bDBH
9 a+b
BA
10 exp(a+bln(DBA))
11 exp(a+b×ln(DBH))WDBk(WD=419.814,Bk=1.22)
12 exp(a+b×ln(DBH))WDBk(WD=645.704,Bk=1.05)
13 10^a
DBH^b
14 a+(bDBH)+c(DBH^2)+d*(DBH^3)
15 NA
16 exp(a+(b*(ln(piDBH))))
17 exp(a+b
(DBH/DBH+c))

@gonzalezeb
Copy link
Contributor

gonzalezeb commented Mar 20, 2018

Now, also notice that some equations are a function of DBA (diameter at base) or BA (basal area). Calculating BA is a step that needs to happen before the actual biomass calculation. Valentine, already wrote a code to:

  • Calculate basal area contribution of each stem within a tree.
  • Redistribute the biomass of main stem to other stem, using the basal contribution For an idea you see it here.

@maurolepore
Copy link
Contributor Author

maurolepore commented Mar 20, 2018

Mmm this brings too many questions. There are tiny details that can easily lead to confusion and result in a completely wrong result. I think we should meet one day and work on this together -- with excel, R or whatever -- until we clean all equations. Then you can continue to collect new equations and format them consistently with what we achieve in our meeting. What do you think? If OK, when would it work for you?

For our records, here I list que questions that come to my mind right away.

  • (a) How should equations 11 and 12 be replaced? Can you replace them here manually to see if I can express that in code?

  • (b) Are BA and DBA the exact same thing? If so, we need to pick one and drop the other one --
    I suggest keep BA and drop DBA because DBA confuses with DBH. DBA leaves the reader thinking, Did you really mean DBH or did you type an H instead of an A, just by mistake?

  • (c) Are the precedence rules correctly expressed within each equation? For example equation 5:

# literal copy paste
exp(a+bln(DBH))

# I think that by `bln` you mean `b * ln(...). 
# Here I use parenthesis to explicitely show precedence rules:
exp(   a +  ( b * ln(DBH) )    )

# Is this the same?
ln_dbh <- ln(DBH)
b_times_ln_dbh <- b * ln_dbh
exp(a + b_times_ln_dbh)
  • (d) In equations 11 and 12, is b × ln( the same as b * ln( ?

@gonzalezeb
Copy link
Contributor

gonzalezeb commented Mar 21, 2018

I run distinct(master, equation) again and got the correct equations:

1                      a*(DBH^2)^b
2                        a*(DBH^b)
3                a+b*DBH+c*(DBH^d)
4                        a*(DBA^b)
5                 exp(a+b*ln(DBH))
6       exp(a+b*DBH+c*(ln(DBH^d)))
7            10^a+b*(log10(DBH^c))
8                          a+b*DBH
9                           a+b*BA
10                exp(a+b*ln(DBA))
11 exp(a+(b*ln(DBH)))*419.814*1.22
12 exp(a+(b*ln(DBH)))*645.704*1.05
13                      10^a*DBH^b
14   a+(b*DBH)+c*(DBH^2)+d*(DBH^3)
15                          NA     
16         exp(a+(b*(ln(pi*DBH))))
17            exp(a+b*(DBH/DBH+c))

But I will have to check on the precedent rules....

@gonzalezeb
Copy link
Contributor

BA and DBA are not the same, we will need to reconsider strategies.

@maurolepore
Copy link
Contributor Author

maurolepore commented Mar 21, 2018

Great! This addresses almost all my concerns. Here are the few things that I still need to ask you or think about:

  • DBA and BA: Now I see: One is diameter and the other is area -- sorry for not reading this carefully enough. I wonder if one can be expressed as a funcition of the other with Valentine's code -- the link to her work didn't work for me. Is it in a public repo? (here).

  • What should we do when the value of equation is NA?

  • Review equation 17: exp(a+b*(DBH/DBH+c)).

DBH <- 2
c <- 1

# Notice how parentheses produce different results
DBH / DBH + c
#> [1] 2
DBH / (DBH + c)
#> [1] 0.6666667

My updates notes are in section 2 here.

@maurolepore
Copy link
Contributor Author

Function to calculate basal area:

https://forestgeo.github.io/fgeo.abundance/reference/basal_area.html

@gonzalezeb
Copy link
Contributor

gonzalezeb commented Mar 30, 2018

I incorporated coefficients a-d in equations (in a 'temporal" column called equation_final), however I still need to work on more changes, i.e., if original equations used no-metric units then we will need to convert to metric. Another issue to consider when calculating biomass.

@maurolepore
Copy link
Contributor Author

maurolepore commented Sep 19, 2018

Summary and moving forward

I believe that most of the comments above have been addressed except the this:

  1. https://github.com/forestgeo/allodb/issues/36#issuecomment-377641030
  2. Shrubs: Allow evaluating equations that depend on diameter at base (DBA) and basal area (BA)
    (#41).
  3. What column links the equations table with the census data that users provide? Originally I thought the link would be the columns site or species. Now, the closest to that seem to be the columns development_species and geographic_area.

@gonzalezeb, I'm particularly interested in the item 3. Is this one of the things you would like to talk about during my visit to SCBI? Or have we already talked about this? Do you want to chat and refresh my mind before my visit to SCBI?

@ValentineHerr
Copy link

@maurolepore,
One easy way around our problems would be to have a column in the allodb table (or whatever format it is) that has directly the R code that should be used for the species and site in question.
The code would incorporate "ifelse" statements to deal with DBH thresholds and also convert DBH to intermediary measurements (like DBA, BA or Height etc...) based on other allometries (e.g. for height or diameter at root collar) or calculations (e.g. for shrub allometries)..
The function that calculates the AGB would directly evaluate the code. Input would be site, species, DBH and output would be AGB.
It is a less "transparent" solution for the user but it might be the easiest one for developing the package. We could add a column that has a quick description of how the AGB was calculated and have that be displayed in the console when the function is used.
Let me know what you think.

@maurolepore
Copy link
Contributor Author

Here I attempt to answer my own question. My conclusion is that we will have all tables linked once we populate the column equation_id.

What column links the equations table with the census data that users provide? (https://github.com/forestgeo/allodb/issues/36#issuecomment-422825299)

LINK VIA SPECIES

Users provide census data, which contains species codes in the column sp. We can translate sp codes to species names looking up into ViewTaxonomy table, and store the result in a column census$species. Then we can link census with sitespecies by matching sitespecies$species andcensus$species (merge(), dplyr::join(), or similar).

In turn, sitespecies can link to equations via equation_id. The full process therefore links census with equaitons. However, we can't do this right now because all values of equation_id are missing. Should we start populating equation_id?

LINK VIA SITE

Users may provide a sting of text of length-1 giving the name of the site where the data comes from. (We may vectorize over this argument to allow multiple sites -- but let's worry about that later). That string would populate a new column census$site and link to sitespecies by matching the corresponding values of sitespecies$site. sitespecies then links to equations via equation_id (which right now has missing values).

@gonzalezeb
Copy link
Contributor

Yes, equation_id is the link between tables, see description here.

But, yes, I haven't populate the equation_id. That's part of our conversation next week, as many more sites and equations need to be populated in the allodb_master.

On another note, I like the idea of LINK VIA SPECIES because that open the use of allodb not just to ForestGEO sites but to anyone who want to use it, after selecting a region (for example)..

There is currently a limitation with species codes: for few sites, which species list I got from ForestGEO website I don't have a code, I will need to contact PI's for imput.

I have so much work to do!

@maurolepore
Copy link
Contributor Author

I haven't populate the equation_id ... as many more sites and equations need to be populated in the allodb_master. -- @gonzalezeb

We can discuss this in person, but I write one idea to clarify my thinking and as a reminder.

I think we need a system for assigning equation_ids that we can use every time you add a new equation. I suggest each new equation should get either a random id or a unique id from a time-stamp (e.g. with Sys.time() in R). Multiple rows in the master table may point to the same equation, and therefore to the same value of equation_id. We could normalize the data by having a table with columns equation_id and equations, and we should wite a test to ensure that each equaiton_id is indeed unique.

For few sites, which species list I got from ForestGEO website I don't have a code.

See ropensci/allodb#43

@maurolepore maurolepore self-assigned this Oct 3, 2018
@maurolepore
Copy link
Contributor Author

@teixeirak and all, I had expected to resume work on this today but I'll start tomorrow. I had to wrap up a few things that I really needed to get out from my head and into code. I'll keep you posted.

@maurolepore
Copy link
Contributor Author

maurolepore commented Nov 6, 2018

Follows issue #58 PR ropensci/allodb#61.

Today I set things up in allodb. Most importantly, I updated tests and drafted this report to capture my progress. Tomorrow I'll be working on bmss to adapt the code to work by default with tables from allodb (instead of the dummy tables I had created for testing purposes).

@maurolepore
Copy link
Contributor Author

Today I wrote some funcitons to compute biomass with data from SCBI and site-level equations from allodb. For an example see README.

At least for now, the code lives in allodb instead of bmss because it mostly restructures and combines data from allodb. The logic that bmss has is not yet available. Instead, the code follows a simple path to computing biomass.

Tomorrow I'll revisit this code with a fresh brain, and will test it more. Then I'll think what's next.

@maurolepore
Copy link
Contributor Author

maurolepore commented Nov 9, 2018

Today I wrote code that starts integrating allodb and bmss:

  • New bmss_cns() restructres a census dataframe to the format that bmss understands.
  • New bmss_default_eqn() restructres an equations dataframe (e.g. allodb::master()) to the format that bmss can use as default equations.
  • New bmss() calculates biamoss with using the output of the two functions above.

This is still work in progress but what's important is that I'm now able to reuse the logic Sean, Gabriel, and I developed a while ago. That logic will surely change but it's a good starting point.

See this updated README to see an example for equations at the species-level and for equations at all-levels.

The code now lives in allodb but will likely move to bmss -- which will make allodb a very independent package (right now it's not).

@maurolepore
Copy link
Contributor Author

Today I made the code more flexible and simple that what it used to be in bmss. You can see an example in README.

The most important change is not exposed to the user -- it is the ability to order a list of dataframes by index or element-name and then reduce the list to a single dataframe, where each row overwrites the others of lower order-priority. In practice, this allows us to match the user-data with equations of different types, then let the user decide what type of equation overwrites which other type. The result is simpler logic and greater flexibility.

library(allodb)
library(dplyr)

prio <- list(
  prio1 = tibble(rowid = 1:1, x = "prio1"),
  prio2 = tibble(rowid = 1:2, x = "prio2"),
  prio3 = tibble(rowid = 1:3, x = "prio3")
)
rowbind_inorder(prio)
#> # A tibble: 3 x 2
#>   rowid x    
#>   <int> <chr>
#> 1     1 prio1
#> 2     2 prio2
#> 3     3 prio3

# 2 overwrites over 1; 3 is dropped
rowbind_inorder(prio, c(2, 1))
#> # A tibble: 2 x 2
#>   rowid x    
#>   <int> <chr>
#> 1     1 prio2
#> 2     2 prio2

What does face the user is a summary of the available equations of each type -- in the form of a nested dataframe. This is what it looks like:

eqn <- get_equations(census_species)
eqn
#> # A tibble: 5 x 2
#>   eqn_type       data                 
#>   <chr>          <list>               
#> 1 species        <tibble [8,930 x 8]> 
#> 2 genus          <tibble [5,642 x 8]> 
#> 3 mixed_hardwood <tibble [5,516 x 8]> 
#> 4 family         <tibble [10,141 x 8]>
#> 5 woody_species  <tibble [0 x 8]>

Then it's up to the user to use the default priority order or change it.

default_order <- order = c(
    "species",
    "genus",
    "family",
    "mixed_hardwood",
    "woody_species"
 )

pick_best_equations(eqn, order = default_order)
#> # A tibble: 30,229 x 8
#>    rowid site  sp           dbh equation_id eqn        eqn_source eqn_type
#>    <int> <chr> <chr>      <dbl> <chr>       <chr>      <chr>      <chr>   
#>  1     4 scbi  nyssa syl~ 135   8da09d      1.5416 * ~ default    species 
#>  2    21 scbi  liriodend~ 232.  34fe5a      1.0259 * ~ default    species 
#>  3    29 scbi  acer rubr~ 326.  7c72ed      exp(4.589~ default    species 
#>  4    38 scbi  fraxinus ~  42.8 0edaff      0.1634 * ~ default    species 
#>  5    72 scbi  acer rubr~ 289.  7c72ed      exp(4.589~ default    species 
#>  6    77 scbi  quercus a~ 636.  07dba7      1.5647 * ~ default    species 
#>  7    79 scbi  tilia ame~ 475   3f99ba      1.4416 * ~ default    species 
#>  8    79 scbi  tilia ame~ 475   76d19b      0.004884 ~ default    species 
#>  9    84 scbi  fraxinus ~ 170.  0edaff      0.1634 * ~ default    species 
#> 10    89 scbi  fagus gra~  27.2 74186d      2.0394 * ~ default    species 
#> # ... with 30,219 more rows

There are also a few other convenient functions. I'll tidy this up, move the code out of allodb and park the project for a bit until you give some feedback. I'll let you know.

@gonzalezeb
Copy link
Contributor

This is great! What I can't see (not sure if it is incorporated) is the issue about units #42 .

But now I realize I didnt include a "conversion factor column" in the equation table that would finally tackle the problem..

Also, I am doing the tedious exercise of checking equations "by hand" and I am making some changes to correct estimates (for example eq 7aaa22 was incorrect [changed from exp(-10.8036+2.7727dbh+1log(dbh)) to exp((-10.8036dbh))+2.7727log(dbh))

@maurolepore
Copy link
Contributor Author

You're right, #42 and other issues are still undone. What I've done so far is basic and exploratory. But -- after some polishing -- I will have set the road for the rest to come.

@maurolepore
Copy link
Contributor Author

maurolepore commented Nov 10, 2018

Today I pre-released allodb 0.0.0.9004.

Now allodb is light weight again (although it depends on two packages that may be later removed), and focused on hosting tables -- not on computing with those tables.

I moved code from allodb to a new package fgeo.biomass. I deprecated the old bmss package. It remains as a private repository of ideas but the implementation of those idea in fgeo.biomass is now totally different -- simpler and more flexible.

Some issues will gradually move from allodb to fgeo.biomass.

@maurolepore maurolepore transferred this issue from ropensci/allodb Nov 10, 2018
@maurolepore
Copy link
Contributor Author

@gonzalezeb and @teixeirak,

Today I pre-released fgeo.biomass 0.0.0.9000.

With this I finish this iteration of the integration between allodb and fgeo.biomass. There is still a lot to do and I already plan some improvements. But before I continue it would be great to get feedback from you and whoever you want to share this work with.

@maurolepore
Copy link
Contributor Author

Closing because this issue is unfocused. We may later extract the bits we need and follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants