Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPECIES_CODE changes in GAP_PRODUCTS break functionality #54

Closed
sean-rohan-NOAA opened this issue Feb 20, 2024 · 11 comments · Fixed by #55
Closed

SPECIES_CODE changes in GAP_PRODUCTS break functionality #54

sean-rohan-NOAA opened this issue Feb 20, 2024 · 11 comments · Fixed by #55
Assignees

Comments

@sean-rohan-NOAA
Copy link

Issue

I'm trying to wrap my head around a recent change to GAP_PRODUCTS and not entirely sure where this issue should go because it is a problem for how multiple repos interact with GAP_PRODUCTS (e.g. here)

In GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION, SPECIES_CODE 41500 changed to 41099 but there are still catch records for 41500 across RACEBASE tables. For gapindex, that means get_data() retrieves catch data for 41500 but not species data. The lack of species data seems to break CPUE calculations and subsequent steps.

Cross-posting an issue in gapindex.

@zoyafuso-NOAA
Copy link
Collaborator

Thanks for bringing this up, @sean-rohan-NOAA. I guess the surface-layer solution is that if we decide on the species code-groupings for each major invertebrate complex (which would be represented by SPECIES_CODE value 41099 for this example), a lot of the taxonomic changes that occur within the complex will be masked while still accounting for all of the catch. The deeper solution, which is out of my depth, is how to harmonize changes in the taxonomic structure over time with the database, which is the ongoing issue here, which then ultimately affects gapindex.

@Ned-Laman-NOAA
Copy link

The hierarchic classification has Gorgonacea as an Order of the Subclass Octocorallia in the Class Anthozoa. I'm wondering if this is an intent to represent hierarchical classification that isn't fully implemented in these tables yet? I had the impression that the Taxonomics portion of the GAP_PRODUCTS was still under development.

@sean-rohan-NOAA
Copy link
Author

@zoyafuso-NOAA I should have been a bit more specific and included some MWEs showing how the code breaks. The problem is that it doesn't currently account for the catch when the species is missing.

Here are a few examples

Example 1: Numeric vector (error)

library(gapindex)

channel <- gapindex::get_connected()

dat <- gapindex::get_data(year_set = 1991:2023,
                          survey_set = "AI",
                          spp_codes = 41500,
                          sql_channel = channel)

cpue <- gapindex::calc_cpue(racebase_tables = dat)

biomass_strata <- gapindex::calc_biomass_stratum(racebase_tables = dat, cpue = cpue)

Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) :
no rows to aggregate

Example 2: Grouping variable in a data.frame (error)

dat <- gapindex::get_data(year_set = 1991:2023,
                          survey_set = "AI",
                          spp_codes = data.frame(SPECIES_CODE = 41500, 
                                                 GROUP = "Gorgonacea"),
                          sql_channel = channel)

cpue <- gapindex::calc_cpue(racebase_tables = dat)

biomass_strata <- gapindex::calc_biomass_stratum(racebase_tables = dat, cpue = cpue)

Error in aggregate.data.frame(lhs, mf[-1L], FUN = FUN, ...) :
no rows to aggregate

Example 3: Selecting 41099 and 41500, calculate CPUE and stratum biomass (this runs)

CPUE only calculated for 41099

dat2 <- gapindex::get_data(year_set = 1991:2023,
                          survey_set = "AI",
                          spp_codes = c(41099, 41500),
                          sql_channel = channel)

cpue2 <- gapindex::calc_cpue(racebase_tables = dat2)

biomass_strata2 <- gapindex::calc_biomass_stratum(racebase_tables = dat2, cpue = cpue2)

table(dat2$catch$SPECIES_CODE)

table(cpue2$SPECIES_CODE)

table(dat2$catch$SPECIES_CODE)

41099 41500
45 117

table(cpue2$SPECIES_CODE)

41099
5201

Example 4: Selecting 41099, calculate CPUE and stratum biomass (this runs)

Same result as example 3

dat3 <- gapindex::get_data(year_set = 1991:2023,
                           survey_set = "AI",
                           spp_codes = 41099,
                           sql_channel = channel)

dat3$catch

cpue3 <- gapindex::calc_cpue(racebase_tables = dat3)

biomass_strata3 <- gapindex::calc_biomass_stratum(racebase_tables = dat3, cpue = cpue3)

identical(biomass_strata2, biomass_strata3)

identical(biomass_strata2, biomass_strata3)
[1] TRUE

@zoyafuso-NOAA
Copy link
Collaborator

For my understanding, so the summary of this issue is that the package is only pulling SPECIES_CODES that exist in GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION. The SPECIES_CODE value 41500 is now 41099 in 2024 and is not in GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION anymore, although it still occurs in RACEBASE.CATCH because we don’t retroactively change SPECIES_CODE values when they are synonymized into another SPECIES_CODE value (or maybe we do, I just don’t know when it happens).

Maybe this is a solution that is already embedded into the package, which is to merge the haul-level catches for 41500 and 41099 so that the catch for both SPECIES_CODES are accounted.

Example 5: Using the spp_codes argument to merge SPECIES_CODE values 41500 and 41099.

dat4 <- gapindex::get_data(year_set = 1991:2023,
                           survey_set = "AI",
                           spp_codes = data.frame(SPECIES_CODE = c(41099, 41500),
                                                  GROUP = 41009),
                           sql_channel = channel)

cpue4 <- gapindex::calc_cpue(racebase_tables = dat4)
biomass_strata4 <- gapindex::calc_biomass_stratum(racebase_tables = dat4, cpue = cpue4)

@sean-rohan-NOAA
Copy link
Author

Although this solution works from an analytical perspective, I'm not sure it really solves the problem of having a dead SPECIES_CODE.

At the very least, it feels like there should be an informative error message when SPECIES_CODE values in the catch table from get_data don't show up in the species table. Otherwise, it seems like there's a pretty good chance of miscalculation. Although the workaround will work for the example, it requires that the user has prior knowledge of which species codes are invalid.

Even with the workaround, I'd still have to employ a pretty kludgy refactor to handle the corner case because the remainder of the code is structured around matching the GROUP (see: internal data setup in esrindex). And yes, I could write more complex generalizable code that could work around taxonomic changes of this type, but I'm a bit hesitant to do so if there could be a more elegant solution.

@zoyafuso-NOAA zoyafuso-NOAA transferred this issue from afsc-gap-products/gap_products Feb 23, 2024
@zoyafuso-NOAA
Copy link
Collaborator

Hi Sean,

Thanks for showing me how it's used in the esrindex package. Yeah, I can see a message output being useful when that happens, and that could leverage the information in GAP_PRODUCTS.TAXONOMIC_CHANGES but that table is still in the works. Would this particular issue be resolved if we reverted to pulling taxonomic information from RACEBASE.SPECIES instead of GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION? I made a little branch thing that allows the user to toggle that option in the get_data() call with an updated script below. It is a bandage because I have a large blind spot about how taxonomic changes affected the database historically and what we want to do about it going forth. Thanks for your patience and support for gapindex and gap_products so far, your input is super helpful.

library(devtools)
devtools::install_github(repo = "afsc-gap-products/gapindex@54-species_code-changes-in-gap_products-break-functionality")
library(gapindex)

channel <- gapindex::get_connected()

### Taxonomic data from RACEBASE.SPECIES
dat_rb <- get_data(year_set = 1991:2023,
                   survey_set = "AI",
                   spp_codes = c(41099, 41500),
                   sql_channel = channel, 
                   taxonomic_source = c("RACEBASE.SPECIES", 
                                        "GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION")[1])

cpue_rb <- gapindex::calc_cpue(racebase_tables = dat_rb)

biomass_strata_rb <- gapindex::calc_biomass_stratum(racebase_tables = dat_rb, 
                                                    cpue = cpue_rb)

### Taxonomic data from GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION
dat_gp <- get_data(year_set = 1991:2023,
                   survey_set = "AI",
                   spp_codes = c(41099, 41500),
                   sql_channel = channel, 
                   taxonomic_source = c("RACEBASE.SPECIES", 
                                        "GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION")[2])

cpue_gp <- gapindex::calc_cpue(racebase_tables = dat_gp)

biomass_strata_gp <- gapindex::calc_biomass_stratum(racebase_tables = dat_gp, 
                                                    cpue = cpue_gp)

### Comparisons
table(dat_rb$catch$SPECIES_CODE)
# 41099 41500 
# 45   117 

table(dat_gp$catch$SPECIES_CODE)
# 41099 41500 
# 45   117 

table(dat_rb$species$SPECIES_CODE)
# 41099 41500 
# 1     1 

table(dat_gp$species$SPECIES_CODE)
# 41099 
# 1 

table(cpue_rb$SPECIES_CODE)
# 41099 41500 
# 5201  5201 

table(cpue_gp$SPECIES_CODE)
# 41099 
# 5201

@zoyafuso-NOAA
Copy link
Collaborator

Alternatively, I could also just get rid of the taxonomic classification data in the get_data() output. It's not used in index production, it's just a nice but ultimately unnecessary thing to have. I made another little branch that comments that part out in the get_data() function.

library(devtools)
devtools::install_github(repo = "afsc-gap-products/gapindex@Remove_Taxon_Info")
library(gapindex)

If you install this branch and run through your initial example, it should resolve. Either branches would seem to work, let me know what you think.

@sean-rohan-NOAA
Copy link
Author

Thank you for addressing this, Zack. Yes, using RACEBASE.SPECIES would solve the issue on my end, but I'm not sure if it is an all-around fix because I'm not sure if/how folks are intending to use GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION.

I definitely think it's helpful to return species info, but I would be a bit concerned if folks are using GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION as a basis to obtain group-level catch/CPUE/biomass since there is a mismatch between species codes in RACEBASE and valid (?) codes in GAP_PRODUCTS.

I supposed I'm leaning towards gapindex@54-species_code-changes-in-gap_products-break-functionality as the best option, but with RACEBASE.SPECIES as the default. But that's just one perspective and I'd be curious to hear what others think (e.g., @EmilyMarkowitz-NOAA, @SarahFriedman-NOAA, @Ned-Laman-NOAA, @MargaretSiple-NOAA, @Duane-Stevenson-NOAA)

@Ned-Laman-NOAA
Copy link

I support reverting to using RACEBase.SPECIES as the precedent home for taxonomic classification for now. I believe that the former is served from RACE_DATA, but not sure I totally have a grip on what goes on behind the scenes there. I suspect that GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION is still in development and will eventually handle things like this, but I'm not clear how just yet.
To comment on the discrepancy between sources, I noted the other day that big skate exists in RACEBase.SPECIES and GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION when citing ITIS, but is not in GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION when citing WoRMS so there's a disagreement even within the GAP_PRODUCTS table.

@SarahFriedman-NOAA
Copy link
Collaborator

SarahFriedman-NOAA commented Mar 4, 2024

Sorry, just catching up on everything now after my leave the past two weeks. It sounds like an interesting discussion and, yes, this is an issue I'm aware of. I was attempting to work on this with Heather prior to her retirement but, as we all know, she had a lot on her plate during that time. For now the SPECIES table in RACEBASE should work, but I should note that it is quite of of date (taxonomically) for species. The issue is essentially that species are frequently synonymized and we have no way to accommodate those changes in our existing database structure. I don't want to over-write the old species codes because that has essential information in case any taxonomic changes happen in the future to revert the names (then we lose the ability to distinguish historical records of species). My idea moving forward is to integrate the new TAXONOMIC_CLASSIFICATION table and the TAXONOMIC_CHANGES tables within RACE_DATA, such that synonymized records are pulled for species under the same species code behind this scenes. This would be akin to how the species codes for juvenile and adult pollock are automatically joined under the generalized pollock species code. Unfortunately, my Oracle skills are lacking and I'm unsure about how to implement this structure as yet.

I'm planning to work with OFIS to make these changes to RACE_DATA in accordance with the taxonomic changes, but we have not yet had time to work on this. We also need to wait until these changes are reflected in the deck guides or we risk confusion about species identification at sea. I'm very much open to implementation ideas in the meantime.

@Ned-Laman-NOAA
Copy link

Thanks for weighing in, Sarah.
I suggest that you put together a ticket and submit it to the nmfs.afsc.helpdesk@noaa.gov to engage OFIS in helping to develop the architecture for a fully functional taxonomic system. I know this is already on Autumn's radar, but don't know where OFIS will prioritize it if they have a request in hand. They did recently hire a database architect into their team so now would be a great time to pursue this.
A second thought is that given the speed and frequency of name changes that are happening in the invert world, perhaps a discussion about reasonable controls on how often we realize a name change on our end is in order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants