-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPECIES_CODE changes in GAP_PRODUCTS break functionality #54
Comments
Thanks for bringing this up, @sean-rohan-NOAA. I guess the surface-layer solution is that if we decide on the species code-groupings for each major invertebrate complex (which would be represented by SPECIES_CODE value 41099 for this example), a lot of the taxonomic changes that occur within the complex will be masked while still accounting for all of the catch. The deeper solution, which is out of my depth, is how to harmonize changes in the taxonomic structure over time with the database, which is the ongoing issue here, which then ultimately affects gapindex. |
The hierarchic classification has Gorgonacea as an Order of the Subclass Octocorallia in the Class Anthozoa. I'm wondering if this is an intent to represent hierarchical classification that isn't fully implemented in these tables yet? I had the impression that the Taxonomics portion of the GAP_PRODUCTS was still under development. |
@zoyafuso-NOAA I should have been a bit more specific and included some MWEs showing how the code breaks. The problem is that it doesn't currently account for the catch when the species is missing. Here are a few examples Example 1: Numeric vector (error)
Example 2: Grouping variable in a data.frame (error)
Example 3: Selecting 41099 and 41500, calculate CPUE and stratum biomass (this runs)CPUE only calculated for 41099
Example 4: Selecting 41099, calculate CPUE and stratum biomass (this runs)Same result as example 3
|
For my understanding, so the summary of this issue is that the package is only pulling SPECIES_CODES that exist in GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION. The SPECIES_CODE value 41500 is now 41099 in 2024 and is not in GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION anymore, although it still occurs in RACEBASE.CATCH because we don’t retroactively change SPECIES_CODE values when they are synonymized into another SPECIES_CODE value (or maybe we do, I just don’t know when it happens). Maybe this is a solution that is already embedded into the package, which is to merge the haul-level catches for 41500 and 41099 so that the catch for both SPECIES_CODES are accounted. Example 5: Using the
|
Although this solution works from an analytical perspective, I'm not sure it really solves the problem of having a dead SPECIES_CODE. At the very least, it feels like there should be an informative error message when SPECIES_CODE values in the catch table from get_data don't show up in the species table. Otherwise, it seems like there's a pretty good chance of miscalculation. Although the workaround will work for the example, it requires that the user has prior knowledge of which species codes are invalid. Even with the workaround, I'd still have to employ a pretty kludgy refactor to handle the corner case because the remainder of the code is structured around matching the GROUP (see: internal data setup in esrindex). And yes, I could write more complex generalizable code that could work around taxonomic changes of this type, but I'm a bit hesitant to do so if there could be a more elegant solution. |
Hi Sean, Thanks for showing me how it's used in the esrindex package. Yeah, I can see a message output being useful when that happens, and that could leverage the information in GAP_PRODUCTS.TAXONOMIC_CHANGES but that table is still in the works. Would this particular issue be resolved if we reverted to pulling taxonomic information from RACEBASE.SPECIES instead of GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION? I made a little branch thing that allows the user to toggle that option in the get_data() call with an updated script below. It is a bandage because I have a large blind spot about how taxonomic changes affected the database historically and what we want to do about it going forth. Thanks for your patience and support for gapindex and gap_products so far, your input is super helpful.
|
Alternatively, I could also just get rid of the taxonomic classification data in the get_data() output. It's not used in index production, it's just a nice but ultimately unnecessary thing to have. I made another little branch that comments that part out in the get_data() function.
If you install this branch and run through your initial example, it should resolve. Either branches would seem to work, let me know what you think. |
Thank you for addressing this, Zack. Yes, using RACEBASE.SPECIES would solve the issue on my end, but I'm not sure if it is an all-around fix because I'm not sure if/how folks are intending to use GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION. I definitely think it's helpful to return species info, but I would be a bit concerned if folks are using GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION as a basis to obtain group-level catch/CPUE/biomass since there is a mismatch between species codes in RACEBASE and valid (?) codes in GAP_PRODUCTS. I supposed I'm leaning towards gapindex@54-species_code-changes-in-gap_products-break-functionality as the best option, but with RACEBASE.SPECIES as the default. But that's just one perspective and I'd be curious to hear what others think (e.g., @EmilyMarkowitz-NOAA, @SarahFriedman-NOAA, @Ned-Laman-NOAA, @MargaretSiple-NOAA, @Duane-Stevenson-NOAA) |
I support reverting to using RACEBase.SPECIES as the precedent home for taxonomic classification for now. I believe that the former is served from RACE_DATA, but not sure I totally have a grip on what goes on behind the scenes there. I suspect that GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION is still in development and will eventually handle things like this, but I'm not clear how just yet. |
Sorry, just catching up on everything now after my leave the past two weeks. It sounds like an interesting discussion and, yes, this is an issue I'm aware of. I was attempting to work on this with Heather prior to her retirement but, as we all know, she had a lot on her plate during that time. For now the SPECIES table in RACEBASE should work, but I should note that it is quite of of date (taxonomically) for species. The issue is essentially that species are frequently synonymized and we have no way to accommodate those changes in our existing database structure. I don't want to over-write the old species codes because that has essential information in case any taxonomic changes happen in the future to revert the names (then we lose the ability to distinguish historical records of species). My idea moving forward is to integrate the new TAXONOMIC_CLASSIFICATION table and the TAXONOMIC_CHANGES tables within RACE_DATA, such that synonymized records are pulled for species under the same species code behind this scenes. This would be akin to how the species codes for juvenile and adult pollock are automatically joined under the generalized pollock species code. Unfortunately, my Oracle skills are lacking and I'm unsure about how to implement this structure as yet. I'm planning to work with OFIS to make these changes to RACE_DATA in accordance with the taxonomic changes, but we have not yet had time to work on this. We also need to wait until these changes are reflected in the deck guides or we risk confusion about species identification at sea. I'm very much open to implementation ideas in the meantime. |
Thanks for weighing in, Sarah. |
Issue
I'm trying to wrap my head around a recent change to GAP_PRODUCTS and not entirely sure where this issue should go because it is a problem for how multiple repos interact with GAP_PRODUCTS (e.g. here)
In GAP_PRODUCTS.TAXONOMIC_CLASSIFICATION, SPECIES_CODE 41500 changed to 41099 but there are still catch records for 41500 across RACEBASE tables. For gapindex, that means get_data() retrieves catch data for 41500 but not species data. The lack of species data seems to break CPUE calculations and subsequent steps.
Cross-posting an issue in gapindex.
The text was updated successfully, but these errors were encountered: