-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handle SRDB duplicates #34
Comments
Looking into this. |
Right... that's the big problem. For NEE, NEP, and Reco, I think we should drop any from SRDB that are within 1 degree lat/long of what's already in ForC (and flag for duplicate review). Eddy flux sites just aren't that common... Other variables are tricky, but I'm sure that problem occurs sometimes for them. Not sure the best way to handle that. Maybe the same? |
1 degree lat-lon is ~ 110 km... that is quite large isn't it ? when we look for potential duplicate sites we do 5km. Should I look only into ForC prior GROA import or also looking at GROA measurements? I think I'll write a temporary script that just "flags as suspicious" those SRDB records because it is too complicated to incorporate that in the duplicate system. They won't be brought into ForC_simplified. Also, FYI, I am removing any measurement without lat lon in ForC_simplified. (I thought I was already doing that... shouldn't change the results though because I think those are ignored later anyways). |
You’re right, 1 degree lay/long is probably too much. How about 0.1 degrees? We can adjust later. I want to check how big the variation gets among real duplicates.
Let’s exclude just for SRDB, but please Flag as suspicious those in GROA it as well. Please create a new field to flag it suspicious, as the current field with this name is limited to records where we’ve gone back to the original publication ( if possible ) and think that their values are wrong.
|
should I look at GROA measurements when flagging SRDB? and vice versa? or should I only look at "original ForC"? |
Hmmm, yes, let's look at GROA when flagging SRDB. I don't think there will be a ton of overlap, though. Let's keep a record of which records are potential conflicts. |
so instead of creating a new column, I created a file for GROA and a file for SRDB that list the ForC measurements IDs when there are records for a same variable in a 11km cluster (~0.1 degree) for both ForC+SRDB and GROA and ForC+GROA and SRDB respectively. Then, when creating ForC_simplified, I load the SRDB file mentioned above and remove any measurement ID that appears in the second column. I did it regardless of the variables. |
Thanks! I want to come back to look at how well this worked more carefully later. Not sure if I'll have time before we submit. |
@ValentineHerr, I want to make sure I have a clear understanding of everything that's done to handle the duplicates. I'm currently drafting an appendix on this based on my current understanding, and will then ask you to review. |
I believe we can say "of same age (if known)".
If you are talking about sites, we handle them independently than records. We look at sites within 5km of eachother and then decide if they need to be merged or not (but I think there is a lot that have not been merged, or decided on).
Hmm... unfortunately I don't think I ever got to using supersites... I thought I did but can't find any evidence of it... unless the D.precedence and all was edited by hand while the supersites were assigned to the records...
I am not sure where this rule would be applied. For now all GROA data that was not identified as duplicate prior to the import has been imported and is considered as independent (except maybe a handful that needs review), regardless of how far they are from ForC_prior or SRDB records. |
duplicates are a nightmare... and the script that IDs them is running out of steam with ForC getting bigger... |
I pushed everything based on the new rules (mentionned in this issue) |
Thanks! Reviewing now. |
We've done this as well as possible for now. |
I think we're duplicating most of the GPP, Reco, and NEE (NEP) records (and probably some others) across both the original ForC and the SRDB import. I need to look into how to deal with this. It may require dropping SRDB records that are close to ForC records, as we don't have time for a careful review of all the potential duplicates.
The text was updated successfully, but these errors were encountered: