Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle SRDB duplicates #34

Closed
teixeirak opened this issue Sep 24, 2020 · 15 comments
Closed

handle SRDB duplicates #34

teixeirak opened this issue Sep 24, 2020 · 15 comments

Comments

@teixeirak
Copy link
Member

I think we're duplicating most of the GPP, Reco, and NEE (NEP) records (and probably some others) across both the original ForC and the SRDB import. I need to look into how to deal with this. It may require dropping SRDB records that are close to ForC records, as we don't have time for a careful review of all the potential duplicates.

@ValentineHerr
Copy link
Member

Looking into this.
there are 753 records of GPP, Reco, or NEP coming from SRDB.
62 of them seem to have some diplucate issue resolved (D.precedence 0 or 1).
I do see values that are close to other ForC records but they not flagged as duplicated because they have a different site or plot name, or a different age stand...

@teixeirak
Copy link
Member Author

Right... that's the big problem. For NEE, NEP, and Reco, I think we should drop any from SRDB that are within 1 degree lat/long of what's already in ForC (and flag for duplicate review). Eddy flux sites just aren't that common...

Other variables are tricky, but I'm sure that problem occurs sometimes for them. Not sure the best way to handle that. Maybe the same?

@ValentineHerr
Copy link
Member

1 degree lat-lon is ~ 110 km... that is quite large isn't it ? when we look for potential duplicate sites we do 5km.

Should I look only into ForC prior GROA import or also looking at GROA measurements?

I think I'll write a temporary script that just "flags as suspicious" those SRDB records because it is too complicated to incorporate that in the duplicate system. They won't be brought into ForC_simplified.

Also, FYI, I am removing any measurement without lat lon in ForC_simplified. (I thought I was already doing that... shouldn't change the results though because I think those are ignored later anyways).

@ValentineHerr
Copy link
Member

If we include GROA when comparing "ForC+GROA" to "SRDB", and looking at ForC_simplified:
5km would remove 1758 SRDB records and 110km would remove 3231.

This is an idea of the number of records per variables (for the 110km threshold):
image

There would still be 3449 SRDB records (with 110km threshold)

@teixeirak
Copy link
Member Author

teixeirak commented Sep 27, 2020 via email

@ValentineHerr
Copy link
Member

should I look at GROA measurements when flagging SRDB? and vice versa? or should I only look at "original ForC"?

@teixeirak
Copy link
Member Author

Hmmm, yes, let's look at GROA when flagging SRDB. I don't think there will be a ton of overlap, though. Let's keep a record of which records are potential conflicts.

ValentineHerr added a commit that referenced this issue Sep 27, 2020
@ValentineHerr
Copy link
Member

so instead of creating a new column, I created a file for GROA and a file for SRDB that list the ForC measurements IDs when there are records for a same variable in a 11km cluster (~0.1 degree) for both ForC+SRDB and GROA and ForC+GROA and SRDB respectively.
For example, if within a cluster of 11km, there are 2 GPP measurements for SRDB and 5 GPP measurements for ForC (including GROA), the table is populated with 10 rows, first column is the IDs for the ForC measurement, repeated twice, the second is the IDs of the SRDB measurements (the IDS they have in ForC), repeated 5 times each.
It is not ideal but for now it will do. And for the record, this script is generating these files.

Then, when creating ForC_simplified, I load the SRDB file mentioned above and remove any measurement ID that appears in the second column.

I did it regardless of the variables.

teixeirak added a commit that referenced this issue Sep 27, 2020
@teixeirak
Copy link
Member Author

Thanks! I want to come back to look at how well this worked more carefully later. Not sure if I'll have time before we submit.

@teixeirak
Copy link
Member Author

@ValentineHerr, I want to make sure I have a clear understanding of everything that's done to handle the duplicates. I'm currently drafting an appendix on this based on my current understanding, and will then ask you to review.

teixeirak added a commit that referenced this issue Sep 28, 2020
@ValentineHerr
Copy link
Member

Potential duplicates were defined as geographically proximate records for stands of similar age with the same variable measured in the same year (if known).

I believe we can say "of same age (if known)".
I believe this is relevant to duplicate records and not sites like the name of the section and next sentence are about. Maybe the section should be renamed to " Detecting and reconciling duplicate records"?

In cases where site and plot names or reported age differed, our script detected potential duplicate sites that were geographically proximate.

If you are talking about sites, we handle them independently than records. We look at sites within 5km of eachother and then decide if they need to be merged or not (but I think there is a lot that have not been merged, or decided on).
If you are talking about records, this is what I added yesterday: any record coming from SRDB is removed if it is within 11km of a Forc_prior or GROA measurement of the same variable.

In cases where a single location -- generally an established research site where multiple investigators have worked -- contained multiple plots in nested or unknown relation to one another, we grouped multiple sites into a "supersite" (e.g., Harvard Forest, Barro Colorado Island, Pasoh Forest Reserve), and duplicates within a supersite were handled in the same way as records with matching site and plot names. (VALENTINE, IS THIS ACCURATE?)

Hmm... unfortunately I don't think I ever got to using supersites... I thought I did but can't find any evidence of it... unless the D.precedence and all was edited by hand while the supersites were assigned to the records...

For suspected duplicate groups that were not flagged as supersites and had not yet been reviewed, we retained only one potential duplicate record, assigning precedence as follows: (1) original GROA record(s), (2) record(s) in ForC prior to SRDB and GROA import, (3) SRDB record(s).

I am not sure where this rule would be applied. For now all GROA data that was not identified as duplicate prior to the import has been imported and is considered as independent (except maybe a handful that needs review), regardless of how far they are from ForC_prior or SRDB records.
Only SRDB records are removed if they are within 11km of a ForC_prior or GROA record.

@ValentineHerr
Copy link
Member

duplicates are a nightmare... and the script that IDs them is running out of steam with ForC getting bigger...

@ValentineHerr
Copy link
Member

I pushed everything based on the new rules (mentionned in this issue)

@teixeirak
Copy link
Member Author

Thanks! Reviewing now.

@teixeirak
Copy link
Member Author

We've done this as well as possible for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants