Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create script to do some basic database checking #23

Closed
6 of 9 tasks
teixeirak opened this issue Aug 14, 2017 · 21 comments
Closed
6 of 9 tasks

Create script to do some basic database checking #23

teixeirak opened this issue Aug 14, 2017 · 21 comments

Comments

@teixeirak
Copy link
Member

teixeirak commented Aug 14, 2017

It would be good to have a script here (public database) that can be run whenever there is a substantial update to the database to check that the structure of the database remains correct and that there are no egregious errors in values.

Here's the start of a list of things to check:

  • For each site-plot combination in MEASUREMENTS, there is a corresponding site-plot record in PLOTS, and there are no records in PLOTS that lack corresponding records in MEASUREMENTS.
  • For each site in PLOTS, there is a corresponding record in SITES, and there are no sites in SITES that lack records in PLOTS.
  • For each site-plot combination in PLOTS, there is at least one corresponding record in HISTORY, and there are no records in HISTORY that lack corresponding records in PLOTS (the latter can be identified based on whether the site-plot combination and the historyID show up in PLOTS. see metadata to understand how historyID works in PLOTS).
  • All other fields that link across tables are represented once in the table where they are defined and 1+ times in the table where they are used.
Variable Table defined Field(s) used
PFTcode PFT MEASUREMENTS:dominantveg
distcat DISTTYPE HISTORY:distcat
disttype DISTTYPE HISTORY:distcat, PLOTS(various)

(THIS TABLE NEEDS TO BE COMPLETED BASED ON THE RELATIONSHIP ENTITY DIAGRAM, which needs to be updated.)

  • Within each field, all entries are either data of the type specified for that field (metadata tables) or a missing data code.
  • All new records in MEASUREMENTS:mean fall within the range reported for that variable in the VARIABLES table. (It is possible that valid new records will fall outside the range, but script should generate a warning: "Measurement falls outside the range of values that currently exist for this variable in the database. Check value, and if valid update range for this variable in VARIABLES."
  • For all other numerical fields, all records fall within the range specified in the metadata tables (metadata tables). Any values that fall outside the current range are flagged for screening.
  • For all categorical variables, all records match one of the defined variable codes (metadata tables).
  • Check for records where stands of different age are represented by a single plot (see Check for and correct any records where different-aged stands are identified by a single plot #22)
@teixeirak
Copy link
Member Author

@bpbond, you may also be interested in this one.

@bpbond
Copy link
Contributor

bpbond commented Aug 15, 2017

Started on this.

There are 756 measurements with no corresponding plot record
There are 175 plots with no corresponding measurement record

Before I go farther @teixeirak is something like this what you're thinking of? See code at https://github.com/bpbond/ForC/blob/qc-script/scripts/checks.R

@teixeirak
Copy link
Member Author

From a quick look, yes, this does look like what I have in mind. We will want to generate text or files listing the errors.

I'm surprised at the number of errors. When we checked a few months ago, there were none. If you can generate a file listing the errors, I'll figure out what's wrong. I suspect that some very minor changes to plot names (e.g., trailing space) got introduced.

@bpbond
Copy link
Contributor

bpbond commented Aug 15, 2017

OK. I will trim white space, implement the other checks on your list, and post the results.

@bpbond
Copy link
Contributor

bpbond commented Aug 15, 2017

It looks like most of the mismatches are coming on non-English site names that are encoded, e.g., as Xiaoxing''anling in one table but Xiaoxing'anling in another. no

@bpbond
Copy link
Contributor

bpbond commented Aug 15, 2017

OK, the script now checks all the issues above except the last (and I'll do that in a bit). Should I start listing the issues one by one here for you to check and resolve?

@teixeirak
Copy link
Member Author

I don't understand why you crossed out the comment about site names. This does make sense; specifically, I believe that my Matlab script that generates ForC_plots from ForC_history (I will move to public repository soon) introduces this error to sites whose name includes the ' character.

@teixeirak
Copy link
Member Author

It would be great if you could program the script to generate files reporting the errors.
In the long run, I'd like to use this script as something that database managers can run to check new pull requests.

@bpbond
Copy link
Contributor

bpbond commented Aug 15, 2017

@teixeirak This is (apostrophes in site names) definitely a factor, but not the only one.

It would be great if you could program the script to generate files reporting the errors.

Easy, no problem.

In the long run, I'd like to use this script as something that database managers can run to check new pull requests.

Yes yes yes! I'm going to open an issue with a suggestion in this regard.

@bpbond
Copy link
Contributor

bpbond commented Aug 15, 2017

I'm going to open separate issues for all the mismatches, OK? Seems like a cleaner way to deal with things.

@teixeirak
Copy link
Member Author

yes, thanks

@bpbond
Copy link
Contributor

bpbond commented Aug 18, 2017

All new records in MEASUREMENTS:mean fall within the range reported for that variable in the VARIABLES table. (It is possible that valid new records will fall outside the range, but script should generate a warning: "Measurement falls outside the range of valies that currently exist for this variable in the database. Check value, and if valid update range for this variable in VARIABLES."

Note that this isn't currently possible, as the min and max of VARIABLES don't have any data.

@ValentineHerr
Copy link
Member

@bpbond I am running into errors with https://github.com/forc-db/ForC/blob/master/scripts/checks.R:

  • line 111: "Error: by can't contain join column distcat, disttype which is missing from LHS"
  • line 123: "Error in (function (classes, fdef, mtable) :
    unable to find an inherited method for function ‘select’ for signature ‘"tbl_df"’"

I should probably start getting more into Hadley Wickham's packages but I find his system harder to troubleshoot so I'll let you handle that! ;-)

It might just be that we need to load another library...

@bpbond
Copy link
Contributor

bpbond commented Aug 23, 2017

@ValentineHerr Uh oh, this is probably my bad–I think @teixeirak renamed those columns and I didn't properly update the script (or the changes didn't get pushed to the PR). Apologies. I'll turn to this in a bit and look.

teixeirak added a commit that referenced this issue Aug 24, 2017
Fix column name issue found by @ValentineHerr in #23
@ValentineHerr
Copy link
Member

line 111 is catching up some "meaningless" discrepancy between "regrowth(_prior)" in HIST and "Regrowth" in HISTORY (same with "Disturbance(_prior)" and "Disturbance").
@bpbond, could you adapt script to look for pattern "Regrowth" or "Disturbance" instead of matching exactly the character strings?
Krista and I discussed the option of copying the X(_prior) in HIST but that means if something changes in histtype we need to remember to change it for both X and X(_prior). That is why we decided that looking for pattern was better than duplicating descriptions. If that is not too complicated.

@bpbond
Copy link
Contributor

bpbond commented Aug 25, 2017

Is it just for these two, or should there be a general rule that "xxxx(yyy)" always matches "xxxx"?

@teixeirak
Copy link
Member Author

It is just those two.

@bpbond
Copy link
Contributor

bpbond commented Aug 25, 2017

> unique(HISTORY$histcat)
[1] "Disturbance"       "Regrowth"          "Establishment"     "No.disturbance"    "No.info"           "Management"       
[7] "Disturbance_prior" "Regrowth_prior"   

> unique(HIST$histcat)
[1] "Establishment"       "Regrowth(_prior)"    "Disturbance(_prior)" "Management"          "No.disturbance"     
[6] "No.info"       

So, is this what is wanted?

HISTORY$histcat HIST$histcat Should match?
Disturbance Disturbance(_prior) Yes
Regrowth Regrowth(_prior) Yes
Disturbance_prior Disturbance(_prior) Yes?
Regrowth_prior Regrowth(_prior) Yes?

@teixeirak
Copy link
Member Author

Yes.

Disturbance_prior matches Disturbance(_prior), and same for regrowth.

@bpbond
Copy link
Contributor

bpbond commented Aug 25, 2017

Should "Fertilization_N", "Fertilization_P" etc. in HISTORY all match "Fertilization_X" in HIST?

@teixeirak
Copy link
Member Author

Yes. Sorry I missed that before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants