New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GO release stats to pipeline #842
Comments
Full list is:
go_associations
go_audit
go_evidence_views
go_general
go_graph
go_graph_views
go_homology
go_meta
go_obd_bridge
go_optimisations
go_prejoined_views
go_refgenomes_views
go_sequence
go_stats_views
go_taxon_views
|
Some stats are currently calculated as part of AmiGO rollouts. If you could make a list of stats that you'd like to see, we could provide them in the pipeline. Your list is essentially a schema (view) dump, not a report list - nothing here was provided, except as shortcuts for AmiGO and GOOSE. |
@lpalbou is there a new ticket for this ? |
@pgaudet there is a repo https://github.com/geneontology/go-stats/ but we didn't created any ticket yet, so let's use this one to finish quickly. Stats on GO@kltm the PyPi package of the go-stats is here and the full code (including PyPI publish setup) is here. I also created the PR #1145 over go-site with just go_stats.py (which requires only the requests lib so you probably have that already). To produce the stats from the pipeline (either for a monthly or daily release):
It will produce 3 files in the folder reports :
@pgaudet , could you let me know if you have everything in go-stats.tsv ? It can be reorganized if that helps. Note that I wasn't able to produce the number of GPs by taxon and evidence type : the GOLr "bioentity" view doesn't store the types of evidence. We could possibly add this field as it would also allow to enrich over certain evidence types only. Notes:
Diff on GO ontologyIf the above works correctly, I will do another PyPI package and PR for the diff script which require the lightweight obo parser. Diff on GO annotationsLastly there will be a script to do a diff over the GO annotations of two releases. This will only work after a go-stats has been computed as it requires as input the current and the previous stats. Note about daily releasesThese scripts will work with the daily releases but we need a strategy to store these files in something more permanent than a snapshot. For instance, storing all the go-stats of the current month would provide our QC group with an interactive graph to see the day by day evolution of each stat. |
Hi @lpalbou Thanks for this, it's coming along nicely. A few questions/comments:
Thanks ! Pascale |
@pgaudet the two files we have been using to check the last releases:
What's been confusing however:
I added the ontology aspects in go-stats but adding merged terms and structural + meta statements will require more work as the stats were computed from GOLr only and this require to load obo file.
The total number is there but the delta is only in the diff as it requires the previous version. However you are right that I was showing the delta in the UI graph, but I was getting it from the diff files.
👍 Proposed steps
|
Let's discuss next steps tomorrow on software call |
This is so awesome !! |
@pgaudet @kltm the PR to compute all the stats and diffs from the pipeline is here: #1148 7 files generated:
A quick look at go-annotation-changes.tsv indicates we have lost 12 000 ISS (possibly Arabidopsis thaliana ?) as well as a few hundreds annotations here and there (ND, IDA, IEP, NAS, EXP), as well as TIGR annotations Last notes:
|
yes please
I would keep it for completeness
That was for the website. If I understand correctly these here are the pipeline stats. In that case we want everything. Thanks !! This is looking great ! Pascale |
Two things:
|
Sure OK, the archive can have the full file - @thomaspd @cmungall @ kltm is that OK for you ?
OK fine, thanks for the information. |
@pgaudet The stats will be in the archive and the standard locations (release, snapshot, current). |
Hi |
@RLovering This is not yet in the pipeline (v/soon), but the locations are: |
Looking good, with results on skyhook branch. Just needs to be folded in. |
@lpalbou
I've been running into this error for a little bit:
Looking through go_stats.py, it seems to have a certain set of species hardwired, and if they are not available it crashes out (my read on what's going on). The pipeline often runs in modes where not all species are available. Can we make these flags, or maybe better to just bypass/catch with a warning--the QC/QA to look at values will occur elsewhere. |
@kltm I didn't expect the pipeline to ever run without the human species. There is indeed a static test to check if the human species was loaded. There is also a list of reference species which should always be here and could trigger an error if you don't load them - see reference_genomes_ids. I will do a commit today to double proof the code for your pipeline modes that don't load these species. Is there a documentation for these pipeline modes ? Also I think there is an error in your query: -c must point to the new go.obo and you used |
@lpalbou Thank you for the fast turnaround. There is no particular documentation for this requirement, just that the pipeline must be able to run any arbitrary subset (or and given GAF/ontology combination), which necessarily includes non-human runs. Ah--thank you for the catch--I have that fixed on branch, ready for the next round. |
@kltm this commit should work with your requirements, let me know if anything else comes up. |
Great--testing. |
Sounds good to me 👍 |
@kltm just a quick reminder to be on the safe side: don't forget to launch the python3 aggregate-stats.py command (to create the aggregate summaries for the website) after the python3 go_reports.py |
Is this not correct? https://github.com/geneontology/pipeline/blob/master/Jenkinsfile#L861 |
hi @kltm How come this is not in a project ? I think it should be in 2019-10 (Berkeley) Data Release Pipeline 1.3 Thanks, Pascale |
@pgaudet I have no idea...maybe I was trying to close and missed? Or maybe I set it free so it could be added to the new project (as SOP)? |
@lpalbou can we close ? |
Hello,
It'd be really useful to have release stats at each GO release (snapshots too, so we could look for anomalies before official releases).
We use to have a lot of queries from GOOSE (I guess), here:
http://geneontology.org/page/go-mysql-database-schema-views
The text was updated successfully, but these errors were encountered: