Add GO release stats to pipeline #842

pgaudet · 2018-10-04T10:41:12Z

Hello,

It'd be really useful to have release stats at each GO release (snapshots too, so we could look for anomalies before official releases).

We use to have a lot of queries from GOOSE (I guess), here:
http://geneontology.org/page/go-mysql-database-schema-views

pgaudet · 2018-10-04T10:41:20Z

Full list is:
go_annotation_reports

annotated_publication_total
annotated_publication_total_by_evidence_code_non_additive
annotated_total_gps_by_evidence_code_non_additive
association_contradiction
association_contradiction_count_by_ontology
association_contradiction_direct
association_contradiction_direct_count_by_ontology
association_count_by_association_qualifier
association_count_by_association_qualifier_and_dbname
association_total_by_evidence_code
association_total_by_evidence_code_and_species
avg_total_annotations_per_gp_by_db
avg_total_nonroot_annotations_per_gp_by_db
avg_total_nonroot_pubs_per_gp_by_db
avg_total_nonroot_terms_per_gp_by_db
avg_total_nonroot_transitive_terms_per_gp
avg_total_nonroot_transitive_terms_per_gp_by_db
avg_total_pubs_per_gp_by_db
avg_total_terms_per_gp_by_db
avg_total_transitive_terms_per_gp
avg_total_transitive_terms_per_gp_by_db
evidence_dbxref_summary
evidence_pub_dbxref_summary
gene_product_dbxref_summary
iea_annotated_total_gps
iea_or_iss_annotated_total_gps
non_iea_annotated_total_gps
non_iea_annotated_total_gps_by_dbname
non_iea_or_iss_annotated_total_gps
ont_association_count_by_association_qualifier
seq_dbxref_summary
term_association_count_by_association_qualifier
term_association_count_by_fraction_type
term_association_count_by_fraction_type_and_evidence
term_dbxref_summary
total_annotated_entities_by_dbname_and_type
total_annotations_per_gp
total_gps_by_dbname
total_nonroot_annotations_per_gp
total_nonroot_pubs_per_gp
total_nonroot_terms_per_gp
total_nonroot_transitive_terms_per_gp
total_nonroot_transitive_terms_per_gp_pair
total_pubs
total_pubs_per_gp
total_terms_per_gp
total_transitive_terms_per_gp
total_transitive_terms_per_gp_ont

go_associations

association
association_property
association_qualifier
association_species_qualifier
evidence
evidence_dbxref
gene_product
gene_product_subset
gene_product_synonym
species

go_audit

instance_data
source_audit
term_audit

go_evidence_views

association_evidence_with
association_inference_candidate_pair
db_evidence_summary
ic_evidence
iss_annotation_to_nas_direct
iss_annotation_to_nas_direct_without
nd_evidence
stale_ic_ipr
stale_iss_annotation

go_general

db
dbxref

go_graph

relation_composition
relation_properties
term
term2term

go_graph_views

avg_max_distance_to_leaf_term_by_db
avg_max_distance_to_leaf_term_by_db_and_ontology
avg_max_distance_to_leaf_term_by_db_and_species
distance_to_leaf_stats_by_term
distance_to_root_stats_by_term
leaf_node
max_distance_to_leaf_by_term
max_distance_to_root_by_term
max_max_distance_to_root_by_term
non_root_term
path_to_leaf
path_to_root
root_term
term_ancestor
term_descendent
term_having_max_delta_distance_to_root
term_having_max_max_distance_to_root
term_having_most_paths_to_root
total_paths_to_root_by_term
transitive_association

go_homology

gene_product_ancestor
gene_product_homology
gene_product_homolset
homolset

go_meta

term2term_metadata
term_dbxref
term_definition
term_subset
term_synonym

go_obd_bridge

asserted_link
implied_link
node
node_max_depth

go_optimisations

gene_product_count
graph_path

go_prejoined_views

association_evidence
association_j_evidence
association_j_evidence_j_gene_product
evidence_j_evidence_dbxref_j_dbxref
gene_product_j_dbxref
gene_product_j_dbxref_via_seq
gene_product_j_gene_product_synonym
gene_product_with_term_pair_via_graph
term_j_association
term_j_association_j_evidence_j_gene_product
term_j_association_j_gene_product
term_j_association_j_gene_product_via_graph
term_j_association_j_species_summary_via_graph
term_j_association_j_species_via_graph
term_j_association_via_graph
term_j_term
term_jt_term

go_refgenomes_views

avg_max_distance_to_leaf_term_per_refg_within_refg_species
avg_total_genes_by_homolset_and_ontol
avg_total_transitive_terms_per_refg_gp_for_refspecies
gene_product_in_refg_subset
gene_product_with_subset
gp_outlier_annotation
gp_outlier_annotation_full_report
gp_partial_outlier_annotation_nothing_above
gp_partial_outlier_annotation_nothing_below
homolset_annotation
homolset_annotation_full
homolset_annotation_non_outlier_with_subsumed
homolset_annotation_non_outlier_with_subsumer
homolset_annotation_outlier_full
homolset_annotation_outlier_full2
homolset_annotation_outlier_full_by_checking_ancestors
homolset_annotation_outlier_old
homolset_summary_by_term
homolset_transitive_annotation
homolset_transitive_annotation_full
refg_total_transitive_terms
refg_with_nd
subsumed_by_association
subsumed_by_noniea_association
subsumer_of_association
subsumer_of_noniea_association
total_gps_by_homolset_and_term
trusted_evidence

go_sequence

gene_product_seq
seq
seq_dbxref
seq_property

go_stats_views

annotated_gp_total_by_code
gene_product_count2
implied_annotation
implied_negative_annotation
term_correlation_summary
term_correlation_via_transitive_annotation

go_taxon_views

annotated_species
annotated_species_id
annotated_species_lacks_term
gene_product_count_by_inner_taxon
species_has_term
species_has_term_d
species_lacks_term
species_lacks_term_d

kltm · 2018-10-04T17:32:21Z

@pgaudet

Some stats are currently calculated as part of AmiGO rollouts. If you could make a list of stats that you'd like to see, we could provide them in the pipeline.

Your list is essentially a schema (view) dump, not a report list - nothing here was provided, except as shortcuts for AmiGO and GOOSE.

pgaudet · 2019-07-03T09:46:09Z

@lpalbou is there a new ticket for this ?

lpalbou · 2019-07-17T02:47:38Z

@pgaudet there is a repo https://github.com/geneontology/go-stats/ but we didn't created any ticket yet, so let's use this one to finish quickly.

Stats on GO

@kltm the PyPi package of the go-stats is here and the full code (including PyPI publish setup) is here.

I also created the PR #1145 over go-site with just go_stats.py (which requires only the requests lib so you probably have that already).

To produce the stats from the pipeline (either for a monthly or daily release):

python3 go_stats.py -g http://golr-aux.geneontology.io/solr/ -o reports/

It will produce 3 files in the folder reports :

go-stats.json
go-stats.tsv
go-meta.json (small summary we can use to live fetch data for the go website)

@pgaudet , could you let me know if you have everything in go-stats.tsv ? It can be reorganized if that helps. Note that I wasn't able to produce the number of GPs by taxon and evidence type : the GOLr "bioentity" view doesn't store the types of evidence. We could possibly add this field as it would also allow to enrich over certain evidence types only.

Notes:

@kltm could you please store these files in compressed mode in S3 ? (e.g. https://geneontology-public.s3.amazonaws.com/go-stats.json is 18k compressed and the browser uncompress it to 103k)
There may be light updates later to this code but this is a stable and usable version

Diff on GO ontology

If the above works correctly, I will do another PyPI package and PR for the diff script which require the lightweight obo parser.

Diff on GO annotations

Lastly there will be a script to do a diff over the GO annotations of two releases. This will only work after a go-stats has been computed as it requires as input the current and the previous stats.

Note about daily releases

These scripts will work with the daily releases but we need a strategy to store these files in something more permanent than a snapshot. For instance, storing all the go-stats of the current month would provide our QC group with an interactive graph to see the day by day evolution of each stat.

pgaudet · 2019-07-17T07:30:09Z

Hi @lpalbou

Thanks for this, it's coming along nicely. A few questions/comments:

I thought there were 2 files, one for ontology and one for annotation. 'go-stats.tsv' is a mix of both; is this right ?
Also, for the GO stats, you use to shod newly created terms, newly obsoleted, and newly merged; will this be restored?
Something that would be useful would be to get the species name alongside the NCBI taxon ID.

Thanks !

Pascale

lpalbou · 2019-07-17T19:06:06Z

I thought there were 2 files, one for ontology and one for annotation. 'go-stats.tsv' is a mix of both; is this right ?

@pgaudet the two files we have been using to check the last releases:

go-stats for stats
go-last-changes for diff

What's been confusing however:

there are more ontology stats (e.g. nb of merged terms, structural and meta statements) in go-last-changes than in go-stats
the diff is only for the ontology, not for the annotations

I added the ontology aspects in go-stats but adding merged terms and structural + meta statements will require more work as the stats were computed from GOLr only and this require to load obo file.

Also, for the GO stats, you use to shod newly created terms, newly obsoleted, and newly merged; will this be restored?

The total number is there but the delta is only in the diff as it requires the previous version. However you are right that I was showing the delta in the UI graph, but I was getting it from the diff files.

Something that would be useful would be to get the species name alongside the NCBI taxon ID

👍

Proposed steps

@kltm checks he can use this script on the pipeline
I add the ontology & annotation diff scripts
I create a master script calling first the ontology diff, then computing the go-stats with it to add the missing ontology stats. At this step, the pipeline will be calling this master script with 4 parameters (GOLr URL; previous OBO file; current OBO file; output folder to store reports)
compute those reports for all release in release.geneontology.org
permanently store those reports for easy access to stats in time (probably need to alter release.geneontology.org to add a reports/ to each release).
have an API endpoint similar to this to easily retrieve the URLs of each release stats, which will be used to populate the graphs in the go website stats page
Iterate on the prototype of the go website stats page
Iterate on the prototype of an internal stats page for QC / release check

cmungall · 2019-07-18T00:04:35Z

Let's discuss next steps tomorrow on software call

pgaudet · 2019-07-18T07:50:24Z

https://geneontology-public.s3.amazonaws.com/go-last-changes.tsv

This is so awesome !!

lpalbou · 2019-07-24T08:18:32Z

@pgaudet @kltm the PR to compute all the stats and diffs from the pipeline is here: #1148

7 files generated:

A quick look at go-annotation-changes.tsv indicates we have lost 12 000 ISS (possibly Arabidopsis thaliana ?) as well as a few hundreds annotations here and there (ND, IDA, IEP, NAS, EXP), as well as TIGR annotations

Last notes:

still have to add taxon label for readability
in annotation-changes, could remove all the species with 0 differences
we discussed it once, do we want an alternative go-stats without protein binding annotations ?

pgaudet · 2019-07-24T14:00:42Z

still have to add taxon label for readability

yes please

in annotation-changes, could remove all the species with 0 differences

I would keep it for completeness

we discussed it once, do we want an alternative go-stats without protein binding annotations ?

That was for the website. If I understand correctly these here are the pipeline stats. In that case we want everything.

Thanks !! This is looking great !

Pascale

lpalbou · 2019-07-24T15:45:43Z

That was for the website. If I understand correctly these here are the pipeline stats. In that case we want everything.

Two things:

those stats will be in the zenodo archive so accessible to the public. If we don't want that, then they need to be stored somewhere else
there are no other stats file, the website is interpreting those files to drive UI graphs, so if we want to show the annotations without protein binding, then it needs to be computed at this level

pgaudet · 2019-07-24T17:00:43Z

those stats will be in the zenodo archive so accessible to the public. If we don't want that, then they need to be stored somewhere else

Sure OK, the archive can have the full file - @thomaspd @cmungall @ kltm is that OK for you ?

there are no other stats file, the website is interpreting those files to drive UI graphs, so if we want to show the annotations without protein binding, then it needs to be computed at this level

OK fine, thanks for the information.

kltm · 2019-07-25T08:28:20Z

@pgaudet The stats will be in the archive and the standard locations (release, snapshot, current).

RLovering · 2019-07-25T09:01:59Z

Hi
sorry please would you provide the links to the standard locations (release, snapshot, current) as I am not sure where these are
Thanks
Ruth

kltm · 2019-07-25T09:09:01Z

@RLovering This is not yet in the pipeline (v/soon), but the locations are:
http://wiki.geneontology.org/index.php/Release_Pipeline#Data_publishing_and_access

…go-site#842

…ite#842

kltm · 2019-08-07T19:07:43Z

Looking good, with results on skyhook branch. Just needs to be folded in.

kltm · 2019-09-23T04:53:27Z

@lpalbou
Running your script basically as:

python3 /tmp/go_reports.py -g http://localhost:8080/solr/ -s https://geneontology.s3.amazonaws.com/temporary/2019-july/go-stats.json -n https://geneontology.s3.amazonaws.com/temporary/2019-july/go-stats-no-pb.json -c http://current.geneontology.org/ontology/go.obo -p https://geneontology.s3.amazonaws.com/archive/2019-07-01_go.obo -o /tmp/stats/ -d $START_DATE

I've been running into this error for a little bit:

[2019-09-23T00:42:13.853Z] Traceback (most recent call last):
[2019-09-23T00:42:13.853Z]   File "/tmp/go_reports.py", line 295, in <module>
[2019-09-23T00:42:13.853Z]     main(sys.argv[1:])
[2019-09-23T00:42:13.853Z]   File "/tmp/go_reports.py", line 146, in main
[2019-09-23T00:42:13.853Z]     json_stats = go_stats.compute_stats(golr_url, release_date)
[2019-09-23T00:42:13.853Z]   File "/tmp/go_stats.py", line 255, in compute_stats
[2019-09-23T00:42:13.853Z]     prepare_globals(all_annotations)
[2019-09-23T00:42:13.853Z]   File "/tmp/go_stats.py", line 310, in prepare_globals
[2019-09-23T00:42:13.853Z]     check = taxon_map['9606'] == 'Homo sapiens'
[2019-09-23T00:42:13.853Z] KeyError: '9606'

Looking through go_stats.py, it seems to have a certain set of species hardwired, and if they are not available it crashes out (my read on what's going on). The pipeline often runs in modes where not all species are available. Can we make these flags, or maybe better to just bypass/catch with a warning--the QC/QA to look at values will occur elsewhere.

lpalbou · 2019-09-23T16:43:36Z

@kltm I didn't expect the pipeline to ever run without the human species. There is indeed a static test to check if the human species was loaded. There is also a list of reference species which should always be here and could trigger an error if you don't load them - see reference_genomes_ids.

I will do a commit today to double proof the code for your pipeline modes that don't load these species. Is there a documentation for these pipeline modes ?

Also I think there is an error in your query: -c must point to the new go.obo and you used -c http://current.geneontology.org/ontology/go.obo which point to the last released go.obo, not the one currently produced by the pipeline.

kltm · 2019-09-23T20:45:12Z

@lpalbou Thank you for the fast turnaround. There is no particular documentation for this requirement, just that the pipeline must be able to run any arbitrary subset (or and given GAF/ontology combination), which necessarily includes non-human runs.

Ah--thank you for the catch--I have that fixed on branch, ready for the next round.

lpalbou · 2019-09-23T22:16:20Z

@kltm this commit should work with your requirements, let me know if anything else comes up.

…ite#842

kltm · 2019-09-23T22:30:55Z

Great--testing.

kltm · 2019-09-24T01:37:33Z

If current master test passes, we'll merge #1181 and switch over to new stats.
Before the next release then, we'll have to get to #1182

lpalbou · 2019-09-24T01:44:24Z

Sounds good to me 👍

lpalbou · 2019-09-26T18:30:29Z

@kltm just a quick reminder to be on the safe side: don't forget to launch the python3 aggregate-stats.py command (to create the aggregate summaries for the website) after the python3 go_reports.py

kltm · 2019-09-26T18:53:54Z

Is this not correct? https://github.com/geneontology/pipeline/blob/master/Jenkinsfile#L861

pgaudet · 2019-11-05T13:24:25Z

hi @kltm

How come this is not in a project ? I think it should be in 2019-10 (Berkeley) Data Release Pipeline 1.3

Thanks, Pascale

kltm · 2019-11-05T21:54:53Z

@pgaudet I have no idea...maybe I was trying to close and missed? Or maybe I set it free so it could be added to the new project (as SOP)?

pgaudet · 2019-11-21T11:49:03Z

@lpalbou can we close ?
and perhaps make new tickets for what remains to be done ?

kltm added the question label Oct 4, 2018

pgaudet assigned lpalbou Jul 3, 2019

kltm added this to In progress in DONE 2019-10 (Berkeley) Data Release Pipeline 1.3 Jul 17, 2019

kltm added enhancement and removed question labels Jul 17, 2019

kltm changed the title ~~GO release stats~~ Add GO release stats to pipeline Jul 17, 2019

kltm mentioned this issue Jul 17, 2019

go-stats : compute stats on go annotations from a GOLr instance #1145

Merged

kltm added a commit to geneontology/pipeline that referenced this issue Aug 5, 2019

initial attempt at continuing solr after stats; work on geneontology/…

b1448b9

…go-site#842

kltm added a commit to geneontology/pipeline that referenced this issue Aug 5, 2019

forgot non-master status; work on geneontology/go-site#842

1f58262

kltm added a commit to geneontology/pipeline that referenced this issue Aug 6, 2019

use command runner, new image, and add lib; work on geneontology/go-s…

aaad1e3

…ite#842

kltm added a commit to geneontology/pipeline that referenced this issue Aug 7, 2019

better environment in image; work on geneontology/go-site#842

30337e5

kltm assigned kltm and unassigned lpalbou Aug 7, 2019

lpalbou mentioned this issue Aug 12, 2019

Create public stats page geneontology/geneontology.github.io#175

Closed

4 tasks

kltm added a commit to geneontology/pipeline that referenced this issue Sep 20, 2019

dumb typo; work on geneontology/go-site#842

b5d87e2

kltm added a commit to geneontology/pipeline that referenced this issue Sep 23, 2019

fix wrong ontology pointed out by @lpalbou; work on geneontology/go-s…

d9535d1

…ite#842

kltm added a commit to geneontology/pipeline that referenced this issue Sep 23, 2019

try and speed up testing cycle; work on geneontology/go-site#842

f3285f3

kltm added a commit to geneontology/pipeline that referenced this issue Sep 23, 2019

paren whoops; work on geneontology/go-site#842

f222883

kltm mentioned this issue Sep 24, 2019

Issue 842 add stats #1181

Merged

kltm added a commit to geneontology/pipeline that referenced this issue Sep 24, 2019

test full master version; work on geneontology/go-site#842

1e05fbc

kltm mentioned this issue Sep 24, 2019

Change pipeline stats command to "final form" #1182

Closed

kltm added a commit to geneontology/pipeline that referenced this issue Sep 24, 2019

remove snapshot paths in master; work on geneontology/go-site#842

4c36cb1

kltm added a commit to geneontology/pipeline that referenced this issue Sep 25, 2019

update to snapshot; work on geneontology/go-site#842

4c2d2bd

kltm added a commit to geneontology/pipeline that referenced this issue Sep 25, 2019

update to release; work on geneontology/go-site#842

a3f7ff0

kltm added a commit to geneontology/pipeline that referenced this issue Sep 26, 2019

revert master for merged geneontology/go-site#842

0d1b37f

kltm added a commit to geneontology/pipeline that referenced this issue Sep 26, 2019

non-wrong command for stats; work on geneontology/go-site#842

abc4d3f

kltm added a commit to geneontology/pipeline that referenced this issue Sep 27, 2019

update; work on geneontology/go-site#842

450ae4c

kltm moved this from In progress to Clearing in DONE 2019-10 (Berkeley) Data Release Pipeline 1.3 Oct 2, 2019

kltm mentioned this issue Oct 15, 2019

Add ontology statistics / graphs page #179

Closed

kltm mentioned this issue Oct 24, 2019

Update stats script to final version geneontology/pipeline#141

Merged

kltm removed this from Clearing in DONE 2019-10 (Berkeley) Data Release Pipeline 1.3 Oct 24, 2019

pgaudet closed this as completed Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GO release stats to pipeline #842

Add GO release stats to pipeline #842

pgaudet commented Oct 4, 2018

pgaudet commented Oct 4, 2018

kltm commented Oct 4, 2018

pgaudet commented Jul 3, 2019

lpalbou commented Jul 17, 2019 •

edited

pgaudet commented Jul 17, 2019

lpalbou commented Jul 17, 2019 •

edited

cmungall commented Jul 18, 2019

pgaudet commented Jul 18, 2019

lpalbou commented Jul 24, 2019

pgaudet commented Jul 24, 2019

lpalbou commented Jul 24, 2019 •

edited

pgaudet commented Jul 24, 2019

kltm commented Jul 25, 2019

RLovering commented Jul 25, 2019

kltm commented Jul 25, 2019

kltm commented Aug 7, 2019

kltm commented Sep 23, 2019

lpalbou commented Sep 23, 2019

kltm commented Sep 23, 2019

lpalbou commented Sep 23, 2019

kltm commented Sep 23, 2019

kltm commented Sep 24, 2019

lpalbou commented Sep 24, 2019

lpalbou commented Sep 26, 2019

kltm commented Sep 26, 2019

pgaudet commented Nov 5, 2019

kltm commented Nov 5, 2019

pgaudet commented Nov 21, 2019

Add GO release stats to pipeline #842

Add GO release stats to pipeline #842

Comments

pgaudet commented Oct 4, 2018

pgaudet commented Oct 4, 2018

kltm commented Oct 4, 2018

pgaudet commented Jul 3, 2019

lpalbou commented Jul 17, 2019 • edited

Stats on GO

Diff on GO ontology

Diff on GO annotations

Note about daily releases

pgaudet commented Jul 17, 2019

lpalbou commented Jul 17, 2019 • edited

Proposed steps

cmungall commented Jul 18, 2019

pgaudet commented Jul 18, 2019

lpalbou commented Jul 24, 2019

pgaudet commented Jul 24, 2019

lpalbou commented Jul 24, 2019 • edited

pgaudet commented Jul 24, 2019

kltm commented Jul 25, 2019

RLovering commented Jul 25, 2019

kltm commented Jul 25, 2019

kltm commented Aug 7, 2019

kltm commented Sep 23, 2019

lpalbou commented Sep 23, 2019

kltm commented Sep 23, 2019

lpalbou commented Sep 23, 2019

kltm commented Sep 23, 2019

kltm commented Sep 24, 2019

lpalbou commented Sep 24, 2019

lpalbou commented Sep 26, 2019

kltm commented Sep 26, 2019

pgaudet commented Nov 5, 2019

kltm commented Nov 5, 2019

pgaudet commented Nov 21, 2019

lpalbou commented Jul 17, 2019 •

edited

lpalbou commented Jul 17, 2019 •

edited

lpalbou commented Jul 24, 2019 •

edited