Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GO release stats to pipeline #842

Closed
pgaudet opened this issue Oct 4, 2018 · 30 comments
Closed

Add GO release stats to pipeline #842

pgaudet opened this issue Oct 4, 2018 · 30 comments

Comments

@pgaudet
Copy link
Contributor

pgaudet commented Oct 4, 2018

Hello,

It'd be really useful to have release stats at each GO release (snapshots too, so we could look for anomalies before official releases).

We use to have a lot of queries from GOOSE (I guess), here:
http://geneontology.org/page/go-mysql-database-schema-views

@pgaudet
Copy link
Contributor Author

pgaudet commented Oct 4, 2018

Full list is:
go_annotation_reports

annotated_publication_total
annotated_publication_total_by_evidence_code_non_additive
annotated_total_gps_by_evidence_code_non_additive
association_contradiction
association_contradiction_count_by_ontology
association_contradiction_direct
association_contradiction_direct_count_by_ontology
association_count_by_association_qualifier
association_count_by_association_qualifier_and_dbname
association_total_by_evidence_code
association_total_by_evidence_code_and_species
avg_total_annotations_per_gp_by_db
avg_total_nonroot_annotations_per_gp_by_db
avg_total_nonroot_pubs_per_gp_by_db
avg_total_nonroot_terms_per_gp_by_db
avg_total_nonroot_transitive_terms_per_gp
avg_total_nonroot_transitive_terms_per_gp_by_db
avg_total_pubs_per_gp_by_db
avg_total_terms_per_gp_by_db
avg_total_transitive_terms_per_gp
avg_total_transitive_terms_per_gp_by_db
evidence_dbxref_summary
evidence_pub_dbxref_summary
gene_product_dbxref_summary
iea_annotated_total_gps
iea_or_iss_annotated_total_gps
non_iea_annotated_total_gps
non_iea_annotated_total_gps_by_dbname
non_iea_or_iss_annotated_total_gps
ont_association_count_by_association_qualifier
seq_dbxref_summary
term_association_count_by_association_qualifier
term_association_count_by_fraction_type
term_association_count_by_fraction_type_and_evidence
term_dbxref_summary
total_annotated_entities_by_dbname_and_type
total_annotations_per_gp
total_gps_by_dbname
total_nonroot_annotations_per_gp
total_nonroot_pubs_per_gp
total_nonroot_terms_per_gp
total_nonroot_transitive_terms_per_gp
total_nonroot_transitive_terms_per_gp_pair
total_pubs
total_pubs_per_gp
total_terms_per_gp
total_transitive_terms_per_gp
total_transitive_terms_per_gp_ont

go_associations

association
association_property
association_qualifier
association_species_qualifier
evidence
evidence_dbxref
gene_product
gene_product_subset
gene_product_synonym
species

go_audit

instance_data
source_audit
term_audit

go_evidence_views

association_evidence_with
association_inference_candidate_pair
db_evidence_summary
ic_evidence
iss_annotation_to_nas_direct
iss_annotation_to_nas_direct_without
nd_evidence
stale_ic_ipr
stale_iss_annotation

go_general

db
dbxref

go_graph

relation_composition
relation_properties
term
term2term

go_graph_views

avg_max_distance_to_leaf_term_by_db
avg_max_distance_to_leaf_term_by_db_and_ontology
avg_max_distance_to_leaf_term_by_db_and_species
distance_to_leaf_stats_by_term
distance_to_root_stats_by_term
leaf_node
max_distance_to_leaf_by_term
max_distance_to_root_by_term
max_max_distance_to_root_by_term
non_root_term
path_to_leaf
path_to_root
root_term
term_ancestor
term_descendent
term_having_max_delta_distance_to_root
term_having_max_max_distance_to_root
term_having_most_paths_to_root
total_paths_to_root_by_term
transitive_association

go_homology

gene_product_ancestor
gene_product_homology
gene_product_homolset
homolset

go_meta

term2term_metadata
term_dbxref
term_definition
term_subset
term_synonym

go_obd_bridge

asserted_link
implied_link
node
node_max_depth

go_optimisations

gene_product_count
graph_path

go_prejoined_views

association_evidence
association_j_evidence
association_j_evidence_j_gene_product
evidence_j_evidence_dbxref_j_dbxref
gene_product_j_dbxref
gene_product_j_dbxref_via_seq
gene_product_j_gene_product_synonym
gene_product_with_term_pair_via_graph
term_j_association
term_j_association_j_evidence_j_gene_product
term_j_association_j_gene_product
term_j_association_j_gene_product_via_graph
term_j_association_j_species_summary_via_graph
term_j_association_j_species_via_graph
term_j_association_via_graph
term_j_term
term_jt_term

go_refgenomes_views

avg_max_distance_to_leaf_term_per_refg_within_refg_species
avg_total_genes_by_homolset_and_ontol
avg_total_transitive_terms_per_refg_gp_for_refspecies
gene_product_in_refg_subset
gene_product_with_subset
gp_outlier_annotation
gp_outlier_annotation_full_report
gp_partial_outlier_annotation_nothing_above
gp_partial_outlier_annotation_nothing_below
homolset_annotation
homolset_annotation_full
homolset_annotation_non_outlier_with_subsumed
homolset_annotation_non_outlier_with_subsumer
homolset_annotation_outlier_full
homolset_annotation_outlier_full2
homolset_annotation_outlier_full_by_checking_ancestors
homolset_annotation_outlier_old
homolset_summary_by_term
homolset_transitive_annotation
homolset_transitive_annotation_full
refg_total_transitive_terms
refg_with_nd
subsumed_by_association
subsumed_by_noniea_association
subsumer_of_association
subsumer_of_noniea_association
total_gps_by_homolset_and_term
trusted_evidence

go_sequence

gene_product_seq
seq
seq_dbxref
seq_property

go_stats_views

annotated_gp_total_by_code
gene_product_count2
implied_annotation
implied_negative_annotation
term_correlation_summary
term_correlation_via_transitive_annotation

go_taxon_views

annotated_species
annotated_species_id
annotated_species_lacks_term
gene_product_count_by_inner_taxon
species_has_term
species_has_term_d
species_lacks_term
species_lacks_term_d

@kltm
Copy link
Member

kltm commented Oct 4, 2018

@pgaudet

Some stats are currently calculated as part of AmiGO rollouts. If you could make a list of stats that you'd like to see, we could provide them in the pipeline.

Your list is essentially a schema (view) dump, not a report list - nothing here was provided, except as shortcuts for AmiGO and GOOSE.

@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 3, 2019

@lpalbou is there a new ticket for this ?

@lpalbou
Copy link
Contributor

lpalbou commented Jul 17, 2019

@pgaudet there is a repo https://github.com/geneontology/go-stats/ but we didn't created any ticket yet, so let's use this one to finish quickly.

Stats on GO

@kltm the PyPi package of the go-stats is here and the full code (including PyPI publish setup) is here.

I also created the PR #1145 over go-site with just go_stats.py (which requires only the requests lib so you probably have that already).

To produce the stats from the pipeline (either for a monthly or daily release):

python3 go_stats.py -g http://golr-aux.geneontology.io/solr/ -o reports/

It will produce 3 files in the folder reports :

@pgaudet , could you let me know if you have everything in go-stats.tsv ? It can be reorganized if that helps. Note that I wasn't able to produce the number of GPs by taxon and evidence type : the GOLr "bioentity" view doesn't store the types of evidence. We could possibly add this field as it would also allow to enrich over certain evidence types only.

Notes:

Diff on GO ontology

If the above works correctly, I will do another PyPI package and PR for the diff script which require the lightweight obo parser.

Diff on GO annotations

Lastly there will be a script to do a diff over the GO annotations of two releases. This will only work after a go-stats has been computed as it requires as input the current and the previous stats.

Note about daily releases

These scripts will work with the daily releases but we need a strategy to store these files in something more permanent than a snapshot. For instance, storing all the go-stats of the current month would provide our QC group with an interactive graph to see the day by day evolution of each stat.

@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 17, 2019

Hi @lpalbou

Thanks for this, it's coming along nicely. A few questions/comments:

  • I thought there were 2 files, one for ontology and one for annotation. 'go-stats.tsv' is a mix of both; is this right ?

  • Also, for the GO stats, you use to shod newly created terms, newly obsoleted, and newly merged; will this be restored?

  • Something that would be useful would be to get the species name alongside the NCBI taxon ID.

Thanks !

Pascale

@lpalbou
Copy link
Contributor

lpalbou commented Jul 17, 2019

I thought there were 2 files, one for ontology and one for annotation. 'go-stats.tsv' is a mix of both; is this right ?

@pgaudet the two files we have been using to check the last releases:

What's been confusing however:

  • there are more ontology stats (e.g. nb of merged terms, structural and meta statements) in go-last-changes than in go-stats
  • the diff is only for the ontology, not for the annotations

I added the ontology aspects in go-stats but adding merged terms and structural + meta statements will require more work as the stats were computed from GOLr only and this require to load obo file.

Also, for the GO stats, you use to shod newly created terms, newly obsoleted, and newly merged; will this be restored?

The total number is there but the delta is only in the diff as it requires the previous version. However you are right that I was showing the delta in the UI graph, but I was getting it from the diff files.

Something that would be useful would be to get the species name alongside the NCBI taxon ID

👍

Proposed steps

  • @kltm checks he can use this script on the pipeline
  • I add the ontology & annotation diff scripts
  • I create a master script calling first the ontology diff, then computing the go-stats with it to add the missing ontology stats. At this step, the pipeline will be calling this master script with 4 parameters (GOLr URL; previous OBO file; current OBO file; output folder to store reports)
  • compute those reports for all release in release.geneontology.org
  • permanently store those reports for easy access to stats in time (probably need to alter release.geneontology.org to add a reports/ to each release).
  • have an API endpoint similar to this to easily retrieve the URLs of each release stats, which will be used to populate the graphs in the go website stats page
  • Iterate on the prototype of the go website stats page
  • Iterate on the prototype of an internal stats page for QC / release check

@kltm kltm added enhancement and removed question labels Jul 17, 2019
@kltm kltm changed the title GO release stats Add GO release stats to pipeline Jul 17, 2019
@cmungall
Copy link
Member

Let's discuss next steps tomorrow on software call

@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 18, 2019

@lpalbou
Copy link
Contributor

lpalbou commented Jul 24, 2019

@pgaudet @kltm the PR to compute all the stats and diffs from the pipeline is here: #1148

7 files generated:

A quick look at go-annotation-changes.tsv indicates we have lost 12 000 ISS (possibly Arabidopsis thaliana ?) as well as a few hundreds annotations here and there (ND, IDA, IEP, NAS, EXP), as well as TIGR annotations

Last notes:

  • still have to add taxon label for readability
  • in annotation-changes, could remove all the species with 0 differences
  • we discussed it once, do we want an alternative go-stats without protein binding annotations ?

@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 24, 2019

still have to add taxon label for readability

yes please

in annotation-changes, could remove all the species with 0 differences

I would keep it for completeness

we discussed it once, do we want an alternative go-stats without protein binding annotations ?

That was for the website. If I understand correctly these here are the pipeline stats. In that case we want everything.

Thanks !! This is looking great !

Pascale

@lpalbou
Copy link
Contributor

lpalbou commented Jul 24, 2019

That was for the website. If I understand correctly these here are the pipeline stats. In that case we want everything.

Two things:

  • those stats will be in the zenodo archive so accessible to the public. If we don't want that, then they need to be stored somewhere else
  • there are no other stats file, the website is interpreting those files to drive UI graphs, so if we want to show the annotations without protein binding, then it needs to be computed at this level

@pgaudet
Copy link
Contributor Author

pgaudet commented Jul 24, 2019

those stats will be in the zenodo archive so accessible to the public. If we don't want that, then they need to be stored somewhere else

Sure OK, the archive can have the full file - @thomaspd @cmungall @ kltm is that OK for you ?

there are no other stats file, the website is interpreting those files to drive UI graphs, so if we want to show the annotations without protein binding, then it needs to be computed at this level

OK fine, thanks for the information.

@kltm
Copy link
Member

kltm commented Jul 25, 2019

@pgaudet The stats will be in the archive and the standard locations (release, snapshot, current).

@RLovering
Copy link
Collaborator

Hi
sorry please would you provide the links to the standard locations (release, snapshot, current) as I am not sure where these are
Thanks
Ruth

@kltm
Copy link
Member

kltm commented Jul 25, 2019

@RLovering This is not yet in the pipeline (v/soon), but the locations are:
http://wiki.geneontology.org/index.php/Release_Pipeline#Data_publishing_and_access

kltm added a commit to geneontology/pipeline that referenced this issue Aug 5, 2019
kltm added a commit to geneontology/pipeline that referenced this issue Aug 5, 2019
kltm added a commit to geneontology/pipeline that referenced this issue Aug 6, 2019
kltm added a commit to geneontology/pipeline that referenced this issue Aug 7, 2019
@kltm
Copy link
Member

kltm commented Aug 7, 2019

Looking good, with results on skyhook branch. Just needs to be folded in.

kltm added a commit to geneontology/pipeline that referenced this issue Sep 20, 2019
@kltm
Copy link
Member

kltm commented Sep 23, 2019

@lpalbou
Running your script basically as:

python3 /tmp/go_reports.py -g http://localhost:8080/solr/ -s https://geneontology.s3.amazonaws.com/temporary/2019-july/go-stats.json -n https://geneontology.s3.amazonaws.com/temporary/2019-july/go-stats-no-pb.json -c http://current.geneontology.org/ontology/go.obo -p https://geneontology.s3.amazonaws.com/archive/2019-07-01_go.obo -o /tmp/stats/ -d $START_DATE

I've been running into this error for a little bit:

[2019-09-23T00:42:13.853Z] Traceback (most recent call last):
[2019-09-23T00:42:13.853Z]   File "/tmp/go_reports.py", line 295, in <module>
[2019-09-23T00:42:13.853Z]     main(sys.argv[1:])
[2019-09-23T00:42:13.853Z]   File "/tmp/go_reports.py", line 146, in main
[2019-09-23T00:42:13.853Z]     json_stats = go_stats.compute_stats(golr_url, release_date)
[2019-09-23T00:42:13.853Z]   File "/tmp/go_stats.py", line 255, in compute_stats
[2019-09-23T00:42:13.853Z]     prepare_globals(all_annotations)
[2019-09-23T00:42:13.853Z]   File "/tmp/go_stats.py", line 310, in prepare_globals
[2019-09-23T00:42:13.853Z]     check = taxon_map['9606'] == 'Homo sapiens'
[2019-09-23T00:42:13.853Z] KeyError: '9606'

Looking through go_stats.py, it seems to have a certain set of species hardwired, and if they are not available it crashes out (my read on what's going on). The pipeline often runs in modes where not all species are available. Can we make these flags, or maybe better to just bypass/catch with a warning--the QC/QA to look at values will occur elsewhere.

@lpalbou
Copy link
Contributor

lpalbou commented Sep 23, 2019

@kltm I didn't expect the pipeline to ever run without the human species. There is indeed a static test to check if the human species was loaded. There is also a list of reference species which should always be here and could trigger an error if you don't load them - see reference_genomes_ids.

I will do a commit today to double proof the code for your pipeline modes that don't load these species. Is there a documentation for these pipeline modes ?

Also I think there is an error in your query: -c must point to the new go.obo and you used -c http://current.geneontology.org/ontology/go.obo which point to the last released go.obo, not the one currently produced by the pipeline.

@kltm
Copy link
Member

kltm commented Sep 23, 2019

@lpalbou Thank you for the fast turnaround. There is no particular documentation for this requirement, just that the pipeline must be able to run any arbitrary subset (or and given GAF/ontology combination), which necessarily includes non-human runs.

Ah--thank you for the catch--I have that fixed on branch, ready for the next round.

@lpalbou
Copy link
Contributor

lpalbou commented Sep 23, 2019

@kltm this commit should work with your requirements, let me know if anything else comes up.

kltm added a commit to geneontology/pipeline that referenced this issue Sep 23, 2019
@kltm
Copy link
Member

kltm commented Sep 23, 2019

Great--testing.

kltm added a commit to geneontology/pipeline that referenced this issue Sep 23, 2019
kltm added a commit to geneontology/pipeline that referenced this issue Sep 23, 2019
kltm added a commit to geneontology/pipeline that referenced this issue Sep 24, 2019
@kltm
Copy link
Member

kltm commented Sep 24, 2019

If current master test passes, we'll merge #1181 and switch over to new stats.
Before the next release then, we'll have to get to #1182

@lpalbou
Copy link
Contributor

lpalbou commented Sep 24, 2019

Sounds good to me 👍

kltm added a commit to geneontology/pipeline that referenced this issue Sep 24, 2019
kltm added a commit to geneontology/pipeline that referenced this issue Sep 25, 2019
kltm added a commit to geneontology/pipeline that referenced this issue Sep 25, 2019
kltm added a commit to geneontology/pipeline that referenced this issue Sep 26, 2019
@lpalbou
Copy link
Contributor

lpalbou commented Sep 26, 2019

@kltm just a quick reminder to be on the safe side: don't forget to launch the python3 aggregate-stats.py command (to create the aggregate summaries for the website) after the python3 go_reports.py

@kltm
Copy link
Member

kltm commented Sep 26, 2019

@pgaudet
Copy link
Contributor Author

pgaudet commented Nov 5, 2019

hi @kltm

How come this is not in a project ? I think it should be in 2019-10 (Berkeley) Data Release Pipeline 1.3

Thanks, Pascale

@kltm
Copy link
Member

kltm commented Nov 5, 2019

@pgaudet I have no idea...maybe I was trying to close and missed? Or maybe I set it free so it could be added to the new project (as SOP)?

@pgaudet
Copy link
Contributor Author

pgaudet commented Nov 21, 2019

@lpalbou can we close ?
and perhaps make new tickets for what remains to be done ?

@pgaudet pgaudet closed this as completed Jun 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

5 participants