Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the NEO into the main pipeline #35

Open
kltm opened this issue Apr 17, 2018 · 63 comments
Open

Add the NEO into the main pipeline #35

kltm opened this issue Apr 17, 2018 · 63 comments

Comments

@kltm
Copy link
Member

kltm commented Apr 17, 2018

The general idea would be to eliminate as much mechanism as possible as far as deployment and maintenance of multiple pipelines and servers. To this end, I've proposed that NEO (the neo.owl owltools ontology load, sorry @cmungall) gets folded into the main solr load and index. This would simply be:

  • adding the metadata for a new document category (e.g. neontology_class)
  • updating the schema
  • adding additional owltools call (that does not add to general)

A separate issue, not dealt with here, would be the adding of the creation of neo.owl itself. As we are just pulling from a URL, this can be separated.

Another, weaker, formulation would be to drop the NEO index separately, but within the new pipeline framework and runs.

@kltm kltm added this to the wishlist milestone Apr 17, 2018
@kltm
Copy link
Member Author

kltm commented Jun 30, 2018

Well, starting and exploring this a little bit, it will not pan out in a "merged" index--we would clobber on general, which is used (for example) by the ubernoodle for NEO.
Instead, at least for now, we'll look at making another index on the pipeline and switch over to deployment like the other indices we have now.

@cmungall
Copy link
Member

switch to solr6 and use a separate core?

@kltm
Copy link
Member Author

kltm commented Jul 2, 2018

The idea is to simplify are current setup, reducing the number of deployed servers and/or number of distinct pipelines. As the Solr 6.x (higher now) is orthogonal, splitting out separately would be at least a temporary bump up in the above.

@kltm
Copy link
Member Author

kltm commented Jan 23, 2019

From an earlier experiment, the overlay is problematic. We'll work towards the weaker form to make progress on things like #73 and geneontology/neo#38 (comment)

@kltm kltm removed this from the wishlist milestone Jan 23, 2019
@kltm kltm changed the title Explore adding the Solr NEO load into the main pipeline and index Add the NEO Solr load into the main pipeline Jan 23, 2019
@kltm
Copy link
Member Author

kltm commented Jan 23, 2019

Until we have a fix for the NEO job automation, it will be a manual step.

@kltm
Copy link
Member Author

kltm commented Jan 23, 2019

From @hdrabkin:

I had created 6 new PRO ids and they became available in our MGI GO EI on Friday. That means they are in the mgi.gpi, (I verified) which I expected would then make them available in Noctua today but they are not there.
PR:000050039
PR:000050038
PR:000050037
PR:000050036
PR:000050035
PR:000050034

@kltm
Copy link
Member Author

kltm commented Jan 23, 2019

Also see geneontology/neo#38 (comment)

@hdrabkin
Copy link

So does this mean these ids will be available soon?

@kltm
Copy link
Member Author

kltm commented Jan 24, 2019

A manual load is finishing now and a spot check seems positive -- try them now?

@cmungall I think there may be something up owltools and the NEO load. It seems to slow down towards the end of the ontology document loading (not for general docs), eventually giving out. I'll try and get a more nuanced view at some point, but it may be best to look towards this as a use case for a new python loader after the go-cams.

@kltm
Copy link
Member Author

kltm commented Jan 24, 2019

Actually, I'm not sure we use anything but the "general" doc in the index...
That would greatly speed-up and simplify things.

@hdrabkin
Copy link

Hi @cmungall and @kltm
Just checked this morning and the Pro ids are all available now. Thanks.

@kltm
Copy link
Member Author

kltm commented Jan 24, 2019

@cmungall We'll need to discuss 1) how we want to migrate the neo build to a new pipeline (whether main or not) and 2) what actual deployment looks like for the ontology

kltm added a commit that referenced this issue Jan 24, 2019
@kltm
Copy link
Member Author

kltm commented Jan 25, 2019

This will need to be tested a bit more, but it looks like the additional resources and updates on our new pipeline can make short work of the NEO products build:
http://skyhook.berkeleybop.org/issue-35-neo-test/products/solr/
This can be used to juggle updates in and out more safely in the interim.

@kltm
Copy link
Member Author

kltm commented Jan 25, 2019

From @cmungall : the PURLs are from the given S3 bucket, not Jenkins, so we just clobber them out.
He has also agreed with the plan of a second pipeline to support NEO as a separate product from the main pipeline, with the chance to revisit later.

@kltm
Copy link
Member Author

kltm commented Jan 25, 2019

Need more mem for Java:

/obo/BFO_0000040> "BFO:0000040"^^xsd:string) AnnotationAssertion(<http://www.geneontology.org/formats/oboInOwl#id> <http://purl.obolibrary.org/obo/CHEBI_23367> "CHEBI:23367"^^xsd:string) }
18:02:38 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
18:02:38 	at com.carrotsearch.hppcrt.sets.ObjectHashSet$EntryIterator.<init>(ObjectHashSet.java:734)
18:02:38 	at com.carrotsearch.hppcrt.sets.ObjectHashSet$1.create(ObjectHashSet.java:784)
18:02:38 	at com.carrotsearch.hppcrt.sets.ObjectHashSet$1.create(ObjectHashSet.java:779)
18:02:38 	at com.carrotsearch.hppcrt.ObjectPool.<init>(ObjectPool.java:74)
18:02:38 	at com.carrotsearch.hppcrt.IteratorPool.<init>(IteratorPool.java:51)
18:02:38 	at com.carrotsearch.hppcrt.sets.ObjectHashSet.<init>(ObjectHashSet.java:778)
18:02:38 	at com.carrotsearch.hppcrt.sets.ObjectHashSet.<init>(ObjectHashSet.java:157)
18:02:38 	at uk.ac.manchester.cs.owl.owlapi.HPPCSet.<init>(MapPointer.java:444)
18:02:38 	at uk.ac.manchester.cs.owl.owlapi.MapPointer.putInternal(MapPointer.java:324)
18:02:38 	at uk.ac.manchester.cs.owl.owlapi.MapPointer.init(MapPointer.java:151)
18:02:38 	at uk.ac.manchester.cs.owl.owlapi.MapPointer.getValues(MapPointer.java:190)
18:02:38 	at uk.ac.manchester.cs.owl.owlapi.OWLImmutableOntologyImpl.getAxioms(OWLImmutableOntologyImpl.java:1325)

@cmungall
Copy link
Member

cmungall commented Jan 25, 2019 via email

kltm added a commit that referenced this issue Jan 25, 2019
@hdrabkin
Copy link

Hi Seth
We have another new ID in our GPI that needs to get into Noctua
PR:A0A1W6AWH1

@kltm
Copy link
Member Author

kltm commented May 22, 2019

@hdrabkin I believe that this is a different issue. Your should be cleared on the completion of
https://github.com/geneontology/noctua/issues/612

@kltm
Copy link
Member Author

kltm commented Oct 21, 2019

Previously discussed with @cmungall , we would spin out this branch into a new top-level pipeline. After starting work on that, I do not believe it's viable compared to formalizing it as a new branch in the current pipeline: it would either be a very fiddly piece of code that played carefully so as not to accidentally clobber skyhook locations or it would require a small rewrite of how skyhook works. While neither of these are insurmountable, given the small and likely temporary nature of this pipeline, I think formalizing the current branch into something slightly more permanent is the fastest and safest way forward.

@kltm
Copy link
Member Author

kltm commented Oct 22, 2019

Discussed with @goodb on how to make this a workable transition:

  • start producing neo as a file in /ontology/extensions/
    likely just by running the same steps as now, but after the ontology build
  • aim neo purl at new file location (snapshot?)
  • turn off old job
  • incorporate neo build directly into go-ontology main build so that go-lego.owl and robot can use current (as in during the run) version of neo, rather than the day-old snapshot one; I may need to tap @goodb or @balhoff for this

With the completion of this, we can now either build the GOlr index for go-lego in the main pipeline, or do it elsewhere. Deployment would still be once a week or so, so it may be fine to keep the degenerate pipeline-neo branch separate.

@kltm
Copy link
Member Author

kltm commented Apr 17, 2020

From the call today, talking with @goodb and @balhoff, next steps in issue-35-neo-test branch:

For Noctua GOlr:

  • go-lego.owl and neo.owl

For Minerva:

  • go-lego-reacto.owl ("main" ontology)
    question to @goodb: we don't have anything like that in there at the moment, what is the best command to produce this, or should this just be added into go-ontology master?
  • blazegraph-go-lego-neo-reacto.owl (ontojournal)

Clarification:
We no longer need the journal product blazegraph-go-lego-reacto.owl

  • yes, we no longer need it
  • no, we still want that around

@goodb
Copy link

goodb commented Apr 17, 2020

  • go-lego-reacto.owl ("main" ontology) question to @goodb: we don't have anything like that in there at the moment, what is the best command to produce this, or should this just be added into go-ontology master?

Probably the best thing is to add this to the ontology makefile by making a go-lego-reacto-edit.ofn file, adding in reacto.owl as an import, and adding the target to the makefile just like the go-lego one. Note the issue of having code to make reacto in a different location from the ontology makefile - so synchronization may be an issue.

@kltm
Copy link
Member Author

kltm commented Apr 17, 2020

@goodb Hm. I don't think there is necessarily any issue with that, as reacto.owl is made during a normal pipeline run anyways and this experimental pipeline will eventually be folded into that. I suppose there is a bit of a trick with the references here, but hopefully that could be accomplished with a catalog or a materialized ontology. That said, it would actually be a convenience to have reacto.owl in the GO Makefile as well, would it not?

@goodb
Copy link

goodb commented Apr 17, 2020

It would indeed be more convenient to have reacto built in the main makefile along with the others. We might actually promote that to a policy - that all ontology products are produced there. Downstream things like journals and indexes could happen elsewhere in the pipeline.

@kltm
Copy link
Member Author

kltm commented Apr 17, 2020

The building of reacto.owl is just a few lines and it feels like it would be an easy win and easy to back out of if necessary. It seems to need a single remote file and a single binary available for the build, possibly supplied by optional environmental variables. As that binary is a release, it might be nice just to add that as a lib to the go-ontology repo to cut the external dependency and make it a little more self-contained.
Since we need this (or a workalike) to continue making progress on getting NEO out and updating minerva, any reason not to just go ahead and do this? Otherwise, maybe I can just get the ontology merge command so I can make go-lego-reacto.owl at least on this branch and still get the updates out.

@goodb
Copy link

goodb commented Apr 17, 2020

@kltm I don't see any reason not to go forward as you suggest. At some point I'd like to figure out why the source code build wasn't working in the pipeline environment and get it posted to maven. For now, I think the binary release approach we have now ought to work.

For merge if needed, pretty straightforward robot command. http://robot.obolibrary.org/merge

I can work on it this weekend if you want.

@kltm
Copy link
Member Author

kltm commented Apr 17, 2020

@goodb Okay, it would be great if you could go ahead with this. If at all possible, please be mindful of the relative positions of the files in the directory hierarchy:

/ontology/extensions/reacto.owl
/ontology/neo.owl

In the meantime, let me know if you'd like me to do the relatively straightforward merge to produce go-lego-reacto.owl to unblock you on testing minerva and co.

goodb added a commit to geneontology/go-ontology that referenced this issue Apr 18, 2020
kltm added a commit that referenced this issue Apr 26, 2020
kltm added a commit that referenced this issue Apr 26, 2020
kltm added a commit that referenced this issue Apr 28, 2020
kltm added a commit that referenced this issue Apr 28, 2020
…docker fail) and split derivatives so we can restart faster; work on #35
kltm added a commit that referenced this issue Apr 29, 2020
kltm added a commit that referenced this issue Apr 29, 2020
kltm added a commit that referenced this issue Apr 29, 2020
@kltm
Copy link
Member Author

kltm commented May 5, 2020

@goodb My current understanding is that, while this pipeline is still separate, it is now creating all of the products that we want.

Besides merging this back into the main pipeline (which may have to wait until we get some speed improvements), we still have some work on checking NEO here: https://github.com/geneontology/pipeline/blob/issue-35-neo-sanity-test/Jenkinsfile#L352 . @dougli1sqrd @goodb Is this still in progress?

@dougli1sqrd
Copy link
Contributor

Yes I think that's still ongoing? I need to revisit to see the exact state, it's been a little bit since I've looked at that last.

@goodb
Copy link

goodb commented May 5, 2020

@kltm my understanding matches yours with regard to pipeline build.

Regarding testing the products, I had written a couple simple sparql queries that could be run on the generated, merged ontologies. I had handed this off to @dougli1sqrd to pipelinify. It looks like he is running something on the merged go-lego.owl file using Robot. That ought to work but if its slow, the tests could be moved downstream to make use of the blazegraph journals that are now being generated. Test queries could be run with blazegraph-runner and ought to be fast.

BTW @dougli1sqrd running something with neo in its name ("sparql/neo/profile.txt") against go-lego.owl probably no longer makes sense. neo.owl is not currently included in go-lego.owl. It needs to be treated separately. Downstream there is a blazegraph journal that does merge them if you need them together.

@kltm
Copy link
Member Author

kltm commented Jan 27, 2022

From the software call today, we don't want to forget making reacto creation and exposure better with this item.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants