Add JSON product production for GO-CAM API to pipeline #265

kltm · 2022-01-12T02:16:43Z

kltm · 2022-01-12T22:47:02Z

Working on issue-265-go-cam-products branch.

…onment; work on #265

kltm · 2022-01-13T01:40:15Z

@dustine32 So, much to my amazement, we seem to have something that is beginning to work...

http://skyhook.berkeleybop.org/issue-265-go-cam-products/products/api-static-files/

I'd like a little feedback and information from you, but there is a start here. I was trying to make something that would work without the modification of any upstreamand without making new docker images, so there is some weirdness in there (e.g. creating new runtime scripts for the blazegraph docker environment, using sed to make runtimes and other changes on the fly, setting up maven and installing nodejs after the fact, nested working directories), but there is a working base here nonetheless. So, questions:

Is what is in there understandable? From:

pipeline/Jenkinsfile

Line 357 in 1a81657

stage("Create GO-CAM JSON products") {
What changes could be made to the upstream repos that would a) keep them sensible and clean but 2) make this pipeline a little less hacky?
What are the commands to get the rest of the required JSON files?
What form do the JSON files need to be in? Gzipped even though the extension is wrong?
Anything else you might think of...?

A lot of questions there, so if it's easier, we can touch bases on voice.

dustine32 · 2022-01-13T19:43:30Z

@kltm Whoa. I'm amazed you were able to hack all of my "manual" commands in to the pipeline. Great job!

Yep. Makes total sense. Though I am wondering about how we will be choosing the blazegraph-production.jnl (in the end, after the rest of the coding/testing is worked out). If a blazegraph-production.jnl is produced in release, snapshot, and master, can we just default to grabbing whatever blazegraph-production.jnl is already produced/laying around locally in that branch/run? Or like, wget -N http://skyhook.berkeleybop.org/$BRANCH_NAME/products/blazegraph/blazegraph-production.jnl.gz?
We can probably get rid of the jetty server and api-gorest-2021 app and replace with these components:
a. blazegraph-runner cmd
b. SPARQL query files (4 of them) - stored in go-site/pipeline/sparql/reports/
c. A small script to convert (handle grouping) the blazegraph-runner output to JSON structure expected by GO-CAM site
I just committed the other two API cmds for the remaining files.
Not yet sure if they actually need to be gzipped. We can test by having a dev instance of web-gocam point to the skyhook files and adjust until it works?
Ummm... hmmm...

dustine32 · 2022-01-13T20:14:57Z

@kltm Actually, I'm playing around with blazegraph-runner locally now and I realized having 4 separate blazegraph-runner cmds means loading the journal 4 times, which is taking a while. So now I'm appreciating that "load-journal-once" jetty endpoint. I'm thinking we just keep that part?

kltm · 2022-01-13T20:43:07Z

@dustine32 Okay, comments on comments...

Correct. There are two ways forward: if (when) this gets folded into the core pipeline, it would grab the journal from inside the pipeline and use it before publications; if this remains outside of there core pipeline, it can continue to grab from "current.go.org", as that will be the latest and likely just created. The former is better as it means we can try different loads as experiments for the GO-CAM API, etc.
Re: sparql/jetty vs blazegraph-runner. In the best of all worlds and for a bunch of reasons, I'd rather have a bunch of cli commands instead of servers being spun up and down. For the main pipeline, I think that spinning up blazegraph-runner is a small consideration compared to the simplicity of just having commands. Moreover, if it was irritating enough, we could parallelize or make blzegraph-runner handle batches or something, probably without too much trouble. As well, having just cli from repos would make it easier to bake a single-purpose and easy-to-use docker image to handle all of these things.
My concern for the moment was something along the lines of how to handle the SPARQL output and how to convert it properly into the JSON that's needed? It may be that that's essentially locked into the JS and hard to extract or make cli conversion tools for. Or maybe not? I guess I'm wondering how much conversion is necessary and how hard would it be to extract? Ideally, there would also be a JSON schema so we knew we were doing the right thing, but I think we'd probably just want to move on quickly as this area of the stack will see some evolution this year.
Cheers! I'll re-reun and see what we get. As well, I'll turn on the "full" data load for the next test.
Okay, sounds good. The wrong extension thing kinda weirds me out, so if we can avoid that all the better. For the moment, I'll also make some gzipped products with the proper extension to see if that works as well. Either way, I'd prefer file names that match content.

balhoff · 2022-01-13T20:46:37Z

@kltm Actually, I'm playing around with blazegraph-runner locally now and I realized having 4 separate blazegraph-runner cmds means loading the journal 4 times, which is taking a while. So now I'm appreciating that "load-journal-once" jetty endpoint. I'm thinking we just keep that part?

@dustine32 if I need to run queries in parallel I would build the journal in another target and then cp the journal to a new file in each target before running. Or else just run all queries in one target (probably makes the most sense).

kltm · 2022-01-13T22:09:09Z

@dustine32 The "full" test run on the production file only takes ten minutes on this end, which is pretty good, especially as I can see things that can easily be sped up, like using pigz.
So, the full versions are now available (with "correct" extensions) on S3 at places like:

https://go-public.s3.amazonaws.com/files/gocam-goterms.json
https://go-public.s3.amazonaws.com/files/gocam-goterms.json.gz

and so on. They are also available on skyhook, but those might disappear during runs.

I guess this puts this back into your court with testing against the S3 products to see if they work?

dustine32 · 2022-01-13T22:32:33Z

@kltm Running a local instance of geneontology/web-gocam@ada645b, I tested and confirmed the non-gzipped URLs work with the GO-CAM browser site.

kltm · 2022-01-13T22:39:50Z

@dustine32 Okay, great. Since they aren't too large, I'm going to go ahead and remove the gzipped versions from our new pipeline and deployment.

kltm · 2022-01-13T22:52:11Z

@dustine32 Okay, done. The next step above "GO-CAM API at new products (temporary)" could technically be a stable terminal state (even though we don't want it to be), so a little less worry for us. I think this one is probably on your plate? Would you like people to work on that with you and spread the knowledge? Also, it's probably good to update our internal documentation for this new stable state, even though it's meant to be temporary.

kltm · 2022-01-14T00:24:49Z

Talked to @dustine32 and he clarified some of my confusion: this only needs to update the GO-CAM website, not the GO-CAM API. Things to do above updated accordingly.

kltm · 2022-01-21T01:49:33Z

After group discussion, we'll wrap this after automating @dustine32 .
So, how should we automate you? There are two obvious ways in my mind:

We get the USC credentials, put them into the pipeline, and push direct
pro: we keep what we have so there is almost no chance of side effects; con: we maintain Yet Another Data Drop Point
We aim the GO-CAM web app at the newly minted S3 products (the API does not use them separately)
pro: easy to do(?); con: maybe a higher chance of side effects (i.e. something else is consuming those files that we forgot or don't know about)

I think beyond those two, we'd likely be doing a bit more work. (I'm avoiding adding them to the main pipeline products for the moment, until we know what our roadmap will be.) Do either of these make more or less sense to you?

dustine32 · 2022-01-21T04:10:29Z

@kltm Thanks! My vote is for option 2 since a side effect might be that we get closer to something like a standard set of GO-CAM JSON products tied to GO releases (once this is running in the main pipeline). See geneontology/go-site#1180 (comment) for a bit more detail.

For changing the the GO-CAM web app, I believe the steps are:

Update JSON endpoint URLs in web-gocam here.
Deploy web-gocam changes to S3 static site - Exact details on this are murky right now. But it looks like this deploy.sh script is a good place to start.

kltm · 2022-01-21T19:51:25Z

@dustine32 Okay, it looks like 1 is done and committed; I've created a PR geneontology/web-gocam#18 . I think that's probably safe to merge, no? At worst, it might automatically update and problem solved. I've tested it locally and it appears to be going to the correct location.

For 2, this is a bit worrying: https://github.com/geneontology/web-gocam/blob/c4e4bf6cf4c190c757e40c9fbe47c3260907cfa6/deploy.sh#L2
I'm not wild about a recursive delete of production, but it otherwise seems straightforward. The local tools seem to work as advertised after doing an npm install. However, I think that you are current the go-to person given the .cloud credentials?

dustine32 · 2022-01-21T20:01:07Z

@kltm Yep, I'll try to make sure it's deployed today, tip-toeing around the recursive delete (I'll prob have to do it but I'll see what I accomplish without it first).

kltm · 2022-01-21T20:07:45Z

I suppose a Friday afternoon is probably the best time to try things like this anyways. It will probably all go fine, but if you run into any hiccups, don't hesitate to ping me (or we could do it together if you want company).

kltm · 2022-02-02T21:11:48Z

Caught up with @dustine32 and updated the TODO list above. We'll revisit after this upcoming Friday.

kltm · 2022-02-12T00:32:02Z

Talked to @dustine32 and " complete transfer (or remapping) of S3 and CF resources to USC" completed.

dustine32 · 2022-02-12T00:56:14Z

Expanding on #265 (comment): with the S3 and CF transfer to USC AWS, we now have control over the GO-CAM site code that is served on geneontology.cloud and thus from where the GO-CAM site will fetch the JSON files.

So, if we ever need to change JSON filenames or location, we just have to PR the changes (an example geneontology/web-gocam#18) and run the deploy.sh script. Be sure to update the correct CF distribution ID in the deploy.sh create-invalidation cmd, which we can now find under the USC CF list.

kltm added the bug (B: affects usability) label Jan 12, 2022

kltm mentioned this issue Jan 12, 2022

Fix SPARQL required for GO-CAM website resource files geneontology/api-gorest-2021#2

Closed

kltm added a commit that referenced this issue Jan 12, 2022

first (non-working) base for #265

73f47ee

kltm added a commit that referenced this issue Jan 12, 2022

working fix for #265

d7c6dee

kltm added a commit that referenced this issue Jan 13, 2022

after some futzing around, a first draft at a usable blazegraph envir…

bd5c4a2

…onment; work on #265

kltm added a commit that referenced this issue Jan 13, 2022

fix typo; work on #265

cde53db

kltm added a commit that referenced this issue Jan 13, 2022

non-master typo; work on #265

5ae926f

kltm added a commit that referenced this issue Jan 13, 2022

faster; work on #265

af7e03f

kltm added a commit that referenced this issue Jan 13, 2022

no pigz; work on #265

8a34eb7

kltm added a commit that referenced this issue Jan 13, 2022

port move; work on #265

25807fc

kltm added a commit that referenced this issue Jan 13, 2022

attempt at working environment and file production; work on #265

eefcb6d

kltm added a commit that referenced this issue Jan 13, 2022

sooo close--switch rsync to scp; work on #265

2c7e0e8

kltm added a commit that referenced this issue Jan 13, 2022

scp whoops; work on #265

798b3f2

kltm added a commit that referenced this issue Jan 13, 2022

need identity files; work on #265

1a81657

dustine32 added a commit that referenced this issue Jan 13, 2022

Cmds to get models and pmids JSON; for #265

917fd65

kltm added a commit that referenced this issue Jan 13, 2022

switch to full data load; add gzipped products; work on #265

9bbbf44

kltm added a commit that referenced this issue Jan 13, 2022

add attempted push to S3; work on #265

f047eb6

kltm added a commit that referenced this issue Jan 13, 2022

correct mimetype; work on #265

b415ccd

kltm added a commit that referenced this issue Jan 13, 2022

apparently, cleaning is necessary for the workspace; work on #265

d9c8305

kltm added a commit that referenced this issue Jan 13, 2022

need pip to install s3cmd; work on #265

df576d9

dustine32 added a commit to geneontology/web-gocam that referenced this issue Jan 13, 2022

Point JSON URLs to GO pipeline S3 files; geneontology/pipeline#265

ada645b

kltm added a commit that referenced this issue Jan 13, 2022

switch to pigz; work on #265

d6a151c

kltm added a commit that referenced this issue Jan 13, 2022

remove zipped products; work on #265

ae69af6

kltm added a commit that referenced this issue Jan 13, 2022

doc for #265

1f1c5a4

kltm mentioned this issue Jan 21, 2022

Point JSON URLs to GO pipeline S3 files geneontology/web-gocam#18

Merged

kltm mentioned this issue Feb 23, 2022

Make sure that all public resources are served from a *.geneontology.org domain geneontology/go-site#1800

Open

6 tasks

kltm added a commit that referenced this issue May 23, 2022

catch this branch up to the current state-of-play for #265

839d5a9

kltm added a commit that referenced this issue May 26, 2022

increase memory and jump timeouts to 2m for #265

51cefa8

kltm added a commit that referenced this issue May 26, 2022

type for #265

01a3d49

kltm mentioned this issue Nov 8, 2022

Fold new GO-CAM-related pipeline items to main pipeline #306

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON product production for GO-CAM API to pipeline #265

Add JSON product production for GO-CAM API to pipeline #265

kltm commented Jan 12, 2022 •

edited

Loading

kltm commented Jan 12, 2022

kltm commented Jan 13, 2022 •

edited

Loading

dustine32 commented Jan 13, 2022

dustine32 commented Jan 13, 2022

kltm commented Jan 13, 2022

balhoff commented Jan 13, 2022 •

edited

Loading

kltm commented Jan 13, 2022

dustine32 commented Jan 13, 2022

kltm commented Jan 13, 2022

kltm commented Jan 13, 2022

kltm commented Jan 14, 2022

kltm commented Jan 21, 2022

dustine32 commented Jan 21, 2022

kltm commented Jan 21, 2022 •

edited

Loading

dustine32 commented Jan 21, 2022

kltm commented Jan 21, 2022

kltm commented Feb 2, 2022

kltm commented Feb 12, 2022

dustine32 commented Feb 12, 2022

Add JSON product production for GO-CAM API to pipeline #265

Add JSON product production for GO-CAM API to pipeline #265

Comments

kltm commented Jan 12, 2022 • edited Loading

kltm commented Jan 12, 2022

kltm commented Jan 13, 2022 • edited Loading

dustine32 commented Jan 13, 2022

dustine32 commented Jan 13, 2022

kltm commented Jan 13, 2022

balhoff commented Jan 13, 2022 • edited Loading

kltm commented Jan 13, 2022

dustine32 commented Jan 13, 2022

kltm commented Jan 13, 2022

kltm commented Jan 13, 2022

kltm commented Jan 14, 2022

kltm commented Jan 21, 2022

dustine32 commented Jan 21, 2022

kltm commented Jan 21, 2022 • edited Loading

dustine32 commented Jan 21, 2022

kltm commented Jan 21, 2022

kltm commented Feb 2, 2022

kltm commented Feb 12, 2022

dustine32 commented Feb 12, 2022

kltm commented Jan 12, 2022 •

edited

Loading

kltm commented Jan 13, 2022 •

edited

Loading

balhoff commented Jan 13, 2022 •

edited

Loading

kltm commented Jan 21, 2022 •

edited

Loading