Skip to content
This repository has been archived by the owner on Mar 15, 2023. It is now read-only.

GIN as GitHub Artifacts #266

Closed
wants to merge 12 commits into from
Closed

GIN as GitHub Artifacts #266

wants to merge 12 commits into from

Conversation

CodyCBakerPhD
Copy link
Member

Motivation

Let's see if this works...

The idea is, instead of manually re-downloading the GIN datasets each time we want to do a full test, do (1) an initial download from GIN (takes about 42 minutes) and 'upload' to GitHub actions [as an artifact] (takes less than 5 minutes)(https://docs.github.com/en/actions/guides/storing-workflow-data-as-artifacts), (2) then to perform tests 'download' the directory from the artifacts locally (download time TBD). This, in theory, should allow the full tests to run much faster.

This also simplifies the test_gin script to no longer need datalad through that imported python API (all gets offloaded to workflows), which reduces the amount of code needed by a significant margin.

This also allows the possibility of running full GIN tests on other parts of the OS matrix to ensure our awareness of OS specific issues.

How to test the behavior?

No changes to the way testing behavior is called; the CI will report issues quickly if there's an issue with handling the artifact part.

Checklist

  • Have you checked our Contributing document?
  • Have you ensured the PR description clearly describes the problem and solutions?
  • Have you checked to ensure that there aren't other open or previously closed Pull Requests for the same change?

@CodyCBakerPhD CodyCBakerPhD added code reduction Reduces the amount of code for the same functionality performance This pull request makes the code run faster or take less memory labels Sep 21, 2021
@CodyCBakerPhD CodyCBakerPhD self-assigned this Sep 21, 2021
@CodyCBakerPhD
Copy link
Member Author

Immediately failing to recognize the successfully uploaded artifact; will work on debugging this.

@CodyCBakerPhD
Copy link
Member Author

Aha, I can manually download the artifact from that job run, but to get it on the other workflow I'll have to do something like this

https://stackoverflow.com/questions/60355925/share-artifacts-between-workflows-github-actions

the only thing now is, and this is why I put upload/download in single quotes above, it does seem like it's still spending server bandwidth on its own end rather than somehow storing it somewhere local that can allow memory access... so it remains to see if it's actually any faster than just going through GIN directly.

@CodyCBakerPhD
Copy link
Member Author

CodyCBakerPhD commented Sep 23, 2021

OK - this should be ready to go now.

Basically, once every 90 days, someone would have to go to this branch and do a push (or just manually trigger that specific workflow through actions), and that will update the archived artifact of the ephy testing dataset.

Now, each development test workflow uses this custom action to download that artifact and perform the GIN tests on it.

Some statistics on performance:

The original approach for the "Full tests", running on ubuntu only, with Datalad fetching only specific data formats, takes on average 20-30 minutes to download and run.

This approach, which is currently able to run on both ubuntu and mac, takes about 8-9 minutes for each OS, and can run each in parallel.

One thing to note is currently, the archived artifact is the entire GIN repo (takes about 40 minutes to upload as the artifact, but again that would only be every 90 days at a minumum). We could narrow it down to the formats we currently use, and this could speed it up even more, but will require more complicated workflow code so I'll table it for a next addition.

Another thing to add in later will be an auto-schedule workflow to automatically update the GIN once every month or so so we don't have to do it manually. However, it will take some tinkering to find the best way to also then update the workflow/run IDs across to the development tests so they know where to look for the artifact.

The custom action however, currently throws an error on the windows part of the matrix, so I'll post an issue on their repo to try to get that resolved; looks like something simple involving the decompression. Will extend to include that in coverage once/if it is fixed.

@CodyCBakerPhD CodyCBakerPhD marked this pull request as ready for review September 23, 2021 21:01
@CodyCBakerPhD
Copy link
Member Author

Note also that this removes the "Full tests" from the workflows and the "development tests" instead run test_gin.py every time, making those the ones we want to set to 'required'.

@CodyCBakerPhD
Copy link
Member Author

Raised an issue on windows part of test matrix, we'll see what they say: dawidd6/action-download-artifact#108

@bendichter
Copy link
Collaborator

@CodyCBakerPhD would you be able to check against the git hash or the hash of the files to update the artifacts any time there is a change?

@CodyCBakerPhD
Copy link
Member Author

@CodyCBakerPhD would you be able to check against the git hash or the hash of the files to update the artifacts any time there is a change?

There's probably a few different ways of doing that update; I think the best solution would be to have some kind of internal repo secret or token that gets updated to point to the correct workflow ID each time, but I haven't messed around with that yet to see what's best.

@CodyCBakerPhD
Copy link
Member Author

Point is, however we do it, the value being reference has to be (a) global to all actions/workflows and (b) modifiable by workflows (specifically the one responsible for updating the artifact archive on a schedule)

@bendichter
Copy link
Collaborator

Could we save those hashes as artifacts?

@CodyCBakerPhD
Copy link
Member Author

Could we save those hashes as artifacts?

We would still have to know the workflow ID of the last hash update if it were uploaded as an artifact, even though its content is itself supposed to be a hash leading to the actual most up to date GIN artifact. Bit of a chicken and egg problem there... problem is still how to communicate the most recent hash references across the workflows.

@CodyCBakerPhD
Copy link
Member Author

I suppose another alternative would be to have a .txt file or something (not sure where would be best placement) and have the workflows open/edit/close that file keeping a record of each artifact update with corresponding ID reference

@CodyCBakerPhD CodyCBakerPhD marked this pull request as draft October 6, 2021 19:00
@CodyCBakerPhD
Copy link
Member Author

Looking into using https://github.com/actions/cache now

@CodyCBakerPhD CodyCBakerPhD mentioned this pull request Oct 8, 2021
3 tasks
@CodyCBakerPhD
Copy link
Member Author

PR #274 shows better performance now, and solves many of the other open questions here, such as hashing.

@CodyCBakerPhD CodyCBakerPhD deleted the gin_as_artifact branch October 8, 2021 00:57
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
code reduction Reduces the amount of code for the same functionality performance This pull request makes the code run faster or take less memory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants