Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify buid and deploy processes across the various components of OpenPATH #1048

Open
shankari opened this issue Feb 5, 2024 · 39 comments
Open

Comments

@shankari
Copy link
Contributor

shankari commented Feb 5, 2024

OpenPATH currently has four main server-side components:

  • webapp
  • analysis
  • public dashboard
  • admin dashboard
  • join page

the webapp and analysis containers are launched from e-mission-server; the others are in separate repos that build on e-mission-server.

There are also additional analysis-only repos (e-mission-eval-private-data and mobility-scripts) that build on e-mission-server but are never deployed directly to production.

In addition, there are internal versions of all the deployable containers that essentially configure them to meet the NREL hosting needs.

We want to unify our build and deploy processes such that:

  • all of the repos are built consistently
  • the internal deployments use the external deployment images directly (if possible) so we don't have to rebuild
  • security fixes are automatically generated using dependabot
@MukuFlash03
Copy link

MukuFlash03 commented Feb 15, 2024

Shankari mentioned this:
“At the end, we want to have one unified process for building all the images needed for OpenPATH. We are not doing incremental work here, we are doing a major redesign. I am open to merging PRs in each repo separately as an intermediate step but eventually, one merge and everything downstream is built”

Does this mean:

  1. Even if other repos are ready to be merged, we can’t actually merge changes until say the parent repo for base images, which currently is e-mission-server is ready to be merged?

  2. Will completely automating merges, skip the PR review process? Or would those PR merges still go through but nothing is actually triggered, until the merge in e-mission-server triggers it?

@MukuFlash03
Copy link

MukuFlash03 commented Feb 15, 2024

The admin and public repos are built on top of e-mission- server image - the Dockerfiles for these build off of the base image of e-mission server. What we would want to do is that when an emission-server PR is merged, we want to bump up the dependency in the admin and public dash board Docker files to the latest tag; and then rebuild those images.
As long as there are no changes to Dockerfile, there should be no merge conflict; if it does exist, can take a look at it manually.

The automation would include just the changes to the Dockerfile concerning the latest image tags to be updated with the base server image.
Then this would trigger and image build for the repo and we can potentially trigger image builds on every merge to a specific repo.

This does not include other code changes with PRs as these would still need to go through the code review process that we are currently following. The automated merges with Docker tag updates must occur only when the underlying e-mission-server has been updated. The automated builds for the latest merged or updated code versions of these repos (and not any open / un-merged PRs) can occur if needed on every merge.

@nataliejschultz
Copy link

nataliejschultz commented Feb 15, 2024

Suggestions from Thursday:

  1. Use GitHub Actions with job dispatch and/or reusable workflows to work within multiple directories.

Reusable workflows syntax example:

Job:
	uses: /reponame/file/otheryaml.yml@main 

Composite actions with repository dispatching
(this uses webhooks)

  1. GitHub Actions in external repos, Jenkins pipeline extension in internal repos

Notes:

  • External repo strategy seems fine. Dispatching events to other repos is a good idea. Let’s get this part working well first, and show Jianli to discuss re-working the internal repos.
  • Integration with Jenkins is likely to be complicated due to cloud services access.
    • Come up w/ strategy for internal repos as well (separate from Jenkins)
    • Reusable workflows for internal repos with GH actions? Is this possible? There may not be a runner because it’s private. Look into these possibilities
  • Other sets of options:
    • ArgoCD - might be overkill
    • Kubernetes: may be better for larger orgs
  • Clarification: sub-modules are a way to include some code in some other code, like a library. It’s not going to solve the PR issue where you have to merge multiple PRs.
  • Internal repos: Suggests asking ourselves if we need internal repos at all. Open your mind

@shankari
Copy link
Contributor Author

shankari commented Feb 16, 2024

It is "sub-modules" not "some modules" 😄
And if you need internal repos, which of the three models should you follow?

@MukuFlash03
Copy link

MukuFlash03 commented Feb 19, 2024

So, I did take a look at git submodules and it might not be a good option for our usecase.

Found some information here:

  1. [Git official submodules](https://git-scm.com/book/en/v2/Git-Tools-Submodules
  2. GitHub Gist Post
  3. Blog Post

What it does:

  • Helpful when code depedencies are needed to be used as libraries in other projects.
  • Adds repositories as subdirectories under the main repository which uses code from the sub repositories.

Why not good?

  • We are anyways building off of server base image, so required code functionalities are available.
  • Complex push/pull process - need to update in both submodules and parent directory
    • If latest changes not committed in either one of these, they won’t be available to all contributors.
    • Need to run “git submodule update” every time; forgetting this will leave code base in a stagnant state.
  • Commits can be overwritten if other contributor’s commits pulled via the update command without having committed your own changes.
  • Better suited when submodules won’t have much frequent commits / changes.

@MukuFlash03
Copy link

Single Internal Repository

A possible redesign for the internal repositories I came up with includes having a single repository internally with sub-directories for each repo similar to how server repo is being used internally currently.

For server repo, internal repo is named as nrelopenpath.
This contains two directories: webapp and analysis, referring to two AWS ECR images which are built from the same latest base server image after customizations.

Similarly, can refactor join, admin-dash, public-dash repos to be used like server repo is being used in the internal GitHub repos.
This would avoid duplication of repos and including these steps:


  • Docker images can be built for these three repos as well and uploaded to DockerHub.
  • Replace the docker image tags as required.

Need to see how to fit in admin-dash (external remote upstream -> main -> dev) and public-dash (notebook image)


Pros and Cons:
 
Pros:

  1. No repo / codebase duplication.
  2. No need to worry about public -> private merging, since we're having images built from public repos itself.



Cons:


  1. All repos images to be built pushed to Dockerhub (currently only server, public-dash)

  2. Docker image tags to be updated twice for other repos: 
once in external (base image as server)
then in internal (base image as the pushed repo image; for e.g. join will have e-mission-join_timestamp)

@MukuFlash03
Copy link

Index for Topic wise updates

  1. Compiled list of Related issues -> Serves as a good reference for others

  2. Learnings from Documentation across OpenPATH project.
  3. Questions and Answers exchanged with the Cloud Team + Key takeaways
  4. Understanding of the internal repos (Need + Rebuild)
  5. Redesign plan updates

We have organized our findings, in a series of topics / sections for ease of read.

@MukuFlash03
Copy link

MukuFlash03 commented Mar 5, 2024

Topic 1: Compiled list of Related issues

We spent a lot of time just scouring through the existing documentation in GitHub issues, PRs (both open and closed) spread throughout the repositories for our OpenPATH project.
As we kept finding more and more issues, we thought it’d be a good idea to keep these organized as we had to keep referring to them back and forth and this table was pretty helpful.
Hence, putting it in here so it serves as a good reference for others.

Notes:

  1. Categorization into Super-related, Related and Possibly Related (or at times unrelated) done with respect to the current task of redesigning the build and deployment process.
  2. The major focus is on the four external and four internal repos related to the server and dashboard
  1. Other repositories referenced include: e-mission-docs, e-mission-docker
  2. Some labels / descriptions might not be appropriately categorized but related in a way to the repository and the task at hand and provide some important information.

S. No. Repository Super-related Related Possibly Related (or not)
1. e-mission-docs e-mission-docker cleanup
#791

!* CI to push multiple images from multiple branches
#752

Public-dash build
#809

Public-dash cleanup
#803

Jenkins, ARGS / ENVS
#822

Docker containerization
#504

Docker testing
e-mission/e-mission-server#731

!* Docker images, server error
#543

NREL-hosted OpenPATH instance
#721
Docker images + Jupyter notebook 
#56

Dockerized setup procedure 
#26

Nominatim
#937

Automated Docker image build - mamba
#926

Shutdown AWS systems
#474

OTP Docker container
#534

GIS Docker container
#544


Submodules
- Splitting monolithic server into micro services
#506
- Split UI from server
#59

public-dash
#743  

admin-dash
#802

AWS Cognito
#1008

Data Encryption
#384

Heroku Deployment
#714

Docker activation script
#619

Monorepo. Viz scripts 
#605 (comment)

Tokens, Enterprise GitHub, DocumentDB, Containers
#790 (comment)

AWS server, notebook
#264

Deploy OpenPATH app to staging
#732

Token stored on server - containers, enterprise repo, document DB
#790
Tripaware Docker
#469
#624

Migrating trip segment feature to native Android
#410

Sandbox environment
#326

Docker compose UI
#657

Separate participants and testers
#642

Server dashboard crash
#645

Code coverage
#729

DynamoDB + MongoDB incompatibility; Jianli 
#597

Conda error
#511
2. e-mission-server Image Build Push yml origin perhaps?
e-mission/e-mission-server#875

Server split - AWS Cost analysis
#292

Travis CI
e-mission/e-mission-server#728
Image re-build
e-mission/e-mission-server#617

Docker files, image upload
e-mission/e-mission-server#594 (comment)

Remove webapp from server
e-mission/e-mission-server#854

Skip image build for a branch
e-mission/e-mission-server#906
3. em-public-dash Hanging line, Docker cleanup
e-mission/em-public-dashboard#37

NREL-hosted version of public-dash
#743

Production container code copied
e-mission/em-public-dashboard#54

Dockerfile ARG vs ENV
e-mission/em-public-dashboard#63

Notebook stored in .docker
e-mission/em-public-dashboard#75

Build dashboard image from server image
e-mission/em-public-dashboard#84

Pinned notebook image
e-mission/em-public-dashboard#38
Public dash origin perhaps?
#602

Jupyter notebook docker
e-mission/em-public-dashboard#81

AWS Codebuild
e-mission/em-public-dashboard#56

Docker port conflict
e-mission/em-public-dashboard#60

Dockerfile location
e-mission/em-public-dashboard#62

Docker image bump up
e-mission/em-public-dashboard#87

Dockerfile vs Dockerfile dev moved
e-mission/em-public-dashboard#56
e-mission/em-public-dashboard@513e382
Move dashboard to NREL template
e-mission/em-public-dashboard#50

Http-server vulnerabilities
e-mission/em-public-dashboard#58

Http-server global install
e-mission/em-public-dashboard#59

Vulnerabilities - tests removed
e-mission/em-public-dashboard#64
4. op-admin-dash Dev branch usage origin perhaps?
#859
Finalize production docker build
e-mission/op-admin-dashboard#32

sed to jq issue
#595
#714 (comment)
5. nrelopenpath Upgrade server base image
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/pull/19

Push remind command; Jianli, AWS
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/pull/26

Test CI Server Images
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/pull/33

Jenkins deployment issues
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/issues/9
Analysis pipeline not running
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/issues/10

Intake pipeline not running
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/issues/38
Download data from NREL network
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/issues/11
6. nrelopenpath-study-join Vulnerabilities, Flask-caching
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-study-join-page/issues/13

Join page docker container - origin perhaps?
#784
Http-server global install
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-study-join-page/pull/2

Public-dash link; staging, production environments
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-study-join-page/pull/12
7. nrelopenpath-public-dash Notebook image origin perhaps?
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-public-dashboard/pull/1
8. nrelopenpath-admin-dash Admin image not building - prod data issue
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/issues/13
Consistent containers
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/pull/3

sed to jq
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/pull/4

nginx, dash, working, error debug
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/issues/9

Pem file added in Dockerfile
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/pull/6

Main to Dev branch origin perhaps?
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/pull/7
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/pull/8
Hack to load data
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/pull/11
9. e-mission-docker CI image used instead of dev-server
e-mission/e-mission-docker#24

ARGS used
e-mission/e-mission-docker#25
Stripped down docker images, PWD
e-mission/e-mission-docker#3

Pull repo instead of image
e-mission/e-mission-docker#11

Install emission no image
e-mission/e-mission-docker#13
e-mission/e-mission-docker#14

Devapp phone UI notebook like separate docker image
e-mission/e-mission-docker#22

Multi-tier docker compose, cronjob
e-mission/e-mission-docker#4
#410

UI Devapp standard scripts
e-mission/e-mission-docker#18

@MukuFlash03
Copy link

MukuFlash03 commented Mar 5, 2024

Topic 2: Learnings from Documentation across OpenPATH project

While it’s been a lot of information to digest, we’ve gained a much better understanding now of the history behind OpenPATH deployment and have compiled an expansive list of issues that we believe relate to this.
Perhaps not all of the below learnings may be relevant to the redesign task but were somewhat related and may have reasonings behind the way we have built our architecture currently.

Some knowledge points from our exploration are mentioned below:


1. Admin-dashboard
A. Switch to Dev from Main [1]

  • Found that we may have [started building off of dev] and not master in admin-dash as initially the master branch just had a dashboard template from the NREL team.
  • This is where “base branch” is changed to dev from main by Jianli in Internal repo.

B. Wrappers added in Dockerfile (for other repos too) [1]

  • PEM certificates added for accessing AWS Document DB for three of the four internal repos: nrelopenpath, public-dash, admin-dash.

2. Join
A. Sebastian, found the need to create a docker container ; no longer a redirect to CMS [1]

  • First commit to Dockerfile made on Sep 9, 2022, this where it was set up for join.

3. Public-dash

A. Basic understanding of public-dash images [1]

  • Learnt that the public dashboard is a static entity which provides two docker containers as described by Shankari:

”It essentially provisions two docker containers: one for generating the static images (that we basically run notebooks using a cronjob to generate) and a dead simple frontend using jquery that basically displays the image.”

B. AWS Codebuild [1]

  • We were unsure how images are being built for public-dash, join, admin-dash since these were not present in buildspec.yml which builds the webapp and analysis images.
  • Came across this reference to AWS Codebuild, so looks like there is an action setup in the Jenkins pipeline that builds images from Dockerfiles for these three repos.
  • Changes in this PR involved moving Dockerfile to the parent directory for frontend.

C. Switched to pinned notebook Image [1, 2]

  • Docker container build issue which was a bitrot issue (which I read up on and means software deterioration) and dealt with updating conda package dependencies.
  • Issue also includes conversation on using pre-built server image instead of cloning repo.
  • Future ideas to build CI for public-dash mentioned as well but looks like wasn’t implemented (glad to take it up now!)
  • Dockerfile renamed to Dockerfile.dev (e-mission/em-public-dashboard@513e382)
  • “Heavy lifting” (e-mission/em-public-dashboard@e08c259)
  • Dockerfile created based on Dockerfile.dev here

D. Point where we switched to using notebook image instead of server image [1]

  • We found that the start_notebook.sh script differs in external and internal versions
    • Python scripts run for notebooks; else part of CRON MODE condition implemented which runs scripts from crontab
    • These modifications were made to as we are using AWS ECS Scheduled tasks instead of crontab which handles the case of the Docker container failing in which case the crontab wouldn’t run either.

These public-dash changes that commented out code (used in external and ran notebooks directly for generating plots) were all done around the same time - Sep 2022.
This also coincides with the time when the public-dash base image was changed to using the docker hub notebook image instead of server image.
So all of this seems to be a part of the hack used by Shankari and Jianli to make sure the latest deployment back then went through successfully.


4. e-mission-server

A. CI publish multiple images from multiple branches + Image_build_push.yml origin [1, 2]

Most important issue as this is very close to the current task we’re working on that highlights the redesign needed.

  • Sheds light on how we moved from e-mission-docker having all related Dockerfiles to providing a Dockerfile for each repo.

Looks like we also went from a monorepo / monolithic design to a split microservices design, especially for the server code:

  • Moving from Monorepo to Microservices here
  • Webapp UI removed here

B. Learnt about the origin of Dockerhub usage + Automatic image build

C. Travis CI was tested by Shankari [1]


  • Compared different CI options back in 2020 for transition to an “external build system” outside of UC Berkeley.

D. AWS Costs Detailed Discussion [1]

E. Why or why not to use Dockerhub (costs, free tier, images removal) [1]

  • Cloud services said that compared to GitHub Container Registry, they’re more familiar with Dockerhub.
  • Dockerhub free tier should be fine as we will be updating images at least once every 6 months and they would be getting pulled as well.
  • Did not find a firm notice on whether the policy for image retention after 6 months was enforced or not. Seems like not, since the emission account images on Dockerhub are still there after last being updated 2 years ago.

Dockerhub resources: [pricing, 6 months policy, policy delayed, retention policy + docker alternatives]

F. MongoDB to AWS DocumentDB switch [1]

  • Explains why we moved from MongoDB to DocumentDB; has notes on AWS setup and issues encountered.
  • Lots of pointers shared by Jianli as well pertaining to Document DB, containers in the NREL cloud environment.
  • Could be useful information when considering the certificates added to internal Dockerfiles.

5. Nrelopenpath [INTERNAL]

  • Incorporating image_build_push.yml in internal repos here
  • CI possible future options mentioned by aguttman here
  • Analysis container commented out in docker-compose but that’s perhaps because the Codebuild scripts use Dockerfiles and not docker-compose directly here
  • Another mention to docker-compose not being used for deploying to AWS here

6. e-mission-docker

A. Origin of multi-tier Docker compose [1]

  • Point where we implemented the current webapp + analysis structure seen in internal nrelopenpath repository.
  • Multi-tier docker compose directory was setup with images being rebuilt using another set of Dockerfiles.

@nataliejschultz
Copy link

nataliejschultz commented Mar 5, 2024

Topic 3: Questions for the Cloud Team + Key takeaways

We have been collaborating with cloud services back and forth.

Q1:

We understand the current process to build and deploy images looks like this:
a. Build from Dockerfile in external repos and pushed to Dockerhub
b. Rebuild in Jenkins based on modified Dockerfiles in internal repos
c. Push to AWS ECR

Could the possible process be changed to only build once externally like this:
a. Build from Dockerfile in external repos and pushed to Dockerhub
b. Runs Jenkins pipeline which pulls images directly from Dockerhub (and not rebuild)
c. Push to AWS ECR

A1:

Security is the most important reason that NREL does not pull EXTERNAL images and deploys it direct to NREL infrastructures. That's why Cloud build central ECRs for all cloud project images stored on AWS, all images needs to be (regularly) scanned by Cyber.


Q2:
Is one of the reasons we need to modify the images due to using confidential credentials to access AWS Cognito (ie nrel-cloud-computing/ nrelopenpath-admin-dashboard/docker-compose-prod.yml)?

A2:

Cloud team is not using docker-compose for deploy. The base image is pulled from EXTERNAL, i.e. your DockerHub, we wrapped the image to fit into AWS ECS deployment. For example, https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/blob/master/webapp/Dockerfile, we need to install certificates for your app to access DocumentDB cluster.


Q3:
How and when is nrelopenpath/buildspec.yml run? Is it run multiple times in the pipeline for the staging and production deployments?

A3:

The CodeBuild (buildspec.yaml) run will be triggered through Jenkins. On Jenkins job, there's checkbox to let you choose if building image or not. If not, then it'll search the image built in most recent time. If yes, it'll run build the image before the deployment, it builds once and deploy to multiple productions.


Q4:
How and when are the images for the other repos(ie public dash, admin dash, join) built? How are they run if they're not in buildspec.yml as we only see "webapp" and "analysis" in here?

A4:

Same as what's in 3, if you choose build app images, then all public, join, admin, web/analysis images would be re-built. Shankari would like the Jenkins job to be simple, that's why, we use on button/checkbox to control everything under the hood.


Q5:
Regarding this commit, what was the reasoning behind running the .ipynb notebooks directly?

A5:

We run the .ipynb notebooks directly, it's because previous code was running this with crontab. But in AWS, we are not using crontab to schedule tasks, we are using ECS Scheduled Tasks, basically we created AWS EventBridge rules to schedule viz scripts run with a cron expression. As we are using the Docker container, the container could fail, if it failed, then the crontab would not run within container. That's why we choose use AWS to schedule and run cron job as container.


Q6:
Assuming there are not a lot of such wrapping images tasks, it seems like these tasks like adding certificates can be moved to the EXTERNAL Dockerfile. If so, that would mean we no longer require the INTERNAL Dockerfile and would not need build the docker image INTERNALLY.
We would still be pushing to AWS ECR?

A6:

Yes, Cloud has built pipeline required by Cyber, where it scans all images from NREL ECR regularly and reports vulnerabilities/KEVs. Which means the pipeline could not pull and scan external images, it requires all images built and pushed to NREL ECR as the central repos.


Q7:
For more clarity, in the buildspec.yml file (https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/blob/master/buildspec.yml), we currently have four phases: install, pre-build, build, post-build.
Would it be possible, to just have three phases: install, pre-build, post-build; i.e. skipping the build stage?

Instead, in the post-build stage, just before “docker push” to AWS ECR, we can pull the required image from Dockerhub, tag it, then push to AWS ECR.
Thus we would not require to rebuild it, again assuming that we do not have any wrappers to be applied (these can be applied EXTERNALLY itself).

A7:

It sounds doable, you could update and give a try, as long as there's no NREL specific info expose to external.


Some key takeaways:

  1. It is incredibly important that we store the images that are built during the Jenkins run in AWS ECR. This allows for regular, on-demand vulnerability scanning of images by Cyber.
  2. Some differences external and internal images include: the addition of certificates required to access the DocumentDB cluster, image customizations by changing certain directories (e.g. conf settings).
  3. Buildspec.yml is triggered through Jenkins. Whoever runs Jenkins can choose which images are re-built or not by checking a box.

@nataliejschultz
Copy link

nataliejschultz commented Mar 5, 2024

Topic 4: Understanding of the internal repos (Need + Rebuild)

Two questions to answer and we’ve put forth our points having discussed with Jianli.

1. Do we need internal repos at all? Why?
Yes. We do need them.

  • Security is the major concern, due to which images need to be scanned and hence stored on AWS ECR [Jianli].
  • Add more layers like installing certificates for accessing DocumentDB. [Jianli].
  • Pushing to ECR requires AWS credentials to be setup which would be better suited to be stored in an internally protected environment. Migrating to an external process is possible, but it will be more involved.
  • Conf directories are different in internal and external repos as well as in webapp and analysis internally.

2. Do we need to re-build images both externally and internally (two sets of Dockerfiles for each repo)

A. Possibly not needed to rebuild:

  • Additional steps like adding certificates could be done in the external Dockerfiles.
  • Could the AWS credentials (e.g. Cognito in admin-dash config.py file) be stored in GitHub secrets if that’s allowed by the Cloud team? Natalie has collaborated with Cloud services in the past to create an IAM role that allows direct access to AWS credentials in the past, so looks like we can.

B. Possibly need to rebuild:

  • For the conf files, can we avoid copying conf, start_script in the internal repo? If we cannot, since conf seems to be different / customized for each study, can it be done without building docker image internally? If not, then need to rebuild images to include the internal conf directories instead of the external sample conf.
  • We found some files with redundant purposes in both external and internal repos.
    • For instance, the start_script.sh in emission-server and nrelopenpath/webapp are pretty similar except a few lines. This script is added as the last layer in the Dockerfile using the CMD command which serves to execute the script as a startup command.
    • This start_script.sh is already available in the built docker image, so why do we need it again in the internal repo? Well, one reason is that, the internal version of the script adds some personalized commands like copying conf directories, if present.
    • Similarly, there is a lot of redundancy in the analysis cmd files in nrel-cloud-computing/nrelopenpath directory. All of the cmd files go through the same setup, which could potentially be done just once using a different method. These files are all identical aside from the last line or two which executes a specific script.
    • The reason for this seems to be the way we use AWS ECS Scheduled tasks instead of crontab (See Topic 3: A5).

To summarize:

  • Assuming, that such extra tasks that are defined in the internal Dockerfiles can be completed in the external Dockerfiles themselves, then yes, we do not need to rebuild images internally as well.
  • However, if, some level of personalization / customization is required (like the conf directories or start_script.sh or running python notebooks for public-dash directly), then perhaps need to build images in the internal repos as well.

@nataliejschultz
Copy link

nataliejschultz commented Mar 5, 2024

Topic 5: Redesign Plan Updates

Redesign steps suggested in previous meeting:

1. Add all four repos to the Multi-tier docker structure

A. Current setup

  • Only server, public-dash (notebook) have images pushed to Dockerhub.
  • For server (webapp, analysis), internal repo contains Dockerfiles and some scripts only.
  • For other external repos, separate internal repos, which is mostly duplicated code.

B. New setup

  • Build and push images to Dockerhub for all four repos.
  • Similar to webapp and analysis, can have directories for the other repos; no need to have the entire external code repos duplicated.
  • Can just update Docker image tags in the Dockerfiles and include relevant config files, scripts.
  • Need to work out how to design public-dash, admin-dash whose build process is different.

C. Feedback from previous meeting:

  • Why do we even need to build internally?
  • Remember, that current process, especially for public-dash is a “bad hack” and must be changed.
  • Dockerhub image limit for free tier?

2. Proposed deployment process

We are still considering the one internal repository structure (mentioned above in 1.) with related files for each repo inside just one subdirectory per repo.
So, all four repos would have a ready-to-use image pushed to Dockerhub.

A. Skipping the “build” job:

  • As discussed in Part III above, in the buildspec.yml file, we currently have four phases: install, pre-build, build, post-build.
  • We can change the yaml file to have three phases: install, pre-build, post-build; i.e. skipping the build stage.
  • Instead, in the post-build stage, just before “docker push” to AWS ECR, we can"docker pull” the required image from Dockerhub
  • Thus we would not require to rebuild it, again assuming that we do not have any wrappers to be applied (these can be applied EXTERNALLY itself) or any customizations to be made in the image.
  • Jianli said that it could be a possible option as long as there's no NREL specific info expose to external.

B. Streamlining repo-specific build processes

i. E-mission-server:

ii. Join

  • Wouldn’t need any changes, just need to fit in the changes from whatever design decision we take.

iii. Admin-dash:

  • Let us consolidate to using one branch (either main or dev) since the commits match for both branches; dev is only ahead by 1 commit which is the merge main into dev commit.
  • This is possible assuming there is a way to specify branch to build from in the Jenkins pipeline.

iv. Public-dash:

  • We now understand why the customizations and the hacky build process mentioned here was used.
  • For instance, we know that the start_notebook.sh script has changes to run the python notebooks directly to avoid the scenario when the container itself fails and the crontab jobs would not run; hence we are using AWS ECS Scheduler for this.
  • We need a workaround for this, where we replaced the cronjobs with AWS ECS scheduler, which changes the start_notebook.sh script to run the notebooks directly.
  • Since, the hacky method to build the notebook image and then use that image is related to this topic of using AWS ECS scheduler tasks in place of crontab.

Two possibilities:

  1. Similar to how cronjobs are being handled in nrelopenpath/analysis, we can rebuild the image and push it to AWS ECR.
  • This means, we build image externally, then use the internal Dockerfile to make customizations to include scripts to run the python notebooks without using crontab.
  • This would avoid manually re-tagging the public-dashboard notebook image and manually pushing it to Dockerhub.
  • We would still be building image internally from the latest server image.
  1. But, the question is, do we even need to run these notebooks on a schedule?
  • One reason is, when analysis scripts are run as cronjobs, we are updating dashboard based on their results.
  • If not, can we just use the static images that were generated from the external build.

@shankari
Copy link
Contributor Author

shankari commented Mar 5, 2024

One high-level comment before our next meeting:

Conf directories are different in internal and external repos as well as in webapp and analysis internally.

Do we need all these conf directories? A lot of them have a single configuration.
See also e-mission/e-mission-server#959 (comment)

@MukuFlash03
Copy link

MukuFlash03 commented Mar 14, 2024

Table for Differences in External and Internal Repositories

S.No. Repository/Container File Difference Findings Solution Needs Rebuilding Internally (Y/N)
1. Join N/A None N/A Internal matches External. No
2. Admin-dash docker/start.sh sed changed to jq Tested changing sed to jq for both admin-dash and public-dash.
Changed in script, rebuilt containers, working with jq.
Change sed to jq in external repos. No
docker/Dockerfile AWS Certificates Cannot move outside since this customization is just for our use case as we use AWS DynamoDB.
What if someone else is using other DBs like MongoDB or other cloud services.
Natalie has an idea to use environment variables and a script that runs as a Dockerfile layer.
Keep it as it is / Natalie script. Yes
docker-compose-prod.yml Cognito credentials added. For internal Github repo, these will be stored as secrets.
These will be set up as ENV variables in the docker-compose same as current setup but will use secrets instead of directly setting values.
Use GitHub secrets + Environment variables. Yes
config.py, config-fake.py INDEX_STRING_NO_META added. We searched for history behind this addition and found that it was done as a workaround to handle a security issue with a script injection bug.
We found that Shankari had filed an issue with the dash library repository and was able to elicit a response and a fix from the official maintainers of the library.

With the dash library version that fixed it 2.10.0, flask version <= 2.2.3 is needed.
Hence choosing next higher versions (2.14.1 which increased flask version limit), 2.14.2 (mentioned by a developer that it works).
Working with 2.16.1 latest version as well.
The issue no longer appears when tested with versions: 2.14.1, 2.14.2, 2.16.1.

Shankari updates
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/commit/23930b6687c6e2e8cd4aeb79d3181fc7af065de6
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/commit/c40f0866c76e2bfa03bef05d0daefda30625a943
plotly/dash#2536
plotly/dash#2540

Related:
plotly/dash#2699
plotly/dash#2707 [2.14.2 works]

Release Tags
https://github.com/plotly/dash/releases/tag/v2.10.0 [Contains fix for Shankari’s issue]
https://github.com/plotly/dash/releases
Upgrade Dash library version to >= 2.14.1 or to latest (2.16.1) in requirements.txt No
app_sidebar_collapsible.py OpenPATH logo removed.
Config file import added.
INDEX_STRING_NO_META added.
Except OpenPATH logo, others can be skipped.
Need more info on whether OpenPATH logo can be added or kept as it is from external version or not.
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-admin-dashboard/commit/47f023dc1d6b5a6531693a90e0575830691356ec
e-mission/op-admin-dashboard#43

Config import and INDEX_STRING_NO_META can be removed as I’ve tested that updating the dash version to version 2.16.1 and above solves the issue and we no longer need a workaround.
Decide on whether OpenPATH logo can be added to internal.
Based on Shankari’s suggestions in the open issue, we believe it’d be a good idea to store the icon as a static image (png or svg) in the local file system.

Config import and INDEX_STRING_NO_META can be removed after Dash version upgrade.
No
3. Public-dash start_noteboook.sh sed changed to jq Tested changing sed to jq for both admin-dash and public-dash.
Changed in script, rebuilt containers, working with jq.
Change sed to jq in external repos. No
docker/Dockerfile AWS Certificates Cannot move outside since this customization is just for our use case as we use AWS DynamoDB.
What if someone else is using other DBs like MongoDB or other cloud services.
Natalie has an idea to use environment variables and a script that runs as a Dockerfile layer.
Keep it as it is / Natalie script. Yes
start_noteboook.sh Python notebooks split into multiple execution calls. Related to AWS ECS scheduled tasks being used instead of cron jobs. Natalie’s suggestion: event-driven updates Yes
4. Nrelopenpath
ENV variables
analysis/conf/net/ext_service/push.json Four key-value pairs containing credentials, auth tokens. Need to test if we can we pass the entire dictionary as an environment variable?
If yes, avoids having to create 4 different ENV variables and can just use one for the entire file.
Use environment variables. No
webapp/conf/net/auth/secret_list.json One key-value pair containing credentials, auth tokens. Need to test if we can we pass the entire list as an environment variable?
File history
e-mission/e-mission-server#802

Hardcoded now, switch to some channel later
#628 (comment)
Use environment variables. No
5. Nrelopenpath/analysis
CONF Files
conf/log/intake.conf Logging level changed. Debug level set in internal while Warning level set in external.
Debug is lower priority than Warning.
This means internally, we want to log as much as possible.

Mentioned here that this was a hack, not a good long-term solution.
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/commit/6dfc8da32cf2d3b98b200f98c1777462a9194042

Same as log/webserver.conf in webapp.
Just differs in two filenames which are log file locations.
Keep it as it is.
Decide on level of logging and details.
Yes
conf/analysis/debug.conf.json Three key-value pairs. Analysis code config values, keys. Keep it as it is. Yes
6. Nrelopenpath/webapp
CONF Files
conf/log/webserver.conf Logging level changed. Debug level set in internal while Warning level set in external.
Debug is lower priority than Warning.
This means internally, we want to log as much as possible.

Mentioned here that this was a hack, not a good long-term solution.
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/commit/6dfc8da32cf2d3b98b200f98c1777462a9194042

Same as log/intake.conf in analysis.
Just differs in two filenames which are log file locations.
Keep it as it is.
Decide on level of logging and details.
Yes
conf/analysis/debug.conf.json Nine key-value pairs. Analysis code config values, keys. Keep it as it is. Yes
conf/net/api/webserver.conf.sample 2 JSON key-value pairs.
1 pair removed.
Related to auth (2 pairs).

Looks like 404 redirect was added only in external and not in internal.
Found this commit for addition of 404 redirect in external:
e-mission/e-mission-server@964ed28

Why sample file used + skip to secret
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/commit/a66a52a8fbd28bb13c4ebad445ba21e8b478c105

Secret to skip
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/commit/cb4f54e4c201bd132c311d04a5e34a57bae2efb7

Skip to dynamic
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath/commit/52392a303ea60b73e8ea23a199b09778b306d8da
Keep it as it is. Should the redirect be added to internal as well? Yes
7. Nrelopenpath/analysis
Startup Scripts
cmd_intake.sh
cmd_push_remind.sh
cmd_build_trip_model.sh
cmd_push.sh
cmd_reset_pipeline.sh
One external startup script duplicated into 5 scripts. Crontab and start_cron.sh no longer used as they are not used in NREL-hosted production environments
https://github.nrel.gov/nrel-cloud-computing/nrelopenpath-public-dashboard/commit/2a8d8b53d7ecf24e85217fb9b812ef3936054e35.
Instead, scripts are run directly in Dockerfile as ECS can create scheduled task for cron.
TBD

Have a single script that runs these scripts based on parameters.
But how will ECS know when to run each script?
Can we execute these in Dockerfile like public-dash currently runs Python notebooks (bad hack!?)
Yes
8. Nrelopenpath/webapp
Startup Scripts
start_script.sh Additional command to copy conf files added. Need to understand what is being copied where.
Tested and saw that custom conf files added or copied over to conf directory of base server image.
Keep it as it is. Yes

@shankari
Copy link
Contributor Author

High-level thoughts:

  • Cron versus AWS scheduled tasks:
    • Can we make this be event-driven?
      • There are pros and cons of making it be event driven
      • One big con, and the reason I have have set it up this way is the latency expected by users. If a user sees a refresh button and wants to click it, they want to see an updated result in 1 second. While if we run in the background, we can take 15 minutes to run. This may be more of an issue as we have more data and more complex analyses.
      • There is also the question of how we can schedule an event from a public website - how do we get the AWS credentials to do so into the website without leaking information?
  • why are we using AWS scheduled tasks? Because NREL Cloud Services wants to. And they want to so that we don't have to have a container running all the time and can have it start and stop when needed.
  • I think that the solution for this is the same as the Admin dashboard:Dockerfile ie environment variable and use cronjob or AWS depending on how it is set

Microscopic details:

  • Admin dashboard: docker/Dockerfile: Natalie has an idea to use environment variables and a script that runs as a Dockerfile layer. This is fine, and in that case, we do not need to rebuild locally.
  • Admin-dashboard: Cognito credentials added.: The cognito credentials in the internal repo are hardcoded to the staging values. We cannot use them because they are embedded into the image and will be the same across all instances. This is not the way that they are specified on production. Instead, as part of the AWS deploy, NREL cloud services sets a bunch of environment variables that are specific to each production instance. So:
    • these are not required in the production docker compose
    • even if they were, secrets would not really be needed because this is the internal repo and we are not worried about internal NREL users accessing staging anyway
    • Admin dashboard: config.py, config-fake.py: current proposal is fine, but I also want to question whether config.py is even needed. IIRC, it just reads environment variables and makes them available to python. Do we need a class that does that, or can we read environment variables directly. Or write a simpler class that just makes all environment variables available as python variables.
  • Admin dashboard: INDEX...: can be upgraded
  • Admin dashboard: local file name: this would be part of the config as well (through a local file system link). You can have the URL of the image as an ARG while building or have the internal dockerfile pull it and include it while building.
  • push.json: this is currently a conf file, so not sure why it is an "ENV variable". Yes, it can be converted to a set of environment variables. 4 variables is fine.
  • secret_list.json this is currently a conf file, so not sure why it is an "ENV variable". this is currently unused and should be removed
  • webapp does not need analysis configuration
  • simplify webserver.conf significantly (several of the veriables e.g. log_base_dir are no longer used) + convert to ENV variables instead
  • the advantage of converting this to an environment variable is that we don't need to write overwrite the sample file with the value from an environment variable. So we don't need jq and we don't need to convert webserver.conf.sample to webserver.conf, and more importantly, db.conf.sample to db.conf
  • parametrize AWS scripts and switch to cron or AWS depending on variable (see high-level above)
  • copying conf files over, do we have a lot of conf files left? If we have assertion differences between debug and prod, for example, we can we just check in both versions and use the correct one based on the environment variable that is passed in (e.g. PROD = true or AWS = true or ....)

@nataliejschultz
Copy link

Current dealings:

@MukuFlash03 and I have been collaborating on minimizing the differences between the internal and external repositories.

I have been working on figuring out how to pass the image tag – created in the server action image-build-push.yml – between repositories. I’ve been attempting to use the upload-artifact/download-artifact method. It worked to upload the file, and we were able to retrieve the file in another repository, but we had to specify the run id for the workflow where the artifact was created. So, this defeats the purpose of automating the image tag in the first place.

We also looked into GitHub release and return dispatch as options, but decided they were not viable.

There are ways to push files from one run to another repository, though we haven’t tried them yet. Write permissions might be a barrier to this, so creating tokens will be necessary. If we can get the file pushing to work, this is our intended workflow:

  1. E-mission-server image-build-push.yml: writes image tag to file
  2. Tag file is pushed to admin dash, and public dash
  3. Push of file triggers image-build-push workflows in the other repos
  4. File is read into image-build-push workflow
  5. Tag from file set as an environment variable for workflow run
  6. Dockerfiles updated with tags
  7. Docker image build and push

@nataliejschultz
Copy link

Had a meeting with @MukuFlash03 to discuss some issues with testing. Made a plan for documentation of testing and outlined what all needs to be done.

  • Continue adding comments to code changes on each individual PR and suggesting changes/clarifying
  • Run e-mission server and try to run runalltests.sh. if it passes, take that as confirmation that it's working.
  • Build admin dash on dockerfile in the main directory (not docker) and see if it can be visualized. May need to change the port number in URL. Add clarifying comments on how this Dockerfile is to be used
  • Try to resolve issue with with crontab for public dash, but still show how it's working currently for changes vs published repos
  • Document join page result and what command was run. Do this for current external version as well.
  • Add screenshots of results in individual PRs.

@MukuFlash03
Copy link

MukuFlash03 commented Apr 24, 2024

Docker Commands for Testing Code Changes

Posting a list of the docker commands I used to verify whether the docker images were building successfully.
Next, I also tested whether containers could be run from the images.

I had to ensure that the configurations setup in the docker-compose files were set manually by me since in the internal images docker-compose is not needed used any more.
These settings included things like ports, networks, volumes, environment variables.


Creating a network so containers can be connected to each other:

$ docker network create emission

DB container is needed for storage; data must be loaded into it (I did not load data when I did this testing initially)

$ docker run --name db -d -p 27017:27017 --network emission mongo:4.4.0

A. Internal Repo Images

Checkout to multi-tier branch in internal repo -> nrelopenpath


  1. Webapp
$ docker build -t int-webapp ./webapp/
$ docker run --name int-webapp-1 -d -e DB_HOST=db -e WEB_SERVER_HOST=0.0.0.0 -e STUDY_CONFIG=stage_program --network emission int-webapp

  1. Analysis
$ docker build -t int-analysis ./analysis/
$ docker run --name int-analysis-1 -d -e DB_HOST=db -e WEB_SERVER_HOST=0.0.0.0 -e STUDY_CONFIG=stage_program -e PUSH_PROVIDER="firebase" -e PUSH_SERVER_AUTH_TOKEN="Get from firebase console" -e PUSH_APP_PACKAGE_NAME="full package name from config.xml. e.g. edu.berkeley.eecs.emission or edu.berkeley.eecs.embase. Defaults to edu.berkeley.eecs.embase" -e PUSH_IOS_TOKEN_FORMAT="fcm" --network emission int-analysis

  1. Join
$ docker build -t int-join ./join_page
$ docker run --name int-join-1 -d -p 3274:5050 --network emission int-join

Sometimes, during local testing, join and public-dash frontend page might load the same html file as the port is still mapped to either join or public-dash depending on which is run first.
So, change the join port to a different one (just for testing purposes). :

$ docker run --name int-join-1 -d -p 2254:5050 --network emission int-join

  1. Admin-dash
$ docker build -t int-admin-dash ./admin_dashboard
$ docker run --name int-admin-dash-1 -d -e DASH_SERVER_PORT=8050 -e DB_HOST=db -e WEB_SERVER_HOST=0.0.0.0 -e CONFIG_PATH="https://raw.githubusercontent.com/e-mission/nrel-openpath-deploy-configs/main/configs/" -e STUDY_CONFIG="stage-program" -e DASH_SILENCE_ROUTES_LOGGING=False -e SERVER_BRANCH=master -e REACT_VERSION="18.2.0" -e AUTH_TYPE="basic" -p 8050:8050 --network emission int-admin-dash

  1. Public-dash frontend
$ docker build -t int-public-dash-frontend ./public-dashboard/frontend/
$ docker run --name int-public-dash-frontend-1 -d -p 3274:6060 -v ./plots:/public/plots --network emission int-public-dash-frontend

  1. Public-dash viz_scripts
$ docker build -t int-public-dash-notebook ./public-dashboard/viz_scripts/
$ docker run --name int-public-dash-notebook-1 -d -e DB_HOST=db -e WEB_SERVER_HOST=0.0.0.0 -e STUDY_CONFIG="stage-program" -p 47962:8888 -v ./plots:/plots --network emission int-public-dash-notebook


B. External Repo Images

  1. Directly pull latest pushed image from Dockerhub for each repo in its Dockerfile
docker run --name container_name image_name 
  1. Alternatively, can try building from external repo after switching to consolidate-differences branch for e-mission-server and image-push branch for join, admin-dash, public-dash.
    Will have to use docker build commands similar to internal images above.
docker build -t image_name Dockerfile_path
docker run —name container_name image_name
  1. Or, can run docker-compose commands since external images still have docker compose files.
Join and Public-dash:  $ docker-compose -f docker-compose.dev.yml up -d
Admin-dash: $ docker compose -f docker-compose-dev.yml up -d

Initially, I used option 1.


  1. E-mission-server
$ docker run --name em-server-1 -d -e DB_HOST=db -e WEB_SERVER_HOST=0.0.0.0 -e STUDY_CONFIG=stage-program --network emission mukuflash03/e-mission-server:image-push-merge_2024-04-16--49-36
  1. Join
$ docker run --name join-2 -d -p 3274:5050 --network emission mukuflash03/nrel-openpath-join-page:image-push-merge_2024-03-26--22-47
  1. Op-admin
$ docker run --name op-admin-2 -d -e DASH_SERVER_PORT=8050 -e DB_HOST=db -e WEB_SERVER_HOST=0.0.0.0 -e CONFIG_PATH="https://raw.githubusercontent.com/e-mission/nrel-openpath-deploy-configs/main/configs/" -e STUDY_CONFIG="stage-program" -e DASH_SILENCE_ROUTES_LOGGING=False -e SERVER_BRANCH=master -e REACT_VERSION="18.2.0" -e AUTH_TYPE="basic" -p 8050:8050 --network emission mukuflash03/op-admin-dashboard:image-push-merge_2024-04-16--00-11
  1. Public-dash Frontend / dashboard
$ docker run --name public-dash-frontend-1 -d -p 3274:6060 -v ./plots:/public/plots --network emission mukuflash03/em-public-dashboard:image-push-merge_2024-04-16--59-18
  1. Public-dash Viz_scripts / notebook-server
$ docker run --name public-dash-viz-1 -d -e DB_HOST=db -e WEB_SERVER_HOST=0.0.0.0 -e STUDY_CONFIG="stage-program" -p 47962:8888 -v ./plots:/plots --network emission mukuflash03/em-public-dashboard_notebook:image-push-merge_2024-04-16--59-18

@shankari
Copy link
Contributor Author

For automatic updates of the tags, we have three options:

  1. reading the tag automatically from a GitHub action in another repo (not clear that this works)
    • how can you pass a run_id between repositories?
  2. pushing a file from a github action in one repo to a github action in another repo (untried)
  3. use Github hooks and/or the GitHub API to be notified when there is a new release and to trigger a workflow based on that (untried)

@nataliejschultz
Copy link

For automatic updates of the tags, we have three options:

  1. pushing a file from a github action in one repo to a github action in another repo (untried)

@MukuFlash03 See my comment above, quoted below, for an outline of the steps to try to get the file pushing method to work:

There are ways to push files from one run to another repository, though we haven’t tried them yet. Write permissions might be a barrier to this, so creating tokens will be necessary. If we can get the file pushing to work, this is our intended workflow:

  1. E-mission-server image-build-push.yml: writes image tag to file
  2. Tag file is pushed to admin dash, and public dash
  3. Push of file triggers image-build-push workflows in the other repos
  4. File is read into image-build-push workflow
  5. Tag from file set as an environment variable for workflow run
  6. Dockerfiles updated with tags
  7. Docker image build and push

@shankari
Copy link
Contributor Author

There is also
https://stackoverflow.com/questions/70018912/how-to-send-data-payload-using-http-request-to-github-actions-workflow
which is a GitHub API-fueled approach to passing data between repositories

@MukuFlash03
Copy link

MukuFlash03 commented Apr 26, 2024

Summary of approaches tried for automating docker image tags

Requirements:

  1. Docker image tags generated in e-mission-server should be available in admin-dash and public-dash repository workflows.
  2. Successful completion of docker image workflow run in e-mission-server should trigger workflows to build and push docker images in admin-dash and public-dash repositories.
  3. Dockerfiles in admin-dash and public-dash should be updated with the latest docker image tags received from the latest successfully completed "docker image" workflow run of e-mission-server.

Notes:

  • Tests done on join repo as well since it had the least amount of changes for the redesign PR and felt like a cleaner repo to experiment on. Once things were working on join repo, I went ahead and implemented them on the dashboard repos.
  • GitHub tokens were utilized with specific accesses required as stated in the specific approach's documentation/
    • Used classic Personal access tokens for downloading artifacts.
    • Used fine-grained tokens for GitHub REST APIs.

Current status:

Implemented

  • Successfully met requirements 1) and 2).
    • Able to trigger workflows in the other three repos (join repo used for testing) automatically whenever a push or merge is made to e-mission-server repo's branch.
    • Docker image tags generated in the latest successfully completed workflow run are flowing from e-mission-server to the workflows in the other three repositories, and able to read the tags in these repos.

For reference, matrix strategy for workflow dispatch events

  • This e-mission-server run triggered these three runs in admin-dash, public-dash, join.
  • Similarly the actions contains successful workflow runs for the artifact method and the workflow dispatch without matrix strategy method.

Pending

  • Req. 3) is pending and will be the next major task to work on.

Approaches planned to try out:

  1. Artifacts upload / download with the use of workflow run ID.
  2. Pushing files from one repo to the other repos.
  3. Using GitHub REST APIs and webhooks to trigger workflows

Approaches actually tested, implemented and verified to run successfully

  1. Artifacts + Run ID
  2. GitHub REST APIs + webhooks

Reason for not trying out Approach 2:

I tried out and implemented Approach 1 and 3 first.
Approach 3 was necessary for triggering a workflow based on another workflow in another repository.
Approach 1 sort of included Approach 2 of pushing files in the form of artifacts.
Both Approach 1 and 2 would need Approach 3 to trigger workflows in the dashboard repos at the same time.
With these two approaches implemented, I was done with Requirements 1) and 2).

The next major task was to work on Req 3) which involves updating the Dockerfiles in the dashboard repos.
Approach 2 is somewhat related to this as physical files present in the actual repos will need to be modified, committed. This is in contrast to any files, text data passed around in Approaches 1 and 3, which was all being done inside the GitHub actions workflow runner. The artifact files were available outside the runner after its execution but they were still tied to the workflow run.
With Approach 2, and in completing Req. 3, I would need to handle the Dockerfiles outside the workflow runs, hence I skipped Approach 2 as in a way I'd be working on it anyways.

@MukuFlash03
Copy link

MukuFlash03 commented Apr 26, 2024

Details of approaches

In my forked repositories for e-mission-server, join, admin-dash, public-dash there are three branches available for the tags
automation: tags-artifact, tags-dispatch, tags-matrix.

tags-artifact branch in: e-mission-server, admin-dash, public-dash, join

tags-dispatch branch in: e-mission-server, admin-dash, public-dash, join

tags-matrix branch in: e-mission-server, admin-dash, public-dash, join

Approach 1: tags-artifact:
Approach 3: tags-dispatch, tags-matrix


  1. tags-artifact [Approach 1: Artifact + Run ID]
  • Official documentation: upload-artifact, download-artifact
  • This involves using the artifact upload and download GitHub actions to make any file generated inside the workflow run available outside the runner execution but still inside the workflow as a downloadable .zip file.
  • This file can then be downloaded in another repository using a personal access token with repo access permissions, the workflow run id, the source repository name with user or organization name.
  • The workflow run ID was an obstacle, as it was unclear how to fetch it automatically.
    • I was finally able to fetch it using a Python script which uses a GitHub REST API endpoint to fetch all runs of a workflow.
    • Then I filtered these by status (completed + success) and source repo's branch name.
    • Finally these filtered runs are sorted in descending order of last updated time which gives the latest workflow run ID in e-mission-server.

Cons:

  • Extra Python script and additional job needed in the YAML file, which included setting up Python in the workflow runner, installing any dependencies that the Python scripts requires.
  • Even if this method is kept, we would still need either of the two methods in which Approach 3 is implemented, which are necessary to meet one of the primary requirements (Req. 2) to trigger multiple workflows after completion of the workflow run in e-mission-server.

  1. tags-dispatch [Approach 3: GitHub REST APIs]
  • Official documentation: workflow dispatch events
  • This uses a GitHub REST API endpoint for the workflow dispatch events which sends POST requests to a target repository to trigger workflows in those repositories.
  • This required usage of a fine-grained token with the required access scope was actions: write.
  • Additionally, this can also pass data such as the docker image tags in the form of json data via the request parameters, which can then be received by the target repositories by accessing the response from the POST request.
    - name: Trigger workflow in join-page, admin-dash, public-dash
      run: |
        curl -L \
          -X POST \
          -H "Accept: application/vnd.github+json" \
          -H "Authorization: Bearer ${{ secrets.GH_FG_PAT_TAGS }}" \
          -H "X-GitHub-Api-Version: 2022-11-28" \
          https://api.github.com/repos/MukuFlash03/op-admin-dashboard/actions/workflows/90180283/dispatches \
          -d '{"ref":"tags-dispatch", "inputs": {"docker_image_tag" : "${{ steps.date.outputs.date }}"}}'

# Similarly for public-dash, join

  1. tags-matrix [Approach 3: GitHub REST APIs]
  • Official documentation: matrix strategy
  • Similar to tags-dispatch, wherein workflow dispatch events are used.
  • The only difference is that the matrix strategy is to dispatch parallel events to the target repositories once the source repository (e-mission-server), successfully completes execution.
  • This also shows the dispatch events as a 2nd set of combined jobs in the workflow run graph.
    • This required usage of a fine-grained token with the required access scope was actions: write.
    strategy:
      matrix:
        repo: ['MukuFlash03/nrel-openpath-join-page', 'MukuFlash03/op-admin-dashboard', 'MukuFlash03/em-public-dashboard']

    - name: Trigger workflow in join-page, admin-dash, public-dash
      run: |
        curl -L \
          -X POST \
          -H "Accept: application/vnd.github+json" \
          -H "Authorization: Bearer ${{ secrets.GH_FG_PAT_TAGS }}" \
          -H "X-GitHub-Api-Version: 2022-11-28" \
          https://api.github.com/repos/${{ matrix.repo }}/actions/workflows/image_build_push.yml/dispatches \
          -d '{"ref":"tags-matrix", "inputs": {"docker_image_tag" : "${{ env.DOCKER_IMAGE_TAG }}"}}'

Pros of workflow dispatch events and matrix strategy :

  • An advantage of using the workflow dispatch events is that we do not need metadata like the run ID.
  • There is an option to use workflow ID but is can also be replaced by the workflow file name; hence even workflow ID isn't needed.
  • I did calculate the workflow ID for each workflow file "image-build-push.yml" in the target repositories by using these API endpoints: e-mission-server, join workflows, admin-dash workflows, public-dash
  • I have kept the same workflow file name for the target repositories and hence in the tags-matrix repo, I can simply use the same workflow name but with differing repository names as defined in the matrix to run the curl command for the all the repositories.

@shankari
Copy link
Contributor Author

The next major task was to work on Req 3) which involves updating the Dockerfiles in the dashboard repos.

I don't think that the solution should be to update the Dockerfiles. That is overkill and is going to lead to merge conflicts. Instead, you should have the Dockerfile use an environment variable, and then set the environment variable directly in the workflow or using a .env file

@nataliejschultz
Copy link

Instead, you should have the Dockerfile use an environment variable, and then set the environment variable directly in the workflow or using a .env file

Our primary concern with this method was for users building locally. Is it acceptable to tell users to copy the latest image from the docker hub image library in the README?

@shankari
Copy link
Contributor Author

shankari commented Apr 29, 2024

docker-compose can use local images

The comment around .env was for @MukuFlash03's task to update the tag on the Dockerfile, which is not related to testing, only for the GitHub triggered actions.

@MukuFlash03
Copy link

MukuFlash03 commented May 2, 2024

Docker tags automation working end-to-end!

Finally got the tags automation to work completely in one click starting from the e-mission-server workflow, passing the latest timestamp used as the docker image tag suffix and then triggering the workflows in admin-dashboard and public-
dashboard.

Final approach taken for this involves a combination of the artifact and the matrix-dispatch methods discussed here.

Additionally, as suggested by Shankari here, I changed the Dockerfiles to use environment variables set in the workflow runs itself. Hence, not using / updating hardcoded timestamp values in the Dockerfiles anymore.

I don't think that the solution should be to update the Dockerfiles. That is overkill and is going to lead to merge conflicts. Instead, you should have the Dockerfile use an environment variable, and then set the environment variable directly in the workflow or using a .env file.


There is still a manual element remaining, however this is to do with any users or developers looking to work on the code base locally with the dashboard repositories.
The docker image tag (only the timestamp part) will need to be manually copied from the latest Dockerhub image of the server repo and added to the args in the docker-compose files.

This is also what @nataliejschultz had mentioned here:

Our primary concern with this method was for users building locally. Is it acceptable to tell users to copy the latest image from the docker hub image library in the README?

# Before adding tag
    build:
      args:
        DOCKER_IMAGE_TAG: ''

# After adding tag
    build:
      args:
        DOCKER_IMAGE_TAG: '2024-05-02--16-40'

@MukuFlash03
Copy link

MukuFlash03 commented May 2, 2024

Implementation
Approaches discussed here.

Combined approach (artifact + matrix) tags-combo-approach branch: e-mission-server, admin-dash, public-dash


Successful workflow runs:

  1. Workflow dispatch on modifying code in server repo:
  • e-mission-server - image build push (link)
server_push
  • admin-dash - fetch tag and image build push (link)
admin_workflow_dispatch
  • public-dash - fetch tag and image build push (link)
public_workflow_dispatch
  1. Push event on modifying code in admin-dash or public-repo
  • admin-dash - push event trigger (link)
admin_push
  • public-dash - push event trigger (link)
public_push

@MukuFlash03
Copy link

MukuFlash03 commented May 2, 2024

I decided to go ahead with the matrix-build strategy which dispatches workflows to multiple repositories when triggered from one source repositories. I had implemented this in tags-matrix branches of the dashboard repos (join repo as well, but this was just for initial testing purposes; final changes only on the dashboard repos).

Initially, I only had a push event trigger, similar to the docker image build and push workflow in the server repo.
However, I realized that there would now be two types of Github actions events that should trigger the workflows in the admin-dashboard and public-dashboard repos.
The second type of trigger would be a workflow_dispatch event.
This was implemented and working via the matrix-build workflow dispatch branch.

Now, for the workflow dispatch event, I was able to pass the latest generated docker image timestamp directly via the e-mission-server workflow in the form of an input parameter docker-image-tag.

    - name: Trigger workflow in admin-dash, public-dash
      run: |
        curl -L \
          -X POST \
          -H "Accept: application/vnd.github+json" \
          -H "Authorization: Bearer ${{ secrets.GH_FG_PAT_TAGS }}" \
          -H "X-GitHub-Api-Version: 2022-11-28" \
          https://api.github.com/repos/${{ matrix.repo }}/actions/workflows/image_build_push.yml/dispatches \
          -d '{"ref":"tags-combo-approach", "inputs": {"docker_image_tag" : "${{ env.DOCKER_IMAGE_TAG }}"}}'

This parameter was then accessible in the workflows of the dashboard repos:

on:
  push:
    branches: [ tags-combo-approach ]

  workflow_dispatch:
    inputs:
      docker_image_tag:
        description: "Latest Docker image tags passed from e-mission-server repository on image build and push"
        required: true

@MukuFlash03
Copy link

MukuFlash03 commented May 2, 2024

With these changes done, I believed I was done but then I came across some more issues. I have resolved them all now but just mentioning them.


  1. If push event triggers workflow, empty string value was being passed into the ENV variable.
  • This was solved by introducing the artifact method discussed in the comment above

Why I chose to add artifact method as well?

The issue I was facing was with fetching the latest timestamp for the image tag in case of a push event trigger. This is because in the workflow dispatch, the server workflow itself would trigger the workflows and hence was in a way connected to these workflows. However, push events would only trigger the specific workflow in that specific dashboard repository to build and push the image and hence would not be able to retrieve the image tag directly.

So, I utilized the artifact upload and download method to:

  • upload the image timestamp as an artifact in the workflow run for future use.
  • download the uploaded artifact from the latest previously successful and completed workflow run in e-mission-server repo for a specific branch (currently set to tags-combo-approach but to be changed to master once changes are final).

  1. There are three jobs in the workflows in the dashboard repo: fetch_run_id, fetch_tag, build.
    Fetch_run_id must always run and complete before build job begins; but build was finishing first and building images with the incorrect image tag as it wasn't yet available since the fetch jobs weren't completed.
  • Solution involved usage of needs keyword to create chained jobs that are dependent on each other and will always wait for the previous task to complete before executing.
  • Additionally, of output variables and Environment variables were used in the workflow to pass values from one job to the other.

  1. Switching to using ARG environment variables in the Dockerfiles was tricky as I had to figure out how to pass the appropriate timestamp tags considering the two event triggers - push and workflow_dispatch.

Dockerfiles' FROM layer looks like:

ARG DOCKER_IMAGE_TAG
FROM mukuflash03/e-mission-server:tags-combo-approach_${DOCKER_IMAGE_TAG}

Solution I implemented involves defining two DOCKER_IMAGE_TAGS in the workflow file, one for push, the other for workflow_dispatch:

    env:
      DOCKER_IMAGE_TAG_1: ${{ needs.fetch_tag.outputs.docker_image_tag }}
      DOCKER_IMAGE_TAG_2: ${{ github.event.inputs.docker_image_tag }}

I then passed either of these as the --build-arg for the docker build command depending on the event trigger:

    - name: build docker image
      run: |
        if [ "${{ github.event_name }}" == "workflow_dispatch" ]; then
          docker build --build-arg DOCKER_IMAGE_TAG=$DOCKER_IMAGE_TAG_2 -t $DOCKER_USER/${GITHUB_REPOSITORY#*/}:${GITHUB_REF##*/}_${{ steps.date.outputs.date }} .
        else
          docker build --build-arg DOCKER_IMAGE_TAG=$DOCKER_IMAGE_TAG_1 -t $DOCKER_USER/${GITHUB_REPOSITORY#*/}:${GITHUB_REF##*/}_${{ steps.date.outputs.date }} .
        fi

  1. How to update docker image tags in case developers wanted to build locally?
    Solution was to provide an option to add the docker image tag manually from the latest server image pushed to Dockerhub.
    This is discussed in this comment above.

The ReadMe.md can contain information on how to fetch this tag, similar to how we ask users to manually set their study-config, DB host, server host info for instance.

@shankari
Copy link
Contributor Author

shankari commented May 2, 2024

wrt merging, I am fine with either approach

  1. put both the automated builds + code cleanup into one PR and the cross-repo automated launches into another PR. You can have the second PR be targeted to merge to the first PR so that it only has the changes that are new for the second level of functinality
  2. One giant PR. I looked at join and I don't think that the changes will be that significant, and I am OK with doing a more complex review if that is easier for you.

@MukuFlash03
Copy link

Build completely automated !

No manual intervention required; not even by developers using code

Referring to this comment:

There is still a manual element remaining, however this is to do with any users or developers looking to work on the code base locally with the dashboard repositories. The docker image tag (only the timestamp part) will need to be manually copied from the latest Dockerhub image of the server repo and added to the args in the docker-compose files.

I've gone ahead and implemented the automated build workflow with the addition of the .env file in the dashboard repos which just stores the latest timestamp from the last successfully completed server image.

Thus, the build is completely automated now and users / developers who want to run the code locally will not have to manually feed in the timestamp from the docker hub images.

The .env file will be updated and committed in the github actions workflow automatically and changes will be pushed to the dashboard repo by the github actions bot.


Links to successful runs

A. Triggered by Workflow_dispatch from e-mission-server
Server run, Admin-dash run, Public-dash run

Automated commits to update .env file:
Admin-dash .env, Public-dash .env


B. Triggered by push to remote dashboard repositories
Admin-dash run, Public-dash run

Automated commits to update .env file:
Admin-dash .env, Public-dash .env


@MukuFlash03
Copy link

I also tested another scenario where let's say a developer changed the timestamp in the .env file to test an older server image.
Now, they may have accidentally pushed this older timestamp to their own repo. What happens when they create a PR with their changes which includes this older server image?

  • Shankari will be able to see that .env file was updated and will then ask developer to revert changes to the .env file.
  • If this change is somehow missed by Shankari, the workflow will take care of it and restore the latest timestamp tag from the e-mission-server run by updating the .env file correctly within the workflow runs in the dashboard repos.

Thus expected workflow steps in this case would be:

  • Push event triggers workflow.
  • It writes DOCKER_IMAGE_TAG_1 fetched from last successful completed run to .env file.
  • It sees that there is a difference in the latest committed .env file in the dashboard repo which includes older timestamp committed by developer while working locally.
  • Hence it runs git commit part of the script to reset to latest server timestamp value stored in DOCKER_IMAGE_TAG_1.

Some outputs from my testing of this scenario, where I manually entered an older timestamp (2024-05-02--16-40) but the workflow automatically updated to latest timestamp (2024-05-03--14-37).

A. Public-dash

mmahadik-35383s:em-public-dashboard mmahadik$ cat .env
DOCKER_IMAGE_TAG=2024-05-02--16-40
mmahadik-35383s:em-public-dashboard mmahadik$ git pull origin tags-combo-approach
remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (2/2), done.
remote: Total 2 (delta 1), reused 2 (delta 1), pack-reused 0
Unpacking objects: 100% (2/2), 262 bytes | 65.00 KiB/s, done.
From https://github.com/MukuFlash03/em-public-dashboard
 * branch            tags-combo-approach -> FETCH_HEAD
   9444e60..40beb80  tags-combo-approach -> origin/tags-combo-approach
Updating 9444e60..40beb80
Fast-forward
 .env | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
mmahadik-35383s:em-public-dashboard mmahadik$ cat .env
DOCKER_IMAGE_TAG=2024-05-03--14-37

B. Admin-dash

mmahadik-35383s:op-admin-dashboard mmahadik$ cat .env 
DOCKER_IMAGE_TAG=2024-05-02--16-40
mmahadik-35383s:op-admin-dashboard mmahadik$ git pull origin tags-combo-approach
remote: Enumerating objects: 5, done.
remote: Counting objects: 100% (5/5), done.
remote: Compressing objects: 100% (1/1), done.
remote: Total 3 (delta 1), reused 3 (delta 1), pack-reused 0
Unpacking objects: 100% (3/3), 308 bytes | 77.00 KiB/s, done.
From https://github.com/MukuFlash03/op-admin-dashboard
 * branch            tags-combo-approach -> FETCH_HEAD
   d98f75c..f1ea34c  tags-combo-approach -> origin/tags-combo-approach
Updating d98f75c..f1ea34c
Fast-forward
 .env | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)
mmahadik-35383s:op-admin-dashboard mmahadik$ cat .env
DOCKER_IMAGE_TAG=2024-05-03--14-37

@MukuFlash03
Copy link

Also, added TODO s to change from my repository and branches to master branch and e-mission server repo.

@nataliejschultz
Copy link

@shankari on the internal repo:

Right now, for the server images that this PR is built on:

  • wait for the server build to complete
  • copy the server image tag using command+C
  • edit webapp/Dockerfile and analysis/Dockerfile and change the tag using command-V
  • commit the changes
  • push to the main branch of the internal repo
  • launch build using Jenkins

Ideally the process would be:

  • something something updates, commits and pushes the updated tags to the main branch of internal repo

    • it is fine for this to be a manual action, at least initially, but I want one manual action (one button or one script)
    • creating a PR that I merge would be OK but sub-optimal. Short-term ideally, this would just push directly to the repo so no merge is required.
    • could this run in Jenkins? No visibility into Jenkins. We should write a script as a template for cloud services if this is even possible.
  • I manually launch build using Jenkins

Initial thoughts about a script:

*Pull the image tags from the external repos (GitHub API?)
*Write those image tags into the Dockerfiles for each repository
*Create a PR that's auto-merged, so the tags are ready to go for the Jenkins pipeline

Where to run?

  • Could this script potentially run in GitHub Actions?
    • This would require connecting the repos, which we're not sure is possible at the moment
  • Run script locally
  • Run script in Internal repo as part of start script? Potentially with GitHub API

@shankari
Copy link
Contributor Author

I have created all the tokens needed; we just need to clean this up to a basic level and merge.
Once we are done with the basic level, I think we will still need one more round of polishing for this task, but we can track that in a separate issue

Screenshot 2024-05-19 at 11 21 22 AM

@shankari
Copy link
Contributor Author

shankari commented May 19, 2024

I can see that we have used docker build and docker run directly both in the PRs and while testing them (e.g.)
#1048 (comment)
or
https://github.com/e-mission/em-public-dashboard/pull/125/files#diff-bde90ebb933051b12f18fdcfcefe9ed31e2e3950d416ac84aec628f1f9cc2780R136

This is bad. We use docker-compose extensively in the READMEs, and we should be standardizing on it.
Using docker build or docker run makes it more likely that we will make mistakes in the way that the containers are configured and interact with each other.

I have already commented on this before:
e-mission/em-public-dashboard#125 (comment)

I will not review or approve any further changes that use docker build or docker run unless there is a reason that docker-compose will not work

@nataliejschultz
Copy link

I got docker compose to work in actions for our process, but had to do it in a roundabout way.
The issue is that we want to push an image with the correct tag, and docker build allows you to specify the name of the tag using the -t flag. Docker compose does not work this way; you have to name the image in the compose file directly like this:

services:
  dashboard:
    build:
    image: name_of_image

Originally, I had planned to use an environment variable in my compose call

SERVER_IMAGE_TAG=$DOCKER_IMAGE_TAG_2 ADMIN_DASH_IMAGE_TAG=$DOCKER_USER/${GITHUB_REPOSITORY#*/}:${GITHUB_REF##*/}_${{ steps.date.outputs.date }} docker compose -f docker-compose-dev.yml build

and then set the name of the image to ${ADMIN_DASH_IMAGE_TAG}. However, this does not seem ideal for people running locally. I found a way around this by adding a renaming step in the build process:

- name: rename docker image
      run: |
        docker image tag e-mission/opdash:0.0.1 $DOCKER_USER/${GITHUB_REPOSITORY#*/}:${GITHUB_REF##*/}_${{ steps.date.outputs.date }}

This way we can keep the names of the images the same and push them correctly. I tested the environment variable version here, and the renaming version here. Both worked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants