Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI to publish multiple e-mission-server images from multiple branches on release/commit #752

Closed
shankari opened this issue Jul 8, 2022 · 22 comments

Comments

@shankari
Copy link
Contributor

shankari commented Jul 8, 2022

In the NREL hosted environment, the cronjobs are run as AWS scheduled tasks.
This causes the containers running them to be created and destroyed for every run.
This implies that the containers should be fast to create.

The analysis tasks are currently created using the emission/e-mission-server.dev.server-only images. This allows maximum flexibility, since we clone the server code and set up the related dependencies at run-time. It works fine for a long-running analysis container with scripts that are launched using cronjobs.

However, it is a terrible design for scheduled tasks where the container is created on demand. In particular, setting up the dependencies using conda is extremely slow and resource intensive. When we launched multiple scheduled tasks in parallel on the NREL hosting environment, they consumed the entire cluster.

From @jgu2

In this case, there will be 5 scheduled tasks to run in parallel, it looks like each task could consume much of memory, and the parallel jobs would consume 100% of the cluster.

So we really need to create images with the conda dependencies and the server code preinstalled. The scheduled tasks can then spin up the container and execute the task quickly.

Our default Dockerfile currently does this
https://github.com/e-mission/e-mission-docker/blob/master/Dockerfile

And we currently push it to the emission/e-mission-server dockerhub image
https://github.com/e-mission/e-mission-docker/blob/master/README.md#docker-build-instructions

However, we need to support better automation and flexibility.
@aGuttman can you handle this task in coordination with @jgu2?

@shankari
Copy link
Contributor Author

shankari commented Jul 8, 2022

Concretely:

  • revisit the design decision around whether we should have the Dockerfile in a separate e-mission-docker repo or move it back to the e-mission-server repo.
    • The latter seems to be more standard and might be better supported by standard GitHub tools
    • but in that case, I think it needs to be changed to copy files directly instead of cloning from the e-mission-server repo (doesn't make sense to have a Dockerfile clone the repo if you are already in the repo). What is standard practice?
    • not sure how best to do this from multiple branches (just check out the branch and then build?)
  • since the image will embed the server code, make sure to create multiple images, at least for the master, gis-based-mode-detection and join_redirect_to_static branches
    • presumably these will be multiple tags for the same image
  • automatically build the images using CI
    • make a design decision on whether to build on every commit, or only on every release
    • if on release, how do we specify the branch to build from? or do we build all branches for every release?
  • automatically publish the images
  • change the internal analysis container (built by cloud services using an internal GitHub repo) to use the appropriate new image tag

@aGuttman, please LMK if you have any questions around this

@aGuttman
Copy link

aGuttman commented Jul 8, 2022

Just want to acknowledge that I've read this. Will probably have questions soon when I begin actually engaging with it.

@shankari
Copy link
Contributor Author

@aGuttman I also want to know whether there is any principled reason to put the ​Dockerfile​ in the repo.

Because at some point, I think that we may want to split up the monolithic server code. The modules for core code which does not change frequently aka the storage or the data models can be pulled into separate python packages. Then the analysis code (aka in https://github.com/e-mission/e-mission-eval-private-data) can ​pip install​ it instead of having to check out the server and add it to the PYTHONPATH

So there may end up being multiple server repos, no just one.

So maybe a pre-requisite decision is monorepo versus multiple separate repos.

@aGuttman
Copy link

aGuttman commented Aug 5, 2022

Keeping the Dockerfile in the server repo solves several problems. It keeps the GitHub actions needed to build and push images simpler and avoids potential branch confusion issues. Keeping the Dockerfile in a separate repo requires either manual work to initiate build/push, cross-repository dependent actions, or additional management software.

Cross-repository actions can work, but in addition to the yaml file simply being more complex, they create more design decisions that don't have clear best answers to deal with branches. Say branch A in the server repo is updated and we want a new image built and pushed. The docker repo doesn't know anything has changed. We can create a github action in the branch that checks out the docker repo and uses the Dockerfile it gets from the repo to build. What happens when branch B is updated and needs a new image built? It also checkouts the Docker repo. Should the Docker repo have a single Dockerfile that is general enough for all server branches to use, and they rely on some secondary configuration file kept in the server branch to deal with the differences? Or should the secondary file be kept on the Docker repo? Or should we have different Dockerfiles for each branch? How do we determine which configuration gets used? Should the Docker repo have different branches that mirror the server branches? If we keep more branch specific configuration info on the server branches, does this properly keep separate the different responsibilities implied by having a separate Docker repo? If more branch specific configuration info is kept in the Docker repo, is a developer working in the server branch responsible for modifying the configuration in another repo as the brach changes? Or when new server branches are made?

If the Dockerfile is kept on the server repo, each branch has its own that can be changed as needed without worrying about interaction outside the repo. When a new image is needed, a GitHub action will run and code won't need to be copied from any outside source. The image can be built and pushed right there.

Most discussion online prefer the Dockerfile in the app repo. There are those who advocate having a separate Docker repo, but these usually are paired with more automation software and seem to assume that you have one main branch that is privileged, the others are just for dev and don't need to be published.

Images of different branches differentiated as tags works.

I prefer build on commit since it makes the GitHub actions easier and allows different branches to be updated on their own pace. This is also premised on the idea that commits made to these branches are good, (as in the messy parts of development get dealt with locally, on a branch that we care less about, or in a fork and merged in) so that we aren't building and pushing bad images.

Host images on dockerhub to support cloud team's workflow.

If we split up the server code and have the analysis pip installed, would we still want to provide different images for just he server or server with different analysis packages installed? If so, I feel like the idea of keeping a Dockerfile in the server/analysis repo still makes sense. As the analysis is updated, a Dockerfile in that repo can build an image with the latest changes to be pushed for use.

@shankari
Copy link
Contributor Author

shankari commented Aug 9, 2022

@aGuttman can you add the original links here for my edification?
also this seems to assume a monorepo
what happens when we move modules - e.g. the storage/timeseries code - out so that they can be pip installed in other parts of the program such as the public dashboard?

@aGuttman
Copy link

Frustratingly, I didn't find a comprehensive resource explaining the way things should be done or the principals involved, mostly low engagement stack posts:
https://stackoverflow.com/questions/57452441/what-code-repository-should-the-dockerfile-get-committed-to
https://stackoverflow.com/questions/60767371/is-it-the-right-way-to-separate-docker-related-files-and-python-related-files
https://stackoverflow.com/questions/27387811/where-to-keep-dockerfiles-in-a-project
https://softwareengineering.stackexchange.com/questions/349320/splitting-application-code-and-docker-deployment-files
https://www.reddit.com/r/docker/comments/odmfdl/dockerfile_in_separate_repo_or_same_project_repo/

I am assuming a monorepo because separating the modules didn't come up until later. I did try to address the idea of splitting up the modules in the last paragraph, but honestly I'm not sure I understand the codebase or deployment strategies enough to know what to say.

@shankari
Copy link
Contributor Author

shankari commented Aug 11, 2022

@aGuttman

some of the modules in the e-mission-server repository are reused by other repos.

As a concrete example, the DB access code is accessed by the public dashboard repo. Right now, we handle that by cloning the entire server repo every time, which seems like overkill. There is no reason for the REST API components to be accessible from a repo that runs jupyter notebooks.

An even more apt example here is that the scheduled task instances that will be launched from these images are intended for running modeling tasks. There is no reason that they need access to the REST APIs

Splitting the storage code out into a library will help with this issue. If we publish the storage library as a pip module, we could simply pip install it in the dashboard repo.

We need to plan out how that will interact with the automatic image creation.

@shankari
Copy link
Contributor Author

shankari commented Aug 16, 2022

There are two ways to handle this:

  • Split into multiple repos, but rely on the dependencies being version controlled so a dependency change will require a version update which is a commit to the main repo
  • Keep everything as a giant monorepo (similar to UI monorepos - e.g. https://nx.dev/guides/why-monorepos) and just split while creating the release artifacts (so 5 Dockerfiles or something).

@shankari
Copy link
Contributor Author

shankari commented Aug 16, 2022

Main question for the first one: "Can we pip install from a local directory"?
If so, we can do the beautifulness.
If not, we can't and might want to just stick with monorepo for now.

ETA: Friday
Next Tuesday: PR with final approach complete.

@shankari
Copy link
Contributor Author

Seems like this is just pip install -e ...

@aGuttman
Copy link

aGuttman commented Aug 19, 2022

pip installing from a local directory can be done with pip install -e ... as stated above. A setup.py file is also needed for this, which are easy enough to write.

The difficulty for us is going to be trying to eliminate circular dependancies. For example, it would be nice to break off the entire storage directory, as is, into its own module. However files in storage have imports from core, analysis, and net. Files in analysis have imports from storage, as do files in net.

If I'm thinking about this correctly (please let me know if I'm not), we would need to break these directories into multiple modules each so that our dependancies form a DAG. This should be possible. Since the code works, it doesn't have dependency cycles where fileA depends on fileB and fileB depends on fileA. It does have "pseudo cycles" between directories, where some files in dirA depend on files in dirB and some files in dirB depend on files in dirA. If we try to just make modules out of the current directories, these pseudo cycles will stop us. We should be able to make modules if we repackage our files, possibly moving some files into different directories and/or possibly splitting one directory into several. Since files also import from various files within their directories, splitting up directories or moving a file from one directory to another also needs to respect these relations, which makes the problem harder.

I don't know if this makes sense to do or not. Looking at a few files by hand, it looks messy and a good analysis to figure this out isn't coming to mind immediately. That said, I have a nagging feeling that I'm overcomplicating, and this is actually fine if I think about it from some different perceptive, I'm just not sure what. Perhaps just find a bunch of things that we can package into one big module?

@shankari
Copy link
Contributor Author

shankari commented Aug 22, 2022

@aGuttman just to clarify: the goal here is not to actually modularize the server code right now. That is a fairly tricky change and well outside the scope of this PR (and is tracked in #506 anyway).

The goal of this investigation is to make design choices for the CI that don't preclude splitting up the code later. So I would want to basically test out the pip install -e with the storage directory for example. Having circular dependencies for now should be fine since the dependencies will be included in the e-mission-server code anyway.

That + code cleanup are the only two tasks left on this issue.

By Tuesday, please:

  • include the results of using pip install -e
  • submit the cleaned up PR

@aGuttman
Copy link

Ok, I guess I'm confused about what I should show for pip install -e <path/to/dir>.

The command works provided you include a file setup.py in the directory.

from setuptools import setup, find_packages

setup(
    name='emiss-stor',
    version='0.1.0',
    packages=find_packages()
)
Installing collected packages: emiss-stor
  Attempting uninstall: emiss-stor
    Found existing installation: emiss-stor 0.1.0
    Uninstalling emiss-stor-0.1.0:
      Successfully uninstalled emiss-stor-0.1.0
  Running setup.py develop for emiss-stor
Successfully installed emiss-stor
$ pip list
Package            Version   Location
------------------ --------- ------------------------------
...
emiss-stor         0.1.0     /Users/aguttman/Desktop/storage
...

Changes made to the code are used automatically without having to run a reinstall/update.

Version numbering is kept in setup.py and needs to be updated manually when you want to signify a new release.

Moving storage back into e-mission-server to get around the dependency issue works (though it's a new install from a new location). Imports such as import emission.storage.timeseries.abstract_timeseries as esta are changed to import timeseries.abstract_timeseries as esta to use the installed package rather than accessing the python file directly. The imports in storage to analysis etc... are left alone and appear to be resolved just fine that way.

I don't like the naming scheme for the imports. It feels wrong to me the you access timeseries.abstract_timeseries without telling it which package it's from. It feels to me like it should be emiss_store.timeseries.abstract_timeseries, but that is not what it defaults to. I'd like to see if I can change it.

Is that adequate information for pip install -e? Sorry, I had tried this last week, but it seemed boring, I guess. I wasn't sure you wanted me to report it beyond "yes it is possible". Let me know if there is anything I should answer here but haven't.

@aGuttman
Copy link

I guess the easy thing to do for the naming issue is to just create a dir emiss-stor in storage and move everything into it.

@shankari
Copy link
Contributor Author

shankari commented Aug 22, 2022

Is that adequate information for pip install -e?

Yes. Our goal was just to see that the option of creating putting the Dockerfile into the e-mission-server repo would work even after we (in the future) split out the repo. You can now complete the changes for moving the Dockerfile into the server repo and building it via CI

I guess the easy thing to do for the naming issue is to just create a dir emiss-stor in storage and move everything into it.

Not sure why an even easier solution is not to just change the name to emission.storage instead of emiss-stor

@aGuttman
Copy link

Not sure why an even easier solution is not to just change the name to emission.storage instead of emiss-stor

Ah, true. I used emiss-stor when testing because it wasn't used anywhere else. If it resolved the name I could be sure the pip module was being used. Now that I can see it working as expected, emission.storage is easier.

@shankari
Copy link
Contributor Author

@aGuttman any updates on the cleaned up PR to review?

@aGuttman
Copy link

@shankari
Copy link
Contributor Author

Further cleanup is at
#791

@shankari
Copy link
Contributor Author

Finally closing this!

@shankari
Copy link
Contributor Author

Finally closing this!

1 similar comment
@shankari
Copy link
Contributor Author

Finally closing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants