Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker pilot run #43

Closed
szahedian opened this issue Aug 11, 2021 · 14 comments
Closed

Docker pilot run #43

szahedian opened this issue Aug 11, 2021 · 14 comments
Assignees

Comments

@szahedian
Copy link
Contributor

In this issue, we will demonstrate replication of a simple project within a Docker container.

szahedian pushed a commit that referenced this issue Aug 13, 2021
szahedian pushed a commit that referenced this issue Aug 13, 2021
@szahedian
Copy link
Contributor Author

@gentzkow I've written a Dockerfile that produces a minimal working Docker image that could be used to replicate results. Right now, it builds an image that sets up the dependencies and environment required for running run_all.py. It is a minimal version, so it doesn't support Stata or external dependencies.

The below steps set up and run a Docker container that recreates all output files within the template repo. These steps should be run from the repo root.

PROJECT_NAME=template
PROJECT_PATH=<absolute path to repo root i.e. /Users/snz/Documents/...>

docker build -t $PROJECT_NAME --build-arg PROJECT_DIR=$PROJECT_NAME .
docker run -v $PROJECT_PATH:/tmp/$PROJECT_NAME $PROJECT_NAME 

@szahedian szahedian self-assigned this Aug 13, 2021
@gentzkow
Copy link
Owner

Thanks @szahedian! Very cool

I've installed Docker and played around with their tutorial a bit. On the build step above I get an error when installing R:
error_msg.txt

A couple of questions about how this would work:

  • It seems like this image is built to run one specific application (run_all.py). Is it possible to build the image and then allow a user to run many different comments within it? I.e., can it work like a virtual machine where I can issue commands from a terminal window and have them execute as docker containers?
  • If we built an image like this and shared it on Docker Hub, would that mean a user could run it without the docker build step above? I.e., they would just docker run our pre-built image?

@szahedian
Copy link
Contributor Author

@gentzkow

I've installed Docker and played around with their tutorial a bit. On the build step above I get an error when installing R

Hmm that's interesting. I've experimented some more locally and that section of the Dockerfile is redundant anyways. I've pushed an update that will hopefully get around that.

It seems like this image is built to run one specific application (run_all.py). Is it possible to build the image and then allow a user to run many different comments within it? I.e., can it work like a virtual machine where I can issue commands from a terminal window and have them execute as docker containers?

Yes it is possible to take control of a running container through shell commands as if you had ssh-ed into it. The simplest way is to specify interactive mode. You can try this out with something like docker run -it ubuntu.

If we built an image like this and shared it on Docker Hub, would that mean a user could run it without the docker build step above? I.e., they would just docker run our pre-built image?

Yes that's right. What I've written imagines the following usage pattern:

  1. User clones project repo.
  2. User builds Docker image.
  3. User spins up Docker container, which populates local repo output directories.

If we put the current image on Docker Hub, the usage pattern would be the same but we could eliminate (ii).

If we continue with Docker, we should probably discuss the usage pattern we want to support. For instance, in this current version, none of the repo code is actually housed in the container; the user's copy of the repo is mounted into the Docker container, and the container is unaware that the files it operates on are "external". We could make edits to make.py, then run docker run -v $PROJECT_PATH:/tmp/$PROJECT_NAME $PROJECT_NAME again and the changes would take effect.

But there could be other usage patterns where we, say, do copy the repo into the container, and allow users to execute CLI commands inside a full Docker sandbox. This seems closer to the world where users just docker run our pre-built image.

@gentzkow
Copy link
Owner

Thanks @szahedian.

Still getting an error; this time on the Lyx step
error_msg.txt

I'm a bit confused by this:

none of the repo code is actually housed in the container; the user's copy of the repo is mounted into the Docker container

It sounds at first glance like "the repo is not in the container; the repo is in the container." Can you clarify?

An ideal usage pattern to support would be

  1. User clones project repo
  2. User spins up Docker image from Docker Hub
  3. User runs scripts in the repo interactively, e.g.
    • Run run_all.py to populate all /output directories
    • Edit and re-run selected make.py scripts
    • Open R and run a particular R script interactively; edit the script; re-run
    • Open Lyx and compile a PDF manually

@szahedian
Copy link
Contributor Author

Darn! I've pushed some further edits that will hopefully allow it to build and be run interactively (details below). @gentzkow if this doesn't fix it we can go through a more deliberate debugging process.

It sounds at first glance like "the repo is not in the container; the repo is in the container." Can you clarify?

I can see how that may have been unclear. Let me try to do better. The cloned repo exists as bits on your computer's disk, which your OS is able to address. When you mount the repo into the Docker container, the container is able to address those same bits, only to the container they appear at a different location in the filesystem — concretely, /Users/.../local/path/to/template/ on your local machine and /tmp/template/ in the container reference the same underlying data. So the repo is accessible to the container, but the repo isn't "in" the container in some metaphysical sense. Conversely, if we had copied over the repo into, say, ~/template/ in the container such that two copies existed, that copy would fully "belong" to the container.

An ideal usage pattern to support would be...

Sounds good! This doesn't seem too far from the path we were on. Along with the (hopeful) bug fixes, I've edited the Dockerfile so that the container can be run interactively. The steps are similar.

PROJECT_NAME=template
PROJECT_PATH=<absolute path to repo root i.e. /Users/username/Documents/.../template>

docker build -t $PROJECT_NAME --build-arg PROJECT_DIR=$PROJECT_NAME .
docker run -v $PROJECT_PATH:/tmp/$PROJECT_NAME -it $PROJECT_NAME 

Then you should see a bash prompt. You'll finally need to run conda activate $(conda env list | awk 'NR==4 {print $1}'). Then you should all be set to play around with running the repo within Docker!

@gentzkow
Copy link
Owner

Your response to the "in the container" question is super clear. Thanks!

Still getting errors. I tried commenting out the Lyx install but I get an error on the Conda install step as well. Shucks.

Rather than spend a bunch of time debugging this, though, what about if you go ahead and push the image to Docker Hub then I try pulling the image from there? In some ways that's actually a better test of what we want to do.

Another question: Is it possible to interact w/ applications running in the container in GUI as well? E.g., could I open up a Jupyter notebook in the container and work with it? Or could I open the Lyx application and make changes to a file in the GUI interface?

@szahedian
Copy link
Contributor Author

@gentzkow

Rather than spend a bunch of time debugging this, though, what about if you go ahead and push the image to Docker Hub then I try pulling the image from there?

I pushed the build to Dockerhub. It should be accessible by running:

docker run -it -v $PROJECT_PATH:/tmp/template snzahedian/gentzkow:template

Another question: Is it possible to interact w/ applications running in the container in GUI as well? E.g., could I open up a Jupyter notebook in the container and work with it? Or could I open the Lyx application and make changes to a file in the GUI interface?

The answer to this question depends on the application you'd like to run graphically.

A Jupyter notebook started in Docker can be accessed graphically. You can test this for yourself by running docker run -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/minimal-notebook. Then you can access notebooks through your web browser as usual. The reason this works is that the Jupyter notebook interface is rendered like a webpage, where runningjupyter notebook opens a port on our machine that acts as a web server. With Docker, the only change is that we forward that local port (where we'd usually look to receive the notebook interface) to a port on the Docker container (which has been opened by running jupyter notebook in the container).

For general applications like LyX that require the OS to render the interface, it becomes more difficult to access these through the container. The procedure is simplest on Linux, and seem more complicated on MacOS. Even if the steps aren't too difficult, they may become technically opaque and outweigh the setup cost savings of using Docker in the first place.

In the case of LyX, though, this may not be necessary because a user could install LyX on their local machine, edit files, and then compile them using the CLI inside Docker — because of the file mount, changes made on the local machine would be visible to the Docker container.

@gentzkow
Copy link
Owner

Success! This works great. I successfully ran run_all.py from the container and also played around with running some other scripts a la carte. I think this would be an excellent way to ship replication code.

A couple of things I think we might want to do when we create an actual replication archive:

  1. Replace the gslab_make git submodule with code committed directly to the repo. The submodule is valuable for internal development but I think it might be better to make this totally static for the purpose of replication. (This would also eliminate a step at which the Docker build could perhaps break.)
  2. Turn off the make.py warnings for target files not committed to the repo. This is again helpful for internal development but not for external replication.

Let me know if you agree with those and we can open separate issues to implement.

I guess the remaining question is whether we can make it work with Stata. Do you want to wrap this issue and open a new one to investigate that?

@szahedian
Copy link
Contributor Author

Awesome! Glad to hear it works.

  1. I believe git submodules are tied to a specific repo commit (I've tested this with a dummy branch), so the submodule will only receive changes from gslab_make if we commit those changes inside template. On the other hand, it's not too much work to just re-clone new version of gslab_make into /lib when we want to update them inside template. If you feel strongly about static code, let's go with that.
  2. This change makes sense.

I guess the remaining question is whether we can make it work with Stata. Do you want to wrap this issue and open a new one to investigate that?

Sounds good! How should I proceed with respect to branches? I'm thinking I close this issue, but continue using this same branch to integrate Stata. That way all updates to code and documentation explaining Docker can be made in one go.

@gentzkow
Copy link
Owner

Great! I don't think we want to commit gslab_make directly to the template. We just made the switch to submodules in #38 and I think that will work great for our internal work. What I was thinking of was flipping over to direct commit when we release replication packages. We can decide that down the line.

For (2), I'm thinking we might just want to add an option to config.yaml called suppress_git_warnings or something like that. We could then flip that switch when we release a replication archive. Unless that would be super easy to implement though I'd be fine setting it aside for now.

On Stata, yes -- let's open a new issue but continue work in this branch. And obviously we do not merge the branch back to master.

@szahedian
Copy link
Contributor Author

Concerning (2), those warnings are caused by gs.get_modified_sources(PATHS, inputs + externals) in the make scripts. We could suppress those warnings at the level of the make script by checking for suppress_git_warnings and conditioning the call to gs.get_modified_sources() on its value.

A more considered approach may be to suppress those warnings at the level of get_modified_sources; check for suppress_git_warnings in get_modified_sources() (line 321 [check_repo.py]gslab_make/check_repo.py) and condition the warning printout on its value.

@gentzkow
Copy link
Owner

Thanks. We could have get_modified_sources read directly from config.yaml to check that. I guess it would be more elegant to replace PATHS with a structure that can include both paths and options and populate that from config.yaml only once.

It occurs to me that we'd probably always want to disable those warnings when someone runs run_all.py, even if it's not part of a replication package. It would be nice for that reason if the code could flip that switch independently of config.yaml.

@szahedian
Copy link
Contributor Author

I'll keep thinking about this. I propose we pause on "suppress git warnings" until we close #45, because how we signal to suppress warnings may depend on changes we make to the structure of make.py scripts.

@szahedian
Copy link
Contributor Author

szahedian commented Sep 20, 2021

Summary: In this issue, we built a Stata-less Docker image that we pushed to Dockerhub and tested.

We close this and #47, with the expectation that progress made on these issues will be merged once we complete #49.

  • Stable link to issue branch here.
  • Stable link to test branch here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants