Docker pilot run #43

szahedian · 2021-08-11T21:02:52Z

In this issue, we will demonstrate replication of a simple project within a Docker container.

…ng all scripts; Stata unsupported

szahedian · 2021-08-13T20:07:45Z

@gentzkow I've written a Dockerfile that produces a minimal working Docker image that could be used to replicate results. Right now, it builds an image that sets up the dependencies and environment required for running run_all.py. It is a minimal version, so it doesn't support Stata or external dependencies.

The below steps set up and run a Docker container that recreates all output files within the template repo. These steps should be run from the repo root.

PROJECT_NAME=template
PROJECT_PATH=<absolute path to repo root i.e. /Users/snz/Documents/...>

docker build -t $PROJECT_NAME --build-arg PROJECT_DIR=$PROJECT_NAME .
docker run -v $PROJECT_PATH:/tmp/$PROJECT_NAME $PROJECT_NAME

gentzkow · 2021-08-13T22:51:17Z

Thanks @szahedian! Very cool

I've installed Docker and played around with their tutorial a bit. On the build step above I get an error when installing R:
error_msg.txt

A couple of questions about how this would work:

It seems like this image is built to run one specific application (run_all.py). Is it possible to build the image and then allow a user to run many different comments within it? I.e., can it work like a virtual machine where I can issue commands from a terminal window and have them execute as docker containers?
If we built an image like this and shared it on Docker Hub, would that mean a user could run it without the docker build step above? I.e., they would just docker run our pre-built image?

szahedian · 2021-08-14T01:35:08Z

@gentzkow

I've installed Docker and played around with their tutorial a bit. On the build step above I get an error when installing R

Hmm that's interesting. I've experimented some more locally and that section of the Dockerfile is redundant anyways. I've pushed an update that will hopefully get around that.

It seems like this image is built to run one specific application (run_all.py). Is it possible to build the image and then allow a user to run many different comments within it? I.e., can it work like a virtual machine where I can issue commands from a terminal window and have them execute as docker containers?

Yes it is possible to take control of a running container through shell commands as if you had ssh-ed into it. The simplest way is to specify interactive mode. You can try this out with something like docker run -it ubuntu.

If we built an image like this and shared it on Docker Hub, would that mean a user could run it without the docker build step above? I.e., they would just docker run our pre-built image?

Yes that's right. What I've written imagines the following usage pattern:

User clones project repo.
User builds Docker image.
User spins up Docker container, which populates local repo output directories.

If we put the current image on Docker Hub, the usage pattern would be the same but we could eliminate (ii).

If we continue with Docker, we should probably discuss the usage pattern we want to support. For instance, in this current version, none of the repo code is actually housed in the container; the user's copy of the repo is mounted into the Docker container, and the container is unaware that the files it operates on are "external". We could make edits to make.py, then run docker run -v $PROJECT_PATH:/tmp/$PROJECT_NAME $PROJECT_NAME again and the changes would take effect.

But there could be other usage patterns where we, say, do copy the repo into the container, and allow users to execute CLI commands inside a full Docker sandbox. This seems closer to the world where users just docker run our pre-built image.

gentzkow · 2021-08-14T21:16:32Z

Thanks @szahedian.

Still getting an error; this time on the Lyx step
error_msg.txt

I'm a bit confused by this:

none of the repo code is actually housed in the container; the user's copy of the repo is mounted into the Docker container

It sounds at first glance like "the repo is not in the container; the repo is in the container." Can you clarify?

An ideal usage pattern to support would be

User clones project repo
User spins up Docker image from Docker Hub
User runs scripts in the repo interactively, e.g.
- Run run_all.py to populate all /output directories
- Edit and re-run selected make.py scripts
- Open R and run a particular R script interactively; edit the script; re-run
- Open Lyx and compile a PDF manually

…actively

szahedian · 2021-08-15T03:30:30Z

Darn! I've pushed some further edits that will hopefully allow it to build and be run interactively (details below). @gentzkow if this doesn't fix it we can go through a more deliberate debugging process.

It sounds at first glance like "the repo is not in the container; the repo is in the container." Can you clarify?

I can see how that may have been unclear. Let me try to do better. The cloned repo exists as bits on your computer's disk, which your OS is able to address. When you mount the repo into the Docker container, the container is able to address those same bits, only to the container they appear at a different location in the filesystem — concretely, /Users/.../local/path/to/template/ on your local machine and /tmp/template/ in the container reference the same underlying data. So the repo is accessible to the container, but the repo isn't "in" the container in some metaphysical sense. Conversely, if we had copied over the repo into, say, ~/template/ in the container such that two copies existed, that copy would fully "belong" to the container.

An ideal usage pattern to support would be...

Sounds good! This doesn't seem too far from the path we were on. Along with the (hopeful) bug fixes, I've edited the Dockerfile so that the container can be run interactively. The steps are similar.

PROJECT_NAME=template
PROJECT_PATH=<absolute path to repo root i.e. /Users/username/Documents/.../template>

docker build -t $PROJECT_NAME --build-arg PROJECT_DIR=$PROJECT_NAME .
docker run -v $PROJECT_PATH:/tmp/$PROJECT_NAME -it $PROJECT_NAME

Then you should see a bash prompt. You'll finally need to run conda activate $(conda env list | awk 'NR==4 {print $1}'). Then you should all be set to play around with running the repo within Docker!

gentzkow · 2021-08-15T04:10:37Z

Your response to the "in the container" question is super clear. Thanks!

Still getting errors. I tried commenting out the Lyx install but I get an error on the Conda install step as well. Shucks.

Rather than spend a bunch of time debugging this, though, what about if you go ahead and push the image to Docker Hub then I try pulling the image from there? In some ways that's actually a better test of what we want to do.

Another question: Is it possible to interact w/ applications running in the container in GUI as well? E.g., could I open up a Jupyter notebook in the container and work with it? Or could I open the Lyx application and make changes to a file in the GUI interface?

szahedian · 2021-08-16T17:40:42Z

@gentzkow

Rather than spend a bunch of time debugging this, though, what about if you go ahead and push the image to Docker Hub then I try pulling the image from there?

I pushed the build to Dockerhub. It should be accessible by running:

docker run -it -v $PROJECT_PATH:/tmp/template snzahedian/gentzkow:template

Another question: Is it possible to interact w/ applications running in the container in GUI as well? E.g., could I open up a Jupyter notebook in the container and work with it? Or could I open the Lyx application and make changes to a file in the GUI interface?

The answer to this question depends on the application you'd like to run graphically.

A Jupyter notebook started in Docker can be accessed graphically. You can test this for yourself by running docker run -p 8888:8888 -v $(pwd):/home/jovyan/work jupyter/minimal-notebook. Then you can access notebooks through your web browser as usual. The reason this works is that the Jupyter notebook interface is rendered like a webpage, where runningjupyter notebook opens a port on our machine that acts as a web server. With Docker, the only change is that we forward that local port (where we'd usually look to receive the notebook interface) to a port on the Docker container (which has been opened by running jupyter notebook in the container).

For general applications like LyX that require the OS to render the interface, it becomes more difficult to access these through the container. The procedure is simplest on Linux, and seem more complicated on MacOS. Even if the steps aren't too difficult, they may become technically opaque and outweigh the setup cost savings of using Docker in the first place.

In the case of LyX, though, this may not be necessary because a user could install LyX on their local machine, edit files, and then compile them using the CLI inside Docker — because of the file mount, changes made on the local machine would be visible to the Docker container.

gentzkow · 2021-08-19T16:35:44Z

Success! This works great. I successfully ran run_all.py from the container and also played around with running some other scripts a la carte. I think this would be an excellent way to ship replication code.

A couple of things I think we might want to do when we create an actual replication archive:

Replace the gslab_make git submodule with code committed directly to the repo. The submodule is valuable for internal development but I think it might be better to make this totally static for the purpose of replication. (This would also eliminate a step at which the Docker build could perhaps break.)
Turn off the make.py warnings for target files not committed to the repo. This is again helpful for internal development but not for external replication.

Let me know if you agree with those and we can open separate issues to implement.

I guess the remaining question is whether we can make it work with Stata. Do you want to wrap this issue and open a new one to investigate that?

szahedian · 2021-08-19T20:43:12Z

Awesome! Glad to hear it works.

I believe git submodules are tied to a specific repo commit (I've tested this with a dummy branch), so the submodule will only receive changes from gslab_make if we commit those changes inside template. On the other hand, it's not too much work to just re-clone new version of gslab_make into /lib when we want to update them inside template. If you feel strongly about static code, let's go with that.
This change makes sense.

I guess the remaining question is whether we can make it work with Stata. Do you want to wrap this issue and open a new one to investigate that?

Sounds good! How should I proceed with respect to branches? I'm thinking I close this issue, but continue using this same branch to integrate Stata. That way all updates to code and documentation explaining Docker can be made in one go.

gentzkow · 2021-08-19T21:14:47Z

Great! I don't think we want to commit gslab_make directly to the template. We just made the switch to submodules in #38 and I think that will work great for our internal work. What I was thinking of was flipping over to direct commit when we release replication packages. We can decide that down the line.

For (2), I'm thinking we might just want to add an option to config.yaml called suppress_git_warnings or something like that. We could then flip that switch when we release a replication archive. Unless that would be super easy to implement though I'd be fine setting it aside for now.

On Stata, yes -- let's open a new issue but continue work in this branch. And obviously we do not merge the branch back to master.

szahedian · 2021-08-19T22:08:00Z

Concerning (2), those warnings are caused by gs.get_modified_sources(PATHS, inputs + externals) in the make scripts. We could suppress those warnings at the level of the make script by checking for suppress_git_warnings and conditioning the call to gs.get_modified_sources() on its value.

A more considered approach may be to suppress those warnings at the level of get_modified_sources; check for suppress_git_warnings in get_modified_sources() (line 321 [check_repo.py]gslab_make/check_repo.py) and condition the warning printout on its value.

gentzkow · 2021-08-20T16:22:44Z

Thanks. We could have get_modified_sources read directly from config.yaml to check that. I guess it would be more elegant to replace PATHS with a structure that can include both paths and options and populate that from config.yaml only once.

It occurs to me that we'd probably always want to disable those warnings when someone runs run_all.py, even if it's not part of a replication package. It would be nice for that reason if the code could flip that switch independently of config.yaml.

szahedian · 2021-08-20T23:04:35Z

I'll keep thinking about this. I propose we pause on "suppress git warnings" until we close #45, because how we signal to suppress warnings may depend on changes we make to the structure of make.py scripts.

szahedian · 2021-09-20T16:24:26Z

Summary: In this issue, we built a Stata-less Docker image that we pushed to Dockerhub and tested.

We close this and #47, with the expectation that progress made on these issues will be merged once we complete #49.

Stable link to issue branch here.
Stable link to test branch here.

szahedian pushed a commit that referenced this issue Aug 13, 2021

#43 Complete Docker image capable of setting up environment and runni…

6e0ee92

…ng all scripts; Stata unsupported

szahedian pushed a commit that referenced this issue Aug 13, 2021

#43 Additional edits to Dockerfile

c987584

szahedian self-assigned this Aug 13, 2021

szahedian pushed a commit that referenced this issue Aug 14, 2021

#43 Eliminate redundant R install in Dockerfile

1c2a9b1

szahedian pushed a commit that referenced this issue Aug 15, 2021

#43 Update apt after adding apt-repo; allow container to be run inter…

deb2d47

…actively

szahedian mentioned this issue Aug 20, 2021

Review current template #40

Closed

szahedian mentioned this issue Sep 9, 2021

Stata integration for Docker #47

Closed

szahedian mentioned this issue Sep 20, 2021

Testing Docker integration #49

Closed

szahedian closed this as completed Sep 20, 2021

szahedian mentioned this issue Apr 18, 2022

Discuss improvements to the template #54

Closed

szahedian mentioned this issue Jun 3, 2022

Build fully featured Docker container #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker pilot run #43

Docker pilot run #43

szahedian commented Aug 11, 2021

szahedian commented Aug 13, 2021

gentzkow commented Aug 13, 2021

szahedian commented Aug 14, 2021

gentzkow commented Aug 14, 2021

szahedian commented Aug 15, 2021

gentzkow commented Aug 15, 2021

szahedian commented Aug 16, 2021

gentzkow commented Aug 19, 2021

szahedian commented Aug 19, 2021

gentzkow commented Aug 19, 2021

szahedian commented Aug 19, 2021

gentzkow commented Aug 20, 2021

szahedian commented Aug 20, 2021

szahedian commented Sep 20, 2021 •

edited by snairdesai

Loading

Docker pilot run #43

Docker pilot run #43

Comments

szahedian commented Aug 11, 2021

szahedian commented Aug 13, 2021

gentzkow commented Aug 13, 2021

szahedian commented Aug 14, 2021

gentzkow commented Aug 14, 2021

szahedian commented Aug 15, 2021

gentzkow commented Aug 15, 2021

szahedian commented Aug 16, 2021

gentzkow commented Aug 19, 2021

szahedian commented Aug 19, 2021

gentzkow commented Aug 19, 2021

szahedian commented Aug 19, 2021

gentzkow commented Aug 20, 2021

szahedian commented Aug 20, 2021

szahedian commented Sep 20, 2021 • edited by snairdesai Loading

szahedian commented Sep 20, 2021 •

edited by snairdesai

Loading