Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds Dockerfile #46

Merged
merged 14 commits into from
Aug 22, 2014
Merged

adds Dockerfile #46

merged 14 commits into from
Aug 22, 2014

Conversation

ShawnMilo
Copy link
Contributor

This adds a Dockerfile so an environment can be built and tests run with one command. This should make troubleshooting easier, because users and developers can have a shared, known environment for comparing test failures.

Using a Dockerfile means that (with documentation and supporting scripts), only about a kilobyte is added to the repository for all this goodness.

Running tests with the Dockerfile means that textract is completely rebuilt fresh directly from the repository each time. It does not require that any dependencies (such as tesseract, pip, antiword, etc.) be installed, nor any special Python modules.

@deanmalmgren
Copy link
Owner

This looks great, @ShawnMilo. Since I'm not as familiar with docker I've got a knucklehead question so please bear with me: can we reuse the provisioning scripts for the Vagrant VM instead of having to maintain another set of installation scripts?

Specifically I'm looking at the provision/debian.sh and the provision/python.sh scripts. This way we only have to maintain the requirements/* files and things should automatically work on docker or any other environment for that matter (or at least that would be my hope).

With the increasing number of command line tools (which I think of as a really good thing!), another approach might be to start to build and release a PPA that can be installed on any ubuntu system with sudo apt-get install python-textract or something like that. I've never done anything along those lines but it would certainly make textract that much easier to install on all Debian/Ubuntu systems.

Thoughts?

@ShawnMilo
Copy link
Contributor Author

Thanks, I'm glad you like it.

Yes, it's completely possible to use the existing scripts. However, there is a trade-off. The run.sh script does two things:

  1. Build a Docker image. This creates an image once on the user's machine which can then be run many times. Anything done here (for example, apt-get installs) only have to happen once (until we update the Dockerfile).
  2. Run an instance of the Docker image. Every time you run it, it's a "blank slate" based on the image built in step 1. So if you move the apt-get and pip installs into this step, you have to wait for it to happen each time (and use bandwidth to download all the packages).

Currently I have it set up so as much as possible is done in the Dockerfile so it needs to be done only once. If it's preferable to take the time and bandwidth to install the dependencies in "provision" each time, then we can move the apt-get and pip installs into the run_tests.sh script and it'll happen upon the "docker run" instead of "docker build."

@ShawnMilo
Copy link
Contributor Author

I poked at it a bit and it looks like it's possible to have the best of both worlds, using the ADD Dockerfile directive. However, it will require moving files around because during the build the Dockerfile can only ADD things in the current working directory.

Is there any problem with moving the files in /requirements into /provision (or vice versa)? Also, the Dockerfile and accompanying shell scripts will have to be moved into the same folder.

@deanmalmgren
Copy link
Owner

Perfect! There isn't a problem moving everything into the same directory as far as I'm concerned. ./requirements sounds like a good general purpose name for this unless you can think of something better.

@ShawnMilo
Copy link
Contributor Author

Excellent. I'll push it up when it's working.

I can see from the comment in python.sh that this stuff has to work from Travis as well, so I'll try not to mess that up. I may ask for help if my initial push breaks the build.

@deanmalmgren
Copy link
Owner

Sounds good. I'm busy the rest of the day but will definitely get back to you ASAP

* Consolidates requirements, provision, and docker directories.
@ShawnMilo
Copy link
Contributor Author

FYI: Ignore Travis errors for now. I'm aware of them and working on trying to fix the requirements scripts so they work in both Travis and the Docker container.

An attempt to make it work both with Travis and Docker.
@ShawnMilo
Copy link
Contributor Author

Okay, it works in Travis and Docker now. The diff is going to look big because files were moved, but no big changes were made.

Incidentally, I've temporarily changed the Dockerfile to build an Ubuntu 12.04 image instead of an Ubuntu 14.04 image because in 12.04 the tests all pass. So it's definitely a version issue with the third-party dependencies (or maybe just tesseract-ocr).

Otherwise it piles up multiple containers you probably don't need.
@ShawnMilo
Copy link
Contributor Author

Update: Now removes the container (but not the image) after running the tests.

Also, on my machine it takes over seven minutes to do the initial image build, but under four seconds to run subsequently. That means you can run the tests in about four seconds every time you change code, without setting up the external and Python dependencies on your development machine (or with different versions of those dependencies on your machine).

You won't have to do the "seven minute build" again unless the requirements files change or the version of Ubuntu is changed in the Dockerfile.

@deanmalmgren
Copy link
Owner

Thanks for putting this together; this will give people yet another way to develop and run the textract tests. If you don't mind, here are a few minor things to fix and we can get this all merged in:

  • Let's add something to the docs/contributing.md about how to use Docker for development (is there another use case for this that I'm not fully understanding?). Currently there's some instructions for using a Vagrant development environment but this PR goes a long way toward providing another viable alternative that people should know about (thanks again!)
  • Can you rename the requirements/run.sh script to something more descriptive? Maybe requirements/run_docker_tests.sh or something?
  • can you update the paths in the Vagrantfile to not point to the provision directory anymore?
  • I also commented on a few particular lines of code in the commits that would help tidy things up a bit I think.
  • Its kinda a bummer that the Docker stuff mixes up the requirements with the provisioning scripts, but life could be worse. (not really a thing to do, just an observation)

@ShawnMilo
Copy link
Contributor Author

Okay. I'll get on those other comments and let you know when I think it's ready to look at. Ignore pushes before I tell you -- I may push something to make sure it passes Travis.

@deanmalmgren
Copy link
Owner

Sounds good. I'm headed out of town on vacation for a long weekend so I
might not get to it for a while
On Aug 13, 2014 5:25 PM, "Shawn Milochik" notifications@github.com wrote:

Okay. I'll get on those other comments and let you know when I think it's
ready to look at. Ignore pushes before I tell you -- I may push something
to make sure it passes Travis.

Reply to this email directly or view it on GitHub
#46 (comment).

@ShawnMilo
Copy link
Contributor Author

  • Vagrantfile updated
  • CONTRIBUTING.md contains documentation (moved from doc.md from earlier commit)
  • Script renamed as requested
  • Comments on commit 506f40b have been addressed

Brief explanation of Docker images vs. Containers

This is background for section below on combining requirements and provision.

A Docker image is like a template. It is based on a specific Linux distribution and may have packages installed, configurations set, etc. It is never "run" itself -- it is used as a base for containers.

A Docker container is a running instance of an LXC container -- the project upon which Docker is based. It must be based on an image. That can be (for example) a vanilla Ubuntu image provided by Canonical, or a modified Ubuntu image based on the original, modified by a member of the community, and contributed back.

Two ways of acquiring an image is are to get the image directly from Docker (which houses and serves them in a repository), or building one automatically using a Dockerfile (which is similar in concept to a Makefile).

I think using a Dockerfile with a vanilla image is the way to go, because it dynamically builds the image, thus getting the current versions of dependencies. Also, adding or removing dependencies only requires modifying the dependency files debian and python.

If we built a custom image and stored it in Docker's repository, we'd have to replace that image when debian and python are modified. The trade-off is that, on the developer's computer, it takes longer for the initial build as those dependencies are installed.


A note on combining requirements and provision

The mix of the requirements and provisioning isn't strictly necessary, if we move the Dockerfile to the root of the project. Here's why:

  1. The ADD command in a Dockerfile. This updates the image.
    • Pro: Lets important bits be maintained outside of the Dockerfile (in this case, debian.sh and python.sh)
    • Con: requires that the file in the ADD command is in, or in a subdirectory of, the directory containing the Dockerfile
  2. Mounting a volume (or otherwise acquiring a file, for example via wget) after the container is running.
    • Pro: Mounted directory can be anywhere on the filesystem
    • Con: Directory and its contents not available during the image build, so can't be used to bootstrap

@deanmalmgren
Copy link
Owner

@ShawnMilo this is terrific! Thanks for all the effort here and for explaining all of this stuff. I'm looking forward to playing around with Docker a bit and this looks like a great place to start for me. I'm merging this in now and it will be a part of the 1.0.0 release.

So if I understand your last point about the location of the Dockerfile, it sounds like we can put that in the root of the repository and keep the split between requirements and provisioning then; is that right? If so, I may go ahead and switch that up.

I think I'm also going to move the documentation from CONTRIBUTING.md into the docs/contributing.rst as this is another way to quickstart! Great stuff that I hope others find useful as well.

@deanmalmgren deanmalmgren merged commit 40acfa1 into deanmalmgren:master Aug 22, 2014
@ShawnMilo
Copy link
Contributor Author

You're welcome!

Yes, you're correct about being able to move the Dockerfile and put back the directories. I was thinking about it more last night and it's also possible to have the Dockerfile in a subdirectory and have the bash script copy it to the repo root and run it from there, then delete it. It's a hack, but if you value the cleanliness of the root it's an option.

Since you said you're interested in learning more about Docker, here's something I put together yesterday to teach myself. It's a very simple primer on having client and server apps in separate containers:
https://github.com/ShawnMilo/docker_cs_sample

@deanmalmgren
Copy link
Owner

I just finished moving the Dockerfile into the project root and moving a few other things around in 5985e10. When you get a chance, want to try pulling the master branch and make sure everything works with Docker still?

In particular, I moved the docker test scripts into the ./tests directory (and renamed requirements/run_tests.sh to tests/docker_entry.sh). I also made it so that the docker_entry.sh script now runs the full test suite in ./tests/run.py. If I managed to mess any of this up, I totally apologize. I like keeping the requirements and provisioning materials separate to make it relatively easy to reconfigure different development environments :)

@ShawnMilo
Copy link
Contributor Author

The commit breaks all the Docker stuff -- as expected because paths have to be changed. That's not a big deal.

However, other problems were introduced:

  1. Documentation I wrote was lost.
  2. Changes made to debian.sh were reverted.
  3. Changes made to python.sh were reverted.

It looks like this branch was somehow polluted with a different branch. Did you have merge conflicts? The documentation not only lost some of my stuff, but also had different (vagrant-related) things added, so it's not a simple overwrite with an earlier version.

@deanmalmgren
Copy link
Owner

Oof...sorry this breaks the Docker stuff :( Bad form on my part! Would you mind throwing together a pull request that fixes the paths in Docker?

Regarding the other problems:

  1. Documentation I wrote was lost.

I reshuffled the documentation. I want the CONTRIBUTING.md to be narrowly focused on coding practices and things like that. The addition of Docker is another way to get started developing, so I put it in docs/contributing.md. Did you have other documentation that I inadvertantly kludged?

  1. Changes made to debian.sh were reverted.
  2. Changes made to python.sh were reverted.

Sorry about that. The Vagrant provisioning scripts need to use absolute paths in lots of places, not relative paths. Things weren't provisioning properly in Vagrant. Maybe we can work together to find a good resolution on this? I'd be up for a google hangout to pair program if that would be helpful.

@ShawnMilo
Copy link
Contributor Author

No problem about the absolute paths for Vagrant -- I can fix them to take care of that. How can I test the Vagrant setup, or are you able to do that if I push a fix?

No problem about fixing all the Docker paths -- that was never a big deal.

As for the missing docs, the Docker stuff seems to have been dropped from this file:
https://github.com/ShawnMilo/textract/blob/master/CONTRIBUTING.md
Either that or it was modified; a grep doesn't find my TL;DR header quickstart.

@deanmalmgren
Copy link
Owner

I can test the Vagrant setup if you send a PR. Again, sorry for the hastle.

I moved the docs into the primary documentation here, which is in the docs/contributing.md file. What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants