Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chores: editorconfig and dockerfile #30

Merged
merged 1 commit into from Sep 8, 2021

Conversation

SethFalco
Copy link
Contributor

@SethFalco SethFalco commented Sep 5, 2021

I was just exploring the code but wanted to tidy up a bit while I was at it.

EditorConfig

Adds an EditorConfig file, so contributors can use consistent code styles with the rest of the project.
This makes it more convenient to work on the project when a developer's editor settings aren't the same as yours.

Optimizes the Dockerfile

This reduces the number of intermediate layers, and makes the build marginally faster.
Its good practice to minimize layers in a Docker image to make the final image smaller, especially if there's a lot of file system changes between layers. In this case, the difference in size is negligible.

A Docker image consists of read-only layers each of which represents a Dockerfile instruction. The layers are stacked and each one is a delta of the changes from the previous layer.

Each instruction creates one layer:

FROM creates a layer from the ubuntu:18.04 Docker image.
COPY adds files from your Docker client’s current directory.
RUN builds your application with make.
CMD specifies what command to run within the container.

https://docs.docker.com/develop/develop-images/dockerfile_best-practices/

Optimize apt-get

  • Removes the cached lists from /var/lib/apt/lists. (Reduces image size by 0.02 GB.)
  • No longer installs recommended packages. (Reduces image size by another 0.02 GB.)

More info:


Before

$ time docker build -t autosub:dev --build-arg model=0.9.3 --no-cache .
real    3m57.379s
user    0m1.650s
sys     0m1.776s

$ docker images -a
REPOSITORY                  TAG               IMAGE ID       CREATED              SIZE
autosub                     dev               2565f52660cb   About a minute ago   1.86GB
<none>                      <none>            4fd064c626ea   About a minute ago   1.86GB
<none>                      <none>            d9d3ae1ed1db   About a minute ago   1.86GB
<none>                      <none>            cf1540452ffc   2 minutes ago        714MB
<none>                      <none>            dfa2fae14669   2 minutes ago        714MB
<none>                      <none>            9ada5732266d   2 minutes ago        714MB
<none>                      <none>            d6821a416a2c   3 minutes ago        416MB
<none>                      <none>            c918464fff64   3 minutes ago        416MB
<none>                      <none>            594b7f697317   5 minutes ago        114MB
<none>                      <none>            e5f83a59a602   5 minutes ago        114MB
python                      3.8-slim-buster   4728acd2148c   2 days ago           114MB

After

$ time docker build -t autosub:dev --build-arg model=0.9.3 --no-cache .
real    3m38.463s
user    0m1.252s
sys     0m1.209s

$ docker images -a
REPOSITORY                  TAG               IMAGE ID       CREATED          SIZE
autosub                     dev               41dbaca6c36b   7 minutes ago    1.82GB
<none>                      <none>            117e3c54c5cf   7 minutes ago    1.82GB
<none>                      <none>            a911683cc89d   10 minutes ago   114MB
<none>                      <none>            7cf695bce4db   10 minutes ago   114MB
<none>                      <none>            264a99e3d9d1   10 minutes ago   114MB
python                      3.8-slim-buster   4728acd2148c   2 days ago       114MB

@abhirooptalasila
Copy link
Owner

Hey
Thanks for this!
I didn't know about EditorConfig before. Now I'll make sure to include it in future projects :)

@yash-fn
Copy link
Contributor

yash-fn commented Sep 8, 2021

No. I say we should actually remove the dockerfile changes from this commit. First I have already implemented the removal of apt cache in my own PR and the rest actually conflicts with my own changes. Second, I disagree with the approach on reducing layers. The goal should not be to minimize layers at the expense of readability and also development needs.

For example, in this commit, if anything in the code base changes the dockerfile will now copy the changes and then reinstall everything including things that are not dependent on code changes because we merged it into one line.

As a comparison please take a look at what i did in my own PR. The only thing I grouped into a single layer was the apt stuff because that logically only needs to occur once. The rest, including the copying of files, is split into individual lines. So now if I change a file, only that and related files will be added because docker will now reuse previous layers that were unchanged. If we group it all into a single line then even if a part of a layer changes it will recompute the entire layer. This is a waste of resources and will actually INCREASE build time for developers which is the only build time that matters because users will only need to built it once whereas devs will build multiple times. Also, docker automatically will recompute any unchanged layers if they occur after a changed layer, thus order matters too and ideally copy commands should as late as possible. This why I copy the large .pbmm and .scorer files first and THEN the code from this project. This way for a minor kb sized change it won't copy the gigabyte length .pbmm file. Please remove the dockerfile from this commit and instead just use the dockerfile that I published.

Basically, there is no point trying to optimize that which Docker is already trying to optimize for us. There is a benefit from having more layers not less. The only time fewer layers can potentially make sense is when you publish the image for others and you want to simplify the amount they have to download and process, in which case there are other tools available that will just "squash" multiple layers into one after the fact. But for a development utility, there is no point trying to marginally minimize layers like that.

@SethFalco
Copy link
Contributor Author

Just going to put my rational forward.
I don't have a strong enough opinion to make a dispute out of it, so if you still disagree then power to you.

The goal should not be to minimize layers at the expense of readability and also development needs.

The image isn't just for development needs, it can be used by users.
Not to mention, in my opinion it's still effortless to read. (Unless every .sh file is also considered unreadable.)

if anything in the code base changes the dockerfile will now copy the changes and then reinstall everything including things that are not dependent on code changes because we merged it into one line.

However, you should then consider spacing out your image in development, not the final image that's pushed and going to be used by users in production.

But for a development utility

I don't believe this Dockerfile is just for a "development utility".

There is a benefit from having more layers not less.

Have fewer layers can be helpful, especially for future reference. The final image retains all changes of the layers before. As the build process gets larger, the more unnecessary changes get brought along to the final image.

I'm the one who opened an issue suggesting adding a Dockerfile, with the purpose of production use and publishing.

@yash-fn
Copy link
Contributor

yash-fn commented Sep 8, 2021

I understand what you are saying and if one chooses to follow your style of development then your changes certainly make sense. I just disagree with that style. To me end users should just expect to run a docker pull autosub command and get what they want. Users have no business building their own images. In which case, dockerfiles should be made for the developers since we are the ones running the build command every time we want to test a change. So, in my personal opinion Dockerfiles are a development utility exclusively. Docker images are for the user. Also, I don't understand what you are saying in that last part there? Who cares what extraneous things are included in intermediate layers as long as it saves me time during the development build process? And you said yourself the size differences here were marginal, so no problem the way I see it.

@SethFalco
Copy link
Contributor Author

SethFalco commented Sep 8, 2021

Who cares what extraneous things are included in intermediate layers as long as it saves me time during the development build process?

The extra things in intermediate layers impact the size of the final image. (Including what is pushed to Docker Hub.)

In this particular image, the different was marginal, I agree. I'd still consider it a good practice in general as for many images it can save a lot more. Rather than experimenting with "Is it worth it?" before deciding, it's faster to just reduce layers in the first place.

For example, the change here reduces the image from 236 MB to 160 MB:
https://github.com/apache/nifi/pull/5037/files

This reduces the image from 2.91 GB to 1.96 GB:
https://github.com/apache/accumulo-docker/pull/14/files

This reduces the image from 2.37 GB to 1.61 GB:
apache/fluo-docker#18

For all of them, the main change is merging the RUN commands together.

@yash-fn
Copy link
Contributor

yash-fn commented Sep 8, 2021

I like what you were trying to achieve as optimizations should be made where possible and image size is great to reduce, but this type of optimization is just plain wrong ... in my opinion. You are entitled to yours. Take it or leave it. I personally don't like having to download an entire gigabyte file and reinstall entire packages that never changed every time I add a comma to one of the python files.

shasheene added a commit to shasheene/AutoSub-dev that referenced this pull request Sep 9, 2021
Fixes extraneous colon character in SRT/VTT files, split run-on sentences, 3
digit milliseconds

This is merge was done manually based on pull request abhirooptalasila#32, after removing PR
 abhirooptalasila#30 which was merged before it was ready. abhirooptalasila#30 contained many unnecessary format
changes caused a problems when merging other branches, it's easier to remove
the single commit from abhirooptalasila#30, and just re-implement the best parts as a commit on
top.  See [1] for more details.

Rewriting public history is generally frowned upon, but it's the path of least
resistance here.

[1] abhirooptalasila#32 (comment)
shasheene pushed a commit to shasheene/AutoSub-dev that referenced this pull request Sep 9, 2021
Adds configuration file for EditorConfig [1]

This commit was split out by @shasheene from the monolithic commit in PR abhirooptalasila#30,
which contained .editorconfig changes, some unrelated Dockerfile changes and a
large amount of formatting changes all in a single commit.

[1] https://editorconfig.org/
shasheene added a commit to shasheene/AutoSub-dev that referenced this pull request Sep 9, 2021
This merge simply contains the .editorconfig changes from abhirooptalasila#30.

The merge was done manually based on the commit in abhirooptalasila#30, after first removing
it.  abhirooptalasila#30 which was merged before it was ready, and contained a single
monolithic commit with format changes causing when merging other more
substantive branches.

This commit was split out by @shasheene from the monolithic commit in PR abhirooptalasila#30,
which contained .editorconfig changes,

The unrelated Dockerfile changes and the formatting changes can be applied as
separate commits on top, after other work has been merged.

It was easier to remove the single commit from abhirooptalasila#30, and just re-implement the
best parts as a commit on top.  See [1] for more details. Rewriting public
history is generally frowned upon, but it's the path of least resistance here.

[1] abhirooptalasila#32 (comment)
@shasheene
Copy link
Contributor

OK, so as briefly discussed in this comment thread, this pull request was merged before it was ready and is causing some headaches, so it's been reverted and re-applyed.

Basically, it's 3 logically distinct commits squashed into one:

  1. Add .editorconfig file (simple change that nobody has any problems with)
  2. Apply the new formatting rules based on the new .editorconfig
  3. Changes to the Dockerfile

It's the formatting/whitespace/layout changes have caused a huge number of conflicts (and wasted time for me) as I try to merge the several Pull Requests and forks from the past several months.

The Dockerfile changes here are relatively trivial to reconstruct compared to @yash-fn's branch that adds GPU Dockerfile changes.

So what I've done is I've split the commits into the .editorconfig commit (which I've applied already) and the Dockerfile changes.

There is debate on the merits of the changes above, but the key point is the actual changes being done in the Dockerfile can be pretty trivially reconstructed on top of origin/master after @yash-fn has been merged. Also applying the EditorConfig formatting changes across the whole codebase can be done as a commit on top too, if necessary.

I am fully aware that rewriting public git history is deeply frowned upon, but in this case it's the path of least resistance to save time resolving necessary git conflicts.

abhirooptalasila added a commit that referenced this pull request Sep 9, 2021
abhirooptalasila added a commit that referenced this pull request Sep 9, 2021
Revert EditorConfig and clean up #30
shasheene added a commit to shasheene/AutoSub-dev that referenced this pull request Sep 9, 2021
Applies the .editorconfig auto-formatter updates now that all existing Pull
Requests and tasks have been resolved.

This was done done using Pycharm: for each Python file, add character, save
(which applies the .editorconfig automatically), then delete the added
character and save again.

Ref abhirooptalasila#30
shasheene added a commit to shasheene/AutoSub-dev that referenced this pull request Sep 9, 2021
Applies the Jetbrain Pycharm feature of autoformatting whitespace now that all
existing PR's have been merged.

Ref abhirooptalasila#30
shasheene added a commit to shasheene/AutoSub-dev that referenced this pull request Sep 9, 2021
Uses Pycharm to remove unused imports (and re-order existing ones), which can
be cleanly applied now that all existing PR's have been merged.

Ref abhirooptalasila#30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants