```{=latex}
\usepackage{hyperref}
\usepackage{graphicx}
\usepackage{listings}
\usepackage{textcomp}
\usepackage{fancyvrb}

\newcommand{\passthrough}[1]{\lstset{mathescape=false}#1\lstset{mathescape=true}}
\providecommand{\tightlist}{%
  \setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}

```

```{=latex}
\title{Building Containers for Python Applications}
\author{Moshe Zadka -- https://cobordism.com}
\date{2021}

\begin{document}
\begin{titlepage}
\maketitle
\end{titlepage}
```

Python is a popular language for many applications.
Those that run in backend services,
now in the 2020s,
are usually run inside containers.
Building containers for Python applications is common.

Often,
with microservice architectures,
it makes sense to build a
"root"
base image which all of the services will build off of.
Most of the following will focus on the
base
image,
since this is where it is easiest to make mistakes.

However,
the applications themselves will also be covered:
what is a good
base
if not something to 
build on top of?

```{=latex}
\frame{\titlepage}
```

Before continuing,
I want to make an acknowledgement of country.
I come from the city of Belmont,
in the San Francisco Bay Area peninsula.

It was built on the ancestral homeland
of the Ramaytush Ohlone people.
You can learn more about them on their
[website](https://www.ramaytush.org/),

```{=latex}
\begin{frame}
\frametitle{Acknowledgement of Country}

Belmont (in San Francisco Bay Area Peninsula)

Ancestral homeland of the Ramaytush Ohlone people

\end{frame}
```

## Good and Bad

Before talking about
*how*
to build good containers,
there needs to be understanding of what
*are*
good containers?
What distinguishes good containers
from bad ones?

```{=latex}
\begin{frame}
\frametitle{What is good}

\pause


\begin{itemize}
\item To crush your enemies \pause
\item To see them driven before you \pause
\item Um, wrong slides
\end{itemize}


\end{frame}
```

Ah, woops,
these slides are from a different
talk,
about what is good
*in life*,
not
what is good
*in containers*.
It's time to focus on the topic at hand.
What kind of criteria distinguish
good containers from bad?

```{=latex}
\begin{frame}
\frametitle{What is good}

\begin{itemize}
\item Fast \pause
\item Small \pause
\item Secure \pause
\item Usable
\end{itemize}

\end{frame}
```

This is pretty high-level.
What does
"fast"
mean?
Fast at what?
How small is "small"?
What does it mean to be
"secure"?

These guidelines are rough.
Time to focus on concrete,
measurable,
criteria.

```{=latex}
\begin{frame}
\frametitle{Specifying the requirements}

Let's be more concrete

\begin{itemize}
\item Keep up to date \pause
\item Reproducible builds \pause
\item No compilers in prod \pause
\item Keep size (reasonably) small
\end{itemize}

\end{frame}
```

OK, a bit better.
But still not specific enough.
How exactly is
"keep up to date"
a criterion?
What size is
"reasonably"
small?

Start with
"up to date".
The most important part is that
security updates from the upstream
distribution will be installed
on a regular cadence.

```{=latex}
\begin{frame}
\frametitle{Up to date}

\begin{itemize}
\item Install security updates \pause
\item But when?
\end{itemize}


\end{frame}
```

This directly conflicts with the next goal:
"reproducible builds".
The abstract theory of reproducible builds
says that giving the same source
must result in bit-for-bit identical results.
This has many advantages,
but is non-trivial to achieve.

Lowering the bar a bit,
the same source must lead to equivalent results.
While this removes some advantages,
it maintains the most important one.
*Changing*
the source by some amount only results in
*commensurate* changes.

This is the main benefit of reproducible builds.
It allows pushing small fixes
with confidence that there
are no unrelated changes.
This allows less testing for small fixes,
and faster delivery of hot patches.

```{=latex}
\begin{frame}
\frametitle{Reproducible builds}

Same code gives same results \pause

...mostly

\end{frame}
```

The next criterion sounds almost trivial:
"no compilers in prod".
Compile ahead of time,
and store results in the image.

This criterion is here because without
careful thinking and implementation,
it is surprisingly easy to get wrong.
Many containers have been shippped with
`gcc`
included
because someone did not write their
`Dockerfile`
carefully enough.

```{=latex}
\begin{frame}
\frametitle{No compilers in prod}

A common anti-pattern \pause

...surprisingly easy to get wrong!


\end{frame}
```

On size,
however,
it is possible to spend an infinite amount of time.
Every byte can be debated if it is worth it.

In practice,
after getting into the low hundreds of megabytes,
this quickly becomes a game of diminishing returns.
Hours of work can go into carefully trimming
a few hundred extra kilobytes.

The point at which to stop depends on the cost structure.
Do you pay per GB? How much?
How many different images use the base image?
Is there something more valuable to do?

In practice,
getting images down to low hundreds of megabytes
(200 or 300)
is fairly easy.
Getting them below 200 is possible with a little more work.
This is usually a good stopping point.

```{=latex}
\begin{frame}
\frametitle{Size}

\begin{itemize}
\item Diminishing returns \pause
\item Cost savings
\end{itemize}

\end{frame}
```

One way to make the process of building a container image
faster and more reliable is to use
*binary wheels*
for packages with native code.
Whether it is in getting the wheels from PyPI,
building wheels into an internal package index,
or even building the wheels as part of a multistage
container build,
binary wheels are a useful tool.

```{=latex}
\begin{frame}
\frametitle{Support binary wheels}

Installing and building \pause

Faster \pause

Simplifies images


\end{frame}
```

It is important to add a dedicated user for the
container to run applications as.
This is important for several reasons,
but the overarching themes of all of them
is that it is an important intervention to reduce risk.

In most setups,
root inside the container
is the same as root outside the container.
This makes it much more likely that root
can find a
"container escape".

While it is not impossible for a regular user
to find a privilege escalation bug
and then escape as root,
this increases the complexity of such an attack.
Forcing attackers to use complex attacks
is important,
by both frustrating less dedicated ones
and increasing the chances that a persistent
attacker will trip an auditing alarm.

The other big reason is more mundane:
a root user can do anything
*inside*
the container.
Limiting those abilities is both
a smart bug avoiding strategy
and reduces the attack surface.

```{=latex}
\begin{frame}
\frametitle{Not run as root}

General hygiene

\end{frame}
```

Running as root is also a necessary component for the next good idea:
running with minimal privileges.
Most importantly,
it is a good idea to avoid write permissions as much as possible.
The most important thing to avoid write permisions for is the
virtual environment from which the application is running.

Avoiding such write permissions again lowers the attack surface
by preventing code modifications at runtime.

```{=latex}
\begin{frame}
\frametitle{Minimal privileges}

Especially avoid permissions to \lstinline|pip install|

\end{frame}
```

Leaving security behind,
the next thing to optimize for is performance.
The most important speed-up
criterion here
is
*rebuild*
time.

Modern buildkit-based builds
try to be smart about which
steps prevent which cache invalidations.
In a mutlistage build,
they also try to run steps which
provably are independent of each other
in prallel.

Writing the
`Dockerfile`
to take advantage of this techniques
is a non-trivial skill to master,
but worthwhile.
Especially useful is to think about
which files change less than others.

One example trick:
first copying
`requirements.txt`
and using it as an argument
to
`pip install -r`,
before copying the source
code and installing it.

This means that downloading and installing
(and sometimes even compiling)
the dependencies will only be
cache-invalidated by the
`requirements.txt`
file.
This allows faster rebuilds for the more common
use-case
the local source code changes.

```{=latex}
\begin{frame}
\frametitle{Fast rebuilds}

Responsiveness!

\end{frame}
```

## Bases

To make an apple pie from scratch,
first create the universe.
Creating the universe is a lot of thankless work,
and there are probably more valuable ways to spend work time.

All this is to say you will probably start with
`FROM <some distro>`.
But which distro?
Are we really going back to the 20th
century to relight the distro wars?
Hopefully not!

Instead,
this will cover what kind of issues apply,
specifically,
to base container OSs for Python applications.
Some distributions end up being better for this use-case,
in some ways,
than others.

Make your own choices!

```{=latex}
\begin{frame}
\frametitle{Base OS}

The distro wars are back?

\end{frame}
```

One thing that is more important for containers
than traditional uses of operating systems
is that they are more sensitive to size overhead.
This is because containers images tend to be in 1:1
correspondence with applications.

If an application builds a test build on every PR,
and stores it in a registry for a while so that
tests can be run on different environments on this PR,
this stores a lot of versions of the OS
in the registry.

Some of this is alleviated by containers sharing base layers,
but in practice,
less than naively assumed.
This is because images will be built to take in security and critical
bug patches.
This tends to preturb the base OS often enough that caching,
while useful,
is no substitute for a smaller size.

```{=latex}
\begin{frame}
\frametitle{Base - size}

Most modern distros have a decent minimal server \pause

...but Debian is easiest to get smallest.


\end{frame}
```

Since applications are built on top of the base,
it is useful if bumping the base version does not need to happen too often.
Time application teams spend moving to a new base
is time they are not spending developing useful customer-facing features.

This means it is good to find a base that has a long-term support version.
Having a base with around 5 years worth of LTS allows reasonable planning
for upgrades without making it a frequent exercise.

```{=latex}
\begin{frame}
\frametitle{Base - LTS/support}

Usually around 5 years \pause

Gives you time to upgrade!


\end{frame}
```

Together with LTS,
it matters what is the policy of the base about updates.
Does it update for general bugs?
Only critical bugs?
Security fixes?
Does it do backports or tries to upgrade to new upstream versions?

```{=latex}
\begin{frame}
\frametitle{Base - Volatility}

How much change?

Security? Backports? Fixes?

\end{frame}
```

Getting concrete,
one popular choice is
`Debian`.
It is a conservative police on updates,
and a 5 year LTS.

```{=latex}
\begin{frame}
\frametitle{Debian}

LTS: 5 years

Conservative


\end{frame}
```

Another popular choice is `Ubuntu`.
It has slightly more liberal polices
(for example,
it will allow backports for sufficiently good reasons).
Those policies also depend on the subtle differences
between universe and multiverse,
that are beyond the scope of this talk.

```{=latex}
\begin{frame}
\frametitle{Ubuntu}

LTS: 5 years

(Universe, Multiverse, etc...)

Fairly conservative

\end{frame}
```

Alpine is not a good choice for Python-based applications.
Since it uses `musl`, and not `glibc`,
it is not `manylinux` compatible.
This makes a lot of binary wheel issues unnecessarily
more complicated.
This might change in the future with `musllinux` potential support,
but for now,
this is not the best choice.

```{=latex}
\begin{frame}
\frametitle{Alpine (probably not)}

Uses musl, not manylinux compatible

\end{frame}
```

Some distributions have so-called
"rolling releases".
Instead of having a scheduled release
updating to new upstream versions of all packages,
new upstream versions added as they are released
and integrated.

This works well for desktop,
where using up to date versions is fun.
It can even work well for non-ephemeral servers,
where being able to do in-place upgrades
long term allowes minimization
of the need to do complete machine rebuilds.

For containers,
rolling releases are a poor match.
The main benefit
of updating incrementally
is completely lost,
as each image is built from scratch.
Container are built to be replaced
wholesale.

The biggest downside of rolling releases
is there is no way to get security updates
without,
potentially,
getting new versions of upstream software.
This can mean an expensive,
immediate,
need to support a new version of an upstream dependency
to push out a security fix.

```{=latex}
\begin{frame}
\frametitle{Rolling releases (probably not)}

Up to date, but... \pause

updates can change major versions!


\end{frame}
```

Starting with CentOS 8,
CentOS is more akin to a rolling release.
It receives new versions of the upstream software
and suffers from the same downsides as a
base OS for containers.

```{=latex}
\begin{frame}
\frametitle{CentOS}

Rolling release!

\end{frame}
```

## Installing Python

Now that there is an operating system
installed in the container,
it is time for the piece de resistance:
a Python interpreter.
Running Python applications requires the interpeter
and the standard library.

Somehow,
the container needs to include them.

```{=latex}
\begin{frame}
\frametitle{How to get Python?}

So many options...

\end{frame}
```

The most obvious choice,
using
`apt install pythonX.Y`
or
`dnf install pythonX.Y`
is also,
unfortunately,
the worst one.
The OS Python is primarily designed
for other packages needing Python,
not for arbitrary applications.

It often does not include
`venv`
or
`pip`
out of the box,
and is sometimes even missing
standard library modules
such as
`sqlite3`.
This makes perfect sense for the
OS Python interpreter.
Packages which need Python are installed
in their own area,
and do not need
`pip`
or
`venv`.
If they need
`sqlite3`,
for example,
they can explicitly note a dependency
on the relevant OS package.

For applications
this makes it a poor choice.


```{=latex}
\begin{frame}
\frametitle{Not system Python}

Distros aim Python at distro packages\pause

not user programs.

\end{frame}
```

There are some 3rd party repositories
packaging Python for use in distributions
as an OS package.
The most famous one is
`deadsnakes`
for Ubuntu,
which precompiles Python packages.

This is a popular choice.
It does mean waiting for the right version
to appear in the repository,
but usually this happens with little delay.

```{=latex}
\begin{frame}
\frametitle{Appropriate repositories}

Famous examples: deadsnakes PPA for Ubuntu

\end{frame}
```

Another option is to use
`pyenv`.
This is particularly useful if a single
`dev`
Python container image needs to have multiple versions
of Python.
The runtime versions can be built from it via
careful copying,
and allows some flows which require multiple versions of Python
at buildtime to work.

Even without the need for multiple versions of Python,
`pyenv`
can be a popular choice.
It is a well-trusted tool
that can build Python inside a container.

```{=latex}
\begin{frame}
\frametitle{pyenv}

Builds and installs Python


\end{frame}
```

One way to get the biggest benefit of
`pyenv`
without needing some of the overhead that is less useful in containers
(like shims and the ability to switch versions)
is to use
`python-build`.
This is the engine,
inside
`pyenv`,
which builds Python.

Using it directly not only allows skipping redundancies
but also configuring build details on a more
granular basis.
These are possible in
`pyenv`,
but the need to do a pass-through to
`python-build`
makes them more awkward,
especially when there are a lot.

```{=latex}
\begin{frame}
\frametitle{python-build}

Builds and installs Python

\end{frame}
```

Finally,
or maybe initially,
it is possible to do it
like the people in the
before-times.
The
`configure/make/make install`
flow works,
and removes any barriers
between the developer
and the build.

Any build parameters can be set,
and tweaked.
The main downside is the need
to securely grab a tarball
of the source code
and avoiding supply-chain attacks.


```{=latex}
\begin{frame}[fragile]
\frametitle{Source}

\begin{lstlisting}
RUN configure [...]
RUN make
RUN make install
\end{lstlisting}

\end{frame}
```

There are inherent trade-offs when choosing.
The trade-offs are three-ways:

* How much control the local build has over the result
* How much work it is to implement
* The potential for issues

Ultimately,
each team must decide for itself what
trade-offs
are right for it.

```{=latex}
\begin{frame}
\frametitle{Trade-offs}

Control vs. Work vs. Problems

\end{frame}
```

It is usually a good idea to build several versions of the
"base level"
Python containers.
This allows dependent containers to move to a new version
at different times.

The minimum needed for this to work is 2.
While more than 3 are possible,
in practice,
this is usually unnecessary.
Python releases yearly,
so three versions give two years to upgrade to a new,
mostly-backwards compatible,
version of Python.

If a team does not have slack over the course
of two years,
the problem is not one of Python versions.
In practice,
this means the choice is between supporting two or three
versions of Python.


```{=latex}
\begin{frame}
\frametitle{Versions}

Support multiple for upgrade path\pause

2-3


\end{frame}
```


## Thinking in Stages

```{=latex}
\begin{frame}
\frametitle{Docker multistage (quick recap)}

Only one stage output \pause

other stages help

\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{FROM}

Use previous stage as starting image

\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{COPY --from}

Copy files from previous stage

\end{frame}
```

```{=latex}
\begin{frame}[fragile]
\frametitle{Stages a as modules}

\begin{lstlisting}
FROM ubuntu as security-updates
RUN add-apt-repository ppa:deadsnakes/ppa
RUN apt-get update
RUN apt-get upgrade

FROM security-updates as with-38
RUN apt-get install python3.8

FROM security-updates as with-39
RUN apt-get install python3.9
\end{lstlisting}

\end{frame}
```

```{=latex}
\begin{frame}[fragile]
\frametitle{Separate build and runtime}

Especially when building from source! \pause

\begin{lstlisting}
FROM ubuntu as builder
# install build dependencies
# build Python into /opt/myorg/python

FROM ubuntu as as runtime
COPY --from=builder \
      /opt/myorg/python \
      /opt/myorg/python
\end{lstlisting}

\end{frame}
```

```{=latex}
\begin{frame}[fragile]
\frametitle{Optimizing layers}

Put everything under \lstinline|/opt/myorg|

Use one \lstinline|COPY --from=...|


\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{Optimizing size}

After building Python, remove:

\begin{itemize}
\item Tests
\item Builder dependencies (in runtime)
\item ....and more
\end{itemize}

\end{frame}
```

## Use in Applications

```{=latex}
\begin{frame}
\frametitle{Binary wheels}

\begin{itemize}
\item Build with builder
\item Copy to runtime
\item Install in virtual environment
\end{itemize}


\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{Binary wheels (alt)}

\begin{itemize}
\item Build with builder
\item Install in virtual environment
\item Copy virtual environment to runtime
\end{itemize}


\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{Patchelf}

Used to make wheels self-contained

Newst version needed

\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{Auditwheel}

Use pip to install

\end{frame}
```

```{=latex}
\begin{frame}[fragile]
\frametitle{Self-contained binary wheels}

Run 

\begin{lstlisting}
auditwheel repair --platform linux_x86_64
\end{lstlisting}

\pause

No need for binary dependencies!
\end{frame}
```

```{=latex}
\begin{frame}[fragile]
\frametitle{Portable binary wheels}

\begin{itemize}
\item Oldest supported?
\end{itemize} 

\pause

Example:

\begin{lstlisting}
auditwheel repair --platform manylinux_2_27_x86_64
\end{lstlisting}


\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{Generating binary wheels}

Build instructions in docs
\pause

Build dependencies

\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{Optimizing layers}

Reduce copies
\pause

Prep

\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{Optimizing caching}


Where to build wheel?

\pause

What invalidates caching?


\end{frame}
```

## Final Thoughts

```{=latex}
\begin{frame}
\frametitle{Conclusion}

\begin{itemize}
\item Wrong easier than right \pause
\item But right is amazing \pause
\item Think before  you docker
\end{itemize}

\end{frame}
```

```{=latex}
\begin{frame}
\frametitle{Further Resources}

Itamar's series -- https://pythonspeed.com/docker/

\end{frame}
```

```{=latex}
\end{document}
```