Skip to content

Commit

Permalink
Universal DPP base image
Browse files Browse the repository at this point in the history
  • Loading branch information
da115115 committed May 28, 2023
1 parent cd809ae commit c401f3e
Show file tree
Hide file tree
Showing 4 changed files with 350 additions and 0 deletions.
45 changes: 45 additions & 0 deletions .github/workflows/docker.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
name: Docker

on:
push:
branches: main

jobs:
docker_build:
environment: docker-hub
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2

- name: Login to Docker Hub
uses: docker/login-action@v2
with:
username: ${{ secrets.DOCKERHUB_USERNAME }}
password: ${{ secrets.DOCKERHUB_TOKEN }}

- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-action@v1

- name: Lint pyspark-coretto-8-emr-dbs-universal-python
uses: hadolint/hadolint-action@v2.0.0
with:
dockerfile: pyspark-coretto-8-emr-dbs-universal-python/Dockerfile
failure-threshold: error

- name: Run privileged
run: sudo docker run --privileged --rm tonistiigi/binfmt --install arm64

- name: Build pyspark-coretto-8-emr-dbs-universal-python
id: docker_build_1
uses: docker/build-push-action@v3
with:
builder: ${{ steps.buildx.outputs.name }}
context: ./pyspark-coretto-8-emr-dbs-universal-python
file: ./pyspark-coretto-8-emr-dbs-universal-python/Dockerfile
push: false
tags: infrahelpers/dpp:latest
cache-from: type=gha
cache-to: type=gha,mode=max

150 changes: 150 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@

Container images focusing on Data Processing Pipelines (DPP)
============================================================

[![Docker Cloud Build Status](https://img.shields.io/docker/cloud/build/infrahelpers/dpp)](https://hub.docker.com/repository/docker/infrahelpers/dpp/general)

# Overview
[That project](https://github.com/data-engineering-helpers/dpp-images)
produces [OCI](https://opencontainers.org/)
[(Docker-compliant) images](https://hub.docker.com/repository/docker/infrahelpers/dpp/tags),
which provide environments for Data Processing Pipelines (DPP),
ready to use and to be deployed on Modern Data Stack (MDS),
be it on private or public clouds (_e.g._, AWS, Azure, GCP).

These images are based
[AWS-supported Corretto](https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/what-is-corretto-8.html)

These OCI images are aimed at deploying Data Engineering applications,
typically Data Processing Pipelines (DPP), on
[Modern Data Stack (MDS)](https://www.montecarlodata.com/blog-what-is-a-data-platform-and-how-to-build-one/)

The author of this repository also maintains general purpose cloud
Python OCI images in a
[dedicated GitHub repository](https://github.com/cloud-helpers/cloud-python-images/)
and
[Docker Hub space](https://hub.docker.com/repository/docker/infrahelpers/cloud-python).

Thanks to
[Docker multi-stage builds](https://docs.docker.com/develop/develop-images/multistage-build/),
one can easily have in the same Docker specification files two images, one for
every day data engineering work, and the other one to deploy the corresponding
applications onto production environments.

The Docker images of this repository just add various utilities to make it
work out of the box with cloud vendors (_e.g._, Azure and AWS command-line
utilities) and cloud-native tools (_e.g._, Pachyderm), on top of the native
images maintained by the
[AWS-supported Corretto](https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/what-is-corretto-8.html).
They also add specific Python versions.

In the OCI image, Python packages are installed by the `pip` utility.
For testing purposes, outside of the container, Python virtual environments
may be installed thanks to Pyenv and `pipenv`, as detailed in the
[dedicated procedure](http://github.com/machine-learning-helpers/induction-python/tree/master/installation/virtual-env)
on the
[Python induction notebook sub-project](http://github.com/machine-learning-helpers/induction-python).

Any additional Python module may be installed either:
* With `pip` and some `requirements.txt` dependency specification file:
```bash
$ python3 -mpip install -r requirements.txt
```
* In a dedicated virtual environment, controlled by `pipenv` through
local `Pipfile` (and potentially `Pipfile.lock`) files,
which should be versioned:
```bash
$ pipenv --rm; pipenv install; pipenv install --dev
```

On the other hand, the OCI images install those modules globally.

The Docker images of this repository are intended to run any Data Engineering
applications / Data Processing Pipeline (DPP).

## See also
* [Images on Docker Cloud](https://cloud.docker.com/u/infrahelpers/repository/docker/infrahelpers/dpp)
* Cloud Python images:
+ GitHub:
https://github.com/cloud-helpers/cloud-python-images
+ Docker Cloud:
https://cloud.docker.com/u/infrahelpers/repository/docker/infrahelpers/cloud-python
* General purpose C++ and Python with Debian OCI images:
+ GitHub:
https://github.com/cpp-projects-showcase/docker-images/tree/master/debian10
+ Docker Cloud:
https://cloud.docker.com/u/infrahelpers/repository/docker/infrahelpers/cpppython
* General purpose light Python/Debian OCI images:
+ GitHub: https://github.com/machine-learning-helpers/docker-python-light
+ Docker Cloud:
https://cloud.docker.com/u/infrahelpers/repository/docker/artificialintelligence/python-light
* [Native Python OCI images](https://github.com/docker-library/python):
+ [Python 3.12](https://github.com/docker-library/python/tree/master/3.12-rc)
- https://github.com/docker-library/python/tree/master/3.12-rc/buster
+ [Python 3.11](https://github.com/docker-library/python/tree/master/3.11)
- https://github.com/docker-library/python/tree/master/3.11/buster
+ [Python 3.10](https://github.com/docker-library/python/tree/master/3.10)
- https://github.com/docker-library/python/tree/master/3.10/buster
+ [Python 3.9](https://github.com/docker-library/python/tree/master/3.9)
- https://github.com/docker-library/python/tree/master/3.9/buster

# Simple use
* Download the Docker image:
```bash
$ docker pull infrahelpers/dpp
```

* Launch a Spark application:
```bash
$ docker run -it infrahelpers/dpp
```

# Build your own container image
* Clone the
[Git repository](https://github.com/data-engineering-helpers/dpp):
```bash
$ mkdir -p ~/dev/infra && cd ~/dev/infra
$ git clone https://github.com/data-engineering-helpers/dpp.git
$ cd dpp
```

* Build the OCI images (here with Docker, but any other tool may be used):
+ Amazon Linux 2 for Elastic Map Reduce (EMR) 6 and DataBricks
with a single Python installation, with more freedom on its version,
with JDK 8:
```bash
$ docker build -t infrahelpers/cloud-python:pyspark-emr-dbs-univ pyspark-coretto-8-emr-dbs-universal-python
```

* In addition to what the Docker Hub builds, the CI/CD (GitHub Actions)
pipeline also builds the `infrahelpers/dpp` images,
from the
[`pyspark-coretto-8-emr-dbs-universal-python/` directory](pyspark-coretto-8-emr-dbs-universal-python/),
on two CPU architectures, namely the classical AMD64 and the newer ARM64

* (Optional) Push the newly built images to Docker Hub.
That step is usually not needed, as the images are automatically
built everytime there is
[a change on GitHub](https://github.com/data-engineering-helpers/dpp-images/commits/master))
```bash
$ docker login
$ docker push infrahelpers/dpp
```

* Choose which image should be the latest, tag it and upload it to Docker Hub:
```bash
$ docker tag infrahelpers/dpp:py310 infrahelpers/dpp:latest
$ docker push infrahelpers/dpp:latest
```

* Shutdown the Docker image
```bash
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
7b69efc9dc9a de/dpp "/bin/sh -c 'python …" 48 seconds ago Up 47 seconds 0.0.0.0:9000->8050/tcp vigilant_merkle
$ docker kill vigilant_merkle
vigilant_merkle
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
```

135 changes: 135 additions & 0 deletions pyspark-coretto-8-emr-dbs-universal-python/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
#
# Source: https://github.com/data-engineering-helpers/dpp-images/tree/main/pyspark-coretto-8-emr-dbs-universal-python/Dockerfile
# On Docker Hub: https://hub.docker.com/repository/docker/infrahelpers/dpp/general
# Usual Docker tag: pyspark-emr-dbs-univ (infrahelpers/dpp:pyspark-emr-dbs-univ)
#
# Base image for Data Processing Pipelines (DPP), with images
# for specific Python versions
#
# Inspired by:
# * EMR: https://aws.amazon.com/blogs/big-data/simplify-your-spark-dependency-management-with-docker-in-emr-6-0-0
# * DataBricks: https://github.com/databricks/containers
#
# A pristine Python installation is performed, with a specific version,
# namely 3.9.16 here (latest release of the 3.9 minor version). A 3.9 version
# allows it to work on both DataBricks (which uses Python 3.8 internally
# by default) and AWS EMR (which uses Python 3.7.10 by default).
# As of March 2023, on EMR 6.9, libraries/wheels developed with newer versions
# of Python (e.g., 3.10+) will most certainly fail on AWS EMR.
#
# AWS Corretto / EMR
# ===================
# + https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/what-is-corretto-8.html
# + https://docs.aws.amazon.com/corretto/latest/corretto-17-ug/docker-install.html
# The underlying operating system (OS) is Amazon Linux 2, i.e., based on a
# RedHat Linux 7 with some Amazon specific additions.
# The Python version is 3.7.15 by default, if installed with the Linux
# distribution.
# Note that, up to at least version 6.9.0 of EMR, only Java 8 is supported.
# With Java 11+, it generates errors like
# https://confluence.atlassian.com/confkb/unrecognized-jvm-gc-options-when-using-java-11-1002472841.html
#
# DataBricks
# ==========
# + Base image Dockerfile: https://github.com/databricks/containers/tree/master/ubuntu/standard
# + Base image on Docker Hub: https://hub.docker.com/r/databricksruntime/standard
# - Usual Docker tag: latest
#
# The underlying operating system (OS) is Ubuntu 18.04 LTS (Bionic Beaver).
# The Python installation has to be a virtual environment in
# /databricks/python3, and Python is the main one (pristine, installed manually
# by that container image)
#
FROM amazoncorretto:8

LABEL authors "Denis Arnaud <denis.arnaud_fedora@m4x.org>"

# Environment
ENV container="docker"
ENV HOME="/root"
ENV HOMEUSR="/home/ubuntu"
ENV PYSPARK_PYTHON="/databricks/python3/bin/python3"
ENV PYSPARK_DRIVER_PYTHON="python3"
ENV PYTHON_MINOR_VERSION="3.9"
ENV PYTHON_MICRO_VERSION="${PYTHON_MINOR_VERSION}.16"

# Update the OS and install a few packages useful for software development
# (needed for some Python modules like SHAP)
RUN yum -y update && \
yum -y install yum-utils && \
yum -y groupinstall development && \
yum clean all

# Install a few more utilities, including pre-requisites for Python
RUN yum -y install procps net-tools hostname iproute coreutils \
openssl-devel less htop passwd which sudo man vim git tar tree \
wget curl file bash-completion keyutils zlib-devel bzip2-devel gzip \
autoconf automake libtool m4 gcc gcc-c++ cmake cmake3 libffi-devel \
readline-devel sqlite-devel jq fuse fuse-libs && \
yum clean all

#
RUN yum -y install passwd fuse fuse-libs iproute net-tools procps coreutils \
gcc openssl-devel gzip bzip2-devel libffi-devel zlib-devel && \
yum clean all

#
WORKDIR $HOME

# Cloud helpers Shell scripts (https://github.com/cloud-helpers/k8s-job-wrappers)
RUN KJW_VER=$(curl -Ls https://api.github.com/repos/cloud-helpers/k8s-job-wrappers/tags|jq -r '.[].name'|grep "^v"|sort -r|head -1|cut -d'v' -f2,2) && \
curl -Ls \
https://github.com/cloud-helpers/k8s-job-wrappers/archive/refs/tags/v${KJW_VER}.tar.gz \
-o k8s-job-wrappers.tar.gz && \
tar zxf k8s-job-wrappers.tar.gz && rm -f k8s-job-wrappers.tar.gz && \
mv -f k8s-job-wrappers-${KJW_VER} /usr/local/ && \
ln -s /usr/local/k8s-job-wrappers-${KJW_VER} /usr/local/k8s-job-wrappers

# AWS: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html
RUN curl -Ls https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip \
-o awscliv2.zip && \
unzip -q awscliv2.zip && rm -f awscliv2.zip && ./aws/install && \
rm -rf ./aws

# SAML-to-AWS (saml2aws)
# https://github.com/Versent/saml2aws
RUN SAML2AWS_VER=$(curl -Ls https://api.github.com/repos/Versent/saml2aws/releases/latest | grep 'tag_name' | cut -d'v' -f2 | cut -d'"' -f1) && \
curl -Ls \
https://github.com/Versent/saml2aws/releases/download/v${SAML2AWS_VER}/saml2aws_${SAML2AWS_VER}_linux_amd64.tar.gz -o saml2aws.tar.gz && \
tar zxf saml2aws.tar.gz && rm -f saml2aws.tar.gz README.md LICENSE.md && \
mv -f saml2aws /usr/local/bin/ && \
chmod 775 /usr/local/bin/saml2aws

# Copy configuration in the user home, for the root user
ADD bashrc $HOME/.bashrc

# Install the PYTHON_MICRO_VERSION version of Python
RUN curl -kLs \
https://www.python.org/ftp/python/${PYTHON_MICRO_VERSION}/Python-${PYTHON_MICRO_VERSION}.tgz \
-o Python-${PYTHON_MICRO_VERSION}.tgz && \
tar zxf Python-${PYTHON_MICRO_VERSION}.tgz && \
rm -f Python-${PYTHON_MICRO_VERSION}.tgz && \
cd Python-${PYTHON_MICRO_VERSION} && \
./configure --prefix=/usr && \
make && \
make altinstall

# Set the PYTHON_MICRO_VERSION version of Python as system Python
# This is what is used by AWS EMR
RUN cp -f /usr/bin/python${PYTHON_MINOR_VERSION} /usr/bin/python3 && \
cp -f /usr/bin/python${PYTHON_MINOR_VERSION} /usr/bin/python

# Create an ubuntu user
RUN useradd ubuntu && \
echo "ubuntu ALL=(root) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu && \
chmod 0440 /etc/sudoers.d/ubuntu

# Create a /databricks/python3 directory and set environment and permissions
# This is what is used by DataBricks
RUN mkdir -p /databricks/python3 && chown -R ubuntu:ubuntu /databricks
ENV PYSPARK_PYTHON="/databricks/python3/bin/python3"

# Install a virtual environment in /databricks/python3
RUN python3 -mpip install virtualenv && \
virtualenv --system-site-packages /databricks/python3

20 changes: 20 additions & 0 deletions pyspark-coretto-8-emr-dbs-universal-python/bashrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@

# Locale
export LANG="en_US.UTF-8"
export LC_ALL="en_US.UTF-8"

# History
export HISTTIMEFORMAT="%d/%m/%y %T "

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# Aliases
alias rm='rm -i'
alias cp='cp -i'
alias mv='mv -i'
alias dir='ls -laFh --color'
alias grep='grep --color'

0 comments on commit c401f3e

Please sign in to comment.