-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
350 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
name: Docker | ||
|
||
on: | ||
push: | ||
branches: main | ||
|
||
jobs: | ||
docker_build: | ||
environment: docker-hub | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@v2 | ||
|
||
- name: Login to Docker Hub | ||
uses: docker/login-action@v2 | ||
with: | ||
username: ${{ secrets.DOCKERHUB_USERNAME }} | ||
password: ${{ secrets.DOCKERHUB_TOKEN }} | ||
|
||
- name: Set up Docker Buildx | ||
id: buildx | ||
uses: docker/setup-buildx-action@v1 | ||
|
||
- name: Lint pyspark-coretto-8-emr-dbs-universal-python | ||
uses: hadolint/hadolint-action@v2.0.0 | ||
with: | ||
dockerfile: pyspark-coretto-8-emr-dbs-universal-python/Dockerfile | ||
failure-threshold: error | ||
|
||
- name: Run privileged | ||
run: sudo docker run --privileged --rm tonistiigi/binfmt --install arm64 | ||
|
||
- name: Build pyspark-coretto-8-emr-dbs-universal-python | ||
id: docker_build_1 | ||
uses: docker/build-push-action@v3 | ||
with: | ||
builder: ${{ steps.buildx.outputs.name }} | ||
context: ./pyspark-coretto-8-emr-dbs-universal-python | ||
file: ./pyspark-coretto-8-emr-dbs-universal-python/Dockerfile | ||
push: false | ||
tags: infrahelpers/dpp:latest | ||
cache-from: type=gha | ||
cache-to: type=gha,mode=max | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
|
||
Container images focusing on Data Processing Pipelines (DPP) | ||
============================================================ | ||
|
||
[![Docker Cloud Build Status](https://img.shields.io/docker/cloud/build/infrahelpers/dpp)](https://hub.docker.com/repository/docker/infrahelpers/dpp/general) | ||
|
||
# Overview | ||
[That project](https://github.com/data-engineering-helpers/dpp-images) | ||
produces [OCI](https://opencontainers.org/) | ||
[(Docker-compliant) images](https://hub.docker.com/repository/docker/infrahelpers/dpp/tags), | ||
which provide environments for Data Processing Pipelines (DPP), | ||
ready to use and to be deployed on Modern Data Stack (MDS), | ||
be it on private or public clouds (_e.g._, AWS, Azure, GCP). | ||
|
||
These images are based | ||
[AWS-supported Corretto](https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/what-is-corretto-8.html) | ||
|
||
These OCI images are aimed at deploying Data Engineering applications, | ||
typically Data Processing Pipelines (DPP), on | ||
[Modern Data Stack (MDS)](https://www.montecarlodata.com/blog-what-is-a-data-platform-and-how-to-build-one/) | ||
|
||
The author of this repository also maintains general purpose cloud | ||
Python OCI images in a | ||
[dedicated GitHub repository](https://github.com/cloud-helpers/cloud-python-images/) | ||
and | ||
[Docker Hub space](https://hub.docker.com/repository/docker/infrahelpers/cloud-python). | ||
|
||
Thanks to | ||
[Docker multi-stage builds](https://docs.docker.com/develop/develop-images/multistage-build/), | ||
one can easily have in the same Docker specification files two images, one for | ||
every day data engineering work, and the other one to deploy the corresponding | ||
applications onto production environments. | ||
|
||
The Docker images of this repository just add various utilities to make it | ||
work out of the box with cloud vendors (_e.g._, Azure and AWS command-line | ||
utilities) and cloud-native tools (_e.g._, Pachyderm), on top of the native | ||
images maintained by the | ||
[AWS-supported Corretto](https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/what-is-corretto-8.html). | ||
They also add specific Python versions. | ||
|
||
In the OCI image, Python packages are installed by the `pip` utility. | ||
For testing purposes, outside of the container, Python virtual environments | ||
may be installed thanks to Pyenv and `pipenv`, as detailed in the | ||
[dedicated procedure](http://github.com/machine-learning-helpers/induction-python/tree/master/installation/virtual-env) | ||
on the | ||
[Python induction notebook sub-project](http://github.com/machine-learning-helpers/induction-python). | ||
|
||
Any additional Python module may be installed either: | ||
* With `pip` and some `requirements.txt` dependency specification file: | ||
```bash | ||
$ python3 -mpip install -r requirements.txt | ||
``` | ||
* In a dedicated virtual environment, controlled by `pipenv` through | ||
local `Pipfile` (and potentially `Pipfile.lock`) files, | ||
which should be versioned: | ||
```bash | ||
$ pipenv --rm; pipenv install; pipenv install --dev | ||
``` | ||
|
||
On the other hand, the OCI images install those modules globally. | ||
|
||
The Docker images of this repository are intended to run any Data Engineering | ||
applications / Data Processing Pipeline (DPP). | ||
|
||
## See also | ||
* [Images on Docker Cloud](https://cloud.docker.com/u/infrahelpers/repository/docker/infrahelpers/dpp) | ||
* Cloud Python images: | ||
+ GitHub: | ||
https://github.com/cloud-helpers/cloud-python-images | ||
+ Docker Cloud: | ||
https://cloud.docker.com/u/infrahelpers/repository/docker/infrahelpers/cloud-python | ||
* General purpose C++ and Python with Debian OCI images: | ||
+ GitHub: | ||
https://github.com/cpp-projects-showcase/docker-images/tree/master/debian10 | ||
+ Docker Cloud: | ||
https://cloud.docker.com/u/infrahelpers/repository/docker/infrahelpers/cpppython | ||
* General purpose light Python/Debian OCI images: | ||
+ GitHub: https://github.com/machine-learning-helpers/docker-python-light | ||
+ Docker Cloud: | ||
https://cloud.docker.com/u/infrahelpers/repository/docker/artificialintelligence/python-light | ||
* [Native Python OCI images](https://github.com/docker-library/python): | ||
+ [Python 3.12](https://github.com/docker-library/python/tree/master/3.12-rc) | ||
- https://github.com/docker-library/python/tree/master/3.12-rc/buster | ||
+ [Python 3.11](https://github.com/docker-library/python/tree/master/3.11) | ||
- https://github.com/docker-library/python/tree/master/3.11/buster | ||
+ [Python 3.10](https://github.com/docker-library/python/tree/master/3.10) | ||
- https://github.com/docker-library/python/tree/master/3.10/buster | ||
+ [Python 3.9](https://github.com/docker-library/python/tree/master/3.9) | ||
- https://github.com/docker-library/python/tree/master/3.9/buster | ||
|
||
# Simple use | ||
* Download the Docker image: | ||
```bash | ||
$ docker pull infrahelpers/dpp | ||
``` | ||
|
||
* Launch a Spark application: | ||
```bash | ||
$ docker run -it infrahelpers/dpp | ||
``` | ||
|
||
# Build your own container image | ||
* Clone the | ||
[Git repository](https://github.com/data-engineering-helpers/dpp): | ||
```bash | ||
$ mkdir -p ~/dev/infra && cd ~/dev/infra | ||
$ git clone https://github.com/data-engineering-helpers/dpp.git | ||
$ cd dpp | ||
``` | ||
|
||
* Build the OCI images (here with Docker, but any other tool may be used): | ||
+ Amazon Linux 2 for Elastic Map Reduce (EMR) 6 and DataBricks | ||
with a single Python installation, with more freedom on its version, | ||
with JDK 8: | ||
```bash | ||
$ docker build -t infrahelpers/cloud-python:pyspark-emr-dbs-univ pyspark-coretto-8-emr-dbs-universal-python | ||
``` | ||
|
||
* In addition to what the Docker Hub builds, the CI/CD (GitHub Actions) | ||
pipeline also builds the `infrahelpers/dpp` images, | ||
from the | ||
[`pyspark-coretto-8-emr-dbs-universal-python/` directory](pyspark-coretto-8-emr-dbs-universal-python/), | ||
on two CPU architectures, namely the classical AMD64 and the newer ARM64 | ||
|
||
* (Optional) Push the newly built images to Docker Hub. | ||
That step is usually not needed, as the images are automatically | ||
built everytime there is | ||
[a change on GitHub](https://github.com/data-engineering-helpers/dpp-images/commits/master)) | ||
```bash | ||
$ docker login | ||
$ docker push infrahelpers/dpp | ||
``` | ||
|
||
* Choose which image should be the latest, tag it and upload it to Docker Hub: | ||
```bash | ||
$ docker tag infrahelpers/dpp:py310 infrahelpers/dpp:latest | ||
$ docker push infrahelpers/dpp:latest | ||
``` | ||
|
||
* Shutdown the Docker image | ||
```bash | ||
$ docker ps | ||
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES | ||
7b69efc9dc9a de/dpp "/bin/sh -c 'python …" 48 seconds ago Up 47 seconds 0.0.0.0:9000->8050/tcp vigilant_merkle | ||
$ docker kill vigilant_merkle | ||
vigilant_merkle | ||
$ docker ps | ||
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES | ||
``` | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
# | ||
# Source: https://github.com/data-engineering-helpers/dpp-images/tree/main/pyspark-coretto-8-emr-dbs-universal-python/Dockerfile | ||
# On Docker Hub: https://hub.docker.com/repository/docker/infrahelpers/dpp/general | ||
# Usual Docker tag: pyspark-emr-dbs-univ (infrahelpers/dpp:pyspark-emr-dbs-univ) | ||
# | ||
# Base image for Data Processing Pipelines (DPP), with images | ||
# for specific Python versions | ||
# | ||
# Inspired by: | ||
# * EMR: https://aws.amazon.com/blogs/big-data/simplify-your-spark-dependency-management-with-docker-in-emr-6-0-0 | ||
# * DataBricks: https://github.com/databricks/containers | ||
# | ||
# A pristine Python installation is performed, with a specific version, | ||
# namely 3.9.16 here (latest release of the 3.9 minor version). A 3.9 version | ||
# allows it to work on both DataBricks (which uses Python 3.8 internally | ||
# by default) and AWS EMR (which uses Python 3.7.10 by default). | ||
# As of March 2023, on EMR 6.9, libraries/wheels developed with newer versions | ||
# of Python (e.g., 3.10+) will most certainly fail on AWS EMR. | ||
# | ||
# AWS Corretto / EMR | ||
# =================== | ||
# + https://docs.aws.amazon.com/corretto/latest/corretto-8-ug/what-is-corretto-8.html | ||
# + https://docs.aws.amazon.com/corretto/latest/corretto-17-ug/docker-install.html | ||
# The underlying operating system (OS) is Amazon Linux 2, i.e., based on a | ||
# RedHat Linux 7 with some Amazon specific additions. | ||
# The Python version is 3.7.15 by default, if installed with the Linux | ||
# distribution. | ||
# Note that, up to at least version 6.9.0 of EMR, only Java 8 is supported. | ||
# With Java 11+, it generates errors like | ||
# https://confluence.atlassian.com/confkb/unrecognized-jvm-gc-options-when-using-java-11-1002472841.html | ||
# | ||
# DataBricks | ||
# ========== | ||
# + Base image Dockerfile: https://github.com/databricks/containers/tree/master/ubuntu/standard | ||
# + Base image on Docker Hub: https://hub.docker.com/r/databricksruntime/standard | ||
# - Usual Docker tag: latest | ||
# | ||
# The underlying operating system (OS) is Ubuntu 18.04 LTS (Bionic Beaver). | ||
# The Python installation has to be a virtual environment in | ||
# /databricks/python3, and Python is the main one (pristine, installed manually | ||
# by that container image) | ||
# | ||
FROM amazoncorretto:8 | ||
|
||
LABEL authors "Denis Arnaud <denis.arnaud_fedora@m4x.org>" | ||
|
||
# Environment | ||
ENV container="docker" | ||
ENV HOME="/root" | ||
ENV HOMEUSR="/home/ubuntu" | ||
ENV PYSPARK_PYTHON="/databricks/python3/bin/python3" | ||
ENV PYSPARK_DRIVER_PYTHON="python3" | ||
ENV PYTHON_MINOR_VERSION="3.9" | ||
ENV PYTHON_MICRO_VERSION="${PYTHON_MINOR_VERSION}.16" | ||
|
||
# Update the OS and install a few packages useful for software development | ||
# (needed for some Python modules like SHAP) | ||
RUN yum -y update && \ | ||
yum -y install yum-utils && \ | ||
yum -y groupinstall development && \ | ||
yum clean all | ||
|
||
# Install a few more utilities, including pre-requisites for Python | ||
RUN yum -y install procps net-tools hostname iproute coreutils \ | ||
openssl-devel less htop passwd which sudo man vim git tar tree \ | ||
wget curl file bash-completion keyutils zlib-devel bzip2-devel gzip \ | ||
autoconf automake libtool m4 gcc gcc-c++ cmake cmake3 libffi-devel \ | ||
readline-devel sqlite-devel jq fuse fuse-libs && \ | ||
yum clean all | ||
|
||
# | ||
RUN yum -y install passwd fuse fuse-libs iproute net-tools procps coreutils \ | ||
gcc openssl-devel gzip bzip2-devel libffi-devel zlib-devel && \ | ||
yum clean all | ||
|
||
# | ||
WORKDIR $HOME | ||
|
||
# Cloud helpers Shell scripts (https://github.com/cloud-helpers/k8s-job-wrappers) | ||
RUN KJW_VER=$(curl -Ls https://api.github.com/repos/cloud-helpers/k8s-job-wrappers/tags|jq -r '.[].name'|grep "^v"|sort -r|head -1|cut -d'v' -f2,2) && \ | ||
curl -Ls \ | ||
https://github.com/cloud-helpers/k8s-job-wrappers/archive/refs/tags/v${KJW_VER}.tar.gz \ | ||
-o k8s-job-wrappers.tar.gz && \ | ||
tar zxf k8s-job-wrappers.tar.gz && rm -f k8s-job-wrappers.tar.gz && \ | ||
mv -f k8s-job-wrappers-${KJW_VER} /usr/local/ && \ | ||
ln -s /usr/local/k8s-job-wrappers-${KJW_VER} /usr/local/k8s-job-wrappers | ||
|
||
# AWS: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-linux.html | ||
RUN curl -Ls https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip \ | ||
-o awscliv2.zip && \ | ||
unzip -q awscliv2.zip && rm -f awscliv2.zip && ./aws/install && \ | ||
rm -rf ./aws | ||
|
||
# SAML-to-AWS (saml2aws) | ||
# https://github.com/Versent/saml2aws | ||
RUN SAML2AWS_VER=$(curl -Ls https://api.github.com/repos/Versent/saml2aws/releases/latest | grep 'tag_name' | cut -d'v' -f2 | cut -d'"' -f1) && \ | ||
curl -Ls \ | ||
https://github.com/Versent/saml2aws/releases/download/v${SAML2AWS_VER}/saml2aws_${SAML2AWS_VER}_linux_amd64.tar.gz -o saml2aws.tar.gz && \ | ||
tar zxf saml2aws.tar.gz && rm -f saml2aws.tar.gz README.md LICENSE.md && \ | ||
mv -f saml2aws /usr/local/bin/ && \ | ||
chmod 775 /usr/local/bin/saml2aws | ||
|
||
# Copy configuration in the user home, for the root user | ||
ADD bashrc $HOME/.bashrc | ||
|
||
# Install the PYTHON_MICRO_VERSION version of Python | ||
RUN curl -kLs \ | ||
https://www.python.org/ftp/python/${PYTHON_MICRO_VERSION}/Python-${PYTHON_MICRO_VERSION}.tgz \ | ||
-o Python-${PYTHON_MICRO_VERSION}.tgz && \ | ||
tar zxf Python-${PYTHON_MICRO_VERSION}.tgz && \ | ||
rm -f Python-${PYTHON_MICRO_VERSION}.tgz && \ | ||
cd Python-${PYTHON_MICRO_VERSION} && \ | ||
./configure --prefix=/usr && \ | ||
make && \ | ||
make altinstall | ||
|
||
# Set the PYTHON_MICRO_VERSION version of Python as system Python | ||
# This is what is used by AWS EMR | ||
RUN cp -f /usr/bin/python${PYTHON_MINOR_VERSION} /usr/bin/python3 && \ | ||
cp -f /usr/bin/python${PYTHON_MINOR_VERSION} /usr/bin/python | ||
|
||
# Create an ubuntu user | ||
RUN useradd ubuntu && \ | ||
echo "ubuntu ALL=(root) NOPASSWD:ALL" > /etc/sudoers.d/ubuntu && \ | ||
chmod 0440 /etc/sudoers.d/ubuntu | ||
|
||
# Create a /databricks/python3 directory and set environment and permissions | ||
# This is what is used by DataBricks | ||
RUN mkdir -p /databricks/python3 && chown -R ubuntu:ubuntu /databricks | ||
ENV PYSPARK_PYTHON="/databricks/python3/bin/python3" | ||
|
||
# Install a virtual environment in /databricks/python3 | ||
RUN python3 -mpip install virtualenv && \ | ||
virtualenv --system-site-packages /databricks/python3 | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
|
||
# Locale | ||
export LANG="en_US.UTF-8" | ||
export LC_ALL="en_US.UTF-8" | ||
|
||
# History | ||
export HISTTIMEFORMAT="%d/%m/%y %T " | ||
|
||
# Source global definitions | ||
if [ -f /etc/bashrc ]; then | ||
. /etc/bashrc | ||
fi | ||
|
||
# Aliases | ||
alias rm='rm -i' | ||
alias cp='cp -i' | ||
alias mv='mv -i' | ||
alias dir='ls -laFh --color' | ||
alias grep='grep --color' | ||
|