# Building a Scikit-Learn base Docker Image

We will start our MLOps journey here by creating an abstract Docker Image for supporting [Ludwig](https://uber.github.io/ludwig/) models.

So, after we create and test locally our Dockerfile, we'll send it to our first pipeline that will build this image and make it available in ECR.

This image will be based on `tensorflow-training`, and install python libraries for serving inference and `ludwig`.

## First, lets create a Dockerfile

In [None]:
%%writefile Dockerfile
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/tensorflow-training:1.14-cpu-py3
    
RUN apt-get update -y && apt-get install -y libev-dev
RUN pip install bottle bjoern 
RUN pip install ludwig[text] # 0.2.1
RUN python -m spacy download en

RUN mkdir -p /opt/program
RUN mkdir -p /opt/ml

ENV PYTHONUNBUFFERED=TRUE
ENV PYTHONDONTWRITEBYTECODE=TRUE
ENV PATH="/opt/program:${PATH}"

COPY app.py /opt/program
WORKDIR /opt/program

EXPOSE 8080
ENTRYPOINT ["python", "app.py"]

## Then, the a basic application that will host our model code

Please, notice that we're creating a WebService application with two methods: **ping** and **invocations**. Ping is for healthcheck and invocations is for calling your model.

For a production environment it is important to use a **WSGI** solution. We will use a combo of **bottle** and **bjoern**. Bottle is our webservice api and bjoern our WSGI server. Since bjoern is single threaded, you can't run multiple predictions at the same time. If you need something like that, maybe you need gunicorn and a reverse proxy to protect your endpoint.

In [None]:
%%writefile app.py
import argparse
import os
import sys
import logging
import json
import time
import numpy as np

#Class for json nump encoding with int64
class NpEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        else:
            return super(NpEncoder, self).default(obj)

# Import python serving
import bjoern
import bottle
from bottle import run, request, post, get

# Import ludwig library
import pandas as pd
from io import StringIO
import glob
import ludwig
from ludwig.api import LudwigModel

print('ludwig: {}'.format(ludwig.__version__))

def parse_args():
    parser = argparse.ArgumentParser()

    # parameters for training (TODO: Add horovod etc)
    parser.add_argument('--experiment_name', type=str, default='sagemaker_experiment')
    parser.add_argument('--trial_name', type=str, default='run')
    parser.add_argument('--pandas_engine', type=str, default='python')
    
    # data directories
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN', '/opt/ml/input/data/training'))
    parser.add_argument('--validation', type=str, default=os.environ.get('SM_CHANNEL_VALIDATION', '/opt/ml/input/data/validation'))
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST', '/opt/ml/input/data/test'))

    # input_config
    parser.add_argument('--config_dir', type=str, default=os.environ.get('SM_INPUT_CONFIG_DIR', '/opt/ml/input/config'))

    # model directory: we will use the default set by SageMaker, /opt/ml/model
    parser.add_argument('--output_dir', type=str, default=os.environ.get('SM_OUTPUT_DATA_DIR', '/opt/ml/output/data'))
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR', '/opt/ml/model'))
    
    return parser.parse_known_args()   

def read_csv_dataframe(path, engine='python'):
    files = glob.glob(os.path.join(path, '*.csv'))
    if len(files) > 0:
        return pd.concat([pd.read_csv(fn, engine=engine) for fn in files], axis=0, ignore_index=True)

def ludwig_train():
    args, _ = parse_args()
    print(args)
    
    # Create model from definition
    ludwig_model = LudwigModel(None, model_definition_file='model_definition.yml')
    
    # Allow specifying training hyperparameters 
    # see: https://uber.github.io/ludwig/user_guide/#training
    trainnig_config_path = os.path.join(args.config_dir, 'hyperparameters.config')
    if os.path.exists(trainnig_config_path):
        with open(trainnig_config_path, 'r') as tc:
            ludwig_model.model_definition['training'] = json.load(tc)

    print('model definition', json.dumps(ludwig_model.model_definition))

    # Load the train/validation/test files
    data_train_df = read_csv_dataframe(args.train, engine=args.pandas_engine)
    data_validation_df = read_csv_dataframe(args.validation, engine=args.pandas_engine)
    data_test_df = read_csv_dataframe(args.test, engine=args.pandas_engine)
    
    print('training model...')
    train_stats = ludwig_model.train(
        skip_save_log=True, # Don't save tensorboard
        skip_save_processed_input=True, # Don't save pre-processed input
        data_train_df=data_train_df,
        data_validation_df=data_validation_df,
        data_test_df=data_test_df,
        output_directory=args.output_dir,
        experiment_name=args.experiment_name,
        model_name=args.trial_name
    )

    # TODO: Output stats for logging
    print('train stats', json.dumps(train_stats))
        
    # Save the ludwig model 
    ludwig_model.save(args.model_dir)
    
#     # Optionally save the model for serving in a numbered directory
#     saved_model_path = os.path.join(args.model_dir, str(int(time.time())))
#     ludwig_model.save_for_serving(saved_model_path)
    
    # Print output files and close
    print('model output', os.listdir(args.model_dir))
    ludwig_model.close()    
        
ludwig_model = None
args = None

def load_model():
    global ludwig_model
    global args
    if ludwig_model == None:
        # Load model and print definition if not already loaded
        args, _ = parse_args()
        print('args', args)
        print('loading model...')
        ludwig_model = LudwigModel.load(args.model_dir)
        print('model definition', json.dumps(ludwig_model.model_definition))        
    return ludwig_model, args
    
@get('/ping')
def ping():
    print('ping')
    # Load/cache the model on ping
    load_model()
    return "OK"

@post('/invocations')
def invoke():
    payload = request.body.read().decode('utf-8')
    data_df = pd.read_csv(StringIO(payload))
    print('invoke')
    print(data_df)
    ludwig_model, _ = load_model()
    predictions = ludwig_model.predict(data_df=data_df)
    return predictions.to_csv(index=False)

if __name__ == "__main__":
    if len(sys.argv) < 2 or ( not sys.argv[1] in [ "serve", "train", "test"] ):
        raise Exception("Invalid argument: you must inform 'train' for training mode or 'serve' predicting mode") 

    train = sys.argv[1] == "train"
    test = sys.argv[1] == "test"

    # TEMP: Print out all files mounted under /opt/ml
    print([os.path.join(dp, f) for dp, dn, fn in os.walk(os.path.expanduser("/opt/ml")) for f in fn])    
    
    if train:
        ludwig_train()
    elif test:
        # Read and write to local file
        print('test', sys.argv[2], sys.argv[3], sys.argv[4])
        data_df = pd.read_csv(sys.argv[2])
        print(data_df.head())
        ludwig_model, args = load_model()
        predictions, test_stats = ludwig_model.test(data_df=data_df)    
        predictions.to_csv(sys.argv[3], index=False)
        # Write test stats to a file
        with open(sys.argv[4], 'w') as tsf:    
            json.dump(test_stats, tsf, cls=NpEncoder)        
    else:
        bjoern.run(bottle.app(), "0.0.0.0", 8080)

## Finally, let's create the buildspec

This file will be used by CodeBuild for creating our base image

In [None]:
%%writefile buildspec.yml
version: 0.2

phases:
  install:
    runtime-versions:
      docker: 18
        
  pre_build:
    commands:
      - echo Logging in to Amazon ECR...
      - $(aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884) 
      - $(aws ecr get-login --no-include-email --region $AWS_DEFAULT_REGION --registry-ids $AWS_ACCOUNT_ID)
  build:
    commands:
      - echo Build started on `date` for commit $CODEBUILD_RESOLVED_SOURCE_VERSION
      - echo Building the Docker image... 
      - docker build -t $IMAGE_REPO_NAME:$IMAGE_TAG .
      - docker tag $IMAGE_REPO_NAME:$IMAGE_TAG $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG

  post_build:
    commands:
      - echo Build completed on `date`
      - echo Pushing the Docker image...
      - echo docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
      - docker push $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG
      - echo $AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/$IMAGE_REPO_NAME:$IMAGE_TAG > image.url
      - cat image.url
      - echo Done!
artifacts:
  files:
    - image.url
  name: image_url
  discard-paths: yes

### Building the image locally, first

Test that we can pull the tensorflow training image and then build the local docker.

The Dockerfile and buildspec.yaml by default pull the `tensorflow-training` image from `us-east-1` region.

In [None]:
!$(aws ecr get-login --no-include-email --region us-east-1 --registry-ids 763104351884)

In [None]:
!sudo docker build -f Dockerfile -t scikit-base:latest .

### Before we push our code to the repo, let's check the building process

In [None]:
import boto3

sts_client = boto3.client("sts")
session = boto3.session.Session()

account_id = sts_client.get_caller_identity()["Account"]
region = session.region_name
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()

repo_name='scikit-base'
image_tag='test'

In [None]:
!mkdir -p tests
!cp app.py Dockerfile buildspec.yml tests/
with open('tests/vars.env', 'w') as f:
    f.write("AWS_ACCOUNT_ID=%s\n" % account_id)
    f.write("IMAGE_TAG=%s\n" % image_tag)
    f.write("IMAGE_REPO_NAME=%s\n" % repo_name)
    f.write("AWS_DEFAULT_REGION=%s\n" % region)
    f.write("AWS_ACCESS_KEY_ID=%s\n" % credentials.access_key)
    f.write("AWS_SECRET_ACCESS_KEY=%s\n" % credentials.secret_key)
    f.write("AWS_SESSION_TOKEN=%s\n" % credentials.token )
    f.close()

!cat tests/vars.env

In [None]:
%%time

!/tmp/aws-codebuild/local_builds/codebuild_build.sh \
    -a "$PWD/tests/output" \
    -s "$PWD/tests" \
    -i "samirsouza/aws-codebuild-standard:2.0" \
    -e "$PWD/tests/vars.env" \
    -c

## Ok, now it's time to push everything to the correct repo

In [None]:
%%bash

cd ../../../mlops-workshop-images/scikit_base
cp $OLDPWD/buildspec.yml $OLDPWD/app.py $OLDPWD/Dockerfile .

git add buildspec.yml app.py Dockerfile
git commit -a -m " - files for building a scikit learn image"
git push

### Ok, now open the AWS console in another tab and go to the CodePipeline console to see the status of our building pipeline