# Text Classification Using AWS Deep Learning Docker Containers on Singularity

A modified version of this AWS SageMaker lab guide: https://github.com/lbnl-science-it/aws-sagemaker-keras-text-classification

# Data Exploration

In [3]:
import pandas as pd
import tensorflow as tf
import re
import numpy as np
import os

from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
from tensorflow.python.keras.utils import to_categorical

In [4]:
column_names = ["TITLE", "URL", "PUBLISHER", "CATEGORY", "STORY", "HOSTNAME", "TIMESTAMP"]
news_dataset = pd.read_csv(os.path.join('./data', 'newsCorpora.csv'), names=column_names, header=None, delimiter='\t')
news_dataset.head()

Unnamed: 0,TITLE,URL,PUBLISHER,CATEGORY,STORY,HOSTNAME,TIMESTAMP
1,"Fed official says weak data caused by weather,...",http://www.latimes.com/business/money/la-fi-mo...,Los Angeles Times,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.latimes.com,1394470370698
2,Fed's Charles Plosser sees high bar for change...,http://www.livemint.com/Politics/H2EvwJSK2VE6O...,Livemint,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.livemint.com,1394470371207
3,US open: Stocks fall after Fed official hints ...,http://www.ifamagazine.com/news/us-open-stocks...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371550
4,"Fed risks falling 'behind the curve', Charles ...",http://www.ifamagazine.com/news/fed-risks-fall...,IFA Magazine,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.ifamagazine.com,1394470371793
5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,http://www.moneynews.com/Economy/federal-reser...,Moneynews,b,ddUyU0VZz0BRneMioxUPQVP6sIxvM,www.moneynews.com,1394470372027


In [5]:
news_dataset.groupby(['CATEGORY']).size()

CATEGORY
b    115967
e    152469
m     45639
t    108344
dtype: int64

# Training your Algorithm

### Building the Singularity container using available aws deep learning docker containers Images
https://aws.amazon.com/releasenotes/available-deep-learning-containers-images/

The following shell code shows how to build the container image using `docker` and convert the container image to a `Singularity` image. 

In [6]:
%%sh

cd container

####################################################
########## Download and unzip the dataset ##########
####################################################
cd ../data/
wget https://danilop.s3-eu-west-1.amazonaws.com/reInvent-Workshop-Data-Backup.zip && unzip reInvent-Workshop-Data-Backup.zip
mv reInvent-Workshop-Data-Backup/* ./
rm -rf reInvent-Workshop-Data-Backup reInvent-Workshop-Data-Backup.zip
cd ../container/

###################################################################################
######### Build the SageMaker Container & Convert it to Singularity image #########
###################################################################################
algorithm_name=sagemaker-keras-text-classification

chmod +x sagemaker_keras_text_classification/train
chmod +x sagemaker_keras_text_classification/serve

# Get the region defined in the current configuration

fullname="local_${algorithm_name}:latest"

# Get the login command from ECR and execute it directly
$(aws ecr get-login --no-include-email --region ${region} --registry-ids 763104351884)

# Build the docker image locally with the image name
# In the "Dockerfile", modify the source image to select one of the available deep learning docker containers images:
# https://aws.amazon.com/releasenotes/available-deep-learning-containers-images
docker build  -t ${algorithm_name} .
docker tag ${algorithm_name} ${fullname}

# Build Singularity image from local docker image
sifname="local_sagemaker-keras-text-classification.sif"
sudo singularity build ${sifname} docker-daemon:${fullname}

################################
########## Local Test ########## 
################################
cd ../data
cp -a . ../container/local_test/test_dir/input/data/training/
cd ../container
cd local_test

### Train
./train_local.sh ../${sifname}

### Prediction
#./serve_local.sh ../${fullname}

Archive:  reInvent-Workshop-Data-Backup.zip
   creating: reInvent-Workshop-Data-Backup/
  inflating: reInvent-Workshop-Data-Backup/glove.6B.100d.txt  
  inflating: reInvent-Workshop-Data-Backup/newsCorpora.csv  
Sending build context to Docker daemon  1.093GB
Step 1/9 : FROM 763104351884.dkr.ecr.us-east-2.amazonaws.com/tensorflow-training:1.14.0-cpu-py36-ubuntu16.04
 ---> e6a210ff54e4
Step 2/9 : RUN apt-get update &&     apt-get install -y nginx imagemagick graphviz
 ---> Using cache
 ---> 32ff2dce1af3
Step 3/9 : RUN pip install --upgrade pip
 ---> Using cache
 ---> 4e1b65ea3a65
Step 4/9 : RUN pip install gevent gunicorn flask tensorflow_hub seqeval graphviz nltk spacy tqdm
 ---> Using cache
 ---> d97c22f6de86
Step 5/9 : RUN python -m spacy download en_core_web_sm
 ---> Using cache
 ---> 14c8854a1901
Step 6/9 : RUN python -m spacy download en
 ---> Using cache
 ---> 185661d9e15d
Step 7/9 : ENV PATH="/opt/program:${PATH}"
 ---> Using cache
 ---> b5d5c6867074
Step 8/9 : COPY sagemaker_ke

--2020-06-08 21:40:00--  https://danilop.s3-eu-west-1.amazonaws.com/reInvent-Workshop-Data-Backup.zip
Resolving danilop.s3-eu-west-1.amazonaws.com (danilop.s3-eu-west-1.amazonaws.com)... 52.218.101.40
Connecting to danilop.s3-eu-west-1.amazonaws.com (danilop.s3-eu-west-1.amazonaws.com)|52.218.101.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 163330090 (156M) [application/zip]
Saving to: ‘reInvent-Workshop-Data-Backup.zip’

     0K .......... .......... .......... .......... ..........  0%  294K 9m2s
    50K .......... .......... .......... .......... ..........  0%  587K 6m47s
   100K .......... .......... .......... .......... ..........  0% 94.3M 4m32s
   150K .......... .......... .......... .......... ..........  0% 23.8M 3m25s
   200K .......... .......... .......... .......... ..........  0%  604K 3m37s
   250K .......... .......... .......... .......... ..........  0% 89.4M 3m1s
   300K .......... .......... .......... .......... ..........  0%  12