# Transfer Learning Using Keras and MADlib

This is a transfer learning example based on https://keras.io/examples/mnist_transfer_cnn/ 

## Table of contents
<a href="#import_libraries">1. Import libraries</a>

<a href="#load_and_prepare_data">2. Load and prepare data</a>

<a href="#image_preproc">3. Call image preprocessor</a>

<a href="#define_and_load_model">4. Define and load model architecture</a>

<a href="#train">5. Train</a>

<a href="#transfer_learning">6. Transfer learning</a>

In [1]:
%load_ext sql

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


In [2]:
# Greenplum Database 5.x on GCP (PM demo machine)
#%sql postgresql://gpadmin@35.184.232.200:5432/madlib
  
# Greenplum Database 5.x on GCP for deep learning (PM demo machine)
%sql postgresql://gpadmin@35.239.240.26:5432/madlib
        
# PostgreSQL local
#%sql postgresql://fmcquillan@localhost:5432/madlib

u'Connected: gpadmin@madlib'

In [3]:
%sql select madlib.version();
#%sql select version();

1 rows affected.


version
"MADlib version: 1.16-dev, git revision: rel/v1.15.1-129-g954609a, cmake configuration time: Fri Jun 21 18:18:35 UTC 2019, build type: release, build system: Linux-3.10.0-957.12.1.el7.x86_64, C compiler: gcc 4.8.5, C++ compiler: g++ 4.8.5"


<a id="import_libraries"></a>
# 1.  Import libraries
From https://keras.io/examples/mnist_transfer_cnn/ import libraries and define some params

In [4]:
from __future__ import print_function

import datetime
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

now = datetime.datetime.now

batch_size = 128
num_classes = 5
epochs = 5

# input image dimensions
img_rows, img_cols = 28, 28
# number of convolutional filters to use
filters = 32
# size of pooling area for max pooling
pool_size = 2
# convolution kernel size
kernel_size = 3

if K.image_data_format() == 'channels_first':
    input_shape = (1, img_rows, img_cols)
else:
    input_shape = (img_rows, img_cols, 1)

Using TensorFlow backend.


Couldn't import dot_parser, loading of dot files will not be possible.


Others needed in this workbook

In [5]:
import pandas as pd
import numpy as np

<a id="load_and_prepare_data"></a>
# 2.  Load and prepare data

First load MNIST data from Keras, consisting of 60,000 28x28 grayscale images of the 10 digits, along with a test set of 10,000 images.

In [14]:
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# create two datasets one with digits below 5 and one with 5 and above
x_train_lt5 = x_train[y_train < 5]
y_train_lt5 = y_train[y_train < 5]
x_test_lt5 = x_test[y_test < 5]
y_test_lt5 = y_test[y_test < 5]

x_train_gte5 = x_train[y_train >= 5]
y_train_gte5 = y_train[y_train >= 5] - 5
x_test_gte5 = x_test[y_test >= 5]
y_test_gte5 = y_test[y_test >= 5] - 5

# reshape to match model architecture
print(x_test_gte5.shape)
x_train_lt5=x_train_lt5.reshape(len(x_train_lt5), *input_shape)
x_test_lt5 = x_test_lt5.reshape(len(x_test_lt5), *input_shape)
x_train_gte5=x_train_gte5.reshape(len(x_train_gte5), *input_shape)
x_test_gte5 = x_test_gte5.reshape(len(x_test_gte5), *input_shape)
print(x_test_gte5.shape)

(4861, 28, 28)
(4861, 28, 28, 1)


Load datasets into tables using image loader scripts

In [7]:
# MADlib tools directory
import sys
import os
madlib_site_dir = '/Users/fmcquillan/Documents/Product/MADlib/Demos/data'
sys.path.append(madlib_site_dir)

# Import image loader module
from madlib_image_loader import ImageLoader, DbCredentials

In [8]:
# Specify database credentials, for connecting to db
db_creds = DbCredentials(user='gpadmin',
                         host='35.239.240.26',
                         port='5432',
                         password='')

In [9]:
# Initialize ImageLoader (increase num_workers to run faster)
iloader = ImageLoader(num_workers=5, db_creds=db_creds)

In [16]:
# Drop tables
%sql DROP TABLE IF EXISTS train_lt5, test_lt5, train_gte5, test_gte5

# Save images to temporary directories and load into database
iloader.load_np_array_to_table(x_train_lt5, y_train_lt5, 'train_lt5', append=False, img_names=None)
iloader.load_np_array_to_table(x_test_lt5, y_test_lt5, 'test_lt5', append=False, img_names=None)
iloader.load_np_array_to_table(x_train_gte5, y_train_gte5, 'train_gte5', append=False, img_names=None)
iloader.load_np_array_to_table(x_test_gte5, y_test_gte5, 'test_gte5', append=False, img_names=None)

Done.
Executing: CREATE TABLE train_lt5 (id SERIAL, x REAL[], y TEXT)
CREATE TABLE
Created table train_lt5 in madlib db
Spawning 5 workers...
Initializing PoolWorker-31 [pid 36504]
PoolWorker-31: Created temporary directory PoolWorker-31
Initializing PoolWorker-32 [pid 36505]
PoolWorker-32: Created temporary directory PoolWorker-32
Initializing PoolWorker-33 [pid 36506]
PoolWorker-33: Created temporary directory PoolWorker-33
Initializing PoolWorker-34 [pid 36507]
PoolWorker-34: Created temporary directory PoolWorker-34
Initializing PoolWorker-35 [pid 36508]
PoolWorker-35: Created temporary directory PoolWorker-35
PoolWorker-31: Connected to madlib db.
PoolWorker-32: Connected to madlib db.
PoolWorker-33: Connected to madlib db.
PoolWorker-34: Connected to madlib db.
PoolWorker-35: Connected to madlib db.
PoolWorker-32: Wrote 1000 images to /tmp/madlib_BaxJbXtAOV/train_lt50000.tmp
PoolWorker-31: Wrote 1000 images to /tmp/madlib_gijRoaCfJc/train_lt50000.tmp
PoolWorker-33: Wrote 1000 ima

PoolWorker-43: Loaded 1000 images into train_gte5
PoolWorker-44: Loaded 1000 images into train_gte5
PoolWorker-41: Loaded 1000 images into train_gte5
PoolWorker-43: Wrote 1000 images to /tmp/madlib_fHsDrEhTBh/train_gte50001.tmp
PoolWorker-44: Wrote 1000 images to /tmp/madlib_CU2CNf1Yxz/train_gte50001.tmp
PoolWorker-41: Wrote 1000 images to /tmp/madlib_4culpKfGHi/train_gte50001.tmp
PoolWorker-43: Loaded 1000 images into train_gte5
PoolWorker-42: Loaded 1000 images into train_gte5
PoolWorker-45: Loaded 1000 images into train_gte5
PoolWorker-43: Wrote 1000 images to /tmp/madlib_fHsDrEhTBh/train_gte50002.tmp
PoolWorker-41: Loaded 1000 images into train_gte5
PoolWorker-42: Wrote 1000 images to /tmp/madlib_V6Nqwp374Q/train_gte50001.tmp
PoolWorker-45: Wrote 1000 images to /tmp/madlib_1ZkWBU7pyz/train_gte50001.tmp
PoolWorker-41: Wrote 1000 images to /tmp/madlib_4culpKfGHi/train_gte50002.tmp
PoolWorker-43: Loaded 1000 images into train_gte5
PoolWorker-44: Loaded 1000 images into train_gte5
Pool

<a id="image_preproc"></a>
# 3. Call image preprocessor

Transforms from one image per row to multiple images per row for batch optimization.  Also normalizes and one-hot encodes.

Training dataset < 5

In [11]:
%%sql
DROP TABLE IF EXISTS train_lt5_packed, train_lt5_packed_summary;

SELECT madlib.training_preprocessor_dl('train_lt5',               -- Source table
                                       'train_lt5_packed',        -- Output table
                                       'y',                       -- Dependent variable
                                       'x',                       -- Independent variable
                                        NULL,                     -- Buffer size
                                        255                       -- Normalizing constant
                                        );

SELECT * FROM train_lt5_packed_summary;

Done.
1 rows affected.
1 rows affected.


source_table,output_table,dependent_varname,independent_varname,dependent_vartype,class_values,buffer_size,normalizing_const,num_classes
train_lt5,train_lt5_packed,y,x,text,"[u'0', u'1', u'2', u'3', u'4']",15298,255.0,5


Test dataset < 5

In [12]:
%%sql
DROP TABLE IF EXISTS test_lt5_packed, test_lt5_packed_summary;

SELECT madlib.validation_preprocessor_dl('test_lt5',                -- Source table
                                         'test_lt5_packed',         -- Output table
                                         'y',                       -- Dependent variable
                                         'x',                       -- Independent variable
                                         'train_lt5_packed'         -- Training preproc table
                                        );

SELECT * FROM test_lt5_packed_summary;

Done.
1 rows affected.
1 rows affected.


source_table,output_table,dependent_varname,independent_varname,dependent_vartype,class_values,buffer_size,normalizing_const,num_classes
test_lt5,test_lt5_packed,y,x,text,"[u'0', u'1', u'2', u'3', u'4']",2570,255.0,5


Training dataset >= 5

In [8]:
%%sql
DROP TABLE IF EXISTS train_gte5_packed, train_gte5_packed_summary;

SELECT madlib.training_preprocessor_dl('train_gte5',              -- Source table
                                       'train_gte5_packed',       -- Output table
                                       'y',                       -- Dependent variable
                                       'x',                       -- Independent variable
                                        NULL,                     -- Buffer size
                                        255                       -- Normalizing constant
                                        );

SELECT * FROM train_gte5_packed_summary;

Done.
1 rows affected.
1 rows affected.


source_table,output_table,dependent_varname,independent_varname,dependent_vartype,class_values,buffer_size,normalizing_const,num_classes
train_gte5,train_gte5_packed,y,x,text,"[u'0', u'1', u'2', u'3', u'4']",14702,255.0,5


Test dataset >= 5

In [17]:
%%sql
DROP TABLE IF EXISTS test_gte5_packed, test_gte5_packed_summary;

SELECT madlib.validation_preprocessor_dl('test_gte5',             -- Source table
                                         'test_gte5_packed',      -- Output table
                                         'y',                     -- Dependent variable
                                         'x',                     -- Independent variable
                                         'train_gte5_packed'      -- Training preproc table
                                        );

SELECT * FROM test_gte5_packed_summary;

Done.
1 rows affected.
1 rows affected.


source_table,output_table,dependent_varname,independent_varname,dependent_vartype,class_values,buffer_size,normalizing_const,num_classes
test_gte5,test_gte5_packed,y,x,text,"[u'0', u'1', u'2', u'3', u'4']",2431,255.0,5


<a id="define_and_load_model"></a>
# 4. Define and load model architecture

Model with feature and classification layers trainable

In [19]:
# define two groups of layers: feature (convolutions) and classification (dense)
feature_layers = [
    Conv2D(filters, kernel_size,
           padding='valid',
           input_shape=input_shape),
    Activation('relu'),
    Conv2D(filters, kernel_size),
    Activation('relu'),
    MaxPooling2D(pool_size=pool_size),
    Dropout(0.25),
    Flatten(),
]

classification_layers = [
    Dense(128),
    Activation('relu'),
    Dropout(0.5),
    Dense(num_classes),
    Activation('softmax')
]

# create complete model
model = Sequential(feature_layers + classification_layers)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
activation_1 (Activation)    (None, 26, 26, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 32)        9248      
_________________________________________________________________
activation_2 (Activation)    (None, 24, 24, 32)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 12, 12, 32)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 4608)              0         
__________

Load into model architecture table using psycopg2

In [20]:
import psycopg2 as p2
conn = p2.connect('postgresql://gpadmin@35.239.240.26:5432/madlib')
cur = conn.cursor()

%sql DROP TABLE IF EXISTS model_arch_library;
query = "SELECT madlib.load_keras_model('model_arch_library', %s, NULL, %s)"
cur.execute(query,[model.to_json(), "feature + classification layers trainable"])
conn.commit()

# check model loaded OK
%sql SELECT model_id, name FROM model_arch_library;

Done.
1 rows affected.


model_id,name
1,feature + classification layers trainable


Model with feature layers frozen

In [21]:
# freeze feature layers
for l in feature_layers:
    l.trainable = False

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 26, 26, 32)        320       
_________________________________________________________________
activation_1 (Activation)    (None, 26, 26, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 24, 24, 32)        9248      
_________________________________________________________________
activation_2 (Activation)    (None, 24, 24, 32)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 12, 12, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 12, 12, 32)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 4608)              0         
__________

Load into transfer model architecture table using psycopg2

In [22]:
cur.execute(query,[model.to_json(), "only classification layers trainable"])
conn.commit()

# check model loaded OK
%sql SELECT model_id, name FROM model_arch_library ORDER BY model_id;

2 rows affected.


model_id,name
1,feature + classification layers trainable
2,only classification layers trainable


<a id="train"></a>
# 5.  Train
Train the model for 5-digit classification [0..4]  

In [22]:
%%sql
DROP TABLE IF EXISTS mnist_model, mnist_model_summary;

SELECT madlib.madlib_keras_fit('train_lt5_packed',    -- source table
                               'mnist_model',         -- model output table
                               'model_arch_library',  -- model arch table
                                1,                    -- model arch id
                                $$ loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy']$$,  -- compile_params
                                $$ batch_size=128, epochs=1 $$,  -- fit_params
                                5                     -- num_iterations
                              );

Done.
1 rows affected.


madlib_keras_fit


View the model summary:

In [23]:
%%sql
SELECT * FROM mnist_model_summary;

1 rows affected.


source_table,model,dependent_varname,independent_varname,model_arch_table,model_arch_id,compile_params,fit_params,num_iterations,validation_table,metrics_compute_frequency,name,description,model_type,model_size,start_training_time,end_training_time,metrics_elapsed_time,madlib_version,num_classes,class_values,dependent_vartype,normalizing_const,metrics_type,training_metrics_final,training_loss_final,training_metrics,training_loss,validation_metrics_final,validation_loss_final,validation_metrics,validation_loss,metrics_iters
train_lt5_packed,mnist_model,y,x,model_arch_library,1,"loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy']","batch_size=128, epochs=1",5,,5,,,madlib_keras,2344.43066406,2019-06-24 19:08:31.328530,2019-06-24 19:13:50.944601,[319.616029977798],1.16-dev,5,"[u'0', u'1', u'2', u'3', u'4']",text,255.0,[u'accuracy'],0.996045231819,0.0139331035316,[0.996045231819153],[0.013933103531599],,,,,[5]


Evaluate using test data

In [24]:
%%sql
DROP TABLE IF EXISTS mnist_validate;

SELECT madlib.madlib_keras_evaluate('mnist_model',      -- model
                                   'test_lt5_packed',   -- test table
                                   'mnist_validate'     -- output table
                                   );

SELECT * FROM mnist_validate;

Done.
1 rows affected.
1 rows affected.


loss,metric,metrics_type
0.00919340737164,0.997081160545,[u'accuracy']


<a id="transfer_learning"></a>
# 6. Transfer learning

Use UPDATE to load trained weights from previous run into the model library table:

In [25]:
%%sql
UPDATE model_arch_library SET model_weights = model_data FROM mnist_model WHERE model_id = 2;

1 rows affected.


[]

Transfer: train dense layers for new classification task [5..9]

In [26]:
%%sql
DROP TABLE IF EXISTS mnist_transfer_model, mnist_transfer_model_summary;

SELECT madlib.madlib_keras_fit('train_gte5_packed',   -- source table
                               'mnist_transfer_model',-- model output table
                               'model_arch_library',  -- model arch table
                                2,                    -- model arch id
                                $$ loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy']$$,  -- compile_params
                                $$ batch_size=128, epochs=1 $$,  -- fit_params
                                5                     -- num_iterations
                              );

Done.
1 rows affected.


madlib_keras_fit


View the model summary

In [27]:
%%sql
SELECT * FROM mnist_transfer_model_summary;

1 rows affected.


source_table,model,dependent_varname,independent_varname,model_arch_table,model_arch_id,compile_params,fit_params,num_iterations,validation_table,metrics_compute_frequency,name,description,model_type,model_size,start_training_time,end_training_time,metrics_elapsed_time,madlib_version,num_classes,class_values,dependent_vartype,normalizing_const,metrics_type,training_metrics_final,training_loss_final,training_metrics,training_loss,validation_metrics_final,validation_loss_final,validation_metrics,validation_loss,metrics_iters
train_gte5_packed,mnist_transfer_model,y,x,model_arch_library,2,"loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy']","batch_size=128, epochs=1",5,,5,,,madlib_keras,2344.43066406,2019-06-24 19:16:55.336042,2019-06-24 19:19:53.589704,[178.253571987152],1.16-dev,5,"[u'0', u'1', u'2', u'3', u'4']",text,255.0,[u'accuracy'],0.991429746151,0.0280887652189,[0.99142974615097],[0.028088765218854],,,,,[5]


Evaluate using test data

In [30]:
%%sql
DROP TABLE IF EXISTS mnist_transfer_validate;

SELECT madlib.madlib_keras_evaluate('mnist_transfer_model',      -- model
                                   'test_gte5_packed',           -- test table
                                   'mnist_transfer_validate'     -- output table
                                   );

SELECT * FROM mnist_transfer_validate;

Done.
1 rows affected.
1 rows affected.


loss,metric,metrics_type
0.0312170274556,0.989714026451,[u'accuracy']
