# Preprocessor for image data
This is a mini-batch preprocessor utility for image data:
* training_preprocessor_dl() for training datasets
* validation_preprocessor_dl() for validation datasets

Note that there is a separate mini-batch preprocessor utility for general use cases
http://madlib.apache.org/docs/latest/group__grp__minibatch__preprocessing.html

The preprocessor for image data was added in MADlib 1.16.

## Table of contents

<a href="#load_data">1. Load data</a>

<a href="#pp_train">2. Run preprocessor for training image data</a>

<a href="#pp_val">3. Run preprocessor for validation image data</a>

<a href="#load_data2">4. Load data, another format</a>

<a href="#pp_train2">5. Run preprocessor for training image data</a>

<a href="#pp_val2">6. Run preprocessor for validation image data</a>

<a href="#change_buffer">7. Change buffer size</a>

<a href="#set_num_classes">8. Setting number of classes</a>

<a href="#distr_rules">9. Using distribution rules</a>

In [1]:
%load_ext sql

  warn("IPython.utils.traitlets has moved to a top-level traitlets package.")


In [2]:
# Greenplum Database 5.x on GCP (PM demo machine) - direct external IP access
#%sql postgresql://gpadmin@34.67.65.96:5432/madlib

# Greenplum Database 5.x on GCP - via tunnel
%sql postgresql://gpadmin@localhost:8000/madlib
        
# PostgreSQL local
#%sql postgresql://fmcquillan@localhost:5432/madlib

u'Connected: gpadmin@madlib'

In [3]:
%sql select madlib.version();
#%sql select version();

1 rows affected.


version
"MADlib version: 1.17-dev, git revision: rel/v1.16-54-gec5614f, cmake configuration time: Wed Dec 18 17:08:05 UTC 2019, build type: release, build system: Linux-3.10.0-1062.4.3.el7.x86_64, C compiler: gcc 4.8.5, C++ compiler: g++ 4.8.5"


<a id="load_data"></a>
# 1. Load data

Create an artificial 2x2 resolution color image data set with 3 possible classifications.  The RGB values are per-pixel arrays:

In [4]:
%%sql
DROP TABLE IF EXISTS image_data;

CREATE TABLE image_data AS (
    SELECT ARRAY[
        ARRAY[
            ARRAY[(random() * 256)::integer, -- pixel (1,1)
                (random() * 256)::integer,
                (random() * 256)::integer],
            ARRAY[(random() * 256)::integer, -- pixel (2,1)
                (random() * 256)::integer,
                (random() * 256)::integer]
        ],
        ARRAY[
            ARRAY[(random() * 256)::integer, -- pixel (1,2)
                (random() * 256)::integer,
                (random() * 256)::integer],
            ARRAY[(random() * 256)::integer, -- pixel (2,1)
                (random() * 256)::integer,
                (random() * 256)::integer]
        ]
    ] as rgb, ('{cat,dog,bird}'::text[])[ceil(random()*3)] as species
    FROM generate_series(1, 52)
);

SELECT * FROM image_data;

Done.
52 rows affected.
52 rows affected.


rgb,species
"[[[152, 186, 35], [102, 145, 138]], [[40, 249, 108], [175, 207, 70]]]",cat
"[[[234, 110, 251], [147, 18, 158]], [[55, 79, 14], [140, 50, 143]]]",cat
"[[[179, 202, 20], [219, 198, 173]], [[149, 233, 18], [38, 115, 59]]]",cat
"[[[223, 234, 239], [37, 253, 217]], [[147, 248, 108], [166, 150, 162]]]",bird
"[[[164, 46, 39], [51, 130, 218]], [[253, 150, 181], [195, 66, 75]]]",bird
"[[[85, 113, 32], [144, 145, 255]], [[122, 127, 36], [118, 88, 183]]]",dog
"[[[195, 93, 4], [102, 81, 168]], [[148, 120, 219], [21, 82, 217]]]",bird
"[[[8, 156, 237], [82, 72, 66]], [[196, 104, 210], [84, 103, 75]]]",bird
"[[[139, 194, 43], [66, 48, 239]], [[159, 52, 84], [240, 220, 232]]]",dog
"[[[183, 253, 187], [144, 168, 194]], [[44, 150, 21], [116, 216, 216]]]",bird


<a id="pp_train"></a>
# 2.  Run preprocessor for training image data

Run the preprocessor to generate the packed output table:

In [5]:
%%sql
DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;

SELECT madlib.training_preprocessor_dl('image_data',         -- Source table
                                        'image_data_packed',  -- Output table
                                        'species',            -- Dependent variable
                                        'rgb',                -- Independent variable
                                        NULL,                 -- Buffer size
                                        255                   -- Normalizing constant
                                        );

SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;

Done.
1 rows affected.
2 rows affected.


independent_var_shape,dependent_var_shape,buffer_id
"[26, 2, 2, 3]","[26, 3]",0
"[26, 2, 2, 3]","[26, 3]",1


For small datasets like in this example, buffer size is mainly determined by the number of segments in the database. For a Greenplum database with 2 segments, there will be 2 rows with a buffer size of 26. For PostgresSQL, there would be only one row with a buffer size of 52 since it is a single node database. For larger data sets, other factors go into computing buffers size besides number of segments. 

Review the output summary table:

In [6]:
%%sql
SELECT * FROM image_data_packed_summary;

1 rows affected.


source_table,output_table,dependent_varname,independent_varname,dependent_vartype,class_values,buffer_size,normalizing_const,num_classes,distribution_rules,__internal_gpu_config__
image_data,image_data_packed,species,rgb,text,"[u'bird', u'cat', u'dog']",26,255.0,3,all_segments,all_segments


<a id="pp_val"></a>
# 3.  Run preprocessor for validation image data

Run the preprocessor for the validation dataset. In this example, we use the same images for validation to demonstrate, but normally validation data is different than training data:

In [7]:
%%sql
DROP TABLE IF EXISTS val_image_data_packed, val_image_data_packed_summary;
SELECT madlib.validation_preprocessor_dl(
      'image_data',             -- Source table
      'val_image_data_packed',  -- Output table
      'species',                -- Dependent variable
      'rgb',                    -- Independent variable
      'image_data_packed',      -- From training preprocessor step
      NULL                      -- Buffer size
      ); 
SELECT independent_var_shape, dependent_var_shape, buffer_id FROM val_image_data_packed ORDER BY buffer_id;

Done.
1 rows affected.
2 rows affected.


independent_var_shape,dependent_var_shape,buffer_id
"[26, 2, 2, 3]","[26, 3]",0
"[26, 2, 2, 3]","[26, 3]",1


Review the output summary table:

In [8]:
%%sql
SELECT * FROM val_image_data_packed_summary;

1 rows affected.


source_table,output_table,dependent_varname,independent_varname,dependent_vartype,class_values,buffer_size,normalizing_const,num_classes,distribution_rules,__internal_gpu_config__
image_data,val_image_data_packed,species,rgb,text,"[u'bird', u'cat', u'dog']",26,255.0,3,all_segments,all_segments


<a id="load_data2"></a>
# 4. Load data, another format
Create an artificial 2x2 resolution color image data set with 3 possible classifications.  The RGB values are unrolled in to a flat array:

In [9]:
%%sql
DROP TABLE IF EXISTS image_data;

CREATE TABLE image_data AS (
SELECT ARRAY[
        (random() * 256)::integer, -- R values
        (random() * 256)::integer,
        (random() * 256)::integer,
        (random() * 256)::integer,
        (random() * 256)::integer, -- G values
        (random() * 256)::integer,
        (random() * 256)::integer,
        (random() * 256)::integer,
        (random() * 256)::integer, -- B values
        (random() * 256)::integer,
        (random() * 256)::integer,
        (random() * 256)::integer
    ] as rgb, ('{cat,dog,bird}'::text[])[ceil(random()*3)] as species
FROM generate_series(1, 52)
);

SELECT * FROM image_data;

Done.
52 rows affected.
52 rows affected.


rgb,species
"[19, 126, 250, 219, 119, 255, 86, 152, 200, 36, 57, 188]",cat
"[49, 201, 114, 38, 201, 8, 101, 172, 88, 233, 82, 78]",dog
"[203, 196, 132, 57, 220, 151, 183, 214, 113, 46, 213, 200]",bird
"[157, 236, 255, 90, 38, 48, 35, 152, 86, 236, 160, 187]",dog
"[248, 164, 234, 70, 61, 181, 10, 193, 238, 229, 88, 165]",bird
"[201, 210, 145, 145, 152, 46, 125, 151, 135, 163, 199, 170]",cat
"[29, 150, 219, 216, 46, 211, 124, 24, 25, 186, 205, 35]",dog
"[187, 8, 211, 95, 196, 156, 50, 84, 45, 202, 130, 170]",dog
"[9, 77, 40, 179, 136, 69, 74, 98, 29, 120, 53, 153]",dog
"[78, 83, 93, 113, 206, 23, 121, 160, 119, 61, 60, 168]",dog


<a id="pp_train2"></a>
# 5.  Run preprocessor for training image data

Run the preprocessor to generate the packed output table:

In [10]:
%%sql
DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;

SELECT madlib.training_preprocessor_dl('image_data',         -- Source table
                                        'image_data_packed',  -- Output table
                                        'species',            -- Dependent variable
                                        'rgb',                -- Independent variable
                                        NULL,                 -- Buffer size
                                        255                   -- Normalizing constant
                                        );

SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;

Done.
1 rows affected.
2 rows affected.


independent_var_shape,dependent_var_shape,buffer_id
"[26, 12]","[26, 3]",0
"[26, 12]","[26, 3]",1


<a id="pp_val2"></a>
# 6.  Run preprocessor for validation image data

Run the preprocessor for the validation dataset. In this example, we use the same images for validation to demonstrate, but normally validation data is different than training data:

In [11]:
%%sql
DROP TABLE IF EXISTS val_image_data_packed, val_image_data_packed_summary;

SELECT madlib.validation_preprocessor_dl(
    'image_data',             -- Source table
    'val_image_data_packed',  -- Output table
    'species',                -- Dependent variable
    'rgb',                    -- Independent variable
    'image_data_packed',      -- From training preprocessor step
    NULL                      -- Buffer size
    );

SELECT independent_var_shape, dependent_var_shape, buffer_id FROM val_image_data_packed ORDER BY buffer_id;

Done.
1 rows affected.
2 rows affected.


independent_var_shape,dependent_var_shape,buffer_id
"[26, 12]","[26, 3]",0
"[26, 12]","[26, 3]",1


<a id="change_buffer"></a>
# 7.  Change buffer size 

Generally the default buffer size will work well, but if you have occasion to change it:

In [12]:
%%sql
DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;

SELECT madlib.training_preprocessor_dl('image_data',         -- Source table
                                       'image_data_packed',  -- Output table
                                       'species',            -- Dependent variable
                                       'rgb',                -- Independent variable
                                        10,                   -- Buffer size
                                        255                   -- Normalizing constant
                                        );

SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;

Done.
1 rows affected.
6 rows affected.


independent_var_shape,dependent_var_shape,buffer_id
"[8, 12]","[8, 3]",0
"[9, 12]","[9, 3]",1
"[9, 12]","[9, 3]",2
"[9, 12]","[9, 3]",3
"[9, 12]","[9, 3]",4
"[8, 12]","[8, 3]",5


Review the output summary data:

In [13]:
%%sql
SELECT * FROM image_data_packed_summary;

1 rows affected.


source_table,output_table,dependent_varname,independent_varname,dependent_vartype,class_values,buffer_size,normalizing_const,num_classes,distribution_rules,__internal_gpu_config__
image_data,image_data_packed,species,rgb,text,"[u'bird', u'cat', u'dog']",10,255.0,3,all_segments,all_segments


<a id="set_num_classes"></a>
# 8. Setting number of classes

If want the 1-hot encoded vector to have more classes than present in the data, use the 'num_classes' param which will pad the 1-hot vector:

In [14]:
%%sql
DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;

SELECT madlib.training_preprocessor_dl('image_data',         -- Source table
                                        'image_data_packed',  -- Output table
                                        'species',            -- Dependent variable
                                        'rgb',                -- Independent variable
                                        NULL,                 -- Buffer size
                                        255,                  -- Normalizing constant
                                        5                     -- Number of desired class values
                                        );

SELECT independent_var_shape, dependent_var_shape, buffer_id FROM image_data_packed ORDER BY buffer_id;

Done.
1 rows affected.
2 rows affected.


independent_var_shape,dependent_var_shape,buffer_id
"[26, 12]","[26, 5]",0
"[26, 12]","[26, 5]",1


In [15]:
%%sql
SELECT * FROM image_data_packed_summary;

1 rows affected.


source_table,output_table,dependent_varname,independent_varname,dependent_vartype,class_values,buffer_size,normalizing_const,num_classes,distribution_rules,__internal_gpu_config__
image_data,image_data_packed,species,rgb,text,"[u'bird', u'cat', u'dog', None, None]",26,255.0,5,all_segments,all_segments


<a id="distr_rules"></a>
# 9. Using distribution rules
Specifies how to distribute the 'output_table'. This is important for how the fit function will use resources on the cluster.

To distribute to all segments on hosts with GPUs attached:

In [None]:
%%sql
DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;

SELECT madlib.training_preprocessor_dl('image_data',          -- Source table
                                        'image_data_packed',  -- Output table
                                        'species',            -- Dependent variable
                                        'rgb',                -- Independent variable
                                        NULL,                 -- Buffer size
                                        255,                  -- Normalizing constant
                                        NULL,                 -- Number of classes
                                        'gpu_segments'        -- Distribution rules
                                        );
SELECT * FROM image_data_packed_summary;

To distribute to only specified segments, create a distribution table with a column called 'dbid' that lists the segments you want:

In [17]:
%%sql
DROP TABLE IF EXISTS segments_to_use;
CREATE TABLE segments_to_use(
    dbid INTEGER,
    hostname TEXT
);
INSERT INTO segments_to_use VALUES
(2, 'hostname-01'),
(3, 'hostname-01');

DROP TABLE IF EXISTS image_data_packed, image_data_packed_summary;
SELECT madlib.training_preprocessor_dl('image_data',          -- Source table
                                        'image_data_packed',  -- Output table
                                        'species',            -- Dependent variable
                                        'rgb',                -- Independent variable
                                        NULL,                 -- Buffer size
                                        255,                  -- Normalizing constant
                                        NULL,                 -- Number of classes
                                        'segments_to_use'     -- Distribution rules
                                        );
SELECT * FROM image_data_packed_summary;

Done.
Done.
2 rows affected.
Done.
1 rows affected.
1 rows affected.


source_table,output_table,dependent_varname,independent_varname,dependent_vartype,class_values,buffer_size,normalizing_const,num_classes,distribution_rules,__internal_gpu_config__
image_data,image_data_packed,species,rgb,text,"[u'bird', u'cat', u'dog']",26,255.0,3,"[2, 3]","[0, 1]"
