Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Vector-Column Transformations #291

Closed
wants to merge 7 commits into from

Conversation

ArvindSridhar
Copy link
Contributor

@ArvindSridhar ArvindSridhar commented Jul 11, 2018

JIRA: MADLIB-1240

This commit adds a new SQL function called vec2cols and refactors the
current function cols2vec, providing greater integration between the two
modules. We now have a single Python file with separate classes for each
feature. We also have unified unit-tests and dev-check/install-check
tests.

The vec2cols function enables users to split up a single column into
multiple columns, given that the input column contains array entries.
For example, if the input column contained ARRAY[1, 2, 3] in one of its
rows, the output table will contain 3 different columns, one for each
element of the array.

@asfgit
Copy link

asfgit commented Jul 11, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/545/

Copy link
Contributor

@iyerr3 iyerr3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionality on its own LGTM. There might be overlap of code between this and #288. It would help to have a unified code structure for the two functions.

import plpy_mock as plpy

m4_changequote(`<!', `!>')
class Vec2ColsTestCase(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is another Vec2ColsTestCase in test_vec2cols.py_in. Is this one redundant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch haha, this was a typo. Updating the other PR now.

@@ -243,6 +243,9 @@ class UtilitiesTestCase(unittest.TestCase):
self.assertFalse(s.is_valid_psql_type('boolean[]', s.INCLUDE_ARRAY | s.ONLY_ARRAY))
self.assertFalse(s.is_valid_psql_type('boolean', s.ONLY_ARRAY))
self.assertFalse(s.is_valid_psql_type('boolean[]', s.ONLY_ARRAY))
self.assertTrue(s.is_valid_psql_type('boolean[]', s.ANY_ARRAY))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this and corresponding code changes to another commit and PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, should be on PRs 292 and 293

@param: feature_names, list. Python list of the feature names to
use for the split elements in the vector_col array
"""
query = """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming this was meant to use the is_col_1d_array function?

@fmcquillan99
Copy link

user docs seem incomplete

@asfgit
Copy link

asfgit commented Jul 12, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/551/

@asfgit
Copy link

asfgit commented Jul 13, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/556/

@asfgit
Copy link

asfgit commented Jul 14, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/560/

@asfgit
Copy link

asfgit commented Jul 17, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/566/

@asfgit
Copy link

asfgit commented Jul 17, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/567/

@asfgit
Copy link

asfgit commented Jul 17, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/568/

@asfgit
Copy link

asfgit commented Jul 18, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/569/

@asfgit
Copy link

asfgit commented Jul 18, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/570/

@asfgit
Copy link

asfgit commented Jul 18, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/571/

@iyerr3
Copy link
Contributor

iyerr3 commented Jul 18, 2018

Please note that cols2vec.py_in failed rat check due to missing license header.

@ArvindSridhar
Copy link
Contributor Author

PR is ready to be reviewed once again

@asfgit
Copy link

asfgit commented Jul 19, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/574/

@fmcquillan99
Copy link

In cols2vec,

For this table:

CREATE TABLE golf (
    id integer NOT NULL,
    "OUTLOOK" text,
    temperature double precision,
    humidity double precision,
    "Temp_Humidity" double precision[],
    clouds_airquality text[],
    windy boolean,
    class text,
    observation_weight double precision
);

this fails:

SELECT madlib.cols2vec(
    'golf',
    'cols2vec_result',
    'id, temperature'
);

because id is INT and temperature is FLOAT.

It forces the user to do:

SELECT madlib.cols2vec(
    'golf',
    'cols2vec_result',
    'id::FLOAT, temperature'
);

but this is inconvenient especially if you have a big
table and are using '*' to get all columns into the feature
vector and they are a mix of numeric types.

Also a mix of VARCHAR and TEXT fails in a similar way
but should not.

Use PostgreSQL precendence rules to fix this please.

@fmcquillan99
Copy link

fmcquillan99 commented Jul 19, 2018

In vec2cols,

SELECT madlib.vec2cols(
    'golf',                           -- source table
    'vec2cols_result',                -- output table
    'clouds_airquality',              -- column with array entries to split
    ARRAY['clouds', 'air_quality'],   -- feature names
    '"OUTLOOK", id'                   -- columns to keep from source table
);

results in

 clouds | air_quality | OUTLOOK  | id
--------+-------------+----------+----
 none   | unhealthy   | sunny    |  1
 none   | moderate    | sunny    |  2
 low    | moderate    | overcast |  3
 low    | moderate    | rain     |  4
 medium | good        | rain     |  5
 low    | unhealthy   | rain     |  6
 medium | moderate    | overcast |  7
 high   | unhealthy   | sunny    |  8
 high   | good        | sunny    |  9
 medium | good        | rain     | 10
 none   | good        | sunny    | 11
 medium | moderate    | overcast | 12
 medium | moderate    | overcast | 13
 low    | unhealthy   | rain     | 14
(14 rows)

but the split columns clouds and air_quality should
be the right side of the table. This will make
it consistent with the way cols2vec works where the
the new stuff is put on the right side of the table.

@fmcquillan99
Copy link

After the above 2 issues I mentioned are fixed, I will have 1 more commit on user docs to this PR

@asfgit
Copy link

asfgit commented Jul 19, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/582/

"""
input_tbl_valid(source_table, self.module_name)
output_tbl_valid(output_table, self.module_name)
# cols_to_validate = self.get_cols_helper.get_cols_as_list(cols_to_output) + [vector_col]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess we can remove this commented line.

cols_to_keep = ', '.join(self.get_cols_helper.get_cols_as_list(cols_to_output,
source_table)) + ", " if cols_to_output else ''

# TODO why don't we call quote_literal here but call it later for feature_cols
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a comment explaining this?


output_table_summary = add_postfix(output_table, "_summary")
# Dollar-quote the text to allow single-quotes without escaping
# TODO explain why it's called _outer_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still valid?

@asfgit
Copy link

asfgit commented Jul 20, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/583/

@ArvindSridhar
Copy link
Contributor Author

All PR comments addressed, commits have been squashed and pushed to this branch, ready to finalize PR

@asfgit
Copy link

asfgit commented Jul 20, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/584/

@asfgit
Copy link

asfgit commented Jul 25, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/603/

@asfgit
Copy link

asfgit commented Jul 26, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/605/

@asfgit
Copy link

asfgit commented Jul 26, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/606/

@asfgit
Copy link

asfgit commented Jul 26, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/607/

@asfgit
Copy link

asfgit commented Jul 26, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/608/

@asfgit
Copy link

asfgit commented Jul 26, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/609/

distinct_types = set([col_type[1] for col_type in all_cols_and_types
if col_type[0] in self.features_to_nest])
for expr_type in distinct_types:
_assert(not is_valid_psql_type(expr_type, ANY_ARRAY),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more cleanly written as

_assert(not any(is_valid_psql_type(expr_type, ANY_ARRAY) for expr_type in distinc_types), ...

@@ -513,11 +513,12 @@ def array_col_has_same_dimension(tbl, col):
# ------------------------------------------------------------------------


def explicit_bool_to_text(tbl, cols, schema_madlib):
def explicit_bool_to_text(tbl, cols, schema_madlib, is_forced=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on the need for is_forced. Are there platforms that have a bool-to-text cast require this patching?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w/ @ArvindSridhar On platforms that has bool to text casting (gpdb5, pg 9.6, pg 10), we still need this to make sure we can create an array of bool and text types.

Copy link
Contributor

@iyerr3 iyerr3 Jul 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this function, though - we might need ::TEXT. But I believe we decided that we'll let the platform fail if the array cannot be built by it. Wouldn't this build a successful array when the platform is failing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point-we added this arg to make our dev-check pass, because our dev check didn't have an explicit ::TEXT cast and yet it was running on PG10, which supports bool to text casting. Would removing the dev check test and the is_forced arg be the better way to go?

@@ -221,11 +221,11 @@ def is_psql_numeric_type(arg, exclude=None):
Returns:
Boolean. Returns if 'arg' is one of the numeric types
"""
numeric_types = set(['smallint', 'integer', 'bigint', 'decimal', 'numeric',
'real', 'double precision', 'serial', 'bigserial'])
# numeric_types = set(['smallint', 'integer', 'bigint', 'decimal', 'numeric',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines (+ other similar lines below) can be deleted.

_assert(not is_valid_psql_type(expr_type, ANY_ARRAY),
"{0}: Feature columns to nest cannot be of type array"
.format(self.module_name))
if len(distinct_types) != 1 and 'boolean' in distinct_types:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help me understand the need for len(distinct_types) != 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to cast boolean to text only if it is mixed with other types. Creating a vector of multiple boolean columns works without casting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I would suggest changing it to len(distinct_types) > 1 and adding what you said above as a comment.

@asfgit
Copy link

asfgit commented Jul 27, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/614/

@ArvindSridhar
Copy link
Contributor Author

Removed the problematic dev check test that required us to use forced bool to text conversion, should be good to go

@asfgit
Copy link

asfgit commented Jul 30, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/617/

ArvindSridhar and others added 6 commits July 30, 2018 12:36
JIRA: MADLIB-1240

This commit adds a new SQL function called vec2cols and refactors the
current function cols2vec, providing greater integration between the two
modules. We now have a single Python file with separate classes for each
feature. We also have unified unit-tests and dev-check/install-check
tests.

The vec2cols function enables users to split up a single column into
multiple columns, given that the input column contains array entries.
For example, if the input column contained ARRAY[1, 2, 3] in one of its
rows, the output table will contain 3 different columns, one for each
element of the array.

Co-authored-by: Nandish Jayaram <njayaram@apache.org>
Co-authored-by: Rahul Iyer <riyer@apache.org>
Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Removes type validation from the code and instead lets the underlying
database handle type exceptions and casting. Updated dev-check and
unit-tests accordingly.

Co-authored-by: Orhan Kislal <okislal@pivotal.io>
@asfgit
Copy link

asfgit commented Jul 30, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/619/

@fmcquillan99
Copy link

Where did we land on the boolean casting issue? Testing on Greenplum 5, I see:

(psycopg2.ProgrammingError) plpy.SPIError: ARRAY types boolean and text cannot be matched
CONTEXT:  Traceback (most recent call last):
  PL/Python function "cols2vec", line 23, in <module>
    return cols2vec_obj.cols2vec(**globals())
  PL/Python function "cols2vec", line 363, in cols2vec
PL/Python function "cols2vec"
 [SQL: "SELECT madlib.cols2vec(\n    'golf',\n    'cols2vec_result',\n    'windy, class'\n);"]
(psycopg2.ProgrammingError) plpy.SPIError: ARRAY types integer and boolean cannot be matched
CONTEXT:  Traceback (most recent call last):
  PL/Python function "cols2vec", line 23, in <module>
    return cols2vec_obj.cols2vec(**globals())
  PL/Python function "cols2vec", line 363, in cols2vec
PL/Python function "cols2vec"
 [SQL: "SELECT madlib.cols2vec(\n    'golf',\n    'cols2vec_result',\n    'temperature, windy'\n);"]

@fmcquillan99
Copy link

Wondering about order for varchar and text casting.
For this data set:

DROP TABLE IF EXISTS golf CASCADE;

CREATE TABLE golf (
    id int,
    "OUTLOOK" varchar,
    temperature smallint,
    humidity real,
    "Temp_Humidity" double precision[],
    clouds_airquality text[],
    windy boolean,
    class text,
    observation_weight double precision
);

INSERT INTO golf VALUES
(1,'sunny', 85, 85, ARRAY[85, 85],ARRAY['none', 'unhealthy'], 'false','Don''t Play', 5.0),
(2, 'sunny', 80, 90, ARRAY[80, 90], ARRAY['none', 'moderate'], 'true', 'Don''t Play', 5.0),
(3, 'overcast', 83, 78, ARRAY[83, 78], ARRAY['low', 'moderate'], 'false', 'Play', 1.5),
(4, 'rain', 70, 96, ARRAY[70, 96], ARRAY['low', 'moderate'], 'false', 'Play', 1.0),
(5, 'rain', 68, 80, ARRAY[68, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0),
(6, 'rain', 65, 70, ARRAY[65, 70], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0),
(7, 'overcast', 64, 65, ARRAY[64, 65], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5),
(8, 'sunny', 72, 95, ARRAY[72, 95], ARRAY['high', 'unhealthy'], 'false', 'Don''t Play', 5.0),
(9, 'sunny', 69, 70, ARRAY[69, 70], ARRAY['high', 'good'], 'false', 'Play', 5.0),
(10, 'rain', 75, 80, ARRAY[75, 80], ARRAY['medium', 'good'], 'false', 'Play', 1.0),
(11, 'sunny', 75, 70, ARRAY[75, 70], ARRAY['none', 'good'], 'true', 'Play', 5.0),
(12, 'overcast', 72, 90, ARRAY[72, 90], ARRAY['medium', 'moderate'], 'true', 'Play', 1.5),
(13, 'overcast', 81, 75, ARRAY[81, 75], ARRAY['medium', 'moderate'], 'false', 'Play', 1.5),
(14, 'rain', 71, 80, ARRAY[71, 80], ARRAY['low', 'unhealthy'], 'true', 'Don''t Play', 1.0);

(1)

DROP TABLE IF EXISTS cols2vec_result, cols2vec_result_summary;

SELECT madlib.cols2vec(
    'golf',
    'cols2vec_result',
    '"OUTLOOK", class'
);

produces a varchar array:

select *                                                                                                                                                  from INFORMATION_SCHEMA.COLUMNS where table_name = 'out99';
-[ RECORD 1 ]------------+------------------
table_catalog            | madlib
table_schema             | public
table_name               | out99
column_name              | f2
ordinal_position         | 2
column_default           | 
is_nullable              | YES
data_type                | character varying
character_maximum_length | 
character_octet_length   | 1073741824
numeric_precision        | 
numeric_precision_radix  | 
numeric_scale            | 
datetime_precision       | 
interval_type            | 
interval_precision       | 
character_set_catalog    | 
character_set_schema     | 
character_set_name       | 
collation_catalog        | 
collation_schema         | 
collation_name           | 
domain_catalog           | 
domain_schema            | 
domain_name              | 
udt_catalog              | madlib
udt_schema               | pg_catalog
udt_name                 | varchar
scope_catalog            | 
scope_schema             | 
scope_name               | 
maximum_cardinality      | 
dtd_identifier           | 2
is_self_referencing      | NO
is_identity              | NO
identity_generation      | 
identity_start           | 
identity_increment       | 
identity_maximum         | 
identity_minimum         | 
identity_cycle           | 
is_generated             | NEVER
generation_expression    | 
is_updatable             | YES
-[ RECORD 2 ]------------+------------------
table_catalog            | madlib
table_schema             | public
table_name               | out99
column_name              | f1
ordinal_position         | 1
column_default           | 
is_nullable              | YES
data_type                | character varying
character_maximum_length | 
character_octet_length   | 1073741824
numeric_precision        | 
numeric_precision_radix  | 
numeric_scale            | 
datetime_precision       | 
interval_type            | 
interval_precision       | 
character_set_catalog    | 
character_set_schema     | 
character_set_name       | 
collation_catalog        | 
collation_schema         | 
collation_name           | 
domain_catalog           | 
domain_schema            | 
domain_name              | 
udt_catalog              | madlib
udt_schema               | pg_catalog
udt_name                 | varchar
scope_catalog            | 
scope_schema             | 
scope_name               | 
maximum_cardinality      | 
dtd_identifier           | 1
is_self_referencing      | NO
is_identity              | NO
identity_generation      | 
identity_start           | 
identity_increment       | 
identity_maximum         | 
identity_minimum         | 
identity_cycle           | 
is_generated             | NEVER
generation_expression    | 
is_updatable             | YES

(2)

DROP TABLE IF EXISTS cols2vec_result, cols2vec_result_summary;

SELECT madlib.cols2vec(
    'golf',
    'cols2vec_result',
    'class, "OUTLOOK"'
);

produces a text array:

select *                                                                                                                                                  from INFORMATION_SCHEMA.COLUMNS where table_name = 'out99';
-[ RECORD 1 ]------------+-----------
table_catalog            | madlib
table_schema             | public
table_name               | out99
column_name              | f2
ordinal_position         | 2
column_default           | 
is_nullable              | YES
data_type                | text
character_maximum_length | 
character_octet_length   | 1073741824
numeric_precision        | 
numeric_precision_radix  | 
numeric_scale            | 
datetime_precision       | 
interval_type            | 
interval_precision       | 
character_set_catalog    | 
character_set_schema     | 
character_set_name       | 
collation_catalog        | 
collation_schema         | 
collation_name           | 
domain_catalog           | 
domain_schema            | 
domain_name              | 
udt_catalog              | madlib
udt_schema               | pg_catalog
udt_name                 | text
scope_catalog            | 
scope_schema             | 
scope_name               | 
maximum_cardinality      | 
dtd_identifier           | 2
is_self_referencing      | NO
is_identity              | NO
identity_generation      | 
identity_start           | 
identity_increment       | 
identity_maximum         | 
identity_minimum         | 
identity_cycle           | 
is_generated             | NEVER
generation_expression    | 
is_updatable             | YES
-[ RECORD 2 ]------------+-----------
table_catalog            | madlib
table_schema             | public
table_name               | out99
column_name              | f1
ordinal_position         | 1
column_default           | 
is_nullable              | YES
data_type                | text
character_maximum_length | 
character_octet_length   | 1073741824
numeric_precision        | 
numeric_precision_radix  | 
numeric_scale            | 
datetime_precision       | 
interval_type            | 
interval_precision       | 
character_set_catalog    | 
character_set_schema     | 
character_set_name       | 
collation_catalog        | 
collation_schema         | 
collation_name           | 
domain_catalog           | 
domain_schema            | 
domain_name              | 
udt_catalog              | madlib
udt_schema               | pg_catalog
udt_name                 | text
scope_catalog            | 
scope_schema             | 
scope_name               | 
maximum_cardinality      | 
dtd_identifier           | 1
is_self_referencing      | NO
is_identity              | NO
identity_generation      | 
identity_start           | 
identity_increment       | 
identity_maximum         | 
identity_minimum         | 
identity_cycle           | 
is_generated             | NEVER
generation_expression    | 
is_updatable             | YES

Why is that?

@ArvindSridhar
Copy link
Contributor Author

Regarding bool to text/integer conversion in GPDB5, I believe you have to use an explicit cast as required by the underlying platform. Thus, if you "temperature, windy::integer" or "windy::text, class", it should work fine. This is how the underlying postgres syntax for GPDB5 works as well
For the text-varchar ordering, my hunch is that postgres treats these types as having the same priority, and thus just casts the second argument to whatever type the first argument was in.

@fmcquillan99
Copy link

thanks, that makes sense.

I added a type casting example to the user docs.

LGTM

@asfgit
Copy link

asfgit commented Aug 1, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/628/

@asfgit asfgit closed this in 20f95b3 Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants