-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Vector-Column Transformations #291
Conversation
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionality on its own LGTM. There might be overlap of code between this and #288. It would help to have a unified code structure for the two functions.
import plpy_mock as plpy | ||
|
||
m4_changequote(`<!', `!>') | ||
class Vec2ColsTestCase(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is another Vec2ColsTestCase
in test_vec2cols.py_in
. Is this one redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch haha, this was a typo. Updating the other PR now.
@@ -243,6 +243,9 @@ class UtilitiesTestCase(unittest.TestCase): | |||
self.assertFalse(s.is_valid_psql_type('boolean[]', s.INCLUDE_ARRAY | s.ONLY_ARRAY)) | |||
self.assertFalse(s.is_valid_psql_type('boolean', s.ONLY_ARRAY)) | |||
self.assertFalse(s.is_valid_psql_type('boolean[]', s.ONLY_ARRAY)) | |||
self.assertTrue(s.is_valid_psql_type('boolean[]', s.ANY_ARRAY)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this and corresponding code changes to another commit and PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, should be on PRs 292 and 293
@param: feature_names, list. Python list of the feature names to | ||
use for the split elements in the vector_col array | ||
""" | ||
query = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this was meant to use the is_col_1d_array
function?
user docs seem incomplete |
1e8bc32
to
3fb2c43
Compare
Refer to this link for build results (access rights to CI server needed): |
3fb2c43
to
543dad5
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
b580664
to
0d31b49
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
7ed9ac8
to
e0aa087
Compare
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Please note that |
e0aa087
to
ef23f3a
Compare
PR is ready to be reviewed once again |
Refer to this link for build results (access rights to CI server needed): |
In cols2vec, For this table:
this fails:
because It forces the user to do:
but this is inconvenient especially if you have a big Also a mix of VARCHAR and TEXT fails in a similar way Use PostgreSQL precendence rules to fix this please. |
In vec2cols,
results in
but the split columns |
After the above 2 issues I mentioned are fixed, I will have 1 more commit on user docs to this PR |
Refer to this link for build results (access rights to CI server needed): |
""" | ||
input_tbl_valid(source_table, self.module_name) | ||
output_tbl_valid(output_table, self.module_name) | ||
# cols_to_validate = self.get_cols_helper.get_cols_as_list(cols_to_output) + [vector_col] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess we can remove this commented line.
cols_to_keep = ', '.join(self.get_cols_helper.get_cols_as_list(cols_to_output, | ||
source_table)) + ", " if cols_to_output else '' | ||
|
||
# TODO why don't we call quote_literal here but call it later for feature_cols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a comment explaining this?
|
||
output_table_summary = add_postfix(output_table, "_summary") | ||
# Dollar-quote the text to allow single-quotes without escaping | ||
# TODO explain why it's called _outer_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still valid?
Refer to this link for build results (access rights to CI server needed): |
2013aad
to
a3c51a2
Compare
All PR comments addressed, commits have been squashed and pushed to this branch, ready to finalize PR |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
Refer to this link for build results (access rights to CI server needed): |
9b443f3
to
4b77a37
Compare
Refer to this link for build results (access rights to CI server needed): |
distinct_types = set([col_type[1] for col_type in all_cols_and_types | ||
if col_type[0] in self.features_to_nest]) | ||
for expr_type in distinct_types: | ||
_assert(not is_valid_psql_type(expr_type, ANY_ARRAY), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more cleanly written as
_assert(not any(is_valid_psql_type(expr_type, ANY_ARRAY) for expr_type in distinc_types), ...
@@ -513,11 +513,12 @@ def array_col_has_same_dimension(tbl, col): | |||
# ------------------------------------------------------------------------ | |||
|
|||
|
|||
def explicit_bool_to_text(tbl, cols, schema_madlib): | |||
def explicit_bool_to_text(tbl, cols, schema_madlib, is_forced=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on the need for is_forced
. Are there platforms that have a bool-to-text cast require this patching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w/ @ArvindSridhar On platforms that has bool to text casting (gpdb5, pg 9.6, pg 10), we still need this to make sure we can create an array of bool and text types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this function, though - we might need ::TEXT
. But I believe we decided that we'll let the platform fail if the array cannot be built by it. Wouldn't this build a successful array when the platform is failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point-we added this arg to make our dev-check pass, because our dev check didn't have an explicit ::TEXT cast and yet it was running on PG10, which supports bool to text casting. Would removing the dev check test and the is_forced arg be the better way to go?
@@ -221,11 +221,11 @@ def is_psql_numeric_type(arg, exclude=None): | |||
Returns: | |||
Boolean. Returns if 'arg' is one of the numeric types | |||
""" | |||
numeric_types = set(['smallint', 'integer', 'bigint', 'decimal', 'numeric', | |||
'real', 'double precision', 'serial', 'bigserial']) | |||
# numeric_types = set(['smallint', 'integer', 'bigint', 'decimal', 'numeric', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines (+ other similar lines below) can be deleted.
_assert(not is_valid_psql_type(expr_type, ANY_ARRAY), | ||
"{0}: Feature columns to nest cannot be of type array" | ||
.format(self.module_name)) | ||
if len(distinct_types) != 1 and 'boolean' in distinct_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help me understand the need for len(distinct_types) != 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to cast boolean to text only if it is mixed with other types. Creating a vector of multiple boolean columns works without casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I would suggest changing it to len(distinct_types) > 1
and adding what you said above as a comment.
Refer to this link for build results (access rights to CI server needed): |
Removed the problematic dev check test that required us to use forced bool to text conversion, should be good to go |
Refer to this link for build results (access rights to CI server needed): |
JIRA: MADLIB-1240 This commit adds a new SQL function called vec2cols and refactors the current function cols2vec, providing greater integration between the two modules. We now have a single Python file with separate classes for each feature. We also have unified unit-tests and dev-check/install-check tests. The vec2cols function enables users to split up a single column into multiple columns, given that the input column contains array entries. For example, if the input column contained ARRAY[1, 2, 3] in one of its rows, the output table will contain 3 different columns, one for each element of the array. Co-authored-by: Nandish Jayaram <njayaram@apache.org> Co-authored-by: Rahul Iyer <riyer@apache.org> Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Removes type validation from the code and instead lets the underlying database handle type exceptions and casting. Updated dev-check and unit-tests accordingly. Co-authored-by: Orhan Kislal <okislal@pivotal.io>
3d72503
to
1a7c756
Compare
Refer to this link for build results (access rights to CI server needed): |
Where did we land on the boolean casting issue? Testing on Greenplum 5, I see:
|
Wondering about order for varchar and text casting.
(1)
produces a varchar array:
(2)
produces a text array:
Why is that? |
Regarding bool to text/integer conversion in GPDB5, I believe you have to use an explicit cast as required by the underlying platform. Thus, if you "temperature, windy::integer" or "windy::text, class", it should work fine. This is how the underlying postgres syntax for GPDB5 works as well |
thanks, that makes sense. I added a type casting example to the user docs. LGTM |
Refer to this link for build results (access rights to CI server needed): |
JIRA: MADLIB-1240
This commit adds a new SQL function called vec2cols and refactors the
current function cols2vec, providing greater integration between the two
modules. We now have a single Python file with separate classes for each
feature. We also have unified unit-tests and dev-check/install-check
tests.
The vec2cols function enables users to split up a single column into
multiple columns, given that the input column contains array entries.
For example, if the input column contained ARRAY[1, 2, 3] in one of its
rows, the output table will contain 3 different columns, one for each
element of the array.