New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jira:1239: Converts features from multiple columns into a feature array #288
Conversation
… a feature array
… a feature array
Refer to this link for build results (access rights to CI server needed): |
src/config/Modules.yml
Outdated
@@ -50,3 +50,4 @@ modules: | |||
- name: validation | |||
depends: ['array_ops', 'regress'] | |||
- name: stemmer | |||
- name: cols_vec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced that we need a new module for this functionality. IMO this is better suited for the utilities
module as a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed it with @fmcquillan99 and I will move it under Utilities module in next commit.
""" | ||
Function to validate input parameters | ||
""" | ||
if list_of_features.strip() != '*': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checks below are better expressed as assert statements using _assert(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done.
if not (list_of_features and list_of_features.strip()): | ||
plpy.error("Features to include is empty") | ||
|
||
if list_of_features.strip() != '*': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please combine this with the above if statement
|
||
all_cols = '' | ||
feature_cols = '' | ||
if list_of_features.strip() == '*': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if
and else
blocks seem to be very similar except for the source of the all_cols/feature_list
. I suggest using the if switch only for populating the source columns. Other statements can be moved out of the if.
feature_list = split_quoted_delimited_str(list_of_features) | ||
feature_exclude = split_quoted_delimited_str( | ||
list_of_features_to_exclude) | ||
return_set = set(feature_list) - set(feature_exclude) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the features are lost in this operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the above code as well, now the order of features will remain the same.
from {source_table} | ||
""".format(**locals())) | ||
|
||
plpy.execute(""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't understood the need for the summary table. Is it just to record the features combined in the array? Can we provide that as an output of the function? Creating a table just to record that one parameter seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iyerr3 The vec2cols story (https://issues.apache.org/jira/browse/MADLIB-1240) might consume this summary table if provided as the dictionary
param in that module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are 1000+ columns which you want to keep track of, then saving the array of col names in a summary tables might be convenient. It is not ideal but should not be onerous to the user and they can ignore the summary table if they don't care about it.
Since we are writing out a summary table, may as well add more info in it. {code} source_table TEXT. Source table name. Is this do-able @hpandeycodeit ? |
What is total_rows_processed and total_rows_skipped ? Can you provide more details on these? |
update my comment above to remove the rows processed and skipped. |
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor corrections
feature_list = '' | ||
if list_of_features.strip() == '*': | ||
all_cols = get_cols(source_table, schema_madlib) | ||
all_col_set = set(list(all_cols)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of the columns (retained by get_cols
) is lost here. I suggest:
feature_list = [col for col in all_cols if col not in exclude_set]
validate_cols2vec_args(source_table, output_table, list_of_features, | ||
list_of_features_to_exclude, cols_to_output, **kwargs) | ||
|
||
all_cols = '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can delete lines 70, 71, 72 since we don't need those anymore.
|
||
feature_cols = py_list_to_sql_string( | ||
list(feature_list), "text", False) | ||
filtered_list_of_features = ",".join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filtered_list_of_features = ",".join(feature_list)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above changes are done as suggested.
Refer to this link for build results (access rights to CI server needed): |
…er minor changes
Refer to this link for build results (access rights to CI server needed): |
JIRA: MADLIB-1239 Closes apache#288
JIRA: 1239
Added a new module cols_vec which Converts features from multiple columns of an input table into a feature array in a single column.
Following files are committed:
cols2vec.py_in
cols2vec.sql_in
test/cols2vec.sql_in
Modules.yml
For special characters handling, using the py_list_to_string with "long_format = False".
Also, split_quoted_delimited_str which quotes each element of the array.
Tests with special characters are added in the install check.