Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jira:1239: Converts features from multiple columns into a feature array #288

Closed
wants to merge 8 commits into from

Conversation

hpandeycodeit
Copy link
Member

JIRA: 1239

Added a new module cols_vec which Converts features from multiple columns of an input table into a feature array in a single column.

Following files are committed:

cols2vec.py_in
cols2vec.sql_in
test/cols2vec.sql_in
Modules.yml

For special characters handling, using the py_list_to_string with "long_format = False".
Also, split_quoted_delimited_str which quotes each element of the array.
Tests with special characters are added in the install check.

@hpandeycodeit hpandeycodeit changed the title Madlib 1239 Jira:1239: Converts features from multiple columns into a feature array Jul 3, 2018
@asfgit
Copy link

asfgit commented Jul 3, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/535/

@@ -50,3 +50,4 @@ modules:
- name: validation
depends: ['array_ops', 'regress']
- name: stemmer
- name: cols_vec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced that we need a new module for this functionality. IMO this is better suited for the utilities module as a separate file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed it with @fmcquillan99 and I will move it under Utilities module in next commit.

"""
Function to validate input parameters
"""
if list_of_features.strip() != '*':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checks below are better expressed as assert statements using _assert(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is done.

if not (list_of_features and list_of_features.strip()):
plpy.error("Features to include is empty")

if list_of_features.strip() != '*':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please combine this with the above if statement


all_cols = ''
feature_cols = ''
if list_of_features.strip() == '*':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if and else blocks seem to be very similar except for the source of the all_cols/feature_list. I suggest using the if switch only for populating the source columns. Other statements can be moved out of the if.

feature_list = split_quoted_delimited_str(list_of_features)
feature_exclude = split_quoted_delimited_str(
list_of_features_to_exclude)
return_set = set(feature_list) - set(feature_exclude)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of the features are lost in this operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the above code as well, now the order of features will remain the same.

from {source_table}
""".format(**locals()))

plpy.execute("""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't understood the need for the summary table. Is it just to record the features combined in the array? Can we provide that as an output of the function? Creating a table just to record that one parameter seems unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iyerr3 The vec2cols story (https://issues.apache.org/jira/browse/MADLIB-1240) might consume this summary table if provided as the dictionary param in that module.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are 1000+ columns which you want to keep track of, then saving the array of col names in a summary tables might be convenient. It is not ideal but should not be onerous to the user and they can ignore the summary table if they don't care about it.

@fmcquillan99
Copy link

fmcquillan99 commented Jul 6, 2018

Since we are writing out a summary table, may as well add more info in it.

{code}
A summary table named <out_table>_summary is also created at the same time, which has the following columns:

source_table TEXT. Source table name.
list_of_features Input list of features.
list_of_features_to_exclude Input list of features to exclude.
feature_names TEXT[]. Array of names of features, i.e, dictionary for the feature_vector.
{code}

Is this do-able @hpandeycodeit ?

@hpandeycodeit
Copy link
Member Author

@fmcquillan99 ,

What is total_rows_processed and total_rows_skipped ? Can you provide more details on these?

@fmcquillan99
Copy link

update my comment above to remove the rows processed and skipped.

@asfgit
Copy link

asfgit commented Jul 6, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/536/

Copy link
Contributor

@iyerr3 iyerr3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor corrections

feature_list = ''
if list_of_features.strip() == '*':
all_cols = get_cols(source_table, schema_madlib)
all_col_set = set(list(all_cols))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of the columns (retained by get_cols) is lost here. I suggest:

feature_list = [col for col in all_cols if col not in exclude_set]

validate_cols2vec_args(source_table, output_table, list_of_features,
list_of_features_to_exclude, cols_to_output, **kwargs)

all_cols = ''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can delete lines 70, 71, 72 since we don't need those anymore.


feature_cols = py_list_to_sql_string(
list(feature_list), "text", False)
filtered_list_of_features = ",".join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filtered_list_of_features = ",".join(feature_list)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above changes are done as suggested.

@asfgit
Copy link

asfgit commented Jul 9, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/537/

@asfgit
Copy link

asfgit commented Jul 10, 2018

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/539/

iyerr3 added a commit to madlib/madlib that referenced this pull request Jul 13, 2018
@asfgit asfgit closed this in 950114c Jul 16, 2018
@hpandeycodeit hpandeycodeit deleted the MADLIB_1239 branch May 1, 2019 23:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants