Jira:1239: Converts features from multiple columns into a feature array #288

hpandeycodeit · 2018-07-03T22:22:05Z

JIRA: 1239

Added a new module cols_vec which Converts features from multiple columns of an input table into a feature array in a single column.

Following files are committed:

cols2vec.py_in
cols2vec.sql_in
test/cols2vec.sql_in
Modules.yml

For special characters handling, using the py_list_to_string with "long_format = False".
Also, split_quoted_delimited_str which quotes each element of the array.
Tests with special characters are added in the install check.

… a feature array

asfgit · 2018-07-03T23:38:52Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/535/

iyerr3 · 2018-07-04T00:18:37Z

src/config/Modules.yml

@@ -50,3 +50,4 @@ modules:
    - name: validation
      depends: ['array_ops', 'regress']
    - name: stemmer
+    - name: cols_vec


I'm not convinced that we need a new module for this functionality. IMO this is better suited for the utilities module as a separate file.

Discussed it with @fmcquillan99 and I will move it under Utilities module in next commit.

iyerr3 · 2018-07-04T00:20:43Z

src/ports/postgres/modules/cols_vec/cols2vec.py_in

+    """
+        Function to validate input parameters
+    """
+    if list_of_features.strip() != '*':


The checks below are better expressed as assert statements using _assert(...)

This is done.

iyerr3 · 2018-07-04T00:21:29Z

src/ports/postgres/modules/cols_vec/cols2vec.py_in

+        if not (list_of_features and list_of_features.strip()):
+            plpy.error("Features to include is empty")
+
+    if list_of_features.strip() != '*':


Please combine this with the above if statement

iyerr3 · 2018-07-04T00:28:09Z

src/ports/postgres/modules/cols_vec/cols2vec.py_in

+
+        all_cols = ''
+        feature_cols = ''
+        if list_of_features.strip() == '*':


The if and else blocks seem to be very similar except for the source of the all_cols/feature_list. I suggest using the if switch only for populating the source columns. Other statements can be moved out of the if.

iyerr3 · 2018-07-04T00:34:57Z

src/ports/postgres/modules/cols_vec/cols2vec.py_in

+            feature_list = split_quoted_delimited_str(list_of_features)
+            feature_exclude = split_quoted_delimited_str(
+                list_of_features_to_exclude)
+            return_set = set(feature_list) - set(feature_exclude)


The order of the features are lost in this operation.

Updated the above code as well, now the order of features will remain the same.

iyerr3 · 2018-07-04T00:37:36Z

src/ports/postgres/modules/cols_vec/cols2vec.py_in

+    		from {source_table}
+    		""".format(**locals()))
+
+        plpy.execute("""


I haven't understood the need for the summary table. Is it just to record the features combined in the array? Can we provide that as an output of the function? Creating a table just to record that one parameter seems unnecessary.

@iyerr3 The vec2cols story (https://issues.apache.org/jira/browse/MADLIB-1240) might consume this summary table if provided as the dictionary param in that module.

If there are 1000+ columns which you want to keep track of, then saving the array of col names in a summary tables might be convenient. It is not ideal but should not be onerous to the user and they can ignore the summary table if they don't care about it.

fmcquillan99 · 2018-07-06T19:05:56Z

Since we are writing out a summary table, may as well add more info in it.

{code}
A summary table named <out_table>_summary is also created at the same time, which has the following columns:

source_table TEXT. Source table name.
list_of_features Input list of features.
list_of_features_to_exclude Input list of features to exclude.
feature_names TEXT[]. Array of names of features, i.e, dictionary for the feature_vector.
{code}

Is this do-able @hpandeycodeit ?

hpandeycodeit · 2018-07-06T20:36:01Z

@fmcquillan99 ,

What is total_rows_processed and total_rows_skipped ? Can you provide more details on these?

fmcquillan99 · 2018-07-06T21:42:45Z

update my comment above to remove the rows processed and skipped.

asfgit · 2018-07-06T23:09:22Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/536/

iyerr3

Minor corrections

iyerr3 · 2018-07-06T23:13:00Z

src/ports/postgres/modules/cols_vec/cols2vec.py_in

+        feature_list = ''
+        if list_of_features.strip() == '*':
+            all_cols = get_cols(source_table, schema_madlib)
+            all_col_set = set(list(all_cols))


The order of the columns (retained by get_cols) is lost here. I suggest:

feature_list = [col for col in all_cols if col not in exclude_set]

iyerr3 · 2018-07-06T23:13:21Z

src/ports/postgres/modules/cols_vec/cols2vec.py_in

+        validate_cols2vec_args(source_table, output_table, list_of_features,
+                               list_of_features_to_exclude, cols_to_output, **kwargs)
+
+        all_cols = ''


We can delete lines 70, 71, 72 since we don't need those anymore.

iyerr3 · 2018-07-06T23:13:50Z

src/ports/postgres/modules/cols_vec/cols2vec.py_in

+
+        feature_cols = py_list_to_sql_string(
+            list(feature_list), "text", False)
+        filtered_list_of_features = ",".join(


filtered_list_of_features = ",".join(feature_list)

Above changes are done as suggested.

asfgit · 2018-07-09T06:50:37Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/537/

…er minor changes

asfgit · 2018-07-10T16:49:23Z

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/madlib-pr-build/539/

JIRA: MADLIB-1239 Closes apache#288

hpandeycodeit added 5 commits June 15, 2018 01:33

Changes for Jira: 1239, Converts features from multiple columns into…

3b5a778

… a feature array

added Module cols_vec

06bef61

added Module cols_vec

40f08e1

Fixed the test file for install-check failure.

3050607

Changes for Jira: 1239, Converts features from multiple columns into…

7dce06b

… a feature array

hpandeycodeit changed the title ~~Madlib 1239~~ Jira:1239: Converts features from multiple columns into a feature array Jul 3, 2018

iyerr3 reviewed Jul 4, 2018

View reviewed changes

Replaced if with _assert and couple of other minor changes

bb41f57

iyerr3 approved these changes Jul 6, 2018

View reviewed changes

Minor code changes

5a7889d

Changes to the output summary Table to include more columns with oth…

c6a793f

…er minor changes

iyerr3 added a commit to madlib/madlib that referenced this pull request Jul 13, 2018

Utilties: Refactor and clean cols2vec from 109be7d

625e537

JIRA: MADLIB-1239 Closes apache#288

asfgit closed this in 950114c Jul 16, 2018

hpandeycodeit deleted the MADLIB_1239 branch May 1, 2019 23:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jira:1239: Converts features from multiple columns into a feature array #288

Jira:1239: Converts features from multiple columns into a feature array #288

hpandeycodeit commented Jul 3, 2018

asfgit commented Jul 3, 2018

iyerr3 Jul 4, 2018

hpandeycodeit Jul 6, 2018

iyerr3 Jul 4, 2018

hpandeycodeit Jul 6, 2018

iyerr3 Jul 4, 2018

iyerr3 Jul 4, 2018

iyerr3 Jul 4, 2018

hpandeycodeit Jul 6, 2018 •

edited

iyerr3 Jul 4, 2018

njayaram2 Jul 5, 2018

fmcquillan99 Jul 5, 2018 •

edited

fmcquillan99 commented Jul 6, 2018 •

edited

hpandeycodeit commented Jul 6, 2018

fmcquillan99 commented Jul 6, 2018

asfgit commented Jul 6, 2018

iyerr3 left a comment

iyerr3 Jul 6, 2018

iyerr3 Jul 6, 2018

iyerr3 Jul 6, 2018

hpandeycodeit Jul 9, 2018

asfgit commented Jul 9, 2018

asfgit commented Jul 10, 2018

Jira:1239: Converts features from multiple columns into a feature array #288

Jira:1239: Converts features from multiple columns into a feature array #288

Conversation

hpandeycodeit commented Jul 3, 2018

asfgit commented Jul 3, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hpandeycodeit Jul 6, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fmcquillan99 Jul 5, 2018 • edited

Choose a reason for hiding this comment

fmcquillan99 commented Jul 6, 2018 • edited

hpandeycodeit commented Jul 6, 2018

fmcquillan99 commented Jul 6, 2018

asfgit commented Jul 6, 2018

iyerr3 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

asfgit commented Jul 9, 2018

asfgit commented Jul 10, 2018

hpandeycodeit Jul 6, 2018 •

edited

fmcquillan99 Jul 5, 2018 •

edited

fmcquillan99 commented Jul 6, 2018 •

edited