Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Sessionize funtion - Phase 2 #49

Closed
wants to merge 6 commits into from

Conversation

njayaram2
Copy link
Contributor

JIRA: MADLIB-1001

This contains the phase 2 implementation of the sessionize function.
This commit adds two new optional parameters: output_cols and
create_view. output_cols is a comma separated list of columns to
be projected into the output, while create_view indicates if the
output is a view or a table. The default values of output_cols
and create_view are '*' and True respectively.

These two parameters are for performance reasons. A view is
generally faster when compared to materializing the results to an
actual table, while having to include all columns when not necessary
can be avoided with the output_cols parameter.

JIRA: MADLIB-1001

This contains the phase 2 implementation of the sessionize function.
This commit adds two new optional parameters: output_cols and
create_view. output_cols is a comma separated list of columns to
be projected into the output, while create_view indicates if the
output is a view or a table. The default values of output_cols
and create_view are '*' and True respectively.

These two parameters are for performance reasons. A view is
generally faster when compared to materializing the results to an
actual table, while having to include all columns when not necessary
can be avoided with the output_cols parameter.
if col_name == '*':
return get_cols_str(table_columns_list)
else:
# Column name cannot have more than one pair of quotes in it. Removing any existing quotes, and then quoting the new string obtained.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't going to work. One immediate test case that comes to mind is "column with spaces in the name".

If you wanted to do something around quoting, I think you need to do everything that Postgres quote_ident() does.

How is this handled elsewhere in MADlib? What happens if you have "a table with spaces in the name"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While looking at #47, I noticed that there are already column parsing utilities. See https://github.com/apache/incubator-madlib/pull/47/files#diff-96a0a5136d71ec812cc09b1d3a2e259aR35.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the comment exactly. This method does take care of column names with spaces in the name. Say I had a column name called "user id" (note that it is quoted, since postgres does not let me name a column with spaces without quotes), and another column called revenue.
If my output_cols parameter was:
' "user id"<100, revenue>20 '

I parse this to rename each expression:

  • "user id"<100 is named as "user id<100"
  • revenue>20 is named as "revenue>20"

This ends up creating a select statement as follows:
SELECT "user id"<100 AS "user id<100", revenue>20 AS "revenue>20"

All I am doing in this function is creating a new column name for an expression. I quote the expression to create a new column name, but before doing that, I remove any existing quotes from the expression (because postgres does not let me have quotes inside a quoted column name).

I checked again to make sure this is indeed how its working. I will put up some install checks to cover these cases too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • "user id"<100 is named as "user id<100"
  • revenue>20 is named as "revenue>20"

Ahh, ok. That was completely unclear to me. It would be good to add
comments/docstrings to the various functions explaining what they're doing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1.
I suggest also adding the above examples to the docstrings.

@njayaram2
Copy link
Contributor Author

Thank you for the comments @decibel and @iyerr3. I will improve the install check and documentation,
and update the pull request.

JIRA: MADLIB-1001

This commit has updated install check and documentation in the code,
based on comments in the original PR.
@fmcquillan99
Copy link

fmcquillan99 commented Jun 22, 2016

Thanks for the discussion on this pull request. I would suggest that asking the user to provie a valid SELECT expression as output_cols is not unreasonable. We just need to be clear in the docs.

For example, in the path function http://madlib.incubator.apache.org/docs/latest/group__grp__path.html#examples
we do a similar thing for the aggregate function parameter.

Nandish Jayaram added 2 commits June 24, 2016 15:20
JIRA: MADLIB-1001

Based on the commentss by @fmcquillan99, @decibel and @iyerr3, the
requirements for output_cols has changed a bit, which has made the
implementation significantly simpler!
output_cols is expected to be a valid SELECT expression now. So we
are now using output_cols as is, except for replacing '*' with the
actual column names from the source table.
JIRA: MADLIB-1001

Had missed deleting one of functions which is not used anymore.
@njayaram2
Copy link
Contributor Author

I have pushed the latest code that treats output_cols as a valid select expression. I have removed
the code that was doing a lot of string parsing to rename expressions, and the onus is now on the
user to use ' AS ' and rename expressions in output_cols.

table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE'

output_cols = '*' if output_cols is None else output_cols
# If the output_cols has '*' as one of the elements, expand it to include all columns in the source table. The following list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block needs to be wrapped into 72 characters and appropriate line space needs to be added to improve readability.

JIRA: MADLIB-1001

Wrapping up lines to 72 characters and improving the line spaces
for better readability.
JIRA: MADLIB-1001

More indentation changes.
@asfgit asfgit closed this in b996544 Jun 29, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants