New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature: Sessionize funtion - Phase 2 #49
Conversation
JIRA: MADLIB-1001 This contains the phase 2 implementation of the sessionize function. This commit adds two new optional parameters: output_cols and create_view. output_cols is a comma separated list of columns to be projected into the output, while create_view indicates if the output is a view or a table. The default values of output_cols and create_view are '*' and True respectively. These two parameters are for performance reasons. A view is generally faster when compared to materializing the results to an actual table, while having to include all columns when not necessary can be avoided with the output_cols parameter.
if col_name == '*': | ||
return get_cols_str(table_columns_list) | ||
else: | ||
# Column name cannot have more than one pair of quotes in it. Removing any existing quotes, and then quoting the new string obtained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't going to work. One immediate test case that comes to mind is "column with spaces in the name".
If you wanted to do something around quoting, I think you need to do everything that Postgres quote_ident() does.
How is this handled elsewhere in MADlib? What happens if you have "a table with spaces in the name"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While looking at #47, I noticed that there are already column parsing utilities. See https://github.com/apache/incubator-madlib/pull/47/files#diff-96a0a5136d71ec812cc09b1d3a2e259aR35.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand the comment exactly. This method does take care of column names with spaces in the name. Say I had a column name called "user id" (note that it is quoted, since postgres does not let me name a column with spaces without quotes), and another column called revenue.
If my output_cols parameter was:
' "user id"<100, revenue>20 '
I parse this to rename each expression:
- "user id"<100 is named as "user id<100"
- revenue>20 is named as "revenue>20"
This ends up creating a select statement as follows:
SELECT "user id"<100 AS "user id<100", revenue>20 AS "revenue>20"
All I am doing in this function is creating a new column name for an expression. I quote the expression to create a new column name, but before doing that, I remove any existing quotes from the expression (because postgres does not let me have quotes inside a quoted column name).
I checked again to make sure this is indeed how its working. I will put up some install checks to cover these cases too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "user id"<100 is named as "user id<100"
- revenue>20 is named as "revenue>20"
Ahh, ok. That was completely unclear to me. It would be good to add
comments/docstrings to the various functions explaining what they're doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1.
I suggest also adding the above examples to the docstrings.
JIRA: MADLIB-1001 This commit has updated install check and documentation in the code, based on comments in the original PR.
Thanks for the discussion on this pull request. I would suggest that asking the user to provie a valid SELECT expression as output_cols is not unreasonable. We just need to be clear in the docs. For example, in the path function http://madlib.incubator.apache.org/docs/latest/group__grp__path.html#examples |
JIRA: MADLIB-1001 Based on the commentss by @fmcquillan99, @decibel and @iyerr3, the requirements for output_cols has changed a bit, which has made the implementation significantly simpler! output_cols is expected to be a valid SELECT expression now. So we are now using output_cols as is, except for replacing '*' with the actual column names from the source table.
JIRA: MADLIB-1001 Had missed deleting one of functions which is not used anymore.
I have pushed the latest code that treats output_cols as a valid select expression. I have removed |
table_or_view = 'VIEW' if create_view or create_view is None else 'TABLE' | ||
|
||
output_cols = '*' if output_cols is None else output_cols | ||
# If the output_cols has '*' as one of the elements, expand it to include all columns in the source table. The following list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block needs to be wrapped into 72 characters and appropriate line space needs to be added to improve readability.
JIRA: MADLIB-1001 Wrapping up lines to 72 characters and improving the line spaces for better readability.
d248f2f
to
eb86fd5
Compare
JIRA: MADLIB-1001 More indentation changes.
JIRA: MADLIB-1001
This contains the phase 2 implementation of the sessionize function.
This commit adds two new optional parameters: output_cols and
create_view. output_cols is a comma separated list of columns to
be projected into the output, while create_view indicates if the
output is a view or a table. The default values of output_cols
and create_view are '*' and True respectively.
These two parameters are for performance reasons. A view is
generally faster when compared to materializing the results to an
actual table, while having to include all columns when not necessary
can be avoided with the output_cols parameter.