Skip to content

Type of __time column is determined by RowSignature in case of External Datasource#12770

Merged
kfaraz merged 7 commits intoapache:masterfrom
LakshSingla:extern-restrict
Jul 26, 2022
Merged

Type of __time column is determined by RowSignature in case of External Datasource#12770
kfaraz merged 7 commits intoapache:masterfrom
LakshSingla:extern-restrict

Conversation

@LakshSingla
Copy link
Contributor

@LakshSingla LakshSingla commented Jul 12, 2022

Description

Queries like

REPLACE INTO inline_data1 OVERWRITE ALL
SELECT 
  TIME_PARSE("__time") AS __time
FROM TABLE(
  EXTERN(
    '{"type":"inline","data":"{\"__time\":\"2022-02-02T02:02:02.002\"}"}',
    '{"type":"json"}',
    '[{"name":"__time","type":"string"}]'
  )
)
PARTITIONED BY ALL

would fail at the Calcite layer. This is because any column with name __time is considered to be of type SqlTypeName.TIMESTAMP (code) without consideration for the RowSignature that is passed (i.e. the type that is getting passed in EXTERN). This messes up functions like TIME_PARSE which are expected to work with a certain signature.

This PR modifies RowSignatures.toRelDataType() so that the type of __time column is determined by the RowSignature's type.

`This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

// 1. __time in the external data source may map to a different column
// 2. __time in the external data source might be ignored while ingesting the data
// 3. Even if __time is of type LONG, it may be used with some function, which would cause the abovementioned Calcite failure
// The most optimal solution would be to prevent assumption of the __time column having the type SqlTypeName.TIMESTAMP
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

Is there a reason why we cannot do the more optimal solution now? Something like: for __time, have RowSignatures.toRelDataType return TIMESTAMP for regular Druid tables, and the type from the provided signature otherwise. To get the information through, we could add a parameter to toRelDataType like boolean isDruidTable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a go through this, but it should be possible. The thing preventing me from doing it was that I was unable to understand the reason for ignoring the RowSignature and casting the __time column to type TIMESTAMP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was that we don't want to use the default conversion for regular Druid tables, since that would be BIGINT, and we actually want TIMESTAMP. This rationale only really applies to regular Druid tables.

@LakshSingla LakshSingla changed the title Restrict usage of __time in the EXTERN function Type of __time column is determined by RowSignature in case of External Datasource Jul 12, 2022
@LakshSingla
Copy link
Contributor Author

Thanks for the comment @gianm. I have updated the approach in the PR to determine the Rel type from the RowSignature in case of ExternalDataSource.

{
return RowSignatures.toRelDataType(rowSignature, typeFactory);
// For external datasources, the row type should be determined by whatever the row signature has been explicitly
// passed in. Typecasting directly to SqlTypeName.TIMESTAMP will lead to inconsistencies with the Calcite functions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that for future reader, it will be good to clarify with an example of inconsistency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these changes still needed?

Copy link
Contributor Author

@LakshSingla LakshSingla Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. This is because this change is required for the Calcite layer. The two use cases that I am thinking of right now:

  1. Let's say we have a function FUNC(a) that accepts long. If the user passes a __time column of type long, this would be typecasted to SqlTypeName.TIMESTAMP before passing to FUNC(a). This will cause a signature mismatch, even though the query would have produced correct results downstream.
  2. In the future, when the typecasting with the cursor is improved to accept columns of type other than LONG, this would ensure that the explicit type passed by the user is respected.

@cryptoe
Copy link
Contributor

cryptoe commented Jul 18, 2022

LGTM as well!!

@LakshSingla
Copy link
Contributor Author

Please hold off merging this PR for now. I was doing my own testing, and it doesn't seem to work as expected.

@LakshSingla
Copy link
Contributor Author

Reverted the PR to prevent using __time column with type other than LONG. Simultaneously, I am also preventing typecasting of data to SqlTypeName.TIMESTAMP when passed in EXTERN.

  1. The reason for the restriction is that currently RowBasedColumnSelectorFactory typecasts the __time column to LONG irrespective of data type which causes erroneous evaluations of the expressions and the column.
  2. I have decided to keep in the typecasting changes because that seems logical for the Calcite layer, i.e. to go with whatever type has been explicitly mentioned by the user. One advantage of this is that LONG won't get typecasted to TIMESTAMP and functions that operate on LONG won't get affected (which was not the case with the original commit in this PR).

// of inconsistency is that functions such as TIME_PARSE evaluate incorrectly
Optional<ColumnType> timestampColumnTypeOptional = signature.getColumnType(ColumnHolder.TIME_COLUMN_NAME);
if (timestampColumnTypeOptional.isPresent() && !timestampColumnTypeOptional.get().equals(ColumnType.LONG)) {
throw new ISE("Unable to use EXTERN function with data containing a __time column of any type other than long");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw new ISE("Unable to use EXTERN function with data containing a __time column of any type other than long");
throw new ISE("EXTERN function with __time column can be used when __time column is of type long. Please change the column name to something other than __time ");

Copy link
Contributor

@cryptoe cryptoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor exception message changes.
LGTM

// of inconsistency is that functions such as TIME_PARSE evaluate incorrectly
Optional<ColumnType> timestampColumnTypeOptional = signature.getColumnType(ColumnHolder.TIME_COLUMN_NAME);
if (timestampColumnTypeOptional.isPresent() && !timestampColumnTypeOptional.get().equals(ColumnType.LONG)) {
throw new ISE("EXTERN function with __time column can be used when __time column is of type long. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean this, right?

Suggested change
throw new ISE("EXTERN function with __time column can be used when __time column is of type long. "
throw new ISE("EXTERN function with __time column can be used when __time column is not of type long. "

Copy link
Contributor Author

@LakshSingla LakshSingla Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original one should be correct. We are only allowing EXTERN function with __time columns iff they are of type long. Because of the special handling of __time column in the cursors, the forced typcasting would produce incorrect results if it is not of type long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exception is thrown when type(__time)!=long. Hence either we use double negation or let the message be as it is no ?

@kfaraz kfaraz merged commit 2e616e6 into apache:master Jul 26, 2022
@kfaraz
Copy link
Contributor

kfaraz commented Jul 26, 2022

Merged as build failure was unrelated.

@paul-rogers
Copy link
Contributor

paul-rogers commented Jul 29, 2022

A bit late to the party on this one. Hit this when merging code. I'm not sure the fixes are quite right.

The fix in ExternalTableMacro essentially restricts the columns that can appear in an external file. Yet, users have no control over the existing data. It is not the job of a random CSV or JSON file to have a __time column of the form needed by Druid. That's the job of the mapping layer.

Then, in DruidTable, we also check. But, again, that's the wrong place. Again, it is not the job of the table metadata to conform.

The proper place for this kind of check is in the validator. But, since INSERT queries are not yet validated, the next best place is in the DruidPlanner where we're about to hand the DruidSqlInsert node off to the QueryMaker. At that point, we can check if the user has done the proper mapping.

The result is that I should be able to have a CSV file with a __time column as a string and do:

INSERT INTO foo
SELECT TIME_PARSE("__time") AS __time, ...
FROM TABLE(...)

To be clear how SQL works: the first __time is in the name space of the input table row signature. The second one, in the AS clause, is in the name space of the row signature of the SELECT statement. Since we match-by-name for INSERT, this is also the column we'd insert into the target data source, foo.

Looks like the code is trying to guess when to cast __time to the right type? This would require that __time either be BIGINT, or be in a parsable string format. The BIGINT is safe, but the string casting (if we do it), is risky because the user does not explicitly state the format, and formats are ambiguous.

If we did want to do this, the function to get the row type, getRowType() isn't quite the right place: the validator is.

@LakshSingla
Copy link
Contributor Author

Thanks for the comment @paul-rogers. The original behavior of the code (Calcite layer) was to cast any column with the name __time to TIMESTAMP for evaluation purposes. The modified behavior of the code checks whether the table was an external table or not, and takes a call depending on that.
The issue with the query in the example:

INSERT INTO foo
SELECT TIME_PARSE("__time") AS __time, ...
FROM TABLE(...)

is that even this would fail because ExternalDataSource would provide the __time column (the left one) to the Calcite layer for execution, which would be interpreted as TIMESTAMP, instead of whatever the user intended it to be. This would cause a Calcite exception because TIME_PARSE will expect a STRING input and not a TIMESTAMP input. By changing the Macro, we are providing the "hint" to the execution layer that we want this column to be interpreted as whatever was provided by the user.

@abhishekagarwal87 abhishekagarwal87 added this to the 24.0.0 milestone Aug 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants