-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to detect file format automatically during schema inference if it's unknown #59092
Conversation
This is an automated comment for commit 2c34a98 with description of existing statuses. It's updated for the latest CI running ❌ Click here to open a full report in a separate page Successful checks
|
d8f64bc
to
93fbe1d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few small remarks
{ | ||
size_t rows_read = schema_reader->getNumRowsRead(); | ||
assert(rows_read <= max_rows_to_read); | ||
max_rows_to_read -= schema_reader->getNumRowsRead(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor. Is it better?
A question. rows_read
can only increase. Is this correct if we pass here twice for the same file? Or this should never happen?
max_rows_to_read -= schema_reader->getNumRowsRead(); | |
max_rows_to_read -= rows_read; |
} | ||
|
||
/// We choose the format with larger number of columns in inferred schema. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is really what we want? Consider a tab-separated file with some textual sentences that may contain commas. There might be a sentence with a lot of commas. This will correspond to a greater number of columns, which IMO is not the "best" fit. Are schema readers smart enough to figure this out under the hood?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are schema readers smart enough to figure this out under the hood?
I would say yes. In your example we will be able to infer the schema from such TSV file only if all first 25k rows will have the same number of commas (so we will have the same number of columns in each row for CSV format, otherwise CSV schema reader won't extract the schema), and this is unlikely to happpen.
/// if we failed to detect the format, we failed to detect the schema of this file | ||
/// in any format. It doesn't make sense to continue. | ||
if (!format_name) | ||
throw Exception(ErrorCodes::CANNOT_DETECT_FORMAT, "The data format cannot be detected by the contents of the files. You can specify the format manually"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we specify the exact file name that caused the failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
storage_metadata.setColumns(columns); | ||
} | ||
else | ||
{ | ||
if (configuration.format == "auto") | ||
configuration.format = StorageS3::getTableStructureAndFormatFromData(configuration, format_settings, context_).second; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay that we still need to do all the parsing even if columns are explicitly specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we still don't know the format, so we will anyway need to do some parsing.
Yes, with known structure we can do something better, for example try to use format readers instead of schema inference as we know the structure, and it will find the best format for sure. But I considered this use case with unknown format and known structure as really rare to be used by users and didn't want to implement separate logic for it
Co-authored-by: Sergei Trifonov <svtrifonov@gmail.com>
Co-authored-by: Sergei Trifonov <svtrifonov@gmail.com>
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Try to detect file format automatically during schema inference if it's unknown in
file/s3/hdfs/url/azureBlobStorage
engines.Closes #50576.
Documentation entry for user-facing changes
Examples: