-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Fix issue with dtype
parameter in the read_csv
method.
#8387
Conversation
…, so the 'types' of the relation are accurate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I had a quick look at the problem and I think it goes a bit deeper than what this fix addresses. I think the problem should actually be solved in ReadCSVRelation
. The issue is here in the constructor:
ReadCSVRelation::ReadCSVRelation(const std::shared_ptr<ClientContext> &context, const string &csv_file,
BufferedCSVReaderOptions options, string alias_p)
: TableFunctionRelation(context, "read_csv_auto", {Value(csv_file)}, nullptr, false), alias(std::move(alias_p)),
auto_detect(true) {
if (alias.empty()) {
alias = StringUtil::Split(csv_file, ".")[0];
}
// Force auto_detect for this constructor
options.auto_detect = true;
BufferedCSVReader reader(*context, std::move(options));
auto &types = reader.GetTypes();
auto &names = reader.GetNames();
for (idx_t i = 0; i < types.size(); i++) {
columns.emplace_back(names[i], types[i]);
}
AddNamedParameter("auto_detect", Value::BOOLEAN(true));
}
The options that are set get passed in to the BufferedCSVReader
, which then uses those options to run auto-detection and obtain the columns and types. This means only flags set in the BufferedCSVReaderOptions
are considered for the initial binding. However, any parameters provided here are subsequently lost.
Afterwards, the actual table function is constructed and AddNamedParameter
is called to add new parameters as named parameters. These parameters are then considered for the actual execution of read_csv
.
All of this is to say this is quite messy - we have two avenues of providing options (BufferedCSVReaderOptions
and AddNamedParameter
) and both affect the relation in different ways. In addition, auto-detection is run multiple times unnecessarily.
Ideally we would only have one way of providing options (AddNamedParameter
- which could be used with the BufferedCSVReaderOptions
using the SetReadOption
function). The auto-detection would run once when creating the relation, and the found columns and types would be saved in the relation (by for example adding a dtypes=[...]
parameter). This would prevent the auto-detection from running again in subsequent iterations.
Yea I agree this is really messy, and error-prone. It's not immediately clear how to fix this in the ReadCSVRelation, but I'll have a look soon using the helpful pointers you provided 👍 |
My suggested fix would be:
Then in the Python method only populate the |
I think #8421 is related to this as well |
…tions, sniff, then convert the buffered csv reader options back to named parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! LGTM now
…into python_readcsv_types
…ot run as part of coverage
@Mytherin I think this can also be merged CI both here and on my fork passed |
Thanks! |
This PR fixes #8209, #8421, #8434
Relations are bound twice, one to get the names+types, and the second time to execute.
the
dtype
parameter was only handled after the first bind, so its effects were not visible in thecolumns
andtypes
attributes of the produced relation.Also now we only
auto_detect
during the first phase, capturing the options detected by the sniffer and forwarding them to the second bind as named parameters.i.e if a quote was detected, we forward this option as
read_csv(quote=<detected_quote>)