Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Fix issue with dtype parameter in the read_csv method. #8387

Merged
merged 33 commits into from
Sep 17, 2023

Conversation

Tishj
Copy link
Contributor

@Tishj Tishj commented Jul 27, 2023

This PR fixes #8209, #8421, #8434

Relations are bound twice, one to get the names+types, and the second time to execute.
the dtype parameter was only handled after the first bind, so its effects were not visible in the columns and types attributes of the produced relation.

Also now we only auto_detect during the first phase, capturing the options detected by the sniffer and forwarding them to the second bind as named parameters.

i.e if a quote was detected, we forward this option as read_csv(quote=<detected_quote>)

Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I had a quick look at the problem and I think it goes a bit deeper than what this fix addresses. I think the problem should actually be solved in ReadCSVRelation. The issue is here in the constructor:

ReadCSVRelation::ReadCSVRelation(const std::shared_ptr<ClientContext> &context, const string &csv_file,
                                 BufferedCSVReaderOptions options, string alias_p)
    : TableFunctionRelation(context, "read_csv_auto", {Value(csv_file)}, nullptr, false), alias(std::move(alias_p)),
      auto_detect(true) {

	if (alias.empty()) {
		alias = StringUtil::Split(csv_file, ".")[0];
	}

	// Force auto_detect for this constructor
	options.auto_detect = true;
	BufferedCSVReader reader(*context, std::move(options));

	auto &types = reader.GetTypes();
	auto &names = reader.GetNames();
	for (idx_t i = 0; i < types.size(); i++) {
		columns.emplace_back(names[i], types[i]);
	}

	AddNamedParameter("auto_detect", Value::BOOLEAN(true));
}

The options that are set get passed in to the BufferedCSVReader, which then uses those options to run auto-detection and obtain the columns and types. This means only flags set in the BufferedCSVReaderOptions are considered for the initial binding. However, any parameters provided here are subsequently lost.

Afterwards, the actual table function is constructed and AddNamedParameter is called to add new parameters as named parameters. These parameters are then considered for the actual execution of read_csv.

All of this is to say this is quite messy - we have two avenues of providing options (BufferedCSVReaderOptions and AddNamedParameter) and both affect the relation in different ways. In addition, auto-detection is run multiple times unnecessarily.

Ideally we would only have one way of providing options (AddNamedParameter - which could be used with the BufferedCSVReaderOptions using the SetReadOption function). The auto-detection would run once when creating the relation, and the found columns and types would be saved in the relation (by for example adding a dtypes=[...] parameter). This would prevent the auto-detection from running again in subsequent iterations.

@Tishj
Copy link
Contributor Author

Tishj commented Jul 27, 2023

Yea I agree this is really messy, and error-prone.

It's not immediately clear how to fix this in the ReadCSVRelation, but I'll have a look soon using the helpful pointers you provided 👍

@Mytherin
Copy link
Collaborator

Mytherin commented Jul 27, 2023

My suggested fix would be:

  • Make ReadCSVRelation take as parameter an unordered_set<string, Value> instead of a BufferedCSVReaderOptions
  • Use AddReadOption to populate the BufferedCSVReaderOptions for the auto-detection pass
  • In the ReadCSVRelation itself - call AddNamedParameter with the provided parameters (in unordered_set<string, Value>), plus whatever was found during the auto-detection process (i.e. also populate dtypes, header, delimiter, with what was found during auto-detection) and set auto-detection to false

Then in the Python method only populate the unordered_set<string, Value>, instead of populating both BufferedCSVReaderOptions and using AddNamedParameter

@Mytherin
Copy link
Collaborator

Mytherin commented Aug 1, 2023

I think #8421 is related to this as well

@github-actions github-actions bot marked this pull request as draft August 2, 2023 12:11
@Tishj Tishj marked this pull request as ready for review August 5, 2023 17:24
Copy link
Collaborator

@Mytherin Mytherin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! LGTM now

@github-actions github-actions bot marked this pull request as draft August 7, 2023 12:59
@Tishj Tishj marked this pull request as ready for review September 11, 2023 19:25
@github-actions github-actions bot marked this pull request as draft September 12, 2023 08:25
@Tishj Tishj marked this pull request as ready for review September 14, 2023 21:26
@github-actions github-actions bot marked this pull request as draft September 15, 2023 05:33
@Tishj Tishj marked this pull request as ready for review September 15, 2023 17:23
@Tishj
Copy link
Contributor Author

Tishj commented Sep 17, 2023

@Mytherin I think this can also be merged

CI both here and on my fork passed
I also just merged with main and there are no conflicts

@Mytherin Mytherin merged commit 52a47a6 into duckdb:main Sep 17, 2023
50 checks passed
@Mytherin
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] relation.types and relation.columns values incorrect after creating relation with read_csv()
2 participants