[Python] Fix issue with `dtype` parameter in the `read_csv` method. #8387

Tishj · 2023-07-27T09:54:02Z

Relations are bound twice, one to get the names+types, and the second time to execute.
the dtype parameter was only handled after the first bind, so its effects were not visible in the columns and types attributes of the produced relation.

Also now we only auto_detect during the first phase, capturing the options detected by the sniffer and forwarding them to the second bind as named parameters.

i.e if a quote was detected, we forward this option as read_csv(quote=<detected_quote>)

…, so the 'types' of the relation are accurate

…the result

Mytherin

Thanks for the PR!

I had a quick look at the problem and I think it goes a bit deeper than what this fix addresses. I think the problem should actually be solved in ReadCSVRelation. The issue is here in the constructor:

ReadCSVRelation::ReadCSVRelation(const std::shared_ptr<ClientContext> &context, const string &csv_file,
                                 BufferedCSVReaderOptions options, string alias_p)
    : TableFunctionRelation(context, "read_csv_auto", {Value(csv_file)}, nullptr, false), alias(std::move(alias_p)),
      auto_detect(true) {

	if (alias.empty()) {
		alias = StringUtil::Split(csv_file, ".")[0];
	}

	// Force auto_detect for this constructor
	options.auto_detect = true;
	BufferedCSVReader reader(*context, std::move(options));

	auto &types = reader.GetTypes();
	auto &names = reader.GetNames();
	for (idx_t i = 0; i < types.size(); i++) {
		columns.emplace_back(names[i], types[i]);
	}

	AddNamedParameter("auto_detect", Value::BOOLEAN(true));
}

The options that are set get passed in to the BufferedCSVReader, which then uses those options to run auto-detection and obtain the columns and types. This means only flags set in the BufferedCSVReaderOptions are considered for the initial binding. However, any parameters provided here are subsequently lost.

Afterwards, the actual table function is constructed and AddNamedParameter is called to add new parameters as named parameters. These parameters are then considered for the actual execution of read_csv.

All of this is to say this is quite messy - we have two avenues of providing options (BufferedCSVReaderOptions and AddNamedParameter) and both affect the relation in different ways. In addition, auto-detection is run multiple times unnecessarily.

Ideally we would only have one way of providing options (AddNamedParameter - which could be used with the BufferedCSVReaderOptions using the SetReadOption function). The auto-detection would run once when creating the relation, and the found columns and types would be saved in the relation (by for example adding a dtypes=[...] parameter). This would prevent the auto-detection from running again in subsequent iterations.

Tishj · 2023-07-27T15:37:02Z

Yea I agree this is really messy, and error-prone.

It's not immediately clear how to fix this in the ReadCSVRelation, but I'll have a look soon using the helpful pointers you provided 👍

Mytherin · 2023-07-27T16:13:17Z

My suggested fix would be:

Make ReadCSVRelation take as parameter an unordered_set<string, Value> instead of a BufferedCSVReaderOptions
Use AddReadOption to populate the BufferedCSVReaderOptions for the auto-detection pass
In the ReadCSVRelation itself - call AddNamedParameter with the provided parameters (in unordered_set<string, Value>), plus whatever was found during the auto-detection process (i.e. also populate dtypes, header, delimiter, with what was found during auto-detection) and set auto-detection to false

Then in the Python method only populate the unordered_set<string, Value>, instead of populating both BufferedCSVReaderOptions and using AddNamedParameter

…ions

Mytherin · 2023-08-01T08:35:24Z

I think #8421 is related to this as well

…tions, sniff, then convert the buffered csv reader options back to named parameters

…values

Mytherin

Thanks for the changes! LGTM now

…into python_readcsv_types

…ot run as part of coverage

Tishj · 2023-09-17T10:25:18Z

@Mytherin I think this can also be merged

CI both here and on my fork passed
I also just merged with main and there are no conflicts

Mytherin · 2023-09-17T10:26:20Z

Thanks!

Tishj added 8 commits July 11, 2023 11:34

write the dtype setting into 'options' so it's available at bind time…

bbfec9c

…, so the 'types' of the relation are accurate

also add it to post-bind options because we need it when determining …

049b17d

…the result

Merge remote-tracking branch 'upstream/master' into python_readcsv_types

1b6d06e

increase uncovered files

f3080c4

Merge remote-tracking branch 'upstream/master' into python_readcsv_types

07803e5

Merge remote-tracking branch 'upstream/master' into python_readcsv_types

d17684a

Merge remote-tracking branch 'upstream/master' into python_readcsv_types

c982775

Merge remote-tracking branch 'upstream/master' into python_readcsv_types

6863ab1

Mytherin reviewed Jul 27, 2023

View reviewed changes

Tishj added 2 commits July 28, 2023 10:55

working on de-duplicating the options

90cef83

convert to and from named_parameter_map_t and buffered csv reader opt…

084d0e8

…ions

Tishj added 2 commits August 2, 2023 13:48

populate named parameters, then convert that into BufferedCSVReaderOp…

c2c07c8

…tions, sniff, then convert the buffered csv reader options back to named parameters

fix the issues with unrecognized or faulty submitted named parameter …

072d24b

…values

github-actions bot marked this pull request as draft August 2, 2023 12:11

Tishj added 3 commits August 3, 2023 09:20

format, update uncovered files

ed1a900

add back support for globbing

d78d2a2

add test for read_csv globbing

d536967

Tishj mentioned this pull request Aug 4, 2023

Python Client read_csv() ignores Delimiter, Quote and Escape Arguments #8421

Closed

2 tasks

Tishj added 4 commits August 4, 2023 11:28

add test from issue duckdb#8421

d7989d6

format

173dcdc

Merge remote-tracking branch 'upstream/master' into python_readcsv_types

b69a3d3

fix coverage issues

a870de3

Tishj marked this pull request as ready for review August 5, 2023 17:24

Merge branch 'master' into python_readcsv_types

59f7956

Mytherin approved these changes Aug 7, 2023

View reviewed changes

github-actions bot marked this pull request as draft August 7, 2023 12:59

Merge remote-tracking branch 'upstream/master' into python_readcsv_types

4e6edd2

Merge branch 'python_readcsv_types' of https://github.com/Tishj/duckdb …

2bc4e9d

…into python_readcsv_types

Tishj mentioned this pull request Aug 24, 2023

using read_parquet() with a * inside a google colab does not load #8457

Closed

1 task

Tishj added 6 commits August 24, 2023 15:07

Merge branch 'main' into python_readcsv_types

c053b15

Merge remote-tracking branch 'upstream/main' into python_readcsv_types

69568b3

order result in 'test_read_csv_glob' to be deterministic

40a4545

Merge remote-tracking branch 'upstream/main' into python_readcsv_types

da71ec1

these methods are only used in python's 'read_csv' method, which is n…

64e0fcd

…ot run as part of coverage

Merge remote-tracking branch 'upstream/main' into python_readcsv_types

ae3300a

Tishj mentioned this pull request Sep 9, 2023

[python] read_csv() doesn't support the "names" or "auto_detect" parameters that the CLI/SQL function supports #8857

Closed

1 task

Tishj added 2 commits September 11, 2023 08:57

Merge remote-tracking branch 'upstream/main' into python_readcsv_types

852bc68

Merge remote-tracking branch 'upstream/main' into python_readcsv_types

e7bb41a

Tishj marked this pull request as ready for review September 11, 2023 19:25

fix up uncovered files, remove dead code

7877e80

github-actions bot marked this pull request as draft September 12, 2023 08:25

Merge remote-tracking branch 'upstream/main' into python_readcsv_types

38f05a4

Tishj marked this pull request as ready for review September 14, 2023 21:26

update uncovered_files

cfb335e

github-actions bot marked this pull request as draft September 15, 2023 05:33

Tishj marked this pull request as ready for review September 15, 2023 17:23

Mytherin merged commit 52a47a6 into duckdb:main Sep 17, 2023
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Fix issue with `dtype` parameter in the `read_csv` method. #8387

[Python] Fix issue with `dtype` parameter in the `read_csv` method. #8387

Tishj commented Jul 27, 2023 •

edited

Mytherin left a comment

Tishj commented Jul 27, 2023

Mytherin commented Jul 27, 2023 •

edited

Mytherin commented Aug 1, 2023

Mytherin left a comment

Tishj commented Sep 17, 2023 •

edited

Mytherin commented Sep 17, 2023

[Python] Fix issue with dtype parameter in the read_csv method. #8387

[Python] Fix issue with dtype parameter in the read_csv method. #8387

Conversation

Tishj commented Jul 27, 2023 • edited

Mytherin left a comment

Choose a reason for hiding this comment

Tishj commented Jul 27, 2023

Mytherin commented Jul 27, 2023 • edited

Mytherin commented Aug 1, 2023

Mytherin left a comment

Choose a reason for hiding this comment

Tishj commented Sep 17, 2023 • edited

Mytherin commented Sep 17, 2023

[Python] Fix issue with `dtype` parameter in the `read_csv` method. #8387

[Python] Fix issue with `dtype` parameter in the `read_csv` method. #8387

Tishj commented Jul 27, 2023 •

edited

Mytherin commented Jul 27, 2023 •

edited

Tishj commented Sep 17, 2023 •

edited