Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-14644 [R] open_dataset doesn't ignore BOM in csv file #11871

Conversation

dragosmg
Copy link
Contributor

@dragosmg dragosmg commented Dec 6, 2021

This draft PR only adds a failing unit test for the failure to skip BOMs when reading a CSV file with open_dataset().

@github-actions
Copy link

github-actions bot commented Dec 6, 2021

@github-actions
Copy link

github-actions bot commented Dec 6, 2021

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@wjones127
Copy link
Member

I've found the particular line where things are going wrong, but I'm not sure how to fix.

Column "a" isn't being added to convert options, as it is somehow triggering the condition on line 120.

for (auto field : scan_options->dataset_schema->fields()) {
if (materialized_fields.find(field->name()) == materialized_fields.end()) continue;
// Ignore virtual columns.
if (column_names.find(field->name()) == column_names.end()) continue;

I'm pretty baffled as it seems like the field name is clearly in the column set. Here's my debugger output from a breakpoint at line 120:

print column_names.find(field->name())
(std::unordered_set<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >::iterator) $0 = {
  __node_ = nullptr
}
print column_names
(std::unordered_set<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::hash<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >) $1 = size=1 {
  [0] = "a"
}
print field->name()
(const std::string) $2 = "a"
print (*column_names.begin())
(const std::__hash_const_iterator<std::__hash_node<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, void *> *>::value_type) $4 = "a"

(I altered the test so it just contains column "a" to make it easier to grab the value in the set; I get similar results with the original test that has both columns.)

I'm fairly sure this is the root problem because when I run column_names.insert(field->name()) at the breakpoint and continue, the test passes.

@wjones127
Copy link
Member

Found it: Looks like column_names still contains BOM, while the field names do not, hence the mismatch.

x/20bc &(*column_names.begin())
0x286780300: \xef\xbb\xbfa\0\0\0\0pb\xe0o\x01\0\0\0\xb0c\xe0o
x/20bc field->name().data()
0x28765f660: a\0\0\0\0\0\0p\x17'v(\0\0\0P\x02\0]\x80

@wjones127
Copy link
Member

@dragosmg Would you be willing to give me permission to push to your repo? I have a fix in C++ I can push to this branch.

@wjones127
Copy link
Member

@dragosmg nvm. Let's just close this in favor of #11892

@dragosmg dragosmg closed this Dec 7, 2021
@dragosmg dragosmg deleted the ARROW-14644_skip_BOM_in_CSV_file_with_open_dataset branch April 26, 2022 09:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants