Skip to content

PARQUET-1581: [C++] Fix undefined behavior in encoding.cc #4336

Closed
emkornfield wants to merge 3 commits intoapache:masterfrom
emkornfield:parquet_san
Closed

PARQUET-1581: [C++] Fix undefined behavior in encoding.cc #4336
emkornfield wants to merge 3 commits intoapache:masterfrom
emkornfield:parquet_san

Conversation

@emkornfield
Copy link
Copy Markdown
Contributor

No description provided.

@wesm
Copy link
Copy Markdown
Member

wesm commented May 17, 2019

Do you know how this UB could occur?

@emkornfield
Copy link
Copy Markdown
Contributor Author

Hmm, this came up after cleaning up unaligned access UBSan errors, but I think had a small bug in that cleanup and reverted. I'll see if I can reproduce without it but will close this for now until I can (I also need to open a thread about UBSan issues on the mailing list).

@emkornfield
Copy link
Copy Markdown
Contributor Author

I haven't following the code paths fully but n DictEncoding/0.CheckDecodeArrowUsingDenseBuilder
when I log:

  541     ARROW_LOG(INFO) << "encoder num entries: " << dict_encoder->num_entries();
  542     plain_decoder_->SetData(dict_encoder->num_entries(), dict_buffer_->data(),
  543                             static_cast<int>(dict_buffer_->size()));

I see

I0517 12:58:39.690502 14292 encoding-test.cc:541] encoder num entries: 99
I0517 12:58:39.692281 14292 encoding-test.cc:541] encoder num entries: 99
I0517 12:58:39.693528 14292 encoding-test.cc:541] encoder num entries: 0

I believe the last entry exercises this code path.

@emkornfield emkornfield reopened this May 17, 2019
@emkornfield
Copy link
Copy Markdown
Contributor Author

If the above doesn't sound right I can dig further.

@wesm
Copy link
Copy Markdown
Member

wesm commented May 17, 2019

OK. I'm interested in the UBSAN report to see what the UB is actually occurring, could we set up a docker-compose build for this?

@emkornfield
Copy link
Copy Markdown
Contributor Author

sorry should have clarified this was a null reference error. ive updated the PR with a more specific fix.

@emkornfield
Copy link
Copy Markdown
Contributor Author

I'll open up follow-up JIRAs per discussion on the ML to get UBsan integrated with the build

@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #4336 into master will increase coverage by 0.96%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4336      +/-   ##
==========================================
+ Coverage   88.26%   89.23%   +0.96%     
==========================================
  Files         777      632     -145     
  Lines       97953    86551   -11402     
  Branches     1251        0    -1251     
==========================================
- Hits        86461    77232    -9229     
+ Misses      11256     9319    -1937     
+ Partials      236        0     -236
Impacted Files Coverage Δ
cpp/src/parquet/encoding.cc 93.84% <100%> (ø) ⬆️
go/arrow/ipc/writer.go
go/arrow/math/uint64_amd64.go
go/arrow/memory/memory_avx2_amd64.go
go/arrow/ipc/file_reader.go
js/src/enum.ts
go/arrow/array/builder.go
js/src/Arrow.node.ts
js/src/schema.ts
go/arrow/type_traits_boolean.go
... and 138 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e68ca7f...37c9ef0. Read the comment docs.

@emkornfield
Copy link
Copy Markdown
Contributor Author

I opened up https://issues.apache.org/jira/browse/ARROW-5365 to track adding in UBSan and ASAN to CI (I thought we were already doing ASAN though?)

Copy link
Copy Markdown
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented May 20, 2019

I thought we were already doing ASAN though?

No we are not. We had Valgrind runs but we disabled them because of regressions when refactoring the CMake build system AFAICT.

@pitrou
Copy link
Copy Markdown
Member

pitrou commented May 20, 2019

Note: null pointers are generally a PITA with C++ undefined behaviour rules. You often need to special-case the null case, even when it seems things should be alright as the size is zero. @bkietz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants