Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C#] Arrow R/C++ hangs reading binary file generated by C# #23030

Closed
asfimport opened this issue Sep 25, 2019 · 41 comments
Closed

[C#] Arrow R/C++ hangs reading binary file generated by C# #23030

asfimport opened this issue Sep 25, 2019 · 41 comments
Assignees
Milestone

Comments

@asfimport
Copy link

I get random hangs on arrow_read in R (windows) when using a very large file (10-12gb). (the file has 37 columns)

I have memory dumps - All threads seem to be in wait handles.

Are there debug symbols somewhere? 

Is there a way to get the C++ code to produce diagnostic logging from R? 

 

UPDATE: it seems that the hangs are not related to file size, row counts, or # of record batches, but rather the number of columns

Reporter: Anthony Abate / @abbotware
Assignee: Eric Erhardt / @eerhardt

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-6682. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
I have a 150 meg file that i generated (with the C# library) random data and it has 50 columns and it hangs on (almost) every load!

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
See the attached file

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Hi, would you mind sharing the code you're running that's hanging, as well as additional system information?

All source code is available at https://github.com/apache/arrow.

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
start_time <- Sys.time()
start_memory <- memory.size()

library(arrow)

dfcs <- read_arrow("e:
Generated_4000Batch_50Columns_100Rows_PerBatch.arrow")

end_memory <- memory.size()
end_time <- Sys.time()

print(end_memory)
end_time - start_time
end_memory - start_memory

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
code above (its trivial)

system 8 cores  (virtual)
64 gigs ram
windows 10

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Note that the C# library isn't being integration tested with the other implementations. So it's possible there's a flaw in the C# implementation. The issues you are describing aren't present on data generated by the C++ library

@asfimport
Copy link
Author

Wes McKinney / @wesm:
cc @eerhardt

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
It loads 'sometimes' - so sounds like threading issues?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
It would be best to try to reproduce the issue with data generated from R rather than C#. If the issue can't be reproduced endogenously (i.e. using binary files generated by R via the C++ library) then we know the issue lies with the C# library.

What is the schema of the file?

@asfimport
Copy link
Author

Eric Erhardt / @eerhardt:
@abbotware  - can you also share the C# code you used to create the .arrow file? I assume the issue is in the C# arrow writer, so having C# code to reproduce the problem would be best.

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Works on my (virtual) machine:

> system.time(tab <- read_arrow("Generated_4000Batch_50Columns_100Rows_PerBatch.arrow"))
   user  system elapsed 
   1.10    3.96    8.69 
> dim(tab)
[1] 400000     50
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] arrow_0.14.1.9000

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 bit_1.1-14       compiler_3.6.0   magrittr_1.5     assertthat_0.2.1 R6_2.4.0        
 [7] tools_3.6.0      glue_1.3.1       Rcpp_1.0.1       bit64_0.9-7      rlang_0.3.4      purrr_0.3.2     
> 

Would you mind installing a nightly build of the R package and see if whatever you're seeing has been resolved in the master branch?

install.packages("arrow", repos="https://dl.bintray.com/ursalabs/arrow-r")

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@wesm  - Also, I can generate many other files from the C# libraries that no problems being loaded

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@nealrichardson - how many times did you try to load it? - i get it to fail 4 out of every 5 times

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
I'm 5 for 5 so far.

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
if the file is 'bad' shouldn't that easily be determined by examining the attached file? 

 

 

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@nealrichardson- ok sounds like you have no problems with the file - let me try that latest package and I'll let you know

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@nealrichardson  - seems like the same problem:

other than
install.packages("arrow", repos="https://dl.bintray.com/ursalabs/arrow-r")
do i need to do anything else to use the dev package?

is there a version number I can print out via runtime to make sure im using the new code?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
How do I decompress the attached file (I'm on Linux)?

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@wesm - I made a zip version of the file

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
sessionInfo() will show the relevant versions. The nightly package build includes the date in the version string.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
The file core dumps for me in Python in a debug build

In [3]: pa.ipc.open_file('Generated_4000Batch_50Columns_100Rows_PerBatch.arrow').read_all()    
../src/arrow/ipc/reader.cc:683:  Check failed: BitUtil::IsMultipleOf8(block.metadata_length) 
/home/wesm/local/lib/libarrow.so.15(_ZN5arrow4util7CerrLog14PrintBackTraceEv+0x35)[0x7f4a9cd494dd]
/home/wesm/local/lib/libarrow.so.15(_ZN5arrow4util7CerrLogD1Ev+0x5b)[0x7f4a9cd4945f]
/home/wesm/local/lib/libarrow.so.15(_ZN5arrow4util7CerrLogD0Ev+0x18)[0x7f4a9cd49480]
/home/wesm/local/lib/libarrow.so.15(_ZN5arrow4util8ArrowLogD1Ev+0x57)[0x7f4a9cd492c1]
/home/wesm/local/lib/libarrow.so.15(+0x18e9a60)[0x7f4a9d34fa60]
/home/wesm/local/lib/libarrow.so.15(+0x18ea26b)[0x7f4a9d35026b]
/home/wesm/local/lib/libarrow.so.15(_ZN5arrow3ipc21RecordBatchFileReader15ReadRecordBatchEiPSt10shared_ptrINS_11RecordBatchEE+0x48)[0x7f4a9d349f76]
/home/wesm/code/arrow/python/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so(+0x2a4115)[0x7f4a9e088115]
/home/wesm/code/arrow/python/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so(+0x2a21e4)[0x7f4a9e0861e4]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyMethodDef_RawFastCallKeywords+0x141)[0x56025cd81651]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyMethodDescr_FastCallKeywords+0x4f)[0x56025cd89cdf]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x4cbc)[0x56025cdeedac]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x56025cd2f929]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(PyEval_EvalCodeEx+0x44)[0x56025cd307e4]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(PyEval_EvalCode+0x1c)[0x56025cd3080c]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(+0x1e0c70)[0x56025cdf9c70]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyMethodDef_RawFastCallKeywords+0xe9)[0x56025cd815f9]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyCFunction_FastCallKeywords+0x21)[0x56025cd81891]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x47d4)[0x56025cdee8c4]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyGen_Send+0x2a2)[0x56025cd8aea2]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x1acc)[0x56025cdebbbc]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyGen_Send+0x2a2)[0x56025cd8aea2]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x1acc)[0x56025cdebbbc]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyGen_Send+0x2a2)[0x56025cd8aea2]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyMethodDef_RawFastCallKeywords+0x8c)[0x56025cd8159c]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyMethodDescr_FastCallKeywords+0x4f)[0x56025cd89cdf]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x4cbc)[0x56025cdeedac]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x56025cd80cfb]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x416)[0x56025cdea506]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x56025cd80cfb]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x6f0)[0x56025cdea7e0]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x56025cd2f929]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyFunction_FastCallKeywords+0x387)[0x56025cd80f87]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x14dc)[0x56025cdeb5cc]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x56025cd2f929]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyFunction_FastCallKeywords+0x325)[0x56025cd80f25]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x6f0)[0x56025cdea7e0]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x56025cd2f929]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyFunction_FastCallKeywords+0x325)[0x56025cd80f25]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x6f0)[0x56025cdea7e0]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyFunction_FastCallKeywords+0xfb)[0x56025cd80cfb]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x6f0)[0x56025cdea7e0]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x56025cd2f929]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyFunction_FastCallDict+0x400)[0x56025cd30c20]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyObject_Call_Prepend+0x63)[0x56025cd4fe23]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(PyObject_Call+0x6e)[0x56025cd4251e]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x1f4c)[0x56025cdec03c]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x56025cd2f929]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyFunction_FastCallKeywords+0x387)[0x56025cd80f87]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalFrameDefault+0x416)[0x56025cdea506]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_PyEval_EvalCodeWithName+0x2f9)[0x56025cd2f929]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(PyEval_EvalCodeEx+0x44)[0x56025cd307e4]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(PyEval_EvalCode+0x1c)[0x56025cd3080c]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(+0x22fac4)[0x56025ce48ac4]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(PyRun_FileExFlags+0xa1)[0x56025ce52db1]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(PyRun_SimpleFileExFlags+0x1c3)[0x56025ce52fa3]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(+0x23b0bf)[0x56025ce540bf]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(_Py_UnixMain+0x3c)[0x56025ce541dc]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f4acfb0bb97]
/home/wesm/miniconda/envs/arrow-3.7/bin/python(+0x1e0d3d)[0x56025cdf9d3d]
Aborted

These issues would be caught in integration tests. All of this points to a problem in the C# library

@asfimport
Copy link
Author

Eric Erhardt / @eerhardt:
@wesm- can you tell which block isn't a multiple of 8? I did fix this in https://issues.apache.org/jira/browse/ARROW-5908, which made it into 0.14.1. But obviously could have missed a spot.

@abbotware  - which version of the C# library were you using to produce the file? was it 0.14.1 or earlier?

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@eerhardt- i used the current nuget version + some code i wrote to build null support

 

@wesm @nealrichardson  - this looks/behaves like a threading issue - I don't get any hanging if i reduce the VM to 1 core (not ideal)

(I can't explain the core dumps though)

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
If the file is bad - Id expect the R library to fail the same as python if they both use the same underlying C++ code.  However, I don't know the R / Python bindings / code.

I would point out that I was able to validate a 30 million row x 37 column data set produced by C# in R including the null support I added.  

The only indication of any issue was a very rare hang on first use of the library in R Studio - if it didn't hang the first time, i was able to do many file loads of 10gb without issue

I was attempting to narrow down that rare hang when it seemed to be a column width issue

@asfimport
Copy link
Author

Wes McKinney / @wesm:
The file is invalid. The check that fails above only occurs in debug builds, so release builds are probably failing in some other way due to a bad memory access that results in non-deterministic failure. I'll look at a release build now since we ideally should not crash on invalid files. Probably the invalid file is causing your VM to swap due to memory overallocation, and the swapping is more likely to occur with multiple threads

Once we have a valid file to look at we can begin to diagnose other issues that come up.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I don't see valgrind warnings on Linux when accessing the file from Python on the master branch in a release build.

I tried reading the file 100 times in release builds in Python on both 0.14.1 and master on Linux and had no issues. With R I tested the master branch and also read the file more than 10 times. So definitely inconclusive...

I'll look out to learn more if you're able to reproduce the issue and track down what might be going wrong

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
I can repro the problem fairly consistently - I can get more info if needed:

> sessionInfo()R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] arrow_0.14.1.20190925

loaded via a namespace (and not attached):
[1] tidyselect_0.2.5 bit_1.1-14 compiler_3.6.1 magrittr_1.5 assertthat_0.2.1 R6_2.4.0 tools_3.6.1
[8] glue_1.3.1 Rcpp_1.0.2 bit64_0.9-7 rlang_0.4.0 purrr_0.3.2

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@nealrichardson how many cores was your test vm using?

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
Just one

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
the other thing I should point out:  I am using a new rscript.exe process each time:  That way I know for certain the dll is unloaded and reinitialized without any static init related code.

this is the script runner code I am using:

 

(powershell script)

$rpath = "C:\Program Files \R-3.6.1\bin\Rscript.exe"
$rscript = "arrow.benchmark.r"

{{For ($i=0; $i -le 10000; $i++) }}
  Write-Output "run: $i"
  $stopwatch = [system.diagnostics.stopwatch]::StartNew()

  & $rpath --no-save --no-restore --verbose $rscript > c: emp\outputFile.Rout 2>&1

  $stopwatch.Elapsed.TotalSeconds
}

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@nealrichardson - I can't reproduce the issue on a single core, but I can on two cores - can you try a vm with two cores?

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
I also uploaded the exact script files / script loop runner

 

@asfimport
Copy link
Author

Eric Erhardt / @eerhardt:
I see the issue, we are writing the File Footer's Block metaDataLength and appending the length of the 4 or 8 byte size to metaDataLength.

 

https://github.com/apache/arrow/blob/15fca3d0169479fc4d21619b22515d95cd264fc5/csharp/src/Apache.Arrow/Ipc/ArrowStreamWriter.cs#L361-L364

 

checked
{
return _*options.SizeOfIpcLength* + messageData.Length + messagePaddingLength;
}

 

SizeOfIpcLength can either be 4 or 8. In 0.14.1 it was always 4, which means the metaDataLength won't be a multiple of 8, becausemessageData.Length + messagePaddingLength is a  multiple of 8 and we are adding 4 to it.

 

The doc https://arrow.apache.org/docs/format/IPC.html#recordbatch-body-structure clearly states this should be a multiple of 8, and it includes the length prefix.

 

I am working on a fix.

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
It sounds like there might be more than 1 issue here:

  • that malformed file

  • the hanging on R

    It would be troubling if a malformed file can take down / crash the library... (ie a DOS Exploit) 

    When trying to use an out of date C# Feather library in R I did get some indication that the file was invalid:  (C# -> R Files seem invalid kevin-montrose/FeatherDotNet#7

    Is there a way to validate the integrity of the arrow file on open?  (ie check offsets, padding, etc)  - might be slower, but when opening a file from an unknown source, could be safer.

    Regarding the hanging - There does seem to be some threadpool options for the C++ code, but I don't know how to access them in R

@asfimport
Copy link
Author

Neal Richardson / @nealrichardson:
You can try options(arrow.use_threads=FALSE)

cf. 

option_use_threads <- function() {
!is_false(getOption("arrow.use_threads"))
}

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@nealrichardson - setting that option may be a workaround for now
I am not sure what the threads do since there seems to be no performance difference - at least in the read_arrow function

@asfimport
Copy link
Author

Wes McKinney / @wesm:
The threads parallelize conversion from an Arrow table to an R data.frame / tibble.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Issue resolved by pull request 5520
#5520

@asfimport
Copy link
Author

Anthony Abate / @abbotware:
@wesm - I understand that the file generation is fixed on the C# side, but isnt a malformed file taking down the library another problem?

@eerhardt - Is there a pre-release nuget that I can test out? 

@asfimport
Copy link
Author

Eric Erhardt / @eerhardt:

 Is there a pre-release nuget that I can test out? 

No, you would have to build one yourself. It looks like this change will be included in 0.15 according to the RC2 vote that was opened this morning.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
@abbotware it is a problem, but we haven't really figured out what the problem is. I haven't been able to reproduce the issue so I'm not sure how to proceed

@asfimport asfimport added this to the 0.15.0 milestone Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants