Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread panics on read operations of FST set file #57

Closed
davidblewett opened this issue Feb 12, 2018 · 22 comments
Closed

Thread panics on read operations of FST set file #57

davidblewett opened this issue Feb 12, 2018 · 22 comments

Comments

@davidblewett
Copy link

I'm seeing a very small percentage of "corrupt" FST set files that are triggering panics in Rust (leading to a Python interpreter segfault). The errors look like:

thread '<unnamed>' panicked at 'index out of bounds: the len is 17498006 but the index is 15336395951936096993', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17

thread '<unnamed>' panicked at 'index out of bounds: the len is 89225255 but the index is 15119944950614189002', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17

thread '<unnamed>' panicked at 'index out of bounds: the len is 16285338 but the index is 3532794445444415790', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17                      

This occurs on approximately 13 out of 4635 files, ranging in size from 20MB to > 100MB. I have not been able to narrow things down past this, but wanted to know what might cause this?

I'm shelling out to the fst-bin crate to combine multiple input files into larger files, then doing set operations on the merged output. The fst binary was built on Rust nightly; I'm not sure of the exact version at the moment.

@davidblewett
Copy link
Author

Trying to use fst csv edges or fst csv nodes results in:

fatal runtime error: allocator memory exhausted
Illegal instruction (core dumped)

@davidblewett
Copy link
Author

I can provide a sample, broken file (and a sample of a functional file) if you like.

@BurntSushi
Copy link
Owner

Interesting bug! Yeah I definitely need a sample to reproduce. Could you also show output with RUST_BACKTRACE=1?

Also, the Python interpreter should not be segfaulting. The ffi layer should be catching panics and converting them to aborts. Otherwise it is UB.

@davidblewett
Copy link
Author

backtrace:

thread '<unnamed>' panicked at 'index out of bounds: the len is 89225255 but the index is 15119944950614189002', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/fst-0.3.0/src/raw/node.rs:306:17                     
stack backtrace:                                        
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace                                                 
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49                                                   
   1: std::sys_common::backtrace::_print                
             at libstd/sys_common/backtrace.rs:71       
   2: std::panicking::default_hook::{{closure}}         
             at libstd/sys_common/backtrace.rs:59       
             at libstd/panicking.rs:380                 
   3: std::panicking::default_hook                      
             at libstd/panicking.rs:396                 
   4: std::panicking::rust_panic_with_hook              
             at libstd/panicking.rs:576                 
   5: std::panicking::begin_panic                       
             at libstd/panicking.rs:537                 
   6: std::panicking::begin_panic_fmt                   
             at libstd/panicking.rs:521                 
   7: rust_begin_unwind                                 
             at libstd/panicking.rs:497                 
   8: core::panicking::panic_fmt                        
             at libcore/panicking.rs:71                 
   9: core::panicking::panic_bounds_check
             at libcore/panicking.rs:58
  10: fst::raw::Fst::node
  11: <fst::raw::StreamBuilder<'f, A> as fst::stream::IntoStreamer<'a>>::into_stream
  12: fst_set_streambuilder_finish
  13: ffi_call_unix64
             at ../src/x86/unix64.S:76
  14: ffi_call
             at ../src/x86/ffi64.c:525
  15: cdata_call
             at c/_cffi_backend.c:3025
  16: _PyObject_FastCallDict
             at Objects/abstract.c:2331
  17: call_function
             at Python/ceval.c:4848
  18: _PyEval_EvalFrameDefault
             at Python/ceval.c:3322
  19: _PyFunction_FastCall
             at Python/ceval.c:4906
  20: _PyFunction_FastCallDict
             at Python/ceval.c:5008
  21: _PyObject_FastCallDict
             at Objects/abstract.c:2310
  22: _PyObject_Call_Prepend
             at Objects/abstract.c:2373
  23: PyObject_Call
             at Objects/abstract.c:2261
  24: call_method.constprop.53
             at Objects/typeobject.c:1453
  25: _PyEval_EvalFrameDefault
             at Python/ceval.c:1510
  26: _PyEval_EvalCodeWithName
             at Python/ceval.c:4153
  27: PyEval_EvalCodeEx                                 
             at Python/ceval.c:4174                     
  28: PyEval_EvalCode                                   
             at Python/ceval.c:730                      
  29: PyRun_InteractiveOneObjectEx                      
             at Python/pythonrun.c:1025                 
             at Python/pythonrun.c:246                  
  30: PyRun_InteractiveLoopFlags                        
             at Python/pythonrun.c:114                  
  31: PyRun_AnyFileExFlags                              
             at Python/pythonrun.c:75                   
  32: Py_Main                                           
             at Modules/main.c:338                      
             at Modules/main.c:809                      
  33: main                                              
             at ./Programs/python.c:69                  
  34: __libc_start_main                                 
  35: <unknown>                                         
fatal runtime error: failed to initiate panic, error 5
Aborted (core dumped)                                   

# echo $?
134

@davidblewett
Copy link
Author

Working on getting the files somewhere.

@davidblewett
Copy link
Author

@BurntSushi : here are the files: https://drive.google.com/file/d/1xs9NSIEU2yEDoUg0Vl2qEtL06hOvo9f1/view?usp=sharing .

Let me know when you get them so I can unshare them. The data in them should be anonymized to not be a problem, but would like to limit exposure.

@BurntSushi
Copy link
Owner

@davidblewett Thanks! I downloaded it. Won't be able to look into this until a bit later, hopefully today, but no promises. Also, in the future, if it's a concern, you can email me files if they are small enough. jamslam@gmail.com

Out of curiosity (in case it becomes relevant), how did you build these files? Did you use fst 0.3.0?

@davidblewett
Copy link
Author

davidblewett commented Feb 12, 2018

Yes, fst 0.3.0. The process looks like:

  1. Build array in memory
  2. Pass array to fst set --sorted - -
  3. Write output to file, compress, upload to S3
  4. Read from S3, decompress, accumulate files for 5 minutes
  5. fst union ... ... temp_dir/output && mv temp_dir/output foo.fst

Steps 1-4 are actually in Ruby; 5 in Python.

@davidblewett
Copy link
Author

@BurntSushi could it be step 2, both reading from stdin and writing to stdout that breaks in some circumstances? Should I be outputting to a tempfile on disk?

@BurntSushi
Copy link
Owner

BurntSushi commented Feb 12, 2018

@davidblewett I don't think so. Is it possible to provide the original array of strings that produced one of these FSTs? That would help debugging on my end.

@davidblewett
Copy link
Author

@BurntSushi : Unfortunately, it would be fairly involved to try to track that down. I don't have the telemetry for what files were combined in step 5 above. I'm going to purge the files that are exhibiting this behavior, and resume the aggregation process. If it occurs again, I might be able to track that down.

@BurntSushi
Copy link
Owner

@davidblewett No worries, thanks! I'll see what I can do with what I have. :)

@davidblewett
Copy link
Author

I've added some telemetry to our process that should allow me to reconstruct the input data. Will let you know if I see it happen again.

@davidblewett
Copy link
Author

@BurntSushi : after letting this run for a few days, I have not been able to reproduce the error. It's possible that the sample file here hadn't been completely finalized, and was accidentally included in a globbing expression to load finished files.

@davidblewett
Copy link
Author

@BurntSushi this has raised it's ugly head again. I'm pretty sure it's a sequence of events in my larger application that ends up writing out corrupt FST data. However, would you be opposed to using .get(x) on arrays instead of [x]? It would be a fairly invasive change, but would allow a clean way to recover in the face of invalid data.

Alternatively, perhaps some form of check method could be added that can validate the structure of a given FST file?

@BurntSushi
Copy link
Owner

@davidblewett Hmmm... So I haven't had a chance to look into this. There's unfortunately a large amount of context switching overhead required to dive into fst internals and debug this kind of thing.

I think using get(x) instead of [x] is probably not the right direction to take. It would make the code incredibly noisy. In particular, it's not just about writing get(x) instead of [x], but actually doing case analysis everywhere. And when we do try to access an out-of-bounds index, it's not clear to me that we could do much better than panic anyway.

A check method or at least some kind of optional checksum seems like a better path to me. In theory, a full check method wouldn't be too bad, but that's only if you use the existing code that deserializes the FST, which I think is the actual problem here, so that doesn't help. A checksum would probably work well to weed out corrupt FST files, but would not fix any issues arising from a non-corrupt but incorrect FST file (e.g., if the builder is producing incorrect data as opposed to some external thing preventing a complete FST from being written).

I'm not too sure what I'll have time for in the short term here unfortunately. How severe would you rate this bug?

@davidblewett
Copy link
Author

The incidince rate has gone down in the last few days. I like the idea of a checksum, but how would you verify it? Could it be something that is always the "longest" chain in that specific FST?

@BurntSushi
Copy link
Owner

@davidblewett Naively, a checksum would be written at the end of the FST, and it would correspond to a crc32c sum of all previous bytes in the FST. If they don't line up, then you have pretty high confidence that the FST has been corrupted somehow. The checksum would not however help you if the panics you're seeing are a result of a bug in the FST builder itself.

@davidblewett
Copy link
Author

@BurntSushi I believe I discovered the root of the issue. I don't think the FST files were corrupt, it was due to doing multiple operations in different threads on the same handle. Since the Python binding is basically C, those actions don't trigger any warnings.

In pure Rust, are the Set structs not re-entrant? We have a few hundred thousand FST segments, so need to be careful to not go over the operating system limit on the number of open mmap'd files. If we spawned new Set instances in different threads, we could easily go over that limit if we had multiple concurrent requests for a customer that had tens of thousands of segments.

@BurntSushi
Copy link
Owner

@davidblewett The FST sets themselves are purely immutable and can be inspected from multiple threads/processes simultaneously without issue. However, the streams produced by FSTs require mutable access and are themselves not legal to access from multiple threads simultaneously without explicit synchronization. Of course, you can have as many streams operating simultaneously as you want. You just need to make sure you're only accessing each stream from one thread.

@DiSToAGe
Copy link

Perhaps I discover your problem ...?

In a personal code in Rust, I write an fst coming from a sorted list of strings. Then I tryied to open my fst with the fst-bin and use a regex on it. But I get the almost same error :

thread 'main' panicked at 'index out of bounds: the len is 267524517 but the index is 148671801088665313', (...)github.com-1ecc6299db9ec823/fst-0.3.3/src/raw/node.rs:307:17

Then I realise there was an error on my code writting the fst. I used

  1. SetBuilder::new(file)
  2. build.extend_iter(lines.into_iter())
    => but I forgott the "build.finish()" at the end ...

If you don't "finish()", the fst file is a little bit smaller than what it must be, and at opening it, it give this error given.
Don't know If it is the same problem as you ...?

PS: correcting my code, create a correct fst file and the problem doesn't come again.

@BurntSushi
Copy link
Owner

I'm going to close this issue because it seems like there isn't a problem with FST reading/writing itself. Happy to dig into this more with a better reproduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants