-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add locks for de-/compressor + initialize decompressor #397
Conversation
@@ -35,14 +35,19 @@ function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=g | |||
@debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(kw.data)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should probably be broken down into a few smaller functions for clarity:
toarrowvector(a::ArrowVector, compression::Nothing, kw...) = a
function toarrowvector(a::ArrowVector, compression::AbstractVector{LZ4FrameCompressor}, kw...)
lock(LZ4_FRAME_COMPRESSOR_LOCK[Threads.threadid()]) do
compress(Meta.CompressionTypes.LZ4_FRAME, compression[tid], a)
end
end
function toarrowvector(a::ArrowVector, compression::LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...)
compress(Meta.CompressionTypes.LZ4_FRAME, compression, a)
end
function toarrowvector(a::ArrowVector, compression::AbstractVector{ZstdCompressor}, kw...)
compress(Meta.CompressionTypes.ZSTD, compression, a)
end
function toarrowvector(a::ArrowVector, compression::ZstdCompressor, kw...)
lock(ZSTD_COMPRESSOR_LOCK[Threads.threadid()]) do
compress(Meta.CompressionTypes.ZSTD, compression[tid], a)
end
end
function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=getmetadata(x); compression::Union{Nothing, Vector{LZ4FrameCompressor}, LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...)
@debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(kw.data)"
@debugv 3 x
A = arrowvector(x, i, 0, 0, de, ded, meta; compression=compression, kw...)
A = toarrowvector(A, compression)
@debugv 2 "converted top-level column to arrow format: $(typeof(A))"
@debugv 3 A
return A
end
const LZ4_FRAME_DECOMPRESSOR = LZ4FrameDecompressor[] | ||
const ZSTD_DECOMPRESSOR = ZstdDecompressor[] | ||
# add locks for multithreaded (de/)compression (because we index by threadid which might not be safe under Julia >1.8) | ||
const LZ4_FRAME_COMPRESSOR_LOCK = ReentrantLock[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single thread can only do one thing at a time so maybe we can have a single lock per thread to be shared amongst all compressors/decompressors?
push!(ZSTD_COMPRESSOR_LOCK, ReentrantLock()) | ||
# Decompressors | ||
zstdd = ZstdDecompressor() | ||
CodecZstd.TranscodingStreams.initialize(zstdd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should be a bit lazier about initializing these as I believe they're quite expensive. Shall we initialize them only as needed? For example, if a user is only decompressing (reading) data then there's no reason to pay the penalty of initializing compressors (and vice versa). Same idea for LZ4 vs ZSTD - if a user only uses one then no need to initialize both.
Fixes #396. As noted in the originally reported issue, enabling debug logging when writing arrow data with compression can result in segfaults because the underlying CodecX package have debug logs, causing task switches/migration and thus making the pattern of using a single `X_COMPRESSOR` array indexed by `Threads.threadid()` unsafe since multiple threads may try using the same compressor at the same time. We fix this by wrapping each compressor in a `Lockable` and ensuring the `compress` (or `uncompress`) operation holds the lock for the duration of the operation. We also: * Add a decompressor per thread to avoid recreating them over and over during reading * Lazily initialize compressors/decompressors in a way that is 1.9+ safe and only creates the object when needed by a specific thread * Switch from WorkerUtilities -> ConcurrentUtilities (the package was renamed) Successor to #397; I've added @svilupp as a co-author here since they started the original movement for the code to go in this direction. --------- Co-authored-by: J S <49557684+svilupp@users.noreply.github.com>
Fixes #396
This PR proposes to:
Threads.threadid()
by locking the compressor/decompressor (see Error/Segfault when writing many partitions #396)arrow-julia/src/Arrow.jl
Line 80 in 9b36c8b
Acquiring locks introduces some additional overhead (I was not confident we could use SpinLocks given we traverse several packages and into C-code, so I opted for a slightly slower ReentrantLock).
I've created a minimalistic workload (100 partitions with 1 row) to see mostly the lock overhead.
I couldn't detect any slowdown in writing speed.
Since we're initializing decompressors upfront, we get c. 30% speed up when reading (no need to initialize for each record batch)