Add locks for de-/compressor + initialize decompressor #397

svilupp · 2023-03-12T13:39:21Z

Fixes #396

This PR proposes to:

fix the issue with indexing by Threads.threadid() by locking the compressor/decompressor (see Error/Segfault when writing many partitions #396)
initialize an array of decompressors, similar to what has been previously done for compressors (eg,

arrow-julia/src/Arrow.jl

Line 80 in 9b36c8b

CodecZstd.TranscodingStreams.initialize(zstd)

)

Acquiring locks introduces some additional overhead (I was not confident we could use SpinLocks given we traverse several packages and into C-code, so I opted for a slightly slower ReentrantLock).

I've created a minimalistic workload (100 partitions with 1 row) to see mostly the lock overhead.
I couldn't detect any slowdown in writing speed.
Since we're initializing decompressors upfront, we get c. 30% speed up when reading (no need to initialize for each record batch)

using Arrow, Tables, Random
# Speed test 
N = 100
len = 100
t = Tables.rowtable((; x1 = map(x -> randstring(len), 1:N)));
fn="test_partitioned.arrow"

## Before
# Write:
@btime Arrow.write($fn, Iterators.partition($t, 1); compress = :lz4)
# 2.067 ms (12912 allocations: 13.24 MiB)

# Read:
@btime t=Arrow.Table($fn);
# 718.458 μs (9305 allocations: 533.62 KiB)

## After
# Write:
@btime Arrow.write($fn, Iterators.partition($t, 1); compress = :lz4)
# 2.021 ms (12987 allocations: 13.24 MiB)

# Read:
@btime t=Arrow.Table($fn);
# 461.083 μs (9712 allocations: 543.52 KiB)

baumgold · 2023-03-13T23:34:22Z

src/arraytypes/arraytypes.jl

@@ -35,14 +35,19 @@ function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=g
    @debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(kw.data)"


This function should probably be broken down into a few smaller functions for clarity:

toarrowvector(a::ArrowVector, compression::Nothing, kw...) = a function toarrowvector(a::ArrowVector, compression::AbstractVector{LZ4FrameCompressor}, kw...) lock(LZ4_FRAME_COMPRESSOR_LOCK[Threads.threadid()]) do compress(Meta.CompressionTypes.LZ4_FRAME, compression[tid], a) end end function toarrowvector(a::ArrowVector, compression::LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...) compress(Meta.CompressionTypes.LZ4_FRAME, compression, a) end function toarrowvector(a::ArrowVector, compression::AbstractVector{ZstdCompressor}, kw...) compress(Meta.CompressionTypes.ZSTD, compression, a) end function toarrowvector(a::ArrowVector, compression::ZstdCompressor, kw...) lock(ZSTD_COMPRESSOR_LOCK[Threads.threadid()]) do compress(Meta.CompressionTypes.ZSTD, compression[tid], a) end end function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=getmetadata(x); compression::Union{Nothing, Vector{LZ4FrameCompressor}, LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...) @debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(kw.data)" @debugv 3 x A = arrowvector(x, i, 0, 0, de, ded, meta; compression=compression, kw...) A = toarrowvector(A, compression) @debugv 2 "converted top-level column to arrow format: $(typeof(A))" @debugv 3 A return A end

baumgold · 2023-03-13T23:42:42Z

src/Arrow.jl

+const LZ4_FRAME_DECOMPRESSOR = LZ4FrameDecompressor[]
+const ZSTD_DECOMPRESSOR = ZstdDecompressor[]
+# add locks for multithreaded (de/)compression (because we index by threadid which might not be safe under Julia >1.8)
+const LZ4_FRAME_COMPRESSOR_LOCK = ReentrantLock[]


A single thread can only do one thing at a time so maybe we can have a single lock per thread to be shared amongst all compressors/decompressors?

baumgold · 2023-03-13T23:45:20Z

src/Arrow.jl

+        push!(ZSTD_COMPRESSOR_LOCK, ReentrantLock())
+        # Decompressors
+        zstdd = ZstdDecompressor()
+        CodecZstd.TranscodingStreams.initialize(zstdd)


Maybe we should be a bit lazier about initializing these as I believe they're quite expensive. Shall we initialize them only as needed? For example, if a user is only decompressing (reading) data then there's no reason to pay the penalty of initializing compressors (and vice versa). Same idea for LZ4 vs ZSTD - if a user only uses one then no need to initialize both.

quinnj · 2023-05-25T22:11:36Z

Not sure if @svilupp wasn't able to see this through, but I've cleaned up their code here: #445. I can confirm that I was able to reproduce the original segfault on Apple M1 and that the code in that PR resolves the issue. I believe that PR should also address @baumgold's comments from this PR.

@svilupp

Fixes #396. As noted in the originally reported issue, enabling debug logging when writing arrow data with compression can result in segfaults because the underlying CodecX package have debug logs, causing task switches/migration and thus making the pattern of using a single `X_COMPRESSOR` array indexed by `Threads.threadid()` unsafe since multiple threads may try using the same compressor at the same time. We fix this by wrapping each compressor in a `Lockable` and ensuring the `compress` (or `uncompress`) operation holds the lock for the duration of the operation. We also: * Add a decompressor per thread to avoid recreating them over and over during reading * Lazily initialize compressors/decompressors in a way that is 1.9+ safe and only creates the object when needed by a specific thread * Switch from WorkerUtilities -> ConcurrentUtilities (the package was renamed) Successor to #397; I've added @svilupp as a co-author here since they started the original movement for the code to go in this direction. --------- Co-authored-by: J S <49557684+svilupp@users.noreply.github.com>

add locks for de-/compressor, initialize decompressor

f21096d

svilupp mentioned this pull request Mar 12, 2023

Experiment: Potential speed up strategies #399

Draft

baumgold reviewed Mar 13, 2023

View reviewed changes

quinnj mentioned this pull request May 25, 2023

Refactor compressors/decompressors for laziness + safety #445

Merged

quinnj closed this May 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add locks for de-/compressor + initialize decompressor #397

Add locks for de-/compressor + initialize decompressor #397

svilupp commented Mar 12, 2023

baumgold Mar 13, 2023 •

edited

Loading

baumgold Mar 13, 2023

baumgold Mar 13, 2023 •

edited

Loading

quinnj commented May 25, 2023

		@@ -35,14 +35,19 @@ function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=g
		@debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(kw.data)"

Add locks for de-/compressor + initialize decompressor #397

Add locks for de-/compressor + initialize decompressor #397

Conversation

svilupp commented Mar 12, 2023

baumgold Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

baumgold Mar 13, 2023

Choose a reason for hiding this comment

baumgold Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

quinnj commented May 25, 2023

baumgold Mar 13, 2023 •

edited

Loading

baumgold Mar 13, 2023 •

edited

Loading