Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add locks for de-/compressor + initialize decompressor #397

Closed
wants to merge 1 commit into from

Conversation

svilupp
Copy link
Contributor

@svilupp svilupp commented Mar 12, 2023

Fixes #396

This PR proposes to:

Acquiring locks introduces some additional overhead (I was not confident we could use SpinLocks given we traverse several packages and into C-code, so I opted for a slightly slower ReentrantLock).

I've created a minimalistic workload (100 partitions with 1 row) to see mostly the lock overhead.
I couldn't detect any slowdown in writing speed.
Since we're initializing decompressors upfront, we get c. 30% speed up when reading (no need to initialize for each record batch)

using Arrow, Tables, Random
# Speed test 
N = 100
len = 100
t = Tables.rowtable((; x1 = map(x -> randstring(len), 1:N)));
fn="test_partitioned.arrow"

## Before
# Write:
@btime Arrow.write($fn, Iterators.partition($t, 1); compress = :lz4)
# 2.067 ms (12912 allocations: 13.24 MiB)

# Read:
@btime t=Arrow.Table($fn);
# 718.458 μs (9305 allocations: 533.62 KiB)

## After
# Write:
@btime Arrow.write($fn, Iterators.partition($t, 1); compress = :lz4)
# 2.021 ms (12987 allocations: 13.24 MiB)

# Read:
@btime t=Arrow.Table($fn);
# 461.083 μs (9712 allocations: 543.52 KiB)

@@ -35,14 +35,19 @@ function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=g
@debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(kw.data)"
Copy link
Member

@baumgold baumgold Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should probably be broken down into a few smaller functions for clarity:

toarrowvector(a::ArrowVector, compression::Nothing, kw...) = a

function toarrowvector(a::ArrowVector, compression::AbstractVector{LZ4FrameCompressor}, kw...)
    lock(LZ4_FRAME_COMPRESSOR_LOCK[Threads.threadid()]) do
        compress(Meta.CompressionTypes.LZ4_FRAME, compression[tid], a)
    end
end

function toarrowvector(a::ArrowVector, compression::LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...)
    compress(Meta.CompressionTypes.LZ4_FRAME, compression, a)
end

function toarrowvector(a::ArrowVector, compression::AbstractVector{ZstdCompressor}, kw...)
    compress(Meta.CompressionTypes.ZSTD, compression, a)
end

function toarrowvector(a::ArrowVector, compression::ZstdCompressor, kw...)
    lock(ZSTD_COMPRESSOR_LOCK[Threads.threadid()]) do
        compress(Meta.CompressionTypes.ZSTD, compression[tid], a)
    end
end

function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=getmetadata(x); compression::Union{Nothing, Vector{LZ4FrameCompressor}, LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...)
    @debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(kw.data)"
    @debugv 3 x
    A = arrowvector(x, i, 0, 0, de, ded, meta; compression=compression, kw...)
    A = toarrowvector(A, compression)
    @debugv 2 "converted top-level column to arrow format: $(typeof(A))"
    @debugv 3 A
    return A
end

const LZ4_FRAME_DECOMPRESSOR = LZ4FrameDecompressor[]
const ZSTD_DECOMPRESSOR = ZstdDecompressor[]
# add locks for multithreaded (de/)compression (because we index by threadid which might not be safe under Julia >1.8)
const LZ4_FRAME_COMPRESSOR_LOCK = ReentrantLock[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single thread can only do one thing at a time so maybe we can have a single lock per thread to be shared amongst all compressors/decompressors?

push!(ZSTD_COMPRESSOR_LOCK, ReentrantLock())
# Decompressors
zstdd = ZstdDecompressor()
CodecZstd.TranscodingStreams.initialize(zstdd)
Copy link
Member

@baumgold baumgold Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be a bit lazier about initializing these as I believe they're quite expensive. Shall we initialize them only as needed? For example, if a user is only decompressing (reading) data then there's no reason to pay the penalty of initializing compressors (and vice versa). Same idea for LZ4 vs ZSTD - if a user only uses one then no need to initialize both.

@quinnj
Copy link
Member

quinnj commented May 25, 2023

Not sure if @svilupp wasn't able to see this through, but I've cleaned up their code here: #445. I can confirm that I was able to reproduce the original segfault on Apple M1 and that the code in that PR resolves the issue. I believe that PR should also address @baumgold's comments from this PR.

@quinnj quinnj closed this May 25, 2023
quinnj added a commit that referenced this pull request May 30, 2023
Fixes #396.

As noted in the originally reported issue, enabling debug logging when
writing arrow data with compression can result in segfaults because the
underlying CodecX package have debug logs, causing task
switches/migration and thus making the pattern of using a single
`X_COMPRESSOR` array indexed by `Threads.threadid()` unsafe since
multiple threads may try using the same compressor at the same time.

We fix this by wrapping each compressor in a `Lockable` and ensuring the
`compress` (or `uncompress`) operation holds the lock for the duration
of the operation. We also:
* Add a decompressor per thread to avoid recreating them over and over
during reading
* Lazily initialize compressors/decompressors in a way that is 1.9+ safe
and only creates the object when needed by a specific thread
* Switch from WorkerUtilities -> ConcurrentUtilities (the package was
renamed)

Successor to #397; I've added
@svilupp as a co-author here since they started the original movement
for the code to go in this direction.

---------

Co-authored-by: J S <49557684+svilupp@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error/Segfault when writing many partitions
3 participants