Skip to content

Memory Usage Improvement

Hadi Moshayedi edited this page Jul 23, 2014 · 1 revision

1. Introduction

Some of the github issues point that some users have out of memory issues when using cstore_fdw. Some of the related issues are:

  • Issue #19, where the user has problems loading data to a table with lots of columns,
  • Issue #13, where the user has problems running many concurrent queries on cstore_fdw.

Optimizing away every last bit of memory usage is a hard task. In this document we want to analyze how the memory is used in cstore_fdw, and discover the easier methods we can use to utilize the memory better.

2. Memory usage in current implementation

In the current implementation,

  • when writing, we batch all rows for a stripe and then flush them into disk.
  • when reading, we read all rows for a stripe at once.

Because of the way we read and write cstore_fdw tables, the memory used when reading/writing is a function of stripe size. Currently users have control over stripe size by the stripe_row_count. Increasing stripe_row_count increases the memory usage and decreasing it decreases the memory usage.

The function for memory usage seems to be of the form M(stripe_size) = coefficient * stripe_size. We calculated the coefficient for different values of stripe_row_count for lineitem table of scale 1G:

-------------------------------------------------------
Stripe Row Count | Write Coefficient | Read Coefficient
-----------------+-------------------+-----------------
10K              | 6.8               | 5.2
20K              | 5.1               | 3.6
40K              | 4.2               | 2.9
80K              | 3.7               | 2.3
160K             | 3.5               | 2.3
320K             | 3.3               | 2.0 *
640K             | 3.3               | 2.2
1280K            | 3.3               | 2.2
-------------------------------------------------------

(*) This looks like an odd case. I double checked this. I might have made an error. Investigating this in more detail didn’t seem to worth it at the moment.

Based on the data in the above table, the write coefficient seem to converge to 3.3, and read coefficient seem to converge to 2.2. This means when writing, cstore_fdw uses 3.3stripe_size amount of memory, and when reading, it uses 2.2stripe_size amount of memory.

We digged further to find out what constitutes these 3.3 and 2.2 coefficients. The major components where:

  • Writing. 1.13 (serialized data buffer) + 1.18 (copy of datums) + 0.82 (array for storing datums) + (other minor structures)
  • Reading. 1.01 (data buffer read from file) + 1.06 (deserialized data buffer) + 0.11 (deserialized exists buffer) + (other minor structures)

3. Effect of compression

When reading from a compressed file, the coefficient stays 2.2. So the memory used is 2.2 times the uncompressed stripe size.

When writing to a compressed file, the coefficient goes from 3.3 to 3.8, i.e. it increases by 0.5 * uncompressed_stripe_size.

Our expectation was that the memory usage stays the same as uncompressed, but we are seeing an extra 0.5 in here. The reason for this seems to be memory fragmentation (or some other scenario that we have unused resident memory) when doing the allocation and deallocations for compression of blocks (e.g. when the requested size is not possible to construct from current free list, so new pages get requested).

I don’t have direct evidence for this, but the following log for two consecutive stripes implies that:

INFO:  FlushStripe() started (resident: 422560 kb)
INFO:  exists buffer created (resident: 428268 kb)
INFO:  value buffer created (resident: 643284 kb)
INFO:  starting compression (resident: 643500 kb)
INFO:  compression finished (resident: 765728 kb)
INFO:  during compression, allocated: 195032 kb freed: 195001 kb
INFO:  during compression, palloc count: 2064, pfree count: 2048
INFO:  flushed (resident: 765812 kb)
INFO:  FlushStripe() started (resident: 422544 kb)
INFO:  exists buffer created (resident: 428252 kb)
INFO:  value buffer created (resident: 643284 kb)
INFO:  starting compression (resident: 643484 kb)
INFO:  compression finished (resident: 765712 kb)
INFO:  during compression, allocated: 195020 kb freed: 194990 kb
INFO:  during compression, palloc count: 2064, pfree count: 2048
INFO:  flushed (resident: 765788 kb)

In the above log, we see that the difference between the memory allocated and the memory freed during the compressions is only 31KB, but the resident memory increases by about 120MB during the compression.

Note that the during the compression,

  • We allocate a buffer about the same size as the original buffer to store the compressed data. We don’t know the compression ratio beforehand, so we should allocate a buffer with largest possible size. The exact allocated size is:
    • #define PGLZ_MAX_OUTPUT(len) ((len) + 4 + sizeof(PGLZ_Header))
  • After the compression, we free the original buffer.
  • The extra 16 allocations is one array for keeping the compression type for each of the columns.

If instead of the above algorithm, we allocate a large enough (that can store data for largest single block) single temp buffer that will be the destination of all compressions and during each compression:

  • compress into the temp buffer,
  • overwrite the original buffer with data from the temp buffer.

i.e. not as much allocations and frees as before, then the difference between resident memory before the compression and decompression disappears:

INFO:  FlushStripe() started (resident: 424404 kb)
INFO:  exists buffer created (resident: 429432 kb)
INFO:  value buffer created (resident: 633736 kb)
INFO:  starting compression (resident: 633896 kb)
INFO:  compression finished (resident: 633896 kb)
INFO:  during compression, allocated: 8 kb freed: 0 kb
INFO:  during compression, palloc count: 16, pfree count: 0
INFO:  flushed (resident: 633896 kb)
INFO:  FlushStripe() started (resident: 424404 kb)
INFO:  exists buffer created (resident: 429432 kb)
INFO:  value buffer created (resident: 633736 kb)
INFO:  starting compression (resident: 633868 kb)
INFO:  compression finished (resident: 633868 kb)
INFO:  during compression, allocated: 8 kb freed: 0 kb
INFO:  during compression, palloc count: 16, pfree count: 0
INFO:  flushed (resident: 633868 kb)

Which can be another evidence that we are getting memory fragmentation (or some other scenario that we have unused resident memory) because lots of allocations and deallocations of large size.

To triple check that we have unused resident memory, I used mallinfo() few lines before returning from FlushStripe(). The uordblks field in the result (Total number of bytes used in in-use allocations) was about 100MB less than the resident memory.

4. How much memory palloc mallocs?

  • requested_size <= 8KB, then round to next power of two of (48 bytes + requested_size),
  • requested_size > 8KB, then (48 bytes + requested_size),
  • Note that malloc() has its additional own roundings and overheads.

5. High Level Improvements

5.1. stripe_max_size foreign table option

As we earlier discussed, memory usage is proportional to the uncompressed stripe size. So if users have control over stripe size, they can control the memory usage too.

Currently cstore_fdw provides stripe_row_count for controlling the stripe size. But user needs to know the average row width to be able to decided what stripe_row_count corresponds to some stripe size.

Our proposal here is to add a stripe_max_size option which limits the number of bytes in each uncompressed stripe.

5.2. max_estimated_memory_usage foreign table option

Similar to the stripe_max_size option, but we automatically calculate stripe_max_size based on our knowledge about our implementation and the user provided max memory usage. For example, in current implementation we can do:

stripe_max_size = max_estimated_memory_usage / 3.5

5.3. Decreasing the default stripe_row_count

Our current default stripe_row_count is 150K. We can decrease this default to some extent (about 50%) without hurting the performance of the tpch queries.

If we decided to do this, we can run more extensive benchmarks to find the minimum value where performance is the same as stripe_row_count=150K.

6. Mid-Level Improvements

6.1. Keeping one copy of data

We currently store datums in two places:

  • array of datums,
  • serialized datums in a byte array.

Because we don’t need random access to datums, we can remove the need for the first storage (i.e. the array of datums), and always read/write to the byte array.

This applies both to reads and writes.

6.2. Keeping data in compressed format (only applies to compressed cstore files)

We currently decompress the whole stripe data as soon as we read them from file. We can optimize this and decompress the data when we need it.

If we do this, at the peak time we will have: all compressed data + uncompressed data for one block. Instead of uncompressed data for all blocks of a stripe.

This applies both to reads and writes.

7. Low Level Improvements

7.1. Freeing raw buffer for by-value columns

In cstore_reader in LoadColumnData(), we read the raw value buffer from file, and then deserialize it.

We don’t free the original raw value buffer, because the deserialized data may contain pointer to the original buffer for the “by reference” columns.

We can free the data for “by value” columns though. A basic test showed a 11% improvement in read memory usage after doing this optimization.

7.2. Optimizing DatumCopy

The DatumCopy() function in cstore_writer does a palloc() for each by-reference value. As we saw earlier, the overhead of palloc() is huge.

We can use a simple allocator wrapper inside this function which only palloc()s in large chunks, and instead of doing palloc() for every single datum, it does one allocation for every 1MB of datums.

A basic test showed a 16% improvement in write memory usage after doing this optimization.

7.3. Doing less allocations in FlushStripe

As we discussed earlier, the allocation/free pattern for compressed files in FlushStripe() seems to cause some unused resident memory. We describe an alternative algorithm in section 3 that removes the overhead of this unused resident memory.

A basic test showed a 14% improvement in write memory usage after doing this optimization.

8. Action Plan

For the next release, we do:

  • Determine a new value for stripe_raw_count after doing some benchmarks,
  • Implement 6.1, and keep only one copy of data in memory.

These optimizations seem to be not so difficult to do, and their effect should be huge. Note that after implementing 6.1 some of the other optimizations lose their affects (mostly the ones in section 7).

My estimation for applying these optimizations is 2-3 weeks.