Python: In-memory duckdb keep increasing indefinitely #2471

lamkenn · 2021-10-21T14:09:54Z

What happens?

I run a in-memory duckdb python (initialise it with a table of 200K records, memory~250MB after inserting those, id column as the primary key) and the process subscribe to a stream of update (pandas dataframe) which keep updating the table by cursor.executemany("UPDATE TABLE set field1 = ?, field2= ? where id = ?", df.to_records()) for 500 records every second.

However, the memory of the python program keep increasing even there is no new records inserted ( I keep reusing the cursor for the updates)

If i comment out the cursor.executemany statement and just print out the dataframe. The memory doesn't increase while getting the update from the data stream.

Therefore, I am quite sure the memory increment is due to the update statement. I also set the memory limit by PRAGMA memory_limit='1GB';

Moreover, I got segmentation fault if i try to run a update-select (update a big table with 20k records from a table with 500 records) statement . If i just have say 5k records in that big table , then it runs fine.

To Reproduce

I will try to create a sample program later. But for now, wondering if I am doing anything wrong with the in-memory database.

Environment (please complete the following information):

OS: Windows 10
DuckDB Version: duckdb-0.3.0
DuckDB Client: Python

Before Submitting

Have you tried this on the latest master branch?

Python: pip install duckdb --upgrade --pre
R: install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
Other Platforms: You can find binaries here or compile from source.

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

The text was updated successfully, but these errors were encountered:

lamkenn · 2021-10-21T16:06:52Z

And i noticed that even after i deleted all the records from the table. The memory is not freed up.
Any ideas how to reclaim those memory ?

Thanks

Mytherin · 2021-10-22T19:21:57Z

Thanks for the report!

However, the memory of the python program keep increasing even there is no new records inserted ( I keep reusing the cursor for the updates)

That is perhaps related to the fact that in-memory databases are not checkpointed/flushed, or perhaps related to string updates not being properly cleaned up. What kind of data are you inserting into the tables (types, size of values, etc)?

Moreover, I got segmentation fault if i try to run a update-select (update a big table with 20k records from a table with 500 records) statement . If i just have say 5k records in that big table , then it runs fine.

Could you share the query/data that triggers this crash please?

And i noticed that even after i deleted all the records from the table. The memory is not freed up. Any ideas how to reclaim those memory ?

DROP TABLE should reclaim the memory. Delete only marks tuples as deleted, and in-memory databases are not checkpointed/flushed yet, which means the deleted tuples will not be cleaned up from memory. This is related to #109.

lamkenn · 2021-10-26T13:52:33Z

Will try to create a repeatable example later this week.

lamkenn · 2021-11-04T08:53:21Z

Hi. I found that there are different behavior for

duckdb.connect(database=':memory', read_only=False)

vs

duckdb.connect()

the 1st one caused me segmentation fault during UPDATE executemany on a large table.

What is the difference between the two ?

Thanks

Mytherin · 2021-11-04T09:01:16Z

In the first one you are creating an on-disk database called :memory. There is a missing colon in the end. The correct syntax is :memory: for creating an in-memory database.

Could you create a reproducible example of the segmentation fault?

lamkenn · 2021-11-09T10:58:41Z

Here is the reproducible example:

import duckdb
import pandas as pd
# import pyarrow as pa

import faulthandler
faulthandler.enable()

print('Start')
conn: duckdb.DuckDBPyConnection = duckdb.connect()
cursor = conn.cursor()
data = pd.read_csv('data.csv')

conn.execute("create table test_table (isin VARCHAR(12), value VARCHAR(1))")
# arrow_table = pa.Table.from_pydict(data)
# cursor.register_arrow(f'test_table_view', arrow_table)

cursor.register('test_table_view', data)
cursor.execute("insert into test_table SELECT * FROM test_table_view")
cursor.execute("UPDATE test_table set value = tdv.value FROM test_table_view tdv where tdv.isin = test_table.isin")

# for i in range(1000):
#     cursor.execute("UPDATE test_table set value = tdv.value FROM test_table_view tdv where tdv.isin = test_table.isin")

# Not able to reach this statement
cursor.close()
conn.close()
print('End')

data.csv


root@goorm:/workspace/pythonContainer# python3 /workspace/pythonContainer/update_from_view.py
Start
Fatal Python error: Segmentation fault

Current thread 0x00007fe3b11be600 (most recent call first):
  File "/workspace/pythonContainer/update_from_view.py", line 19 in <module>
Segmentation fault (core dumped)
root@goorm:/workspace/pythonContainer#

I noticed that the VARCHAR/TEXT column type is the cause of the segmentation fault.
Eg. if you change the type to eg. INTEGER, it runs fine.

lamkenn · 2021-11-16T15:13:07Z

@Mytherin could you plz take a look. Thanks!

Mytherin · 2021-11-16T15:31:54Z

Thanks for the update! I can indeed reproduce the problem here. I will have a look.

…nd handle it earlier to clean upc ode

Fix #2471: correctly handle offset passed by ::UpdateSegment, and handle it earlier to clean up code

lamkenn changed the title ~~In-memory duckdb keep increasing~~ Python: In-memory duckdb keep increasing indefinitely Oct 21, 2021

Mytherin added a commit to Mytherin/duckdb that referenced this issue Dec 2, 2021

Fix duckdb#2471: correctly handle offset passed by ::UpdateSegment, a…

66b7d4c

…nd handle it earlier to clean upc ode

Mytherin mentioned this issue Dec 2, 2021

Fix #2471: correctly handle offset passed by ::UpdateSegment, and handle it earlier to clean up code #2723

Merged

Mytherin linked a pull request Dec 2, 2021 that will close this issue

Fix #2471: correctly handle offset passed by ::UpdateSegment, and handle it earlier to clean up code #2723

Merged

Mytherin closed this as completed in #2723 Dec 3, 2021

Mytherin added a commit that referenced this issue Dec 3, 2021

Merge pull request #2723 from Mytherin/issue2471

eacccc7

Fix #2471: correctly handle offset passed by ::UpdateSegment, and handle it earlier to clean up code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: In-memory duckdb keep increasing indefinitely #2471

Python: In-memory duckdb keep increasing indefinitely #2471

lamkenn commented Oct 21, 2021 •

edited

lamkenn commented Oct 21, 2021

Mytherin commented Oct 22, 2021

lamkenn commented Oct 26, 2021 •

edited

lamkenn commented Nov 4, 2021

Mytherin commented Nov 4, 2021

lamkenn commented Nov 9, 2021 •

edited

lamkenn commented Nov 16, 2021

Mytherin commented Nov 16, 2021

Python: In-memory duckdb keep increasing indefinitely #2471

Python: In-memory duckdb keep increasing indefinitely #2471

Comments

lamkenn commented Oct 21, 2021 • edited

What happens?

To Reproduce

Environment (please complete the following information):

Before Submitting

lamkenn commented Oct 21, 2021

Mytherin commented Oct 22, 2021

lamkenn commented Oct 26, 2021 • edited

lamkenn commented Nov 4, 2021

Mytherin commented Nov 4, 2021

lamkenn commented Nov 9, 2021 • edited

lamkenn commented Nov 16, 2021

Mytherin commented Nov 16, 2021

lamkenn commented Oct 21, 2021 •

edited

lamkenn commented Oct 26, 2021 •

edited

lamkenn commented Nov 9, 2021 •

edited