Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: In-memory duckdb keep increasing indefinitely #2471

Closed
1 of 2 tasks
lamkenn opened this issue Oct 21, 2021 · 8 comments · Fixed by #2723
Closed
1 of 2 tasks

Python: In-memory duckdb keep increasing indefinitely #2471

lamkenn opened this issue Oct 21, 2021 · 8 comments · Fixed by #2723

Comments

@lamkenn
Copy link

lamkenn commented Oct 21, 2021

What happens?

I run a in-memory duckdb python (initialise it with a table of 200K records, memory~250MB after inserting those, id column as the primary key) and the process subscribe to a stream of update (pandas dataframe) which keep updating the table by cursor.executemany("UPDATE TABLE set field1 = ?, field2= ? where id = ?", df.to_records()) for 500 records every second.

However, the memory of the python program keep increasing even there is no new records inserted ( I keep reusing the cursor for the updates)

If i comment out the cursor.executemany statement and just print out the dataframe. The memory doesn't increase while getting the update from the data stream.

Therefore, I am quite sure the memory increment is due to the update statement. I also set the memory limit by PRAGMA memory_limit='1GB';

Moreover, I got segmentation fault if i try to run a update-select (update a big table with 20k records from a table with 500 records) statement . If i just have say 5k records in that big table , then it runs fine.

To Reproduce

I will try to create a sample program later. But for now, wondering if I am doing anything wrong with the in-memory database.

Environment (please complete the following information):

  • OS: Windows 10
  • DuckDB Version: duckdb-0.3.0
  • DuckDB Client: Python

Before Submitting

  • Have you tried this on the latest master branch?
  • Python: pip install duckdb --upgrade --pre
  • R: install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
  • Other Platforms: You can find binaries here or compile from source.
  • Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
@lamkenn lamkenn changed the title In-memory duckdb keep increasing Python: In-memory duckdb keep increasing indefinitely Oct 21, 2021
@lamkenn
Copy link
Author

lamkenn commented Oct 21, 2021

And i noticed that even after i deleted all the records from the table. The memory is not freed up.
Any ideas how to reclaim those memory ?

Thanks

@Mytherin
Copy link
Collaborator

Thanks for the report!

However, the memory of the python program keep increasing even there is no new records inserted ( I keep reusing the cursor for the updates)

That is perhaps related to the fact that in-memory databases are not checkpointed/flushed, or perhaps related to string updates not being properly cleaned up. What kind of data are you inserting into the tables (types, size of values, etc)?

Moreover, I got segmentation fault if i try to run a update-select (update a big table with 20k records from a table with 500 records) statement . If i just have say 5k records in that big table , then it runs fine.

Could you share the query/data that triggers this crash please?

And i noticed that even after i deleted all the records from the table. The memory is not freed up. Any ideas how to reclaim those memory ?

DROP TABLE should reclaim the memory. Delete only marks tuples as deleted, and in-memory databases are not checkpointed/flushed yet, which means the deleted tuples will not be cleaned up from memory. This is related to #109.

@lamkenn
Copy link
Author

lamkenn commented Oct 26, 2021

Will try to create a repeatable example later this week.

@lamkenn
Copy link
Author

lamkenn commented Nov 4, 2021

Hi. I found that there are different behavior for

duckdb.connect(database=':memory', read_only=False)

vs

duckdb.connect()

the 1st one caused me segmentation fault during UPDATE executemany on a large table.

What is the difference between the two ?

Thanks

@Mytherin
Copy link
Collaborator

Mytherin commented Nov 4, 2021

In the first one you are creating an on-disk database called :memory. There is a missing colon in the end. The correct syntax is :memory: for creating an in-memory database.

Could you create a reproducible example of the segmentation fault?

@lamkenn
Copy link
Author

lamkenn commented Nov 9, 2021

Here is the reproducible example:

import duckdb
import pandas as pd
# import pyarrow as pa

import faulthandler
faulthandler.enable()

print('Start')
conn: duckdb.DuckDBPyConnection = duckdb.connect()
cursor = conn.cursor()
data = pd.read_csv('data.csv')

conn.execute("create table test_table (isin VARCHAR(12), value VARCHAR(1))")
# arrow_table = pa.Table.from_pydict(data)
# cursor.register_arrow(f'test_table_view', arrow_table)

cursor.register('test_table_view', data)
cursor.execute("insert into test_table SELECT * FROM test_table_view")
cursor.execute("UPDATE test_table set value = tdv.value FROM test_table_view tdv where tdv.isin = test_table.isin")

# for i in range(1000):
#     cursor.execute("UPDATE test_table set value = tdv.value FROM test_table_view tdv where tdv.isin = test_table.isin")

# Not able to reach this statement
cursor.close()
conn.close()
print('End')

data.csv


root@goorm:/workspace/pythonContainer# python3 /workspace/pythonContainer/update_from_view.py
Start
Fatal Python error: Segmentation fault

Current thread 0x00007fe3b11be600 (most recent call first):
  File "/workspace/pythonContainer/update_from_view.py", line 19 in <module>
Segmentation fault (core dumped)
root@goorm:/workspace/pythonContainer#


I noticed that the VARCHAR/TEXT column type is the cause of the segmentation fault.
Eg. if you change the type to eg. INTEGER, it runs fine.

@lamkenn
Copy link
Author

lamkenn commented Nov 16, 2021

@Mytherin could you plz take a look. Thanks!

@Mytherin
Copy link
Collaborator

Thanks for the update! I can indeed reproduce the problem here. I will have a look.

Mytherin added a commit to Mytherin/duckdb that referenced this issue Dec 2, 2021
Mytherin added a commit that referenced this issue Dec 3, 2021
Fix #2471: correctly handle offset passed by ::UpdateSegment, and handle it earlier to clean up code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants