Skip to content

Conversation

@igorniebylski
Copy link
Contributor

@igorniebylski igorniebylski requested review from Copilot and solatis June 23, 2025 07:43
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses a ValueError when calling qdbpd.read_dataframe on an existing but empty table by forcing evaluation of the generator and returning an empty DataFrame.

  • Modified read_dataframe to materialize the generator and handle empty results.
  • Added a new test test_read_dataframe_empty_table_sc16881 to ensure empty tables yield an empty DataFrame.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
tests/test_pandas.py Added test_read_dataframe_empty_table_sc16881.
quasardb/pandas/init.py Changed read_dataframe to list(...) the stream and return empty DataFrame if no batches.
Comments suppressed due to low confidence (1)

tests/test_pandas.py:686

  • Consider adding assertions to verify that the empty DataFrame has the expected column names (e.g., ['d'] or including the timestamp column) to strengthen this test.
    assert df.empty

# we need to evaluate the generator first, and then concatenate if result set is not empty.
dfs = list(stream_dataframe(conn, table, **kwargs))

if len(dfs) == 0:
Copy link

Copilot AI Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Use if not dfs: for a more idiomatic and concise empty-list check in Python.

Suggested change
if len(dfs) == 0:
if not dfs:

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if len(dfs) == 0: is a check taken from pandas.concat

dfs = list(stream_dataframe(conn, table, **kwargs))

if len(dfs) == 0:
return pd.DataFrame()
Copy link

Copilot AI Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an empty DataFrame without any columns may break downstream code expecting the original table schema; consider initializing with the table's column names to preserve schema.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same behavior as qdbpd.query function

Comment on lines 347 to 349
# stream_dataframe is *lazy* - is a generator, not a list.
# if result of stream_dataframe is empty this could result in ValueError on pd.concat()
# we need to evaluate the generator first, and then concatenate if result set is not empty.
Copy link

Copilot AI Jun 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] These three comment lines could be consolidated into a single concise remark to reduce verbosity and improve readability.

Suggested change
# stream_dataframe is *lazy* - is a generator, not a list.
# if result of stream_dataframe is empty this could result in ValueError on pd.concat()
# we need to evaluate the generator first, and then concatenate if result set is not empty.
# stream_dataframe is lazy (a generator, not a list). Evaluate it first to avoid ValueError on pd.concat() if the result is empty.

Copilot uses AI. Check for mistakes.
# stream_dataframe is *lazy* - is a generator, not a list.
# if result of stream_dataframe is empty this could result in ValueError on pd.concat()
# we need to evaluate the generator first, and then concatenate if result set is not empty.
dfs = list(stream_dataframe(conn, table, **kwargs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pulls everything into memory at once -- will this cause additional memory consumption when used in conjunction with pd.concat ?

The beautiful thing about using a lazy-evaluated generator for pd.concat is that we don't need to pull all individual dataframes into memory before pd.concat merges them into a single dataframe.

Assuming that pd.concat does not copy dataframes, this does not matter. If it does copy the data, this will suddenly cause a doubling of the required memory.

Could you look into this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pondering further, we really only need to know whether there's at least 1 item -- we really don't need to evaluate all of them.

Is there perhaps a more elegant way to inspect lazily with functools or some other generator helper function/library whether there's at least one item?

Copy link
Contributor Author

@igorniebylski igorniebylski Jun 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default pd.concat() will make a copy of the data. we can specify copy argument to False and override this behavior https://pandas.pydata.org/docs/reference/api/pandas.concat.html

as for the second comment i didn't find an easy way to peek a generator

  • we can manually call next() once on the generator, check if first element returns or raises StopIteration exception (or we can specify default value if generator reaches an end). if there is no exception we can use itertools.chain to combine first (already evaluated) element with rest of the generator and pass this new "chain" to pd.concat()
    https://stackoverflow.com/questions/661603/how-do-i-know-if-a-generator-is-empty-from-the-start
  • just wrap pd.concat() with try|except ValueError block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this should be fixed:

  • should not make a copy, should take the dataframes by reference instead (= faster);
  • I think try/except this specific exception is the best approach, it's the simplest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i agree, lets not over complicate this

Copy link
Contributor

@solatis solatis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comment, would be good to clarify upon that.

@igorniebylski igorniebylski requested a review from solatis June 24, 2025 10:54
@igorniebylski igorniebylski merged commit 3676ac7 into master Jun 24, 2025
2 checks passed
@igorniebylski igorniebylski deleted the sc-16881/exception-when-reading-empty-table-with-qdbpd-read-dataframe branch June 24, 2025 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants