You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Importing data from a Pandas dataframe result in incorrect column if the column in Pandas is object or string types and the column is completely empty.
Documentation is not clear on the implicit casting which occurs when importing a dataframe, but I would expect that the column/string pandas type becomes VARCHAR. This is the case when there are values in the column, but not when the column is empty, instead returning INTEGER.
In my limited testing this does not occur with other column types.
Both 1.0.0 and 1.0.1-dev122 results in this bug.
To Reproduce
The following code will create example dataframes and relations and compare types, failing as per bug description when checking relation data types. Changing use_string_type allows testing of both object and string Pandas types.
importrandomimportpandasaspdimportnumpyasnpimportduckdbdefget_filled_df(rows: int=100, use_string_type: bool=False) ->pd.DataFrame:
"""Return dataframe filled with random data of specific types."""index=np.arange(rows)
data: dict= {
"int_column": [random.randint(0, 100) for_inrange(rows)],
"bool_column": [random.choice([True, False]) for_inrange(rows)],
"float_column": [random.uniform(0, 1) for_inrange(rows)],
"string_column": [random.choice(["foo", "bar", "baz", "qux"]) for_inrange(rows)]
}
returnpd.DataFrame(data=data, index=index).convert_dtypes(infer_objects=False, convert_string=use_string_type)
defget_empty_df(rows: int=100, use_string_type: bool=False) ->pd.DataFrame:
"""Return dataframe with empty columns of specific types."""df: pd.DataFrame=pd.DataFrame(index=np.arange(rows))
df["int_column"] =pd.Series(dtype="Int64")
df["bool_column"] =pd.Series(dtype="boolean")
df["float_column"] =pd.Series(dtype="Float64")
ifuse_string_type:
df["string_column"] =pd.Series(dtype="string")
else:
df["string_column"] =pd.Series(dtype="object")
returndfprint(f"Pandas version: {pd.__version__}") # 2.2.2print(f"Numpy version: {np.__version__}") # 1.26.4print(f"DuckDB version: {duckdb.__version__}") # 1.0.0use_string_type: bool=Truefilled_df: pd.DataFrame=get_filled_df(use_string_type=use_string_type)
empty_df: pd.DataFrame=get_empty_df(use_string_type=use_string_type)
# check df correctly constructedassertlen(filled_df) ==len(empty_df)
# check dtypes matchforfdf_type, edf_typeinzip(filled_df.dtypes, empty_df.dtypes):
assertfdf_type==edf_typewithduckdb.connect(database=":memory:") ascon:
f_df_rel: duckdb.DuckDBPyRelation=con.from_df(filled_df)
e_df_rel: duckdb.DuckDBPyRelation=con.from_df(empty_df)
forfdf_col, edf_colinzip(f_df_rel.columns, e_df_rel.columns):
assertfdf_col==edf_col, f"Filled column {fdf_col} does not match empty col {fdf_col}"# fails with empty string column being INTEGER type instead of VARCHARfor (fdf_col, fdf_type), (edf_col, edf_type) inzip(zip(f_df_rel.columns, f_df_rel.dtypes), zip(e_df_rel.columns, e_df_rel.dtypes)):
assertfdf_type==edf_type, f"Filled column {fdf_col} type {fdf_type} does not match empty column {edf_col} type {edf_type}"
This results in:
AssertionError: Filled column string_column type VARCHAR does not match empty column string_column type INTEGER
What happens?
Importing data from a Pandas dataframe result in incorrect column if the column in Pandas is object or string types and the column is completely empty.
Documentation is not clear on the implicit casting which occurs when importing a dataframe, but I would expect that the column/string pandas type becomes
VARCHAR
. This is the case when there are values in the column, but not when the column is empty, instead returningINTEGER
.In my limited testing this does not occur with other column types.
Both 1.0.0 and 1.0.1-dev122 results in this bug.
To Reproduce
The following code will create example dataframes and relations and compare types, failing as per bug description when checking relation data types. Changing
use_string_type
allows testing of bothobject
andstring
Pandas types.This results in:
AssertionError: Filled column string_column type VARCHAR does not match empty column string_column type INTEGER
OS:
Ubuntu 20.04 WSL x64 (5.15.146.1-microsoft-standard-WSL2)
DuckDB Version:
1.0.0, 1.0.1-dev122
DuckDB Client:
Python
Full Name:
Zander Horn
Affiliation:
Stone Three
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
Not applicable - the reproduction does not require a data set
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: