Importing from Pandas has incorrect column type for empty string/object columns #12507

H-SG · 2024-06-12T19:13:40Z

What happens?

Importing data from a Pandas dataframe result in incorrect column if the column in Pandas is object or string types and the column is completely empty.

Documentation is not clear on the implicit casting which occurs when importing a dataframe, but I would expect that the column/string pandas type becomes VARCHAR. This is the case when there are values in the column, but not when the column is empty, instead returning INTEGER.

In my limited testing this does not occur with other column types.

Both 1.0.0 and 1.0.1-dev122 results in this bug.

To Reproduce

The following code will create example dataframes and relations and compare types, failing as per bug description when checking relation data types. Changing use_string_type allows testing of both object and string Pandas types.

import random

import pandas as pd
import numpy as np
import duckdb

def get_filled_df(rows: int = 100, use_string_type: bool = False) -> pd.DataFrame:
    """Return dataframe filled with random data of specific types."""
    index = np.arange(rows)
    data: dict = {
        "int_column": [random.randint(0, 100) for _ in range(rows)],
        "bool_column": [random.choice([True, False]) for _ in range(rows)],
        "float_column": [random.uniform(0, 1) for _ in range(rows)],
        "string_column": [random.choice(["foo", "bar", "baz", "qux"]) for _ in range(rows)]
    }

    return pd.DataFrame(data=data, index=index).convert_dtypes(infer_objects=False, convert_string=use_string_type)

def get_empty_df(rows: int = 100, use_string_type: bool = False) -> pd.DataFrame:
    """Return dataframe with empty columns of specific types."""
    df: pd.DataFrame = pd.DataFrame(index=np.arange(rows))

    df["int_column"] = pd.Series(dtype="Int64")
    df["bool_column"] = pd.Series(dtype="boolean")
    df["float_column"] = pd.Series(dtype="Float64")
    if use_string_type:
        df["string_column"] = pd.Series(dtype="string")
    else:
        df["string_column"] = pd.Series(dtype="object")

    return df

print(f"Pandas version: {pd.__version__}") # 2.2.2
print(f"Numpy version: {np.__version__}") # 1.26.4
print(f"DuckDB version: {duckdb.__version__}") # 1.0.0

use_string_type: bool = True
filled_df: pd.DataFrame = get_filled_df(use_string_type=use_string_type)
empty_df: pd.DataFrame = get_empty_df(use_string_type=use_string_type)

# check df correctly constructed
assert len(filled_df) == len(empty_df)

# check dtypes match
for fdf_type, edf_type in zip(filled_df.dtypes, empty_df.dtypes):
    assert fdf_type == edf_type

with duckdb.connect(database=":memory:") as con:
    f_df_rel: duckdb.DuckDBPyRelation = con.from_df(filled_df)
    e_df_rel: duckdb.DuckDBPyRelation = con.from_df(empty_df)

    for fdf_col, edf_col in zip(f_df_rel.columns, e_df_rel.columns):
        assert fdf_col == edf_col, f"Filled column {fdf_col} does not match empty col {fdf_col}"

    # fails with empty string column being INTEGER type instead of VARCHAR
    for (fdf_col, fdf_type), (edf_col, edf_type) in zip(zip(f_df_rel.columns, f_df_rel.dtypes), zip(e_df_rel.columns, e_df_rel.dtypes)):
        assert fdf_type == edf_type, f"Filled column {fdf_col} type {fdf_type} does not match empty column {edf_col} type {edf_type}"

This results in:

AssertionError: Filled column string_column type VARCHAR does not match empty column string_column type INTEGER

OS:

Ubuntu 20.04 WSL x64 (5.15.146.1-microsoft-standard-WSL2)

DuckDB Version:

1.0.0, 1.0.1-dev122

DuckDB Client:

Python

Full Name:

Zander Horn

Affiliation:

Stone Three

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

H-SG added the needs triage label Jun 12, 2024

szarnyasg added the reproduced label Jun 13, 2024

duckdblabs-bot removed the needs triage label Jun 13, 2024

Tishj mentioned this issue Jun 13, 2024

[Python] Skip the PandasAnalyzer if dtype is 'string' #12511

Merged

Tishj closed this as completed Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Importing from Pandas has incorrect column type for empty string/object columns #12507

Importing from Pandas has incorrect column type for empty string/object columns #12507

H-SG commented Jun 12, 2024 •

edited by szarnyasg

Loading

Importing from Pandas has incorrect column type for empty string/object columns #12507

Importing from Pandas has incorrect column type for empty string/object columns #12507

Comments

H-SG commented Jun 12, 2024 • edited by szarnyasg Loading

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

H-SG commented Jun 12, 2024 •

edited by szarnyasg

Loading