Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Importing from Pandas has incorrect column type for empty string/object columns #12507

Closed
2 tasks done
H-SG opened this issue Jun 12, 2024 · 0 comments
Closed
2 tasks done

Comments

@H-SG
Copy link

H-SG commented Jun 12, 2024

What happens?

Importing data from a Pandas dataframe result in incorrect column if the column in Pandas is object or string types and the column is completely empty.

Documentation is not clear on the implicit casting which occurs when importing a dataframe, but I would expect that the column/string pandas type becomes VARCHAR. This is the case when there are values in the column, but not when the column is empty, instead returning INTEGER.

In my limited testing this does not occur with other column types.

Both 1.0.0 and 1.0.1-dev122 results in this bug.

To Reproduce

The following code will create example dataframes and relations and compare types, failing as per bug description when checking relation data types. Changing use_string_type allows testing of both object and string Pandas types.

import random

import pandas as pd
import numpy as np
import duckdb

def get_filled_df(rows: int = 100, use_string_type: bool = False) -> pd.DataFrame:
    """Return dataframe filled with random data of specific types."""
    index = np.arange(rows)
    data: dict = {
        "int_column": [random.randint(0, 100) for _ in range(rows)],
        "bool_column": [random.choice([True, False]) for _ in range(rows)],
        "float_column": [random.uniform(0, 1) for _ in range(rows)],
        "string_column": [random.choice(["foo", "bar", "baz", "qux"]) for _ in range(rows)]
    }

    return pd.DataFrame(data=data, index=index).convert_dtypes(infer_objects=False, convert_string=use_string_type)

def get_empty_df(rows: int = 100, use_string_type: bool = False) -> pd.DataFrame:
    """Return dataframe with empty columns of specific types."""
    df: pd.DataFrame = pd.DataFrame(index=np.arange(rows))

    df["int_column"] = pd.Series(dtype="Int64")
    df["bool_column"] = pd.Series(dtype="boolean")
    df["float_column"] = pd.Series(dtype="Float64")
    if use_string_type:
        df["string_column"] = pd.Series(dtype="string")
    else:
        df["string_column"] = pd.Series(dtype="object")

    return df

print(f"Pandas version: {pd.__version__}") # 2.2.2
print(f"Numpy version: {np.__version__}") # 1.26.4
print(f"DuckDB version: {duckdb.__version__}") # 1.0.0

use_string_type: bool = True
filled_df: pd.DataFrame = get_filled_df(use_string_type=use_string_type)
empty_df: pd.DataFrame = get_empty_df(use_string_type=use_string_type)

# check df correctly constructed
assert len(filled_df) == len(empty_df)

# check dtypes match
for fdf_type, edf_type in zip(filled_df.dtypes, empty_df.dtypes):
    assert fdf_type == edf_type

with duckdb.connect(database=":memory:") as con:
    f_df_rel: duckdb.DuckDBPyRelation = con.from_df(filled_df)
    e_df_rel: duckdb.DuckDBPyRelation = con.from_df(empty_df)

    for fdf_col, edf_col in zip(f_df_rel.columns, e_df_rel.columns):
        assert fdf_col == edf_col, f"Filled column {fdf_col} does not match empty col {fdf_col}"

    # fails with empty string column being INTEGER type instead of VARCHAR
    for (fdf_col, fdf_type), (edf_col, edf_type) in zip(zip(f_df_rel.columns, f_df_rel.dtypes), zip(e_df_rel.columns, e_df_rel.dtypes)):
        assert fdf_type == edf_type, f"Filled column {fdf_col} type {fdf_type} does not match empty column {edf_col} type {edf_type}"

This results in:

AssertionError: Filled column string_column type VARCHAR does not match empty column string_column type INTEGER

OS:

Ubuntu 20.04 WSL x64 (5.15.146.1-microsoft-standard-WSL2)

DuckDB Version:

1.0.0, 1.0.1-dev122

DuckDB Client:

Python

Full Name:

Zander Horn

Affiliation:

Stone Three

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants