Skip to content

Register_parquet not working for pandas parquet files #66

@marvin-lge

Description

@marvin-lge

Bug descripion
From pandas I am writing a parquet file (using gzip compression), I am looking to query this file using datafusion. The same file exported as .csv is working fine using this library, but the parquet version is not returning anything.

Steps to reproduce

import datafusion
import pandas as pd

df = pd.DataFrame(data={'col1': [1, 2, 4, 5, 6], 'col2': [3, 4, 3, 5, 2], 'col3': [3, 4, 1, 2, 3], 'col4': [3, 4, 4, 5, 6]})
df.to_csv('df.csv', compression=None)  
df.to_parquet('df.pq', compression=None)  

ctx = datafusion.SessionContext()
ctx.register_csv(name="example_csv", path="df.csv")
ctx.register_parquet(name="example_pq", path="df.pq")

# test csv
df = ctx.sql("SELECT * FROM example_csv")
result = df.collect()
res = result[0]

# test parquet
df = ctx.sql("SELECT * FROM example_pq")
result = df.collect()
res = result[0]

Expected behavior
The same result from both approaches

Additional context
It also seems that gzip compressed files are not working, I am not sure why this is, please consider the following example:

df.to_csv('df.csv.gz') 
ctx.register_csv(name="example_csv_gz", path="df.csv.gz")

# test csv
df = ctx.sql("SELECT * FROM example_csv_gz")
result = df.collect()
res = result[0]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions