Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in load_file some datasets in Snowflake #104

Closed
tatiana opened this issue Feb 11, 2022 · 2 comments
Closed

Issue in load_file some datasets in Snowflake #104

tatiana opened this issue Feb 11, 2022 · 2 comments
Labels
bug Something isn't working priority/high High priority product/python-sdk Label describing products

Comments

@tatiana
Copy link
Collaborator

tatiana commented Feb 11, 2022

Version: astro==0.4.0

Problem
At the moment, we are unable to load the following dataset from Tate Gallery into Snowflake: https://github.com/tategallery/collection/blob/master/artwork_data.csv. The operation works using BQ and Postgres. I could not find any particular issue with the original dataset.

Exception:

  File "/home/tati/Code/astro-fresh/src/astro/sql/operators/agnostic_load_file.py", line 80, in execute
    move_dataframe_to_sql(
  File "/home/tati/Code/astro-fresh/src/astro/utils/load_dataframe.py", line 72, in move_dataframe_to_sql
    write_pandas(
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/snowflake/connector/pandas_tools.py", line 146, in write_pandas
    create_stage_sql = (
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/util/_decorators.py", line 207, in wrapper
    return func(*args, **kwargs)
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/core/frame.py", line 2677, in to_parquet
    return to_parquet(
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/io/parquet.py", line 416, in to_parquet
    impl.write(
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/io/parquet.py", line 173, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 1561, in pyarrow.lib.Table.from_pandas
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays
    arrays = [convert_column(c, f)
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp>
    arrays = [convert_column(c, f)
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column
    raise e
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("Could not convert '99' with type str: tried to convert to double", 'Conversion failed for column HEIGHT with type object')

How to reproduce

Download the dataset artwork_data.csv.

Update the tests/benchmark/config.json file to include a dataset similar to:

    {
        "name": "few_mb",
        "size": "24M",
        "path": "<path-to>/artwork_data.csv",
        "rows": 69201
    },

And a database that uses Snowflake.

From within the tests/benchmark folder, run:

/run.py --dataset=few_mb --database=snowflake

Initial analysis

The first step of load_file is to load the CSV to a Pandas data frame; In the case of this particular dataset, Pandas automagically assigns the following types per column:

(Pdb) df. types
id                      int64
accession_number       object
artist                 object
artistRole             object
artistId                int64
title                  object
dateText               object
medium                 object
creditLine             object
year                   object
acquisitionYear       float64
dimensions             object
width                  object
height                 object
depth                 float64
units                  object
inscription            object
thumbnailCopyright     object
thumbnailUrl           object
url                    object
dtype: object

When analyzing the values within height, it is possible to see that there is a mixture of strings, floats, and nan:

(Pdb) len([i for i in df.height if isinstance(i, str)])
31330
(Pdb) len([i for i in df.height if not isinstance(i, str)])
37871

Why doesn't this happen for BQ & Postgres?

Because they are currently using a different strategy to write from the data frame into the table in the database:
https://github.com/astro-projects/astro/blob/4e63302bc5c69401b10568598c4ff738e21563f5/src/astro/utils/load_dataframe.py#L60-L95

@tatiana tatiana added the bug Something isn't working label Feb 11, 2022
@tatiana tatiana changed the title Issue in load_file dataset in which a column contains floats, ints and empty values Issue in load_file some datasets in Snowflake Feb 11, 2022
@tatiana tatiana added this to the 0.6.1 milestone Feb 21, 2022
@tatiana tatiana added the priority/high High priority label Feb 21, 2022
@dimberman
Copy link
Collaborator

I believe that the solution is to add a feature where a user can manually define a schema using a list of SQLAlchemy columns. I believe @utkarsharma2 had looked into this at some point.

@tatiana tatiana modified the milestones: 0.7.x, 0.8.x Mar 8, 2022
@tatiana tatiana removed this from the 0.8.x milestone Apr 11, 2022
@kaxil kaxil added the product/python-sdk Label describing products label Oct 6, 2022
@tatiana
Copy link
Collaborator Author

tatiana commented Jan 17, 2023

This problem is related to the incorrect type identification by Pandas.

We have given a solution to this problem by allowing users to specify the columns as part of the Table definition, introduced in 0.9:
https://github.com/astronomer/astro-sdk/blob/main/python-sdk/docs/CHANGELOG.md#090

@tatiana tatiana closed this as completed Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority/high High priority product/python-sdk Label describing products
Projects
None yet
Development

No branches or pull requests

3 participants