Issue in `load_file` some datasets in Snowflake #104

tatiana · 2022-02-11T23:39:30Z

Version: astro==0.4.0

Problem
At the moment, we are unable to load the following dataset from Tate Gallery into Snowflake: https://github.com/tategallery/collection/blob/master/artwork_data.csv. The operation works using BQ and Postgres. I could not find any particular issue with the original dataset.

Exception:

  File "/home/tati/Code/astro-fresh/src/astro/sql/operators/agnostic_load_file.py", line 80, in execute
    move_dataframe_to_sql(
  File "/home/tati/Code/astro-fresh/src/astro/utils/load_dataframe.py", line 72, in move_dataframe_to_sql
    write_pandas(
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/snowflake/connector/pandas_tools.py", line 146, in write_pandas
    create_stage_sql = (
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/util/_decorators.py", line 207, in wrapper
    return func(*args, **kwargs)
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/core/frame.py", line 2677, in to_parquet
    return to_parquet(
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/io/parquet.py", line 416, in to_parquet
    impl.write(
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pandas/io/parquet.py", line 173, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 1561, in pyarrow.lib.Table.from_pandas
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in dataframe_to_arrays
    arrays = [convert_column(c, f)
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 594, in <listcomp>
    arrays = [convert_column(c, f)
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 581, in convert_column
    raise e
  File "/home/tati/.virtualenvs/astro-py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 575, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 302, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 99, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("Could not convert '99' with type str: tried to convert to double", 'Conversion failed for column HEIGHT with type object')

How to reproduce

Download the dataset artwork_data.csv.

Update the tests/benchmark/config.json file to include a dataset similar to:

    {
        "name": "few_mb",
        "size": "24M",
        "path": "<path-to>/artwork_data.csv",
        "rows": 69201
    },

And a database that uses Snowflake.

From within the tests/benchmark folder, run:

/run.py --dataset=few_mb --database=snowflake

Initial analysis

The first step of load_file is to load the CSV to a Pandas data frame; In the case of this particular dataset, Pandas automagically assigns the following types per column:

(Pdb) df. types
id                      int64
accession_number       object
artist                 object
artistRole             object
artistId                int64
title                  object
dateText               object
medium                 object
creditLine             object
year                   object
acquisitionYear       float64
dimensions             object
width                  object
height                 object
depth                 float64
units                  object
inscription            object
thumbnailCopyright     object
thumbnailUrl           object
url                    object
dtype: object

When analyzing the values within height, it is possible to see that there is a mixture of strings, floats, and nan:

(Pdb) len([i for i in df.height if isinstance(i, str)])
31330
(Pdb) len([i for i in df.height if not isinstance(i, str)])
37871

Why doesn't this happen for BQ & Postgres?

Because they are currently using a different strategy to write from the data frame into the table in the database:
https://github.com/astro-projects/astro/blob/4e63302bc5c69401b10568598c4ff738e21563f5/src/astro/utils/load_dataframe.py#L60-L95

The text was updated successfully, but these errors were encountered:

dimberman · 2022-02-24T16:18:41Z

I believe that the solution is to add a feature where a user can manually define a schema using a list of SQLAlchemy columns. I believe @utkarsharma2 had looked into this at some point.

tatiana · 2023-01-17T09:42:16Z

This problem is related to the incorrect type identification by Pandas.

We have given a solution to this problem by allowing users to specify the columns as part of the Table definition, introduced in 0.9:
https://github.com/astronomer/astro-sdk/blob/main/python-sdk/docs/CHANGELOG.md#090

tatiana added the bug Something isn't working label Feb 11, 2022

tatiana changed the title ~~Issue in load_file dataset in which a column contains floats, ints and empty values~~ Issue in load_file some datasets in Snowflake Feb 11, 2022

tatiana added this to the 0.6.1 milestone Feb 21, 2022

tatiana added the priority/high High priority label Feb 21, 2022

tatiana modified the milestones: 0.7.x, 0.8.x Mar 8, 2022

tatiana mentioned this issue Mar 23, 2022

Improve load_file so users can customize table column types #239

Closed

12 tasks

tatiana removed this from the 0.8.x milestone Apr 11, 2022

kaxil added the product/python-sdk Label describing products label Oct 6, 2022

tatiana closed this as completed Jan 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue in `load_file` some datasets in Snowflake #104

Issue in `load_file` some datasets in Snowflake #104

tatiana commented Feb 11, 2022 •

edited

dimberman commented Feb 24, 2022

tatiana commented Jan 17, 2023

Issue in load_file some datasets in Snowflake #104

Issue in load_file some datasets in Snowflake #104

Comments

tatiana commented Feb 11, 2022 • edited

dimberman commented Feb 24, 2022

tatiana commented Jan 17, 2023

Issue in `load_file` some datasets in Snowflake #104

Issue in `load_file` some datasets in Snowflake #104

tatiana commented Feb 11, 2022 •

edited