Add native autodetect schema feature #780

feluelle · 2022-09-06T15:26:58Z

Description

What is the current behavior?

Currently for inferring the schema we are always using pandas.

For more information, check the issue below..

closes: #709

What is the new behavior?

Add native autodetect schema feature for snowflake

add snowflake file format data class
add option to use native autodetect schema for databases
implement snowflake specific autodetect schema

Add native autodetect schema for bigquery

implement bigquery native autodetect schema
add tests for bigquery

Does this introduce a breaking change?

No

Checklist

Created tests which fail without the change (if possible)
Extended the README / documentation, if necessary

- add snowflake file format data class - add option to use native autodetect schema for databases - implement snowflake specific autodetect schema

codecov · 2022-09-06T15:49:02Z

Codecov Report

Merging #780 (3d422b3) into main (f1030fd) will increase coverage by 0.03%.
The diff coverage is 94.64%.

@@            Coverage Diff             @@
##             main     #780      +/-   ##
==========================================
+ Coverage   93.27%   93.31%   +0.03%     
==========================================
  Files          46       46              
  Lines        1962     2018      +56     
  Branches      247      252       +5     
==========================================
+ Hits         1830     1883      +53     
- Misses        103      105       +2     
- Partials       29       30       +1

Impacted Files	Coverage Δ
python-sdk/src/astro/databases/base.py	`95.62% <83.33%> (-0.48%)`	⬇️
python-sdk/src/astro/databases/snowflake.py	`95.76% <94.73%> (-0.20%)`	⬇️
python-sdk/src/astro/databases/google/bigquery.py	`96.55% <100.00%> (+0.25%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

- add tests for snowflake - implement snowflake autodetect check - remove extra arg for setting schema auto detection

.pre-commit-config.yaml

- implement bigquery native autodetect schema - add tests for bigquery

feluelle · 2022-09-08T12:47:22Z

The failed checks are unrelated.

dimberman

This is super cool! Huge +1

python-sdk/src/astro/databases/snowflake.py

sunank200

Overall looks good. Added few comments.

python-sdk/src/astro/databases/google/bigquery.py

python-sdk/src/astro/databases/snowflake.py

python-sdk/src/astro/databases/google/bigquery.py

utkarsharma2 · 2022-09-12T08:08:04Z

python-sdk/src/astro/databases/google/bigquery.py

@@ -60,6 +60,9 @@
 }
 BIGQUERY_WRITE_DISPOSITION = {"replace": "WRITE_TRUNCATE", "append": "WRITE_APPEND"}

+NATIVE_AUTODETECT_SCHEMA_SUPPORTED_FILE_TYPES = {FileType.CSV, FileType.NDJSON}
+NATIVE_AUTODETECT_SCHEMA_SUPPORTED_FILE_LOCATIONS = {FileLocation.GS}


@feluelle How would we represent cases like:
S3 - Parquet
GCS - CSV, JSON, Parquet

GCS - CSV, JSON, Parquet

GS is GCS; CSV & JSON is supported; For parquet the native inference within bigquery will be used automatically. See comment #780 (comment)

S3 - Parquet

I guess, there is not native path available which means it will use pandas instead.

So currently what this mapping implies is that for every supported location(GCS/S3) we support all the filetypes - CVS/JSON, right? but is it the case always?

is_file_type_supported = ( file.type.name in NATIVE_AUTODETECT_SCHEMA_SUPPORTED_FILE_TYPES ) is_file_location_supported = ( file.location.location_type in NATIVE_AUTODETECT_SCHEMA_SUPPORTED_FILE_LOCATIONS )

Would it be a good idea to check for location and then support fileType by that location? Something like -

{ "GCS": [CSV, NDJSON], "S3": [PARQUET] # just an example. }

but we do - in the return statement:

return is_file_type_supported and is_file_location_supported

ah, okay I got what you mean now. Yes, this could make sense. I just used the same functionality we use for file loading. Is file loading different from schema autodetection? 🤔

@feluelle No, I guess not. We should change that as well. I just realized that we can do that in a separate PR.

@utkarsharma2 Can you create a separate issue on the separate PR you are talking about?

Added - #853

utkarsharma2 · 2022-09-12T09:42:39Z

python-sdk/src/astro/databases/google/bigquery.py

+        file: File,
+    ) -> None:
+        """
+        Create a SQL table, automatically inferring the schema using the given file via native database support.


@feluelle I think this would populate the file in the table as well right?

If the user has given a pattern that results in two files

example -
pattern : s3://tmp/

which results in files:

s3://tmp/test1.csv

s3://tmp/test2.csv

With the existing code and new autodetecting schema logic, we will end up with:

Step 1: Autoschema detect - Table + 1st file load
Step 2: Load data in a table - 1st file load + 2nd file load

Would result in 1st file being loaded twice? Can we confirm if this is not the case?

cc: @sunank200

Step 1: Autoschema detect - Table + 1st file load

The autoschema detect should not load the file actually..

But good point, I will check that.

I tested it for snowflake by adding:

statement = f"SELECT COUNT(*) FROM {database.get_table_qualified_name(table)}" count = database.run_sql(statement).scalar() assert count == 0

which passed.

@feluelle Nice, Can we add one for bigquery as well?

I have added it see #780 (comment)

@kaxil, @utkarsharma2 I think @feluelle is truncating the table: https://github.com/astronomer/astro-sdk/blob/main/python-sdk/src/astro/databases/google/bigquery.py#L279

The test only needs fix

utkarsharma2 · 2022-09-12T09:49:51Z

python-sdk/src/astro/databases/snowflake.py

+        :return: unique file format name
+        """
+        return (
+            "file_format_"


@feluelle Looks like we are having a duplication of sorts, would it be a good idea to have a single function to generate a random string of the desired length and take prefix into account?

astro-sdk/python-sdk/src/astro/sql/table.py

Line 82 in f1030fd

unique_id = random.choice(string.ascii_lowercase) + "".join(

Or we can use this function in table.py

I don't have a strong preference here. But if we want to make it more elegant and truly unique, I would prefer using uuid (maybe uuid4) to create a unique name and store the function in a utils file? WDYT?

We can use it. But I guess we considered UUI for table names, but I cannot recall why we didn't. @tatiana might have better context there.

utkarsharma2 · 2022-09-12T09:55:16Z

python-sdk/tests/databases/test_bigquery.py

+    indirect=True,
+    ids=["bigquery"],
+)
+def test_bigquery_create_table_using_native_schema_autodetection(


@feluelle we should also check for no. of rows in the table created which I think should be 0.

I can do that, but if that is documented that it only creates the schema imo this is enough.

bigquery indeed loads data into the table 🙄

It seems we cannot change that. So we have to DELETE the rows afterwards? 😅 Or wdyt?

8530ffc - Let me know what you think, please.

Yes, we need to delete the rows, not ideal but we can improve on this later.

Does that mean file is loaded twice?

@kaxil, @utkarsharma2 I think @feluelle is truncating the table: https://github.com/astronomer/astro-sdk/blob/main/python-sdk/src/astro/databases/google/bigquery.py#L279

The test only needs fix.

- add tests for table row count

kaxil · 2022-09-13T15:01:40Z

python-sdk/src/astro/databases/google/bigquery.py

+            configuration=job_config,
+        )
+
+        # We have to clear the table afterwards as bigquery automatically loads the data when creating the table.


If so, are we loading the files twice?

Load entire file and then truncate the table

Load the files again

In this PR, yes, if we would not truncate the table. Bigquery does not let us create only.

kaxil · 2022-09-13T16:49:36Z

I have merged this PR, but we should take care of the double-loading in the next PR

feluelle added 2 commits September 6, 2022 17:23

Add native autodetect schema feature for snowflake

9c083cf

- add snowflake file format data class - add option to use native autodetect schema for databases - implement snowflake specific autodetect schema

Add missing attrs to mypy pre-commit hook

7c75a20

feluelle added 4 commits September 7, 2022 14:46

Add auto check for native schema detection

b37fc2e

- add tests for snowflake - implement snowflake autodetect check - remove extra arg for setting schema auto detection

Fix docstring

871d5bb

Ignore some deepsource checks

72d0d02

Fix sql injection issues

b7d7c6c

feluelle commented Sep 7, 2022

View reviewed changes

.pre-commit-config.yaml Outdated Show resolved Hide resolved

feluelle added 3 commits September 8, 2022 10:38

Add native autodetect schema for bigquery

a715069

- implement bigquery native autodetect schema - add tests for bigquery

Improve exception message

9a03663

Improve exception message

15557e1

feluelle marked this pull request as ready for review September 8, 2022 12:47

feluelle requested review from dimberman, tatiana, utkarsharma2, sunank200, pankajastro and pankajkoti as code owners September 8, 2022 12:47

dimberman approved these changes Sep 8, 2022

View reviewed changes

utkarsharma2 reviewed Sep 8, 2022

View reviewed changes

python-sdk/src/astro/databases/snowflake.py Show resolved Hide resolved

Merge branch 'main' into feature/709-native-autodetect-schema

5ecb340

sunank200 reviewed Sep 9, 2022

View reviewed changes

feluelle and others added 2 commits September 12, 2022 08:40

Optimize lookup in supported file types/locations

b725f9c

Merge branch 'main' into feature/709-native-autodetect-schema

776c529

feluelle requested review from utkarsharma2 and sunank200 September 12, 2022 06:41

utkarsharma2 reviewed Sep 12, 2022

View reviewed changes

python-sdk/src/astro/databases/google/bigquery.py Show resolved Hide resolved

utkarsharma2 reviewed Sep 12, 2022

View reviewed changes

feluelle added 2 commits September 12, 2022 13:19

Clear bq table after creation

8530ffc

- add tests for table row count

Add docs

3d422b3

kaxil reviewed Sep 13, 2022

View reviewed changes

kaxil merged commit fa854a8 into main Sep 13, 2022

kaxil deleted the feature/709-native-autodetect-schema branch September 13, 2022 16:49

pankajkoti mentioned this pull request Sep 14, 2022

Use native paths pattern support #708

Closed

8 tasks

Add native autodetect schema feature #780

Add native autodetect schema feature #780

Conversation

feluelle commented Sep 6, 2022 • edited Loading

Description

What is the current behavior?

What is the new behavior?

Does this introduce a breaking change?

Checklist

codecov bot commented Sep 6, 2022 • edited Loading

Codecov Report

feluelle commented Sep 8, 2022

dimberman left a comment

Choose a reason for hiding this comment

sunank200 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

utkarsharma2 Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

utkarsharma2 Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

utkarsharma2 Sep 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaxil commented Sep 13, 2022

feluelle commented Sep 6, 2022 •

edited

Loading

codecov bot commented Sep 6, 2022 •

edited

Loading

utkarsharma2 Sep 12, 2022 •

edited

Loading

utkarsharma2 Sep 12, 2022 •

edited

Loading

utkarsharma2 Sep 12, 2022 •

edited

Loading