Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add native autodetect schema feature #780

Merged
merged 14 commits into from
Sep 13, 2022
Merged

Conversation

feluelle
Copy link
Member

@feluelle feluelle commented Sep 6, 2022

Description

What is the current behavior?

Currently for inferring the schema we are always using pandas.

For more information, check the issue below..

closes: #709

What is the new behavior?

Add native autodetect schema feature for snowflake

  • add snowflake file format data class
  • add option to use native autodetect schema for databases
  • implement snowflake specific autodetect schema

Add native autodetect schema for bigquery

  • implement bigquery native autodetect schema
  • add tests for bigquery

Does this introduce a breaking change?

No

Checklist

  • Created tests which fail without the change (if possible)
  • Extended the README / documentation, if necessary

- add snowflake file format data class
- add option to use native autodetect schema for databases
- implement snowflake specific autodetect schema
@codecov
Copy link

codecov bot commented Sep 6, 2022

Codecov Report

Merging #780 (3d422b3) into main (f1030fd) will increase coverage by 0.03%.
The diff coverage is 94.64%.

@@            Coverage Diff             @@
##             main     #780      +/-   ##
==========================================
+ Coverage   93.27%   93.31%   +0.03%     
==========================================
  Files          46       46              
  Lines        1962     2018      +56     
  Branches      247      252       +5     
==========================================
+ Hits         1830     1883      +53     
- Misses        103      105       +2     
- Partials       29       30       +1     
Impacted Files Coverage Δ
python-sdk/src/astro/databases/base.py 95.62% <83.33%> (-0.48%) ⬇️
python-sdk/src/astro/databases/snowflake.py 95.76% <94.73%> (-0.20%) ⬇️
python-sdk/src/astro/databases/google/bigquery.py 96.55% <100.00%> (+0.25%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

- add tests for snowflake
- implement snowflake autodetect check
- remove extra arg for setting schema auto detection
.pre-commit-config.yaml Outdated Show resolved Hide resolved
@feluelle
Copy link
Member Author

feluelle commented Sep 8, 2022

The failed checks are unrelated.

Copy link
Collaborator

@dimberman dimberman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super cool! Huge +1

Copy link
Contributor

@sunank200 sunank200 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Added few comments.

python-sdk/src/astro/databases/google/bigquery.py Outdated Show resolved Hide resolved
python-sdk/src/astro/databases/google/bigquery.py Outdated Show resolved Hide resolved
python-sdk/src/astro/databases/snowflake.py Outdated Show resolved Hide resolved
python-sdk/src/astro/databases/snowflake.py Outdated Show resolved Hide resolved
@@ -60,6 +60,9 @@
}
BIGQUERY_WRITE_DISPOSITION = {"replace": "WRITE_TRUNCATE", "append": "WRITE_APPEND"}

NATIVE_AUTODETECT_SCHEMA_SUPPORTED_FILE_TYPES = {FileType.CSV, FileType.NDJSON}
NATIVE_AUTODETECT_SCHEMA_SUPPORTED_FILE_LOCATIONS = {FileLocation.GS}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feluelle How would we represent cases like:
S3 - Parquet
GCS - CSV, JSON, Parquet

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS - CSV, JSON, Parquet

GS is GCS; CSV & JSON is supported; For parquet the native inference within bigquery will be used automatically. See comment #780 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 - Parquet

I guess, there is not native path available which means it will use pandas instead.

Copy link
Collaborator

@utkarsharma2 utkarsharma2 Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently what this mapping implies is that for every supported location(GCS/S3) we support all the filetypes - CVS/JSON, right? but is it the case always?

is_file_type_supported = (
    file.type.name in NATIVE_AUTODETECT_SCHEMA_SUPPORTED_FILE_TYPES
)
is_file_location_supported = (
    file.location.location_type
    in NATIVE_AUTODETECT_SCHEMA_SUPPORTED_FILE_LOCATIONS
)

Would it be a good idea to check for location and then support fileType by that location? Something like -

{
"GCS": [CSV, NDJSON],
"S3": [PARQUET]  # just an example. 
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we do - in the return statement:

return is_file_type_supported and is_file_location_supported

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, okay I got what you mean now. Yes, this could make sense. I just used the same functionality we use for file loading. Is file loading different from schema autodetection? 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feluelle No, I guess not. We should change that as well. I just realized that we can do that in a separate PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@utkarsharma2 Can you create a separate issue on the separate PR you are talking about?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added - #853

file: File,
) -> None:
"""
Create a SQL table, automatically inferring the schema using the given file via native database support.
Copy link
Collaborator

@utkarsharma2 utkarsharma2 Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feluelle I think this would populate the file in the table as well right?

If the user has given a pattern that results in two files

example -
pattern : s3://tmp/

which results in files:

  1. s3://tmp/test1.csv
  2. s3://tmp/test2.csv

With the existing code and new autodetecting schema logic, we will end up with:

Step 1: Autoschema detect - Table + 1st file load
Step 2: Load data in a table - 1st file load + 2nd file load

Would result in 1st file being loaded twice? Can we confirm if this is not the case?

cc: @sunank200

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Step 1: Autoschema detect - Table + 1st file load

The autoschema detect should not load the file actually..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But good point, I will check that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it for snowflake by adding:

    statement = f"SELECT COUNT(*) FROM {database.get_table_qualified_name(table)}"
    count = database.run_sql(statement).scalar()
    assert count == 0

which passed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feluelle Nice, Can we add one for bigquery as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added it see #780 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:return: unique file format name
"""
return (
"file_format_"
Copy link
Collaborator

@utkarsharma2 utkarsharma2 Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feluelle Looks like we are having a duplication of sorts, would it be a good idea to have a single function to generate a random string of the desired length and take prefix into account?

unique_id = random.choice(string.ascii_lowercase) + "".join(

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can use this function in table.py

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference here. But if we want to make it more elegant and truly unique, I would prefer using uuid (maybe uuid4) to create a unique name and store the function in a utils file? WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use it. But I guess we considered UUI for table names, but I cannot recall why we didn't. @tatiana might have better context there.

indirect=True,
ids=["bigquery"],
)
def test_bigquery_create_table_using_native_schema_autodetection(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feluelle we should also check for no. of rows in the table created which I think should be 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, but if that is documented that it only creates the schema imo this is enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigquery indeed loads data into the table 🙄

Screenshot 2022-09-12 at 12 31 13

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we cannot change that. So we have to DELETE the rows afterwards? 😅 Or wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8530ffc - Let me know what you think, please.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to delete the rows, not ideal but we can improve on this later.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that mean file is loaded twice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

configuration=job_config,
)

# We have to clear the table afterwards as bigquery automatically loads the data when creating the table.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, are we loading the files twice?

  1. Load entire file and then truncate the table
  2. Load the files again

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR, yes, if we would not truncate the table. Bigquery does not let us create only.

@kaxil kaxil merged commit fa854a8 into main Sep 13, 2022
@kaxil kaxil deleted the feature/709-native-autodetect-schema branch September 13, 2022 16:49
@kaxil
Copy link
Collaborator

kaxil commented Sep 13, 2022

I have merged this PR, but we should take care of the double-loading in the next PR

@pankajkoti pankajkoti mentioned this pull request Sep 14, 2022
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use native autodetect schema
5 participants