375 filesystem storage staging #451

sh-rp · 2023-06-23T12:17:33Z

´

netlify · 2023-06-23T12:17:38Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`4ae7fe6`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/64b2cb94d2d5780008fe8a52
😎 Deploy Preview	https://deploy-preview-451--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

dlt/common/destination/capabilities.py

dlt/common/destination/reference.py

dlt/destinations/bigquery/bigquery.py

rudolfix · 2023-06-27T08:53:58Z

dlt/pipeline/__init__.py

@@ -113,6 +115,8 @@ def pipeline(
        pipelines_dir = get_dlt_pipelines_dir()

    destination = DestinationReference.from_name(destination or kwargs["destination_name"])
+    staging = DestinationReference.from_name(staging or kwargs["staging_name"]) if staging is not None else None


here you could also check if the spec support the staging client configuration (is a subclass of)( see my remarks above)

is it done?

rudolfix · 2023-06-27T08:54:10Z

dlt/pipeline/pipeline.py

@@ -186,15 +186,17 @@ def __init__(
            progress: _Collector,
            must_attach_to_local_pipeline: bool,
            config: PipelineConfiguration,
-            runtime: RunConfiguration
+            runtime: RunConfiguration,
+            staging: DestinationReference = None


after destination I'd say

I think run method takes destination so it should take staging

one important thing: presence of staging must be serialized to state. see _state_to_props and _props_to_state way below
you must test this: if after restore state from destination staging is funvctional

what else you need to do: maybe we should all staging information to LoadInfo? in reference.py? so when you print it is is clear that staging was used.

also the dlt pipeline ... info command should display staging information

same for built in streamlit app

I think the pipeline info automatically picked up the staging info after I changed the props to state stuff, no?

dlt/pipeline/pipeline.py

# Conflicts: # docs/website/docs/dlt-ecosystem/destinations/redshift.md

rudolfix

If we go for DestinationClientDwhWithStagingConfiguration which is a destination that contains staging config we must go full ahead and refactor the whole thing.
ie.

DestinationClientStagingConfiguration is not needed. the embedded config is always staging
no need for a staging parameter in the Pipeline class. it is simply part of the destination.

you also de facto force the usage of filesystem as staging (which is fine maybe?)

let's discuss it before I move forward with review

rudolfix

OK I like it now. I suggested some simplifications that we can made. also there will be less code :) thanks for the docs update. I'll review that and push my changes still to this branch

rudolfix · 2023-07-13T09:15:36Z

dlt/common/configuration/specs/aws_credentials_pure.py

+
+
+@configspec
+class AwsCredentialsPure(CredentialsConfiguration, CredentialsWithDefault):


yeah that should be like that from the very start. I'd just keep the convention we have in GcpCredentials

you can keep it in one file

AwsCredentialsPure -> rename to AwsCredentialsWithoutDefaults

AwsCredentialsPure -> do not inherit from CredentialsWithDefault

AwsCredentials -> inherit from CredentialsWithDefault

there are several places (ie. dbt helper) which checks for CredentialsWithDefault inheritance and then calls the CredentialsWithDefault interface. so it should be implemented where we really do that (check and retrieve default credentials)

rudolfix · 2023-07-13T09:22:04Z

dlt/common/destination/reference.py

+            ...
+
+@configspec(init=True)
+class DestinationClientDwhWithStagingConfiguration(DestinationClientDwhConfiguration):


OK I like this idea! so let's double down on it:

seems you do not need the full config just staging credentials!

put those credentials to DestinationClientDwhConfiguration as staging_credentials, use generic credentials type (see the credentials in base class)

another big +1: user can fill those credentials explicitly (it will not work fully now - we will deal with that later)

we do not need DestinationClientDwhWithStagingConfiguration

dlt/common/destination/reference.py

rudolfix · 2023-07-13T09:29:01Z

dlt/destinations/filesystem/filesystem.py

-        destination_file_name = LoadFilesystemJob.make_destination_filename(file_name, schema_name, load_id)
-        fs_client.put_file(local_path, posixpath.join(dataset_path, destination_file_name))
+        self.destination_file_name = LoadFilesystemJob.make_destination_filename(file_name, schema_name, load_id)
+        fs_client.put_file(local_path, posixpath.join(dataset_path, self.destination_file_name))


use make_remote_path instead of posixpath.join that will work with fsspec (90% sure ;>)

rudolfix · 2023-07-13T09:30:58Z

dlt/destinations/job_client_impl.py

@@ -58,6 +59,38 @@ def is_sql_job(file_path: str) -> bool:
        return os.path.splitext(file_path)[1][1:] == "sql"


+class CopyFileLoadJob(LoadJob, FollowupJob):
+    def __init__(self, table: TTableSchema, file_path: str, sql_client: SqlClientBase[Any], forward_staging_credentials: bool = True, staging_config: Optional[DestinationClientStagingConfiguration] = None) -> None:


looks like we do not need forward_staging_credentials. we forward when staging credentials are present and that's it!

rudolfix · 2023-07-13T09:31:53Z

dlt/destinations/job_client_impl.py

+        return "completed"
+
+    @staticmethod
+    def get_bucket_path(file_path: str) -> str:


it is still not moved. I think those method are way more generic to be in copy job class

rudolfix · 2023-07-13T09:33:49Z

dlt/destinations/snowflake/snowflake.py

+            files_clause = ""
+            stage_file_path = ""
+
+            # create storage integration stage for reference jobs if defined, for now gcs only


comment is AFAIK outdated

rudolfix · 2023-07-13T09:35:45Z

dlt/destinations/snowflake/snowflake.py

+                    credentials_clause = f"""CREDENTIALS=(AWS_KEY_ID='{aws_access_key}' AWS_SECRET_KEY='{aws_secret_key}')"""
+                from_clause = f"FROM '{bucket_path}'"
+            # create a stage if so defined
+            elif stage_name:


I'd always assume that stage exists if name is provided. so we can drop this part and simplify if condition

I am not quite sure why this is done this way, this was like this before I started working on snowflake, so I don't understand the reasoning tbh. I will keep it for now and if you insist I will change it :)

heh when we wrote the code I didn't know exact how stages work in snowflake. there's no use case where creating a named stage makes sense.

if stage_name is set we assume that it exist and use it

if it is not set and we have a bucket url (external files) then we fail the load (terminal exception) with a user friendly message

for local load and no stage we use built in table stage (we do it now)

rudolfix · 2023-07-13T09:39:49Z

dlt/destinations/snowflake/snowflake.py

            else:
                # Use implicit table stage by default: "SCHEMA_NAME"."%TABLE_NAME"
                stage_name = client.make_qualified_table_name('%'+table_name)

+            if not bucket_path: # this means we have a local file


theoretically someone can use file:// with staging. do you think it makes sense to handle it?

Hm, how would file:// end up in this job? We either have reference jobs to buckets coming in or the parquet files that come out of the normaliser stage with the right file_path present. Or am I missing something here?

you can use filesystem with file

rudolfix · 2023-07-13T09:41:08Z

dlt/pipeline/pipeline.py

+            staging_client = self._get_staging_client(self.default_schema)
+            # inject staging config into destination config, TODO: Not super clean I think?
+            if isinstance(client.config, DestinationClientDwhWithStagingConfiguration):
+                client.config.staging_config = staging_client.config # type: ignore


I'd set it if staging config is not already set

sh-rp · 2023-07-13T13:50:02Z

Not sure where you want def get_bucket_path(file_path: str) -> str: to be, I moved it into CopyFileLoadJob, but maybe you prefer it to be in the reference job?

# Conflicts: # docs/website/docs/dlt-ecosystem/destinations/redshift.md

rudolfix · 2023-07-13T15:19:13Z

Not sure where you want def get_bucket_path(file_path: str) -> str: to be, I moved it into CopyFileLoadJob, but maybe you prefer it to be in the reference job?

yes we should. there's a review comment to do that hanging for some time :)

rudolfix

just 2 small things

rudolfix · 2023-07-13T15:12:20Z

dlt/common/destination/reference.py

+            dataset_name: str = None,
+            default_schema_name: Optional[str] = None,
+            as_staging: bool = False,
+            bucket_url: str = None,


not present in the init

rudolfix · 2023-07-13T15:20:00Z

dlt/destinations/job_client_impl.py

+        return "completed"
+
+    @staticmethod
+    def get_bucket_path(file_path: str) -> str:


resolve_remote_path would be better name IMO

… and re-enables staging tests

rudolfix

@sh-rp I've fixed a few things and this is ready to go! thanks, that was a lot of work,

here are my final fixes: 4ae7fe6
let's go through them on monday there's a few learnings there

* add creation of reference followup jobs * copy job * add parquet format * make staging destination none by default * add automatic resolving of correct file format * renaming staging destination to staging * refactor redshift job and inject fs config info * small cleanup and full parquet file loading test * add redshift test * fix resuming existing jobs * linter fixes and something else i forgot * move reference job follow up creation into job * add bigquery staging with gcs * add jsonl loading for bigquery staging * better supported file format resolution * move to existing bigquery load job * change pipeline args order * add staging run arg * some more pipeline fixes * configure staging via config * enhance staging load tests * fix merge disposition on redshift * add comprehensive staging tests * fix redshift jsonl loading * add doc page (not in hierarchy for now) * move redshift credentials testing to redshift loading location * change rotation test * change timing test * implement snowflake file staging * switch to staging instead of integration * add s3 stage (which does not currently work) * filter out certain combinations for tests * update docs for snowflake staging * forward staging config to supported destination configuration types * move boto out of staging credentials * authentication to s3 with iam role for redshift * verify support for snowflake s3 stage and update docs for this * adds named put stage to snowflake, improves exception messages, fixes and re-enables staging tests --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

sh-rp added 2 commits June 23, 2023 01:24

some work

3e4ecba

add creation of reference followup jobs

c9f42c5

sh-rp added 2 commits June 23, 2023 18:15

copy job

09a1edd

add parquet format

1d42d4b

sh-rp force-pushed the d#/375-filesystem-storage-staging branch from 41ec238 to 1d42d4b Compare June 23, 2023 17:36

sh-rp added 2 commits June 23, 2023 20:56

make staging destination none by default

5b47a31

add automatic resolving of correct file format

6af9a5c

sh-rp force-pushed the d#/375-filesystem-storage-staging branch from 89aa188 to 6af9a5c Compare June 25, 2023 16:44

sh-rp added 8 commits June 26, 2023 14:19

renaming staging destination to staging

8dc7be6

refactor redshift job and inject fs config info

e4444cc

small cleanup and full parquet file loading test

9a57a5e

small fixes

b29af0e

add redshift test

144ea90

fix test

507dd0b

fix resuming existing jobs

cc717b7

linter fixes and something else i forgot

876ec3f

sh-rp force-pushed the d#/375-filesystem-storage-staging branch from 651fdeb to 876ec3f Compare June 26, 2023 14:25

move reference job follow up creation into job

fed914b

sh-rp force-pushed the d#/375-filesystem-storage-staging branch from 5d4c6e3 to fed914b Compare June 26, 2023 15:05

sh-rp added 5 commits June 26, 2023 18:12

add bigquery staging with gcs

e494b25

small fix

1550b51

another small fix

568b9d3

add jsonl loading for bigquery staging

26540ec

better supported file format resolution

a1b0ac4

rudolfix requested changes Jun 27, 2023

View reviewed changes

sh-rp added 4 commits June 27, 2023 13:10

move to existing bigquery load job

6d0fe0e

some small review changes

f8ffa58

change pipeline args order

bf1e769

add staging run arg

da7e892

sh-rp added 6 commits July 10, 2023 16:07

add s3 stage (which does not currently work)

bfaa45c

small changes

6bdeb4f

filter out certain combinations for tests

d33f12f

update docs for snowflake staging

130c194

doc changes and another small fix

63f8a61

small fix

1d4b648

sh-rp force-pushed the d#/375-filesystem-storage-staging branch from 77d2e22 to 1d4b648 Compare July 10, 2023 18:37

sh-rp added 5 commits July 11, 2023 11:33

forward staging config to supported destination configuration types

9ed8899

move boto out of staging credentials

702d222

authentication to s3 with iam role for redshift

dc61770

Merge branch 'devel' into d#/375-filesystem-storage-staging

cf5bf15

# Conflicts: # docs/website/docs/dlt-ecosystem/destinations/redshift.md

verify support for snowflake s3 stage and update docs for this

ae3fd0d

rudolfix requested changes Jul 12, 2023

View reviewed changes

rudolfix requested changes Jul 13, 2023

View reviewed changes

sh-rp added 3 commits July 13, 2023 16:43

PR fixes

e44c129

Merge branch 'devel' into d#/375-filesystem-storage-staging

0838847

# Conflicts: # docs/website/docs/dlt-ecosystem/destinations/redshift.md

small docs changes

613d672

rudolfix requested changes Jul 13, 2023

View reviewed changes

sh-rp added 2 commits July 13, 2023 17:23

snowflake change

5cff969

PR fixes

66a8c3d

rudolfix force-pushed the d#/375-filesystem-storage-staging branch from faddc8a to d8f2f2c Compare July 15, 2023 15:40

adds named put stage to snowflake, improves exception messages, fixes…

4ae7fe6

… and re-enables staging tests

rudolfix force-pushed the d#/375-filesystem-storage-staging branch from d8f2f2c to 4ae7fe6 Compare July 15, 2023 16:38

rudolfix approved these changes Jul 15, 2023

View reviewed changes

rudolfix merged commit fb9dabf into devel Jul 15, 2023
35 checks passed

rudolfix deleted the d#/375-filesystem-storage-staging branch July 15, 2023 19:45

rudolfix mentioned this pull request Jul 30, 2023

[loader] switch redshift loader to s3 + csv writer #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

375 filesystem storage staging #451

375 filesystem storage staging #451

sh-rp commented Jun 23, 2023 •

edited

Loading

netlify bot commented Jun 23, 2023 •

edited

Loading

rudolfix Jun 27, 2023

rudolfix Jul 6, 2023

rudolfix Jun 27, 2023

rudolfix Jun 27, 2023

rudolfix Jun 27, 2023

rudolfix Jun 27, 2023

sh-rp Jun 27, 2023

rudolfix left a comment

rudolfix left a comment

rudolfix Jul 13, 2023

rudolfix Jul 13, 2023

rudolfix Jul 13, 2023

rudolfix Jul 13, 2023

sh-rp Jul 13, 2023

rudolfix Jul 13, 2023

rudolfix Jul 13, 2023

rudolfix Jul 13, 2023

sh-rp Jul 13, 2023

rudolfix Jul 13, 2023

rudolfix Jul 13, 2023

sh-rp Jul 13, 2023

rudolfix Jul 13, 2023

rudolfix Jul 13, 2023

sh-rp commented Jul 13, 2023

rudolfix commented Jul 13, 2023

rudolfix left a comment

rudolfix Jul 13, 2023

rudolfix Jul 13, 2023

rudolfix left a comment



		@configspec
		class AwsCredentialsPure(CredentialsConfiguration, CredentialsWithDefault):

375 filesystem storage staging #451

375 filesystem storage staging #451

Conversation

sh-rp commented Jun 23, 2023 • edited Loading

netlify bot commented Jun 23, 2023 • edited Loading

✅ Deploy Preview for dlt-hub-docs ready!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Jul 13, 2023

rudolfix commented Jul 13, 2023

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Jun 23, 2023 •

edited

Loading

netlify bot commented Jun 23, 2023 •

edited

Loading