Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

destination config updates #783

Merged
merged 23 commits into from
Dec 1, 2023
Merged

destination config updates #783

merged 23 commits into from
Dec 1, 2023

Conversation

sh-rp
Copy link
Collaborator

@sh-rp sh-rp commented Nov 22, 2023

  • Renames destionation_name to destionation_type on DestinationClientConfiguration
  • Adds new variables name and environment to DestinationClientConfiguration
  • Makes config section respect the destination name.

add name to destination config
add environment to destination config
Copy link

netlify bot commented Nov 22, 2023

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit f48338b
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65699923a7f6d20008ca1f3f

@sh-rp sh-rp changed the base branch from master to devel November 22, 2023 10:53
@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 22, 2023

General notes:

  • I'm not sure that it makes sense to have the destination name in the config. I think it should be set in code and then you can select a config section based on that name. If you also have the name in the config it might be a bit confusing. Generally I am not quite sure how the setup for a user with multiple named destinations and environments would look like..
  • Should we also store the destination name in the pipeline state? I am not quite sure why the destination info is stored there in the first place, this is one of these (for me) confusing mechanisms where stuff gets stored somewhere and then dlt does thing that are not really reflected in the code. Anyway if we do not save the name but have config segments that are reliant on the name, then the second run might produce other results.
  • I have implemented the dynamic configuration resolution in the same way it was done before in the from_reference constructor on the destination. This seems to work good, but it is different than was specified in the ticket. Let me know if this also works. the one things that does not work now though, is to have common credentials for all duckdb destinations if those duckdb destinations have different names for example (if we also have other destinations types)
  • I have also made the dataset configurable via the config, I am not sure why this was not possible before, so maybe I am missing something there and will have to revert that..

@@ -11,8 +11,8 @@
class PipelineConfiguration(BaseConfiguration):
pipeline_name: Optional[str] = None
pipelines_dir: Optional[str] = None
destination_name: Optional[str] = None
staging_name: Optional[str] = None
destination_type: Optional[str] = None
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should just call this destination?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep _type to be more explicit. also please link to this PR and add info that we changed the names in this config in here #763

also please add on_resolved and please keep destination_name and staging_name in the config but display deprecation warnings. Steini added a method to warn on deprecation somewhere in the Pipeline class

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are people that configure destinations via config.

btw. we may keep destination name in the longer run. so in fact we'll create a destination with given type and given name. so let's do that:

  • if only name is present or only type is present, we create a destination with name == type
  • if they are different we pass both as destination reference.
    i think destination reference supports it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so we do not deprecated destination_name and staging_name, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we keep them

@sh-rp sh-rp added the devel label Nov 22, 2023
# Conflicts:
#	dlt/common/destination/reference.py
#	dlt/common/pipeline.py
#	dlt/destinations/impl/bigquery/factory.py
#	dlt/destinations/impl/duckdb/factory.py
#	dlt/destinations/impl/filesystem/configuration.py
#	dlt/destinations/impl/motherduck/factory.py
#	dlt/destinations/impl/redshift/factory.py
#	dlt/destinations/impl/snowflake/factory.py
#	dlt/destinations/impl/weaviate/factory.py
#	dlt/helpers/dbt/__init__.py
#	dlt/load/load.py
#	dlt/pipeline/__init__.py
#	dlt/pipeline/pipeline.py
#	tests/cli/test_init_command.py
#	tests/load/test_job_client.py
#	tests/load/test_sql_client.py
#	tests/load/utils.py
#	tests/pipeline/test_pipeline_trace.py
@@ -23,3 +23,19 @@ def __str__(self) -> str:
return str(self.staging_config.credentials)
else:
return "[no staging set]"

if TYPE_CHECKING:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥔🥔🥔

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for all missing mypy init delcarations...

I'm not sure that it makes sense to have the destination name in the config. I think it should be set in code and then you can select a config section based on that name. If you also have the name in the config it might be a bit confusing. Generally I am not quite sure how the setup for a user with multiple named destinations and environments would look like..

yes - that may be good idea. I even proposed to make "name" final in the config. so you can completely take it out.
User setup: if destination is entity independent from pipeline we can have many devel or production BigQuery project and users may want to name them and refer to them in many pipelines.

Should we also store the destination name in the pipeline state? I am not quite sure why the destination info is stored there in the first place, this is one of these (for me) confusing mechanisms where stuff gets stored somewhere and then dlt does thing that are not really reflected in the code. Anyway if we do not save the name but have config segments that are reliant on the name, then the second run might produce other results.

We should store both name and type of the destination in the pipeline state. You'll also need to bump state version and write a migration. The goal is to be able to retrieve exactly the same configuration for the pipeline (now we need both type and name - because the type is not stored in the config - I think we can keep it that way).

Right now it is used by streamlit app and CLI to do dlt.attach and obtain fully functional pipeline object outside of the original pipeline script.
We do not talk much about this in the docs - right. But this is quite powerful way for doing automations.

I have implemented the dynamic configuration resolution in the same way it was done before in the from_reference constructor on the destination. This seems to work good, but it is different than was specified in the ticket. Let me know if this also works. the one things that does not work now though, is to have common credentials for all duckdb destinations if those duckdb destinations have different names for example (if we also have other destinations types)

Heh you are right. But putting the config under [destination] will work

I have also made the dataset configurable via the config, I am not sure why this was not possible before, so maybe I am missing something there and will have to revert that..

I think it makes sense.

dlt/common/destination/reference.py Outdated Show resolved Hide resolved
dlt/common/pipeline.py Show resolved Hide resolved
dlt/destinations/impl/athena/factory.py Outdated Show resolved Hide resolved
dlt/destinations/impl/athena/factory.py Outdated Show resolved Hide resolved
@@ -11,8 +11,8 @@
class PipelineConfiguration(BaseConfiguration):
pipeline_name: Optional[str] = None
pipelines_dir: Optional[str] = None
destination_name: Optional[str] = None
staging_name: Optional[str] = None
destination_type: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep _type to be more explicit. also please link to this PR and add info that we changed the names in this config in here #763

also please add on_resolved and please keep destination_name and staging_name in the config but display deprecation warnings. Steini added a method to warn on deprecation somewhere in the Pipeline class

@@ -11,8 +11,8 @@
class PipelineConfiguration(BaseConfiguration):
pipeline_name: Optional[str] = None
pipelines_dir: Optional[str] = None
destination_name: Optional[str] = None
staging_name: Optional[str] = None
destination_type: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are people that configure destinations via config.

btw. we may keep destination name in the longer run. so in fact we'll create a destination with given type and given name. so let's do that:

  • if only name is present or only type is present, we create a destination with name == type
  • if they are different we pass both as destination reference.
    i think destination reference supports it

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 27, 2023

Questions:

  • Did I understand correctly that destination_name is not to be marked as deprecated?
  • The pipeline steps currently support a destination and stage argument which can be a reference or a destination_type, should it be possible to passe type and name there too? Or should we just assume people will pass a destination instance with this info if they really want to use it?
  • Does the environment also need to be stored in the pipeline props? Probably yes? Should this maybe be a nested structure and not 3 props for staging and destination each? Maybe the TDestinationReferenceArg should be a union of str, destination instance and an object with three values type, name and environment.

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 27, 2023

Hm, maybe dataset should also be part of this destination object...

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I understand correctly that destination_name is not to be marked as deprecated?
yes

The pipeline steps currently support a destination and stage argument which can be a reference or a destination_type, should it be possible to passe type and name there too? Or should we just assume people will pass a destination instance with this info if they really want to use it?
right now if someone wants to pass both type and name, s/he must pass the Destination instance. we may support a tuple later (if that is what you mean)

Does the environment also need to be stored in the pipeline props? Probably yes? Should this maybe be a nested structure and not 3 props for staging and destination each? Maybe the TDestinationReferenceArg should be a union of str, destination instance and an object with three values type, name and environment.

You can move destination_name and destination_type to destination but we do not need to add environment there. we just need a minimum information to identify right configuration element. OR we just dump a whole config there (without secrets).

what we keep in state supports only minimum information and assumes that config OR code has the rest. if that is not the case some things ie. CLI commands will not work

let's not change the TDestinationReferenceArg but what you can (should do) is to define a NewType for destination_type that is a special kind of string that can be a full module name or destination from dlt default module. and then use it TDestinationReferenceArg . you'll catch a lot of problems with that!

@@ -125,7 +125,7 @@ def _query_data_live(query: str, schema_name: str = None) -> pd.DataFrame:
schema_names = ", ".join(sorted(pipeline.schema_names))
st.markdown(f"""
* pipeline name: **{pipeline.pipeline_name}**
* destination: **{str(credentials)}** in **{pipeline.destination.name}**
* destination: **{str(credentials)}** in **{pipeline.destination.destination_name}**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can display both name and type?

@@ -11,8 +11,8 @@
class PipelineConfiguration(BaseConfiguration):
pipeline_name: Optional[str] = None
pipelines_dir: Optional[str] = None
destination_name: Optional[str] = None
staging_name: Optional[str] = None
destination_type: Optional[str] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we keep them

@@ -251,7 +251,7 @@ class Pipeline(SupportsPipeline):
"""A working directory of the pipeline"""
destination: TDestination = None
staging: TDestination = None
"""The destination reference which is ModuleType. `destination.name` returns the name string"""
"""The destination reference which is ModuleType. `destination.destination_name` returns the name string"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a Destination class now

@@ -1407,18 +1415,18 @@ def _restore_state_from_destination(self) -> Optional[TPipelineState]:
if state is None:
logger.info(
"The state was not found in the destination"
f" {self.destination.name}:{dataset_name}"
f" {self.destination.destination_name}:{dataset_name}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we show also type?

tests/destinations/test_destination_name_and_config.py Outdated Show resolved Hide resolved
"test_pipeline", state_v2, state_v2["_state_engine_version"], STATE_ENGINE_VERSION # type: ignore
)
assert state_v2["destination_name"] == "redshift"
assert state_v2["destination_type"] == "redshift"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks wrong. it must always be a full path to a module that can be imported.

assert state_v2["destination_type"] == "redshift"
assert "destination" not in state_v2
assert state_v2["staging_name"] == "filesystem"
assert state_v2["staging_type"] == "filesystem"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same!

@@ -485,6 +490,11 @@ def test_migrate_state(test_storage: FileStorage) -> None:
assert state["_state_engine_version"] == STATE_ENGINE_VERSION
assert "_local" in state

# check destination migration
assert state["destination_name"] == "dlt.destinations.postgres"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name should be postgres

"""The destination name will either be explicitely set while creating the destination or will be taken from the type"""
return self.config_params.get("name") or self.destination_type
return self.config_params.get("destination_name") or self.destination_type

@property
def destination_type(self) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it returning a full class name? from which we can always recreate instance? I think Stein messed up a little here. destination type is always a full type that you can pass to from_reference and get back the instance (with default config)
this must be tested.
make sure that state keeps a full type in pipeline and that this type is valid - also after migration

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the line below returns wrong type

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe let's make it to return Type? then you'll catch all the bugs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what you mean. We already have the type, no? I have changed it that it will return the full path to the destination module and added / updated tests for this.

"""The destination name will either be explicitely set while creating the destination or will be taken from the type"""
return self.config_params.get("name") or self.destination_type
return self.config_params.get("destination_name") or self.destination_type
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use def to_name(ref: TDestinationReferenceArg) -> str: to get name from type. use it everywhere you do that

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 27, 2023

let's not change the TDestinationReferenceArg but what you can (should do) is to define a NewType for destination_type that is a special kind of string that can be a full module name or destination from dlt default module. and then use it TDestinationReferenceArg . you'll catch a lot of problems with that!

This i don't get. A module name can just be a string, no? In that case the linter will not be able to differentiate between some wrong string and a valid module name. But I probably am missing sth here.

@sh-rp sh-rp force-pushed the d#/destination_config_updates branch from 472fe51 to 4df026d Compare November 27, 2023 20:57
@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 27, 2023

Changing the detination_type property to the full module path changes the output of some other places to this full path as well. I am not sure if we want that everywhere, or if we want to shorten it to only the class name for dlt destinations, you can let me know if the way it is is alright.
I have two questions in the other comments.

@sh-rp
Copy link
Collaborator Author

sh-rp commented Nov 28, 2023

I can't get tests/cli/test_pipeline_command.py::test_pipeline_command_operations to work. If I use the dlt commands regularly from the console, i can create a pipeline, run it and then drop a table. but somehow in the test it fails and I am not sure why..

This reverts commit 0b624b05b7c0d5880099b7a3d9e9ea2a63378002.
assert load_info.environment == "production"


def test_config_respects_name() -> None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rudolfix here is the test for config with name

@sh-rp sh-rp force-pushed the d#/destination_config_updates branch from 5d05aab to a206c5e Compare November 30, 2023 09:30
@rudolfix rudolfix force-pushed the d#/destination_config_updates branch from 9421551 to fdf0615 Compare November 30, 2023 22:15
@rudolfix rudolfix force-pushed the d#/destination_config_updates branch from 9e9715f to 6568d36 Compare December 1, 2023 07:45
@rudolfix rudolfix marked this pull request as ready for review December 1, 2023 09:08
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is good now! thanks! a few simple tests that make sure that pipeline props to state works right is needed but I'll do this with new ticket

@rudolfix rudolfix merged commit fcb7d5e into devel Dec 1, 2023
39 of 40 checks passed
@rudolfix rudolfix deleted the d#/destination_config_updates branch December 1, 2023 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants