-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
destination config updates #783
Conversation
sh-rp
commented
Nov 22, 2023
•
edited
Loading
edited
- Renames destionation_name to destionation_type on DestinationClientConfiguration
- Adds new variables name and environment to DestinationClientConfiguration
- Makes config section respect the destination name.
add name to destination config add environment to destination config
✅ Deploy Preview for dlt-hub-docs canceled.
|
General notes:
|
@@ -11,8 +11,8 @@ | |||
class PipelineConfiguration(BaseConfiguration): | |||
pipeline_name: Optional[str] = None | |||
pipelines_dir: Optional[str] = None | |||
destination_name: Optional[str] = None | |||
staging_name: Optional[str] = None | |||
destination_type: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should just call this destination?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep _type
to be more explicit. also please link to this PR and add info that we changed the names in this config in here #763
also please add on_resolved
and please keep destination_name
and staging_name
in the config but display deprecation warnings. Steini added a method to warn on deprecation somewhere in the Pipeline class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are people that configure destinations via config.
btw. we may keep destination name in the longer run. so in fact we'll create a destination with given type and given name. so let's do that:
- if only name is present or only type is present, we create a destination with name == type
- if they are different we pass both as destination reference.
i think destination reference supports it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so we do not deprecated destination_name and staging_name, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we keep them
# Conflicts: # dlt/common/destination/reference.py # dlt/common/pipeline.py # dlt/destinations/impl/bigquery/factory.py # dlt/destinations/impl/duckdb/factory.py # dlt/destinations/impl/filesystem/configuration.py # dlt/destinations/impl/motherduck/factory.py # dlt/destinations/impl/redshift/factory.py # dlt/destinations/impl/snowflake/factory.py # dlt/destinations/impl/weaviate/factory.py # dlt/helpers/dbt/__init__.py # dlt/load/load.py # dlt/pipeline/__init__.py # dlt/pipeline/pipeline.py # tests/cli/test_init_command.py # tests/load/test_job_client.py # tests/load/test_sql_client.py # tests/load/utils.py # tests/pipeline/test_pipeline_trace.py
@@ -23,3 +23,19 @@ def __str__(self) -> str: | |||
return str(self.staging_config.credentials) | |||
else: | |||
return "[no staging set]" | |||
|
|||
if TYPE_CHECKING: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🥔🥔🥔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for all missing mypy init delcarations...
I'm not sure that it makes sense to have the destination name in the config. I think it should be set in code and then you can select a config section based on that name. If you also have the name in the config it might be a bit confusing. Generally I am not quite sure how the setup for a user with multiple named destinations and environments would look like..
yes - that may be good idea. I even proposed to make "name" final in the config. so you can completely take it out.
User setup: if destination is entity independent from pipeline we can have many devel or production BigQuery project and users may want to name them and refer to them in many pipelines.
Should we also store the destination name in the pipeline state? I am not quite sure why the destination info is stored there in the first place, this is one of these (for me) confusing mechanisms where stuff gets stored somewhere and then dlt does thing that are not really reflected in the code. Anyway if we do not save the name but have config segments that are reliant on the name, then the second run might produce other results.
We should store both name and type of the destination in the pipeline state. You'll also need to bump state version and write a migration. The goal is to be able to retrieve exactly the same configuration for the pipeline (now we need both type and name - because the type is not stored in the config - I think we can keep it that way).
Right now it is used by streamlit app and CLI to do dlt.attach
and obtain fully functional pipeline object outside of the original pipeline script.
We do not talk much about this in the docs - right. But this is quite powerful way for doing automations.
I have implemented the dynamic configuration resolution in the same way it was done before in the from_reference constructor on the destination. This seems to work good, but it is different than was specified in the ticket. Let me know if this also works. the one things that does not work now though, is to have common credentials for all duckdb destinations if those duckdb destinations have different names for example (if we also have other destinations types)
Heh you are right. But putting the config under [destination]
will work
I have also made the dataset configurable via the config, I am not sure why this was not possible before, so maybe I am missing something there and will have to revert that..
I think it makes sense.
@@ -11,8 +11,8 @@ | |||
class PipelineConfiguration(BaseConfiguration): | |||
pipeline_name: Optional[str] = None | |||
pipelines_dir: Optional[str] = None | |||
destination_name: Optional[str] = None | |||
staging_name: Optional[str] = None | |||
destination_type: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep _type
to be more explicit. also please link to this PR and add info that we changed the names in this config in here #763
also please add on_resolved
and please keep destination_name
and staging_name
in the config but display deprecation warnings. Steini added a method to warn on deprecation somewhere in the Pipeline class
@@ -11,8 +11,8 @@ | |||
class PipelineConfiguration(BaseConfiguration): | |||
pipeline_name: Optional[str] = None | |||
pipelines_dir: Optional[str] = None | |||
destination_name: Optional[str] = None | |||
staging_name: Optional[str] = None | |||
destination_type: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are people that configure destinations via config.
btw. we may keep destination name in the longer run. so in fact we'll create a destination with given type and given name. so let's do that:
- if only name is present or only type is present, we create a destination with name == type
- if they are different we pass both as destination reference.
i think destination reference supports it
Questions:
|
Hm, maybe dataset should also be part of this destination object... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I understand correctly that destination_name is not to be marked as deprecated?
yes
The pipeline steps currently support a destination and stage argument which can be a reference or a destination_type, should it be possible to passe type and name there too? Or should we just assume people will pass a destination instance with this info if they really want to use it?
right now if someone wants to pass both type and name, s/he must pass theDestination
instance. we may support a tuple later (if that is what you mean)
Does the environment also need to be stored in the pipeline props? Probably yes? Should this maybe be a nested structure and not 3 props for staging and destination each? Maybe the TDestinationReferenceArg should be a union of str, destination instance and an object with three values type, name and environment.
You can move destination_name
and destination_type
to destination
but we do not need to add environment there. we just need a minimum information to identify right configuration element. OR we just dump a whole config there (without secrets).
what we keep in state supports only minimum information and assumes that config OR code has the rest. if that is not the case some things ie. CLI commands will not work
let's not change the TDestinationReferenceArg
but what you can (should do) is to define a NewType for destination_type
that is a special kind of string that can be a full module name or destination from dlt default module. and then use it TDestinationReferenceArg
. you'll catch a lot of problems with that!
dlt/helpers/streamlit_helper.py
Outdated
@@ -125,7 +125,7 @@ def _query_data_live(query: str, schema_name: str = None) -> pd.DataFrame: | |||
schema_names = ", ".join(sorted(pipeline.schema_names)) | |||
st.markdown(f""" | |||
* pipeline name: **{pipeline.pipeline_name}** | |||
* destination: **{str(credentials)}** in **{pipeline.destination.name}** | |||
* destination: **{str(credentials)}** in **{pipeline.destination.destination_name}** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can display both name and type?
@@ -11,8 +11,8 @@ | |||
class PipelineConfiguration(BaseConfiguration): | |||
pipeline_name: Optional[str] = None | |||
pipelines_dir: Optional[str] = None | |||
destination_name: Optional[str] = None | |||
staging_name: Optional[str] = None | |||
destination_type: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we keep them
dlt/pipeline/pipeline.py
Outdated
@@ -251,7 +251,7 @@ class Pipeline(SupportsPipeline): | |||
"""A working directory of the pipeline""" | |||
destination: TDestination = None | |||
staging: TDestination = None | |||
"""The destination reference which is ModuleType. `destination.name` returns the name string""" | |||
"""The destination reference which is ModuleType. `destination.destination_name` returns the name string""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a Destination class now
dlt/pipeline/pipeline.py
Outdated
@@ -1407,18 +1415,18 @@ def _restore_state_from_destination(self) -> Optional[TPipelineState]: | |||
if state is None: | |||
logger.info( | |||
"The state was not found in the destination" | |||
f" {self.destination.name}:{dataset_name}" | |||
f" {self.destination.destination_name}:{dataset_name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we show also type?
"test_pipeline", state_v2, state_v2["_state_engine_version"], STATE_ENGINE_VERSION # type: ignore | ||
) | ||
assert state_v2["destination_name"] == "redshift" | ||
assert state_v2["destination_type"] == "redshift" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks wrong. it must always be a full path to a module that can be imported.
assert state_v2["destination_type"] == "redshift" | ||
assert "destination" not in state_v2 | ||
assert state_v2["staging_name"] == "filesystem" | ||
assert state_v2["staging_type"] == "filesystem" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same!
@@ -485,6 +490,11 @@ def test_migrate_state(test_storage: FileStorage) -> None: | |||
assert state["_state_engine_version"] == STATE_ENGINE_VERSION | |||
assert "_local" in state | |||
|
|||
# check destination migration | |||
assert state["destination_name"] == "dlt.destinations.postgres" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name should be postgres
"""The destination name will either be explicitely set while creating the destination or will be taken from the type""" | ||
return self.config_params.get("name") or self.destination_type | ||
return self.config_params.get("destination_name") or self.destination_type | ||
|
||
@property | ||
def destination_type(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it returning a full class name? from which we can always recreate instance? I think Stein messed up a little here. destination type is always a full type that you can pass to from_reference
and get back the instance (with default config)
this must be tested.
make sure that state keeps a full type in pipeline and that this type is valid - also after migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the line below returns wrong type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe let's make it to return Type? then you'll catch all the bugs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you mean. We already have the type, no? I have changed it that it will return the full path to the destination module and added / updated tests for this.
dlt/common/destination/reference.py
Outdated
"""The destination name will either be explicitely set while creating the destination or will be taken from the type""" | ||
return self.config_params.get("name") or self.destination_type | ||
return self.config_params.get("destination_name") or self.destination_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use def to_name(ref: TDestinationReferenceArg) -> str:
to get name from type. use it everywhere you do that
This i don't get. A module name can just be a string, no? In that case the linter will not be able to differentiate between some wrong string and a valid module name. But I probably am missing sth here. |
472fe51
to
4df026d
Compare
Changing the detination_type property to the full module path changes the output of some other places to this full path as well. I am not sure if we want that everywhere, or if we want to shorten it to only the class name for dlt destinations, you can let me know if the way it is is alright. |
I can't get tests/cli/test_pipeline_command.py::test_pipeline_command_operations to work. If I use the dlt commands regularly from the console, i can create a pipeline, run it and then drop a table. but somehow in the test it fails and I am not sure why.. |
This reverts commit 0b624b05b7c0d5880099b7a3d9e9ea2a63378002.
assert load_info.environment == "production" | ||
|
||
|
||
def test_config_respects_name() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rudolfix here is the test for config with name
# Conflicts: # Makefile # dlt/pipeline/pipeline.py # dlt/pipeline/state_sync.py # tests/pipeline/test_pipeline_state.py
5d05aab
to
a206c5e
Compare
9421551
to
fdf0615
Compare
9e9715f
to
6568d36
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is good now! thanks! a few simple tests that make sure that pipeline props to state works right is needed but I'll do this with new ticket