Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/fsspec destination #342

Merged
merged 52 commits into from
May 26, 2023
Merged

Feat/fsspec destination #342

merged 52 commits into from
May 26, 2023

Conversation

steinitzu
Copy link
Collaborator

WIP

@netlify
Copy link

netlify bot commented May 19, 2023

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit c530da4
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6470d9b5252f190008b5027a

Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good! decorator to create dynamic type hints seems to work well.

we'll need a few tests for it for a few mock credentials cases, somewhere in tests/common/configuration

dlt/common/configuration/specs/aws_credentials.py Outdated Show resolved Hide resolved
dlt/common/configuration/specs/aws_credentials.py Outdated Show resolved Hide resolved
dlt/common/configuration/specs/aws_credentials.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a few changes

  • default buckets are defined in tests/config.toml ENV override will still work
  • changes the scheme from gcs to gs
  • some fixes when passing gcp credentials

my biggest issue is lack of test coverage, the client is tested in replace mode. you do not test if append is adding more files. not sure if you test if replace is testing if files are deleted.
you do
I do not see you testing if file layout is exactly as in ticket spec
I do not see tests for merge special case

Please add tests where filesystem is used as destination in pipelines. Maybe some of existing tests in load/pipelines could be converted (separating parts that require sql_client, you could look if files are present instead). it is some work but without that it is hard to say if this thing works


# - name: Install self
# run: poetry install --no-interaction

- run: |
poetry run pytest tests/load tests/cli --ignore=tests/load/bigquery -k '(not bigquery)'
poetry run pytest tests/load/filesystem tests/cli --ignore=tests/load/bigquery -k '(not bigquery)'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this change after all tests passed

@@ -117,6 +117,9 @@ def _resolve_config_fields(
unresolved_fields: Dict[str, Sequence[LookupTrace]] = {}

for key, hint in fields.items():
if key in config.__hint_resolvers__:
# Type hint for this field is created dynamically
hint = config.__hint_resolvers__[key](config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please raise a nice configuration error when the key is not available, the key name the available hint resolves and the configuration type being resolved should be present. the message should be clear so user can fix the problem

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an exception with message when the decorator is used without matching fields, e.g. the message:

The config spec ConfigWithInvalidDynamicType has dynamic type resolvers for fields: ['a', 'b', 'c'] but these fields are not defined in the spec.
When using @resolve_type() decorator, Add the fields with 'Any' or another common type hint, example:

>>> class ConfigWithInvalidDynamicType(BaseConfiguration)
>>>    a: Any
>>>    b: Any
>>>    c: Any

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear what you mean with the other cases. The standard config fields missing exception is raised as normal when e.g. aws credentials are missing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good now, thanks!

dlt/destinations/filesystem/configuration.py Outdated Show resolved Hide resolved
dlt/destinations/filesystem/filesystem.py Outdated Show resolved Hide resolved
tests/common/configuration/test_credentials.py Outdated Show resolved Hide resolved
tests/load/filesystem/conftest.py Outdated Show resolved Hide resolved
tests/load/filesystem/test_filesystem_client.py Outdated Show resolved Hide resolved
tests/load/filesystem/test_filesystem_client.py Outdated Show resolved Hide resolved
tests/common/configuration/test_credentials.py Outdated Show resolved Hide resolved
dlt/destinations/filesystem/filesystem.py Outdated Show resolved Hide resolved
def create_merge_job(self, table_chain: Sequence[TTableSchema]) -> NewLoadJob:
return None

def complete_load(self, load_id: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at our use cases and actually it would be useful to create a file here. the reason: when the file is present we now that load to the bucket was finalized and we can use the loaded data. You could probably execute LoadFileSystemJob. write_disposition: append, the file can be empty, and the table name is LOADS_TABLE_NAME (_dlt_loads)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

63a97ec#diff-524215919bce1cfd1cd9bae6d4ae82c9f753d4b3ffb1215a418dd06277113a5dR107-R110
Will this do? Just creates empty file named {schema_name}.{_dlt_loads}.{load_id} . Figure extension and file ID is not needed since it's just a marker per load ID?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it is not needed. thx!

@steinitzu steinitzu marked this pull request as ready for review May 26, 2023 03:58
rudolfix
rudolfix previously approved these changes May 26, 2023
@@ -117,6 +117,9 @@ def _resolve_config_fields(
unresolved_fields: Dict[str, Sequence[LookupTrace]] = {}

for key, hint in fields.items():
if key in config.__hint_resolvers__:
# Type hint for this field is created dynamically
hint = config.__hint_resolvers__[key](config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good now, thanks!

def create_merge_job(self, table_chain: Sequence[TTableSchema]) -> NewLoadJob:
return None

def complete_load(self, load_id: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, it is not needed. thx!

@rudolfix rudolfix merged commit 00f94cb into devel May 26, 2023
@rudolfix rudolfix deleted the feat/fsspec-destination branch May 26, 2023 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants