-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add parquet data writer #403
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
Some notes: I'd like to test the pipeline implementations to see wether the data collected is correct. Is there some kind of way of easily accessing the destination? |
f247ce2
to
7f5124a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks really good. kudos for fixing a problem with double rotation of the file. I could not find that and I so much time tracing it 👍
I'd like to test the pipeline implementations to see wether the data collected is correct. Is there some kind of way of easily accessing the destination?
in case of destinations supporting sql there sqlclient, that you know from verified pipelines
in case of file based destinations we have internal client in the pipeline
ie. intest_filesysten_pipeline
:
client: FilesystemClient = pipeline._destination_client() # type: ignore[assignment]
append_glob = posixpath.join(client.dataset_path, 'some_source.some_data.*')
replace_glob = posixpath.join(client.dataset_path, 'some_source.other_data.*')
append_files = client.fs_client.glob(append_glob)
replace_files = client.fs_client.glob(replace_glob)
I'm not so sure about the configuration. I don't think the class_factory is the right place, but I also don't know which place would be..
look how configuration is added in buffered.py:
class BufferedDataWriter:
@configspec
class BufferedDataWriterConfiguration(BaseConfiguration):
buffer_max_items: int = 5000
file_max_items: Optional[int] = None
file_max_bytes: Optional[int] = None
_caps: Optional[DestinationCapabilitiesContext] = None
__section__ = known_sections.DATA_WRITER
@with_config(spec=BufferedDataWriterConfiguration)
def __init__(
self,
file_format: TLoaderFileFormat,
file_name_template: str,
*,
buffer_max_items: int = 5000,
file_max_items: int = None,
file_max_bytes: int = None,
_caps: DestinationCapabilitiesContext = None
):
you can do the same for ParquetDataWriter
@@ -82,7 +82,7 @@ def write_data_item(self, item: TDataItems, columns: TTableSchemaColumns) -> Non | |||
if self.file_max_bytes and self._file.tell() >= self.file_max_bytes: | |||
self._rotate_file() | |||
# rotate on max items | |||
if self.file_max_items and self._writer.items_count >= self.file_max_items: | |||
elif self.file_max_items and self._writer.items_count >= self.file_max_items: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dlt/common/data_writers/writers.py
Outdated
from dlt.helpers.parquet_helper import pq | ||
|
||
# build schema | ||
self.schema = pyarrow.schema([pyarrow.field(name, self.get_data_type(schema_item["data_type"]), nullable=schema_item["nullable"]) for name, schema_item in columns_schema.items()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually quite useful function. could we separate it? a function that gets pyarrow schema from dlt table schema.
should we maybe move parquet out of the common file for data writers?
dlt/common/data_writers/writers.py
Outdated
self.schema = pyarrow.schema([pyarrow.field(name, self.get_data_type(schema_item["data_type"]), nullable=schema_item["nullable"]) for name, schema_item in columns_schema.items()]) | ||
# find row items that are of the complex type (could be abstracted out for use in other writers?) | ||
self.complex_indices = [i for i, field in columns_schema.items() if field["data_type"] == "complex"] | ||
self.writer = pq.ParquetWriter(self._f, self.schema, flavor="spark") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flavor should be in configuration. I think it is somewhere in the ticket
dlt/common/data_writers/writers.py
Outdated
|
||
table = pyarrow.Table.from_pylist(rows, schema=self.schema) | ||
# Write chunks of data | ||
for i in range(0, len(rows), 100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo we should just write table as a whole? I just wonder if that allocates the double the memory.
https://arrow.apache.org/docs/python/generated/pyarrow.RecordBatch.html#pyarrow.RecordBatch.from_pylist
and some hallucinations from gpt-4 which are moslty right
import pyarrow as pa
import pyarrow.parquet as pq
# Your list of dictionaries, broken into chunks
chunks = [
[{'column1': 'value1', 'column2': 'value2'}, {'column1': 'value3', 'column2': 'value4'}],
[{'column1': 'value5', 'column2': 'value6'}, {'column1': 'value7', 'column2': 'value8'}],
# More chunks...
]
# Define the schema
schema = pa.schema([
('column1', pa.string()),
('column2', pa.string())
])
# Open the binary file you want to write to
with open('output.parquet', 'wb') as binary_file:
# Create a ParquetWriter with the file and schema
writer = pq.ParquetWriter(binary_file, schema)
# Write each chunk to the file
for chunk in chunks:
# Create a record batch from the chunk
batch = pa.RecordBatch.from_pandas({k: v for d in chunk for k, v in d.items()}, schema=schema)
# Write the batch to the file
writer.write_table(pa.Table.from_batches([batch]))
# Close the writer when done
writer.close()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just writing the whole table now, that is what you wanted right?
dlt/pipeline/__init__.py
Outdated
@@ -148,7 +152,7 @@ def attach( | |||
pipelines_dir = get_dlt_pipelines_dir() | |||
progress = collector_from_name(progress) | |||
# create new pipeline instance | |||
p = Pipeline(pipeline_name, pipelines_dir, pipeline_salt, None, None, None, None, None, full_refresh, progress, True, last_config(**kwargs), kwargs["runtime"]) | |||
p = Pipeline(pipeline_name, pipelines_dir, pipeline_salt, None, None, None, None, None, None, full_refresh, progress, True, last_config(**kwargs), kwargs["runtime"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather pass it to run
method. this is way more relevant there and more expected, it then can be passed along to normalize without even storing in pipeline
dlt/pipeline/__init__.py
Outdated
@@ -21,6 +22,7 @@ def pipeline( | |||
pipelines_dir: str = None, | |||
pipeline_salt: TSecretValue = None, | |||
destination: TDestinationReferenceArg = None, | |||
loader_file_format: TLoaderFileFormat = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the function arguments are documented please always update the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's in the run function now of course, but I'm not quite sure where to update this since we don't have a proper reference at this point (or so it seems). I think a page with all available config options would be cool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(for me a reference is a place where I can find all the parts that a lib has to offer, even the ones that are not mentioned in more prosaic docs, if that makes sense)
from dlt.destinations.filesystem.filesystem import FilesystemClient, LoadFilesystemJob | ||
from dlt.common.schema.typing import LOADS_TABLE_NAME | ||
|
||
def test_pipeline_parquet_bigquery_destination() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's move that to tests/load/pipelines
we can parcelate them to test_pipelines
and test_filesystem_pipeline
there we have automatic fixtures set up. there are almost the same as in verified sources repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlt/common/data_writers/writers.py
Outdated
|
||
def write_data(self, rows: Sequence[Any]) -> None: | ||
super().write_data(rows) | ||
from dlt.helpers.parquet_helper import pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should not include helpers here. helpers are intended to use common not vice versa. my take would be to move this to separate file as mentioned earlier and import pyarrow directly (same as done in the helpers - with the nice exception)
the other option would be to move this writer to helpers but then we need to rewrite the class factory to allow registration of new writers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the common code has many places like this. ie some credentials lazy import bigquery or boto3. I could not find any good pattern to do that. the guiding principle is that dlt can be used with minimum (default) credentials and it asks for extras via this specialized exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved the parquet stuff now into a newly created libs folder, maybe this would be a nice pattern for the wrappers of external libraries, not sure, lmk
dlt/normalize/normalize.py
Outdated
) -> TWorkerRV: | ||
|
||
schema_updates: List[TSchemaUpdate] = [] | ||
total_items = 0 | ||
# process all files with data items and write to buffered item storage | ||
with Container().injectable_context(destination_caps): | ||
schema = Schema.from_stored_schema(stored_schema) | ||
load_storage = LoadStorage(False, destination_caps.preferred_loader_file_format, LoadStorage.ALL_SUPPORTED_FILE_FORMATS, loader_storage_config) | ||
load_storage = LoadStorage(False, loader_file_format or destination_caps.preferred_loader_file_format, LoadStorage.ALL_SUPPORTED_FILE_FORMATS, loader_storage_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
somehow you pass the list here. which shows in the test (list is unhashable)
@@ -18,7 +18,7 @@ def _configure(config: BigQueryClientConfiguration = config.value) -> BigQueryCl | |||
def capabilities() -> DestinationCapabilitiesContext: | |||
caps = DestinationCapabilitiesContext() | |||
caps.preferred_loader_file_format = "jsonl" | |||
caps.supported_loader_file_formats = ["jsonl", "sql"] | |||
caps.supported_loader_file_formats = ["jsonl", "parquet"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing sql
fails the bigquery tests. sql
is a valid file type that gets executed during loading and contains transformations
fixed the issue in normalize tests (the default file type for filesystem) |
7293e9a
to
ad73d6b
Compare
c9ccc14
to
e735971
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good! all my comments are rather minor. the only thing we need to do is to add parquet to our docs. for that we need a new section:
dlt ecosystem/file formats: please add parquet there, with some basic info (pyarrow, how to configure, how to pass file format to run or to pipeline via config etc.)
i'll add a section on jsonl
let's do a separate PR for that
class ParquetDataWriterConfiguration(BaseConfiguration): | ||
flavor: str = "spark" | ||
version: str = "2.4" | ||
data_page_size: int = 1024 * 1024 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool!
let's add a section:
__section__ = known_sections.DATA_WRITER
then "data_writer" is always mandatory and it will blend in with generic settings in buffered
@@ -115,3 +115,7 @@ def test_string_literal_escape_unicode() -> None: | |||
assert escape_redshift_literal("イロハニホヘト チリヌルヲ ワカヨタレソ ツネナラム") == "'イロハニホヘト チリヌルヲ ワカヨタレソ ツネナラム'" | |||
assert escape_redshift_identifier("ąćł\"") == '"ąćł"""' | |||
assert escape_redshift_identifier("イロハニホヘト チリヌルヲ \"ワカヨタレソ ツネナラム") == '"イロハニホヘト チリヌルヲ ""ワカヨタレソ ツネナラム"' | |||
|
|||
|
|||
def test_parquet_writer() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then remove placeholder?
os.environ["NORMALIZE__DATA_WRITER__VERSION"] = "2.0" | ||
os.environ["NORMALIZE__DATA_WRITER__DATA_PAGE_SIZE"] = str(1024 * 512) | ||
|
||
with inject_section(ConfigSectionContext(pipeline_name=None, sections=("normalize", "data_writer",))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I get where the problem is coming from: look at this:
class BufferedDataWriterConfiguration(BaseConfiguration):
buffer_max_items: int = 5000
file_max_items: Optional[int] = None
file_max_bytes: Optional[int] = None
disable_compression: bool = False
_caps: Optional[DestinationCapabilitiesContext] = None
__section__ = known_sections.DATA_WRITER
you need to add __section__
to the config, not section
. then you can drop "data_writer" from tuple of sections in the test. and it will work. good that we have this test
…ve value and completed
6837b2a
to
2732ca3
Compare
Implements ticket #389