-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-7625: [Parquet][GLib] Add support for writer properties #6336
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet::WriterProperties::Builder
API is useful for C++ because we can use method chain to build properties:
builder
.memory_pool(pool)
.enable_dictionary()
.data_pagesize(1024);
But this API isn't useful in C because we can't use method chain in C:
gparquet_writer_properties_data_pagesize(
gparquet_writer_properties_enable_dictionary(
gparquet_writer_properties_memory_pool(builder, pool)
),
1024);
Or
gparquet_writer_properties_builder_memory_pool(builder, pool);
gparquet_writer_properties_builder_enable_dictionary(builder);
gparquet_writer_properties_builder_data_pagesize(builder, 1024);
So we don't need to provide the builder API to users.
We can just use parquet::WriterProperties::Builder
internally. We can provide just accessor API:
gparquet_writer_properties_set_memory_pool(properties, pool);
gparquet_writer_properties_enable_dictionary(properties);
gparquet_writer_properties_set_data_pagesize(properties, 1024);
And we can build parquet::WriterProperties
when it's needed:
gparquet_arrow_file_writer_new_*()
{
...
auto parquet_writer_properties =
gparquet_writer_properties_get_raw(writer_properties);
...
}
Could you change the API or do I push base implementation?
Could you enable Parquet on macOS CI? diff --git a/.github/workflows/ruby.yml b/.github/workflows/ruby.yml
index bc28c6f54..f9ba56b67 100644
--- a/.github/workflows/ruby.yml
+++ b/.github/workflows/ruby.yml
@@ -85,6 +85,7 @@ jobs:
ARROW_HOME: /usr/local
ARROW_JEMALLOC: OFF
ARROW_ORC: OFF
+ ARROW_PARQUET: ON
ARROW_WITH_BROTLI: ON
ARROW_WITH_LZ4: ON
ARROW_WITH_SNAPPY: ON |
Thank you for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check my comments?
|
||
static void | ||
gparquet_writer_properties_init(GParquetWriterProperties *object) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can always create parquet::WriterPropertiesBuilder
here.
We don't need to receive it as an argument of g_object_new()
.
gparquet_writer_properties_get_raw(GParquetWriterProperties *properties) | ||
{ | ||
auto priv = GPARQUET_WRITER_PROPERTIES_GET_PRIVATE(properties); | ||
return priv->builder->build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to cache build()
result.
if (priv->changed) {
priv->properties = priv->builder->build();
}
return priv->properties;
gparquet_writer_properties_get_compression(GParquetWriterProperties *properties) | ||
{ | ||
auto priv = GPARQUET_WRITER_PROPERTIES_GET_PRIVATE(properties); | ||
return priv->compression_type; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to keep duplicated information in this object.
auto parquet_properties = gparquet_writer_properties_get_raw(properties);
auto parquet_column_path = parquet::Schema::ColumnPath::FromDotString(dotstring); // Or receive as an argument
auto arrow_compression = parquet_properties->compression(parquet_column_path);
return garrow_compression_type_from_raw(arrow_compression);
Thank you for your comments. |
Nice catch with the other configurables (dictionary, memory pool) and exposing the per-column settings. I was happy with: parquet::WriterProperties::Builder builder;
builder.compression(arrow::Compression::SNAPPY);
auto parquet_writer_properties = builder.build(); :P |
@ziggythehamster What do you mean? Do you want to use |
fbe3f66
to
0fce7f1
Compare
31f431a
to
2ed0e47
Compare
I hope I've addressed all review comments. In addition to them, I've added some properties. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check my comments?
} | ||
|
||
/** | ||
* gparquet_writer_properties_get_compression: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you append _dot_string
?
Because we will add parquet::Schema::ColumnPath
version later.
/** | ||
* gparquet_writer_properties_get_compression: | ||
* @properties: A #GParquetWriterProperties. | ||
* @dotstring: The dot string path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dot_string
/** | ||
* gparquet_writer_properties_dictionary_enabled: | ||
* @properties: A #GParquetWriterProperties. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dot_string
is missing.
} | ||
|
||
/** | ||
* gparquet_writer_properties_dictionary_enabled: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add is_
for predicate?
gparquet_writer_properties_is_dictionary_enabled
.
/** | ||
* gparquet_writer_properties_set_dictionary_pagesize_limit: | ||
* @properties: A #GParquetWriterProperties. | ||
* @dictionary_pagesize_limit: The dictionary page size limit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify this to @limit
.
/** | ||
* gparquet_writer_properties_set_data_pagesize: | ||
* @properties: A #GParquetWriterProperties. | ||
* @data_pagesize: The data page size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use page_size
instead of pagesize
?
@@ -133,18 +398,29 @@ gparquet_arrow_file_writer_class_init(GParquetArrowFileWriterClass *klass) | |||
GParquetArrowFileWriter * | |||
gparquet_arrow_file_writer_new_arrow(GArrowSchema *schema, | |||
GArrowOutputStream *sink, | |||
GParquetWriterProperties *writer_properties, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks API backward compatibility but it may be acceptable...
if (priv->changed) { | ||
priv->properties = priv->builder->build(); | ||
} | ||
priv->changed = FALSE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put this to if (priv->changed) {
clause?
def test_compression | ||
@properties.compression = :gzip | ||
assert_equal(Arrow::CompressionType.new("gzip"), | ||
@properties.get_compression("a_column")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you use not-specified
or something instead of a_column
?
*/ | ||
void | ||
gparquet_writer_properties_set_compression(GParquetWriterProperties *properties, | ||
GArrowCompressionType compression_type) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding nullable path
or dot_string
?
gparquet_writer_properties_set_compression((GParquetWriterProperties *properties,
const gchar *path,
GArrowCompressionType compression_type)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added nullable path
.
262290b
to
24f723e
Compare
Thank you for your review. I think I've addressed them. |
…properties_get_compression_dot_string
…er_properties_is_dictionary_enabled
24f723e
to
c0ba797
Compare
We will use "column_path" for Parquet::Schema::ColumnPath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@shiro615 Sorry for my late review.
I pushed some changes:
- Add the "path" argument to
enable_dictionary()
anddisable_dictionary()
. - Rename the "dot_string" argument to "path". Sorry. How about using "path" as "dot string" in Parquet GLib? We will use "column_path" when we need to export
Parquet::Schema::ColumnPath
.
Could you confirm them?
Thank you for your review and suggestion.
It sounds good for me too. I'll merge this. |
No description provided.