Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-7625: [Parquet][GLib] Add support for writer properties #6336

Closed
wants to merge 24 commits into from

Conversation

shiro615
Copy link
Contributor

@shiro615 shiro615 commented Feb 1, 2020

No description provided.

@github-actions
Copy link

github-actions bot commented Feb 1, 2020

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet::WriterProperties::Builder API is useful for C++ because we can use method chain to build properties:

builder
  .memory_pool(pool)
  .enable_dictionary()
  .data_pagesize(1024);

But this API isn't useful in C because we can't use method chain in C:

gparquet_writer_properties_data_pagesize(
  gparquet_writer_properties_enable_dictionary(
    gparquet_writer_properties_memory_pool(builder, pool)
  ),
  1024);

Or

gparquet_writer_properties_builder_memory_pool(builder, pool);
gparquet_writer_properties_builder_enable_dictionary(builder);
gparquet_writer_properties_builder_data_pagesize(builder, 1024);

So we don't need to provide the builder API to users.

We can just use parquet::WriterProperties::Builder internally. We can provide just accessor API:

gparquet_writer_properties_set_memory_pool(properties, pool);
gparquet_writer_properties_enable_dictionary(properties);
gparquet_writer_properties_set_data_pagesize(properties, 1024);

And we can build parquet::WriterProperties when it's needed:

gparquet_arrow_file_writer_new_*()
{
  ...
  auto parquet_writer_properties =
    gparquet_writer_properties_get_raw(writer_properties);
  ...
}

Could you change the API or do I push base implementation?

@kou
Copy link
Member

kou commented Feb 2, 2020

Could you enable Parquet on macOS CI?

diff --git a/.github/workflows/ruby.yml b/.github/workflows/ruby.yml
index bc28c6f54..f9ba56b67 100644
--- a/.github/workflows/ruby.yml
+++ b/.github/workflows/ruby.yml
@@ -85,6 +85,7 @@ jobs:
       ARROW_HOME: /usr/local
       ARROW_JEMALLOC: OFF
       ARROW_ORC: OFF
+      ARROW_PARQUET: ON
       ARROW_WITH_BROTLI: ON
       ARROW_WITH_LZ4: ON
       ARROW_WITH_SNAPPY: ON

@shiro615
Copy link
Contributor Author

shiro615 commented Feb 4, 2020

Thank you for your review.
I'll try to address them.

@shiro615
Copy link
Contributor Author

shiro615 commented Feb 5, 2020

@kou
fbe3f66
Could you please confirm my understanding?
I'll add the other properties in this PR if this change is fine.

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check my comments?


static void
gparquet_writer_properties_init(GParquetWriterProperties *object)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can always create parquet::WriterPropertiesBuilder here.
We don't need to receive it as an argument of g_object_new().

gparquet_writer_properties_get_raw(GParquetWriterProperties *properties)
{
auto priv = GPARQUET_WRITER_PROPERTIES_GET_PRIVATE(properties);
return priv->builder->build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to cache build() result.

if (priv->changed) {
  priv->properties = priv->builder->build(); 
}
return priv->properties;

gparquet_writer_properties_get_compression(GParquetWriterProperties *properties)
{
auto priv = GPARQUET_WRITER_PROPERTIES_GET_PRIVATE(properties);
return priv->compression_type;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to keep duplicated information in this object.

auto parquet_properties = gparquet_writer_properties_get_raw(properties);
auto parquet_column_path = parquet::Schema::ColumnPath::FromDotString(dotstring); // Or receive as an argument
auto arrow_compression = parquet_properties->compression(parquet_column_path);
return garrow_compression_type_from_raw(arrow_compression);

@shiro615
Copy link
Contributor Author

shiro615 commented Feb 6, 2020

Thank you for your comments.
I'll fix them in this week.

@ziggythehamster
Copy link

Nice catch with the other configurables (dictionary, memory pool) and exposing the per-column settings. I was happy with:

parquet::WriterProperties::Builder builder;
builder.compression(arrow::Compression::SNAPPY);
auto parquet_writer_properties = builder.build();

:P

@kou
Copy link
Member

kou commented Feb 7, 2020

@ziggythehamster What do you mean? Do you want to use parquet::WriterProperties built by C++ with Apache Arrow GLib?

@shiro615
Copy link
Contributor Author

shiro615 commented Feb 8, 2020

I hope I've addressed all review comments. In addition to them, I've added some properties.

@shiro615 shiro615 requested a review from kou February 8, 2020 00:45
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check my comments?

}

/**
* gparquet_writer_properties_get_compression:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you append _dot_string?
Because we will add parquet::Schema::ColumnPath version later.

/**
* gparquet_writer_properties_get_compression:
* @properties: A #GParquetWriterProperties.
* @dotstring: The dot string path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dot_string

/**
* gparquet_writer_properties_dictionary_enabled:
* @properties: A #GParquetWriterProperties.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dot_string is missing.

}

/**
* gparquet_writer_properties_dictionary_enabled:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add is_ for predicate?
gparquet_writer_properties_is_dictionary_enabled.

/**
* gparquet_writer_properties_set_dictionary_pagesize_limit:
* @properties: A #GParquetWriterProperties.
* @dictionary_pagesize_limit: The dictionary page size limit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this to @limit.

/**
* gparquet_writer_properties_set_data_pagesize:
* @properties: A #GParquetWriterProperties.
* @data_pagesize: The data page size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use page_size instead of pagesize?

@@ -133,18 +398,29 @@ gparquet_arrow_file_writer_class_init(GParquetArrowFileWriterClass *klass)
GParquetArrowFileWriter *
gparquet_arrow_file_writer_new_arrow(GArrowSchema *schema,
GArrowOutputStream *sink,
GParquetWriterProperties *writer_properties,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks API backward compatibility but it may be acceptable...

if (priv->changed) {
priv->properties = priv->builder->build();
}
priv->changed = FALSE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put this to if (priv->changed) { clause?

def test_compression
@properties.compression = :gzip
assert_equal(Arrow::CompressionType.new("gzip"),
@properties.get_compression("a_column"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use not-specified or something instead of a_column?

*/
void
gparquet_writer_properties_set_compression(GParquetWriterProperties *properties,
GArrowCompressionType compression_type)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding nullable path or dot_string?

gparquet_writer_properties_set_compression((GParquetWriterProperties *properties,
                                           const gchar *path,
                                           GArrowCompressionType compression_type)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added nullable path.

@shiro615 shiro615 force-pushed the glib-parquet-writer-options branch 4 times, most recently from 262290b to 24f723e Compare February 15, 2020 17:27
@shiro615
Copy link
Contributor Author

Thank you for your review. I think I've addressed them.

@shiro615 shiro615 requested a review from kou February 15, 2020 18:13
@kou kou force-pushed the glib-parquet-writer-options branch from 24f723e to c0ba797 Compare February 26, 2020 21:36
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@shiro615 Sorry for my late review.
I pushed some changes:

  • Add the "path" argument to enable_dictionary() and disable_dictionary().
  • Rename the "dot_string" argument to "path". Sorry. How about using "path" as "dot string" in Parquet GLib? We will use "column_path" when we need to export Parquet::Schema::ColumnPath.

Could you confirm them?

@shiro615
Copy link
Contributor Author

Thank you for your review and suggestion.

How about using "path" as "dot string" in Parquet GLib?

It sounds good for me too.

I'll merge this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants