ARROW-12512: [C++][Python][Dataset] Create CSV writer class and add Datasets support #10230

lidavidm · 2021-05-03T19:01:06Z

This refactors the CSV write support to expose an explicit CSV writer class, and adds Python bindings and Datasets support.

github-actions · 2021-05-03T19:06:22Z

https://issues.apache.org/jira/browse/ARROW-12512

lidavidm · 2021-05-04T13:21:10Z

@emkornfield would you be free to take a look (at least the CSV side, if not the Datasets side)? No rush of course.

jorisvandenbossche · 2021-05-04T18:30:57Z

Does this also enable writing CSV with the dataset API in Python? (write_dataset(..., format="csv")

lidavidm · 2021-05-04T18:34:19Z

@jorisvandenbossche I missed that: CsvFileFormat.make_write_options in Python needs to be updated as well.

nealrichardson · 2021-05-04T20:34:10Z

There's probably a very small amount of wiring to propagate this up to the R write_dataset() function; up to you if you want to handle it here or make another JIRA for it.

lidavidm · 2021-05-04T22:12:35Z

I threw in R support and found & fixed a bug with scanning CSV datasets with manually-specified names.

westonpace

I haven't look at the CSV writer much so some of my comments may have been on that.

Do we have a JIRA for allowing incremental writes to a CSV file (using file append)? Or would that be possible today?

westonpace · 2021-05-11T17:19:01Z

cpp/src/arrow/csv/writer.cc

-                                                    MemoryPool* pool) {
+  static Result<std::shared_ptr<CSVConverter>> Make(
+      io::OutputStream* sink, std::shared_ptr<io::OutputStream> owned_sink,
+      std::shared_ptr<Schema> schema, MemoryPool* pool, const WriteOptions& options) {


Maybe take in IOContext instead of MemoryPool*? If you later decide to add support for cancellation it'll save you from having to change the API.

westonpace · 2021-05-11T17:22:24Z

cpp/src/arrow/csv/writer.cc

    }

    return Status::OK();
  }

+  Status Close() override { return Status::OK(); }


No need to close owned_sink_?

The IPC reader doesn't do this either, oddly. I guess it is not a Rust 'exclusively owned' sink but merely, 'keep this sink alive'. (Though, that does beg the question: what's the point? Either you're the only one keeping it alive, and so you should close it, or you aren't the only one, and you don't need a shared_ptr. I would guess it's just less of a footgun to have a strong reference than a potentially dangling one, though.)

other places I can think of (Buffer comes to mind), the way this is handled is passing a unique_ptr instead of a shared_ptr.

Right, we don't close either in other "writer" classes. This is more flexible, though of course in the general case not very useful.

westonpace · 2021-05-11T17:23:27Z

cpp/src/arrow/csv/writer.cc

    TableBatchReader reader(table);
-    reader.set_chunksize(options.batch_size);
-    RETURN_NOT_OK(PrepareForContentsWrite(options, out));
+    reader.set_chunksize(max_chunksize > 0 ? max_chunksize : options_.batch_size);


Seems a little odd to have two options to control batch_size. I suppose it's a "default" batch size and a "specific for this table" batch size?

There's a bit of an impedance mismatch because I elected to reuse the ipc::RecordBatchWriter interface, which has parameters like that in the API. I could at least introduce an overload that doesn't require specifying it for convenience.

westonpace · 2021-05-11T17:25:56Z

cpp/src/arrow/csv/writer.cc


-  Status PrepareForContentsWrite(const WriteOptions& options, io::OutputStream* out) {
+  Status PrepareForContentsWrite() {


Does data_buffer_ ever revert back to nullptr? Why isn't it just initialized once at construction?

I think at one point I might have been using as a signal to see if header was written, but I can't really remember. I agree it is strange and I don't have a strong justification for this pattern. It might of been to avoid having to make a factory function for the private class.

Sounds good to me.

westonpace · 2021-05-11T17:29:05Z

cpp/src/arrow/csv/writer.cc

@@ -355,7 +370,9 @@ class CSVConverter {
    return header_length + (kQuoteDelimiterCount * schema_->num_fields());
  }

-  Status WriteHeader(io::OutputStream* out) {
+  Status WriteHeader() {
+    if (header_written_) return Status::OK();


Would it be clearer to return Invalid here to inform the caller they are doing something odd? Or is it sometimes hard for the caller to know when the header will be written?

westonpace · 2021-05-11T17:49:55Z

cpp/src/arrow/dataset/file_csv.cc

+      new CsvFileWriteOptions(shared_from_this()));
+  csv_options->options =
+      std::make_shared<csv::WriteOptions>(csv::WriteOptions::Defaults());
+  csv_options->pool = default_memory_pool();


I'm a little surprised that pool is not a property of FileWriteOptions.

westonpace · 2021-05-11T17:55:14Z

cpp/src/arrow/dataset/file_csv.h

+class ARROW_DS_EXPORT CsvFileWriteOptions : public FileWriteOptions {
+ public:
+  /// Options passed to csv::MakeCSVWriter. use_threads is ignored
+  std::shared_ptr<csv::WriteOptions> options;


options is a little ambiguous. Perhaps format_options or csv_options or writer_options?

westonpace · 2021-05-11T18:01:17Z

python/pyarrow/_csv.pxd

+
+cdef class WriteOptions(_Weakrefable):
+    cdef:
+        unique_ptr[CCSVWriteOptions] options


Why does this need to be a unique_ptr? CCSVWriteOptions is pretty trivial.

Mostly for consistency with the other options, and in case we add things to WriteOptions that would make it a non-standard layout type, in which case Cython will generate a lot of compiler warnings as it relies on sizeof.

westonpace · 2021-05-11T18:03:57Z

python/pyarrow/_dataset.pyx

@@ -1747,8 +1749,15 @@ cdef class CsvFileFormat(FileFormat):
        FileFormat.init(self, sp)
        self.csv_format = <CCsvFileFormat*> sp.get()

-    def make_write_options(self):
-        raise NotImplemented("writing CSV datasets")
+    def make_write_options(self, WriteOptions options=None,


This is kind of confusing having a method named make_write_options that takes in an instance of WriteOptions. Perhaps in C++ it wouldn't be so bad but for Python I think we might want something more understandable.

I could have it take **kwargs which get forwarded to csv.WriteOptions; now that I look, that's what ParquetFileFormat does.

westonpace · 2021-05-11T18:06:20Z

python/pyarrow/tests/test_dataset.py

+    table = pa.table([
+        pa.array(range(20)), pa.array(np.random.randn(20)),
+        pa.array(np.repeat(['a', 'b'], 10))
+    ], names=["f1", "f2", "part"])


The column here is named part which makes me think it is going to be used for partitioning but that isn't actually done. I'm not sure this is a problem as much as an observation.

emkornfield · 2021-05-12T15:44:34Z

cpp/src/arrow/csv/writer.cc

@@ -403,34 +415,44 @@ class CSVConverter {
  }

  static constexpr int64_t kColumnSizeGuess = 8;
+  io::OutputStream* sink_;
+  std::shared_ptr<io::OutputStream> owned_sink_;


shared_ptr seems strange in general for a OutputStream which seems for the most part should have only one owner.

I agree it seems weird, but both the IPC and Parquet writers use shared_ptr for this.

Well, except that in Python any object can be shared, even if it's logically "owned" by something.

emkornfield · 2021-05-12T15:45:34Z

cpp/src/arrow/csv/writer.cc

-  ASSIGN_OR_RAISE(std::unique_ptr<CSVConverter> converter,
-                  CSVConverter::Make(table.schema(), pool));
-  return converter->WriteCSV(table, options, output);
+  ASSIGN_OR_RAISE(auto converter, MakeCSVWriter(output, table.schema(), options));


nit: should converter now be writer? (same question below)

emkornfield · 2021-05-12T15:47:43Z

cpp/src/arrow/dataset/file_csv.h

@@ -83,6 +82,35 @@ struct ARROW_DS_EXPORT CsvFragmentScanOptions : public FragmentScanOptions {
  csv::ReadOptions read_options = csv::ReadOptions::Defaults();
 };

+class ARROW_DS_EXPORT CsvFileWriteOptions : public FileWriteOptions {
+ public:
+  /// Options passed to csv::MakeCSVWriter. use_threads is ignored


Is use_threads used elsewhere? The way the code is structured threads could be used for the casts, so if it is important we might want to file a follow-up JIRA.

I copied this from the equivalent IPC struct - it doesn't apply here since there's no such parameter of course.

emkornfield · 2021-05-12T15:49:52Z

python/pyarrow/_csv.pyx

    else:
        raise TypeError(f"Expected Table or RecordBatch, got '{type(data)}'")
+
+
+cdef class CsvWriter(_CRecordBatchWriter):


nit: as much as I appreciate Csv naming convention I think CSV is used everywhere else?

emkornfield · 2021-05-12T15:50:34Z

python/pyarrow/_dataset.pyx

@@ -1819,6 +1824,28 @@ cdef class CsvFragmentScanOptions(FragmentScanOptions):
                                        self.read_options)


+cdef class CsvFileWriteOptions(FileWriteOptions):


same nit on Csv vs CSV

Unfortunately in the context of datasets (and only datasets) all other classes already use Csv.

emkornfield · 2021-05-12T15:55:04Z

Took a quick pass through, seems OK to me. (didn't look at R stuff at all) and I agree with Weston's comments.

lidavidm · 2021-05-12T17:16:03Z

Thanks for the reviews. I think I've addressed all feedback, minus the shared_ptr - while this is weird, it is the pattern used by IPC and Parquet as well and I think we may was well be consistent across the formats. (Also, IPC exposes both the output-owning and output-borrowing APIs too, even though it expects the caller to close the stream in both cases.)

lidavidm · 2021-05-18T15:31:59Z

Just to follow up - APIs like FileSystem return shared_ptr<OutputStream> so it would be very annoying to take unique_ptr. And we could just take only OutputStream* but IMO even if the caller is supposed to keep the pointer alive, it's safer to offer the option to take a shared_ptr by default to minimize mistakes.

lidavidm · 2021-06-15T17:15:08Z

Rebased/fixed conflicts.

lidavidm · 2021-06-16T13:47:51Z

Rebased/fixed conflicts again.

On the shared ptr vs raw pointer: I think the other issue is that we allow constructing a shared_ptr<Writer> at which point we should hold a shared_ptr<OutputStream> or unique_ptr<OutputStream>. But the filesystem interfaces only give a shared_ptr<OutputStream>, hence we can only take a shared_ptr here.

We could perhaps not allow you to get a shared_ptr<Writer> and only let you use Writer or Writer* and then only accept a OutputStream*.

emkornfield · 2021-06-17T07:15:39Z

I don't think I had any objections before but i can rereview if you want. The writer api stuff is unfortunate but doesn't need to be addressed here.

lidavidm · 2021-06-17T13:09:37Z

Ah, ok, I was unsure how to deal with that part. I'd appreciate another look (though I don't think anything substantial has changed; I did have to rebase a few times) before I merge, though.

lidavidm · 2021-06-30T13:45:46Z

Rebased and fixed conflicts here.

pitrou

LGTM.

pitrou · 2021-07-01T16:58:50Z

cpp/src/arrow/csv/writer.cc

  std::vector<std::unique_ptr<ColumnPopulator>> column_populators_;
  std::vector<int32_t, arrow::stl::allocator<int32_t>> offsets_;
  std::shared_ptr<ResizableBuffer> data_buffer_;
  const std::shared_ptr<Schema> schema_;
-  MemoryPool* pool_;
+  WriteOptions options_;


Nit: const?

I added the const.

emkornfield · 2021-07-03T03:21:28Z

cpp/src/arrow/csv/writer.cc

    }
-    return std::unique_ptr<CSVConverter>(
-        new CSVConverter(std::move(schema), std::move(populators), pool));
+    auto writer = std::shared_ptr<CSVWriterImpl>(new CSVWriterImpl(


nit: std::make_shared?

emkornfield · 2021-07-03T03:23:05Z

cpp/src/arrow/csv/writer.cc

-               std::vector<std::unique_ptr<ColumnPopulator>> populators, MemoryPool* pool)
-      : column_populators_(std::move(populators)),
-        offsets_(0, 0, ::arrow::stl::allocator<char*>(pool)),
+  CSVWriterImpl(io::OutputStream* sink, std::shared_ptr<io::OutputStream> owned_sink,


i guess this would need to be public to used std::make_shared.

emkornfield · 2021-07-03T03:26:00Z

cpp/src/arrow/csv/writer.h

+
+/// \brief Create a new CSV writer.
+///
+/// \param[in] sink output stream to write to


also note that ownership is not taken here?

emkornfield · 2021-07-03T03:28:20Z

python/pyarrow/_csv.pyx

+            CCSVWriteOptions c_write_options
+            CMemoryPool* c_memory_pool = maybe_unbox_memory_pool(memory_pool)
+        _get_write_options(write_options, &c_write_options)
+        c_write_options.io_context = CIOContext(c_memory_pool)


IOContext is new to me in general. Should we be making new API's take that instead of Memory pool?

I would say yes, since it wraps up a memory pool, thread pool, and cancellation token all in one.

emkornfield · 2021-07-03T03:31:35Z

A few random nits, but the core C++ looks OK to me.

pitrou · 2021-07-05T14:18:39Z

Thanks for the updates @lidavidm !

github-actions bot added Component: C++ Component: Python labels May 3, 2021

lidavidm force-pushed the arrow-12512 branch from bc365de to 0368b76 Compare May 4, 2021 19:35

github-actions bot added the Component: R label May 4, 2021

lidavidm force-pushed the arrow-12512 branch 2 times, most recently from ddcf58c to 893db1f Compare May 6, 2021 13:23

westonpace reviewed May 11, 2021

View reviewed changes

lidavidm force-pushed the arrow-12512 branch from 893db1f to 7f7059a Compare May 11, 2021 19:43

emkornfield reviewed May 12, 2021

View reviewed changes

lidavidm force-pushed the arrow-12512 branch 2 times, most recently from edb7572 to 7e14c80 Compare May 24, 2021 16:04

lidavidm force-pushed the arrow-12512 branch from 7e14c80 to f3afa1d Compare June 1, 2021 16:45

lidavidm force-pushed the arrow-12512 branch from f3afa1d to 76ffe47 Compare June 15, 2021 17:13

lidavidm force-pushed the arrow-12512 branch from 76ffe47 to a54b7a5 Compare June 16, 2021 12:59

lidavidm force-pushed the arrow-12512 branch from a54b7a5 to 520055d Compare June 30, 2021 13:45

pitrou approved these changes Jul 1, 2021

View reviewed changes

lidavidm force-pushed the arrow-12512 branch from 520055d to a326a29 Compare July 1, 2021 17:19

emkornfield reviewed Jul 3, 2021

View reviewed changes

lidavidm added 3 commits July 5, 2021 08:52

ARROW-12512: [C++][Python][R] Support writing CSV datasets

789a680

ARROW-12512: [C++] Add const

e2c4b76

ARROW-12512: [C++] Address feedback

56c0b7a

lidavidm force-pushed the arrow-12512 branch from a326a29 to 56c0b7a Compare July 5, 2021 12:56

pitrou approved these changes Jul 5, 2021

View reviewed changes

pitrou closed this in 0ebed2b Jul 5, 2021

asfimport mentioned this pull request Jul 6, 2021

[C++][Dataset] Implement CSV writing support #28277

Closed


		Status PrepareForContentsWrite(const WriteOptions& options, io::OutputStream* out) {
		Status PrepareForContentsWrite() {

		@@ -1819,6 +1824,28 @@ cdef class CsvFragmentScanOptions(FragmentScanOptions):
		self.read_options)


		cdef class CsvFileWriteOptions(FileWriteOptions):

ARROW-12512: [C++][Python][Dataset] Create CSV writer class and add Datasets support #10230

ARROW-12512: [C++][Python][Dataset] Create CSV writer class and add Datasets support #10230

Conversation

lidavidm commented May 3, 2021

github-actions bot commented May 3, 2021

lidavidm commented May 4, 2021

jorisvandenbossche commented May 4, 2021

lidavidm commented May 4, 2021

nealrichardson commented May 4, 2021

lidavidm commented May 4, 2021

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented May 12, 2021

lidavidm commented May 12, 2021

lidavidm commented May 18, 2021

lidavidm commented Jun 15, 2021

lidavidm commented Jun 16, 2021

emkornfield commented Jun 17, 2021

lidavidm commented Jun 17, 2021

lidavidm commented Jun 30, 2021

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented Jul 3, 2021

pitrou commented Jul 5, 2021