-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
We're working on the user experience of our tool that lets users upload a folder of CSV files (e.g.). DataFusion supports this natively, which is great, but if there's a decoding issue in any of the files a single ArrowError is reported for the entire operation, e.g.
incorrect number of fields for line 1, expected 55 got 34
It would be great if we could report in which file the issue was encountered.
some/location.csv: incorrect number of fields for line 1, expected 55 got 34
(Note that there's even more context needed to resolve this specific error. The resolution is to turn off headers, though I've not dug in to understand why that leads to overestimating the number of columns.)
Describe the solution you'd like
I first thought about using Diagnostic for this. A small patch that captures the location for CSV decoding errors is as follows:
diff --git a/datafusion/datasource-csv/src/source.rs b/datafusion/datasource-csv/src/source.rs
index 8b95d9ba9..c2e7788fa 100644
--- a/datafusion/datasource-csv/src/source.rs
+++ b/datafusion/datasource-csv/src/source.rs
@@ -35,7 +35,7 @@ use datafusion_datasource::{
use arrow::csv;
use arrow::datatypes::SchemaRef;
-use datafusion_common::{DataFusionError, Result, Statistics};
+use datafusion_common::{DataFusionError, Diagnostic, Result, Statistics};
use datafusion_common_runtime::JoinSet;
use datafusion_datasource::file::FileSource;
use datafusion_datasource::file_scan_config::FileScanConfig;
@@ -405,7 +405,14 @@ impl FileOpener for CsvOpener {
input,
DecoderDeserializer::new(CsvDecoder::new(decoder)),
);
- Ok(stream.map_err(Into::into).boxed())
+ Ok(stream
+ .map_err(move |error| {
+ DataFusionError::from(error).context(format!(
+ "decoding error in {}",
+ file_meta.object_meta.location
+ ))
+ })
+ .boxed())
}
}
}))
This gives an error such as:
decoding error in some/location.csv
caused by
Arrow error: Csv error: incorrect number of fields for line 1, expected 55 got 34
This is OK, but there are other approaches that might make more sense so I wanted to open the issue first in case there are opinions:
-
On the basis that most arrow errors will be associated with a particular location,
DataFusionError::ArrowErrorcould be converted into a struct variant with a newlocation: Option<Path>field.This would probably be a pretty disruptive change (and an API breaking change), and I haven't checked whether it's actually the case that most arrow errors are associated with a location.
-
We could introduce a new
DataFusionErrorvariant specifically for decoding errors and use that instead of the genericArrowError, e.g.DecodingError { inner: Box<ArrowError>, // Not sure why `Box`, but copying `DataFusionError::ArrowError` location: Path, },
Since
DataFusionErroris exhaustive this would still be an API breaking change, but it would be much less disruptive to introduce. -
Something else?
Describe alternatives you've considered
No response
Additional context
No response