New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-6086: [Rust] [DataFusion] Add support for partitioned Parquet data sources #5494
Closed
+55
−51
Closed
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter...
Filter file types
Jump to…
Jump to file or symbol
Failed to load files and symbols.
| @@ -37,23 +37,27 @@ use parquet::file::reader::*; | ||
|
|
||
| use crate::datasource::{ScanResult, TableProvider}; | ||
| use crate::error::{ExecutionError, Result}; | ||
| use crate::execution::physical_plan::common; | ||
| use crate::execution::physical_plan::BatchIterator; | ||
|
|
||
| /// Table-based representation of a `ParquetFile` | ||
| pub struct ParquetTable { | ||
| filename: String, | ||
| filenames: Vec<String>, | ||
| schema: Arc<Schema>, | ||
| } | ||
|
|
||
| impl ParquetTable { | ||
| /// Attempt to initialize a new `ParquetTable` from a file path | ||
| pub fn try_new(filename: &str) -> Result<Self> { | ||
| let parquet_file = ParquetFile::open(filename, None, 0)?; | ||
| let schema = parquet_file.projection_schema.clone(); | ||
| Ok(Self { | ||
| filename: filename.to_string(), | ||
| schema, | ||
| }) | ||
| pub fn try_new(path: &str) -> Result<Self> { | ||
| let mut filenames: Vec<String> = vec![]; | ||
| common::build_file_list(path, &mut filenames, ".parquet")?; | ||
| if filenames.is_empty() { | ||
| Err(ExecutionError::General("No files found".to_string())) | ||
| } else { | ||
| let parquet_file = ParquetFile::open(&filenames[0], None, 0)?; | ||
| let schema = parquet_file.projection_schema.clone(); | ||
andygrove
Author
Member
|
||
| Ok(Self { filenames, schema }) | ||
| } | ||
| } | ||
| } | ||
|
|
||
| @@ -70,17 +74,16 @@ impl TableProvider for ParquetTable { | ||
| projection: &Option<Vec<usize>>, | ||
| batch_size: usize, | ||
| ) -> Result<Vec<ScanResult>> { | ||
| // note that this code currently assumes the filename is a file rather than a directory | ||
| // and therefore only returns a single partition | ||
| let parquet_file = match projection { | ||
| Some(p) => ParquetScanPartition::try_new( | ||
| &self.filename, | ||
| Some(p.clone()), | ||
| batch_size, | ||
| )?, | ||
| None => ParquetScanPartition::try_new(&self.filename, None, batch_size)?, | ||
| }; | ||
| Ok(vec![Arc::new(Mutex::new(parquet_file))]) | ||
| Ok(self | ||
| .filenames | ||
| .iter() | ||
| .map(|filename| { | ||
| ParquetScanPartition::try_new(filename, projection.clone(), batch_size) | ||
| .and_then(|part| { | ||
| Ok(Arc::new(Mutex::new(part)) as Arc<Mutex<dyn BatchIterator>>) | ||
| }) | ||
| }) | ||
| .collect::<Result<Vec<_>>>()?) | ||
| } | ||
| } | ||
|
|
||
ProTip!
Use n and p to navigate between commits in a pull request.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
What happens if the
schemaof the files differ? I guess it just fails are execution time when a differentschemais encountered?