Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DriverManager::get_drivers_for_filename for the ability to auto detect compatible Drivers for writing data #510

Merged
merged 24 commits into from
Jan 15, 2024
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
5f44a24
Add DriverManager::get_drivers_for_filename
Atreyagaurav Dec 25, 2023
9fb996b
Add docstring to DriverManager::get_drivers_for_filename; use match
Atreyagaurav Dec 29, 2023
f7d6bcb
Satisfy clippy
Atreyagaurav Dec 29, 2023
3d569e0
Improve readability for DriverManager::get_drivers_for_filename
Atreyagaurav Jan 1, 2024
0b8d80d
only test for available drivers in DriverManager::get_driver_by_name
Atreyagaurav Jan 1, 2024
79300cd
Add test for ESRI Shapefile for .shp file extension
Atreyagaurav Jan 1, 2024
563eb97
bool for vector/raster in DriverManager::get_drivers_for_filename
Atreyagaurav Jan 1, 2024
62f7ccd
Fix: modify test for gpkg.zip only for gdal v 3.7 onwards
Atreyagaurav Jan 2, 2024
3dbc42f
Fix: modify test for shp.zip only for gdal v 3.1 onwards
Atreyagaurav Jan 2, 2024
05f46c4
Fixed the test failed due to Elasticsearch name capitalization change
Atreyagaurav Jan 2, 2024
7c90f62
Rename the function and minor changes
Atreyagaurav Jan 3, 2024
223b846
Merge branch 'master' of github.com:georust/gdal
Atreyagaurav Jan 5, 2024
2feaf29
Use DriverIterator for looping through drivers
Atreyagaurav Jan 5, 2024
5c32a2c
Make `DriverManager::all()` return an Iterator
Atreyagaurav Jan 5, 2024
ed5a991
Add function to get a single driver based on file extension
Atreyagaurav Jan 8, 2024
45473a6
Use `AsRef<Path>` instead of `&str` in guess_driver(s)_for_write
Atreyagaurav Jan 9, 2024
fc0361a
Small cleanups
lnicola Jan 10, 2024
2c1fc76
Fix test
lnicola Jan 10, 2024
773872d
Try to debug test
lnicola Jan 10, 2024
a187d38
Remove debugging code
lnicola Jan 10, 2024
f54cd81
Rename methods and ignore case
lnicola Jan 11, 2024
326feda
Add PR link to CHANGES.md
Atreyagaurav Jan 15, 2024
04a549c
Rename DriverProperties to DriverType
Atreyagaurav Jan 15, 2024
98fbc23
Fix: wrong PR link location
Atreyagaurav Jan 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@

## Unreleased

- Added `DriverManager::get_drivers_for_filename` for the ability to auto detect compatible `Driver`s for writing data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add the PR link here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Done.

- Added `Feature::unset_field`
- <https://github.com/georust/gdal/pull/503>

Expand Down
128 changes: 128 additions & 0 deletions src/driver.rs
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,90 @@ impl DriverManager {
Ok(Driver { c_driver })
}

/// Get the Driver based on the file extension from filename
///
/// Searches for the available extensions in the registered
/// drivers and returns the matches. The determined driver is
/// checked for writing capabilities as
/// [`Dataset::open`](Dataset::open) can already open datasets
/// with just path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why we limit returned drivers to ones with "write" capabilities... maybe read-only use cases might also want a list of drivers to pick from (e.g. when reading JP2k, there are several options and you might want to inspect the metadata to decide on which to select).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, initially I was thinking that, but then I realized that when we already have a dataset, we can just call Dataset::open and then look at the driver. Having a dataset you can open means you have a lot in your arsenal to detect the driver, like the metadata you mentioned.

The use case we're thinking is to when we need to write a dataset, currently the only way I could find to write to file is to create a file from a driver, which requires you to know the driver in advance. Of course, you can make the user pass the driver name along with the filename and call it a day (that's what I was doing on one of my program), but all users are not likely to know about the driver names. And if we can more or less guess based on the file extension, then we make it convenient and remove any user error (could write a Tiff file in .shp extension file and it'll write it). And since GIS programs let you write to any file, making the program only dependent on the filename makes it easier to integrate with other programs as well.

I think renaming the function to mention it's for writing should help with it.

///
/// See also: [`get_driver_by_name`](Self::get_driver_by_name)
///
/// # Example
///
/// ```rust, no_run
/// use gdal::DriverManager;
/// # fn main() -> gdal::errors::Result<()> {
/// let compatible_drivers =
/// DriverManager::get_drivers_for_filename("test.gpkg", true)
/// .iter()
/// .map(|d| d.short_name())
/// .collect::<Vec<String>>();
/// println!("{:?}", compatible_drivers);
/// # Ok(())
/// # }
/// ```
/// ```text
/// ["GPKG"]
/// ```
pub fn get_drivers_for_filename(filename: &str, is_vector: bool) -> Vec<Driver> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is is_vector necessary? Is it possible to have multiple drivers that read the same file extension but differentiate on vector vs. raster data? If so, is this something we need to impose on every user (IOW, could we let the caller deal with that if it's a rare edge case?)

I guess what I'm wondering here is if we should only be evaluating against DMD_EXTENSION and DMD_EXTENSIONS, and defer capability filtering to a different mechanism (e.g. extension method on Driver). The method name implies we're only looking at matching drivers against filename (extension).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. In principle, the user might have all sorts of extra drivers installed. Matching the GDAL logic seemed like a safe approach.

We could ask Even, I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic I went by is this: there can be multiple drivers for the same extension, and some could support raster some vector and some both. And this function is (related with other comments) made to know the driver in advance while creating a file as if we have a file already, then we can call Dataset::open without driver information. But we cannot create and save data without deciding the driver first. And if we're writing a data, we'd want the list of drivers that supports that type of data, hence that bool.

It was also in the function Inicola sent me as an example, so it's my understanding of that, not my innovation.

let ext = {
let filename = filename.to_ascii_lowercase();
let e = match filename.rsplit_once('.') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to ensure true logical parity with GDAL C behavior, we'd use CPLGetExtension. 50/50 on whether that's good idea or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider using the rust's "extension()" method on the Path as well, but considering that it can only do the last "." onwards, which requires us to do the endswith on the filename for .shp.zip and .gpkg.shp, I thought just using them directly might be better. And the online protocol addresses with prefix (like Elasticsearch in the test) may not be taken as a valid path, so I just did the string thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have this work for Paths. I think you can use extension and then file_stem and again extension to check for this without that much ceremony.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path works fine with those, especially as they're just valid file names on Linux. Path::new is infallible.

I'd like this because otherwise it's pretty annoying to convert a Path or PathBuf to a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly I don't want to do it lol.

Like for that: startswith() we need to do for "PG:" and similar paths, we'd need to convert the path into string right? And iirc it can't convert it into direct string and has to go through Debug trait and ends up with quoted string or something.

I can try, but unless we can use the Path args without having to turn it back to string, it feels like we'll just end up asking for a Path and then internally convert it into string anyway.

Copy link
Contributor Author

@Atreyagaurav Atreyagaurav Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a Path it's harder to get a string.

That conflicts with the first sentence :D

Users can just call [to_string_lossy](https://doc.rust-lang.org/std/path/struct.Path.html#method.to_string_lossy). If it's easy for us, it's easy for them.

Anyway, I'm trying to do it with AsRef<Path>. I'll make a commit if I am successful.

Edit: users have to call it once to pass the path as string currently.

If we take Path, we have to call to_string_lossy for extension, for filename (second extension), and again to check for prefix...

Copy link
Member

@lnicola lnicola Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's easy for us, it's easy for them.

It's easy for us in the sense that it's infallible, it requires no unsafe code, and it's almost always efficient. It's harder for them because it's ugly and maybe they won't know about to_string_lossy so they might try to debug-format them, as you wanted to do.


There's also something to be said about being lossy, but I guess we only support UTF-8 file names, while GDAL might, in principle, support arbitrary byte sequences). But we hide it, and we could support non-UTF-8 file names today -- at least on Linux -- without changing our API.

On the other hand, if we make the user call to_string_lossy on a non-UTF-8 path, they'll get a different name at the output. It's fine when we do it (and also when the user does it) because it won't affect the driver selection in practice. But as you can see, it's a lot of relatively subtle stuff to think about.

Also, you'll see that to_string_lossy returns a Cow<'_, str>, which might cause some people to give up and go back to Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I'm currently having that same problem. it seems I can't use Cow<'_, str> on pattern. And might have to do, Variants of Cow::. At this point doing if statement seems easier than match. It's really a lot of complications when we have to turn it back into string. :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's leave it like this, I'll try to clean it up tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the abovementioned problem with Cow, and not being able to use match properly and so on. I converted it into String after fighting with borrow checking for a while trying to use str.

Some(("", _)) => "", // hidden file no ext
Some((f, "zip")) => {
// zip files could be zipped shp or gpkg
if f.ends_with(".shp") {
"shp.zip"
} else if f.ends_with(".gpkg") {
"gpkg.zip"
} else {
"zip"
}
}
Some((_, e)) => e, // normal file with ext
None => "",
};
e.to_string()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, but will change anyway if we go with Path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get it. I did some changes here. But idk.

I'm really reluctant to use path, as we could have URLs for PG and other backends. And rust path library is tested for local file paths only (at least from what I can see) URLs could complicate things in the future.

};

let mut drivers: Vec<Driver> = Vec::new();
for i in 0..DriverManager::count() {
Atreyagaurav marked this conversation as resolved.
Show resolved Hide resolved
let d = DriverManager::get_driver(i).expect("Index for this loop should be valid");
let can_create = d.metadata_item("DCAP_CREATE", "").is_some()
|| d.metadata_item("DCAP_CREATECOPY", "").is_some();
let check_vector = is_vector && d.metadata_item("DCAP_VECTOR", "").is_some();
let check_raster = !is_vector && d.metadata_item("DCAP_RASTER", "").is_some();
let check_vector_translate =
is_vector && d.metadata_item("DCAP_VECTOR_TRANSLATE_FROM", "").is_some();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above... i.e. should we consider selection and filtering two different operations?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this description of what GDAL does when Dataset::open is called.
https://github.com/OSGeo/gdal/blob/master/gcore/gdaldataset.cpp#L3630-L3645

From my 5 minute reading it looks like some of the logic is delegated to GDALIdentifyDriverEx:

https://github.com/OSGeo/gdal/blob/master/gcore/gdaldriver.cpp#L2649C25-L2649C45

My concern here is not wanting to diverge from the accepted GDAL logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDALIdentifyDriverEx is used for opening, but that's usually not necessary.

The functions I linked to in #506 (comment) is used by the GDAL binaries to pick an output format. If we offer this function, we should match the latter, and use a better name (get_driver(s)_for_writing or something like that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to change the function name. I was also looking for ideas as the ones I came up got too long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions I linked to in #506 (comment) is used by the GDAL binaries to pick an output format.

Ahhh... I thought there was something already in the code, but didn't realize it was C++ only :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is for binary apps. Not the part of the gdal library. Otherwise, I was planning on just using the function from gdal-sys.

if !((can_create && (check_vector || check_raster)) || (check_vector_translate)) {
continue;
}

if let Some(e) = &d.metadata_item("DMD_EXTENSION", "") {
if *e == ext {
drivers.push(d);
continue;
}
}
if let Some(e) = d.metadata_item("DMD_EXTENSIONS", "") {
if e.split(' ').collect::<Vec<&str>>().contains(&ext.as_str()) {
drivers.push(d);
continue;
}
}

if let Some(pre) = d.metadata_item("DMD_CONNECTION_PREFIX", "") {
if filename.starts_with(&pre) {
drivers.push(d);
}
}
}

drivers
}

/// Register a driver for use.
///
/// Wraps [`GDALRegisterDriver()`](https://gdal.org/api/raster_c_api.html#_CPPv418GDALRegisterDriver11GDALDriverH)
Expand Down Expand Up @@ -455,6 +539,8 @@ impl DriverManager {

#[cfg(test)]
mod tests {
use std::collections::HashSet;

use super::*;

#[test]
Expand All @@ -466,4 +552,46 @@ mod tests {
assert!(DriverManager::count() > 0);
assert!(DriverManager::get_driver(0).is_ok());
}

#[test]
fn test_driver_extension() {
// convert the driver into short_name for testing purposes
let drivers = |filename, is_vector| {
DriverManager::get_drivers_for_filename(filename, is_vector)
.iter()
.map(|d| d.short_name())
.collect::<HashSet<String>>()
};
let gdal_version: i64 = crate::version::version_info("VERSION_NUM").parse().unwrap();
if DriverManager::get_driver_by_name("ESRI Shapefile").is_ok() {
assert!(drivers("test.shp", true).contains("ESRI Shapefile"));
// `shp.zip` only supported from gdal version 3.1
// https://gdal.org/drivers/vector/shapefile.html#compressed-files
if gdal_version >= 3010000 {
Atreyagaurav marked this conversation as resolved.
Show resolved Hide resolved
assert!(drivers("test.shp.zip", true).contains("ESRI Shapefile"));
}
}
if DriverManager::get_driver_by_name("GPKG").is_ok() {
assert!(drivers("test.gpkg", true).contains("GPKG"));
// `gpkg.zip` only supported from gdal version 3.7
// https://gdal.org/drivers/vector/gpkg.html#compressed-files
if gdal_version >= 3070000 {
Atreyagaurav marked this conversation as resolved.
Show resolved Hide resolved
assert!(drivers("test.gpkg.zip", true).contains("GPKG"));
}
}
if DriverManager::get_driver_by_name("GTiff").is_ok() {
assert!(drivers("test.tiff", false).contains("GTiff"));
}
if DriverManager::get_driver_by_name("netCDF").is_ok() {
assert!(drivers("test.nc", false).contains("netCDF"));
}
if DriverManager::get_driver_by_name("Elasticsearch").is_ok() {
// Elasticsearch short name was ElasticSearch in older versions
if gdal_version >= 3010000 {
assert!(drivers("ES:test", true).contains("Elasticsearch"));
} else {
assert!(drivers("ES:test", true).contains("ElasticSearch"));
}
}
}
metasim marked this conversation as resolved.
Show resolved Hide resolved
}