Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DriverManager::get_drivers_for_filename for the ability to auto detect compatible Drivers for writing data #510

Merged
merged 24 commits into from
Jan 15, 2024

Conversation

Atreyagaurav
Copy link
Contributor

  • I agree to follow the project's code of conduct.
  • I added an entry to CHANGES.md if knowledge of this change could be valuable to users.

Fixes #506

I have the tests at the end of src/driver.rs, let me know if I should remove it.

I've only tested for getting the drivers here, but I tried using that function to see if the code I wrote that wrote GPKG files can write SHP now it works. Though some driver specific functions seem to make it not work if we're not careful (like out_data.start_transaction()? works for GPKG but not SHP).

@Atreyagaurav
Copy link
Contributor Author

Tested locally for:

  cargo fmt --all --check
  cargo clippy --all-targets -- -D warnings
  cargo clippy --all-targets --all-features -- -D warnings
  cargo build
  cargo test
  cargo build --all-features
  cargo test --all-features -- --nocapture

Note: The original code suggested in #506 had the tests for writing capabilities of the driver, and I've kept that as, for reading a file, Dataset::open can already read any dataset without us having to determine a driver. I've added that in docstring.

@Atreyagaurav
Copy link
Contributor Author

The tests pass locally but are failing in CI, So I thought it could be because we need to install something for GPKG driver to be present in the ubuntu distribution.

So I've added conditional in the test to only test extension if the driver we're looking for is already there. It should work.

@Atreyagaurav
Copy link
Contributor Author

Found the problem: The previous versions don't support .gpkg.zip and only .gpkg files.

Relevant link: https://gdal.org/drivers/vector/gpkg.html#compressed-files

$ docker run -it --entrypoint bash ghcr.io/osgeo/gdal:ubuntu-full-3.8.1
root@65b7fb97e696:/# python3
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from osgeo import ogr
>>> ogr.GetDriverByName("GPKG").GetMetadataItem('DMD_EXTENSIONS')
'gpkg gpkg.zip'
>>> ogr.GetDriverByName("GPKG").GetMetadataItem('DMD_EXTENSION')
>>> ogr.GetDriverByName("GPKG").GetMetadataItem('DCAP_CREATE')
'YES'
>>> ogr.GetDriverByName("GPKG").GetMetadataItem('DCAP_VECTOR')
'YES'
>>> 

But

$ docker run -it --entrypoint bash ghcr.io/osgeo/gdal:ubuntu-full-3.5.3
root@6691affad24c:/# python
Python 3.8.10 (default, Jun 22 2022, 20:18:18) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from osgeo import ogr
>>> ogr.GetDriverByName("GPKG").GetMetadataItem('DMD_EXTENSIONS')
'gpkg'
>>> ogr.GetDriverByName("GPKG").GetMetadataItem('DMD_EXTENSION')
'gpkg'

So I just made a condition on the .gpkg.zip assert in the test for newer versions only. Hopefully it works now

@Atreyagaurav
Copy link
Contributor Author

Atreyagaurav commented Jan 2, 2024

I checked that all upto gdal 3.1.3 has shp.zip support:
https://gdal.org/drivers/vector/shapefile.html#compressed-files

That's why I left it as it was.

This link shows that ubuntu-lts has gdal 3.0 https://askubuntu.com/questions/1267844/installing-libgdal-dev-on-ubuntu-20-04

I've made it to check for that in the test as well.

@Atreyagaurav
Copy link
Contributor Author

Ok, should I remove the ES test? I don't understand it, and I just added it for complete tests. I can't find if something changed in version 3.1 for that. Or should I add the condition of >= 3010000 for it as well.

@Atreyagaurav
Copy link
Contributor Author

Atreyagaurav commented Jan 2, 2024

EDIT: Found the problem, you can only read the last 2 paragraph.

$ docker run -it --entrypoint bash ubuntu:20.04
root@1015807d8b92:/# apt install libpq-dev gdal-bin libgdal-dev
root@1015807d8b92:/# python3
Python 3.8.10 (default, Nov 22 2023, 10:22:35) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from osgeo import ogr
>>> import osgeo
>>> osgeo.__version__
'3.0.4'
>>> ogr.GetDriverByName("Elasticsearch").GetMetadataItem('DMD_CONNECTION_PREFIX')
'ES:'
>>> ogr.GetDriverByName("Elasticsearch").GetMetadataItem('DCAP_CREATE')
'YES'
>>> ogr.GetDriverByName("Elasticsearch").GetMetadataItem('DCAP_VECTOR')
'YES'
>>>

I don't know why it's failing there. the things we're checking seems fine. Should we just put gdal_version >= 3010000 for the test?

lol, the name was different ogr.GetDriverByName("Elasticsearch").name came up as ElasticSearch in 3.0.4, while for others it's Elasticsearch, even though driver_by_name gives us the driver using Elasticsearch. Should be fixed now. Hopefully no more problems.

The logic for the function is correct, tests are troublesome.

src/driver.rs Outdated
/// ```text
/// ["GPKG"]
/// ```
pub fn get_drivers_for_filename(filename: &str, is_vector: bool) -> Vec<Driver> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is is_vector necessary? Is it possible to have multiple drivers that read the same file extension but differentiate on vector vs. raster data? If so, is this something we need to impose on every user (IOW, could we let the caller deal with that if it's a rare edge case?)

I guess what I'm wondering here is if we should only be evaluating against DMD_EXTENSION and DMD_EXTENSIONS, and defer capability filtering to a different mechanism (e.g. extension method on Driver). The method name implies we're only looking at matching drivers against filename (extension).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. In principle, the user might have all sorts of extra drivers installed. Matching the GDAL logic seemed like a safe approach.

We could ask Even, I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic I went by is this: there can be multiple drivers for the same extension, and some could support raster some vector and some both. And this function is (related with other comments) made to know the driver in advance while creating a file as if we have a file already, then we can call Dataset::open without driver information. But we cannot create and save data without deciding the driver first. And if we're writing a data, we'd want the list of drivers that supports that type of data, hence that bool.

It was also in the function Inicola sent me as an example, so it's my understanding of that, not my innovation.

src/driver.rs Outdated
Comment on lines 464 to 469
let can_create = d.metadata_item("DCAP_CREATE", "").is_some()
|| d.metadata_item("DCAP_CREATECOPY", "").is_some();
let check_vector = is_vector && d.metadata_item("DCAP_VECTOR", "").is_some();
let check_raster = !is_vector && d.metadata_item("DCAP_RASTER", "").is_some();
let check_vector_translate =
is_vector && d.metadata_item("DCAP_VECTOR_TRANSLATE_FROM", "").is_some();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above... i.e. should we consider selection and filtering two different operations?

src/driver.rs Outdated
Comment on lines 442 to 443
let filename = filename.to_ascii_lowercase();
let e = match filename.rsplit_once('.') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to ensure true logical parity with GDAL C behavior, we'd use CPLGetExtension. 50/50 on whether that's good idea or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider using the rust's "extension()" method on the Path as well, but considering that it can only do the last "." onwards, which requires us to do the endswith on the filename for .shp.zip and .gpkg.shp, I thought just using them directly might be better. And the online protocol addresses with prefix (like Elasticsearch in the test) may not be taken as a valid path, so I just did the string thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have this work for Paths. I think you can use extension and then file_stem and again extension to check for this without that much ceremony.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path works fine with those, especially as they're just valid file names on Linux. Path::new is infallible.

I'd like this because otherwise it's pretty annoying to convert a Path or PathBuf to a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly I don't want to do it lol.

Like for that: startswith() we need to do for "PG:" and similar paths, we'd need to convert the path into string right? And iirc it can't convert it into direct string and has to go through Debug trait and ends up with quoted string or something.

I can try, but unless we can use the Path args without having to turn it back to string, it feels like we'll just end up asking for a Path and then internally convert it into string anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you have a Path it's harder to get a string.

That conflicts with the first sentence :D

Users can just call [to_string_lossy](https://doc.rust-lang.org/std/path/struct.Path.html#method.to_string_lossy). If it's easy for us, it's easy for them.

Anyway, I'm trying to do it with AsRef<Path>. I'll make a commit if I am successful.

Edit: users have to call it once to pass the path as string currently.

If we take Path, we have to call to_string_lossy for extension, for filename (second extension), and again to check for prefix...

Copy link
Member

@lnicola lnicola Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's easy for us, it's easy for them.

It's easy for us in the sense that it's infallible, it requires no unsafe code, and it's almost always efficient. It's harder for them because it's ugly and maybe they won't know about to_string_lossy so they might try to debug-format them, as you wanted to do.


There's also something to be said about being lossy, but I guess we only support UTF-8 file names, while GDAL might, in principle, support arbitrary byte sequences). But we hide it, and we could support non-UTF-8 file names today -- at least on Linux -- without changing our API.

On the other hand, if we make the user call to_string_lossy on a non-UTF-8 path, they'll get a different name at the output. It's fine when we do it (and also when the user does it) because it won't affect the driver selection in practice. But as you can see, it's a lot of relatively subtle stuff to think about.

Also, you'll see that to_string_lossy returns a Cow<'_, str>, which might cause some people to give up and go back to Python.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, I'm currently having that same problem. it seems I can't use Cow<'_, str> on pattern. And might have to do, Variants of Cow::. At this point doing if statement seems easier than match. It's really a lot of complications when we have to turn it back into string. :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let's leave it like this, I'll try to clean it up tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the abovementioned problem with Cow, and not being able to use match properly and so on. I converted it into String after fighting with borrow checking for a while trying to use str.

src/driver.rs Outdated
Comment on lines 462 to 469
for i in 0..DriverManager::count() {
let d = DriverManager::get_driver(i).expect("Index for this loop should be valid");
let can_create = d.metadata_item("DCAP_CREATE", "").is_some()
|| d.metadata_item("DCAP_CREATECOPY", "").is_some();
let check_vector = is_vector && d.metadata_item("DCAP_VECTOR", "").is_some();
let check_raster = !is_vector && d.metadata_item("DCAP_RASTER", "").is_some();
let check_vector_translate =
is_vector && d.metadata_item("DCAP_VECTOR_TRANSLATE_FROM", "").is_some();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this description of what GDAL does when Dataset::open is called.
https://github.com/OSGeo/gdal/blob/master/gcore/gdaldataset.cpp#L3630-L3645

From my 5 minute reading it looks like some of the logic is delegated to GDALIdentifyDriverEx:

https://github.com/OSGeo/gdal/blob/master/gcore/gdaldriver.cpp#L2649C25-L2649C45

My concern here is not wanting to diverge from the accepted GDAL logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GDALIdentifyDriverEx is used for opening, but that's usually not necessary.

The functions I linked to in #506 (comment) is used by the GDAL binaries to pick an output format. If we offer this function, we should match the latter, and use a better name (get_driver(s)_for_writing or something like that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need to change the function name. I was also looking for ideas as the ones I came up got too long.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The functions I linked to in #506 (comment) is used by the GDAL binaries to pick an output format.

Ahhh... I thought there was something already in the code, but didn't realize it was C++ only :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is for binary apps. Not the part of the gdal library. Otherwise, I was planning on just using the function from gdal-sys.

src/driver.rs Outdated Show resolved Hide resolved
src/driver.rs Outdated
Comment on lines 416 to 419
/// drivers and returns the matches. The determined driver is
/// checked for writing capabilities as
/// [`Dataset::open`](Dataset::open) can already open datasets
/// with just path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand why we limit returned drivers to ones with "write" capabilities... maybe read-only use cases might also want a list of drivers to pick from (e.g. when reading JP2k, there are several options and you might want to inspect the metadata to decide on which to select).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, initially I was thinking that, but then I realized that when we already have a dataset, we can just call Dataset::open and then look at the driver. Having a dataset you can open means you have a lot in your arsenal to detect the driver, like the metadata you mentioned.

The use case we're thinking is to when we need to write a dataset, currently the only way I could find to write to file is to create a file from a driver, which requires you to know the driver in advance. Of course, you can make the user pass the driver name along with the filename and call it a day (that's what I was doing on one of my program), but all users are not likely to know about the driver names. And if we can more or less guess based on the file extension, then we make it convenient and remove any user error (could write a Tiff file in .shp extension file and it'll write it). And since GIS programs let you write to any file, making the program only dependent on the filename makes it easier to integrate with other programs as well.

I think renaming the function to mention it's for writing should help with it.

@metasim
Copy link
Contributor

metasim commented Jan 3, 2024

I have the tests at the end of src/driver.rs, let me know if I should remove it.

That's the appropriate place :)

@Atreyagaurav
Copy link
Contributor Author

That's the appropriate place :)

Thank you. That was more of a comment due to worry that the tests might fail on the CI or on other people's computer because some drivers are not available. But that should be fixed now. Thought the tests did give a lot of trouble for other reasons.

@lnicola
Copy link
Member

lnicola commented Jan 3, 2024

Nit: maybe we should use something more common like PG instead of ElasticSearch?

@Atreyagaurav
Copy link
Contributor Author

Nit: maybe we should use something more common like PG instead of ElasticSearch?

Done.

I also renamed the funtion. Idk if PG will also have some quirks on the tests like ElasticSearch though. Hopefully not

src/driver.rs Outdated Show resolved Hide resolved
src/driver.rs Outdated
/// ```text
/// ["GPKG"]
/// ```
pub fn guess_drivers_for_write(filename: &str, is_vector: bool) -> Vec<Driver> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to bikeshed here, but should we consider:

  1. Also providing guess_drivers_for_read
  2. Have the signature guess_drivers_for_write<P: AsRef<Path>>(filename: P, ...) to be more coherent with other methods accepting files. (&str is AsRef<Path>).

Copy link
Member

@lnicola lnicola Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about guess_drivers_for_read. There's already an API that works for existing files, and do we really need to find a driver for a file that doesn't exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider using AsRef<Path>, but I thought the connections like PG:https://xxxxxx might not work or have unexpected problems. And we need to convert it back to string to use the "endswith" or "split" for double extensions. Hence, I went with string. I haven't tested if those connections work as path or not though.

@metasim
Copy link
Contributor

metasim commented Jan 3, 2024

LGTM

@metasim
Copy link
Contributor

metasim commented Jan 9, 2024

Also, if we want to wait to use the gdal-sys function from the new version, we can wait. Otherwise, we can merge this, and then later replace the implementation by calling that function internally.

This is my vote, assuming the semantics are/will be the same. Would be nice to expose the capability for users using older versions of 3.x, but then add a cfg block for the new version of gdal, upon release.

@lnicola
Copy link
Member

lnicola commented Jan 9, 2024

Yeah, I think we can add it now (with an appropriate signature for the native one), with a note that it's implemented differently before 3.9.

@lnicola
Copy link
Member

lnicola commented Jan 10, 2024

I pushed a couple of changes, but still want to do another pass before merging, sorry.

@Atreyagaurav
Copy link
Contributor Author

Wouldn't the name get_drivers_for_path be better than name? As name is used as driver name driver_by_name, so it might be ambiguous. So filename or path might be better.

@lnicola
Copy link
Member

lnicola commented Jan 10, 2024

Yeah, probably. I had dataset_name, but it seemed a little excessive. path is confusing for PG.

@Atreyagaurav
Copy link
Contributor Author

It's still a path in case of PG and others, just not a local one. Should be fine if people check the doc strings, if they don't I thought name might make people try the driver name instead.

rouault added a commit to rouault/gdal that referenced this pull request Jan 10, 2024
rouault added a commit to rouault/gdal that referenced this pull request Jan 11, 2024
@Atreyagaurav
Copy link
Contributor Author

This is finalized, right? No more modifications?

@@ -10,6 +10,8 @@

- <https://github.com/georust/gdal/pull/504>

- Added `DriverManager::guess_driver_for_write` and `DriverManager::guess_drivers_for_write` for the ability to auto detect compatible `Driver`(s) for writing data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to add the PR link here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Done.

src/driver.rs Outdated
@@ -461,6 +592,11 @@ impl DriverManager {
}
}

pub enum DriverProperties {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider the name DriverType, as "Properties" usually implies a collection of settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I was initially thinking that properties might be weird. It's renamed now.

@metasim
Copy link
Contributor

metasim commented Jan 15, 2024

Looks good to me. @lnicola OK if I merge?

@lnicola
Copy link
Member

lnicola commented Jan 15, 2024

Yes, but I think it still needs a rebase.

I'm a little worried about that enum, since we should probably add a variat (bit) to warn when there are multiple matching drivers (GDAL has a second parameter, but we don't need to match that). That would make the name a little unfortunate, and something like DriverPreference or GetOutputDriverFlags would be nicer.

But let's merge this now, as I'm not going to have much time for it in the near future and I don't want to keep delaying it forever.

@metasim
Copy link
Contributor

metasim commented Jan 15, 2024

Is a "Squash and merge" sufficient?

@lnicola
Copy link
Member

lnicola commented Jan 15, 2024

Feel free to try it, I'm seeing:

This branch cannot be rebased due to conflicts

@metasim metasim merged commit 87497bf into georust:master Jan 15, 2024
9 checks passed
@metasim
Copy link
Contributor

metasim commented Jan 15, 2024

@Atreyagaurav thanks for your hard, patient work on this. We all appreciate it! :)

@Atreyagaurav
Copy link
Contributor Author

Thank you everyone for reviewing and the guidance. And for making gdal available in rust. I'll try to contribute more if I find some simple issues I can fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Guess the Driver Type based on the file extension
4 participants