Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raster_to_grid not working when using retile with GeoTIFF file #558

Open
carlosg-m opened this issue May 3, 2024 · 2 comments
Open

raster_to_grid not working when using retile with GeoTIFF file #558

carlosg-m opened this issue May 3, 2024 · 2 comments

Comments

@carlosg-m
Copy link

carlosg-m commented May 3, 2024

  • DBR 13.3 LTS ML (includes Apache Spark 3.4.1, Scala 2.12)
  • Standard_DS13_v2 (driver and 2 workers)
  • Photon disabled
  • databricks-mosaic 0.4.1
  • GDAL init script installed in cluster
  • Dataset is a GeoTIFF relatively large raster file, it has one layer. Each pixel in uint16 dtype represents a category that describes the land use or land cover (forest, industrial, agricultural, and so on). Values out of bounds are masked (represented by the last integer 2^16 - 1)
  • Dataset source: https://geo2.dgterritorio.gov.pt/cosc/COSc2023.zip
  • Use case: need to efficiently read and process raster file represented in a projected coordinate system, convert to grid index to intersect with points represented in a geographic coordinate system (WGS84).
  • The GeoTiff file seems ok as far as I know, I've tested it with Rasterio and NumPy.
  • I'm trying to automate the process with Mosaic.

This example is very slow and seems to have a lot of data skew (it gets stuck on the last task):

import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()

When trying to use "retile" option it throws an error:

import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .option("retile", "true")\
        .option("tileSize", "1000")\
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 2434) (10.208.237.16 executor 4): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif.

Don't take this the wrong way, it is a pleasure to work with Shapely/Pygeos/GeoPandas and even Rasterio together with Spark and Pandas UDFs, however it is being an absolute pain navigating through Databricks-Mosaic (the same happened with Sedona and GeoSpark).

@carlosg-m carlosg-m changed the title raster_to_grid not working when using retile raster_to_grid not working when using retile with GeoTIFF file May 3, 2024
@mjohns-databricks
Copy link
Contributor

mjohns-databricks commented May 4, 2024

Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.

@carlosg-m
Copy link
Author

carlosg-m commented May 6, 2024

Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.

Thank you for the response, @mjohns-databricks.
Are there any best practices to make the operation I described efficient, using the current version of Mosaic?

The workaround is to generate a table of "reading windows or bounding boxes" and with a UDF go through each one in parallel, loading each window with Rasterio and converting it to a "grid index".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants