diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 66531397d2cc1..34829c03da694 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -31,6 +31,7 @@ license: | - Since Spark 4.2, Spark enables order-independent checksums for shuffle outputs by default to detect data inconsistencies during indeterminate shuffle stage retries. If a checksum mismatch is detected, Spark rolls back and re-executes all succeeding stages that depend on the shuffle output. If rolling back is not possible for some succeeding stages, the job will fail. To restore the previous behavior, set `spark.sql.shuffle.orderIndependentChecksum.enabled` and `spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch` to `false`. - Since Spark 4.2, support for Derby JDBC datasource is deprecated. - Since Spark 4.2, a new default method `mergeWith` has been added to the `CustomTaskMetric` interface. The default implementation sums the two metric values, which is correct for count-type metrics. Data source connector implementations that report non-additive metrics (e.g., maximum, average, compression ratio, or gauge values) must override `mergeWith` to provide correct merge semantics. +- Since Spark 4.2, the geospatial `GEOMETRY` and `GEOGRAPHY` types and the corresponding `ST_*` functions are enabled. See [Geospatial (Geometry/Geography) Types](sql-ref-geospatial-types.html) for additional details. ## Upgrading from Spark SQL 4.0 to 4.1 diff --git a/docs/sql-ref-datatypes.md b/docs/sql-ref-datatypes.md index 743ad4e3abb22..32e3716de366a 100644 --- a/docs/sql-ref-datatypes.md +++ b/docs/sql-ref-datatypes.md @@ -95,8 +95,8 @@ Spark SQL and DataFrames support the following data types: * Spatial types Spatial objects as defined in the [OGC Simple Feature Access](https://portal.ogc.org/files/?artifact_id=25355) specification. - - `GeometryType`: Represents GEOMETRY values—spatial objects in a Cartesian coordinate system. The type can be fixed to a single SRID, e.g. `geometry(4326)`, or allow mixed SRIDs with `geometry(any)`. Default SRID when not specified is 4326 (WGS 84). - - `GeographyType`: Represents GEOGRAPHY values—spatial objects in a geographic coordinate system (latitude/longitude). Edge interpolation is always SPHERICAL. The type can be fixed to a single SRID, e.g. `geography(4326)`, or allow mixed SRIDs with `geography(any)`. Default SRID is 4326 (WGS 84). + - `GeometryType`: Represents GEOMETRY values, spatial objects in a Cartesian coordinate system. The type must be fixed to a single SRID, e.g. `geometry(4326)`, or allow mixed SRIDs with `geometry(any)`. In SQL, `GEOMETRY` columns must always be declared with an explicit SRID or `ANY`. + - `GeographyType`: Represents GEOGRAPHY values, spatial objects in a geographic coordinate system (latitude/longitude). Edge interpolation is always SPHERICAL. The type must be fixed to a single geographic SRID, e.g. `geography(4326)`, or allow mixed SRIDs with `geography(any)`. In SQL, `GEOGRAPHY` columns must always be declared with an explicit SRID or `ANY`. For more details and built-in functions, see [Geospatial (Geometry/Geography) types](sql-ref-geospatial-types.html). * Complex types @@ -143,8 +143,8 @@ from pyspark.sql.types import * |**TimestampNTZType**|datetime.datetime|TimestampNTZType()| |**DateType**|datetime.date|DateType()| |**DayTimeIntervalType**|datetime.timedelta|DayTimeIntervalType()| -|**GeometryType**|Geometry|GeometryType() or GeometryType(*srid*)| -|**GeographyType**|Geography|GeographyType() or GeographyType(*srid*)| +|**GeometryType**|Geometry|GeometryType(*srid*)
**Note:** *srid* is required and may be an `int` or the string `"ANY"`.| +|**GeographyType**|Geography|GeographyType(*srid*)
**Note:** *srid* is required and may be an `int` or the string `"ANY"`.| |**ArrayType**|list, tuple, or array|ArrayType(*elementType*, [*containsNull*])
**Note:**The default value of *containsNull* is True.| |**MapType**|dict|MapType(*keyType*, *valueType*, [*valueContainsNull]*)
**Note:**The default value of *valueContainsNull* is True.| |**StructType**|list or tuple|StructType(*fields*)
**Note:** *fields* is a Seq of StructFields. Also, two fields with the same name are not allowed.| @@ -272,8 +272,8 @@ The following table shows the type names as well as aliases used in Spark SQL pa |**DecimalType**|DECIMAL, DEC, NUMERIC| |**YearMonthIntervalType**|INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH| |**DayTimeIntervalType**|INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND| -|**GeometryType**|GEOMETRY or GEOMETRY(*srid*) or GEOMETRY(ANY)| -|**GeographyType**|GEOGRAPHY or GEOGRAPHY(*srid*) or GEOGRAPHY(ANY)| +|**GeometryType**|GEOMETRY(*srid*) or GEOMETRY(ANY)| +|**GeographyType**|GEOGRAPHY(*srid*) or GEOGRAPHY(ANY)| |**ArrayType**|ARRAY\| |**StructType**|STRUCT
**Note:** ':' is optional.| |**MapType**|MAP| diff --git a/docs/sql-ref-functions-builtin.md b/docs/sql-ref-functions-builtin.md index b6572609a34b8..1912a1e577d59 100644 --- a/docs/sql-ref-functions-builtin.md +++ b/docs/sql-ref-functions-builtin.md @@ -126,3 +126,8 @@ license: | {% include_api_gen generated-variant-funcs-table.html %} #### Examples {% include_api_gen generated-variant-funcs-examples.html %} + +### Geospatial ST Functions +{% include_api_gen generated-st-funcs-table.html %} +#### Examples +{% include_api_gen generated-st-funcs-examples.html %} diff --git a/docs/sql-ref-geospatial-types.md b/docs/sql-ref-geospatial-types.md index ed8b6597ae1f0..d5a9d0fece84b 100644 --- a/docs/sql-ref-geospatial-types.md +++ b/docs/sql-ref-geospatial-types.md @@ -25,8 +25,13 @@ Spark SQL supports **GEOMETRY** and **GEOGRAPHY** types for spatial data, as def | Type | Coordinate system | Typical use and notes | |------|-------------------|------------------------| -| **GEOMETRY** | Cartesian (planar) | Projected or local coordinates; planar calculations. Represents points, lines, polygons in a flat coordinate system. Suitable for Web Mercator (SRID 3857), UTM, or local grids (e.g. engineering/CAD). Default SRID in Spark is 4326. | -| **GEOGRAPHY** | Geographic (latitude/longitude) | Earth-based data; distances and areas on the sphere/ellipsoid. Coordinates in longitude and latitude (degrees). Edge interpolation is always **SPHERICAL**. Default SRID is 4326 (WGS 84). | +| **GEOMETRY** | Cartesian (planar) | Projected or local coordinates; planar calculations. Represents points, lines, polygons in a flat coordinate system. Suitable for Web Mercator (SRID 3857), UTM, or local grids (e.g. engineering/CAD). Accepts any SRID in the registry, including SRID 0 (unspecified CRS). | +| **GEOGRAPHY** | Geographic (latitude/longitude) | Earth-based data; distances and areas on the sphere/ellipsoid. Coordinates in longitude and latitude (degrees). Edge interpolation is always **SPHERICAL**. Only geographic SRIDs are accepted; the most common is 4326 (WGS 84). | + +In SQL, `GEOMETRY` and `GEOGRAPHY` columns must always be declared with an explicit SRID +(or `ANY`); see [Type Syntax in SQL](#type-syntax-in-sql) below. When a value is constructed +via `ST_GeomFromWKB(wkb)` without an explicit SRID, the value's SRID is `0` (unspecified), +while `ST_GeogFromWKB(wkb)` always returns a value with SRID 4326. #### When to use GEOMETRY vs GEOGRAPHY @@ -113,16 +118,18 @@ When parsing WKB, Spark applies the following rules. Violations result in a pars ### Built-in Geospatial (ST) Functions -Spark SQL provides scalar functions for working with GEOMETRY and GEOGRAPHY values. They are grouped under **st_funcs** in the [Built-in Functions](sql-ref-functions-builtin.html) API. +Spark SQL provides scalar functions for working with GEOMETRY and GEOGRAPHY values. The full list, +with detailed argument descriptions and examples, is on the +[Built-in Functions](sql-ref-functions-builtin.html#geospatial-st-functions) page under +**Geospatial ST Functions**. The functions provided in the current release are summarized here: | Function | Description | |----------|-------------| -| `ST_AsBinary(geo)` | Returns the GEOMETRY or GEOGRAPHY value as WKB (BINARY). | -| `ST_GeomFromWKB(wkb)` | Parses WKB and returns a GEOMETRY with default SRID 0. | -| `ST_GeomFromWKB(wkb, srid)` | Parses WKB and returns a GEOMETRY with the given SRID. | +| `ST_AsBinary(geo[, endianness])` | Returns the GEOMETRY or GEOGRAPHY value as WKB (BINARY). The optional `endianness` argument is `'NDR'` for little-endian (default) or `'XDR'` for big-endian. | +| `ST_GeomFromWKB(wkb[, srid])` | Parses WKB and returns a GEOMETRY. The optional `srid` argument sets the SRID; if omitted, the SRID is `0`. | | `ST_GeogFromWKB(wkb)` | Parses WKB and returns a GEOGRAPHY with SRID 4326. | | `ST_Srid(geo)` | Returns the SRID of the GEOMETRY or GEOGRAPHY value (NULL if input is NULL). | -| `ST_SetSrid(geo, srid)` | Returns a new GEOMETRY or GEOGRAPHY with the given SRID. | +| `ST_SetSrid(geo, srid)` | Returns a new GEOMETRY or GEOGRAPHY with the given SRID. The new SRID must be valid for the value's type. | **Examples:** @@ -130,6 +137,9 @@ Spark SQL provides scalar functions for working with GEOMETRY and GEOGRAPHY valu SELECT hex(ST_AsBinary(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040'))); -- 0101000000000000000000F03F0000000000000040 +SELECT hex(ST_AsBinary(ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040'), 'XDR')); +-- 00000000013FF00000000000004000000000000000 + SELECT ST_Srid(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040')); -- 4326 @@ -139,9 +149,9 @@ SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F00000000000 ### SRID and Stored Values -* **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the value’s SRID to match the column). -* **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed. -* **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required. +* **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID raises a `GEO_ENCODER_SRID_MISMATCH_ERROR`. Use `ST_SetSrid` to change a value's SRID to match the column. +* **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs per row. Each value must still have a valid SRID for the type; an invalid SRID raises `ST_INVALID_SRID_VALUE`. +* **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column. They do not support persisting `GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`; mixed-SRID types exist for in-memory/query use only. ### Supported SRIDs diff --git a/sql/gen-sql-functions-docs.py b/sql/gen-sql-functions-docs.py index 13f9ae055fa73..2ae00f6db8221 100644 --- a/sql/gen-sql-functions-docs.py +++ b/sql/gen-sql-functions-docs.py @@ -36,7 +36,8 @@ "bitwise_funcs", "conversion_funcs", "csv_funcs", "xml_funcs", "lambda_funcs", "collection_funcs", "url_funcs", "hash_funcs", "struct_funcs", - "table_funcs", "variant_funcs", "protobuf_funcs", "sketch_funcs" + "table_funcs", "variant_funcs", "protobuf_funcs", "sketch_funcs", + "st_funcs" }