Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/sql-migration-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ license: |
- Since Spark 4.2, Spark enables order-independent checksums for shuffle outputs by default to detect data inconsistencies during indeterminate shuffle stage retries. If a checksum mismatch is detected, Spark rolls back and re-executes all succeeding stages that depend on the shuffle output. If rolling back is not possible for some succeeding stages, the job will fail. To restore the previous behavior, set `spark.sql.shuffle.orderIndependentChecksum.enabled` and `spark.sql.shuffle.orderIndependentChecksum.enableFullRetryOnMismatch` to `false`.
- Since Spark 4.2, support for Derby JDBC datasource is deprecated.
- Since Spark 4.2, a new default method `mergeWith` has been added to the `CustomTaskMetric` interface. The default implementation sums the two metric values, which is correct for count-type metrics. Data source connector implementations that report non-additive metrics (e.g., maximum, average, compression ratio, or gauge values) must override `mergeWith` to provide correct merge semantics.
- Since Spark 4.2, the geospatial `GEOMETRY` and `GEOGRAPHY` types and the corresponding `ST_*` functions are enabled. See [Geospatial (Geometry/Geography) Types](sql-ref-geospatial-types.html) for additional details.

## Upgrading from Spark SQL 4.0 to 4.1

Expand Down
12 changes: 6 additions & 6 deletions docs/sql-ref-datatypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,8 @@ Spark SQL and DataFrames support the following data types:

* Spatial types
Spatial objects as defined in the [OGC Simple Feature Access](https://portal.ogc.org/files/?artifact_id=25355) specification.
- `GeometryType`: Represents GEOMETRY valuesspatial objects in a Cartesian coordinate system. The type can be fixed to a single SRID, e.g. `geometry(4326)`, or allow mixed SRIDs with `geometry(any)`. Default SRID when not specified is 4326 (WGS 84).
- `GeographyType`: Represents GEOGRAPHY valuesspatial objects in a geographic coordinate system (latitude/longitude). Edge interpolation is always SPHERICAL. The type can be fixed to a single SRID, e.g. `geography(4326)`, or allow mixed SRIDs with `geography(any)`. Default SRID is 4326 (WGS 84).
- `GeometryType`: Represents GEOMETRY values, spatial objects in a Cartesian coordinate system. The type must be fixed to a single SRID, e.g. `geometry(4326)`, or allow mixed SRIDs with `geometry(any)`. In SQL, `GEOMETRY` columns must always be declared with an explicit SRID or `ANY`.
- `GeographyType`: Represents GEOGRAPHY values, spatial objects in a geographic coordinate system (latitude/longitude). Edge interpolation is always SPHERICAL. The type must be fixed to a single geographic SRID, e.g. `geography(4326)`, or allow mixed SRIDs with `geography(any)`. In SQL, `GEOGRAPHY` columns must always be declared with an explicit SRID or `ANY`.
For more details and built-in functions, see [Geospatial (Geometry/Geography) types](sql-ref-geospatial-types.html).

* Complex types
Expand Down Expand Up @@ -143,8 +143,8 @@ from pyspark.sql.types import *
|**TimestampNTZType**|datetime.datetime|TimestampNTZType()|
|**DateType**|datetime.date|DateType()|
|**DayTimeIntervalType**|datetime.timedelta|DayTimeIntervalType()|
|**GeometryType**|Geometry|GeometryType() or GeometryType(*srid*)|
|**GeographyType**|Geography|GeographyType() or GeographyType(*srid*)|
|**GeometryType**|Geometry|GeometryType(*srid*)<br/>**Note:** *srid* is required and may be an `int` or the string `"ANY"`.|
|**GeographyType**|Geography|GeographyType(*srid*)<br/>**Note:** *srid* is required and may be an `int` or the string `"ANY"`.|
|**ArrayType**|list, tuple, or array|ArrayType(*elementType*, [*containsNull*])<br/>**Note:**The default value of *containsNull* is True.|
|**MapType**|dict|MapType(*keyType*, *valueType*, [*valueContainsNull]*)<br/>**Note:**The default value of *valueContainsNull* is True.|
|**StructType**|list or tuple|StructType(*fields*)<br/>**Note:** *fields* is a Seq of StructFields. Also, two fields with the same name are not allowed.|
Expand Down Expand Up @@ -272,8 +272,8 @@ The following table shows the type names as well as aliases used in Spark SQL pa
|**DecimalType**|DECIMAL, DEC, NUMERIC|
|**YearMonthIntervalType**|INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH|
|**DayTimeIntervalType**|INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND|
|**GeometryType**|GEOMETRY or GEOMETRY(*srid*) or GEOMETRY(ANY)|
|**GeographyType**|GEOGRAPHY or GEOGRAPHY(*srid*) or GEOGRAPHY(ANY)|
|**GeometryType**|GEOMETRY(*srid*) or GEOMETRY(ANY)|
|**GeographyType**|GEOGRAPHY(*srid*) or GEOGRAPHY(ANY)|
|**ArrayType**|ARRAY\<element_type>|
|**StructType**|STRUCT<field1_name: field1_type, field2_name: field2_type, ...><br/> **Note:** ':' is optional.|
|**MapType**|MAP<key_type, value_type>|
Expand Down
5 changes: 5 additions & 0 deletions docs/sql-ref-functions-builtin.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,3 +126,8 @@ license: |
{% include_api_gen generated-variant-funcs-table.html %}
#### Examples
{% include_api_gen generated-variant-funcs-examples.html %}

### Geospatial ST Functions
{% include_api_gen generated-st-funcs-table.html %}
#### Examples
{% include_api_gen generated-st-funcs-examples.html %}
30 changes: 20 additions & 10 deletions docs/sql-ref-geospatial-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,13 @@ Spark SQL supports **GEOMETRY** and **GEOGRAPHY** types for spatial data, as def

| Type | Coordinate system | Typical use and notes |
|------|-------------------|------------------------|
| **GEOMETRY** | Cartesian (planar) | Projected or local coordinates; planar calculations. Represents points, lines, polygons in a flat coordinate system. Suitable for Web Mercator (SRID 3857), UTM, or local grids (e.g. engineering/CAD). Default SRID in Spark is 4326. |
| **GEOGRAPHY** | Geographic (latitude/longitude) | Earth-based data; distances and areas on the sphere/ellipsoid. Coordinates in longitude and latitude (degrees). Edge interpolation is always **SPHERICAL**. Default SRID is 4326 (WGS 84). |
| **GEOMETRY** | Cartesian (planar) | Projected or local coordinates; planar calculations. Represents points, lines, polygons in a flat coordinate system. Suitable for Web Mercator (SRID 3857), UTM, or local grids (e.g. engineering/CAD). Accepts any SRID in the registry, including SRID 0 (unspecified CRS). |
| **GEOGRAPHY** | Geographic (latitude/longitude) | Earth-based data; distances and areas on the sphere/ellipsoid. Coordinates in longitude and latitude (degrees). Edge interpolation is always **SPHERICAL**. Only geographic SRIDs are accepted; the most common is 4326 (WGS 84). |

In SQL, `GEOMETRY` and `GEOGRAPHY` columns must always be declared with an explicit SRID
(or `ANY`); see [Type Syntax in SQL](#type-syntax-in-sql) below. When a value is constructed
via `ST_GeomFromWKB(wkb)` without an explicit SRID, the value's SRID is `0` (unspecified),
while `ST_GeogFromWKB(wkb)` always returns a value with SRID 4326.

#### When to use GEOMETRY vs GEOGRAPHY

Expand Down Expand Up @@ -113,23 +118,28 @@ When parsing WKB, Spark applies the following rules. Violations result in a pars

### Built-in Geospatial (ST) Functions

Spark SQL provides scalar functions for working with GEOMETRY and GEOGRAPHY values. They are grouped under **st_funcs** in the [Built-in Functions](sql-ref-functions-builtin.html) API.
Spark SQL provides scalar functions for working with GEOMETRY and GEOGRAPHY values. The full list,
with detailed argument descriptions and examples, is on the
[Built-in Functions](sql-ref-functions-builtin.html#geospatial-st-functions) page under
**Geospatial ST Functions**. The functions provided in the current release are summarized here:

| Function | Description |
|----------|-------------|
| `ST_AsBinary(geo)` | Returns the GEOMETRY or GEOGRAPHY value as WKB (BINARY). |
| `ST_GeomFromWKB(wkb)` | Parses WKB and returns a GEOMETRY with default SRID 0. |
| `ST_GeomFromWKB(wkb, srid)` | Parses WKB and returns a GEOMETRY with the given SRID. |
| `ST_AsBinary(geo[, endianness])` | Returns the GEOMETRY or GEOGRAPHY value as WKB (BINARY). The optional `endianness` argument is `'NDR'` for little-endian (default) or `'XDR'` for big-endian. |
| `ST_GeomFromWKB(wkb[, srid])` | Parses WKB and returns a GEOMETRY. The optional `srid` argument sets the SRID; if omitted, the SRID is `0`. |
| `ST_GeogFromWKB(wkb)` | Parses WKB and returns a GEOGRAPHY with SRID 4326. |
| `ST_Srid(geo)` | Returns the SRID of the GEOMETRY or GEOGRAPHY value (NULL if input is NULL). |
| `ST_SetSrid(geo, srid)` | Returns a new GEOMETRY or GEOGRAPHY with the given SRID. |
| `ST_SetSrid(geo, srid)` | Returns a new GEOMETRY or GEOGRAPHY with the given SRID. The new SRID must be valid for the value's type. |

**Examples:**

```sql
SELECT hex(ST_AsBinary(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040')));
-- 0101000000000000000000F03F0000000000000040

SELECT hex(ST_AsBinary(ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040'), 'XDR'));
-- 00000000013FF00000000000004000000000000000

SELECT ST_Srid(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040'));
-- 4326

Expand All @@ -139,9 +149,9 @@ SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F00000000000

### SRID and Stored Values

* **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the values SRID to match the column).
* **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed.
* **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required.
* **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID raises a `GEO_ENCODER_SRID_MISMATCH_ERROR`. Use `ST_SetSrid` to change a value's SRID to match the column.
* **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs per row. Each value must still have a valid SRID for the type; an invalid SRID raises `ST_INVALID_SRID_VALUE`.
* **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column. They do not support persisting `GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`; mixed-SRID types exist for in-memory/query use only.

### Supported SRIDs

Expand Down
3 changes: 2 additions & 1 deletion sql/gen-sql-functions-docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,8 @@
"bitwise_funcs", "conversion_funcs", "csv_funcs",
"xml_funcs", "lambda_funcs", "collection_funcs",
"url_funcs", "hash_funcs", "struct_funcs",
"table_funcs", "variant_funcs", "protobuf_funcs", "sketch_funcs"
"table_funcs", "variant_funcs", "protobuf_funcs", "sketch_funcs",
"st_funcs"
}


Expand Down