# Quickstart notebook
The example code here shows how to get up and running with Mosaic using the Python API.

In [0]:
from pyspark.sql.functions import *

## Enable Mosaic in the notebook
To get started, you'll need to attach the python library to your cluster and execute the `enable_mosaic` function.

In [0]:
from mosaic import enable_mosaic
enable_mosaic(spark, dbutils)


                Please use a Databricks:
                    - Photon-enabled Runtime for performance benefits
                    - Runtime ML for spatial AI benefits
                Mosaic will stop working on this cluster after v0.3.x.


Mosaic has extra configuration options. Check the docs for more details.

In [0]:
help(enable_mosaic)

Help on function enable_mosaic in module mosaic.api.enable:

enable_mosaic(spark: pyspark.sql.session.SparkSession, dbutils=None) -> None
    Enable Mosaic functions.
    
    Use this function at the start of your workflow to ensure all the required dependencies are installed and
    Mosaic is configured according to your needs.
    
    Parameters
    ----------
    spark : pyspark.sql.SparkSession
            The active SparkSession.
    dbutils : dbruntime.dbutils.DBUtils
            The dbutils object used for `display` and `displayHTML` functions.
            Optional, only applicable to Databricks users.
    
    Returns
    -------
    
    Notes
    -----
    Users can control various aspects of Mosaic's operation with the following Spark confs:
    
    - `spark.databricks.labs.mosaic.jar.autoattach`: 'true' (default) or 'false'
       Automatically attach the Mosaic JAR to the Databricks cluster? (Optional)
    - `spark.databricks.labs.mosaic.jar.location`
       Explicitly 

## Geometry constructors and the Mosaic internal geometry format

Mosaic allows users to create new Point geometries from a pair of Spark DoubleType columns.

In [0]:
from mosaic import st_point

lons = [-80., -80., -70., -70., -80.]
lats = [ 35.,  45.,  45.,  35.,  35.]

bounds_df = (
  spark
  .createDataFrame({"lon": lon, "lat": lat} for lon, lat in zip(lons, lats))
  .coalesce(1)
  .withColumn("point_geom", st_point("lon", "lat"))
)
bounds_df.show()

+----+-----+--------------------+
| lat|  lon|          point_geom|
+----+-----+--------------------+
|35.0|-80.0|{1, 0, [[[-80.0, ...|
|45.0|-80.0|{1, 0, [[[-80.0, ...|
|45.0|-70.0|{1, 0, [[[-70.0, ...|
|35.0|-70.0|{1, 0, [[[-70.0, ...|
|35.0|-80.0|{1, 0, [[[-80.0, ...|
+----+-----+--------------------+



Mosaic Point geometries can be aggregated into LineString and Polygon geometries using the respective constructors.

In [0]:
from mosaic import st_makeline

bounds_df = (
  bounds_df
  .groupBy()
  .agg(collect_list("point_geom").alias("bounding_coords"))
  .select(st_makeline("bounding_coords").alias("bounding_ring"))
)
bounds_df.show()

+--------------------+
|       bounding_ring|
+--------------------+
|{3, 0, [[[-80.0, ...|
+--------------------+



In [0]:
from mosaic import st_makepolygon

bounds_df = bounds_df.select(st_makepolygon("bounding_ring").alias("bounds"))
bounds_df.show()

+--------------------+
|              bounds|
+--------------------+
|{5, 0, [[[-80.0, ...|
+--------------------+



## Geometry clipping without an index

Mosaic implements set intersection functions: contains, intersects, overlaps etc. Here you can see `st_contains` being used to clip points by a polygon geometry.

In [0]:
tripsTable = spark.table("delta.`/databricks-datasets/nyctaxi/tables/nyctaxi_yellow`")

In [0]:
from mosaic import st_contains
trips = (
  tripsTable
  .limit(5_000_000)
  .repartition(sc.defaultParallelism * 20)
  .drop("vendorId", "rateCodeId", "store_and_fwd_flag", "payment_type")
  .withColumn("pickup_geom", st_point("pickup_longitude", "pickup_latitude"))
  .withColumn("dropoff_geom", st_point("dropoff_longitude", "dropoff_latitude"))
  .crossJoin(bounds_df)
  .where(st_contains("bounds", "pickup_geom"))
  .where(st_contains("bounds", "dropoff_geom"))
  .cache()
)

In [0]:
trips.show()

+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------+-----------------+----------------+-----------+-----+-------+----------+------------+------------+--------------------+--------------------+--------------------+
|vendor_id|    pickup_datetime|   dropoff_datetime|passenger_count|trip_distance|pickup_longitude|pickup_latitude|rate_code_id|dropoff_longitude|dropoff_latitude|fare_amount|extra|mta_tax|tip_amount|tolls_amount|total_amount|         pickup_geom|        dropoff_geom|              bounds|
+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------+-----------------+----------------+-----------+-----+-------+----------+------------+------------+--------------------+--------------------+--------------------+
|      CMT|2009-10-31 22:18:30|2009-10-31 22:59:38|              2|          0.9|      -73.993177|       40.73217|        null|      

## Read from GeoJson, compute some basic geometry attributes

You've seen how Mosaic can create geometries from Spark native data types but it also provides functions to translate Well Known Text (WKT), Well Known Binary (WKB) and GeoJSON representations to Mosaic geometries.

In [0]:
from mosaic import st_geomfromgeojson

geoJsonDF = (
  spark.read.format("json")
  .load("dbfs:/FileStore/shared_uploads/stuart.lynn@databricks.com/NYC_Taxi_Zones.geojson")
  .withColumn("geometry", st_geomfromgeojson(to_json(col("geometry"))))
  .select("properties.*", "geometry")
  .drop("shape_area", "shape_leng")
)

geoJsonDF.show()

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m<command-2717610116254747>:4[0m
[1;32m      1[0m [38;5;28;01mfrom[39;00m [38;5;21;01mmosaic[39;00m [38;5;28;01mimport[39;00m st_geomfromgeojson
[1;32m      3[0m geoJsonDF [38;5;241m=[39m (
[0;32m----> 4[0m   spark[38;5;241m.[39mread[38;5;241m.[39mformat([38;5;124m"[39m[38;5;124mjson[39m[38;5;124m"[39m)
[1;32m      5[0m   [38;5;241m.[39mload([38;5;124m"[39m[38;5;124mdbfs:/FileStore/shared_uploads/stuart.lynn@databricks.com/NYC_Taxi_Zones.geojson[39m[38;5;124m"[39m)
[1;32m      6[0m   [38;5;241m.[39mwithColumn([38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m, st_geomfromgeojson(to_json(col([38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m))))
[1;32m      7[0m   [38;5;241m.[39mselect([38;5;124m"[39m[38;5;124mproperties.*[39m[38;5;124m"[39m

Mosaic provides a number of functions for extracting the properties of geometries. Here are some that are relevant to Polygon geometries:

In [0]:
from mosaic import st_area, st_length
(
  geoJsonDF
  .withColumn("calculatedArea", abs(st_area("geometry")))
  .withColumn("calculatedLength", st_length("geometry"))
  .select("geometry", "calculatedArea", "calculatedLength")
).show()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254749>:3[0m
[1;32m      1[0m [38;5;28;01mfrom[39;00m [38;5;21;01mmosaic[39;00m [38;5;28;01mimport[39;00m st_area, st_length
[1;32m      2[0m (
[0;32m----> 3[0m   geoJsonDF
[1;32m      4[0m   [38;5;241m.[39mwithColumn([38;5;124m"[39m[38;5;124mcalculatedArea[39m[38;5;124m"[39m, [38;5;28mabs[39m(st_area([38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m)))
[1;32m      5[0m   [38;5;241m.[39mwithColumn([38;5;124m"[39m[38;5;124mcalculatedLength[39m[38;5;124m"[39m, st_length([38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m))
[1;32m      6[0m   [38;5;241m.[39mselect([38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mcalculatedArea[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mcalculatedLength[

In [0]:
geoJsonDF.count()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254750>:1[0m
[0;32m----> 1[0m [43mgeoJsonDF[49m[38;5;241m.[39mcount()

[0;31mNameError[0m: name 'geoJsonDF' is not defined

## Example point-in-poly with indexing

Mosaic has built-in support for the popular spatial indexing library, H3. The user has access to functions for generating point indices and the sets of indices covering polygons, allowing point-in-polygon joins to be transformed into deterministic SQL joins.

In [0]:
from mosaic import grid_longlatascellid

trips_with_geom = (
  trips
  .withColumn("pickup_h3", grid_longlatascellid(lon="pickup_longitude", lat="pickup_latitude", resolution=lit(10)))
  .withColumn("dropoff_h3", grid_longlatascellid(lon="dropoff_longitude", lat="dropoff_latitude", resolution=lit(10)))
)

trips_with_geom.show()

+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------+-----------------+----------------+-----------+-----+-------+----------+------------+------------+--------------------+--------------------+--------------------+------------------+------------------+
|vendor_id|    pickup_datetime|   dropoff_datetime|passenger_count|trip_distance|pickup_longitude|pickup_latitude|rate_code_id|dropoff_longitude|dropoff_latitude|fare_amount|extra|mta_tax|tip_amount|tolls_amount|total_amount|         pickup_geom|        dropoff_geom|              bounds|         pickup_h3|        dropoff_h3|
+---------+-------------------+-------------------+---------------+-------------+----------------+---------------+------------+-----------------+----------------+-----------+-----+-------+----------+------------+------------+--------------------+--------------------+--------------------+------------------+------------------+
|      CMT|2009-10-

In [0]:
from mosaic import grid_polyfill

neighbourhoods = (
  geoJsonDF
  .repartition(sc.defaultParallelism)
  .select("*", explode(grid_polyfill("geometry", lit(10))).alias("h3"))
  .drop("geometry")
)

neighbourhoods.show()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254754>:4[0m
[1;32m      1[0m [38;5;28;01mfrom[39;00m [38;5;21;01mmosaic[39;00m [38;5;28;01mimport[39;00m grid_polyfill
[1;32m      3[0m neighbourhoods [38;5;241m=[39m (
[0;32m----> 4[0m   geoJsonDF
[1;32m      5[0m   [38;5;241m.[39mrepartition(sc[38;5;241m.[39mdefaultParallelism)
[1;32m      6[0m   [38;5;241m.[39mselect([38;5;124m"[39m[38;5;124m*[39m[38;5;124m"[39m, explode(grid_polyfill([38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m, lit([38;5;241m10[39m)))[38;5;241m.[39malias([38;5;124m"[39m[38;5;124mh3[39m[38;5;124m"[39m))
[1;32m      7[0m   [38;5;241m.[39mdrop([38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m)
[1;32m      8[0m )
[1;32m     10[0m neighbourhoods[38;5;241m.[39mshow()

[0;31mNameError[0m: name 

In [0]:
joined_df = trips_with_geom.alias("t").join(neighbourhoods.alias("n"), on=expr("t.pickup_h3 = n.h3"), how="inner")
joined_df.count()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254755>:1[0m
[0;32m----> 1[0m joined_df [38;5;241m=[39m trips_with_geom[38;5;241m.[39malias([38;5;124m"[39m[38;5;124mt[39m[38;5;124m"[39m)[38;5;241m.[39mjoin(neighbourhoods[38;5;241m.[39malias([38;5;124m"[39m[38;5;124mn[39m[38;5;124m"[39m), on[38;5;241m=[39mexpr([38;5;124m"[39m[38;5;124mt.pickup_h3 = n.h3[39m[38;5;124m"[39m), how[38;5;241m=[39m[38;5;124m"[39m[38;5;124minner[39m[38;5;124m"[39m)
[1;32m      2[0m joined_df[38;5;241m.[39mcount()

[0;31mNameError[0m: name 'neighbourhoods' is not defined

## Mosaic spatial join optimizations

Mosaic provides easy access to the optimized spatial join technique described in [this](https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html) blog post.

In [0]:
from mosaic import grid_tessellateexplode

mosaic_neighbourhoods = (
  geoJsonDF
  .repartition(sc.defaultParallelism)
  .select("*", grid_tessellateexplode("geometry", lit(10)))
  .drop("geometry")
)

mosaic_neighbourhoods.show()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254758>:4[0m
[1;32m      1[0m [38;5;28;01mfrom[39;00m [38;5;21;01mmosaic[39;00m [38;5;28;01mimport[39;00m grid_tessellateexplode
[1;32m      3[0m mosaic_neighbourhoods [38;5;241m=[39m (
[0;32m----> 4[0m   geoJsonDF
[1;32m      5[0m   [38;5;241m.[39mrepartition(sc[38;5;241m.[39mdefaultParallelism)
[1;32m      6[0m   [38;5;241m.[39mselect([38;5;124m"[39m[38;5;124m*[39m[38;5;124m"[39m, grid_tessellateexplode([38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m, lit([38;5;241m10[39m)))
[1;32m      7[0m   [38;5;241m.[39mdrop([38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m)
[1;32m      8[0m )
[1;32m     10[0m mosaic_neighbourhoods[38;5;241m.[39mshow()

[0;31mNameError[0m: name 'geoJsonDF' is not defined

Mosaic also includes a convenience function for displaying dataframes with geometry columns.

In [0]:
from mosaic import displayMosaic
displayMosaic(mosaic_neighbourhoods)

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254760>:2[0m
[1;32m      1[0m [38;5;28;01mfrom[39;00m [38;5;21;01mmosaic[39;00m [38;5;28;01mimport[39;00m displayMosaic
[0;32m----> 2[0m displayMosaic(mosaic_neighbourhoods)

[0;31mNameError[0m: name 'mosaic_neighbourhoods' is not defined

This also extends to plotting maps inside the notebook using the kepler.gl visualisation library using a notebook magic `%%mosaic_kepler`.

In [0]:
from mosaic import st_aswkt
(
  mosaic_neighbourhoods
  .select(st_aswkt(col("index.wkb")).alias("wkt"), col("index.index_id").alias("h3"))
).createOrReplaceTempView("kepler_df")

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254762>:3[0m
[1;32m      1[0m [38;5;28;01mfrom[39;00m [38;5;21;01mmosaic[39;00m [38;5;28;01mimport[39;00m st_aswkt
[1;32m      2[0m (
[0;32m----> 3[0m   mosaic_neighbourhoods
[1;32m      4[0m   [38;5;241m.[39mselect(st_aswkt(col([38;5;124m"[39m[38;5;124mindex.wkb[39m[38;5;124m"[39m))[38;5;241m.[39malias([38;5;124m"[39m[38;5;124mwkt[39m[38;5;124m"[39m), col([38;5;124m"[39m[38;5;124mindex.index_id[39m[38;5;124m"[39m)[38;5;241m.[39malias([38;5;124m"[39m[38;5;124mh3[39m[38;5;124m"[39m))
[1;32m      5[0m )[38;5;241m.[39mcreateOrReplaceTempView([38;5;124m"[39m[38;5;124mkepler_df[39m[38;5;124m"[39m)

[0;31mNameError[0m: name 'mosaic_neighbourhoods' is not defined

In [0]:
%%mosaic_kepler
"kepler_df" "h3" "h3"

[0;31m---------------------------------------------------------------------------[0m
[0;31mAnalysisException[0m                         Traceback (most recent call last)
File [0;32m/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/mosaic/utils/kepler_magic.py:107[0m, in [0;36mMosaicKepler.get_spark_df[0;34m(table_name)[0m
[1;32m    106[0m [38;5;28;01mtry[39;00m:
[0;32m--> 107[0m     data [38;5;241m=[39m [43mconfig[49m[38;5;241;43m.[39;49m[43mmosaic_spark[49m[38;5;241;43m.[39;49m[43mread[49m[38;5;241;43m.[39;49m[43mtable[49m[43m([49m[43mtable_name[49m[43m)[49m
[1;32m    108[0m [38;5;28;01mexcept[39;00m:

File [0;32m/databricks/spark/python/pyspark/instrumentation_utils.py:48[0m, in [0;36m_wrap_function.<locals>.wrapper[0;34m(*args, **kwargs)[0m
[1;32m     47[0m [38;5;28;01mtry[39;00m:
[0;32m---> 48[0m     res [38;5;241m=[39m [43mfunc[49m[43m([49m[38;5;241;43m*[39;49m[43margs[49m[43m,[49m[43m 

![mosaic kepler map example](../images/kepler-example.png)

Now the two datasets can be joined first on H3 index, with any false positives removed through a contains filter on a much simpler geometry.

In [0]:
mosaic_joined_df = (
  trips_with_geom.alias("t")
  .join(mosaic_neighbourhoods.alias("n"), on=expr("t.pickup_h3 = n.index.index_id"), how="inner")
  .where(
    ~col("index.is_core") | 
    st_contains("index.wkb", "pickup_geom")
  )
)

mosaic_joined_df.show()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254766>:3[0m
[1;32m      1[0m mosaic_joined_df [38;5;241m=[39m (
[1;32m      2[0m   trips_with_geom[38;5;241m.[39malias([38;5;124m"[39m[38;5;124mt[39m[38;5;124m"[39m)
[0;32m----> 3[0m   [38;5;241m.[39mjoin(mosaic_neighbourhoods[38;5;241m.[39malias([38;5;124m"[39m[38;5;124mn[39m[38;5;124m"[39m), on[38;5;241m=[39mexpr([38;5;124m"[39m[38;5;124mt.pickup_h3 = n.index.index_id[39m[38;5;124m"[39m), how[38;5;241m=[39m[38;5;124m"[39m[38;5;124minner[39m[38;5;124m"[39m)
[1;32m      4[0m   [38;5;241m.[39mwhere(
[1;32m      5[0m     [38;5;241m~[39mcol([38;5;124m"[39m[38;5;124mindex.is_core[39m[38;5;124m"[39m) [38;5;241m|[39m 
[1;32m      6[0m     st_contains([38;5;124m"[39m[38;5;124mindex.wkb[39m[38;5;124m"[39m, [38;5;124m"[39m[

## MosaicFrame abstraction for simple indexing and joins

By wrapping our Spark DataFrames with `MosaicFrame`, we can simplify the join process. For example:

In [0]:
from mosaic import MosaicFrame

In [0]:
trips_mdf = MosaicFrame(trips, "pickup_geom")
neighbourhoods_mdf = MosaicFrame(geoJsonDF, "geometry")



[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254770>:2[0m
[1;32m      1[0m trips_mdf [38;5;241m=[39m MosaicFrame(trips, [38;5;124m"[39m[38;5;124mpickup_geom[39m[38;5;124m"[39m)
[0;32m----> 2[0m neighbourhoods_mdf [38;5;241m=[39m MosaicFrame(geoJsonDF, [38;5;124m"[39m[38;5;124mgeometry[39m[38;5;124m"[39m)

[0;31mNameError[0m: name 'geoJsonDF' is not defined

In [0]:
(
  trips_mdf
  .set_index_resolution(10)
  .apply_index()
  .join(
    neighbourhoods_mdf
    .set_index_resolution(10)
    .apply_index()
  )
).show()

[0;31m---------------------------------------------------------------------------[0m
[0;31mNameError[0m                                 Traceback (most recent call last)
File [0;32m<command-2717610116254771>:6[0m
[1;32m      1[0m (
[1;32m      2[0m   trips_mdf
[1;32m      3[0m   [38;5;241m.[39mset_index_resolution([38;5;241m10[39m)
[1;32m      4[0m   [38;5;241m.[39mapply_index()
[1;32m      5[0m   [38;5;241m.[39mjoin(
[0;32m----> 6[0m     [43mneighbourhoods_mdf[49m
[1;32m      7[0m     [38;5;241m.[39mset_index_resolution([38;5;241m10[39m)
[1;32m      8[0m     [38;5;241m.[39mapply_index()
[1;32m      9[0m   )
[1;32m     10[0m )[38;5;241m.[39mshow()

[0;31mNameError[0m: name 'neighbourhoods_mdf' is not defined