# Quickstart notebook
The example code here shows how to get up and running with Mosaic using the Python API.

In [0]:
from pyspark.sql.functions import *

## Enable Mosaic in the notebook
To get started, you'll need to attach the python library to your cluster and execute the `enable_mosaic` function.

In [0]:
from mosaic import enable_mosaic
enable_mosaic(spark, dbutils)

Mosaic has extra configuration options. Check the docs for more details.

In [0]:
help(enable_mosaic)

## Geometry constructors and the Mosaic internal geometry format

Mosaic allows users to create new Point geometries from a pair of Spark DoubleType columns.

In [0]:
from mosaic import st_point

lons = [-80., -80., -70., -70., -80.]
lats = [ 35.,  45.,  45.,  35.,  35.]

bounds_df = (
  spark
  .createDataFrame({"lon": lon, "lat": lat} for lon, lat in zip(lons, lats))
  .coalesce(1)
  .withColumn("point_geom", st_point("lon", "lat"))
)
bounds_df.show()

Mosaic Point geometries can be aggregated into LineString and Polygon geometries using the respective constructors.

In [0]:
from mosaic import st_makeline

bounds_df = (
  bounds_df
  .groupBy()
  .agg(collect_list("point_geom").alias("bounding_coords"))
  .select(st_makeline("bounding_coords").alias("bounding_ring"))
)
bounds_df.show()

In [0]:
from mosaic import st_makepolygon

bounds_df = bounds_df.select(st_makepolygon("bounding_ring").alias("bounds"))
bounds_df.show()

## Geometry clipping without an index

Mosaic implements set intersection functions: contains, intersects, overlaps etc. Here you can see `st_contains` being used to clip points by a polygon geometry.

In [0]:
tripsTable = spark.table("delta.`/databricks-datasets/nyctaxi/tables/nyctaxi_yellow`")

In [0]:
from mosaic import st_contains
trips = (
  tripsTable
  .limit(5_000_000)
  .repartition(sc.defaultParallelism * 20)
  .drop("vendorId", "rateCodeId", "store_and_fwd_flag", "payment_type")
  .withColumn("pickup_geom", st_point("pickup_longitude", "pickup_latitude"))
  .withColumn("dropoff_geom", st_point("dropoff_longitude", "dropoff_latitude"))
  .crossJoin(bounds_df)
  .where(st_contains("bounds", "pickup_geom"))
  .where(st_contains("bounds", "dropoff_geom"))
  .cache()
)

In [0]:
trips.show()

## Read from GeoJson, compute some basic geometry attributes

You've seen how Mosaic can create geometries from Spark native data types but it also provides functions to translate Well Known Text (WKT), Well Known Binary (WKB) and GeoJSON representations to Mosaic geometries.

In [0]:
from mosaic import st_geomfromgeojson

geoJsonDF = (
  spark.read.format("json")
  .load("dbfs:/FileStore/shared_uploads/stuart.lynn@databricks.com/NYC_Taxi_Zones.geojson")
  .withColumn("geometry", st_geomfromgeojson(to_json(col("geometry"))))
  .select("properties.*", "geometry")
  .drop("shape_area", "shape_leng")
)

geoJsonDF.show()

Mosaic provides a number of functions for extracting the properties of geometries. Here are some that are relevant to Polygon geometries:

In [0]:
from mosaic import st_area, st_length
(
  geoJsonDF
  .withColumn("calculatedArea", abs(st_area("geometry")))
  .withColumn("calculatedLength", st_length("geometry"))
  .select("geometry", "calculatedArea", "calculatedLength")
).show()

In [0]:
geoJsonDF.count()

## Example point-in-poly with indexing

Mosaic has built-in support for the popular spatial indexing library, H3. The user has access to functions for generating point indices and the sets of indices covering polygons, allowing point-in-polygon joins to be transformed into deterministic SQL joins.

In [0]:
from mosaic import grid_longlatascellid

trips_with_geom = (
  trips
  .withColumn("pickup_h3", grid_longlatascellid(lon="pickup_longitude", lat="pickup_latitude", resolution=lit(10)))
  .withColumn("dropoff_h3", grid_longlatascellid(lon="dropoff_longitude", lat="dropoff_latitude", resolution=lit(10)))
)

trips_with_geom.show()

In [0]:
from mosaic import grid_polyfill

neighbourhoods = (
  geoJsonDF
  .repartition(sc.defaultParallelism)
  .select("*", explode(grid_polyfill("geometry", lit(10))).alias("h3"))
  .drop("geometry")
)

neighbourhoods.show()

In [0]:
joined_df = trips_with_geom.alias("t").join(neighbourhoods.alias("n"), on=expr("t.pickup_h3 = n.h3"), how="inner")
joined_df.count()

## Mosaic spatial join optimizations

Mosaic provides easy access to the optimized spatial join technique described in [this](https://databricks.com/blog/2021/10/11/efficient-point-in-polygon-joins-via-pyspark-and-bng-geospatial-indexing.html) blog post.

In [0]:
from mosaic import grid_tessellateexplode

mosaic_neighbourhoods = (
  geoJsonDF
  .repartition(sc.defaultParallelism)
  .select("*", grid_tessellateexplode("geometry", lit(10)))
  .drop("geometry")
)

mosaic_neighbourhoods.show()

Mosaic also includes a convenience function for displaying dataframes with geometry columns.

In [0]:
from mosaic import displayMosaic
displayMosaic(mosaic_neighbourhoods)

This also extends to plotting maps inside the notebook using the kepler.gl visualisation library using a notebook magic `%%mosaic_kepler`.

In [0]:
from mosaic import st_aswkt
(
  mosaic_neighbourhoods
  .select(st_aswkt(col("index.wkb")).alias("wkt"), col("index.index_id").alias("h3"))
).createOrReplaceTempView("kepler_df")

In [0]:
%%mosaic_kepler
"kepler_df" "h3" "h3"

![mosaic kepler map example](../images/kepler-example.png)

Now the two datasets can be joined first on H3 index, with any false positives removed through a contains filter on a much simpler geometry.

In [0]:
mosaic_joined_df = (
  trips_with_geom.alias("t")
  .join(mosaic_neighbourhoods.alias("n"), on=expr("t.pickup_h3 = n.index.index_id"), how="inner")
  .where(
    ~col("index.is_core") | 
    st_contains("index.wkb", "pickup_geom")
  )
)

mosaic_joined_df.show()

## MosaicFrame abstraction for simple indexing and joins

By wrapping our Spark DataFrames with `MosaicFrame`, we can simplify the join process. For example:

In [0]:
from mosaic import MosaicFrame

In [0]:
trips_mdf = MosaicFrame(trips, "pickup_geom")
neighbourhoods_mdf = MosaicFrame(geoJsonDF, "geometry")

In [0]:
(
  trips_mdf
  .set_index_resolution(10)
  .apply_index()
  .join(
    neighbourhoods_mdf
    .set_index_resolution(10)
    .apply_index()
  )
).show()