## Jupyer Notebook using ESRI built-in [Spark](https://spark.apache.org/).

This notebook is borrowed from Mansour's github [ESRI Spark notebook](https://github.com/mraad/spark-esri)  to demonstrate the spatial binning of AIS data around the port of Miami using Apache Spark. Mansour's notebook create the spatila bining inside ArcGIS Pro. What I tried is using notebook outside ArcGIS Pro.  

The AIS broadcast data is in a FileGeodatabase that can be download from [here](https://marinecadastre.gov/ais). 

I download one month of data for test. It is quickly out of memoery as it has over 10 milion points just for a month data. I clipped to One milion points of data closed to the port. The spark SQL runs well.

Create a new conda environment and activate it to use this notebook, as follows:

- Start a `Python Command Prompt` from `Start > ArcGIS`.

- Execute the following:

```
conda create --yes --name spark_esri --clone arcgispro-py3
activate spark_esri
pip install pyarrow
```


### Import the modules.

In [1]:
import os
import arcpy
from python.sparkInit import spark_start, spark_stop
fc= r'C:\Projects\sparkesri\sparkesri.gdb\miami_AISport'

### Start a Spark instance.

Note the `config` argument to [configure the Spark instance](https://spark.apache.org/docs/latest/configuration.html).

In [2]:
config = {"spark.driver.memory":"2G"}
spark = spark_start(config=config)

### Read the selected Broadcast feature shapes in WebMercator SR.

I have ship AIS point feature class `fc` saved in the download `sparkesri.gdb`.

Note that the `SearchCursor` is subject to the user selected features, and to an active query definition in the layer properties. For Example, set the query definition to `Stauts = 0: Under way using engine` to get the location of all moving ships, in such that we get a "heat map" of the port movement.

In [3]:
#fc= r'C:\Projects\sparkesri\Zone17_2013_09.gdb\Zone17_2013_09_Broadcast'

sp_ref = arcpy.SpatialReference(3857)
data = arcpy.da.SearchCursor(fc,["SHAPE@X","SHAPE@Y"],spatial_reference=sp_ref)
#data = arcpy.da.SearchCursor("Broadcast",["SHAPE@X","SHAPE@Y"],spatial_reference=sp_ref)

### Create a Spark data frame of the read data, and create a view named 'v0'.

In [4]:
spark\
    .createDataFrame(data,"x double,y double")\
    .createOrReplaceTempView("v0")

### Aggregate the data at 200x200 meters bins.

The aggregation is performed by Spark as a SQL statement in a parallel share-nothing way and the resulting bins are collected back in the `rows` array variable.

This is a nested SQL expression, where the inner expression is mapping the input `x` and `y` into `q` and `r` cell locations given a user defined bin size, and the outter expression is aggreating as a sum the `q` and `r` pairs. Finally, `q` and `r` are mapped back to `x` and `y` to enble the placement on a map. 

In [5]:
cell0 = 200.0 # meters
cell1 = cell0 * 0.5

rows = spark\
    .sql(f"""
select q*{cell0}+{cell1} x,r*{cell0}+{cell1} y,least(count(1),1000) as pop
from
(select cast(x/{cell0} as long) q,cast(y/{cell0} as long) r from v0)
group by q,r
""")\
    .collect()

### Create an in-memory point feature class of the collected bins.

The variable `rows` is an array of form `[[x0,y0,pop0],[x1,y1,pop1],...,[xN,yN,popN]]`.

In [6]:
ws = "memory"
nm = "Bins"

fc = os.path.join(ws,nm)

arcpy.management.Delete(fc)

sp_ref = arcpy.SpatialReference(3857)
arcpy.management.CreateFeatureclass(ws,nm,"POINT",spatial_reference=sp_ref)
arcpy.management.AddField(fc, "POP", "LONG")

with arcpy.da.InsertCursor(fc, ["SHAPE@X","SHAPE@Y", "POP"]) as cursor:
    for row in rows:
        cursor.insertRow(row)

### Apply a graduated colors symbology to highlight the bins.

In [7]:
print (os.path.join(ws,nm))

memory\Bins


In [8]:
#_ = arcpy.ApplySymbologyFromLayer_management(fc, f"{nm}.lyrx")

In [8]:
import pandas as pd

from arcgis.gis import GIS

In [11]:
gis = GIS()
m_map = gis.map(location = 'Miami, FL', zoomlevel = 12)
#m_map

In [14]:
# load your feature class into a spatially enabled dataframe (sedf)
m_sedf = pd.DataFrame.spatial.from_featureclass(fc)
# plot the sedf on your map
m_sedf.spatial.plot(map_widget = m_map,
        renderer_type='c',  # for class breaks renderer
        method='esriClassifyNaturalBreaks',  # classification algorithm
        class_count=5,  # choose the number of classes
        col='POP',  # numeric column to classify
        cmap='prism',  # color map to pick colors from for each class
        alpha=0.7  # specify opacity
       )
 # show the map

m_map

MapView(jupyter_target='notebook', layout=Layout(height='400px', width='100%'), ready=True)

### Stop the spark instance.

In [15]:
spark_stop()