#Mapping the world's airports with Spark and GeoMesa

We will import datasets of country borders and of airport location and map in which country each airport is located. The airport dataset contains the country where each airport is located, but for this exercise we will pretend we don't have that data, and map each airport to its country using only geographical information.

## Download and process country borders

We download country border shapes from [Natural Earth Data](http://www.naturalearthdata.com), a public domain dataset.

In [4]:
%sh curl -fLO https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip

In [5]:
%sh unzip -o ne_110m_admin_0_countries.zip

The data is in Shapefile format. We could probably use Geomesa's Shapefile converter, but for a pure Python solution we use the *PyShp* module to parse the Shapefile.

In [7]:
dbutils.library.installPyPI("PyShp", version="2.1.0")

In [8]:
import shapefile
myshp = open("ne_110m_admin_0_countries.shp", "rb")
mydbf = open("ne_110m_admin_0_countries.dbf", "rb")
country_shapes = shapefile.Reader(shp=myshp, dbf=mydbf)

We use the *pygeoif* Python module to convert the PyShp records (via the GeoJSON interface) into WKT text format that GeoMesa can parse.

In [10]:
dbutils.library.installPyPI("pygeoif", version="0.7")

In [11]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import pygeoif

# Convert a a PyShp field description (e.g. "C") into a Spark Type (e.g. StringType())
def getSparkType(f):
  if f[1] == "C":
    return StringType()
  elif f[1] == "N":
    if f[3] == 0:
      return LongType()
    return DoubleType()
  else:
    print(f)
    raise NotImplementedError
    
# Convert a PyShp shape into a WKT string
def shapeToWkt(shape):
  shape1 = pygeoif.geometry.as_shape(shape)
  if not shape1.to_wkt:
    shape1 = shape1.geometry
  return shape1.to_wkt()

We create a Spark temporary view in order to further process the data with Scala.

In [13]:
recordSchema = StructType([StructField(f[0], getSparkType(f)) for f in country_shapes.fields[1:]])
schema = StructType([StructField("shape", StringType()), StructField("record", recordSchema)])

(spark
  .createDataFrame([[shapeToWkt(s.shape), s.record] for s in country_shapes.shapeRecords()], schema)
  .createOrReplaceTempView("countries")
)

In [14]:
table("countries").count()

## Download and process airport information

We download airport geolocation data from [OpenFlights](https://github.com/jpatokal/openflights).

In [16]:
%sh curl -fo /dbfs/airports.dat https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat

In [17]:
%sh head /dbfs/airports.dat

In [18]:
airports_schema = """Airport_ID string, Name string, City string, Country string,
  IATA string, ICAO string, Latitude double, Longitude double, Altitude double,
  Timezone string, DST string, Tz string, Type string, Source string"""
airports = spark.read.load("dbfs:/airports.dat",
                     format="csv", schema=airports_schema).cache()
display(airports.limit(3))
airports.createOrReplaceTempView("airports")

Airport_ID,Name,City,Country,IATA,ICAO,Latitude,Longitude,Altitude,Timezone,DST,Tz,Type,Source
1,Goroka Airport,Goroka,Papua New Guinea,GKA,AYGA,-6.081689834590001,145.391998291,5282.0,10,U,Pacific/Port_Moresby,airport,OurAirports
2,Madang Airport,Madang,Papua New Guinea,MAG,AYMD,-5.20707988739,145.789001465,20.0,10,U,Pacific/Port_Moresby,airport,OurAirports
3,Mount Hagen Kagamuga Airport,Mount Hagen,Papua New Guinea,HGU,AYMH,-5.826789855957031,144.29600524902344,5388.0,10,U,Pacific/Port_Moresby,airport,OurAirports


In [19]:
table("airports").count()

## Approach 1: Shapely UDF

In this approach, we use the geometric operations in the Shapely library to map airports to countries. This results in a cartesian join, since the geometric `contains` function must be called for every pair (airport, country), so well over a million times.

In [21]:
dbutils.library.installPyPI("Shapely", version="1.6.4.post2")

In [22]:
import json
import shapely
import shapely.geometry
import shapely.wkt

from pyspark.sql.functions import udf

@udf("boolean")
def point_inside_polygon_udf(polygon, point):
  sh_polygon = shapely.wkt.loads(polygon)
  sh_point = shapely.geometry.Point(point)
  return sh_polygon.contains(sh_point)
        
spark.conf.set("spark.sql.crossJoin.enabled", True)
display(table("countries")
  .join(table("airports"), point_inside_polygon_udf("countries.shape", array("airports.Longitude", "airports.Latitude")))
  .groupBy("Country")
  .count()
  .orderBy(desc("count"))
  .limit(5)
       )

Country,count
United States,1441
Canada,391
Australia,287
Brazil,258
Russia,255


In [23]:
spark.conf.set("spark.sql.crossJoin.enabled", False)

The command takes a long time to run. It is not the best approach when combining large datasets, but could be very suitable for applications such as geofencing (computing whether a location is within an assigned shape) -- in that case, we only match each location with a single polygon.

## Approach 2: GeoMesa processing

We use the GeoMesa Spark SQL functions (available in Scala) to convert the WKT shapes into GeoTools Geometry objects. We then compute the country area. Note that the st_area computes the area in *squared degrees* directly from geographical coordinates. As the Earth is not a perfect sphere, this makes comparisons between countries slightly imprecise.

In [26]:
%scala
import org.apache.spark.sql.SQLTypes
import org.apache.spark.sql.functions._
import org.locationtech.geomesa.spark.jts._

// Register custom Geometry types
SQLTypes.init(sqlContext)

val countries =
table("countries")
  .withColumn("shape", st_geomFromWKT('shape).as("shape"))
  .select(
    $"record.ADM0_A3_IS".as("country_code"),
    $"record.Name".as("country"),
    $"shape",
    st_area($"shape").as("area")
  )
  .cache
display(countries.limit(1))

country_code,country,shape,area
FJI,Fiji,"MULTIPOLYGON (((180 -16.067132663642447, 180 -16.555216566639196, 179.36414266196414 -16.801354076946883, 178.72505936299711 -17.01204167436804, 178.59683859511713 -16.639150000000004, 179.0966093629971 -16.433984277547403, 179.4135093629971 -16.379054277547404, 180 -16.067132663642447)), ((178.12557 -17.50481, 178.3736 -17.33992, 178.71806 -17.62846, 178.55271 -18.15059, 177.93266000000003 -18.28799, 177.38146 -18.16432, 177.28504 -17.72465, 177.67087 -17.381140000000002, 178.12557 -17.50481)), ((-179.79332010904864 -16.020882256741224, -179.9173693847653 -16.501783135649397, -180 -16.555216566639196, -180 -16.067132663642447, -179.79332010904864 -16.020882256741224)))",1.639510995900778


We use GeoMesa Spark SQL [Spatial joins](https://www.geomesa.org/documentation/tutorials/dwithin-join.html) in order to map airports to countries.

In [28]:
%scala

val airportCountries = countries.as("countries")
  .join(table("airports"), st_contains($"countries.shape", st_point($"airports.Longitude", $"airports.Latitude")))
  .cache

 val airportsSummary = airportCountries.groupBy($"countries.country_code")
  .agg(
    first($"countries.country").as("country"),
    first($"countries.area").as("area"),
    count($"airports.Name").as("airports")
  )
  .orderBy(desc("airports"))
  .cache

display(airportsSummary)

country_code,country,area,airports
USA,United States of America,1122.2819207780806,1451
CAN,Canada,1712.9952276493766,391
AUS,Australia,695.5455009461047,287
RUS,Russia,2935.205205440517,261
BRA,Brazil,710.1852431533747,260
CHN,China,954.6353412364664,240
DEU,Germany,45.92359430736882,232
FRA,France,72.61566570396081,212
IND,India,277.92471299206454,140
GBR,United Kingdom,34.20295398919941,131


The United States appear to be the country with the most airports.  In Databricks, switch to a World Map visualization to view the data graphically.

Let's normalize the data by country area (and take the square root of that value to generate a more interesting color scale when plotting the data as a world map).

In [30]:
%scala
display(airportsSummary
        .select($"country_code", sqrt($"airports"/$"area").as("airports_per_deg_sq_sqrt_scale"))
        .orderBy(desc("airports_per_deg_sq_sqrt_scale"))
)

country_code,airports_per_deg_sq_sqrt_scale
CYP,2.855161221472386
CHE,2.8439298704552245
PRI,2.759370339976614
ISR,2.619056743190124
BEL,2.6054781230675235
CRI,2.599750008360929
VUT,2.5171149647202387
DEU,2.247636399010544
BHS,2.236262370699382
NLD,2.179237860356212


Cyprus and Switzerland (CHE) are the countries with the highest airport density.