Geographical Information Systems (GIS), like any specialized field, has a wealth of jargon and unique concepts. When represented in software, these concepts can sometimes be skewed or expanded from their original forms. We give a thorough definition of many of the core concepts here, while referencing the Geotrellis objects and source files backing them.
This document aims to be informative to new and experienced GIS users alike. If GIS is brand, brand new to you, this document is a useful high level overview.
- Tile: A grid of numeric cells that represent some data on the Earth.
- Cell: A single unit of data in some grid, also called a Location in GIS.
- Layer: or "Tile Layer", this is a grid (or cube) of Tiles.
- Zoom Layer: a Tile Layer at some zoom level.
- Key: Used to index a Tile in a grid (or cube) of them.
- Key Index: Used to transform higher-dimensional Keys into one dimension.
- Metadata: or "Layer Metadata", stores information critical to Tile Layer IO.
- Layout Definition: A description of a Tile grid (its dimensions, etc).
- Extent: or "Bounding Box", represents some area on the Earth.
- Raster: A Tile with an Extent.
- Vector: or "Geometry", these are Point, Line, and Polygon data.
- Feature: A Geometry with some associated metadata.
- RDD: "Resilient Distributed Datasets" from Apache
Spark. Can be thought of as a highly distributed
Scala
Seq
.
These definitions are expanded upon in other sections of this document.
Tile layers (of Rasters or otherwise) are represented in GeoTrellis with the
type RDD[(K, V)] with Metadata[M]
. This type is used extensively across
the code base, and its contents form the deepest compositional hierarchy we
have:
In this diagram:
CustomTile
,CustomMetadata
, andCustomKey
don't exist, they represent types that you could write yourself for your application.- The
K
seen in several places is the sameK
. - The type
RDD[(K, V)] with Metadata[M]
is a Scala Anonymous Type. In this case, it meansRDD
from Apache Spark with extra methods injected from theMetadata
trait. This type is sometimes aliased in GeoTrellis asContextRDD
. RDD[(K, V)]
resembles a ScalaMap[K, V]
, and in fact has furtherMap
-like methods injected by Spark when it takes this shape. See Spark's PairRDDFunctions Scaladocs for those methods. Note: UnlikeMap
, theK
s here are not guaranteed to be unique.
TileLayerRDD
A common specification of RDD[(K, V)] with Metadata[M]
in GeoTrellis is as follows:
type TileLayerRDD[K] = RDD[(K, Tile)] with Metadata[TileLayerMetadata[K]]
This type represents a grid (or cube!) of Tile
s on the earth, arranged
according to some K
. Features of this grid are:
- Grid location
(0, 0)
is the top-leftmostTile
. - The
Tile
s exist in some CRS. InTileLayerMetadata
, this is kept track of with an actualCRS
field. - In applications,
K
is mostlySpatialKey
orSpaceTimeKey
.
Tile Layer IO
Layer IO requires a Tile Layer Backend. Each backend
has an AttributeStore
, a LayerReader
, and a LayerWriter
.
Example setup (with our File
system backend):
import geotrellis.spark._
import geotrellis.spark.io._
import geotrellis.spark.io.file._
val catalogPath: String = ... /* Some location on your computer */
val store: AttributeStore = FileAttributeStore(catalogPath)
val reader = FileLayerReader(store)
val writer = FileLayerWriter(store)
Writing an entire layer:
/* Zoom level 13 */
val layerId = LayerId("myLayer", 13)
/* Produced from an ingest, etc. */
val rdd: TileLayerRDD[SpatialKey] = ...
/* Order your Tiles according to the Z-Curve Space Filling Curve */
val index: KeyIndex[SpatialKey] = ZCurveKeyIndexMethod.createIndex(rdd.metadata.bounds)
/* Returns `Unit` */
writer.write(layerId, rdd, index)
Reading an entire layer:
/* `.read` has many overloads, but this is the simplest */
val sameLayer: TileLayerRDD[SpatialKey] = reader.read(layerId)
Querying a layer (a "filtered" read):
/* Some area on the earth to constrain your query to */
val extent: Extent = ...
/* There are more types that can go into `where` */
val filteredLayer: TileLayerRDD[SpatialKey] =
reader.query(layerId).where(Intersects(extent)).result
As mentioned in the Tile Layers section, grids (or cubes) of
Tile
s on the earth are organized by keys. This key, often refered to
generically as K
, is typically a SpatialKey
or a SpaceTimeKey
:
case class SpatialKey(col: Int, row: Int)
case class SpaceTimeKey(col: Int, row: Int, instant: Long)
although there is nothing stopping you from defining your own key type.
Assuming some tile layer Extent
on the earth, SpatialKey(0, 0)
would
index the top-leftmost Tile
in the Tile grid.
When doing Layer IO, certain optimizations can be performed if we know that
Tile
s stored near each other in a filesystem or database (like Accumulo or
HBase) are also spatially-close in the grid they're from. To make such a
guarantee, we use a KeyIndex
.
A KeyIndex
is a GeoTrellis trait
that represents Space Filling
Curves. They are a
means by which to translate multi-dimensional indices into a
single-dimensional one, while maintaining spatial locality. In GeoTrellis,
we use these chiefly when writing Tile Layers to one of our Tile Layer
Backends.
Although KeyIndex
is often used in its generic trait
form, we supply
three underlying implementations.
The Z-Curve is the simplest KeyIndex
to use (and implement). It can be
used with both SpatialKey
and SpaceTimeKey
.
val b0: KeyBounds[SpatialKey] = ... /* from `TileLayerRDD.metadata.bounds` */
val b1: KeyBounds[SpaceTimeKey] = ...
val i0: KeyIndex[SpatialKey] = ZCurveKeyIndexMethod.createIndex(b0)
val i1: KeyIndex[SpaceTimeKey] = ZCurveKeyIndexMethod.byDay().createIndex(b1)
val k: SpatialKey = ...
val oneD: Long = i0.toIndex(k) /* A SpatialKey's 2D coords mapped to 1D */
Another well-known curve, available for both SpatialKey
and SpaceTimeKey
.
val b: KeyBounds[SpatialKey] = ...
val index: KeyIndex[SpatialKey] = HilbertKeyIndexMethod.createIndex(b)
Changing the resolution (in bits) of the index causes a rotation and/or reflection of the points with respect to curve-order. Take, for example the following code (which is actually derived from the testing codebase):
HilbertSpaceTimeKeyIndex(SpaceTimeKey(0,0,y2k), SpaceTimeKey(2,2,y2k.plusMillis(1)),2,1)
The last two arguments are the index resolutions. If that were changed to:
HilbertSpaceTimeKeyIndex(SpaceTimeKey(0,0,y2k), SpaceTimeKey(2,2,y2k.plusMillis(1)),3,1)
The index-order of the points would be different. The reasons behind this
are ultimately technical, though you can imagine how a naive implementation
of an index for, say, a 10x10 matrix (in terms of 100 numbers) would need to
be reworked if you were to change the number of cells (100 would no longer
be enough for an 11x11 matrix and the pattern for indexing you chose may no
longer make sense). Obviously, this is complex and beyond the scope of
GeoTrellis' concerns, which is why we lean on Google's uzaygezen
library.
Currently, the spatial and temporal resolution required to index the points, expressed in bits, must sum to 62 bits or fewer.
For example, the following code appears in HilbertSpaceTimeKeyIndex.scala
:
@transient
lazy val chc = {
val dimensionSpec =
new MultiDimensionalSpec(
List(
xResolution,
yResolution,
temporalResolution
).map(new java.lang.Integer(_))
)
}
where xResolution
, yResolution
and temporalResolution
are numbers of
bits required to express possible locations in each of those dimensions. If
those three integers sum to more than 62 bits, an error will be thrown at
runtime.
Row Major is only available for SpatialKey
, but provides the fastest
toIndex
lookup of the three curves. It doesn't however, give good locality
guarantees, so should only be used when locality isn't as important to your
application.
val b: KeyBounds[SpatialKey] = ...
val index: KeyIndex[SpatialKey] = RowMajorKeyIndexMethod.createIndex(b)
Tile
is a core GeoTrellis primitive. As mentioned in Tile
Layers, a common specification of RDD[(K, V)] with Metadata[M]
is:
type TileLayerRDD[K] = RDD[(K, Tile)] with Metadata[TileLayerMetadata[K]]
What is a Tile
exactly? Below is a diagram of our Tile
type hierarchy.
As you can see, any Tile
(via CellGrid
) is effectively a grid of data
cells:
The Tile
trait has operations you'd expect for traversing and transforming
this grid, like:
map: (Int => Int) => Tile
foreach: (Int => Unit) => Unit
combine: Tile => ((Int, Int) => Int) => Tile
color: ColorMap => Tile
Critically, a Tile
must know how big it is, and what its underlying
Cell Type is:
cols: Int
rows: Int
cellType: CellType
Fundamentally, the union of a Tile
and Extent
is how GeoTrellis defines
a Raster
:
case class Raster[+T <: CellGrid](tile: T, extent: Extent) extends CellGrid
For performance reasons, we have opted for Tile
to hold its CellType
as
opposed to making Tile
polymorphic on its underlying numeric type, for
example like trait Tile[T]
. The large type hierarchy above is what results
from this decision. For more information, see
our notes on Tile performance.
What is a Cell Type?
- A
CellType
is a data type plus a policy for handling cell values that may contain no data. - By 'data type' we shall mean the underlying numerical representation
of a
Tile
's cells. NoData
, for performance reasons, is not represented as a value outside the range of the underlying data type (as, e.g.,None
) - if each cell in some tile is aByte
, theNoData
value of that tile will exist within the range [Byte.MinValue
(-128),Byte.MaxValue
(127)].- If attempting to convert between
CellTypes
, see this note onCellType
conversions.
No NoData | Constant NoData | User Defined NoData | |
---|---|---|---|
BitCells | BitCellType |
N/A | N/A |
ByteCells | ByteCellType |
ByteConstantNoDataCellType |
ByteUserDefinedNoDataCellType |
UbyteCells | UByteCellType |
UByteConstantNoDataCellType |
UByteUserDefinedNoDataCellType |
ShortCells | ShortCellType |
ShortConstantNoDataCellType |
ShortUserDefinedNoDataCellType |
UShortCells | UShortCellType |
UShortConstantNoDataCellType |
UShortUserDefinedNoDataCellType |
IntCells | IntCellType |
IntConstantNoDataCellType |
IntUserDefinedNoDataCellType |
FloatCells | FloatCellType |
FloatConstantNoDataCellType |
FloatUserDefinedNoDataCellType |
DoubleCells | DoubleCellType |
DoubleConstantNoDataCellType |
DoubleUserDefinedNoDataCellType |
The above table lists CellType
DataType
s in the leftmost column
and NoData
policies along the top row.
A couple of points are worth
making here:
- Bits are incapable of representing on, off, and some
NoData
value. As a consequence, there is no such thing as a Bit-backed tile which recognizesNoData
. - While the types in the 'No NoData' and 'Constant NoData' are simply
singleton objects that are passed around alongside tiles, the greater
configurability of 'User Defined NoData'
CellType
s means that they require a constructor specifying the value which will count asNoData
.
Let's look to how this information can be used:
/** Here's an array we'll use to construct tiles */
val myData = Array(42, 1, 2, 3)
/** The GeoTrellis-default integer CellType
* Note that it represents `NoData` values with the smallest signed
* integer possible with 32 bits (Int.MinValue or -2147483648).
*/
val defaultCT = IntConstantNoDataCellType
val normalTile = IntArrayTile(myData, 2, 2, defaultCT)
/** A custom, 'user defined' NoData CellType for comparison; we will
* treat 42 as NoData for this one rather than Int.MinValue
*/
val customCellType = IntUserDefinedNoDataValue(42)
val customTile = IntArrayTile(myData, 2, 2, customCellType)
/** We should expect that the first (default celltype) tile has the value 42 at (0, 0)
* This is because 42 is just a regular value (as opposed to NoData)
* which means that the first value will be delivered without surprise
*/
assert(normalTile.get(0, 0) == 42)
assert(normalTile.getDouble(0, 0) == 42.0)
/** Here, the result is less obvious. Under the hood, GeoTrellis is
* inspecting the value to be returned at (0, 0) to see if it matches our
* `NoData` policy and, if it matches (it does, we defined NoData as
* 42 above), return Int.MinValue (no matter your underlying type, `get`
* on a tile will return an `Int` and `getDouble` will return a `Double`).
*
* The use of Int.MinValue and Double.NaN is a result of those being the
* GeoTrellis-blessed values for NoData - below, you'll find a chart that
* lists all such values in the rightmost column
*/
assert(customTile.get(0, 0) == Int.MinValue)
assert(customTile.getDouble(0, 0) == Double.NaN)
A point which is perhaps not intuitive is that get
will always
return an Int
and getDouble
will always return a Double
.
Representing NoData demands, therefore, that we map other celltypes'
NoData
values to the native, default Int
and Double
NoData
values. NoData
will be represented as Int.MinValue
or Double.Nan
.
Why you should care
In most programming contexts, it isn't all that useful to think carefully about the number of bits necessary to represent the data passed around by a program. A program tasked with keeping track of all the birthdays in an office or all the accidents on the New Jersey turnpike simply doesn't benefit from carefully considering whether the allocation of those extra few bits is really worth it. The costs for any lack of efficiency are more than offset by the savings in development time and effort. This insight - that computers have become fast enough for us to be forgiven for many of our programming sins - is, by now, truism.
An exception to this freedom from thinking too hard about implementation details is any software that tries, in earnest, to provide the tools for reading, writing, and working with large arrays of data. Rasters certainly fit the bill. Even relatively modest rasters can be made up of millions of underlying cells. Additionally, the semantics of a raster imply that each of these cells shares an underlying data type. These points - that rasters are made up of a great many cells and that they all share a backing data type - jointly suggest that a decision regarding the underlying data type could have profound consequences. More on these consequences below.
Compliance with the GeoTIFF standard is another reason that management of cell types is important for GeoTrellis. The most common format for persisting a raster is the GeoTIFF. A GeoTIFF is simply an array of data along with some useful tags (hence the 'tagged' of 'tagged image file format'). One of these tags specifies the size of each cell and how those bytes should be interpreted (i.e. whether the data for a byte includes its sign - positive or negative - or whether it counts up from 0 - and is therefore said to be 'unsigned').
In addition to keeping track of the memory used by each cell in a Tile
,
the cell type is where decisions about which values count as data (and
which, if any, are treated as NoData
). A value recognized as NoData
will be ignored while mapping over tiles, carrying out focal operations
on them, interpolating for values in their region, and just about all of
the operations provided by GeoTrellis for working with Tile
s.
Cell Type Performance
There are at least two major reasons for giving some thought to the types of data you'll be working with in a raster: persistence and performance.
Persistence is simple enough: smaller datatypes end up taking less space
on disk. If you're going to represent a region with only true
/false
values on a raster whose values are Double
s, 63/64 bits will be wasted.
Naively, this means somewhere around 63 times less data than if the most
compact form possible had been chosen (the use of BitCells
would
be maximally efficient for representing the bivalent nature of boolean
values). See the chart below for a sense of the relative sizes of these
cell types.
The performance impacts of cell type selection matter in both a local and a distributed (spark) context. Locally, the memory footprint will mean that as larger cell types are used, smaller amounts of data can be held in memory and worked on at a given time and that more CPU cache misses are to be expected. This latter point - that CPU cache misses will increase - means that more time spent shuffling data from the memory to the processor (which is often a performance bottleneck). When running programs that leverage spark for compute distribution, larger data types mean more data to serialize and more data send over the (very slow, relatively speaking) network.
In the chart below, DataType
s are listed in the leftmost column and
important characteristics for deciding between them can be found to the
right. As you can see, the difference in size can be quite stark depending on
the cell type that a tile is backed by. That extra space is the price
paid for representing a larger range of values. Note that bit cells
lack the sufficient representational resources to have a NoData
value.
Bits / Cell | 512x512 Raster (mb) | Range (inclusive) | GeoTrellis NoData Value | |
---|---|---|---|---|
BitCells | 1 | 0.032768 | [0, 1] | N/A |
ByteCells | 8 | 0.262144 | [-128, 128] | -128 (Byte.MinValue ) |
UbyteCells | 8 | 0.262144 | [0, 255] | 0 |
ShortCells | 16 | 0.524288 | [-32768, 32767] | -32768 (Short.MinValue ) |
UShortCells | 16 | 0.524288 | [0, 65535] | 0 |
IntCells | 32 | 1.048576 | [-2147483648, 2147483647] | -2147483648 (Int.MinValue ) |
FloatCells | 32 | 1.048576 | [-3.40E38, 3.40E38] | Float.NaN |
DoubleCells | 64 | 2.097152 | [-1.79E308, 1.79E308] | Double.NaN |
One final point is worth making in the context of CellType
performance: the Constant
types are able to depend upon macros which
inline comparisons and conversions. This minor difference can certainly
be felt while iterating through millions and millions of cells. If possible, Constant
NoData
values are to be preferred. For convenience' sake, we've
attempted to make the GeoTrellis-blessed NoData
values as unobtrusive
as possible a priori.
The limits of expected return types (discussed in the previous section) is used by
macros to squeeze as much speed out of the JVM as possible. Check out
our macros docs
for more on our use of macros like isData
and isNoData
.
“Yes raster is faster, but raster is vaster and vector just SEEMS more corrector.” — C. Dana Tomlin
The entire purpose of geotrellis.raster
is to provide primitive datatypes
which implement, modify, and utilize rasters. In GeoTrellis, a raster is
just a Tile
with an associated Extent
. A tile is just a two-dimensional
collection of evenly spaced data. Tiles are a lot like certain sequences of
sequences (this array of arrays is like a 3x3 tile):
// not real syntax
val myFirstTile = [[1,1,1],[1,2,2],[1,2,3]]
/** It probably looks more like your mental model if we stack them up:
* [[1,1,1],
* [1,2,2],
* [1,2,3]]
*/
In the raster
module of GeoTrellis, the base type of tile is just Tile
.
All GeoTrellis compatible tiles will have inherited from that base class, so
if you find yourself wondering what a given type of tile's powers are,
that's a decent place to start your search. Here's an incomplete list of the
types of things on offer:
- Mapping transformations of arbitrary complexity over the constituent cells
- Carrying out operations (side-effects) for each cell
- Querying a specific tile value
- Rescaling, resampling, cropping
As we've already discussed, tiles are made up of squares which contain
values. We'll sometimes refer to these value-boxes as cells. And, just
like cells in the body, though they are discrete units, they're most
interesting when looked at from a more holistic perspective - rasters encode
relations between values in a uniform space and it is usually these
relations which most interest us. The code found in the mapalgebra
submodule — discussed later in this document — is all about exploiting these
spatial relations.
One of the first questions you'll ask yourself when working with GeoTrellis
is what kinds of representation best models the domain you're dealing with.
What types of value do you need your raster to hold? This question is the
province of GeoTrellis CellType
s.
With a grasp of tiles and CellType
s, we've got all the conceptual tools
necessary to construct our own tiles. Now, since a tile is a combination of
a CellType
with which its cells are encoded and their spatial arrangement,
we will have to somehow combine Tile
(which encodes our expectations about
how cells sit with respect to one another) and the datatype of our choosing.
Luckily, GeoTrellis has done this for us. To keep its users sane, the wise
maintainers of GeoTrellis have organized geotrellis.raster
such that fully
reified tiles sit at the bottom of an pretty simple inheritance chain. Let's
explore that inheritance so that you will know where to look when your
intuitions lead you astray:
From IntArrayTile.scala
:
abstract class IntArrayTile(
val array: Array[Int],
cols: Int,
rows: Int
) extends MutableArrayTile { ... }
From DoubleArrayTile.scala
:
abstract class DoubleArrayTile(
val array: Array[Double],
cols: Int,
rows: Int
) extends MutableArrayTile { ... }
Both IntArrayTile
and DoubleArrayTile
are themselves extended by other
child classes, but they are a good place to start. Critically, they are both
MutableArrayTile
s, which adds some nifty methods for in-place manipulation
of cells (GeoTrellis is about performance, so this minor affront to the gods
of immutability can be forgiven). From MutableArrayTile.scala:
trait MutableArrayTile extends ArrayTile { ... }
One level up is ArrayTile
. It's handy because it implements the behavior
which largely allows us to treat our tiles like big, long arrays of (arrays
of) data. They also have the trait Serializable
, which is neat any time
you can't completely conduct your business within the neatly defined
space-time of the JVM processes which are running on a single machine (this
is the point of GeoTrellis' Spark integration). From ArrayTile.scala:
trait ArrayTile extends Tile with Serializable { ... }
At the top rung in our abstraction ladder we have Tile
. You might be
surprised how much we can say about tile behavior from the base of its
inheritance tree, so the source is worth reading directly. From Tile.scala:
trait Tile extends CellGrid with ... { ... }
Where CellGrid
and its parent Grid
just declare something to be - you
guessed it - a grid of numbers with an explicit CellType
.
As it turns out, CellType
is one of those things that we can mostly ignore
once we've settled on which one is proper for our domain. After all, it appears
as though there's very little difference between tiles that prefer int-like
things and tiles that prefer double-like things.
CAUTION: While it is true, in general, that operations are
CellType
agnostic, bothget
andgetDouble
are methods implemented onTile
. In effect, this means that you'll want to be careful when querying values. If you're working with int-likeCellType
s, probably useget
. If you're working with float-likeCellType
s, usually you'll wantgetDouble
.
In the repl, you can try this out to construct a simple Raster
:
import geotrellis.raster._
import geotrellis.vector._
scala> IntArrayTile(Array(1,2,3),1,3)
res0: geotrellis.raster.IntArrayTile = IntArrayTile([S@338514ad,1,3)
scala> IntArrayTile(Array(1,2,3),3,1)
res1: geotrellis.raster.IntArrayTile = IntArrayTile([S@736a81de,3,1)
scala> IntArrayTile(Array(1,2,3,4,5,6,7,8,9),3,3)
res2: geotrellis.raster.IntArrayTile = IntArrayTile([I@5466441b,3,3)
scala> Extent(0, 0, 1, 1)
res4: geotrellis.vector.Extent = Extent(0.0,0.0,1.0,1.0)
scala> Raster(res2, res4)
res5: geotrellis.raster.Raster = Raster(IntArrayTile([I@7b47ab7,1,3),Extent(0.0,0.0,1.0,1.0))
Here's a fun method for exploring your tiles:
scala> res0.asciiDraw()
res3: String =
" 1
2
3
"
scala> res2.asciiDraw()
res4: String =
" 1 2 3
4 5 6
7 8 9
"
That's probably enough to get started. geotrellis.raster
is a pretty big
place, so you'll benefit from spending a few hours playing with the tools it
provides.
“Raster is faster but vector is correcter.” — Somebody
In addition to working with raster data, Geotrellis provides a number of
tools for the creation, representation, and modification of vector data. The
data types central to this functionality (geotrellis.vector.Feature
and
geotrellis.vector.Geometry
) correspond - and not by accident - to certain
objects found in the GeoJson spec.
Feature
s correspond to the objects listed under features
in a geojson
FeatureCollection
. Geometry
s, to geometries
in a geojson Feature
.
The base Geometry
class can be found in Geometry.scala
. Concrete
geometries include:
geotrellis.vector.Point
geotrellis.vector.MultiPoint
geotrellis.vector.Line
geotrellis.vector.MultiLine
geotrellis.vector.Polygon
geotrellis.vector.MultiPolygon
geotrellis.vector.GeometryCollection
Working with these geometries is a relatively straightforward affair. Let's take a look:
import geotrellis.vector._
/** First, let's create a Point. Then, we'll use its intersection method.
* Note: we are also using intersection's alias '&'.
*/
val myPoint = Point(1.0, 1.1) // Create a point
// Intersection method
val selfIntersection = myPoint intersection Point(1.0, 1.1)
// Intersection alias
val nonIntersection = myPoint & Point(200, 300)
At this point, the values selfIntersection
and nonIntersection
are
GeometryResult
containers. These containers are what many JTS operations
on Geometry
objects will wrap their results in. To idiomatically
destructure these wrappers, we can use the as[G <: Geometry]
function
which either returns Some(G)
or None
.
val pointIntersection = (Point(1.0, 2.0) & Point(1.0, 2.0)).as[Point]
val pointNonIntersection = (Point(1.0, 2.0) & Point(12.0, 4.0)).as[Point]
assert(pointIntersection == Some(Point(1.0, 2.0))) // Either some point
assert(pointNonIntersection == None) // Or nothing at all
As convenient as as[G <: Geometry]
is, it offers no guarantees about the
domain over which it ranges. So, while you can expect a neatly packaged
Option[G <: Geometry]
, it isn't necessarily the case that the
GeometryResult
object produced by a given set of operations is possibly
convertable to the Geometry
subtype you choose. For example, a
PointGeometryIntersectionResult.as[Polygon]
will always return None
.
An alternative approach uses pattern matching and ensures an exhaustive
check of the results.
Results.scala
contains a large ADT
which encodes the possible outcomes for different types of outcomes. The
result type of a JTS-dependent vector operation can be found somewhere on
this tree to the effect that an exhaustive match can be carried out to
determine the Geometry
(excepting cases of NoResult
, for which there is
no Geometry
).
For example, we note that a Point
/Point
intersection has the type
PointOrNoResult
. From this we can deduce that it is either a Point
underneath or else nothing:
scala> import geotrellis.vector._
scala> p1 & p2 match {
| case PointResult(_) => println("A Point!)
| case NoResult => println("Sorry, no result.")
| }
A Point!
Beyond the methods which come with any Geometry
object there are implicits
in many geotrellis modules which will extend Geometry capabilities. For
instance, after importing geotrellis.vector.io._
, it becomes possible to
call the toGeoJson
method on any Geometry
:
import geotrellis.vector.io._
assert(Point(1,1).toGeoJson == """{"type":"Point","coordinates":[1.0,1.0]}""")
If you need to move from a geometry to a serialized representation or
vice-versa, take a look at the io
directory's contents. This naming
convention for input and output is common throughout Geotrellis. So if
you're trying to get spatial representations in or out of your program,
spend some time seeing if the problem has already been solved.
Methods which are specific to certain subclasses of Geometry
exist too.
For example, geotrellis.vector.MultiLine
is implicitly extended by
geotrellis.vector.op
such that this becomes possible:
import geotrellis.vector.op._
val myML = MultiLine.EMPTY
myML.unionGeometries
The following packages extend Geometry
capabilities:
- geotrellis.vector.io.json
- geotrellis.vector.io.WKT
- geotrellis.vector.io.WKB
- geotrellis.vector.op
- geotrellis.vector.op.affine
- geotrellis.vector.reproject
The Feature
class is odd at first glance; it thinly wraps one of the
afforementioned Geometry
objects along with some type of data. Its purpose
will be clear if you can keep in mind the importance of the geojson format
of serialization which is now ubiquitous in the GIS software space. It can
be found in Feature.scala
.
Let's examine some source code so that this is all a bit clearer. From
geotrellis.vector.Feature.scala
:
abstract class Feature[D] {
type G <: Geometry
val geom: G ; val data: D
}
case class PointFeature[D](geom: Point, data: D) extends Feature[D] {type G = Point}
These type signatures tell us a good deal. Let's make this easy on ourselves
and put our findings into a list. - The type G
is some instance or
other of
Geometry
(which we explored just above).
- The value,
geom
, which anything the compiler recognizes as aFeature
must make available in its immediate closure must be of typeG
. - As with
geom
the compiler will not be happy unless aFeature
providesdata
. - Whereas, with
geom
, we could say a good deal about the types of stuff (only things we call geometries) that would satisfy the compiler, we have nothing in particular to say aboutD
.
Our difficulty with D
is shared by the Point
-focused feature,
PointFeature
. PointFeature
uses Point
(which is one of the concrete
instances of Geometry
introduced above) while telling us nothing at all
about data
's type. This is just sugar for passing around a Point
and
some associated metadata.
Let's look at some code which does something with D (code which calls one of D's methods) so that we know what to expect. Remember: types are just contracts which the compiler is kind enough to enforce. In well-written code, types (and type variables) can tell us a great deal about what was in the head of the author.
There's only one package
which does anything with D
, so the constraints
(and our job) should be relatively easy. In
geotrellis.vector.io.json.FeatureFormats.scala
there are ContextBound
s on D
which ensure that they have JsonReader,
JsonWriter, and JsonFormat implicits available (this is a
typeclass,
and it allows for something like type-safe duck-typing).
D
's purpose is clear enough: any D
which comes with the tools necessary
for json serialization and deserialization will suffice. In effect, data
corresponds to the "properties" member of the geojson spec's Feature
object.
If you can provide the serialization tools (that is, implicit conversions
between some type (your D
) and spray
json), the Feature
object in
geotrellis.vector
does the heavy lifting of embedding your (thus
serializable) data into the larger structure which includes a Geometry
.
There's even support for geojson IDs: the "ID" member of a geojson Feature
is represented by the keys of a Map
from String
to Feature[D]
. Data in
both the ID and non-ID variants of geojson Feature formats is easily
transformed.
These submodules define useful methods for dealing with
the entities that call geotrellis.vector
home:
geotrellis.vector.io
defines input/output (serialization) of geometriesgeotrellis.vector.op
defines common operations on geometriesgeotrellis.vector.reproject
defines methods for translating between projections
Data structures: LayoutDefinition
, TileLayout
, CellSize
A Layout Definition describes the location, dimensions of, and organization of a tiled area of a map. Conceptually, the tiled area forms a grid, and the Layout Definitions describes that grid's area and cell width/height. These definitions can be used to chop a bundle of imagery into tiles suitable for being served out on a web map.
Within the context of GeoTrellis, the LayoutDefinition
class extends
GridExtent
, and exposes methods for querying the sizes of the grid and
grid cells. Those values are stored in the TileLayout
(the grid
description) and CellSize
classes respectively. LayoutDefinition
s are
used heavily during the raster reprojection process.
Within the context of Geotrellis, the LayoutDefinition
class extends
GridExtent
, and exposes methods for querying the sizes of the grid and
grid cells. Those values are stored in the TileLayout
(the grid
description) and CellSize
classes respectively. LayoutDefinition
s are
used heavily during the raster reprojection process.
What is a Layout Scheme?
The language here can be vexing, but a LayoutScheme
can be thought of
as a factory which produces LayoutDefinition
s. It is the scheme
according to which some layout definition must be defined - a layout
definition definition, if you will. The most commonly used
LayoutScheme
is the ZoomedLayoutScheme
, which provides the ability
to generate LayoutDefinitions
for the different zoom levels of a
web-based map (e.g. Leaflet).
How are Layout Definitions used throughout Geotrellis?
Suppose that we've got a distributed collection of ProjectedExtent
s
and Tile
s which cover some contiguous area but which were derived from
GeoTIFFs of varying sizes. We will sometimes describe operations like this as
'tiling'. The method which tiles a collection of imagery provided a
LayoutDefinition
, the underlying CellType
of the produced tiles, and
the ResampleMethod
to use for generating data at new resolutions is
tileToLayout
. Let's take a look at its use:
val sourceTiles: RDD[(ProjectedExtent, Tile)] = ??? // Tiles from GeoTIFF
val cellType: CellType = IntCellType
val layout: LayoutDefinition = ???
val resamp: ResampleMethod = NearestNeighbor
val tiled: RDD[(SpatialKey, Tile)] =
tiles.tileToLayout[SpatialKey](cellType, layout, resamp)
In essence, a LayoutDefinition
is the minimum information required to
describe the tiling of some map's area in Geotrellis. The LayoutDefinition
class extends GridExtent
, and exposes methods for querying the sizes of the
grid and grid cells. Those values are stored in the TileLayout
(the grid
description) and CellSize
classes respectively. LayoutDefinition
s are
most often encountered in raster reprojection processes.
Map Algebra is a name given by Dr. Dana Tomlin in the 1980's to a way of manipulating and transforming raster data. There is a lot of literature out there, not least the book by the guy who "wrote the book" on map algebra, so we will only give a brief introduction here. GeoTrellis follows Dana's vision of map algebra operations, although there are many operations that fall outside of the realm of Map Algebra that it also supports.
Map Algebra operations fall into 3 general categories:
Local Operations
Local operations are ones that only take into account the information of on cell at a time. In the animation above, we can see that the blue and the yellow cell are combined, as they are corresponding cells in the two tiles. It wouldn't matter if the tiles were bigger or smaller - the only information necessary for that step in the local operation is the cell values that correspond to each other. A local operation happens for each cell value, so if the whole bottom tile was blue and the upper tile were yellow, then the resulting tile of the local operation would be green.
Focal Operations
Focal operations take into account a cell, and a neighborhood around that
cell. A neighborhood can be defined as a square of a specific size, or
include masks so that you can have things like circular or wedge-shaped
neighborhoods. In the above animation, the neighborhood is a 5x5 square
around the focal cell. The focal operation in the animation is a focalSum
.
The focal value is 0, and all of the other cells in the focal neighborhood;
therefore the cell value of the result tile would be 8 at the cell
corresponding to the focal cell of the input tile. This focal operation
scans through each cell of the raster. You can imagine that along the
border, the focal neighborhood goes outside of the bounds of the tile; in
this case the neighborhood only considers the values that are covered by the
neighborhood. GeoTrellis also supports the idea of an analysis area, which
is the GridBounds that the focal operation carries over, in order to support
composing tiles with border tiles in order to support distributed focal
operation processing.
Zonal Operations
Zonal operations are ones that operate on two tiles: an input tile, and a
zone tile. The values of the zone tile determine what zone each of the
corresponding cells in the input tile belong to. For example, if you are
doing a zonalStatistics
operation, and the zonal tile has a distribution
of zone 1, zone 2, and zone 3 values, we will get back the statistics such
as mean, median and mode for all cells in the input tile that correspond to
each of those zone values.
How to use Map Algebra operations
Map Algebra operations are defined as implicit methods on Tile
or
Traversable[Tile]
, which are imported with import geotrellis.raster._
.
import geotrellis.raster._
val tile1: Tile = ???
val tile2: Tile = ???
// If tile1 and tile2 are the same dimensions, we can combine
// them using local operations
tile1.localAdd(tile2)
// There are operators for some local operations.
// This is equivalent to the localAdd call above
tile1 + tile2
// There is a local operation called "reclassify" in literature,
// which transforms each value of the function.
// We actually have a map method defined on Tile,
// which serves this purpose.
tile1.map { z => z + 1 } // Map over integer values.
tile2.mapDouble { z => z + 1.1 } // Map over double values.
tile1.dualMap({ z => z + 1 })({ z => z + 1.1 }) // Call either the integer value or double version, depending on cellType.
// You can also combine values in a generic way with the combine funciton.
// This is another local operation that is actually defined on Tile directly.
tile1.combine(tile2) { (z1, z2) => z1 + z2 }
The following packages are where Map Algebra operations are defined in GeoTrellis:
geotrellis.raster.local
defines operations which act on a cell without regard to its spatial relations. Need to double every cell on a tile? This is the module you'll want to explore.geotrellis.raster.focal
defines operations which focus on two-dimensional windows (internally referred to as neighborhoods) of a raster's values to determine their outputs.geotrellis.raster.zonal
defines operations which apply over a zones as defined by corresponding cell values in the zones raster.
Conway's Game of Life
can be seen as a focal operation in that each cell's value depends on
neighboring cell values. Though focal operations will tend to look at a
local region of this or that cell, they should not be confused with the
operations which live in geotrellis.raster.local
- those operations
describe transformations over tiles which, for any step of the calculation,
need only know the input value of the specific cell for which it is
calculating an output (e.g. incrementing each cell's value by 1).
Invented by Mapbox, VectorTiles are a combination of the ideas of finite-sized tiles and vector geometries. Mapbox maintains the official implementation spec for VectorTile codecs. The specification is free and open source.
VectorTiles are advantageous over raster tiles in that:
- They are typically smaller to store
- They can be easily transformed (rotated, etc.) in real time
- They allow for continuous (as opposed to step-wise) zoom in Slippy Maps.
Raw VectorTile data is stored in the protobuf format. Any codec implementing
the spec must
decode and encode data according to this .proto
schema.
GeoTrellis provides the geotrellis-vectortile
module, a high-performance
implementation of Version 2.1 of the VectorTile spec. It features:
- Decoding of Version 2 VectorTiles from Protobuf byte data into useful Geotrellis types.
- Lazy decoding of Geometries. Only parse what you need!
- Read/write VectorTile layers to/from any of our backends.
As of 2016 November, ingests of raw vector data into VectorTile sets aren't yet possible.
Small Example
import geotrellis.spark.SpatialKey
import geotrellis.spark.tiling.LayoutDefinition
import geotrellis.vector.Extent
import geotrellis.vectortile.VectorTile
import geotrellis.vectortile.protobuf._
val bytes: Array[Byte] = ... // from some `.mvt` file
val key: SpatialKey = ... // preknown
val layout: LayoutDefinition = ... // preknown
val tileExtent: Extent = layout.mapTransform(key)
/* Decode Protobuf bytes. */
val tile: VectorTile = ProtobufTile.fromBytes(bytes, tileExtent)
/* Encode a VectorTile back into bytes. */
val encodedBytes: Array[Byte] = tile match {
case t: ProtobufTile => t.toBytes
case _ => ??? // Handle other backends or throw errors.
}
See our VectorTile Scaladocs for detailed usage information.
Implementation Assumptions
This particular implementation of the VectorTile spec makes the following assumptions:
- Geometries are implicitly encoded in ''some'' Coordinate Reference system. That is, there is no such thing as a "projectionless" VectorTile. When decoding a VectorTile, we must provide a Geotrellis [[Extent]] that represents the Tile's area on a map. With this, the grid coordinates stored in the VectorTile's Geometry are shifted from their original [0,4096] range to actual world coordinates in the Extent's CRS.
- The
id
field in VectorTile Features doesn't matter. UNKNOWN
geometries are safe to ignore.- If a VectorTile
geometry
list marked asPOINT
has only one pair of coordinates, it will be decoded as a GeotrellisPoint
. If it has more than one pair, it will be decoded as aMultiPoint
. Likewise for theLINESTRING
andPOLYGON
types. A complaint has been made about the spec regarding this, and future versions may include a difference between single and multi geometries.
GeoTiffs are a type of Tiff image file that contain image data pertaining to satellite, aerial, and elevation data among other types of geospatial information. The additional pieces of metadata that are needed to store and display this information is what sets GeoTiffs apart from normal Tiffs. For instance, the positions of geographic features on the screen and how they are projected are two such pieces of data that can be found within a GeoTiff, but is absent from a normal Tiff file.
Because GeoTiffs are Tiffs with extended features, they both have the same file structure. There exist three components that can be found in all Tiff files: the header, the image file directory, and the actual image data. Within these files, the directories and image data can be found at any point within the file; regardless of how the images are presented when the file is opened and viewed. The header is the only section which has a constant location, and that is at the begining of the file.
As stated earlier, the header is found at the beginning of every Tiff file, including GeoTiffs. All Tiff files have the exact same header size of 8 bytes. The first two bytes of the header are used to determine the `ByteOrder` of the file, also known as "Endianness". After these two, comes the next two bytes which are used to determine the file's magic number. `.tif`, `.txt`, `.shp`, and all other file types have a unique identifier number that tells the program kind of file it was given. For Tiff files, the magic number is 42. Due to how the other components can be situated anywhere within the file, the last 4 bytes of the header provide the offset value that points to the first file directory. Without this offset, it would be impossible to read a Tiff file.For every image found in a Tiff file there exists a corresponding image file
directory for that picture. Each property listed in the directory is
referred to as a Tag
. Tag
s contain information on, but not limited to,
the image size, compression types, and the type of color plan. Since we're
working with Geotiffs, geo-spatial information is also documented within the
Tag
s. These directories can vary in size, as users can create their own
tags and each image in the file does not need to have exact same tags.
Other than image attributes, the file directory holds two offset values that play a role in reading the file. One points to where the actual image itself is located, and the other shows where the the next file directory can be found.
A Tiff file can store any number of images within a single file, including none at all. In the case of GeoTiffs, the images themselves are almost always stored as bitmap data. It is important to understand that there are two ways in which the actual image data is formatted within the file. The two methods are: Striped and Tiled.
Striped storage breaks the image into segments of long, vertical bands that stretch the entire width of the picture. Contained within them are columns of bitmapped image data. If your GeoTiff file was created before the realse of Tiff 6.0, then this is the data storage method in which it most likely uses.
If an image has strip storage, then its corresponding file directory contains
the tags: RowsPerStrip
, StripOffsets
, and StripByteCount
. All three of
these are needed to read that given segment. The first one is the number of rows
that are contained within the strips. Every strip within an image must have the
same number of rows within it except for the last one in certain instances.
StripOffsets
is an array of offsets that shows where each strip starts within
the file. The last tag, ByteSegmentCount
, is also an array of values that
contains the size of each strip in terms of Bytes.
Tiff 6.0 introduced a new way to arrange and store data within a Tiff, tiled storage. These rectangular segments have both a height and a width that must be divisible by 16. There are instances where the tiled grid does not fit the image exactly. When this occurs, padding is added around the image so as to meet the requirement of each tile having dimensions of a factor of 16.
As with stips, tiles have specific tags that are needed in order to process each
segment. These new tags are: TileWidth
, TileLength
, TileOffsets
, and
TileByteCounts
. TileWidth
is the number of columns and TileLength
is the
number of rows that are found within the specified tile. As with striped,
TileOffsets
and TileByteCounts
are arrays that contain the begining offset
and the byte count of each tile in the image, respectively.
There exists two ways in which to describe a location in GeoTiffs. One is in Map coordinates which use X and Y values. X's are oriented along the horizontal axis and run from west to east while Y's are on the vertical axis and run from south to north. Thus the further east you are, the larger your X value ; and the more north you are the larger your Y value.
The other method is to use the grid coordinate system. This technique of measurement uses Cols and Rows to describe the relative location of things. Cols run east to west whereas Rows run north to south. This then means that Cols increase as you go east to west, and rows increase as you go north to south.
In some instances, your GeoTiff may contain an amount of data so large that it can no longer be described as a Tiff, but rather by a new name, BigTiff. In order to qualify as a BigTiff, your file needs to be at least 4gb in size or larger. At this point, the methods used to store and find data need to be changed. The accommodation that is made is to change the size of the various offsets and byte counts of each segment. For a normal Tiff, this size is 32-bits, but BigTiffs have these sizes at 64-bit. GeoTrellis supports BigTiffs without any issue, so one need not worry about size when working with their files.
Typeclasses are a common feature of Functional Programming. As stated in the FAQ, typeclasses group data types by what they can do, as opposed to by what they are. If traditional OO inheritance arranges classes in a tree hierarchy, typeclasses arrange them in a graph.
Typeclasses are realized in Scala through a combination of trait
s and
implicit
class wrappings. A typeclass constraint is visible in a
class/method signature like this:
class Foo[A: Order](a: A) { ... }
Meaning that Foo
can accept any A
, so long as it is "orderable". In reality,
this in syntactic sugar for the following:
class Foo[A](a: A)(implicit ev: Order[A]) { ... }
Here's a real-world example from GeoTrellis code:
protected def _write[
K: AvroRecordCodec: JsonFormat: ClassTag,
V: AvroRecordCodec: ClassTag,
M: JsonFormat: GetComponent[?, Bounds[K]]
](layerId: LayerId, rdd: RDD[(K, V)] with Metadata[M], keyIndex: KeyIndex[K]): Unit = { ... }
A few things to notice:
- Multiple constraints can be given to a single type variable:
K: Foo: Bar: Baz
?
refers toM
, helping the compiler with type inference. UnfortunatelyM: GetComponent[M, Bounds[K]]
is not syntactically possible
Below is a description of the most-used typeclasses used in GeoTrellis. All are written by us, unless otherwise stated.
ClassTag
Built-in from scala.reflect
. This allows classes to maintain some type
information at runtime, which in GeoTrellis is important for serialization.
You will never need to use this directly, but may have to annotate your
methods with it (the compiler will let you know).
JsonFormat
From the spray
library. This constraint says that its type can be
converted to and from JSON, like this:
def toJsonAndBack[A: JsonFormat](a: A): A = {
val json: Value = a.toJson
json.convertTo[A]
}
AvroRecordCodec
Any type that can be serialized by Apache Avro.
While references to AvroRecordCodec
appear frequently through GeoTrellis
code, you will never need to use its methods. They are used internally by
our Tile Layer Backends and Spark.
Boundable
Always used on K
, Boundable
means your key type has a finite bound.
trait Boundable[K] extends Serializable {
def minBound(p1: K, p2: K): K
def maxBound(p1: K, p2: K): K
... // etc
}
Component
Component
is a bare-bones Lens
. A Lens
is a pair of functions that
allow one to generically get and set values in a data structure. They are
particularly useful for nested data structures. Component
looks like this:
trait Component[T, C] extends GetComponent[T, C] with SetComponent[T, C]
Which reads as "if I have a T
, I can read a C
out of it" and "if I have
a T
, I can write some C
back into it". The lenses we provide are as follows:
SpatialComponent[T]
- read aSpatialKey
out of a someT
(usuallySpatialKey
orSpaceTimeKey
)TemporalComponent[T]
- read aTemporalKey
of someT
(usuallySpaceTimeKey
)
Functor
A Functor is anything that maintains its shape and semantics when map
'd
over. Things like List
, Map
, Option
and even Future
are Functors.
Set
and binary trees are not, since map
could change the size of a Set
and the semantics of BTree
.
Vanilla Scala does not have a Functor
typeclass, but implements its
functionality anyway. Libraries like Cats and
ScalaZ provide a proper Functor
, but
their definitions don't allow further constraints on your inner type. We
have:
trait Functor[F[_], A] extends MethodExtensions[F[A]]{
/** Lift `f` into `F` and apply to `F[A]`. */
def map[B](f: A => B): F[B]
}
which allows us to do:
def foo[M[_], K: SpatialComponent: λ[α => M[α] => Functor[M, α]]](mk: M[K]) { ... }
which says "M
can be mapped into, and the K
you find is guaranteed to
have a SpatialComponent
as well".
Data Structures: CRS
, LatLng
, WebMercator
, ConusAlbers
In GIS, a projection is a mathematical transformation of Latitude/Longitude coordinates on a sphere onto some other flat plane. Such a plane is naturally useful for representing a map of the earth in 2D. A projection is defined by a Coordinate Reference System (CRS), which holds some extra information useful for reprojection. CRSs themselves have static definitions, have agreed-upon string representations, and are usually made public by standards bodies or companies. They can be looked up at SpatialReference.org.
A reprojection is the transformation of coorindates in one CRS to another.
To do so, coordinates are first converted to those of a sphere. Every CRS
knows how to convert between its coordinates and a sphere's, so a
transformation CRS.A -> CRS.B -> CRS.A
is actually CRS.A -> Sphere -> CRS.B -> Sphere -> CRS.A
. Naturally some floating point error does
accumulate during this process.
Within the context of GeoTrellis, the main projection-related object is the
CRS
trait. It stores related CRS
objects from underlying libraries, and
also provides the means for defining custom reprojection methods, should the
need arise.
Here is an example of using a CRS
to reproject a Line
:
val wm = Line(...) // A `LineString` vector object in WebMercator.
val ll: Line = wm.reproject(WebMercator, LatLng) // The Line reprojected into LatLng.
Data structures: Extent
, ProjectedExtent
, TemporalProjectedExtent
,
GridExtent
, RasterExtent
An Extent
is a rectangular section of a 2D projection of the Earth. It is
represented by two coordinate pairs that are its "min" and "max" corners in
some Coorindate Reference System. "min" and "max" here are CRS specific, as
the location of the point (0,0)
varies between different CRS. An Extent
can also be referred to as a Bounding Box.
Within the context of GeoTrellis, the points within an Extent
always
implicitely belong to some CRS
, while a ProjectedExtent
holds both the
original Extent
and its current CRS
.
Here are some useful Extent
operations, among many more:
Extent.translate: (Double, Double) => Extent
Extent.distance: Extent => Double
Extent.contains: Extent => Boolean
Extent.intersection: Extent => Option[Extent]
ProjectedExtent.reproject: CRS => Extent
Extent
s are most often used to represent the area of an entire Tile layer,
and also the individual Tile
s themselves (especially in the case of
Raster
s).