Geographical Information Systems (GIS), like any specialized field, has a wealth of jargon and unique concepts. When represented in software, these concepts can sometimes be skewed or expanded from their original forms. We give a thorough definition of many of the core concepts here, while referencing the Geotrellis objects and source files backing them.
This document aims to be informative to new and experienced GIS users alike. If GIS is brand, brand new to you, this document is a useful high level overview.
- Tile: A grid of numeric cells that represent some data on the Earth.
- Cell: A single unit of data in some grid, also called a Location in GIS.
- Layer: or "Tile Layer", this is a grid (or cube) of Tiles.
- Zoom Layer: a Tile Layer at some zoom level.
- Key: Used to index a Tile in a grid (or cube) of them.
- Key Index: Used to transform higher-dimensional Keys into one dimension.
- Metadata: or "Layer Metadata", stores information critical to Tile Layer IO.
- Layout Definition: A description of a Tile grid (its dimensions, etc).
- Extent: or "Bounding Box", represents some area on the Earth.
- Raster: A Tile with an Extent.
- Vector: or "Geometry", these are Point, Line, and Polygon data.
- Feature: A Geometry with some associated metadata.
- RDD: "Resilient Distributed Datasets" from Apache
Spark. Can be thought of as a highly
distributed Scala
Seq
.
These definitions are expanded upon in other sections of this document.
Tile layers (of Rasters or otherwise) are represented in GeoTrellis with
the type RDD[(K, V)] with Metadata[M]
. This type is used extensively
across the code base, and its contents form the deepest compositional
hierarchy we have:
In this diagram:
CustomTile
,CustomMetadata
, andCustomKey
don't exist, they represent types that you could write yourself for your application.- The
K
seen in several places is the sameK
. - The type
RDD[(K, V)] with Metadata[M]
is a Scala Anonymous Type. In this case, it meansRDD
from Apache Spark with extra methods injected from theMetadata
trait. This type is sometimes aliased in GeoTrellis asContextRDD
. RDD[(K, V)]
resembles a ScalaMap[K, V]
, and in fact has furtherMap
-like methods injected by Spark when it takes this shape. See Spark's PairRDDFunctions Scaladocs for those methods. Note: UnlikeMap
, theK
s here are not guaranteed to be unique.
A common specification of RDD[(K, V)] with Metadata[M]
in GeoTrellis
is as follows:
type TileLayerRDD[K] = RDD[(K, Tile)] with Metadata[TileLayerMetadata[K]]
This type represents a grid (or cube!) of Tile
s on the earth,
arranged according to some K
. Features of this grid are:
- Grid location
(0, 0)
is the top-leftmostTile
. - The
Tile
s exist in some CRS. InTileLayerMetadata
, this is kept track of with an actualCRS
field. - In applications,
K
is mostlySpatialKey
orSpaceTimeKey
.
Layer IO requires a Tile Layer Backend. Each
backend has an AttributeStore
, a LayerReader
, and a
LayerWriter
.
Example setup (with our File
system backend):
import geotrellis.spark._
import geotrellis.spark.io._
import geotrellis.spark.io.file._
val catalogPath: String = ... /* Some location on your computer */
val store: AttributeStore = FileAttributeStore(catalogPath)
val reader = FileLayerReader(store)
val writer = FileLayerWriter(store)
Writing an entire layer:
/* Zoom level 13 */
val layerId = LayerId("myLayer", 13)
/* Produced from an ingest, etc. */
val rdd: TileLayerRDD[SpatialKey] = ...
/* Order your Tiles according to the Z-Curve Space Filling Curve */
val index: KeyIndex[SpatialKey] = ZCurveKeyIndexMethod.createIndex(rdd.metadata.bounds)
/* Returns `Unit` */
writer.write(layerId, rdd, index)
Reading an entire layer:
/* `.read` has many overloads, but this is the simplest */
val sameLayer: TileLayerRDD[SpatialKey] = reader.read(layerId)
Querying a layer (a "filtered" read):
/* Some area on the earth to constrain your query to */
val extent: Extent = ...
/* There are more types that can go into `where` */
val filteredLayer: TileLayerRDD[SpatialKey] =
reader.query(layerId).where(Intersects(extent)).result
As mentioned in the Tile Layers section, grids (or
cubes) of Tile
s on the earth are organized by keys. This key,
often refered to generically as K
, is typically a SpatialKey
or
a SpaceTimeKey
:
case class SpatialKey(col: Int, row: Int)
case class SpaceTimeKey(col: Int, row: Int, instant: Long)
although there is nothing stopping you from defining your own key type.
Assuming some tile layer Extent
on the earth, SpatialKey(0, 0)
would index the top-leftmost Tile
in the Tile grid.
When doing Layer IO, certain optimizations can be performed if we know
that Tile
s stored near each other in a filesystem or database
(like Accumulo or HBase) are also spatially-close in the grid they're
from. To make such a guarantee, we use a KeyIndex
.
A KeyIndex
is a GeoTrellis trait
that represents **Space
Filling Curves**.
They are a means by which to translate multi-dimensional indices into a
single-dimensional one, while maintaining spatial locality. In
GeoTrellis, we use these chiefly when writing Tile Layers to one of our
Tile Layer Backends.
Although KeyIndex
is often used in its generic trait
form, we
supply three underlying implementations.
The Z-Curve is the simplest KeyIndex
to use (and implement). It can
be used with both SpatialKey
and SpaceTimeKey
.
val b0: KeyBounds[SpatialKey] = ... /* from `TileLayerRDD.metadata.bounds` */
val b1: KeyBounds[SpaceTimeKey] = ...
val i0: KeyIndex[SpatialKey] = ZCurveKeyIndexMethod.createIndex(b0)
val i1: KeyIndex[SpaceTimeKey] = ZCurveKeyIndexMethod.byDay().createIndex(b1)
val k: SpatialKey = ...
val oneD: Long = i0.toIndex(k) /* A SpatialKey's 2D coords mapped to 1D */
Another well-known curve, available for both SpatialKey
and
SpaceTimeKey
.
val b: KeyBounds[SpatialKey] = ...
val index: KeyIndex[SpatialKey] = HilbertKeyIndexMethod.createIndex(b)
Changing the resolution (in bits) of the index causes a rotation and/or reflection of the points with respect to curve-order. Take, for example the following code (which is actually derived from the testing codebase):
HilbertSpaceTimeKeyIndex(SpaceTimeKey(0,0,y2k), SpaceTimeKey(2,2,y2k.plusMillis(1)),2,1)
The last two arguments are the index resolutions. If that were changed to:
HilbertSpaceTimeKeyIndex(SpaceTimeKey(0,0,y2k), SpaceTimeKey(2,2,y2k.plusMillis(1)),3,1)
The index-order of the points would be different. The reasons behind
this are ultimately technical, though you can imagine how a naive
implementation of an index for, say, a 10x10 matrix (in terms of 100
numbers) would need to be reworked if you were to change the number of
cells (100 would no longer be enough for an 11x11 matrix and the pattern
for indexing you chose may no longer make sense). Obviously, this is
complex and beyond the scope of GeoTrellis' concerns, which is why we
lean on Google's uzaygezen
library.
Currently, the spatial and temporal resolution required to index the points, expressed in bits, must sum to 62 bits or fewer.
For example, the following code appears in
HilbertSpaceTimeKeyIndex.scala
:
@transient
lazy val chc = {
val dimensionSpec =
new MultiDimensionalSpec(
List(
xResolution,
yResolution,
temporalResolution
).map(new java.lang.Integer(_))
)
}
where xResolution
, yResolution
and temporalResolution
are
numbers of bits required to express possible locations in each of those
dimensions. If those three integers sum to more than 62 bits, an error
will be thrown at runtime.
Row Major is only available for SpatialKey
, but provides the fastest
toIndex
lookup of the three curves. It doesn't however, give good
locality guarantees, so should only be used when locality isn't as
important to your application.
val b: KeyBounds[SpatialKey] = ...
val index: KeyIndex[SpatialKey] = RowMajorKeyIndexMethod.createIndex(b)
Tile
is a core GeoTrellis primitive. As mentioned in Tile
Layers, a common specification of
RDD[(K, V)] with Metadata[M]
is:
type TileLayerRDD[K] = RDD[(K, Tile)] with Metadata[TileLayerMetadata[K]]
What is a Tile
exactly? Below is a diagram of our Tile
type
hierarchy. As you can see, any Tile
(via CellGrid
) is
effectively a grid of data cells:
The Tile
trait has operations you'd expect for traversing and
transforming this grid, like:
map: (Int => Int) => Tile
foreach: (Int => Unit) => Unit
combine: Tile => ((Int, Int) => Int) => Tile
color: ColorMap => Tile
Critically, a Tile
must know how big it is, and what its underlying
Cell Type is:
cols: Int
rows: Int
cellType: CellType
Fundamentally, the union of a Tile
and Extent
is how GeoTrellis
defines a Raster
:
case class Raster[+T <: CellGrid](tile: T, extent: Extent) extends CellGrid
For performance reasons, we have opted for Tile
to hold its
CellType
as opposed to making Tile
polymorphic on its underlying
numeric type, for example like trait Tile[T]
. The large type
hierarchy above is what results from this decision. For more
information, see our notes on Tile
performance.
- A
CellType
is a data type plus a policy for handling cell values that may contain no data. - By 'data type' we shall mean the underlying numerical representation
of a
Tile
's cells. NoData
, for performance reasons, is not represented as a value outside the range of the underlying data type (as, e.g.,None
) - if each cell in some tile is aByte
, theNoData
value of that tile will exist within the range [Byte.MinValue
(-128),Byte.MaxValue
(127)].- If attempting to convert between
CellTypes
, see this note onCellType
conversions.
No NoData | Constant NoData | User Defined NoData | |
---|---|---|---|
BitCells | ``BitCellType `` | N/A | N/A |
ByteCells | ByteCellTyp
e |
ByteConstantNoDataCel
lType |
ByteUserDefinedNoDataCe
llType |
UbyteCell s | UByteCellTy
pe |
UByteConstantNoDataCe
llType |
UByteUserDefinedNoDataC
ellType |
ShortCell s | ShortCellTy
pe |
ShortConstantNoDataCe
llType |
ShortUserDefinedNoDataC
ellType |
UShortCel ls | UShortCellT
ype |
UShortConstantNoDataC
ellType |
UShortUserDefinedNoData
CellType |
IntCells | ``IntCellType `` | IntConstantNoDataCell
Type |
IntUserDefinedNoDataCel
lType |
FloatCell s | FloatCellTy
pe |
FloatConstantNoDataCe
llType |
FloatUserDefinedNoDataC
ellType |
DoubleCel ls | DoubleCellT
ype |
DoubleConstantNoDataC
ellType |
DoubleUserDefinedNoData
CellType |
The above table lists CellType
DataType
s in the leftmost
column and NoData
policies along the top row. A couple of points are
worth making here:
- Bits are incapable of representing on, off, and some
NoData
value. As a consequence, there is no such thing as a Bit-backed tile which recognizesNoData
. - While the types in the 'No NoData' and 'Constant NoData' are simply
singleton objects that are passed around alongside tiles, the greater
configurability of 'User Defined NoData'
CellType
s means that they require a constructor specifying the value which will count asNoData
.
Let's look to how this information can be used:
/** Here's an array we'll use to construct tiles */
val myData = Array(42, 1, 2, 3)
/** The GeoTrellis-default integer CellType
* Note that it represents `NoData` values with the smallest signed
* integer possible with 32 bits (Int.MinValue or -2147483648).
*/
val defaultCT = IntConstantNoDataCellType
val normalTile = IntArrayTile(myData, 2, 2, defaultCT)
/** A custom, 'user defined' NoData CellType for comparison; we will
* treat 42 as NoData for this one rather than Int.MinValue
*/
val customCellType = IntUserDefinedNoDataValue(42)
val customTile = IntArrayTile(myData, 2, 2, customCellType)
/** We should expect that the first (default celltype) tile has the value 42 at (0, 0)
* This is because 42 is just a regular value (as opposed to NoData)
* which means that the first value will be delivered without surprise
*/
assert(normalTile.get(0, 0) == 42)
assert(normalTile.getDouble(0, 0) == 42.0)
/** Here, the result is less obvious. Under the hood, GeoTrellis is
* inspecting the value to be returned at (0, 0) to see if it matches our
* `NoData` policy and, if it matches (it does, we defined NoData as
* 42 above), return Int.MinValue (no matter your underlying type, `get`
* on a tile will return an `Int` and `getDouble` will return a `Double`).
*
* The use of Int.MinValue and Double.NaN is a result of those being the
* GeoTrellis-blessed values for NoData - below, you'll find a chart that
* lists all such values in the rightmost column
*/
assert(customTile.get(0, 0) == Int.MinValue)
assert(customTile.getDouble(0, 0) == Double.NaN)
A point which is perhaps not intuitive is that get
will always
return an Int
and getDouble
will always return a Double
.
Representing NoData demands, therefore, that we map other celltypes'
NoData
values to the native, default Int
and Double
NoData
values. NoData
will be represented as Int.MinValue
or
Double.Nan
.
In most programming contexts, it isn't all that useful to think carefully about the number of bits necessary to represent the data passed around by a program. A program tasked with keeping track of all the birthdays in an office or all the accidents on the New Jersey turnpike simply doesn't benefit from carefully considering whether the allocation of those extra few bits is really worth it. The costs for any lack of efficiency are more than offset by the savings in development time and effort. This insight - that computers have become fast enough for us to be forgiven for many of our programming sins - is, by now, truism.
An exception to this freedom from thinking too hard about implementation details is any software that tries, in earnest, to provide the tools for reading, writing, and working with large arrays of data. Rasters certainly fit the bill. Even relatively modest rasters can be made up of millions of underlying cells. Additionally, the semantics of a raster imply that each of these cells shares an underlying data type. These points - that rasters are made up of a great many cells and that they all share a backing data type - jointly suggest that a decision regarding the underlying data type could have profound consequences. More on these consequences below.
Compliance with the GeoTIFF standard is another reason that management of cell types is important for GeoTrellis. The most common format for persisting a raster is the GeoTIFF. A GeoTIFF is simply an array of data along with some useful tags (hence the 'tagged' of 'tagged image file format'). One of these tags specifies the size of each cell and how those bytes should be interpreted (i.e. whether the data for a byte includes its sign - positive or negative - or whether it counts up from 0 - and is therefore said to be 'unsigned').
In addition to keeping track of the memory used by each cell in a
Tile
, the cell type is where decisions about which values count as
data (and which, if any, are treated as NoData
). A value recognized
as NoData
will be ignored while mapping over tiles, carrying out
focal operations on them, interpolating for values in their region, and
just about all of the operations provided by GeoTrellis for working with
Tile
s.
There are at least two major reasons for giving some thought to the types of data you'll be working with in a raster: persistence and performance.
Persistence is simple enough: smaller datatypes end up taking less space
on disk. If you're going to represent a region with only
true
/false
values on a raster whose values are Double
s,
63/64 bits will be wasted. Naively, this means somewhere around 63 times
less data than if the most compact form possible had been chosen (the
use of BitCells
would be maximally efficient for representing the
bivalent nature of boolean values). See the chart below for a sense of
the relative sizes of these cell types.
The performance impacts of cell type selection matter in both a local and a distributed (spark) context. Locally, the memory footprint will mean that as larger cell types are used, smaller amounts of data can be held in memory and worked on at a given time and that more CPU cache misses are to be expected. This latter point - that CPU cache misses will increase - means that more time spent shuffling data from the memory to the processor (which is often a performance bottleneck). When running programs that leverage spark for compute distribution, larger data types mean more data to serialize and more data send over the (very slow, relatively speaking) network.
In the chart below, DataType
s are listed in the leftmost column
and important characteristics for deciding between them can be found to
the right. As you can see, the difference in size can be quite stark
depending on the cell type that a tile is backed by. That extra space is
the price paid for representing a larger range of values. Note that bit
cells lack the sufficient representational resources to have a
NoData
value.
Bits / Cell | 512x512 Raster (mb) | Range (inclusive) | GeoTrellis NoData Value | |
---|---|---|---|---|
BitCells | 1 | 0.032768 | [0, 1] | N/A |
ByteCells | 8 | 0.262144 | [-128, 128] | -128 (Byte.MinValue ) |
UbyteCells | 8 | 0.262144 | [0, 255] | 0 |
ShortCells | 16 | 0.524288 | [-32768, 32767] | -32768 (Short.MinValue ) |
UShortCells | 16 | 0.524288 | [0, 65535] | 0 |
IntCells | 32 | 1.048576 | [-2147483648, 2147483647] | -2147483648 (Int.MinValue ) |
FloatCells | 32 | 1.048576 | [-3.40E38, 3.40E38] | Float.NaN |
DoubleCells | 64 | 2.097152 | [-1.79E308, 1.79E308] | Double.NaN |
One final point is worth making in the context of CellType
performance: the Constant
types are able to depend upon macros which
inline comparisons and conversions. This minor difference can certainly
be felt while iterating through millions and millions of cells. If
possible, Constant NoData
values are to be preferred. For
convenience' sake, we've attempted to make the GeoTrellis-blessed
NoData
values as unobtrusive as possible a priori.
The limits of expected return types (discussed in the previous section)
is used by macros to squeeze as much speed out of the JVM as possible.
Check out our macros
docs for more on our
use of macros like isData
and isNoData
.
“Yes raster is faster, but raster is vaster and vector just SEEMS more corrector.” — C. Dana Tomlin
The entire purpose of geotrellis.raster
is to provide primitive
datatypes which implement, modify, and utilize rasters. In GeoTrellis, a
raster is just a Tile
with an associated Extent
. A tile is just
a two-dimensional collection of evenly spaced data. Tiles are a lot like
certain sequences of sequences (this array of arrays is like a 3x3
tile):
// not real syntax
val myFirstTile = [[1,1,1],[1,2,2],[1,2,3]]
/** It probably looks more like your mental model if we stack them up:
* [[1,1,1],
* [1,2,2],
* [1,2,3]]
*/
In the raster
module of GeoTrellis, the base type of tile is just
Tile
. All GeoTrellis compatible tiles will have inherited from that
base class, so if you find yourself wondering what a given type of
tile's powers are, that's a decent place to start your search. Here's an
incomplete list of the types of things on offer:
- Mapping transformations of arbitrary complexity over the constituent cells
- Carrying out operations (side-effects) for each cell
- Querying a specific tile value
- Rescaling, resampling, cropping
As we've already discussed, tiles are made up of squares which contain
values. We'll sometimes refer to these value-boxes as cells. And, just
like cells in the body, though they are discrete units, they're most
interesting when looked at from a more holistic perspective - rasters
encode relations between values in a uniform space and it is usually
these relations which most interest us. The code found in the
mapalgebra
submodule — discussed later in this document — is all
about exploiting these spatial relations.
One of the first questions you'll ask yourself when working with
GeoTrellis is what kinds of representation best models the domain you're
dealing with. What types of value do you need your raster to hold? This
question is the province of GeoTrellis CellType
s.
With a grasp of tiles and CellType
s, we've got all the conceptual
tools necessary to construct our own tiles. Now, since a tile is a
combination of a CellType
with which its cells are encoded and their
spatial arrangement, we will have to somehow combine Tile
(which
encodes our expectations about how cells sit with respect to one
another) and the datatype of our choosing. Luckily, GeoTrellis has done
this for us. To keep its users sane, the wise maintainers of GeoTrellis
have organized geotrellis.raster
such that fully reified tiles sit
at the bottom of an pretty simple inheritance chain. Let's explore that
inheritance so that you will know where to look when your intuitions
lead you astray:
From IntArrayTile.scala
:
abstract class IntArrayTile(
val array: Array[Int],
cols: Int,
rows: Int
) extends MutableArrayTile { ... }
From DoubleArrayTile.scala
:
abstract class DoubleArrayTile(
val array: Array[Double],
cols: Int,
rows: Int
) extends MutableArrayTile { ... }
Both IntArrayTile
and DoubleArrayTile
are themselves extended by
other child classes, but they are a good place to start. Critically,
they are both MutableArrayTile
s, which adds some nifty methods for
in-place manipulation of cells (GeoTrellis is about performance, so this
minor affront to the gods of immutability can be forgiven). From
MutableArrayTile.scala:
trait MutableArrayTile extends ArrayTile { ... }
One level up is ArrayTile
. It's handy because it implements the
behavior which largely allows us to treat our tiles like big, long
arrays of (arrays of) data. They also have the trait Serializable
,
which is neat any time you can't completely conduct your business within
the neatly defined space-time of the JVM processes which are running on
a single machine (this is the point of GeoTrellis' Spark integration).
From ArrayTile.scala:
trait ArrayTile extends Tile with Serializable { ... }
At the top rung in our abstraction ladder we have Tile
. You might be
surprised how much we can say about tile behavior from the base of its
inheritance tree, so the source is worth reading directly. From
Tile.scala:
trait Tile extends CellGrid with ... { ... }
Where CellGrid
and its parent Grid
just declare something to be
- you guessed it - a grid of numbers with an explicit CellType
.
As it turns out, CellType
is one of those things that we can
mostly ignore once we've settled on which one is proper for our
domain. After all, it appears as though there's very little difference
between tiles that prefer int-like things and tiles that prefer
double-like things.
CAUTION: While it is true, in general, that operations areCellType
agnostic, bothget
andgetDouble
are methods implemented onTile
. In effect, this means that you'll want to be careful when querying values. If you're working with int-likeCellType
s, probably useget
. If you're working with float-likeCellType
s, usually you'll wantgetDouble
.
In the repl, you can try this out to construct a simple Raster
:
import geotrellis.raster._
import geotrellis.vector._
scala> IntArrayTile(Array(1,2,3),1,3)
res0: geotrellis.raster.IntArrayTile = IntArrayTile([S@338514ad,1,3)
scala> IntArrayTile(Array(1,2,3),3,1)
res1: geotrellis.raster.IntArrayTile = IntArrayTile([S@736a81de,3,1)
scala> IntArrayTile(Array(1,2,3,4,5,6,7,8,9),3,3)
res2: geotrellis.raster.IntArrayTile = IntArrayTile([I@5466441b,3,3)
scala> Extent(0, 0, 1, 1)
res4: geotrellis.vector.Extent = Extent(0.0,0.0,1.0,1.0)
scala> Raster(res2, res4)
res5: geotrellis.raster.Raster = Raster(IntArrayTile([I@7b47ab7,1,3),Extent(0.0,0.0,1.0,1.0))
Here's a fun method for exploring your tiles:
scala> res0.asciiDraw()
res3: String =
" 1
2
3
"
scala> res2.asciiDraw()
res4: String =
" 1 2 3
4 5 6
7 8 9
"
That's probably enough to get started. geotrellis.raster
is a pretty
big place, so you'll benefit from spending a few hours playing with the
tools it provides.
“Raster is faster but vector is correcter.” — Somebody
In addition to working with raster data, Geotrellis provides a number of
tools for the creation, representation, and modification of vector data.
The data types central to this functionality
(geotrellis.vector.Feature
and geotrellis.vector.Geometry
)
correspond - and not by accident - to certain objects found in the
GeoJson spec. Feature
s
correspond to the objects listed under features
in a geojson
FeatureCollection
. Geometry
s, to geometries
in a geojson
Feature
.
The base Geometry
class can be found in Geometry.scala
. Concrete
geometries include:
geotrellis.vector.Point
geotrellis.vector.MultiPoint
geotrellis.vector.Line
geotrellis.vector.MultiLine
geotrellis.vector.Polygon
geotrellis.vector.MultiPolygon
geotrellis.vector.GeometryCollection
Working with these geometries is a relatively straightforward affair. Let's take a look:
import geotrellis.vector._
/** First, let's create a Point. Then, we'll use its intersection method.
* Note: we are also using intersection's alias '&'.
*/
val myPoint = Point(1.0, 1.1) // Create a point
// Intersection method
val selfIntersection = myPoint intersection Point(1.0, 1.1)
// Intersection alias
val nonIntersection = myPoint & Point(200, 300)
At this point, the values selfIntersection
and nonIntersection
are GeometryResult
containers. These containers are what many JTS
operations on Geometry
objects will wrap their results in. To
idiomatically destructure these wrappers, we can use the
as[G <: Geometry]
function which either returns Some(G)
or
None
.
val pointIntersection = (Point(1.0, 2.0) & Point(1.0, 2.0)).as[Point]
val pointNonIntersection = (Point(1.0, 2.0) & Point(12.0, 4.0)).as[Point]
assert(pointIntersection == Some(Point(1.0, 2.0))) // Either some point
assert(pointNonIntersection == None) // Or nothing at all
As convenient as as[G <: Geometry]
is, it offers no guarantees about
the domain over which it ranges. So, while you can expect a neatly
packaged Option[G <: Geometry]
, it isn't necessarily the case that
the GeometryResult
object produced by a given set of operations is
possibly convertable to the Geometry
subtype you choose. For
example, a PointGeometryIntersectionResult.as[Polygon]
will always
return None
.
An alternative approach uses pattern matching and ensures an exhaustive
check of the results. geotrellis.vector.Results
contains a
large ADT which
encodes the possible outcomes for different types of outcomes. The result
type of a JTS-dependent vector operation can be found somewhere on this tree
to the effect that an exhaustive match can be carried out to determine the
Geometry
(excepting cases of NoResult
, for which there is no
Geometry
).
For example, we note that a Point
/Point
intersection has the
type PointOrNoResult
. From this we can deduce that it is either a
Point
underneath or else nothing:
scala> import geotrellis.vector._
scala> p1 & p2 match {
| case PointResult(_) => println("A Point!)
| case NoResult => println("Sorry, no result.")
| }
A Point!
Beyond the methods which come with any Geometry
object there are
implicits in many geotrellis modules which will extend Geometry
capabilities. For instance, after importing geotrellis.vector.io._
,
it becomes possible to call the toGeoJson
method on any
Geometry
:
import geotrellis.vector.io._
assert(Point(1,1).toGeoJson == """{"type":"Point","coordinates":[1.0,1.0]}""")
If you need to move from a geometry to a serialized representation or
vice-versa, take a look at the io
directory's contents. This naming
convention for input and output is common throughout Geotrellis. So if
you're trying to get spatial representations in or out of your program,
spend some time seeing if the problem has already been solved.
Methods which are specific to certain subclasses of Geometry
exist
too. For example, geotrellis.vector.MultiLine
is implicitly extended
by geotrellis.vector.op
such that this becomes possible:
import geotrellis.vector.op._
val myML = MultiLine.EMPTY
myML.unionGeometries
The following packages extend Geometry
capabilities:
- geotrellis.vector.io.json
- geotrellis.vector.io.WKT
- geotrellis.vector.io.WKB
- geotrellis.vector.op
- geotrellis.vector.op.affine
- geotrellis.vector.reproject
The Feature
class is odd at first glance; it thinly wraps one of the
afforementioned Geometry
objects along with some type of data. Its
purpose will be clear if you can keep in mind the importance of the
geojson format of serialization which is now ubiquitous in the GIS
software space. It can be found in Feature.scala
.
Let's examine some source code so that this is all a bit clearer. From
geotrellis.vector.Feature.scala
:
abstract class Feature[D] {
type G <: Geometry
val geom: G ; val data: D
}
case class PointFeature[D](geom: Point, data: D) extends Feature[D] {type G = Point}
These type signatures tell us a good deal. Let's make this easy on
ourselves and put our findings into a list. - The type G
is some
instance or
other
of Geometry
(which we explored just above).
- The value,
geom
, which anything the compiler recognizes as aFeature
must make available in its immediate closure must be of typeG
. - As with
geom
the compiler will not be happy unless aFeature
providesdata
. - Whereas, with
geom
, we could say a good deal about the types of stuff (only things we call geometries) that would satisfy the compiler, we have nothing in particular to say aboutD
.
Our difficulty with D
is shared by the Point
-focused feature,
PointFeature
. PointFeature
uses Point
(which is one of the
concrete instances of Geometry
introduced above) while telling us
nothing at all about data
's type. This is just sugar for passing
around a Point
and some associated metadata.
Let's look at some code which does something with D (code which calls one of D's methods) so that we know what to expect. Remember: types are just contracts which the compiler is kind enough to enforce. In well-written code, types (and type variables) can tell us a great deal about what was in the head of the author.
There's only one package
which does anything with D
, so the
constraints (and our job) should be relatively easy. In
geotrellis.vector.io.json.FeatureFormats
there are
ContextBound
s on D
which ensure that they have JsonReader,
JsonWriter, and JsonFormat implicits available (this is a typeclass,
and it allows for something like type-safe duck-typing).
D
's purpose is clear enough: any D
which comes with the tools
necessary for json serialization and deserialization will suffice. In
effect, data
corresponds to the "properties" member of the geojson
spec's Feature
object.
If you can provide the serialization tools (that is, implicit
conversions between some type (your D
) and spray
json), the Feature
object
in geotrellis.vector
does the heavy lifting of embedding your (thus
serializable) data into the larger structure which includes a
Geometry
. There's even support for geojson IDs: the "ID" member of a
geojson Feature is represented by the keys of a Map
from String
to Feature[D]
. Data in both the ID and non-ID variants of geojson
Feature formats is easily transformed.
These submodules define useful methods for dealing with the entities
that call geotrellis.vector
home:
geotrellis.vector.io
defines input/output (serialization) of geometriesgeotrellis.vector.op
defines common operations on geometriesgeotrellis.vector.reproject
defines methods for translating between projections
Data structures: LayoutDefinition
, TileLayout
, CellSize
A Layout Definition describes the location, dimensions of, and organization of a tiled area of a map. Conceptually, the tiled area forms a grid, and the Layout Definitions describes that grid's area and cell width/height. These definitions can be used to chop a bundle of imagery into tiles suitable for being served out on a web map.
Within the context of GeoTrellis, the LayoutDefinition
class extends
GridExtent
, and exposes methods for querying the sizes of the grid
and grid cells. Those values are stored in the TileLayout
(the grid
description) and CellSize
classes respectively.
LayoutDefinition
s are used heavily during the raster reprojection
process. Within the context of Geotrellis, the LayoutDefinition
class extends GridExtent
, and exposes methods for querying the sizes
of the grid and grid cells. Those values are stored in the
TileLayout
(the grid description) and CellSize
classes
respectively. LayoutDefinition
s are used heavily during the raster
reprojection process.
What is a Layout Scheme?
The language here can be vexing, but a LayoutScheme
can be thought
of as a factory which produces LayoutDefinition
s. It is the scheme
according to which some layout definition must be defined - a layout
definition definition, if you will. The most commonly used
LayoutScheme
is the ZoomedLayoutScheme
, which provides the
ability to generate LayoutDefinitions
for the different zoom levels
of a web-based map (e.g. Leaflet).
ProjectedExtent
s and Tile
s which cover some contiguous
area but which were derived from GeoTIFFs of varying sizes. We will
sometimes describe operations like this as 'tiling'. The method which
tiles a collection of imagery provided a LayoutDefinition
, the
underlying CellType
of the produced tiles, and the
ResampleMethod
to use for generating data at new resolutions is
tileToLayout
. Let's take a look at its use:val sourceTiles: RDD[(ProjectedExtent, Tile)] = ??? // Tiles from GeoTIFF
val cellType: CellType = IntCellType
val layout: LayoutDefinition = ???
val resamp: ResampleMethod = NearestNeighbor
val tiled: RDD[(SpatialKey, Tile)] =
tiles.tileToLayout[SpatialKey](cellType, layout, resamp)
In essence, a LayoutDefinition
is the minimum information required
to describe the tiling of some map's area in Geotrellis. The
LayoutDefinition
class extends GridExtent
, and exposes methods
for querying the sizes of the grid and grid cells. Those values are
stored in the TileLayout
(the grid description) and CellSize
classes respectively. LayoutDefinition
s are most often encountered
in raster reprojection processes.
Map Algebra is a name given by Dr. Dana Tomlin in the 1980's to a way of manipulating and transforming raster data. There is a lot of literature out there, not least the book by the guy who "wrote the book" on map algebra, so we will only give a brief introduction here. GeoTrellis follows Dana's vision of map algebra operations, although there are many operations that fall outside of the realm of Map Algebra that it also supports.
Map Algebra operations fall into 3 general categories:
Local operations are ones that only take into account the information of on cell at a time. In the animation above, we can see that the blue and the yellow cell are combined, as they are corresponding cells in the two tiles. It wouldn't matter if the tiles were bigger or smaller - the only information necessary for that step in the local operation is the cell values that correspond to each other. A local operation happens for each cell value, so if the whole bottom tile was blue and the upper tile were yellow, then the resulting tile of the local operation would be green.
Focal operations take into account a cell, and a neighborhood around that
cell. A neighborhood can be defined as a square of a specific size, or
include masks so that you can have things like circular or wedge-shaped
neighborhoods. In the above animation, the neighborhood is a 5x5 square
around the focal cell. The focal operation in the animation is a
focalSum
. The focal value is 0, and all of the other cells in the focal
neighborhood; therefore the cell value of the result tile would be 8 at the
cell corresponding to the focal cell of the input tile. This focal operation
scans through each cell of the raster. You can imagine that along the
border, the focal neighborhood goes outside of the bounds of the tile; in
this case the neighborhood only considers the values that are covered by the
neighborhood. GeoTrellis also supports the idea of an analysis area, which
is the GridBounds that the focal operation carries over, in order to support
composing tiles with border tiles in order to support distributed focal
operation processing.
Zonal operations are ones that operate on two tiles: an input tile, and a
zone tile. The values of the zone tile determine what zone each of the
corresponding cells in the input tile belong to. For example, if you are
doing a zonalStatistics
operation, and the zonal tile has a distribution
of zone 1, zone 2, and zone 3 values, we will get back the statistics such
as mean, median and mode for all cells in the input tile that correspond to
each of those zone values.
Map Algebra operations are defined as implicit methods on Tile
or
Traversable[Tile]
, which are imported with import
geotrellis.raster._
.
import geotrellis.raster._
val tile1: Tile = ???
val tile2: Tile = ???
// If tile1 and tile2 are the same dimensions, we can combine
// them using local operations
tile1.localAdd(tile2)
// There are operators for some local operations.
// This is equivalent to the localAdd call above
tile1 + tile2
// There is a local operation called "reclassify" in literature,
// which transforms each value of the function.
// We actually have a map method defined on Tile,
// which serves this purpose.
tile1.map { z => z + 1 } // Map over integer values.
tile2.mapDouble { z => z + 1.1 } // Map over double values.
tile1.dualMap({ z => z + 1 })({ z => z + 1.1 }) // Call either the integer value or double version, depending on cellType.
// You can also combine values in a generic way with the combine funciton.
// This is another local operation that is actually defined on Tile directly.
tile1.combine(tile2) { (z1, z2) => z1 + z2 }
The following packages are where Map Algebra operations are defined in GeoTrellis:
`geotrellis.raster.local
<../../raster/src/main/scala/geotrellis/raster/mapalgebra/local>`__ defines operations which act on a cell without regard to its spatial relations. Need to double every cell on a tile? This is the module you'll want to explore.`geotrellis.raster.focal
<../../raster/src/main/scala/geotrellis/raster/mapalgebra/focal>`__ defines operations which focus on two-dimensional windows (internally referred to as neighborhoods) of a raster's values to determine their outputs.`geotrellis.raster.zonal
<../..raster/src/main/scala/geotrellis/raster/mapalgebra/zonal>`__ defines operations which apply over a zones as defined by corresponding cell values in the zones raster.
Conway's Game of Life can be seen as a
focal operation in that each cell's value depends on neighboring cell
values. Though focal operations will tend to look at a local region of this
or that cell, they should not be confused with the operations which live in
geotrellis.raster.local
- those operations describe transformations over
tiles which, for any step of the calculation, need only know the input value
of the specific cell for which it is calculating an output (e.g.
incrementing each cell's value by 1).
Invented by Mapbox, VectorTiles are a combination of the ideas of finite-sized tiles and vector geometries. Mapbox maintains the official implementation spec for VectorTile codecs. The specification is free and open source.
VectorTiles are advantageous over raster tiles in that:
- They are typically smaller to store
- They can be easily transformed (rotated, etc.) in real time
- They allow for continuous (as opposed to step-wise) zoom in Slippy Maps.
Raw VectorTile data is stored in the protobuf format. Any codec implementing the spec must decode and encode data according to this .proto schema.
GeoTrellis provides the geotrellis-vectortile
module, a
high-performance implementation of Version 2.1 of the VectorTile
spec. It features:
- Decoding of Version 2 VectorTiles from Protobuf byte data into useful Geotrellis types.
- Lazy decoding of Geometries. Only parse what you need!
- Read/write VectorTile layers to/from any of our backends.
As of 2016 November, ingests of raw vector data into VectorTile sets aren't yet possible.
import geotrellis.spark.SpatialKey
import geotrellis.spark.tiling.LayoutDefinition
import geotrellis.vector.Extent
import geotrellis.vectortile.VectorTile
import geotrellis.vectortile.protobuf._
val bytes: Array[Byte] = ... // from some `.mvt` file
val key: SpatialKey = ... // preknown
val layout: LayoutDefinition = ... // preknown
val tileExtent: Extent = layout.mapTransform(key)
/* Decode Protobuf bytes. */
val tile: VectorTile = ProtobufTile.fromBytes(bytes, tileExtent)
/* Encode a VectorTile back into bytes. */
val encodedBytes: Array[Byte] = tile match {
case t: ProtobufTile => t.toBytes
case _ => ??? // Handle other backends or throw errors.
}
See our VectorTile Scaladocs for detailed usage information.
This particular implementation of the VectorTile spec makes the following assumptions:
- Geometries are implicitly encoded in ''some'' Coordinate Reference system. That is, there is no such thing as a "projectionless" VectorTile. When decoding a VectorTile, we must provide a Geotrellis [[Extent]] that represents the Tile's area on a map. With this, the grid coordinates stored in the VectorTile's Geometry are shifted from their original [0,4096] range to actual world coordinates in the Extent's CRS.
- The
id
field in VectorTile Features doesn't matter. UNKNOWN
geometries are safe to ignore.- If a VectorTile
geometry
list marked asPOINT
has only one pair of coordinates, it will be decoded as a GeotrellisPoint
. If it has more than one pair, it will be decoded as aMultiPoint
. Likewise for theLINESTRING
andPOLYGON
types. A complaint has been made about the spec regarding this, and future versions may include a difference between single and multi geometries.
GeoTiffs are a type of Tiff image file that contain image data pertaining to satellite, aerial, and elevation data among other types of geospatial information. The additional pieces of metadata that are needed to store and display this information is what sets GeoTiffs apart from normal Tiffs. For instance, the positions of geographic features on the screen and how they are projected are two such pieces of data that can be found within a GeoTiff, but is absent from a normal Tiff file.
Because GeoTiffs are Tiffs with extended features, they both have the same file structure. There exist three components that can be found in all Tiff files: the header, the image file directory, and the actual image data. Within these files, the directories and image data can be found at any point within the file; regardless of how the images are presented when the file is opened and viewed. The header is the only section which has a constant location, and that is at the begining of the file.
As stated earlier, the header is found at the beginning of every Tiff
file, including GeoTiffs. All Tiff files have the exact same header size
of 8 bytes. The first two bytes of the header are used to determine the
ByteOrder
of the file, also known as "Endianness". After these two,
comes the next two bytes which are used to determine the file's magic
number. .tif
, .txt
, .shp
, and all other file types have a
unique identifier number that tells the program kind of file it was
given. For Tiff files, the magic number is 42. Due to how the other
components can be situated anywhere within the file, the last 4 bytes of
the header provide the offset value that points to the first file
directory. Without this offset, it would be impossible to read a Tiff
file.
For every image found in a Tiff file there exists a corresponding image
file directory for that picture. Each property listed in the directory
is referred to as a Tag
. Tag
s contain information on, but not
limited to, the image size, compression types, and the type of color
plan. Since we're working with Geotiffs, geo-spatial information is also
documented within the Tag
s. These directories can vary in size, as
users can create their own tags and each image in the file does not need
to have exact same tags.
Other than image attributes, the file directory holds two offset values that play a role in reading the file. One points to where the actual image itself is located, and the other shows where the the next file directory can be found.
A Tiff file can store any number of images within a single file, including none at all. In the case of GeoTiffs, the images themselves are almost always stored as bitmap data. It is important to understand that there are two ways in which the actual image data is formatted within the file. The two methods are: Striped and Tiled.
Striped storage breaks the image into segments of long, vertical bands that stretch the entire width of the picture. Contained within them are columns of bitmapped image data. If your GeoTiff file was created before the realse of Tiff 6.0, then this is the data storage method in which it most likely uses.
If an image has strip storage, then its corresponding file directory
contains the tags: RowsPerStrip
, StripOffsets
, and
StripByteCount
. All three of these are needed to read that given
segment. The first one is the number of rows that are contained within
the strips. Every strip within an image must have the same number of
rows within it except for the last one in certain instances.
StripOffsets
is an array of offsets that shows where each strip
starts within the file. The last tag, ByteSegmentCount
, is also an
array of values that contains the size of each strip in terms of Bytes.
Tiff 6.0 introduced a new way to arrange and store data within a Tiff, tiled storage. These rectangular segments have both a height and a width that must be divisible by 16. There are instances where the tiled grid does not fit the image exactly. When this occurs, padding is added around the image so as to meet the requirement of each tile having dimensions of a factor of 16.
As with stips, tiles have specific tags that are needed in order to
process each segment. These new tags are: TileWidth
, TileLength
,
TileOffsets
, and TileByteCounts
. TileWidth
is the number of
columns and TileLength
is the number of rows that are found within
the specified tile. As with striped, TileOffsets
and
TileByteCounts
are arrays that contain the begining offset and the
byte count of each tile in the image, respectively.
There exists two ways in which to describe a location in GeoTiffs. One is in Map coordinates which use X and Y values. X's are oriented along the horizontal axis and run from west to east while Y's are on the vertical axis and run from south to north. Thus the further east you are, the larger your X value ; and the more north you are the larger your Y value.
The other method is to use the grid coordinate system. This technique of measurement uses Cols and Rows to describe the relative location of things. Cols run east to west whereas Rows run north to south. This then means that Cols increase as you go east to west, and rows increase as you go north to south.
In some instances, your GeoTiff may contain an amount of data so large that it can no longer be described as a Tiff, but rather by a new name, BigTiff. In order to qualify as a BigTiff, your file needs to be at least 4gb in size or larger. At this point, the methods used to store and find data need to be changed. The accommodation that is made is to change the size of the various offsets and byte counts of each segment. For a normal Tiff, this size is 32-bits, but BigTiffs have these sizes at 64-bit. GeoTrellis supports BigTiffs without any issue, so one need not worry about size when working with their files.
Typeclasses are a common feature of Functional Programming. As stated in the FAQ, typeclasses group data types by what they can do, as opposed to by what they are. If traditional OO inheritance arranges classes in a tree hierarchy, typeclasses arrange them in a graph.
Typeclasses are realized in Scala through a combination of trait
s
and implicit
class wrappings. A typeclass constraint is visible in a
class/method signature like this:
class Foo[A: Order](a: A) { ... }
Meaning that Foo
can accept any A
, so long as it is "orderable".
In reality, this in syntactic sugar for the following:
class Foo[A](a: A)(implicit ev: Order[A]) { ... }
Here's a real-world example from GeoTrellis code:
protected def _write[
K: AvroRecordCodec: JsonFormat: ClassTag,
V: AvroRecordCodec: ClassTag,
M: JsonFormat: GetComponent[?, Bounds[K]]
](layerId: LayerId, rdd: RDD[(K, V)] with Metadata[M], keyIndex: KeyIndex[K]): Unit = { ... }
A few things to notice:
- Multiple constraints can be given to a single type variable:
K: Foo: Bar: Baz
?
refers toM
, helping the compiler with type inference. UnfortunatelyM: GetComponent[M, Bounds[K]]
is not syntactically possible
Below is a description of the most-used typeclasses used in GeoTrellis. All are written by us, unless otherwise stated.
Built-in from scala.reflect
. This allows classes to maintain some
type information at runtime, which in GeoTrellis is important for
serialization. You will never need to use this directly, but may have to
annotate your methods with it (the compiler will let you know).
From the spray
library. This constraint says that its type can be
converted to and from JSON, like this:
def toJsonAndBack[A: JsonFormat](a: A): A = {
val json: Value = a.toJson
json.convertTo[A]
}
Any type that can be serialized by Apache
Avro. While references to
AvroRecordCodec
appear frequently through GeoTrellis code, you will
never need to use its methods. They are used internally by our Tile
Layer Backends and Spark.
Always used on K
, Boundable
means your key type has a finite
bound.
trait Boundable[K] extends Serializable {
def minBound(p1: K, p2: K): K
def maxBound(p1: K, p2: K): K
... // etc
}
Component
is a bare-bones Lens
. A Lens
is a pair of
functions that allow one to generically get and set values in a data
structure. They are particularly useful for nested data structures.
Component
looks like this:
trait Component[T, C] extends GetComponent[T, C] with SetComponent[T, C]
Which reads as "if I have a T
, I can read a C
out of it" and "if
I have a T
, I can write some C
back into it". The lenses we
provide are as follows:
SpatialComponent[T]
- read aSpatialKey
out of a someT
(usuallySpatialKey
orSpaceTimeKey
)TemporalComponent[T]
- read aTemporalKey
of someT
(usuallySpaceTimeKey
)
A Functor is anything that maintains its shape and semantics when
map
'd over. Things like List
, Map
, Option
and even
Future
are Functors. Set
and binary trees are not, since map
could change the size of a Set
and the semantics of BTree
.
Vanilla Scala does not have a Functor
typeclass, but implements its
functionality anyway. Libraries like
Cats and
ScalaZ provide a proper
Functor
, but their definitions don't allow further constraints on
your inner type. We have:
trait Functor[F[_], A] extends MethodExtensions[F[A]]{
/** Lift `f` into `F` and apply to `F[A]`. */
def map[B](f: A => B): F[B]
}
which allows us to do:
def foo[M[_], K: SpatialComponent: λ[α => M[α] => Functor[M, α]]](mk: M[K]) { ... }
which says "M
can be mapped into, and the K
you find is
guaranteed to have a SpatialComponent
as well".
Data Structures: CRS
, LatLng
, WebMercator
,
ConusAlbers
In GIS, a projection is a mathematical transformation of Latitude/Longitude coordinates on a sphere onto some other flat plane. Such a plane is naturally useful for representing a map of the earth in 2D. A projection is defined by a Coordinate Reference System (CRS), which holds some extra information useful for reprojection. CRSs themselves have static definitions, have agreed-upon string representations, and are usually made public by standards bodies or companies. They can be looked up at SpatialReference.org.
A reprojection is the transformation of coorindates in one CRS to
another. To do so, coordinates are first converted to those of a sphere.
Every CRS knows how to convert between its coordinates and a sphere's,
so a transformation CRS.A -> CRS.B -> CRS.A
is actually
CRS.A -> Sphere -> CRS.B -> Sphere -> CRS.A
. Naturally some floating
point error does accumulate during this process.
Within the context of GeoTrellis, the main projection-related object is
the CRS
trait. It stores related CRS
objects from underlying
libraries, and also provides the means for defining custom reprojection
methods, should the need arise.
Here is an example of using a CRS
to reproject a Line
:
val wm = Line(...) // A `LineString` vector object in WebMercator.
val ll: Line = wm.reproject(WebMercator, LatLng) // The Line reprojected into LatLng.
Data structures: Extent
, ProjectedExtent
,
TemporalProjectedExtent
, GridExtent
, RasterExtent
An Extent
is a rectangular section of a 2D projection of the Earth.
It is represented by two coordinate pairs that are its "min" and "max"
corners in some Coorindate Reference System. "min" and "max" here are
CRS specific, as the location of the point (0,0)
varies between
different CRS. An Extent can also be referred to as a Bounding Box.
Within the context of GeoTrellis, the points within an Extent
always
implicitely belong to some CRS
, while a ProjectedExtent
holds
both the original Extent
and its current CRS
.
Here are some useful Extent
operations, among many more:
Extent.translate: (Double, Double) => Extent
Extent.distance: Extent => Double
Extent.contains: Extent => Boolean
Extent.intersection: Extent => Option[Extent]
ProjectedExtent.reproject: CRS => Extent
Extent
s are most often used to represent the area of an entire
Tile layer, and also the individual Tile
s themselves (especially
in the case of Raster
s).