# Using `shapely` and `rasterio` to combine GeoJSON and `.tif` raster images

In this tutorial, we'll show how to combine light image data from the NOAA VIIRS website and GeoJSON information using the packages `shapely` and `rasterio`.  

GeoJSON files contain polygons to describe geographical regions such as counties and states, so we combine this information together with the raw pixel arrays in the downloaded tifs from the NOAA to create masks with light data for a specific county of interest.

In this tutorial, we'll look at the GeoJSON description of New York County (which covers Manhattan island), and we'll look at how to overlay the county boundaries on top of a February 2014 sattelight night-light image in order to both visualize the light distribution in the county and compute summary statistics.

## Setup

For our tutorial, we'll need ipython plotting integration and several libraries: `rasterio`, `shapely`, `numpy`, `pandas`, and the `affine` library (which is a dependency of `rasterio`). To install them, you can run
```
pip install numpy pandas shapely rasterio pylab
```

In our tutorial, we assume that the data lives in `~/bh/data/`. In particular, you'll need
 - a sattelite image file at `~/bh/data/satellite/SVDNB_npp_20140201-20140228_75N180W_vcmcfg_v10_c201507201052.avg_rade9.tif`
 - a GeoJSON file with the boundaries of U.S. counties at `~/bh/data/us_counties_5m.json`
 - A listing of US states and state codes in `~/bh/data/state.txt`
 
### Data sources

We obtained the US counties GeoJSON file from [Eric Celeste](http://eric.clst.org/Stuff/USGeoJSON)'s website, which is in `latin-1` encoding. We use the `5m` counties file (resolution of 5 million inches). You can find more information about geojson and shapefiles [here](http://chimera.labs.oreilly.com/books/123000000034/ch12.html#_choose_a_resolution).

In the GeoJSON, the `STATE` property attached to each county is encoded as an integer. We use the reference file from [census.gov](http://www2.census.gov/geo/docs/reference/state.txt) to create a mapping `states` from state codes to state names.

The satellite image can be found at [here](http://mapserver.ngdc.noaa.gov/viirs_data/viirs_composite/v10/201402/vcmcfg/SVDNB_npp_20140201-20140228_75N180W_vcmcfg_v10_c201507201052.tgz), it is part of [a larger collection](http://mapserver.ngdc.noaa.gov/viirs_data/viirs_composite/v10/) of sattelite data that the NOAA has made available to the public.

In [0]:
%pylab inline

First, we import the needed packages and specify the paths for the files.

In [0]:
import os
import json
import rasterio
import rasterio.features
import shapely.geometry
import pandas as pd
from affine import Affine

RASTER_FILE = os.path.join(
    os.path.expanduser('~'), 'bh', 'data', 'satellite',
    'SVDNB_npp_20140201-20140228_75N180W_vcmcfg_'
    'v10_c201507201052.avg_rade9.tif'
)

COUNTIES_GEOJSON_FILE = os.path.join(
    os.path.expanduser('~'), 'bh', 'data',
    'us_counties_5m.json'
)
STATES_TEXT_FILE = os.path.join(
    os.path.expanduser('~'), 'bh', 'data',
    'state.txt'
)


## Taking a look at the GeoJSON

Let's take a look at the GeoJSON file describing U.S. counties. While we're add it, we'll load a dataframe with state codes, because the GeoJSON file tags counties with state code rather than by name.

In [0]:
with open(COUNTIES_GEOJSON_FILE, 'r') as f:
    counties_raw_geojson = json.load(f, 'latin-1')

states_df = pd.read_csv(STATES_TEXT_FILE, sep='|').set_index('STATE')
states = states_df['STATE_NAME']

This top-level geojson object is a `dict` with two keys:
* `type`, which specifies that this is a `FeatureCollection`, and
* `features`, which is a json array of geojson objects for each county.

Since we want to be able to look up counties by name, we rearrange this
with county names as keys and the geojson objects for each
county as values, by looking up the `properties.NAME` and `properties.STATE`
key in each county's GeoJSON object. It is important to use the state as well as
name because several states have counties with the same name.

Note that since there are unicode characters in some county names, we
use `u` in front of the formatting string to avoid ASCII errors.


In [0]:
def get_county_name_from_geo_obj(geo_obj):
    """
    Use the NAME and STATE properties of a county's geojson
    object to get a name "state: county" for that county.
    """
    return u'{state}: {county}'.format(
        state=states[int(geo_obj['properties']['STATE'])],
        county=geo_obj['properties']['NAME']
    )

counties_geojson = {
    get_county_name_from_geo_obj(county_geojson): county_geojson
    for county_geojson in counties_raw_geojson['features']
}

print sorted(counties_geojson.keys())[:10]


## Using the `shapely` library to work with GeoJSON data

We use the `rasterio` library to work with the satellite data, and the `shapely` library to align the raster data with information from the GeoJSON. Let's take a look at `shapely`

Let's take a look at Manhattan. To get a `shapely.geometry.MultiPolygon` object from a GeoJSON dictionary, we use the `shapely.geometry.shape` function.

In [0]:
ny_shape = shapely.geometry.shape(counties_geojson['New York: New York']['geometry'])
print '%r' % ny_shape
ny_shape


To find the smallest rectangular region containing Manhattan that we can use to slice into the raster file, we first get the longitude and latitude bounds  using the `bounds` property of the `shapely.geometry.MultiPolygon` instance. 
This returns `(lon_min, lat_min, lon_max, lat_max)` coordinates:


In [0]:
lon_min, lat_min, lon_max, lat_max = ny_shape.bounds
print lon_min, lat_min, lon_max, lat_max


## Loading data from a raster file with `rasterio`

Our next step is to actually load the satellite image data for New York. Before we can do this, we'll
need to convert these latitude and longitude bounds into array indices of the raster file (which is represented
on disk as an array, essentially a bitmap of the image).

Every `rasterio` file object has an `index` method that can do this. The `index`
accepts `(longitude, latitude)` coordinates and returns `(row, col)` indices for
the corresponding pixels. In the raster file, rows correspond to latitude and
columns to longitude (this is the opposite order of the input, which can lead to confusion
if you aren't careful).


In [0]:
raster_file = rasterio.open(RASTER_FILE, 'r')

bottom, left = raster_file.index(lon_min, lat_min)
top, right = raster_file.index(lon_max, lat_max)

raster_window = ((top, bottom+1), (left, right+1))
raster_window


Note that we used `(top, bottom)` rather than `(bottom, top)`. This corresponds to the order of the pixels in the satellite data.

To load the pixel values, we pass the window to the `read` method of the raster file, which returns
a numpy float-32 array of intensities. We specify `indexes=1` to get a 2D array (rather than a 3D array with size 1).


In [0]:
ny_raster_array = raster_file.read(indexes=1, window=raster_window)
ny_raster_array.shape


We can now plot our data using `imshow` to see Manhattan's beautiful night lights:


In [0]:
from matplotlib import pyplot as plt
plt.imshow(ny_raster_array)
plt.show()


## Working with affine mappings to align GeoJSON and raster data

The bounding-box plot we have so far looks nice. But we see a lot of light from outside Manhattan,
because the entire bounding box shows up in our data. How can we isolate just the data in Manhattan?
To accomplish this, we'll need to think a little more about how `rasterio` represents the mapping
between latitude and longitude to pixels. 

Let's look at the `affine` property of the raster file, which is how it encodes the mapping
between indices in the image and latitude / longitude coordinates:


In [0]:
raster_file.affine


An
```
Affine(a, b, c,
       d, e, f)
```
instance represents a 2d
transformation of the form
$$
\begin{pmatrix} x' \\ y' \\ 1 \end{pmatrix}
=
\begin{pmatrix}
a && b && c \\
d && e && f \\
0 && 0 && 1
\end{pmatrix}
\begin{pmatrix} x \\ y \\ 1 \end{pmatrix}.
$$

In the context of a `rasterio` file, the original coordinates $x$ and $y$
represent columns and rows for the pixel array, and $x'$ and $y'$
represent the latitude and longitude.

The `b` and `d` entries are zero because our image is aligned with the equator
and prime meridian. The `a` and `e` coordinates give scalings for
latitude and longitude (the negative `e` means we index top to bottom, as we saw
earlier), while `c` and `f` give the top-left corner (minimum longitude and
maximum latitude) of the image.


## What we need to do when adjusting the affine mapping to just our bounding box

In order to overlay the `shapley` representation of GeoJSON against the `rasterio`
data, we need to build a new affine mapping for just the bounding box.

The scale of our bounding box is the same as the scale of the overall image, so `a` and
`e` don't need to change. And the bounding box isn't rotated, so `d` and `b` remain zero.
But we need to adjust `c` and `f`, because the top-left corner of our bounding
box for New York county isn't the same as the top-left corner of the full
image:


In [0]:
rfa = raster_file.affine
ny_affine = Affine(
  rfa.a, rfa.b, lon_min,
  rfa.d, rfa.e, lat_max
)

## Using the `rasterize` function to compute a bitmask from `shapely` GeoJSON data

Now we can isolate the pixels inside Manhattan using
the `rasterio.features.rasterize` function. We can represent this kind of filtered data using a `numpy` masked array whose `mask` is 0 for the relevant data (in our case, within Manhattan), and 1 otherwise.

We generate the mask using the `rasterize` function, whose first  argument is an iterable of `(geometry, value)` pairs. It also takes an affine mapping from pixel indices to `(longitude, latitude)` coordinates and an array size, and returns an array with pixel locations within each geometry object set to the corresponding values.




In [0]:
import rasterio.features
ny_mask = rasterio.features.rasterize(
    shapes=[(ny_shape, 0)],
    out_shape=ny_raster_array.shape,
    transform=ny_affine,
    fill=1,
    dtype=np.uint8,
)
plt.imshow(ny_mask)


## Combining the bitmask with the image data

Finally, using this mask we can work with luminosity data for just Manhattan.
For example, we can re-create our plot of the nighttime lights, zeroing out
all the pixels outside Manhattan:


In [0]:
plt.imshow(ny_raster_array * (1 - ny_mask))


Similarly, by making a `numpy` masked array, we can use numpy's masked-data
functions to compute statistics about the light distribution in new york
county:


In [0]:
ny_masked = np.ma.array(
    data=ny_raster_array,
    mask=ny_mask.astype(bool)
)
print 'min: {}'.format(ny_masked.min())
print 'max: {}'.format(ny_masked.max())
print 'mean: {}'.format(ny_masked.mean())
print 'standard deviation: {}'.format(ny_masked.std())


This wraps up our demo, we've covered all of the GeoJSON tools and apis from `shapely`, `rasterio`, and `numpy` needed for our visualizations of the NOAA satellite data.