<a href="https://colab.research.google.com/github/andersknudby/Remote-Sensing/blob/master/Chapter_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Chapter 6 – Reading, manipulating, and writing raster data
Text files were relatively easy to deal with, because they have a simple format, and we know that we can treat all the data in them as text, and that it is structured line by line. Reading a more complex file format, like those typically used to store raster data, is more complicated. However, much of the complication has been hidden in objects and functions already created for us in what is called GDAL – the Geospatial Data Abstraction Library. \[[GDAL](http://www.gdal.org/)] contains object types that we can think of as complex variables, like a ‘raster dataset’ that has different properties like a ‘datum’ and a ‘number of bands’ and so on. GDAL also contains functions to read, manipulate, and write such objects, so we don’t need to code every little detail of those complex operations ourselves.

One big drawback is that GDAL is oddly complicated to import properly, especially on Windows machines. The main problem is that, to use GDAL, Python needs to know where all its files are, and because GDAL can be installed in many different ways, and especially because a computer can have multiple versions of Python installed, it quickly gets very confusing. But for the time being that is not going to stop us, because importing gdal in Colab is as easy as:

In [None]:
from osgeo import gdal

There are other Python libraries that use the functionality of GDAL and add additional functions, improve user-friendliness, or otherwise enhance its functionality. Here we will use one such library, called rasterio. In this chapter we will use some of rasterio's functions, but we are not going to go through everything. To use this library more in the future you can make use of it's 'readthedocs' page \[[rasterio](https://rasterio.readthedocs.io/en/latest/index.html)].

The rasterio library doesn't come with Colab, so we need to install it first:

In [None]:
!pip install rasterio

And when we're done installing it we can import it:

In [None]:
import rasterio

For this chapter, we will use a section of an aerial orthophoto of Simon Fraser University, stored in GeoTiff format, called ‘sfu.tif’. If you want to get an idea of what the image looks like before working with it in Python (always a good idea), open it in one of the software packages you already know and take a look at its properties etc. QGIS is good for this, as you can install it on your home computer regardless of its operating system. If you do so, one thing you will notice is that the ‘first band’ in the image (e.g. what ArcGIS calls ‘Band_1’) is the ‘Red’ band. In other words it contains information on the amount of electromagnetic radiation in the 600-700nm wavelength range reaching the camera from different parts of campus. Similarly, the second band is ‘Green’, and the third band is ‘Blue’. If we are to use this image in an intelligent fashion, we need to know this.

##Reading a raster dataset
GDAL structures raster data in a hierarchy, with three main components.

1\) The **dataset** is the entire, well, dataset, including all the data and all the metadata in one object. The dataset has certain properties, like a projection, a datum, a geotransform (that contain e.g. the pixel size), and so on. These are all defined for the dataset because they are a property of the dataset - they have to be the same for all the individual raster data layers in one dataset.

2\) The dataset is organized in **bands**. Bands also have certain properties, like the minimum and the maximum value. Similarly, these properties are defined for each band because they are a property of the band, not the dataset, and not the individual pixel. However, it is important to note that a band in GDAL is a ‘pointer’ to the data that are in the band itself, so you can ’open the band’ to access its properties (e.g. its minimum value) without having to actually read all the data. This may be confusing for now, but it will be clearer in the examples to follow.

3\)	The actual data pointed to by each band can be read and stored in Python as NumPy **arrays**. These arrays have all the same properties other NumPy arrays have, like shape (e.g. number of columns and rows).

rasterio inherits this structure as well, although it merges **bands** and **arrays** in its own data objects. An example will help illustrate how this all works. First we need to make sure we have access to the file through Google Drive:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

myDir = '/content/drive/My Drive/Python files/'

import os
if os.path.exists(myDir + 'sfu.tif'):
  print("Drive mounted and directory found")
else:
  print("No access to the files")

Then we'll open the file with rasterio:

In [None]:
fileName = myDir + 'sfu.tif'

ds = rasterio.open(fileName)  # ds is a commonly used shorthand for 'dataset'

Now we have a dataset, so we can figure out a few things about it:

In [None]:
print("Dataset name is", ds.name)
print("Number of bands in dataset:", ds.count)
print("Number of columns in dataset:", ds.width)
print("Number of rows in dataset:", ds.height)

We can also find out something about the georeference of the data:

In [None]:
print("Dataset coordinate reference system (CRS):", ds.crs)
print("Dataset bounds:", ds.bounds)
print("Dataset geotransform:", ds.transform)

* The EPSG code you see above refers UTM Zone 10N, based on the WGS84 datum.
* The bounds tell us the area covered by the image.
* And the transform tells us something about the pixel size (0.10, in meters), the x coordinate of the left side of the image (506537.0), and the y coordinate of the top of the image (5458686.0).
* All coordinates for the bounds and the transform refer to the CRS.

You may not have noticed, but opening the dataset was very fast - much faster than if you had opened it in a GIS software. That's because Python didn't actually read all the data - it just read all the information about the data, like what we printed in the code blocks above. To read the actual data we need to read the bands, and from the bands we can read the arrays - the actual numerical values the image is made up of.

**Important to note:** GDAL starts counting bands from 1 (not 0, as is otherwise default in Python):

In [None]:
band1 = ds.read(1)
band1

Values from the array, i.e. values from individual pixels in the image, can be addressed by their row, column index:

In [None]:
band1[453, 1243]

Ok, so now we know how to open an image dataset, get some information about it, and read the values from individual bands into NumPy arrays. Which means we can use our knowledge of NumPy arrays to work with the data!

For example, if we want to describe the brightness of each pixel irrespective of its 'colour' we can calculate that as, say, the average of its value in the three bands. And we can use NumPy's array functions to do that very quickly and efficiently:

In [None]:
import numpy as np
band1 = ds.read(1)
band2 = ds.read(2)
band3 = ds.read(3)
brightness = (band1 + band2 + band3) / 3
brightness

**Warning:** There’s one thing to consider here, that GDAL (and rasterio, and NumPy) isn’t particularly well built to help us with. There's actually an important error - a semantic error - in the above code. It is most easily illustrated by looking at a single pixel (and paying attention to the 'RuntimeWarning' it shows when we run hte code below:


In [None]:
band1Value = band1[100,500]
band2Value = band2[100,500]
band3Value = band3[100,500]
print("Values in the three bands are:", band1Value, band2Value, band3Value)

averageValue = (band1Value + band2Value + band3Value) / 3
print("The average value is:", averageValue)

What is going on here - the three values are 131, 129 and 111 respectively, and when we calculate the average it is... 38.3?!?

The problem is that the data from the image are stored as 8-bit unsigned integers, as demonstrated by:

In [None]:
band1Value.dtype

Feel free to do some searches to learn more about data types, but the short version relevant here is that 8-bit unsigned integers can only contain values between 0 and 255. When we calculate (band1Value + band2Value + band3Value) we get 131 + 129 + 111 = 371. More than what our data type can handle. What NumPy does then, is that when it counts up and reaches 255, it 'folds over' and starts counting again from 0 (instead of 256). So, given that it is bound to an 8-bit unsigned integer, instead of adding all the data up to 371 it ends up at 371 - 256 = 115. And ***then*** it divides by 3, to get to 38.3.

To avoid all this nonsense, the easiest solution is to convert data to a more suitable data type when you first read them from the image file. In our example we can do that like this, with NumPy's 'astype' function:

In [None]:
import numpy as np
band1 = ds.read(1).astype('uint16')
band2 = ds.read(2).astype('uint16')
band3 = ds.read(3).astype('uint16')
brightness = (band1 + band2 + band3) / 3

And to prove that it produces the desired result:

In [None]:
band1Value = band1[100,500]
band2Value = band2[100,500]
band3Value = band3[100,500]
print("Values in the three bands are:", band1Value, band2Value, band3Value)

averageValue = (band1Value + band2Value + band3Value) / 3
print("The average value is:", averageValue)

##Writing a raster dataset
Now that we have a product from our image analysis we typically want to write the result to a file as a new raster dataset. Writing NumPy arrays to raster files includes a series of steps, as outlined below. It's more complicated than writing a text file, because raster datasets not only have the image data but also the associated information we read earlier, like the coordinate reference system, the bounds, and so on. The different components of the data, as listed in the newDs below, are:
* myDir + 'brightness.tif' indicates the name of the file to create and write to
* 'w' indicates that we want this file to be open for writing
* driver='GTiff' indicates that we want the file format to be GeoTiff
* height and width are the number of rows and columns in the new image
* count is the number of bands in the new image. While we had three in the original, we have only one here (to write the brightness into)
* dtype='float64' indicates that we want to write decimal values into individual pixels
* crs and transform are the same as the original image

In [None]:
newDs = rasterio.open(myDir + 'brightness.tif', 'w', driver='GTiff',
                            height=ds.height, width=ds.width, count=1,
                            dtype='float64', crs=ds.crs, transform=ds.transform)
newDs.write(brightness, 1)
newDs.close()

As in the last chapter, to actually write this file to your Google Drive you need to flush and unmount it:

In [None]:
drive.flush_and_unmount()

Go to your Google Drive, download the file called 'brightness.tif', and open it in QGIS next to the original image file. Does the result make sense - i.e. do bright pixels in the original have higher values in the brightness image?

##Exercise
To map vegetation with three-band (red, green and blue) imagery, you can rely on the fact that most things other than vegetation, in the natural world, are not green. You can thus calculate a ‘greenness’ index as a proxy for vegetation. An often used index is the Green Chromatic Coordinate (GCC), which is simply calculated as: Green / (Red + Green + Blue). Modify the code from this chapter to calculate the GCC, and write it to a new file called ‘gcc.tif’. Compare it to the original image to check that high GCC values actually correspond to vegetated areas, and vice versa.