# Introduction 

## Data wrangling and transformation

Often, datasets need to go through a series of data wrangling and transformation steps before they are ready for analysis or visualisation tasks. This lab will demonstrate several data wrangling and transformation operations for raster, vector, and tabular data. 

We will start with a subset of the AgriFieldNet Competition Dataset <a href="https://mlhub.earth/data/ref_agrifieldnet_competition_v1" target="_blank">(Radiant Earth Foundation and IDinsight, 2022)</a> which has been published to encourage people to develop machine learning models that classify a field's crop type from satellite images. This dataset consists of a series of directories with each directory corresponding to a 256 x 256 pixel image footprint. Inside each directory are the following files:

* 12 GeoTIFF files corresponding to spectral reflectance in different wavelengths from Sentinel-2 data. 
* 1 GeoTIFF file with non-zero pixels corresponding to a crop type label. 
* 1 GeoTIFF file with non-zero pixels corresponding to a field id. 
* 1 JSON metadata file. 

This data is subset from a larger dataset covering agricultural fields in four Indian states: Odisha, Uttar Pradesh, Bihar, and Rajasthan. The field boundaries and crop type labels were captured by data collectors from IDinsight's Data on Demand team and the satellite image preparation was undertaken by the Radiant Earth Foundation. 

### Task

Our task is to combine all the raster data in a folder into a tabular dataset that can be used for machine learning tasks to predict a field's crop type. Specifically, we will  transform a collection of GeoTiff files into a tabular dataset with columns for each field id, crop type, and field average spectral reflectance values. We will also store geometry data representing the location of each field in a geometry column. 

![](https://github.com/data-analysis-3300-3003/figs/raw/main/week-4-overview.jpg)

You will learn a range of common data transformation operations to wrangle datasets into a structure suitable for analysis and visualisation.   

**This lab will focus on transformation operations applied to tabular and vector data.** This lab will cover:

* **attribute operations:** data cleaning refresher (from week 3).
* **attribute operations:** subsetting `DataFrame`s based on conditions.
* **attribute operations:** appending (concatenating) rows to tabular `DataFrame` objects.
* **attribute operations:** group-by and summarise operations of tabular `DataFrame` objects.
* **attribute operations:** key-based relational joins between two tables.
* **raster-vector operations:** vectorising raster data.
* **geometry operations:** computing polygon geometry centroids.
* **attribute operations:** key-based relational joins between two tables.
* **spatial data operations:** spatial joins of two `GeoDataFrame` objects. 


## Setup

### Run the labs

You can run the labs locally on your machine or you can use cloud environments provided by Google Colab. **If you're working with Google Colab be aware that your sessions are temporary and you'll need to take care to save, backup, and download your work.**

<a href="https://colab.research.google.com/github/data-analysis-3300-3003/colab/blob/main/lab-4-self-guided.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Download data

If you need to download the data for this lab, run the following code snippet. 

In [None]:
import os

if "week-4" not in os.listdir(os.getcwd()):
    os.system('wget "https://github.com/data-analysis-3300-3003/data/raw/main/data/week-4.zip"')
    os.system('unzip "week-4.zip"')

### Working in Colab

If you're working in Google Colab, you'll need to install the required packages that don't come with the colab environment.

In [None]:
if 'google.colab' in str(get_ipython()):
    !pip install geopandas
    !pip install pyarrow
    !pip install mapclassify
    !pip install rasterio

### Import modules

In [None]:
# Import modules
import os
import pandas as pd
import geopandas as gpd
import plotly.express as px
import numpy as np
import matplotlib.pyplot as plt
import rasterio
import plotly.io as pio
import shapely.geometry
import pprint

from rasterio import features

# setup renderer
if 'google.colab' in str(get_ipython()):
    pio.renderers.default = "colab"
else:
    pio.renderers.default = "jupyterlab"

### Preliminary processing

This weeks self-guided lab will pick up from lab-4 where we'd created a program to:

1. read GeoTIFF files into NumPy `ndarray` objects
2. stack the `ndarray` objects to create a multiband raster representation of the data
3. reshape the multiband `ndarray` objects to a tabular-like structure

In this lab we will extend this program by converting the `ndarray` representation of a table to `DataFrame` object which we will further process with a range of tabular attribute and vector operations. 

In [None]:
# path to data
image_dir_path = os.path.join(os.getcwd(), "week-4", "images", "ref_agrifieldnet_competition_v1_source_0a664")

# stacking bands 

# Sentinel-2 band names 
s2_bands = ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09', 'B11', 'B12']

# empty list to append ndarray of reflectance value for each band to
bands = []
    
# loop over each band, read in the data from the corresponding GeoTIFF file into an ndarray
for b in s2_bands:
    print(f"reading {b}.tif")
    band_path = os.path.join(image_dir_path, b + ".tif")
    with rasterio.open(band_path) as src:
        # append the ndarray storing the Sentinel-2 reflectance data for a band to a list
        bands.append(src.read(1))

# stack all bands in the list to create a multiband raster
multiband_raster = np.stack(bands)
    
# make NDVI band
red = multiband_raster[3,:,:].astype("float64")
nir = multiband_raster[7,:,:].astype("float64")
ndvi = (nir-red)/(nir+red)
ndvi = np.expand_dims(ndvi, axis=0) # add a bands axis
multiband_raster = np.concatenate((multiband_raster, ndvi), axis=0) # stack the ndvi band
    
### HERE WE ARE STACKING THE FIELD ID BAND 
field_id_path = os.path.join(image_dir_path, "field/field_ids.tif")
with rasterio.open(field_id_path) as src:
    field_ids = src.read().astype("float64")
    field_ids[field_ids == 0] = np.nan
    multiband_raster = np.concatenate((multiband_raster, field_ids), axis=0)
    
## HERE WE ARE STACKING THE CROP TYPE LABELS BAND 
labels_path = os.path.join(image_dir_path, "label/raster_labels.tif")
with rasterio.open(labels_path) as src:
    multiband_raster = np.concatenate((multiband_raster, src.read()), axis=0)

# reshape multiband raster to tabular format
rows = multiband_raster.shape[1]
cols = multiband_raster.shape[2]
n_bands = multiband_raster.shape[0]
reshaped = multiband_raster.reshape(n_bands, rows*cols)
tabular = reshaped.T

## Attribute data operations

### Pandas `DataFrame` and data cleaning

As our data is now in a tabular structure it makes sense to convert it from a NumPy `ndarray` object to Pandas a `DataFrame` object. Pandas `DataFrame` objects, and the pandas package more generally, are based on NumPy but have been tailored for working with tabular datasets. For example, a NumPy `ndarray` stores data of the same type in an array-like structure (e.g. all elements are integers). A pandas `DataFrame` can store different type data in different columns (e.g. column 0 is string, column 1 is floating point, etc.), but the values within each column are the same type and each column is typically a `PandasArray` which is based on a NumPy `ndarray`. 

Columns in a pandas `DataFrame` are called `Series` and a `Series` can be objects in your program independent of a `DataFrame`. A `Series` is an array-like sequence of values stored in a `PandasArray` which wraps a NumPy `ndarray`, and a `DataFrame` creates a tabular structure by combining one or more `Series`. 

Operations on Pandas `DataFrame`s also borrow from NumPy's style such as avoiding for-loops; however, they also provide several features and functions geared towards working with tabular datasets. A selection of these features include:

* indexing using column names
* relational database style operations including key-based joins, conditional filtering and selection of data, and group-by and summarise
* support for working with time-series
* more tools for handling missing data

Thus, Pandas `DataFrame` objects are useful for many attribute data transorfmation operations.

To convert a NumPy `ndarray` to a Pandas `DataFrame` we pass the array into the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html#pandas.DataFrame" target="_blank">`DataFrame`'s constructor</a> function helpfully named `DataFrame()`. The constructor function expects the NumPy `ndarray` and a list of column labels as arguments. 

Let's quickly recap our data structures after we have performed several raster transformation operations to the GeoTIFF files. We have an `ndarray` object, `tabular`, which is a 2-Dimensional NumPy `ndarray` representing 256 x 256 pixel images in a tabular structure with pixels aligned down the rows (0-axis) and bands aligned along the columns (1-axis). 

In [None]:
print(f"tabular is of type {type(tabular)}")
print(f"tabular is an ndarray with shape {tabular.shape}")

In [None]:
# convert ndarray to Pandas DataFrame
s2_bands = ['B01', 'B02', 'B03', 'B04','B05', 'B06', 'B07', 'B08','B8A', 'B09', 'B11', 'B12']

# create a DataFrame object from the first element in tables
tmp_df = pd.DataFrame(tabular, columns=s2_bands + ["ndvi", "field_id", "labels"])
tmp_df.head()

Looking at the `DataFrame` we can clearly see some `NaN` values in the `field_id` column. As each row in this `DataFrame` represents a pixel in the image, rows with `NaN` values in the `field_id` column are pixels which don't have a crop type label. Therefore, we can drop them using the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html" target="_blank">`dropna()` method</a> - this will drop all rows from the `DataFrame` where there is a `NaN` value.

Let's inspect some metadata for the `DataFrame` we have created. 

In [None]:
tmp_df.info()

The `DataFrame`'s `info()` method returns a summary of the column's data types, count of non-null data, and the memory usage for the object. We can see that the `field_id` and `labels` columns are float64; however, these columns are storing categorical data so integer numbers would be a more appropriate type. Therefore, we can use the `astype()` method to convert these columns to integer type. 

#### Recap quiz

**Can you use the `dropna()` and `astype()` methods to i) drop all rows with `NaN` values, and ii) convert the `field_id` and `labels` columns to `int32` type? Use the pandas docs for example uses of the `dropna()` and `astype()` methods.**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>

```python
tmp_df = tmp_df.dropna()
tmp_df = tmp_df.astype({"field_id": "int32", "labels": "int32"})
# Check the cast to integer type has worked. 
# Also, note the reduced memory usage after dropping all the nan rows
tmp_df.info()
```
</details>

### Grouped summaries

Another common attribute operation when working with tabular data is performing grouped aggregations and summaries. For example, often we want to compute the mean, median, min, max, or sum of data values within groups in our dataset. This could be to generate summary tables for reporting purposes or an intermediate step in a data transformation workflow. 

Our goal for this data transformation workflow is to generate average spectral reflectance values in different wavelengths from Sentinel-2 images for each field with a crop type label. So far we have created a table where each row represents one pixel in a 256 x 256 image footprint and we have many pixels per field. We need to compute the average reflectance values for each field. This is a group-by and summarise operation - grouping by field and summarising using the mean. 

A group-by and summarise operation can be conceptualised as a sequence of split-apply-combine steps <a href="https://wesmckinney.com/book/data-aggregation.html#groupby_fundamentals" target="_blank">McKinney (2022)</a>:

* **Split** your dataset into groups.
* **Apply** a function to values in each group as a summary.
* **Combine** the results of applying the function to each group. 

For our dataset we need to group-by `field_id` and `labels` (crop type column) and compute the mean of spectral reflectance values within each group. 

Pandas `DataFrame`s have a `groupby()` method that can take in a list of one or more column names. Calling this method returns a `GroupBy` object that creates groups from your dataset for each of the unique values of the grouping columns and can be used to apply summary operations to each group. 

In [None]:
# create a group using field id and crop type
tmp_df_groups = tmp_df.groupby(["field_id", "labels"])
print(tmp_df_groups)

Finally, we need to apply our summary operations to each group. We can do this by calling a function on the `GroupBy` object. A useful function for data exploration tasks is calling `size()` which tells us the number of observations in each group. 

In [None]:
# size of groups
tmp_df_groups.size()

By calling `size()` on the `GroupBy` object we can see that we have a group with a `field_id` of `81` and a crop type label of `1` (wheat). There are 70 observations in this group. 

If we want to compute the mean of the spectral reflectance values within each group we can call `mean()` instead of `size()`. 

In [None]:
# mean spectral reflectance values per group
tmp_df_groups.mean()

We're now in a position to update our data transformation routine to include converting the `ndarray` object in a tabular-like structure to Pandas `DataFrames`, data cleaning to drop `NaN` pixels, and computing the mean spectral reflectance values for each `field_id` and crop type `label` combination. 

In [None]:
# path to data
image_dir_path = os.path.join(os.getcwd(), "week-4", "images", "ref_agrifieldnet_competition_v1_source_0a664")

# stacking bands 

# Sentinel-2 band names 
s2_bands = ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B09', 'B11', 'B12']

# empty list to append ndarray of reflectance value for each band to
bands = []
    
# loop over each band, read in the data from the corresponding GeoTIFF file into an ndarray
for b in s2_bands:
    print(f"reading {b}.tif")
    band_path = os.path.join(image_dir_path, b + ".tif")
    with rasterio.open(band_path) as src:
        # append the ndarray storing the Sentinel-2 reflectance data for a band to a list
        bands.append(src.read(1))

# stack all bands in the list to create a multiband raster
multiband_raster = np.stack(bands)
    
# make NDVI band
red = multiband_raster[3,:,:].astype("float64")
nir = multiband_raster[7,:,:].astype("float64")
ndvi = (nir-red)/(nir+red)
ndvi = np.expand_dims(ndvi, axis=0) # add a bands axis
multiband_raster = np.concatenate((multiband_raster, ndvi), axis=0) # stack the ndvi band
    
### HERE WE ARE STACKING THE FIELD ID BAND 
field_id_path = os.path.join(image_dir_path, "field/field_ids.tif")
with rasterio.open(field_id_path) as src:
    field_ids = src.read().astype("float64")
    field_ids[field_ids == 0] = np.nan
    multiband_raster = np.concatenate((multiband_raster, field_ids), axis=0)
    
## HERE WE ARE STACKING THE CROP TYPE LABELS BAND 
labels_path = os.path.join(image_dir_path, "label/raster_labels.tif")
with rasterio.open(labels_path) as src:
    multiband_raster = np.concatenate((multiband_raster, src.read()), axis=0)

# reshape multiband raster to tabular format
rows = multiband_raster.shape[1]
cols = multiband_raster.shape[2]
n_bands = multiband_raster.shape[0]
reshaped = multiband_raster.reshape(n_bands, rows*cols)
tabular = reshaped.T
    
### HERE WE CONVERT TO DATAFRAMES AND DROP NAN VALUES
tmp_df = pd.DataFrame(tabular, columns=s2_bands + ["ndvi", "field_id", "labels"])
dfs = tmp_df.dropna()
    
dfs = dfs.astype({"field_id": "int32", "labels": "int32"})
dfs = dfs.groupby(["field_id", "labels"]).mean().reset_index()

Let's quickly inspect the output. We should have one row per `field_id` and crop type `label` group. The `DataFrame` storing the results of this routine are referenced by the variable `dfs`.

In [None]:
dfs

## Raster-vector operations and vector operations

We're almost at the stage where we've processed a number of GeoTIFF files stored across many directories into a tabular dataset in a `DataFrame` object ready for machine learning. However, there are two more columns we need to create and append to the `DataFrame`. The first is a `geometry` column recording the centroid of each field. This allows us to keep a record of each field's geographic location. We'll also use this centroid to identify the district (an administrative boundary below the State-level in India) each field is located in.  

To compute the centroid for each field we need to perform some raster-vector operations where each raster dataset is converted to a vector dataset. This is called vectorisation and can be achieved using `rasterio`'s <a href="https://rasterio.readthedocs.io/en/latest/api/rasterio.features.html#rasterio.features.shapes" target="_blank">`shapes()` function</a> which returns the shape and value of connected regions in a raster dataset. Pixels belonging the same field in a raster layer should be connected (i.e. their edges touch) and they should have the same value (field id). Thus, applying the `shapes()` function to the raster layer of field ids should return vector polygons for each field outline.

To do this we'll need to use the `field_ids.tif` file. Let's quickly inspect this files again.

In [None]:
field_id_path = os.path.join(image_dir_path, "field/field_ids.tif")
with rasterio.open(field_id_path) as src:
    print(f"Printing metadata for field_ids.tif")
    pprint.pprint(src.meta)
    print("")

We have printed a dictionary objects of metadata for the `field_ids.tif` file. 

Let's look at what the `shapes()` function returns for `field_ids.tif`. 

The `shapes()` function takes in a NumPy `ndarray` of raster values (generated by `src.read()` which reads the raster values from the GeoTIFF file into a NumPy `ndarray` in memory) and returns a generator object which generates a tuple for each shape in the raster data. The first element of the tuple is a dictionary object of coordinates and the type of geometry (e.g. point, line, polygon). The second element of the tuple is the attribute value that corresponds to the geometry. We can convert the generator into a list of tuples. 

![](https://github.com/data-analysis-3300-3003/figs/raw/main/week-4-raster-to-vector.jpg)


In [None]:
field_id_path = os.path.join(image_dir_path, "field/field_ids.tif")
with rasterio.open(field_id_path) as src:
    # shapes is a generator
    shapes = features.shapes(src.read(), transform=src.transform)

    # list of geometry and shape value
    field_shapes = list(shapes)
    
    # pretty print the first two elements of field_shapes
    pprint.pprint(field_shapes[0:2])

At this stage, we've converted our raster data to a list of numbers representing coordinates for the shape. We now need to turn this list of coordinates into a geometry object. In Python, geometries are represented as <a href="https://shapely.readthedocs.io/en/stable/geometry.html" target="_blank">Shapely</a> `Geometry` objects. The `geometry` column in a GeoPandas `GeoDataFrame` is a `Series` of Shapely `Geometry` objects.

To create a <a href="https://shapely.readthedocs.io/en/stable/geometry.html" target="_blank">Shapely</a> `Geometry` object we extract the list of coordinates and pass them into the `shapely.geometry.shape()` function. 

Printing `geom` should return a list of Shapely `Geometry` objects. 

In [None]:
field_id_path = os.path.join(image_dir_path, "field/field_ids.tif")
with rasterio.open(field_id_path) as src:
    # shapes is a generator
    shapes = features.shapes(src.read(), transform=src.transform)

    # list of geometry and shape value
    field_shapes = list(shapes)

    # create a list of Shapely Geometry objects
    geom = []
    for s in field_shapes:
        geom.append(shapely.geometry.shape(s[0]))

    print(geom)

We can also plot an element of `geom` to show it is a Shapely `Geometry` object.

In [None]:
geom[0]

In [None]:
geom[1]

#### Recap quiz

<details>
    <summary><b>What object are we creating with <code>[]</code>?</b></summary>
An empty list object.
</details>

<p></p>

<details>
    <summary><b>What type of object is <code>geom</code> and what elements does it store?</b></summary>
It is a list object which is storing a list of Shapely <code>Geometry</code> objects.
</details>

Next, we compute the centroid for the polygon shape of the field. Computing a centroid is a geometry operation where the shape's geometry is converted from a polygon to a point feature. To efficiently compute the centroid for the shapes returned by `shapes()` we can convert the list of `Geometry` objects to a `GeoSeries` and then use the <a href="https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.centroid.html" target="_blank">`GeoSeries` `centroid` attribute</a> to return a `GeoSeries` of centroids. 

Inspecting this `GeoSeries` should reveal a sequence of point `Geometry` objects have been computed. This `GeoSeries` object is now referenced by `geom`.

In [None]:
field_id_path = os.path.join(image_dir_path, "field/field_ids.tif")
with rasterio.open(field_id_path) as src:
    # shapes is a generator
    shapes = features.shapes(src.read(), transform=src.transform)

    # list of geometry and shape value
    field_shapes = list(shapes)

    # create a list of Shapely Geometry objects
    geom = []
    for s in field_shapes:
        geom.append(shapely.geometry.shape(s[0]))

    # compute centroids
    geom = gpd.GeoSeries(geom, crs=src.crs).centroid

    print(geom)

Now we have computed the centroid for each field, we can convert the points to a common coordinate reference system (`EPSG:4326`) using GeoPandas `to_crs()` method. We convert the points to a new coordinate system, `EPSG:4326` which uses latitude and longitude values, to be able to perform vector operations using two vector datasets in future tasks (it is important that vector datasets have the same coordinate reference system to get correct and intended results). 

In [None]:
field_id_path = os.path.join(image_dir_path, "field/field_ids.tif")
with rasterio.open(field_id_path) as src:
    # shapes is a generator
    shapes = features.shapes(src.read(), transform=src.transform)

    # list of geometry and shape value
    field_shapes = list(shapes)

    # create a list of Shapely Geometry objects
    geom = []
    for s in field_shapes:
        geom.append(shapely.geometry.shape(s[0]))

    # compute centroids
    geom = gpd.GeoSeries(geom, crs=src.crs).centroid

    # reproject to EPSG 4326
    geom = geom.to_crs("EPSG:4326")

If we look at the object returned by the `features.shapes()`, it's a tuple with coordinates for connected raster pixels with the same value in the first element and the second element is the value of those raster pixels. 

```
({'coordinates': [[(625350.0, 3010380.0),
                    (625370.0, 3010380.0),
                    (625370.0, 3010370.0),
                    (625390.0, 3010370.0),
                    (625390.0, 3010360.0),
                    (625410.0, 3010360.0),
                    (625410.0, 3010350.0),
                    (625430.0, 3010350.0),
                    (625430.0, 3010320.0),
                    (625420.0, 3010320.0),
                    (625420.0, 3010300.0),
                    (625410.0, 3010300.0),
                    (625410.0, 3010280.0),
                    (625400.0, 3010280.0),
                    (625400.0, 3010270.0),
                    (625390.0, 3010270.0),
                    (625390.0, 3010250.0),
                    (625380.0, 3010250.0),
                    (625380.0, 3010230.0),
                    (625370.0, 3010230.0),
                    (625370.0, 3010220.0),
                    (625360.0, 3010220.0),
                    (625360.0, 3010200.0),
                    (625350.0, 3010200.0),
                    (625350.0, 3010180.0),
                    (625310.0, 3010180.0),
                    (625310.0, 3010190.0),
                    (625290.0, 3010190.0),
                    (625290.0, 3010200.0),
                    (625270.0, 3010200.0),
                    (625270.0, 3010240.0),
                    (625280.0, 3010240.0),
                    (625280.0, 3010250.0),
                    (625290.0, 3010250.0),
                    (625290.0, 3010270.0),
                    (625300.0, 3010270.0),
                    (625300.0, 3010290.0),
                    (625310.0, 3010290.0),
                    (625310.0, 3010310.0),
                    (625320.0, 3010310.0),
                    (625320.0, 3010330.0),
                    (625330.0, 3010330.0),
                    (625330.0, 3010350.0),
                    (625340.0, 3010350.0),
                    (625340.0, 3010360.0),
                    (625350.0, 3010360.0),
                    (625350.0, 3010380.0)]],
   'type': 'Polygon'},
  1316.0)
```

#### Recap quiz

This is a challenging quiz question, have a go at each of the steps and follow the pointers to previous code snippets or docs before reviewing the answer. 

**1. We need create another `Series` of field ids using the value of raster pixels corresponding to the coordinates in a tuple. We can do this by looping over `field_shapes` (the list of tuples) and accessing the element at index position one in the tuple.** *Have a look at how we looped over `field_shapes` and accessed the coordinates at index position 0 in `s` and appended the object to the list `geom`. Use this logic as a template for how you could access the value in index position 1 and append it to a list.*  

**2. Once you have created this `Series`, you should set its type to `int32` using the `astype()` method.** *These are the <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html" target="_blank">docs</a> for `astype()`.* 

**3. Then we need to combine the `Series` of field ids with the `GeoSeries` of field centroids in a `GeoDataFrame`. You can do this using the `GeoDataFrame()` constructor function: `field_id_df = gpd.GeoDataFrame({'field_id': field_ids, 'geometry': geom})` (`field_ids` is a `Series` of field id values and `geom` is a `GeoSeries` of points.**

**4. Finally, we need to drop rows in the `GeoDataFrame` with 0 values in the `field_id` column as 0 as these shapes don't have a crop type label.** *You can use the `loc[]` method for this and refer back to the **Subsetting pandas `DataFrame`s** section from the self-guided lab in week 3*.  


In [None]:
field_id_path = os.path.join(image_dir_path, "field/field_ids.tif")
with rasterio.open(field_id_path) as src:
    # shapes is a generator
    shapes = features.shapes(src.read(), transform=src.transform)

    # list of geometry and shape value
    field_shapes = list(shapes)

    # create a list of Shapely Geometry objects
    geom = []
    for s in field_shapes:
        geom.append(shapely.geometry.shape(s[0]))

    # compute centroids
    geom = gpd.GeoSeries(geom, crs=src.crs).centroid

    # reproject to EPSG 4326
    geom = geom.to_crs("EPSG:4326")
        
    ## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>
    
```python
field_id_path = os.path.join(image_dir_path, "field/field_ids.tif")
with rasterio.open(field_id_path) as src:
    # shapes is a generator
    shapes = features.shapes(src.read(), transform=src.transform)

    # list of geometry and shape value
    field_shapes = list(shapes)

    # create a list of Shapely Geometry objects
    geom = []
    for s in field_shapes:
        geom.append(shapely.geometry.shape(s[0]))

    # compute centroids
    geom = gpd.GeoSeries(geom, crs=src.crs).centroid

    # reproject to EPSG 4326
    geom = geom.to_crs("EPSG:4326")

    # create a Series of field ids
    field_ids = []
    for f in field_shapes:
        field_ids.append(f[1])

    field_ids = pd.Series(field_ids).astype("int32")

    # Combine the Series and GeoSeries into a DataFrame
    field_id_df = gpd.GeoDataFrame({'field_id': field_ids, 'geometry': geom})

    # drop shapes with value 0
    field_id_df = field_id_df.loc[field_id_df["field_id"] > 0, :]

    print(field_id_df)
```
</details>

If you've successfully completed the recap quiz, you should have a variable `field_id_df` that references a `GeoDataFrame` with a column of `field_id` values and a `geometry` column of field centroids.

In [None]:
field_id_df

## Joins

### Key-based joins

We now have two separate data objects in our Python program. We have a `DataFrame` storing average spectral reflectance values for each field, field id, and crop type label attributes (this is referenced by the variable `dfs`). We also have a `GeoDataFrame` storing the field id attribute and the field centroid as a point `Geometry` (this is referenced by the variable `field_id_df`). 

When two tables have a matching column(s) we can use join operations to combine them. Rows in both tables are matched using common values in the matching column(s) and the joined table has columns from both tables. 

Joining tables is a common operation in relational databases using SQL and the same operations can be implemented in Pandas using <a href="https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging" target="_blank">`merge()`</a> functions. 

Some important concepts for join operations:

* The columns with values used to match rows are called often called **keys**.
* **one-to-one** joins are where there is exactly one match between rows in the two tables being joined.
* **many-to-one** joins are where a row in one table can match one or more rows in another table.
* **left joins** keep all rows in the left table and only matching rows in the right table. 
* **inner joins** keep only matching rows in the left and right tables. 


The Pandas <a href="https://pandas.pydata.org/docs/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging" target="_blank">`merge()`</a> docs and <a href="https://wesmckinney.com/book/data-wrangling.html#prep_merge_join" target="_blank">McKinney (2022)</a> provide useful explanations for how join operations work.

![](https://github.com/data-analysis-3300-3003/figs/raw/main/week-4-joins.jpg)

Let's use consider these concepts in the context of joining our `DataFrame` `dfs` storing average spectral reflectance values and crop type labels and our `GeoDataFrame` `field_id_df` which stores the field centroids. 

The matching column in both tables is `field_id`. This the joining key. 

We are joining the two tables on `field_id` which should be unique to each field. Therefore, we are implementing a one-to-one join. 

As we're using the field centroids for subsequent operations, we only want to keep fields that have a centroid value. Therefore, we'll use an inner join.

Pandas `merge()` function expects the following arguments:

* `left` - left table in the join.
* `right` - right table in the join.
* `how` - whether to use a left or inner join.
* `left_on` - columns in left table to use as keys.
* `right_on` - columns in the right table to use as keys.

#### Recap quiz

**Can you use the `merge()` function to perform an inner join using the `field_id` column combining `dfs` and `field_id_df`? If the join is successful you should see a `geometry` column appended to the columns in `dfs`. Assign the result of this `merge()` to the variable `joined_df`.**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>

```python
joined_df = pd.merge(left=dfs, right=field_id_df, how="inner", left_on=["field_id"], right_on=["field_id"])
# display on the first few rows
joined_df.head()
```
</details>

In [None]:
# convert joined_df to GeoDataFrame
joined_df = gpd.GeoDataFrame(joined_df, geometry=joined_df.geometry, crs="EPSG:4326")
type(joined_df)

### Spatial Joins

Spatial join operations join the attributes of two vector layers based on their relationship in space. For example, if we have a `GeoDataFrame` storing field boundaries (polygon geometries) and field attributes and another `GeoDataFrame` storing shire boundaries (polygon geometries) and a shire name as an attribute, we can join the the two tables based on the largest intersection (overlap) between field boundaries and shire boundaries. If the field boundaries `GeoDataFrame` was the left table in the spatial join, for each row (or geometry feature) the shire name from the shire with largest intersection would be joined to that table in a new column. 

GeoPandas provides an <a href="https://geopandas.org/en/stable/docs/user_guide/mergingdata.html#spatial-joins" target="_blank">`sjoin()` function</a> that can be used for spatial joins of two `GeoDataFrames`. The `sjoin()` function expects the following as arguments:

* `left_df` - left `GeoDataFrame` in the spatial join.
* `right_df` - right `GeoDataFrame` in the spatial join - columns from the `right_df` will be joined to `left_df`.
* `how` - whether to use a left, inner, or right join.
* `predicate` - a binary predicate that defines the spatial relationship between features in `right_df` and `left_df`. 

Binary predicates that can be used are:

* intersects
* contains
* crosses
* within
* touches
* overlaps

Intersects is the default predicate for spatial joins in GeoPandas. 

To complete our data transformation routine we need to add a column to `joined_df` that stores the District that the field is located in. We can do this using a spatial join based on the intersect of the field's centroid (point geometry) and the shape of the District (polygon geometry). 

But, we need to read in District geometries for India obtained from <a href="https://www.geoboundaries.org" target="_blank">geoBoundaries</a>. 

In [None]:
india_districts = gpd.read_file(os.path.join(os.getcwd(), "week-4", "india-adm", "geoBoundaries-IND-ADM2_simplified.topojson"))
india_districts.head()

Let's quickly tidy up the India Districts `GeoDataFrame` to keep only the District name and `geometry` columns.

In [None]:
india_districts = india_districts.loc[:, ["shapeName", "geometry"]]
india_districts.columns = ["district", "geometry"]
india_districts = india_districts.set_crs("EPSG:4326")
india_districts.head()

In [None]:
india_districts.plot(column="district")

That looks like India. Let's implement our final data transformation step and perform a spatial join to add a District column to `joined_df`.

#### Recap quiz

**Use the GeoPandas docs to implement a spatial join with the <a href="https://geopandas.org/en/stable/docs/reference/api/geopandas.sjoin.html" target="_blank">`sjoin()` function</a> that joins `india_districts` (as `right_df`) to `joined_df` (as `left_df`) using an inner join and `intersects` predicate. Assign the result to the variable `joined_df_district`.**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>

```python
# spatial join
joined_df_district = gpd.sjoin(
    left_df=joined_df, 
    right_df=india_districts, 
    how="inner", 
    predicate="intersects"
)
joined_df_district.head()
```
</details>

### Save file

Finally, let's write our processed data ready for training and testing a machine learning model to file.

#### Recap quiz

**Can you save the data referenced by `joined_df_district` to a GeoJSON file on disk? Save the data with the filename `processed_data.geojson` at the path created by `os.path.join(os.getcwd(), "week-4", "processed_data.geojson")`.**

In [None]:
## ADD CODE HERE ##

<details>
    <summary><b>answer</b></summary>

```python
# save file
out_path = os.path.join(os.getcwd(), "week-4", "processed_data.geojson")
joined_df_district.to_file(out_path)
```
</details>