# Processing boundary data for efficiency

Boundary data can be highly detailed, which is very useful when carrying out very precise assessments of small local areas. 

However, such resolution can often be too much detail when attempting to assess the whole world. 

Therefore, there are multiple options available to help us reduce the size and quantity of data in use. Here we will cover two main approaches.  

These include simplifying polygons and removing small shapes.

You will remember that we previously used Rwanda as an example to process regional boundaries, via the code below (week 4).

Today we can do the same for Grenada. Let us first process the level 1 regions, as follows:


In [2]:
# Example
import os
import pandas
import geopandas

path = os.path.join('..', 'data', 'countries.csv')
countries = pandas.read_csv(path, encoding='latin-1')

for idx, country in countries.iterrows():
    
    if not country['iso3'] == 'GRD': # if the current country iso3 does not match RWA...
        continue                     # continue in the loop to the next country 
    
    iso3 = country['iso3']
    gid_region = country['gid_region']
    
    country_folder_path = os.path.join('..', 'data', 'processed', iso3)
    if not os.path.exists(country_folder_path):
        os.makedirs(country_folder_path)
    
    regions_folder_path = os.path.join('..', 'data', 'processed', iso3, 'regions')
    if not os.path.exists(regions_folder_path):
        os.makedirs(regions_folder_path)
        
    filename = 'gadm36_{}.shp'.format(gid_region)
    global_boundaries_path = os.path.join('..', 'data', 'raw', 'gadm36_levels_shp', filename) 
    global_boundaries = geopandas.read_file(global_boundaries_path)
    
    country_boundaries = global_boundaries[global_boundaries['GID_0'] == iso3]
    
    path_out = os.path.join('..', 'data', 'processed', iso3, 'regions', filename)
    country_boundaries.to_file(path_out)
    
    print("Processing complete for {}".format(country['iso3']))

  pd.Int64Index,


GRD 1


# Simplifying boundaries

Now that we have our country and other data folders setup, we can begin processing our boundaries. 

For example, the `geometry.simplify()` function allows us to make boundaries less detailed (and thus smaller in filesize). 

See the `geopandas` documentation for detailed information: https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.simplify.html

*The algorithm (Douglas-Peucker) recursively splits the original line into smaller parts and connects these parts’ endpoints by a straight line. Then, it removes all points whose distance to the straight line is smaller than tolerance. It does not move any points and it always preserves endpoints of the original line or polygon*

The key parameter is the tolerance which affects the degree of simplifcation undertaken:

*All parts of a simplified geometry will be no more than tolerance distance from the original. It has the same units as the coordinate reference system of the GeoSeries. For example, using tolerance=100 in a projected CRS with meters as units means a distance of 100 meters in reality.*

Below is a demo example of how we would use this function. 

In [4]:
for idx, country in countries.iterrows():
    
    if not country['iso3'] == 'GRD': # if the current country iso3 does not match RWA...
        continue                     # continue in the loop to the next country 
    
    iso3 = country['iso3']
    gid_region = country['gid_region']
    
    regions_folder_path = os.path.join('..', 'data', 'processed', iso3, 'regions')
    if not os.path.exists(regions_folder_path):
        os.makedirs(regions_folder_path)
        
    filename = 'gadm36_{}.shp'.format(gid_region)
    path_in = os.path.join('..', 'data', 'processed', iso3, 'regions', filename) 
    boundaries = geopandas.read_file(path_in)
    
    #this is how we simplify the geometries
    boundaries["geometry"] = boundaries.geometry.simplify(
        tolerance=0.01, preserve_topology=True)
    
    filename = 'gadm36_{}_simplified.shp'.format(gid_region)
    path_out = os.path.join('..', 'data', 'processed', iso3, 'regions', filename) 
    boundaries.to_file(path_out, crs='epsg:4326')

  pd.Int64Index,


# Exercise 1.1

Explore multiple different tolerance levels, and try to select one that best balances resolution vs simplification.


In [None]:
# Example code here


# Exercise 1.2

How does the `preserve_topology` parameter affect the output? Write an explanation for both this and the tolerance parameters.

In [None]:
# Example code here


# Removing small shapes

Often there may be many small shapes which are unnecessary when we want to visualize results for a very large area. 

For example, small coastal islands will not be visible on a global map, so it is best to remove them for efficiency. 

The following function accepts a geopandas dataframe and removes any small geometries present (based on your size preference).

The function works by: 

- Firstly, finding out if the `Shapely` geometry is a polygon or multipolygon object. 
- Secondly, allocating an area threshold value to any multipolygon  (depending on the country size). 
- Finally, dropping shapes below the area threshold and returning a multipolygon.


In [37]:
# Example
from shapely.geometry import MultiPolygon

def remove_small_shapes(x):
    """
    Remove small multipolygon shapes.

    Parameters
    ---------
    x : polygon
        Feature to simplify.

    Returns
    -------
    MultiPolygon : MultiPolygon
        Shapely MultiPolygon geometry without tiny shapes.

    """
    if x.geometry.type == 'Polygon':
        return x.geometry

    elif x.geometry.type == 'MultiPolygon':

        area1 = 0.003
        area2 = 50

        if x.geometry.area < area1: 
            return x.geometry

        if x['GID_0'] in ['CHL','IDN', 'RUS', 'GRL','CAN','USA']:
            threshold = 0.01
        elif x.geometry.area > area2:
            threshold = 0.1
        else:
            threshold = 0.001

        new_geom = []
        for y in list(x['geometry'].geoms):
            if y.area > threshold:
                new_geom.append(y)

        return MultiPolygon(new_geom)

Now we can have a go excluding small shapes as follows:

In [38]:
for idx, country in countries.iterrows():
    
    if not country['iso3'] == 'GRD': # if the current country iso3 does not match RWA...
        continue                     # continue in the loop to the next country 
    
    iso3 = country['iso3']
    gid_region = country['gid_region']
    
    regions_folder_path = os.path.join('..', 'data', 'processed', iso3, 'regions')
    if not os.path.exists(regions_folder_path):
        os.makedirs(regions_folder_path)
        
    filename = 'gadm36_{}.shp'.format(gid_region)
    path_in = os.path.join('..', 'data', 'processed', iso3, 'regions', filename) 
    boundaries = geopandas.read_file(path_in)
    
    #this is how we drop small geometries using the remove_small_shapes function
    boundaries['geometry'] = boundaries.apply(
        remove_small_shapes, axis=1)
        
    filename = 'gadm36_{}_removed_small_shapes.shp'.format(gid_region)
    path_out = os.path.join('..', 'data', 'processed', iso3, 'regions', filename) 
    boundaries.to_file(path_out, crs='epsg:4326')

  pd.Int64Index,


Take a moment to rerun the function and processing code, after changing the `area1` threshold value. 

How does it affect the boundary processing?

# Exercise 1.3

Can you see any problems with the current function? How would you rewrite the code and function to accept a projected CRS? 

Hint: Recall from week 4 how you convert your geopandas dataframe geometries to a projected CRS.


In [23]:
# Example code here


# Exercise 1.4

How would you combine both processing steps to simplify geometries and remove small shapes? 

Write out your shapes to your regions folder.


In [None]:
# Example code here


# Exercise 1.5

Scale your processing code for all countries in North America.

Hint: Within your `countries.csv` file you have a continent column.

In [24]:
# Exercise 1.3
