<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#What?" data-toc-modified-id="What?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>What?</a></span></li><li><span><a href="#Why?" data-toc-modified-id="Why?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Why?</a></span></li><li><span><a href="#How?" data-toc-modified-id="How?-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>How?</a></span></li><li><span><a href="#The-Data" data-toc-modified-id="The-Data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>The Data</a></span><ul class="toc-item"><li><span><a href="#Select-a-land-cover-type" data-toc-modified-id="Select-a-land-cover-type-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Select a land cover type</a></span></li></ul></li><li><span><a href="#Method-1:-rioxarray-and-geopandas" data-toc-modified-id="Method-1:-rioxarray-and-geopandas-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Method 1: <code>rioxarray</code> and <code>geopandas</code></a></span></li></ul></div>

# Polygonalize your data

## What?

**Polygonalize**: Convert your data from *raster* to *vector*.

## Why?

Polygonalization is most useful when you have a raster of *categorical* data such as land-cover or ecosystem types. You can use the resulting polygons to crop and clip additional datasets, particularly ones with a different grid.

Note that polygonalization can be an intensive process for large datasets! Consider computing and applying a mask instead if your datasets have matching grids.

## How?

We'll explore two methods here:
  * [geopandas and rioxarray, from spatial-dev.guru](https://spatial-dev.guru/2022/04/16/polygonize-raster-using-rioxarray-and-geopandas/) - easy to parallelize
  * gdal_polygonize - takes care of the complexities

In [1]:
import os

from dask.distributed import Client
import geopandas as gpd
import rioxarray as rxr

## The Data

We'll use National Land Cover Database (NLCD) 2019 CONUS data. From the description:
  > The National Land Cover Database (NLCD) provides nationwide data on land cover and land cover change at a 30m resolution with a 16-class legend based on a modified Anderson Level II classification system. NLCD 2019 represents the latest evolution of NLCD land cover products focused on providing innovative land cover and land cover change data for the Nation.

You can download the NLCD at [Multi-Resolution Land Characteristics Consortium data download site](https://www.mrlc.gov/data). 

Some key details about the NLCD data:
  - About 30 GB - 'medium-sized' data could be challenging for personal computers
  - [ERDAS IMAGINE](https://www.loc.gov/preservation/digital/formats/fdd/fdd000420.shtml)) proprietary format, with .img extension
  - Only 16 classes -> uint8 datatype (and MUCH smaller files)

In [2]:
# Import the data using rioxarray
nlcd_path = os.path.join(
    '..', 'data', 
    'nlcd_2019_land_cover_l48_20210604',
    'nlcd_2019_land_cover_l48_20210604.img')
nlcd_raw = rxr.open_rasterio(nlcd_path, masked=True).squeeze()
nlcd_raw

We aren't going to be able to do much with this `DataArray` because it is ~30GB (My computer has 32GB of RAM, and that's high for a personal computer). As a rule of thumb, your computer will start having trouble with datasets as you get close to the size of your computers RAM.

There are lots of ways to improve performance. We already talked about:
  * chunking
  * local clusters

In [3]:
# This is a medium/large dataset - we need chunks!
nlcd_raw = rxr.open_rasterio(
    nlcd_path, 
    masked=True,
    chunks='auto').squeeze()
nlcd_raw

Unnamed: 0,Array,Chunk
Bytes,62.70 GiB,484.00 MiB
Shape,"(104424, 161190)","(11264, 11264)"
Count,301 Tasks,150 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 62.70 GiB 484.00 MiB Shape (104424, 161190) (11264, 11264) Count 301 Tasks 150 Chunks Type float32 numpy.ndarray",161190  104424,

Unnamed: 0,Array,Chunk
Bytes,62.70 GiB,484.00 MiB
Shape,"(104424, 161190)","(11264, 11264)"
Count,301 Tasks,150 Chunks
Type,float32,numpy.ndarray


How many chunks?

Too many -> lots of overhead from starting and stopping processes
Too few -> memory can't handle it

We'll let xarray take care of it by using the 'auto' value, but tweaking this parameter can improve performance dramatically.

In [5]:
# Setup a local cluster with 4 threads
client = Client(n_workers=1, threads_per_worker=4)
client.cluster

Tab(children=(HTML(value='<div class="jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-outpu…

Check out the [dask Worker documentation](https://distributed.dask.org/en/latest/worker.html#memtrim): we want to consider whether to have multiple:
 * **processes**: have their own memory
and/or
 * **threads**: share memory

Communication among processes requires of overhead because items in memory must be passed around. However - Python has something called the Global Interpreter Lock which limits access to the Python process.

Libraries like numpy (and xarray) are about to offload work to linear algebra priocesses (this might show up as a kernel task on your computer), and so the GIL is not an issue and we may as well avoid process overhead.

Other types of computations will run into a GIL logjam, and so it's best to have multiple Python processes running without threading.

### Select a land cover type
We can check out the `nlcd_2019_land_cover_l48_20210604.xml` file to learn more about the land cover classes.

Here are some of the classes, taken from the documentation:

| Name                         | Value |
| ---------------------------- | ----- |
| Developed Open Space         | 21    |
| Developed, High Intensity    | 24    |
| Barren Land                  | 31    |
| Deciduous Forest             | 41    |
| Evergreen Forest             | 42    |
| Dwarf Scrub                  | 51    |
| Grassland/Herbaceous         | 71    |
| Pasture/Hay                  | 81    |
| Cultivated Crops             | 82    |
| Woody Wetlands               | 90    |

Notice that the data publishers have left space in-between the classes as a method of grouping and in case they need to add classes in future years.

We will look at the Grassland in class, but you can try a different one if you like.

You can try running this without chunks. It crashed my kernel! But with chunks xarray defers computing so it runs instantly.

In [6]:
# Select only the grassland pixels
nlcd_grassland = nlcd_raw.where(nlcd_raw==71)
nlcd_grassland

Unnamed: 0,Array,Chunk
Bytes,62.70 GiB,484.00 MiB
Shape,"(104424, 161190)","(11264, 11264)"
Count,601 Tasks,150 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 62.70 GiB 484.00 MiB Shape (104424, 161190) (11264, 11264) Count 601 Tasks 150 Chunks Type float32 numpy.ndarray",161190  104424,

Unnamed: 0,Array,Chunk
Bytes,62.70 GiB,484.00 MiB
Shape,"(104424, 161190)","(11264, 11264)"
Count,601 Tasks,150 Chunks
Type,float32,numpy.ndarray


We need to get rid of data ASAP! 

When we use drop=True, it gets rid of a little data, but also forces xarray to compute.

In [None]:
# Select grassland and crop
nlcd_grassland = nlcd_raw.where(nlcd_raw==71, drop=True)
nlcd_grassland

## Method 1: `rioxarray` and `geopandas`

Adapted from [spatial-dev.guru](https://spatial-dev.guru/2022/04/16/polygonize-raster-using-rioxarray-and-geopandas/)

The basic workflow is:
  1. Calculate the centroids of each pixel in the class
  2. Buffer the centroids to the pixel extent
  3. Merge the buffer polygons
  
This method will include edge pixels in their entirely, and possibly overlap slightly with neighboring polygons. We should keep an eye on the edge effects.

In [None]:
nlcd_df = nlcd_grassland.to_dataframe(name='grassland')

In [9]:
nlcd_df.dropna(inplace=True)
nlcd_df.reset_index(inplace=True)
nlcd_df

NameError: name 'nlcd_df' is not defined



In [None]:
nlcd_gdf = gpd.GeoDataFrame(
    nlcd_df, 
    geometry=gpd.points_from_xy(df.x, df.y))

In [None]:
# Don't forget to close the dask Client!
client.close()