vignettes/icosa_2_essentials.Rmd

---
title: "2. Essentials: Binning Point Data"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{2. Essentials: Binning Point Data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---


```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(terra)
library(icosa)
library(rgl)
library(knitr)
knit_hooks$set(rgl = hook_rgl)
```

## Introduction

The purpose of this tutorial is to demonstrate the basic usage of the ``icosa`` package. The primary targeted application of the package is in global biological sciences (e.g. in macroecological, biogeographical analyses), but other fields might find the structures and procedures relevant, given that they operate with point vector data and variables that depend on the density of such data. 

Note that this is just a brief introduction to the package's capabilities. The complete documentation of the package and the rest of the tutorials will be made available on the [package website](https://adamkocsis.github.io/icosa/). Relevant how-to guides will be posted on the evolv-ED blog (https://www.evolv-ed.net/). 

* * *

## Point Data 

The primary problem with ecological samples is that due to density and uniformity issues, the data points are to be aggregated to distinct units. As coordinate recording is very efficient on the 2d surface of a polar coordinate system (i.e. latiude and longitude data), this was primarly achieved by rectangular gridding of the surface (for instance 1° times 1° grid cells). Unfortunately, this method suffers from systematic biasing effects: as the poles are approached, the cells become smaller in area, and come closer to each other. 

#### Random points on the sphere

We can illustrate this problem by generating random points and assign them to different spatial bins. The `icosa` package has a utility function to generate such points:

```{r icosa, eval=FALSE}
library(icosa)
```

The function `rpsphere()` generates randomly distributed points on the surface of a sphere. 

```{r points}
# set seed for exact reproducibility
set.seed(0)
# 10000 random points
randPoints <- rpsphere(10000, output="polar")
# the first 6 points
head(randPoints)
```

#### Land Polygons

You can visualize this pointset and plot them on a map, with some example data, such as the [land polygons from Natural Earth](https://www.naturalearthdata.com/downloads/110m-physical-vectors/) - which are included in the package. You can load and plot such vector data with the [`sf`](https://cran.r-project.org/package=sf) package. You can install it from the CRAN with the regular `install.packages()`

```{r sf}
library(sf)
```

Once the package is loaded, you can load the data with `sf::st_read()`:

```{r vispoints, fig.width=9, fig.height=4.5}
# reading in example data
ne <- sf::st_read(file.path(system.file(package="icosa"), "extdata/ne_110m_land.shx"))

# plotting the world map
plot(ne$geometry, col="gray", border=NA)

# plotting the point cloud
points(randPoints, col="#FF000055", pch=3)
```

Note that the points appear to have a non-random distribution: they seem to have higher density in lower-latitude areas. But this is in fact a distorting effect of not the points, but the [equirectangular map projection](https://en.wikipedia.org/wiki/Equirectangular_projection) itself! The points are are created with 3D normal distributions and then they are projected to the surface of the sphere around the center of this distribution.


#### Binning with a Gaussian (UV)-grid

Let's count these data points in a simple longitude-latitude (Gaussian, polar, or UV) grid, which you can easilly implement with the [`terra`](https://cran.r-project.org/package=terra) package (i.e. raster)! Similar to `sf` this package can be installed from the CRAN with `install.packages()`

```{r terra}
library(terra)
```

Let's make a 10x10° raster grid, and count the number of points in every cell: 

```{r raster}
# a 10-degree resolution grid
r <- rast(res=10)

# count the number of points in raster cells
counts <- rasterize(randPoints,r, fun=length)
counts
```

The counts of these points can be easilly visualized with a plot.

```{r rastplot, fig.width=10, fig.height=5}
# the raster itself
plot(counts)

# plotting the map on plot
plot(ne$geometry, col=NA, border="black", add=TRUE)
```


The point counts have a similar latitudinal pattern as we saw with their distribution: as latitude increases, the cell sizes decrease and the expected number of points in them decreases.0

One solution to this problem can be the use of a different projection. For instance the [Lambert's Cylindrical Equal Area](https://en.wikipedia.org/wiki/Cylindrical_equal-area_projection) projection (`["EPSG:54034"](https://epsg.io/54034)`) can be used instead of the Equirectangular projection. This will, however distort the polar cells that will become more and more elongated with higher the latitude values - which might be a problem, or might not.   

* * *

## About icosahedral grids 

The `icosa` package approaches this problem by not using a rectangular grid, i.e. a grid that has cells that are rectangular shape. 

Instead, `icosa` relies on the tessellation of a [regular icosahedron](https://en.wikipedia.org/wiki/Regular_icosahedron) to a given resolution. This procedure ends up with a polyhedral object of triangular faces of higly isometric properties: very similar shapes of cells which are roughly equally distanced, and similar in cell area. Such bodies are often called icospheres.

Here is how such grid looks like in 3D (also plotted with `icosa`, see a later tutorial for such plotting): 

```{r plot3d, rgl=TRUE,fig.width=10, fig.height=10, echo=FALSE}
# create a trigrid class object
tri <- trigrid(8)

# plot the object in 3d
suppressWarnings(plot3d(tri, guides=F))
```

Still, triangular grids are somewhat unintuitive, because distances between the cell can vary considerably: the corner of a cell is much farther away from the center than the side. For this reason, we usually use the inverted versions of these grids: the triangle cell-midpoints become the corners of the cells (i.e. vertices), which defines a grid where cells are hexagonal - with exactly 12 pentagons (where the vertices of the icosahedron used to be). This is the inverted version of the grid above:

```{r plot3d_hex, rgl=TRUE,fig.width=10, fig.height=10, echo=FALSE}
# create a trigrid class object
hexa <- hexagrid(8)

# plot the object in 3d
suppressWarnings(plot3d(hexa, guides=F))
```

* * *

## Creating an icosahedral grid

Once `icosa` is loaded, you can create such a penta-hexagonal grid with a single line of code, using the `hexagrid()` function:

```{r create, echo=TRUE}
# create a trigrid class object
hexa <- hexagrid(deg=5, sf=TRUE)
```

- The argument `deg` refers to the expected length of the edge length of the grid. This is given as angular distance between two cell vertices in **degrees** - which means same as degrees longitudinally (or latitudinally on the equator), 1° =~111 km. This is used to select a `tessellation` vector, which is directly controlling the grid resolution: how the faces of icosahedron are split up. The higher the product of this vector, the higher the resolution - ie. the more cells there will be in the grid. The example `c(2, 4)` means that the original icosahedrons edges are split up 2 * 4 = 8 times. 

- The argument `sf=TRUE` indicates that we want to create a representation of the grid that can be used for additional spatial operations using the `sf` package, for example 2D (projection-based) plotting. Since this is not always needed, and can significantly decrease performance, you have to indicate that you need this option. 

If you type in the name of the object, you can some immediate properties about the structure and resolution of the grid.

```{r show, echo=TRUE}
hexa
```

The most important detail here is th the number of faces of the 3d body (642) i.e. the number of cells in the grid.

* * *

## Plotting the grid in 2d

If you have generated the `sf`-represenation as above, the `plot()` function can be invoked on the grid to see its 2D (longitude-latitude projection). 

```{r plotting, plot=TRUE, fig.width=10, fig.height=5}
plot(hexa)
```

Putting the map and points generated above is just as easy as with any other kind of mapping object:

```{r plotting2, plot=TRUE, fig.width=10, fig.height=5}
# plotting the world map
plot(ne$geometry, col="gray", border=NA)

# the grid
plot(hexa, add=TRUE)

# plotting the point cloud
points(randPoints, col="#FF000022", pch=3)
```

* * *

## Spatial binning with the `hexagrid`

The grids themselves can only be useful if we can figure out how the points actually interact with them: for instance, if we want to repeat the same [binning as earlier](#binning-with-a-gaussian-uv-grid), but with the grid that we created earlier. To make this process as flexible and useful as possible, we will be doing this in three short steps:  

1. Finding the cells that the points fall onto with `locate()` 
2. Count the number of points in a cell. (using base R)
3. Plot the counts on the grid

The cells of the icoshedral grid are named with the convention `"F<integer>"` (`"F"` stands for `face`), such as as `"F12"`. You can see how these are distributed using the `gridlabs()` function:

```{r gridlabs, plot=TRUE, fig.width=10, fig.height=5}
plot(hexa)
gridlabs(hexa, cex=0.5)
```

Cells names are assigned according to a spiral from the North pole of the grid. 


#### Point lookup - `locate()`

The only thing that we need to able to do with the grid to make this process feasible is the ability to find the cells on which every points falls. This is a very basic feature, but with this we can do most computations that we might want to do with the grid. 

The `locate()` function returns the name of the cell under a point: one cell name for every coordinate pairs. The function takes the grid (`hexa`), and the point coordinates (`randPoints`) as arguments and returns a vector of cell names:

```{r location}
cells <- locate(hexa, randPoints)
str(cells)
```

The output of `locate()` is so designed that it can be used as a new column in the table of coordinates (optional):

```{r columns}
# transform this to a data frame
rdf <- data.frame(randPoints)
# assign the face names as a column
rdf$cells <- cells
# the first 6 rows
head(rdf)
```

#### Counting the cells

To assess the density of points, we need to count the number of points in every cell - which is actually nothing else, but the tabulation of the number of times that a point falls on a cell. This we can easilly do with the `table()` function:

```{r table}
tCells <- table(cells)
str(tCells)
```

... which gives us the number of points that fall on one particular cell (cell name in the `names()` attribute), frequency as the value.

#### Plotting the frequency

There is a number of ways to visualize this number on the map, the simplest is to use a dedicated method of `plot()`, that takes the grid (with an `sf` representation, as [above](#land-polygons)) and a named entity (vector, table) and plots this with the features made available via `sf`:

```{r visdistrib, fig.width=9, fig.height=4.5}
plot(hexa, tCells)
```

You can modify this plotting in any way you could with `sf::plot()`. For instance, if you want to use a different color scheme, a different binning

```{r visdistrib_custom, fig.width=9, fig.height=4.5}
plot(hexa, tCells, 
	border="white",
	pal=c("#440154FF", "#2D708EFF", "#29AF7FFF", "#FDE725FF"), 
	breaks=c(0, 10, 15, 20, 40)
)
```

You can also set the plotted cooordinate reference system directly here in the plotting command. For instance, if you would like this to be plotted directly on a Mollweide projection (`"ESRI:54009"`):

```{r final, fig.width=9, fig.height=4.5}
# the base map
plot(hexa, tCells, 
	crs="ESRI:54009", 
	border="white",
	pal=c("#440154FF", "#2D708EFF", "#29AF7FFF", "#FDE725FF"), 
	breaks=c(0, 10, 15, 20, 40)
)
```

If you do not forget to transform the landmasses

```{r transform}
neMoll <- sf::st_transform(ne, "ESRI:54009")

```

... you can also put them on top. Just make sure to set `reset=FALSE` (see `sf::plot` why):

```{r final2,fig.width=9, fig.height=4.5}
# the base map
plot(hexa, tCells,
	crs="ESRI:54009", 
	border="white",
	pal=c("#440154FF", "#2D708EFF", "#29AF7FFF", "#FDE725FF"), 
	breaks=c(0, 10, 15, 20, 40), 
	reset=FALSE
)

# the landmasses tranparent gray with black contour
plot(neMoll$geometry, add=TRUE, col="#66666688")
```

All these options come from `sf:plot()`. If you familiarize yourself with how to use that,then only your imagination will be the limit!


## Using the result in calculations 

The by-product of this workflow is that now we get explicit access to not only the cells' name, but also their attributes. For instance if you want to see whether the number of points in a cell has a latitudinal pattern, you just need to get the coordinates of the face centers, and assign them to the densities. You can get these with `centers()`:


```{r centers}
# translate the returned 2-column matrix to a data.frame
faceInfo <- as.data.frame(centers(hexa))
# the first 6 rows
head(faceInfo)
```

which returns the longitude and latitude coordinates of the face centers. Now we just need to assign the counts to these, which is easy with base R using the `rownames` as a character subscript and assigning the result as a new column of `faceInfo`:

```{r rownames}
faceInfo$count <- tCells[rownames(faceInfo)]
# the first 6 rows
head(faceInfo)
```

Now it is easy to see whether there is an association with latitude:

```{r latplot, fig.width=10, fig.height=6}
plot(faceInfo$lat, faceInfo$count, 
	xlab="Latitude (deg)", ylab="Point count", 
	pch=16, col="#99000044")
```

There is no latitudinal trend in the points, but you can see that the spread decreases latitudinally (because there are fewer cells) at high latitudes.

## Summary 

If you have some data (e.g. `randPoints`) and you want to execute the binning and plotting, you can do it with this much code - now using a different, coarser resolution grid:

```{r summary, fig.width=10, fig.height=6}
gr <- hexagrid(deg=10, sf=TRUE) # create grid
cells<- locate(gr, randPoints) # locate points
tab <- table(cells) # tabulate, calculate
plot(gr, tab) # plot named vector/table
```