# Automatization of searching queries using R

A language for statistical computations [R (S-plus)](http://r-project.org) is widely in a variety of scientific and applied problems dealing with data analysis.

Since the time of its creation, at least 20 years, a lot of useful packages were developed for statistical language R. This set of packages are grown in a large [R-ecosystem](https://cran.r-project.org/mirrors.html), allowing to solve different scientific problems and perform automatization of data analysis workflows.

Executable code in this document are given in the blocks of the form In[xxx]; these blocks could be executed via R-interpreter (e.g. R-console) in a one-by-one way, or previously stored in a text file with .r extension (r-script file).

This document is created via [Jupyter](http://jupyter.org) and [IRkernel](https://irkernel.github.io/).

## Building a computational environment

Automatizaton of search queries in R include:
* Bulding an HTTP-request to the server according to [HTTP-API](https://github.com/VBGI/herbs/blob/master/herbs/docs/httpapi/ru/http_api.rst);
* Transformation of the answer, a JSON-response, in a form usual in R-environment (e.g. data.frame object);

To handle these steps the R ecosystem includes a lot of packages. We will use the one called [jsonlite](https://cran.r-project.org/web/packages/jsonlite/index.html), that allows automatic transformation received data into R - `data.frame` objects.

So, if your local R-environment doesn't include `jsonlite`, you need to install it with `install.packages('jsonlite')`.

In [None]:
library(jsonlite)

In [None]:
data<-fromJSON('http://botsad.ru/hitem/json/?collectedby=Пименова')

In [None]:
data$data

To build more complex search requests, it is recommended to decompose them using lists of parameters:

In [None]:
http_api_base_url <- 'http://botsad.ru/hitem/json/?'
search_parameters <- c('collectedby', 'Пименова', 'identifiedby', 'Крестов')

In this case, parameters along with their values are defined in the `search_parameters` variable; futher, we can use function `paste`, to form the search url. Building a search url from a set of parameters assumes getting a string including the `http_api_base_url` variable and a substring including parameter names and its values separated by symbol `&`.

In [None]:
search_url <- paste(http_api_base_url, paste(search_parameters[c(TRUE, FALSE)], search_parameters[c(FALSE, TRUE)], sep='=', collapse='&'), sep='')

Ok, the search url is formed now. Lets quiery the server for data.

In [None]:
new_data <- fromJSON(search_url)

In [None]:
dim(new_data$data)

In [None]:
new_data$data

Nested data structures, such as `dethistory` and `additionals` fields, are correctly transformed into `data.frame`s via `fromJSON` utility function.

In [None]:
new_data$data$dethistory

It is useful to assign a special variable to `data$data`, e.g. `my_data`, to get access the data directly.

In [None]:
my_data <- data$data

Lets count family's frequencies in the retrieved data.

In [None]:
data.frame(table(my_data$family))

If one need to compute, for example, Shannon's biodiversity index, one can use the [vegan](http://cc.oulu.fi/~jarioksa/softhelp/vegan/) package, which includes a lot of useful functions and algorithm implementations for processing datasets occuring in botany or ecology.

In [None]:
library(vegan)

Shannon's biodiversity index by family field:

In [None]:
diversity(table(my_data$family))

## Emulation of OR-type search requests

By default Digital Herbarium HTTP API retrieves data matches all provided search conditions (i.e. executes AND-type search queries). Moreover, in the current implementation of the HTTP API service, there is no way to build and execute OR-type search query as a single request. So, the only way to make OR-type requests is its emulation.
Emulation of OR-type search requests assumes executing a sequence of requests and gluing retrieved data into single dataset.

See OR-type search request emulation by example of two searching requests presented by sets of parameters `search_parameters1` and `search_parameters2`:

In [None]:
search_parameters1 <- c('identifiedby', 'Пименова', 'collectedby', 'Пименова')
search_parameters2 <- c('identifiedby', 'Крестов', 'collectedby', 'Крестов')
search_url1 <- paste(http_api_base_url, paste(search_parameters1[c(TRUE, FALSE)], search_parameters1[c(FALSE, TRUE)], sep='=', collapse='&'), sep='')
search_url2 <- paste(http_api_base_url, paste(search_parameters2[c(TRUE, FALSE)], search_parameters2[c(FALSE, TRUE)], sep='=', collapse='&'), sep='')

In [None]:
search_url1

In [None]:
search_url2

In [None]:
dataset1 <- fromJSON(search_url1)
dataset2 <- fromJSON(search_url2)

In [None]:
df1<-data.frame(dataset1$data)
df2<-data.frame(dataset2$data)

In [None]:
merged_data <- rbind(df1, df2)

In [None]:
dim(df2)
dim(df1)
dim(merged_data)

At this step the `merged_data` data frame could include duplicated rows, so we need to clean-up it. To remove duplicated rows we will exploit uniquness of the `ID` field for any row in `merged_data`.

In [None]:
data_without_dups<-merged_data[!duplicated(merged_data$id),]

In [None]:
dim(data_without_dups)

Perfectly, original size of the dataset was changed, i.e. some rows were excluded... So, the `data_without_dups` variable presents a dataset with records matching complex query defined by `search_parameters1` OR `search_parameters2`.

## Filtering data by a user defined region


Let us assume that a given region is presented as an ESRI-shape file. Since HTTP API service allows making search requests only for rectangular areas, we will divide the task into stages: 1) getting a rectangular area that includes desired contour; 2) making search query by the rectangular area; 3) excluding records lie outside of the given contour, but included to the rectangular area.

To read ESRI-shape files we will use `rgdal` package. This package uses open source library for evaluting geographically distributed data GDAL, that is frequently used in building various geographic information systems (GIS).

Therefore, we assume that your R-ecosystem has preinstalled `rgdal` package, that, in turn, assumes installed [GDAL](http://www.gdal.org/) library in your system (since GDAL is included into Windows distribution of the `rgdal` package,  existence of GDAL doesn't make sense for Windows users).

In [None]:
library('rgdal')
shape_rgdal <- readOGR(dsn=path.expand("/home/dmitry/workspace/herbs/herbs/docs/tutorial/R/ru/sakhalin"), layer="sakhalin")

Ok, our shapefile was successfully loaded... 

In [None]:
shape_rgdal

In [None]:
bbox<-shape_rgdal@bbox

In [None]:
as.numeric(bbox)

Rectangular area including countour of the shapefile is stored in `bbox` slot of the `shape_rgdal` S4-object.
So, we can easily extract bounding box coodinates:

In [None]:
lonl<-as.numeric(bbox)[1]
lonu<-as.numeric(bbox)[3]
latl<-as.numeric(bbox)[2]
latu<-as.numeric(bbox)[4]

Further, let us form a search url and make a search request.

In [None]:
search_parameters_sakhalin <- c('lonl', lonl, 'lonu', lonu, 'latl', latl, 'latu', latu)

In [None]:
search_url_sakhalin <- paste(http_api_base_url, paste(search_parameters_sakhalin[c(TRUE, FALSE)], search_parameters_sakhalin[c(FALSE, TRUE)], sep='=', collapse='&'), sep='')

For the sake of trust, explore the url:

In [None]:
search_url_sakhalin

In [None]:
sakhalin_data <- fromJSON(search_url_sakhalin)

Now, we can explore how much records are belonging to the rectangular area including the countour of Sakhalin Island.

In [None]:
dim(sakhalin_data$data)

In [None]:
number_in_rectangle <- dim(sakhalin_data$data)[1]

In [None]:
sprintf("Therefore, the number of records belonging to the rectangular area: %d", number_in_rectangle)

Additional filtration allows to exlude points outside the countour of Sakhalin Island. To make such filtration in R, it is convenient to use the `sp` package.

In [None]:
library(sp)

In [None]:
sakhalin_nonfiltered <- sakhalin_data$data

We need to transform origin dataset `sakhalin_nonfiltered` into an object of spatially distributed data (exectly, R's S4 object from the `sp` package):

In [None]:
coordinates(sakhalin_nonfiltered) <- cbind(sakhalin_nonfiltered$longitude , sakhalin_nonfiltered$latitude)

In [None]:
sakhalin_nonfiltered@proj4string <- CRS(proj4string(shape_rgdal))

Due to amazing syntax of the `sp` package, the filtering is performed by the only one line of the code:

In [None]:
sakhalin_filtered<-sakhalin_nonfiltered[shape_rgdal,]

In [None]:
dim(sakhalin_filtered)

In [None]:
number_in_sakhalin <- dim(sakhalin_filtered)[1]

In [None]:
sprintf('The number of filtered records: %d', number_in_rectangle - number_in_sakhalin)

Attentive user could be surprised here: the rectangular area doesn't include any land, but only Sakhalin Island; why so many records were filtered out? The cause of that is very close -- such error-like phonomenon is due to records distributed on the seashore could be treated as not-belonging to the Sakhalin countour. That isn't a real error, just a result of errors in the contour definition/description and herbarium records positioning.

In [None]:
sprintf('Document execution date: %s', Sys.Date())