Ex_10.Rmd

---
title: External Data, Distribution Networks, and Simple Species Distribution Models <br><small>Advanced
  Data Analytics</small></br>
author: "Austin Peay State University"
output:
  html_document: 
    df_print: paged
    rows.print: 10
    theme: cosmo
    highlight: breezedark
    number_sections: yes
    toc: yes
    toc_float:
      collapsed: no
      smooth_scroll: yes
  pdf_document: default
  html_notebook:
    df_print: paged
    rows.print: 10
    theme: cosmo
    highlight: breezedark
    number_sections: yes
    toc: yes
    toc_float:
      collapsed: no
      smooth_scroll: yes
editor_options:
  chunk_output_type: inline
  mode: gfm
---

```{=html}
<style type="text/css">

h1.title {
  font-size: 40px;
  font-family: "Times New Roman", Times, serif;
  color: Black;
  text-align: center;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 25px;
  font-family: "Times New Roman", Times, serif;
  font-weight: bold;
  color: #D02349;
  text-align: center;
}

body {
  font-family: Helvetica;
  font-size: 12pt;
}

.zoom {
  transform-origin: 40% 50% 0;
  transition: transform .2s;
  margin: 0 auto;
}
.zoom img{
	width:auto;
	height:auto;	
}
.zoom:hover {
  transform: scale(2);
}

th, td {padding: 5px;}

</style>
```

# Introduction
In this exercise you will examine ways to connect to external data sources such as [Dryad](https://datadryad.org/stash/), [GBIF](https://www.gbif.org/), or data stored in [GitHub Repositories](https://github.com/chrismgentry?tab=repositories). You will also take a look at creating a simple species distribution model using data from *GBIF* and *worldclim.org*. To begin you need to install some packages you may not have used before.

## Packages used in this exercise
There are several packages that allow you to connect to external data sources as well as format and display the data you have collect. To begin this exercise, install the following:

```{r Packages, message=FALSE, warning=FALSE, echo=TRUE, results='hide'}
packages<-c("cowplot","dismo","leaflet","mapdata","OpenStreetMap","rasterVis","rdryad","rgbif","sf","tidyverse")
sapply(packages, library, character.only=T)
```

# Connecting to External Data Sources
Often times you will find data available on websites or within a distributed data sources you want to include in your projects, but downloading and distributing the data with a markdown document or *R* script can be cumbersome. Additionally, the dataset may be large enough that it must be stored in your personal/business cloud storage for distribution even though it is already stored elsewhere on the web. Therefore, it is more convenient to connect to this data directly within your script and avoid secondary distribution all together. 

## Connecting to GitHub or other Websites
For example, in a [previous exercise](https://chrismgentry.github.io/Mapping-Basics/), two \*.kml files of the APSU campus were included in the data folder. If you wanted to use those in a new project you could download the data, include them in your data distribution, and remind users they must download the data in order for the script to work. Alternatively you could navigate to the [repository](https://github.com/chrismgentry/Mapping-Basics) and link directly to the data. To do this, you simply go to the **Data Folder** in the repository, click on the file you need, and click the **Raw** button along the top of the script window.
<p align="center">
![*Raw Button*](C:/Users/gentryc/Google Drive/APSU/Courses/BIOL 5700 Advanced Data Analytics/Exercise_10/raw_button.png "Raw Button")
</p>
Then you can copy the link at the top of the page to get direct access to the file. In this example, if you were to select the raw **Campus_Point.kml** file you would see a link that looks like this: https://raw.githubusercontent.com/chrismgentry/Mapping-Basics/master/Data/Campus_Points.kml The link will provide you access to that specific files from the repository. Now you can use that file in our new project.

So for this example you are going to recreate the simple static campus map you made in the previous exercise. To do that you will need to link to the raw *kml* file above.

```{r connect to kml, message=FALSE, warning=FALSE, echo = TRUE, results = 'hide'}
campus.kml <- st_read('https://raw.githubusercontent.com/chrismgentry/Mapping-Basics/master/Data/Campus_Points.kml')
```

This simple line of code is all that is needed to create a spatialpointsdataframe object in our environment. To complete the map we can add the various aesthetics and options to create a custom look for your map.

```{r simple campus map, echo=TRUE, message=FALSE, warning=FALSE, fig.height=6, fig.width=8}
campus.points <- campus.kml %>% mutate(x = unlist(map(campus.kml$geometry,1)),
                                       y = unlist(map(campus.kml$geometry,2))) %>%
                                st_drop_geometry(campus.kml)

campus.map <- openmap(c(36.5360,-87.3570),c(36.5300,-87.3495), type='bing')
APSU <- openproj(campus.map, projection = "+proj=longlat +ellps=WGS84 +units=m +no_defs")

autoplot.OpenStreetMap(APSU) +
  geom_point(data=campus.points, aes(x = x, y = y, color = Name), size = 4, alpha = 0.8) +
  geom_text(data=campus.points,aes(x = x, y = y, label = Name), color="white", vjust=-0.75, fontface="bold") +
  labs(x="Longtiude", y="Latitude") + theme(legend.position = "none")
```

> Try it yourself! Link to the Main_Campus.kml file in that same folder and add a polygon of the main APSU campus to your example map above.

The only caveat to this method of adding data is the unknown longevity of the data source. Depending on the activity level of the website or repository, a simple change in storage structure could break the link location and necessitate a rewrite of your script to update to the new location. Additionally, changes in how GitHub handles data could also cause errors in your script.

## Connecting to Data Distribution Networks 

Sites such as **GBIF** or **Dryad** aggregate data from scientists, governments, private organizations, etc. and distribute them on open-access services. These data are likely more stable than a GitHub Repository and should have consistent standards that apply to all datasets. This provides straight-forward access to a much larger catalog of data with better reproducibility.

### DISMO

To start, we are going to use the `dismo` package to gain access to a simple **GBIF** dataset. Although this is slightly more complicated than linking to a ready-made dataset, sites like **GBIF** provide access to well nearly two billion records.

The first step in obtaining data is setting the extent of the bounding box for your search. While you can access global datasets, there are limits on the amount of information you can download and limitations on your local processing power that make doing so difficult. With the extent set, you provide the genus and species along with the following basic arguments:

- geo: logical. If TRUE, only records that have a georeference (longitude and latitude values) will be downloaded

- removeZeros: logical. If TRUE, all records that have a latitude OR longitude of zero will be removed if geo=TRUE; or set to NA if geo=FALSE. If FALSE, only records that have a latitude AND longitude that are zero will be removed or set to NA

- download: logical. If TRUE, records will be downloaded, otherwise only the number of records will be shown

For this example you are going to obtain records for *Pinus longaeva* (Rocky Mountain Bristlecone Pine) within the western United States.

```{r dismo gbif data, echo=TRUE, message=FALSE, warning=FALSE}
pilo.dismo <- gbif("pinus", species = "longaeva", ext = c(-130,-70,20,60),
                   geo = TRUE, download = TRUE, removeZeros = TRUE)
```

Once you have obtained the observations, the next step is to prepare it for display. To do this, we are going to obtain a base map of the US, locate the x,y fields in the dataset, and create our map of known PILO stands in the west.

```{r simple dismo gbif map, echo=TRUE, message=FALSE, warning=FALSE, fig.height=6, fig.width=7}
us <- map_data("state")

ggplot() +
  geom_polygon(data = us, aes(x=long, y = lat, group = group),
               fill = "white", color="black") +
  geom_point(data = pilo.dismo, aes(x=lon, y=lat)) + 
  xlab("Longitude") + ylab("Latitude") +
  coord_fixed(xlim = c(-124,-110), ylim = c(35,45)) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Pinus longaeva Stands in the Western US") + 
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + 
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "lightblue"))
```

> Give it a shot! Change the species and extent information above to create a quick custom map of a species in your thesis or research project.

This is a simple example of using the *GBIF* options in `dismo` to create a quick map of species locations. Be careful when using this sort of open data source. Quality control on the datasets might not be to the standard required for a robust research project as you will see in the next section. 

### RGBIF

While the `dismo` package allowed you to create a quick species map, there were only a few variables obtained with the data. Using `rgbif` will provide a more detailed look at the species by including additional attributes.

For this part you will continue to look at Bristlecone Pine in the Great Basin so you can evaluate the differences in the type of data and distribution of PILO as compared to its relative Rocky Mountain Bristlecone Pine _(Pinus aristata)_. To begin with you will use the `occ_data` function to set the _scientific name_, _bounding coordinates_, and requirements for PILO and PIAR on **GBIF**. In this instance a limit is set to only include up to the first 2000 records.

```{r gbif lookup, echo=TRUE, message=FALSE, warning=FALSE}
pilo.rgbif <- occ_data(scientificName = "Pinus longaeva",
                       hasCoordinate = TRUE, limit = 2000,
                       decimalLongitude = "-125, -65", 
                       decimalLatitude = "24, 50")

piar.rgbif <- occ_data(scientificName = "Pinus aristata",
                       hasCoordinate = TRUE, limit = 2000,
                       decimalLongitude = "-125, -65", 
                       decimalLatitude = "24, 50")
```

In this instance you now have several different records for PILO and PIAR that includes more variables than the simple geographic data collected above with `dismo`. After examining the dataset, `View(pilo_rgbif)`, it is determined that only a handful of the variables should be retained for further analysis. 

```{r pilo piar data, echo=TRUE, message=FALSE, warning=FALSE}
pilo.rgbif.df <- cbind.data.frame(pilo.rgbif$data$species,
                                  pilo.rgbif$data$decimalLatitude,
                                  pilo.rgbif$data$decimalLongitude,
                                  pilo.rgbif$data$stateProvince,
                                  pilo.rgbif$data$verbatimLocality)

piar.rgbif.df <- cbind.data.frame(piar.rgbif$data$species,
                                  piar.rgbif$data$decimalLatitude,
                                  piar.rgbif$data$decimalLongitude,
                                  piar.rgbif$data$stateProvince,
                                  piar.rgbif$data$verbatimLocality)

colnames(pilo.rgbif.df) <- c("species","y","x","state","location")
colnames(piar.rgbif.df) <- c("species","y","x","state","location")
```

There are a number of steps in this code to consider. The data obtained with ```rgbif``` is in a large gbif/tbl_df (S3) class. Because of the format, we will use `cbind.data.frame` instead of the standard _cbind_ to ensure the data remains in the proper format. Next, each of the selected variables were renamed. You can see this simplified structure in the ```str(pilo.rgbif.df)``` dataset.

Now you are ready to finalize the aesthetics and options for displaying the new **gbif** data.

```{r rgbif map, echo=TRUE, message=FALSE, warning=FALSE, fig.height=6, fig.width=8}
ggplot() +
  geom_polygon(data = us, aes(x=long, y = lat, group = group),
               fill = "white", color="black") +
  geom_point(data = pilo.rgbif.df, aes(x=x, y=y, color = species), size = 3) +
  geom_point(data = piar.rgbif.df, aes(x=x, y=y, color = species), size = 3) +  
  coord_fixed(xlim = c(-124,-110), ylim = c(35,45)) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Bristlecone Stands in the Western US") + 
  guides(color=guide_legend("Legend", override.aes = list(size = 4))) +
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + 
  theme(legend.position = "bottom") +
  theme(legend.title.align = 0.5, legend.box.just = "center") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "lightblue"))
```

It is here where you can see a potential issue with the data. By viewing the [Forest Service, Distribution and Occurrence](#https://www.fs.fed.us/database/feis/plants/tree/pinlon/all.html#DISTRIBUTION%20AND%20OCCURRENCE) information for Great Basin Bristlecone Pine you will see that it occurs "in a relatively narrow latitudinal range in California, Nevada, and Utah." This information seems to track with the visualization above. However, viewing the [same information](https://www.fs.fed.us/database/feis/plants/tree/pinari/all.html#DISTRIBUTION%20AND%20OCCURRENCE) for Rocky Mountain Bristlecone Pine you will find it "occurs in the southern Rocky Mountains of Colorado, New Mexico, and Arizona." However, the distribution of PIAR on the visualization above extends into Utah, Nevada, and California overlapping the range of PIAL. This underlies the importance of understanding the data you are working with.

You could also use this data in `leaflet` to make an interactive version.

```{r interactive gbif map, echo=TRUE, message=FALSE, warning=FALSE, fig.height=8, fig.width=10}
colors <- colorFactor(c("#F8766D","#00BA38","#619CFF"), pilo.rgbif.df$state)

leaflet(pilo.rgbif.df) %>% 
  addTiles() %>% 
  addCircleMarkers(pilo.rgbif.df$x,
                   pilo.rgbif.df$y,
                   popup = pilo.rgbif.df$species,
                   weight = 2,
                   color = colors(pilo.rgbif.df$state),
                   fillOpacity = 0.7) %>%
  addMiniMap(position = 'topright',
             width = 100, 
             height = 100,
             toggleDisplay = FALSE) %>%
  addScaleBar(position = "bottomright")
```

### DRYAD

> Data for this example is being downloaded from [Schile, Lisa et al. (2016), Abu Dhabi Blue Carbon project, Dryad, Dataset, https://doi.org/10.15146/R3K59Z](https://datadryad.org/stash/dataset/doi:10.15146%2FR3K59Z)

You can download data directly from specific **Dryad** publications to import into your environment. While there is a ```dryad_download()``` command in the `rdryad` package that allows you to link to a specific dataset based on the doi, the function does not include robust features to allow for finite control for where an output file will be saved. So for this example you will use `download.file()`. This function will work regardless of the data format (\*.csv, \*.xlsx, \*.zip) you are linking to. If you are downloading a \*.zip file you may need to use the `unzip` package to open the data. For this example, the authors provided data a \*.xlsx file. Using `download.file` you will identify the link, destination path and file type, in mode information to ensure the file is opened for writing in binary mode.

```{r dryad data, echo=TRUE, message=FALSE, warning=FALSE}
carbon <- download.file("https://datadryad.org/stash/downloads/file_stream/140080", destfile = "./carbon.xlsx", mode = "wb")
```

With this new dataset collected you have options on how to import the file into your project. Using the  `readxl` package you can click on the file dataset within the _files tab_ and choose **Import Dataset...** ![*Import Dataset*](C:/Users/gentryc/Google Drive/APSU/Courses/BIOL 5700 Advanced Data Analytics/Exercise_10/import_xlsx.png "Import Button") to use a GUI interface for importing. This will open a new window where you can give the object a name, select the specific sheet and use drop-down menus to change the type of data, or choose to skip it on import. This can be beneficial when you need to review the dataset prior to importing to **R**. When you have selected all of the appropriate options you can either select import or copy the code and paste it into your script.

<p align="center"><div class="zoom"><img src= "Images/readxl-import-options.png" alt="Import XLSX dataset" style="width:100%"></div></p>

For this dataset the only necessary variables are: site, ecosystem, plot, core depth, latitude, and longitude. Either method above is acceptable for importing the data however only the code will be displayed below:

```{r import from readxl xlsx, echo=TRUE, message=FALSE, warning=FALSE}
carbon <- readxl::read_excel("carbon.xlsx", 
                          sheet = "plot information", 
                          col_types = c("skip","skip", "text", "text", 
                                        "numeric", "numeric", "skip", 
                                        "skip", "skip", "skip", "skip", 
                                        "skip", "skip", "skip", "skip",
                                        "skip", "numeric", "numeric", 
                                        "skip", "skip"))
```

Notice in the script the specific sheet was identified and variables were either skipped, or identified in the prefer format. With this information available you can create a simple visualization.

```{r dryad map, echo=TRUE, message=FALSE, warning=FALSE}
world <- map_data("worldHires")
uae <-map_data("worldHires", "United Arab Emirates")

main_map <- ggplot() +
  geom_polygon(data = world, aes(x=long, y = lat, group = group),
               fill = "gray", color="white") +
  geom_polygon(data = uae, aes(x=long, y = lat, group = group),
               fill = "white", color="black") +
  geom_point(data = carbon, aes(Longitude, Latitude, color = Ecosystem, size = `core depth (cm)`)) +
  coord_fixed(xlim = c(51.5,56.5), ylim = c(23.5,26.5)) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Blue Carbon Project") + 
  guides(color=guide_legend("Ecosystems", override.aes = list(size = 3))) + 
  guides(size=guide_legend("Core Depth (cm)")) +  
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + 
  theme(legend.position = "right") +
  theme(legend.title.align = 0.5, legend.box.just = "center") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "lightblue"))
main_map
```

## Creating an Inset Map

Occasionally you may be interested in including a inset or indicator map on your visualization. These are smaller maps that are used to provide additional detail about a specific location. These maps are often small scale versions of the primary map, but sometimes can be larger scale to enhance detail in congested areas. In this case, the data is located in the United Arab Emirates. Because that area of the world is unfamiliar to some, an inset map will help them locate it geographically. 

To begin this process you are going to create an additional ggplot object for countries of the world highlighting UAE.

```{r inset map, echo=TRUE, message=FALSE, warning=FALSE}
inset <- ggplot() + 
  geom_polygon(data = world, aes(x=long, y = lat, group = group),
               fill = "gray", color="white") +
  geom_polygon(data = uae, aes(x=long, y = lat, group = group),
               fill = "blue", color="black") +
  coord_map(xlim = c(30,62), ylim = c(10,40), "polyconic") +
  theme(panel.background = element_rect(fill = "lightblue"), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), 
        axis.line=element_blank(), axis.text.x=element_blank(), axis.text.y=element_blank(),axis.ticks=element_blank(), 
        axis.title.x=element_blank(), axis.title.y=element_blank()) +
  theme(plot.margin=grid::unit(c(0,0,0,0), "mm"))
inset
```

To create an overlay of these two maps you will use `ggdraw` from the **cowplot** package. In this function `draw_plot` calls to the maps created above, and x,y and dimensions are provided to place the inset over the primary map.

```{r final map, echo=TRUE, message=FALSE, warning=FALSE}
ggdraw() +
  draw_plot(main_map) + 
  draw_plot(inset, x = 0.024, y = 0.595, width = 0.25, height = 0.25)
```

# Simple Species Distribution Map

Finally, using a number of the packages included above, we can examine factors that might influence the location of a particular species. For this example we will return to the PILO data and use bioclimatic variables from worldclim.org to examine the relationship between climate variables and PILO stands in the western US.

```{r species distribution, echo=TRUE, message=FALSE, warning=FALSE}
bioclim <- getData(name = "worldclim", res = 2.5, var = "bio", path = "./")

names(bioclim) <- c("Ann Mean Temp","Mean Diurnal Range","Isothermality","Temperature Seasonality",
                    "Max Temp Warmest Mo","Min Temp Coldest Mo","Ann Temp Range","Mean Temp Wettest Qtr",
                    "Mean Temp Driest Qtr","Mean Temp Warmest Qtr","Mean Temp Coldest Qtr","Annual
                    Precip","Precip Wettest Mo","Precip Driest Mo","Precip Seasonality","Precip Wettest
                    Qtr","Precip Driest Qtr","Precip Warmest Qtr","Precip Coldest Qtr")

bio.extent <- extent(x = c(
  min(pilo.rgbif.df$x),
  max(pilo.rgbif.df$x),
  min(pilo.rgbif.df$y),
  max(pilo.rgbif.df$y)))

bioclim.extent <- crop(x = bioclim, y = bio.extent)

bioclim.model <- bioclim(x = bioclim.extent, p = cbind(pilo.rgbif.df$x,pilo.rgbif.df$y))
presence.model <- dismo::predict(object = bioclim.model, 
                                 x = bioclim.extent, 
                                 ext = bio.extent)
```

In this script bioclimatic variable were generated and descriptive names were provided. Because the bioclimatic variables are global, an extent was created out of the PILO dataset to limit the analysis to their geographic range and the extent was used to crop the variables. Finally, a raster layer was created with a prediction based on the model object. To view that information on a map you can use the `gplot` function which is a wrapper for _ggplot_ from the **rasterVis** package used to visualize raster opjects.

```{r species distribution maps, echo=TRUE, message=FALSE, warning=FALSE, fig.height=6, fig.width=10}
rasterVis::gplot(presence.model) + 
  geom_polygon(data = us, aes(x= long, y = lat, group = group),
               fill = "gray", color="black") +
  geom_raster(aes(fill=value)) +
  geom_polygon(data = us, aes(x= long, y = lat, group = group),
               fill = NA, color="black") +
  geom_point(data = pilo.rgbif.df, aes(x = x, y = y), size = 2, color = "black", alpha = 0.5) +
  scale_fill_gradientn(colours=c("brown","yellow","darkgreen"), "Probability") +
  coord_fixed(xlim = c(-122.5,-110.5), ylim = c(36,42.5)) +
  xlab("Longitude") + ylab("Latitude") + ggtitle("Probability of PILO Occurrence") + 
  theme_bw() + theme(plot.title = element_text(hjust = 0.5)) + theme(legend.position = "right") +
  theme(panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.background = element_rect(fill = "lightblue"))
```

This analysis included "the kitchen sink" of variables. The model could be better specified by removing variables that are unrelated to the growth response of PILO. Additionally you can see areas where PILO has a high probability of occurrence but is not present. In Arizona, this is the range of PIAR. You can also view this information in ```leaflet``` to examine the data interactively.

```{r leaflet species distribution, echo=TRUE, message=FALSE, warning=FALSE, fig.height=8, fig.width=10}
colors <- c("brown","yellow","darkgreen")

leaflet() %>% 
  addTiles() %>%
  addRasterImage(presence.model, colors = colors, opacity = 0.8) %>%
  addCircleMarkers(pilo.rgbif.df$x,
                   pilo.rgbif.df$y,
                   weight = 1,
                   color = "grey",
                   fillColor = "green",
                   fillOpacity = 0.7) %>%
  addMiniMap(position = 'topright',
             width = 100, 
             height = 100,
             toggleDisplay = FALSE) %>%
  addScaleBar(position = "bottomright")
```

# YOUR TURN!
Now it's your turn! Although some of this might not be applicable to your thesis research, try this information out on a dataset of your choice from any of the sources above. If you can apply this to your thesis, then add it to your website! Otherwise create a new repository for your work on Thursday. Hint: If uploading this as a project to GitHub be sure to add the bioclim folder to your \*.gitignore file to avoid errors relating to the size of folder exceeding GitHub standards.