# From BCCVL to _ecocloud_: further analyses on the modelling data
While BCCVL gives you a comprehensive suite of results in your SDM experiment, you might want to do some further analyses that are not included in the BCCVL yet. Here we showcase how you can do some easy data explorations using the data from your SDM. For example, you might be interested in checking:
- What is the range and the average of the values of my climatic variables? For example - how much does the precipitation vary across all the locations in which the species has been observed?
- How often is the species reported at a given point in the climatic variable? For example - is the species most frequently observed in areas with 100mm of precipitation or 1000mm of precipitation? Are there any gaps in the environmental variable values across your occurrence records?
- How much are the variables in my model correlated?

Here we will step you through how you might want to address some of these questions using _ecocloud_ and R Notebook.

## Step 1: Obtaining your data from BCCVL
If you haven't already, you will need to download one of the output files from the BCCVL for the next analysis. 
- Under the Results tab of your experiment, select one algorithm. 
- Click the drop down ‘more’ button, which will show all the results for that algorithm.
- Search for the ‘Occurrence points with environmental data’ output, and click the blue download button behind this result.
- Double check that this is the correct file. The filename has the following format: "occurrence_environmental_species.name_algorithm.csv".
- Save this file somewhere that you can access it in the next session.

## Step 2: Uploading your data in _ecocloud_
__Before uploading data, make sure you are currently sitting in the 'workspace' folder (as opposed to 'home'). This will ensure that your files are saved in your _ecocloud_ account.__ 

- Let's first create a new folder to store all the files of this workshop. Click the 'New folder' icon. This creates an 'Untitled Folder'.
- Right click the folder, and select Rename. Name your folder 'ResBaz'.
- Double click the ResBaz folder to make sure you are working in that.
- Click the 'Upload files' button in the top of the panel on the left.
- Find the csv file that you downloaded from BCCVL. This will now be saved in your ResBaz folder.

## Introduction to Jupyter Notebooks <a name="notebook"></a>

Jupyter notebooks allow you to create a live document in which you can edit text, code and equations, or share results. The advantage of using a Jupyter notebook is that you are able to work and annotate your data, which you may want to share with colleagues, or use in published materials. 

In *ecocloud*, we use Jupyter notebooks for this reason. It lets you work on a document and save it for later. For more of an introduction on using Jupyter itself, head to our **[support page](https://support.ecocloud.org.au/support/solutions/articles/6000200389-using-jupyter-notebooks)** or on the **[Jupyter website](http://jupyter.org/)**.

Notebooks have three different types of cells:
- Markdown: these are used for comments. Check this page for useful [Markdown tips](https://www.ibm.com/support/knowledgecenter/SSQNUZ_current/com.ibm.icpdata.doc/dsx/markd-jupyter.html).
- Code: these are used for R or Python code. The coding language that you can run depends on what kind of server you have running in *ecocloud*.
- Raw: can be used to write output directly.

Use the commands on the top of the notebook to (from left to right):
- Save your notebook.
- Add a new cell (these are by default code, so if you want the cell to be markdown, change this in the drop-down menu.
- Cut selected cells.
- Copy selected cells.
- Paste from clipboard.
- Execute a cell. 
- Interupt the kernel, this will stop execution of cells.
- Restart the kernel.

## Using Notebooks in your *ecocloud* workspace <a name="upload"></a>
When you open your server in *ecocloud* you will automatically see your workspace loaded in the menu on the left. You can open a notebook from here, or find a notebook elsewhere. For example:
- Go to a **Google Drive** folder by clicking the Google Drive icon.
- Go to a **GitHub repository**. As an example: click the GitHub icon (the cat icon, 3rd from the top). By default you will see the *ecocloud* training repository. If you are running an R server, click on the R-notebooks folder and open the notebook you want to work with.

You can also start a new notebook by clicking "File" > "New" > "Notebook".

**Work through an existing notebook**

If you want to work through a notebook without seeing the outputs of pre-run script, go to "Edit" > "Clear All Outputs". This will give you a clean notebook with just the markdown and code cells.

You can execute cells of code in a notebook by clicking on the cell and then press "Shift+Enter" on your keyboard, or click the "Play" icon in the menubar. This will execute the current cell, show any output, and jump to the next cell below. During execution of a cell you will see a * next to the cell.

## Step 4: Save your Notebook

If you use a notebook from a GitHub repository, please save it first to a folder in your own workspace so any of your changes will be saved. 
- Go to 'File' > 'Save Notebook As...'
- Use the following name to save: 'workspace/ResBaz/ResBaz.ipynb'

__Note__: it is really important that when you save your notebook, you always use the workspace (and folder name) instead of 'GitHub:ausecocloud/training/R-notebooks'
Without the 'workspace/' in the name your Notebook will not be saved in the workspace for future use. 

## Step 5: Setting the working directory
Whether you are working in the RStudio server, or through a Notebook, setting the working directory is important. Do this by running the code below (press 'Shift+Enter' or click the 'Play' button in the menubar.

In [None]:
setwd("/home/jovyan/workspace/ResBaz")
getwd()

## Step 6: Install packages
This script requires a few packages. These are mostly for the graphics, but you can find more about each package by typing `?` before the name of any of the packages.

In [None]:
library(tidyr)
library(ggplot2)
library(reshape2)
library(readr)
library(Hmisc)
install.packages("corrplot")
library(corrplot)

## Step 7: Load data
Although your dataset is sitting in the workspace folder, this has not yet been called to memory in the script so you can't work on it yet. In this example we use the Mountain Ash occurrence data, but you can use whatever data you downloaded. 
__Note:__ The way your file is named is really important. The script must have the exact spelling otherwise you will get an error.

In [None]:
occurrence <- read.csv("occurrence_environmental_Eucalyptus.regnans_maxent.csv")

## Step 8: Check your data
You always want to check your data. This lets you know how it has been read in, how the names are and what is within the columns and rows. The easiest way to do this is with the `head()` command

In [None]:
head(occurrence)

Here you can see that the data that loaded in has a column for the longtitude, latitude, the species name, and climate variables B01, B14, B05, B04, B06, B12, B13. Your data might look a little different depening on the species you selected and the climate variables you included. 

## Step 9: Generate summary boxplots
There are two plots that we are interested in to explore the data - boxplots and histograms. First we will start with the boxplots. 
__Note__ you can only include variables in a single plot that have the same unit. For example, you cannot include both temperature and precipitation in a summary boxplot, because these are two different units, and aren't comparible in a sinlge box plot. 

A boxplot is a simple way to represent data. Typically the box represents the second and third quartile of the data, the line in the middle of the box represents the median value, and the top and bottom of the box represent the first and fourth quartile, respectively. Any outliers are represented as dots. 

For this dataset, a boxplot can be used to show the distribution of the values of the selected climate variables for different occurrences.

For the first plot we have grouped B01, B05 and B05 together because they are all temperature variables.  

__Note__ you may have to adjust the code to represent your data. You can also rename the x-axis or the y-axis by changing the text within the quotation marks `""`

In [None]:
boxplot(occurrence[, c("B01", "B05", "B06")], #change B01 etc to the variables in your model
        xlab = "Climate Variables",
        ylab = "Temperature (C)")

The standard bioclim temperature variables are:
- B01: Annual mean temperature
- B02: Mean diurnal range
- B03: Isothermality (=B02/B07 * 100)
- B04: Temperature seasonality
- B05: Maximum temperature of the warmest month
- B06: Minimum temperature of coldest month
- B07: Temperature annual range (BO5-BO6)
- B08: Mean temperature of wettest quarter
- B09: Mean temperature of driest quarter
- B10: Mean temperature of warmest quarter
- B11: Mean ttemperature of coldest quarter

What do the boxplots for the temperature values show you? Are there any outliers?

### __Enter your interpretation here by double clicking on these words__

Next we can generate boxplots for the rainfall variables. The standard bioclim rainfall variables are:
- B12: = Annual Precipitation
- B13: = Precipitation of Wettest Month
- B14: = Precipitation of Driest Month
- B15: = Precipitation Seasonality (Coefficient of Variation)
- B16: = Precipitation of Wettest Quarter
- B17: = Precipitation of Driest Quarter
- B18: = Precipitation of Warmest Quarter
- B19: = Precipitation of Coldest Quarter

Run the code below first, and then write in the box your interpretation for the precipitation values.

In [None]:
boxplot(occurrence[, c("B12", "B13", "B14")],
        xlab = "Climate Variables",
        ylab = "Precipitation")

### __Enter your interpretation here by double clicking on these words__


## Step 7: Get the average value for each predictor variable

The boxplots give a good insight in the spread of your data, but we would like to know the average value of each of the climate variables.

Let's see if you can add a code cell to this notebook (click the + on the top of the page), then add the code below:

predictors <- subset(occurrence[,4:10]) #this creates a subset of the occurrence datafile with only the predictor variables

colMeans(predictors[sapply(predictors, is.numeric)]) #this gets the average of each column


## Step 8: Generate summary histograms
Another plot that can be generated is a histogram. Histograms are useful to visualise frequency statistics. Here we use the histogram plot to visualize the frequency of occurrence records across all variables. Each bar represents the number of occurrence records observed for that value of the predictor variable. 

__Note__ the way this plot is generated is by calling the columns of data, which from the `head()` command we saw were in columns 4 to 10. Again you will need to adjust this code so that is relevant to your data.

In [None]:
ggplot(melt(occurrence[,4:10]),aes(x=value)) + geom_histogram() + facet_wrap(~variable, scales = "free")

If your data is anything like ours you might find it interesting to see that some of the variables have a double peak. What do you think this means for the species you are looking at? 

### Type your answer in this box by double clicking on these words


## Step 9: Generate a correlation plot
It is good to check whether the predictor variables in your model are correlated as you want to avoid high correlation between predictors.

To generate this plot we use the `corrplot()` function and package. This generates a matrix which compares each variable against the other. Colours within the plot represent botht the direction (positive or negative) and the strength of the correlation. The strength of the correlation is also represented as the size of the circle. 

__Note__ again, for this plot we selected our data to correlate based on thier column numbers, be sure to adjust this if you have different numbers.

In [None]:
predictors <- subset(occurrence[,4:10])

There are several correlation statistics you can choose from, but here we have chosen "pearson" correlation coeifficient. If you want to learn more about the options and what they mean, you can enter the following code: `?cor`

In [None]:
variable_correlations <- cor(predictors, method="pearson")

corrplot(variable_correlations, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)

The first thing you might notice is that every variable has a strong positive correlation with itself. That's a good thing! It would be concerning otherwise...

Now we can start interpretting some of these results. If this plot shows you high correlation between two variables, you might want to rerun your SDM in the BCCVL with only one of the two highly correlated variables. 

## Congratulations!
You have completed the additional analyses in R using ecocloud, both in RStudio and through a Notebook. We hope this has given you some insight in how to use a cloud-based tool. While you could quite easily run these analyses on your own laptop, the power of cloud compute could come in really handy when you have large data, if you want to use Notebooks like this, and easily collaborate with others through Google Drive or Github. 

Check the handout for some more useful information about other _ecocloud_ functionalities and other Github repositories with training notebooks.