# BEES1041 Exploring the Natural World #
# Week 2 Computer Exercise - Linear models and vegetation structure #
***
This week you will run R code to conduct some analyses of the tree measurement data you collected. You will calculate some statistics and make some plots like you did in Excel, and you will learn to define a linear model to describe the relationship between height and stem diameter at breast height (DBH). Then you will look at another dataset of tree measurements from the Ausplots Forests project, and compare that to the classes data. The exercise is a key component in three of the Course learning outcomes. It involves understanding experimental methods(CLO1); processing, analysing, and interpreting data (CLO2); and, communicating results in visual forms suitable for scientific reports (CLO3).

![image.png](attachment:5a852844-fcb3-4762-b119-2eef14dd1fd3.png)

***
## Linear regression ##
Linear regression is a very commonly used statistical technique among scientists who explore the natural world. It is used to model the relationship between a response (also called dependent) variable y and an explanatory (also called independent or predictor) variable x. For example, we could use linear regression to test whether temperature (the explanatory variable) is a good predictor of plant height (the response variable). We will use it to test how well DBH can predcit tree height.

In linear regression the model takes the form $y = \alpha + \beta{x} + \epsilon$
* $\alpha$ is the intercept (the value of y when x = 0)
* $\beta$ is the slope (amount of change in y for each unit of x
* $\epsilon$ is the error term.

The error term is the part that makes the model statistical rather than mathematical. The error term is drawn from a statistical distribution that captures the random variability in the response, and is assumed to be a normal (Gaussian) distribution. The goal in linear regression is to obtain the best estimates for the model coefficients ($\alpha$ and $\beta$). You should be aware that sometimes the model is written with different symbols for the coefficients (e.g., y = mx + b), and sometimes it is called called Ordinary Least Squares (OLS) linear regression. This is because the model finds the best values for the coeficients as those that make the sum of the squared difference between measured and modelled values. We will explore this concept during the notebook.

***
## Why do a regression? ##
Why would we want to predict tree height from DBH measurements?

One reason to predict height from DBH, is so you dont need to measure height during field work. It is pretty easy to measure DBH, as you discovered. Tree height is much more difficult to get accurate. This is often the point of modelling: to predict a measurement that is difficult to measure from one (or several) that is (or are) much easier to measure.

Another reason is that a linear regression model summarises the shape of the data into a line, described by its intercept and slope. This makes it easier to look at differences between two different datasets. For example, we can see if the relationship between DBH and height is different for trees measured by the class and those from mature forest sites measured by Ausplots Forests. 

***
## Part 1. The class data ##
First we will load the ggplot2 library that we will use for plotting, then we will load up the tree measurment data from the CSV file. We can print out the column names to remind ourselves how the data is organised, and print out the number of rows, to see how many trees were measured.

In [None]:
library(ggplot2)

tree_data <- read.csv('BEES1041_tree_measurements.csv')
colnames(tree_data)
nrow(tree_data)

Now we can make a plot of DBH against height, just like we did in Excel. In ggplot it is easy to colour the points by a category. Lets make a plot where we make trees in forests a different colour to those that are alone. The first line of code in the cell below defines the size of the plot in inches using the `options()` function. Otherwise ggplot can make the graph a bit too large. You can experiment with different sized plots by changing the width and height values and re-running the cell.

In [None]:
options(repr.plot.width=4.5, repr.plot.height=3)

ggplot(tree_data, aes(x = dbh_cm, y = tree_height_m, colour = forest_or_alone)) +
       geom_point() +
       labs(x = "Diameter (cm)", y = "Height (m)") +
       theme_bw()

Now lets make another plot, but this time we will colour by whether the trees are Eucalypts or not, using the `euc_or_not` column.

In [None]:
options(repr.plot.width=4.5, repr.plot.height=3)

ggplot(tree_data, aes(x = dbh_cm, y = tree_height_m, colour = euc_or_not)) +
       geom_point() +
       labs(x = "Diameter (cm)", y = "Height (m)") +
       theme_bw()

If we want to compare our data with that from the Ausplot Forests sites, we should limit our dataset to similar trees. We already know there is a great deal of variability in our class dataset, which includes many varieties of trees from many places. We should select only those trees that are both in forests and are Eucalypts. We can do that with the `subset()` function, then we can see how many trees we have in that subset by using the `nrow()` function.

In [None]:
tree_subset <- subset(tree_data, (euc_or_not == 'Eucalypt') & (forest_or_alone == 'forest'))
nrow(tree_subset)

Now we can make a plot of only these trees, to see how their DBH and height are related. 

In [None]:
options(repr.plot.width=3, repr.plot.height=3)

ggplot(tree_subset, aes(x = dbh_cm, y = tree_height_m)) +
       geom_point() +
       labs(x = "Diameter (cm)", y = "Height (m)") +
       theme_bw()

We have far fewer trees than the total dataset, but a lot of the variability has been removed. Now we will fit a linear regression between diameter and height using the `lm()` function, to see how closely the two variables are related, and to characterise the shape of this relationship. The main argument to `lm()` is the model formula y ~ x, where the response variable is on the left of the tilde (~) and the explanatory variable is on the right. The `summary()` function provides a deatiled output of the regression, including the values for the intercept and slope.

In [None]:
class_model <- lm(data = tree_subset, tree_height_m ~ dbh_cm)
summary(class_model)

The above summary gives lots of information, and can seem a bit overwhelming. I will explain the important parts that I want you to remember. The part that shows the intercept and slope is under the heading of Coefficients:
<center><div> <img src="attachment:163ad5b3-5c0d-4bee-b102-ffba15720939.png" width="150"/></div><\center>

In this case the intercept is 6.63499 and the slope is 0.19371, so the regression equation is:

<center>$height = 6.63499 + 0.19371DBH + \epsilon$<\center>

<br><br>
The strength of the modelled relationship is captured by the $R^2$ value, the proportion of variance in the response that is explained by the explanatory variable. The closer this is to 1.0, the better the model can explain the variance, or the better it can predict the response variable from the explanatory variable. In the model summary above the $R^2$ value is described as the `Multiple R-squared:  0.4998`. The reason it is called "multiple" $R^2$ is that you can use multiple predictor variables for the `lm()` function.

To plot the regression line on the graph, we could use the `geom_smooth()` function in ggplot, which is like adding the trend line to an excel plot. However, I prefer to do it a different way, which helps to illustrate what the regression line actually is. Firstly, I like to calculate the line position through applying the model to the measured DBH values, to create a set of height predictions. Then we can add a line to the graph that is defined by the measured DBH values and the predicted height values. Doing it this way emphasises that the vertical distance on the graph between the points and the line is the error in the model, or the difference between the measured and predicted values of height. These errors are also known as the residuals.

In the next cell, the first line of code creates a new predicted_height column in the data frame by applying the `predict()` function to the model that we made earlier. This is the same as if we applied the equation `tree_subset$predicted_height <- 6.63499 + 0.19371*tree_subset$dbh_cm`
    
The second section of code makes the same graph we had before, but now it adds the regression line.

In [None]:
tree_subset$predicted_height <- predict(class_model)

options(repr.plot.width=3, repr.plot.height=3)

ggplot(tree_subset, aes(x = dbh_cm, y = tree_height_m)) +
       geom_point() +
       labs(x = "Diameter (cm)", y = "Height (m)") +
       geom_line(aes(y = predicted_height), size = 1, color = 'red') +
       theme_bw()

One thing to check when fitting a linear regression, is whether our assumption of a linear relationship is true. There is no point trying to fit a straight line, if the relationship between the variables is curved. To test whether the linear assumption is true, we can look at a plot of the model residuals, which are the vertical differences between the predicted values (the line) and the measured values (the points). If there is no pattern in a plot of the residuals against the predicted values, then the assumption is true. We can make this residual plot by using the following simple code.

In [None]:
options(repr.plot.width=5, repr.plot.height=4)

plot(class_model, which = 1)

The isn't a strong pattern in the above graph, so our linear model appears appropriate for our data.

***
## Part 2. Ausplots Forests data ##
Now we will look at the Ausplots Forests data. We will conduct an identical analysis to that on the class data, and we will see if the mature forest trees sampled have a different relationship between diameter and height than the class trees.

First we need to load the Ausplots Forests data from the CSV file, and check the column names, and the number of trees in the dataset.

In [None]:
ausplots_data <- read.csv('ausplot_forest_data.csv')
colnames(ausplots_data)
nrow(ausplots_data)

As we want the same types of trees that we used for the class data, we need to select all the Eucalypt trees. But we also need to remove some trees whose data will be different. Looking at the column names, and opening the data in Excel, I can see out that we want to select only alive trees, where the diameter was measured at 1.3 m, and diameter and height were both measured. The `subset()` function will do this in the following cell.

In [None]:
ausplots_subset <- subset(ausplots_data,
                          (Point_Of_Measurement == 1.3) &
                          (Diameter > 0) &
                          (Height > 0) &
                          (Tree_Status == 'A'))
nrow(ausplots_subset)

Now we only have much fewer trees from the original number. But we also need to select only Eucalypt trees. To do this we need to create a new column for Genus from the Genus_Species column, and then we can select those trees which have the Genus Eucalyptus, Angophora, and Corymbia. These are the Genus of trees that define Eucalypt forest. One website that describes Eucalypt forest is by the [Australian Government Department of Agriculture, Water and the Environment](https://www.agriculture.gov.au/abares/forestsaustralia/profiles/eucalypt-2019). We can use the `separate()` function from the `tidyverse` library to split the text in the Genus_Species column and create new separate columns for Genus and Species. We can then use the `colnames()` function to make sure those new columns were created.

In [None]:
library(tidyverse)

ausplots_subset <- separate(ausplots_subset, Genus_Species, c("Genus", "Species"),
                            sep = " ", extra = "drop",
                            remove = FALSE, fill = "right")
colnames(ausplots_subset)

Now we can use the new Genus column to select those trees that are Eucalypts, with the following code. Note how the `|` symbol denotes `OR` in the subset expression rather than using `&`, which denotes `AND`.

In [None]:
ausplots_subset <- subset(ausplots_subset,
                          (Genus == "Eucalyptus") |
                          (Genus == "Angophora") |
                          (Genus == "Corymbia"))
nrow(ausplots_subset)

Now we only have even fewer trees left from the original data, but we have made sure that we are comparing the same types of trees that we analysed in the class data.

The next code cell calculates the linear regression between DBH and height on the Ausplots Forests data, in the same way we did for the class data.

In [None]:
ausplot_model <- lm(Height ~ Diameter, data = ausplots_subset)
summary(ausplot_model)

Looking at the summary of the linear regression model, we can see that the intercept, slope and $R^2$ values are different to the model run on the class data.
* The $R^2$ is greater, as in this data DBH is explaining a greater proportion of the variance in tree height.
* The slope is greater, meaning that tree heights in Ausplots Forests increase quicker with DBH than our class data.
* The intercept is also greater, meaning that the model predicts that heights are greater when DBH is small in the Ausplots forests.

Now we should plot the data with the regression line and see the differences.

In [None]:
ausplots_subset$predicted_height <- predict(ausplot_model)

options(repr.plot.width=3, repr.plot.height=3)

ggplot(ausplots_subset, aes(x = Diameter, y = Height)) +
       geom_point() +
       labs(x = "Diameter (cm)", y = "Height (m)") +
       geom_line(aes(y = predicted_height), size = 1, color = 'red') +
       theme_bw()

One thing the plot shows, is that the linear model does not fit the data very well at each end of the scatter of points. For small DBH values most of the measured points are below the line. The same is true for large values of DBH.

We should look at a plot of the model residuals, to see if there is a pattern. We can make this residual plot by using the following simple code.

In [None]:
options(repr.plot.width=5, repr.plot.height=4)
plot(ausplot_model, which = 1)

The residual plot confirms that there is a pattern, and that the relationship between DBH an height is the Ausplots Forests data is not linear. We can look at fitting non-linear regression models in a future class. For now we will continue to look at the linear model, and how it is different to the class data.

It will be easier to see the two regression models if will create a plot with the class data and the Ausplots Forests data side by side. Use the `cowplot` library and the following code to make a figure with two panels. The `coord_cartesian()` function is used to make sure the axes of each graph have the same ranges.

In [None]:
library(cowplot)

tree_plot <- ggplot(tree_subset, aes(x = dbh_cm, y = tree_height_m)) +
                    geom_point() +
                    labs(x = "Diameter (cm)", y = "Height (m)", title = "(A) Class") +
                    theme_bw() +
                    geom_line(aes(y = predicted_height), size = 1, color = 'red') +
                    coord_cartesian(xlim=c(min(tree_subset$dbh_cm), 1.02*max(ausplots_subset$Diameter)),
                                    ylim=c(min(tree_subset$tree_height_m), 1.02*max(ausplots_subset$Height)))

aus_plot <- ggplot(ausplots_subset, aes(x = Diameter, y = Height)) +
                   geom_point() +
                   labs(x = "Diameter (cm)", y = "Height (m)", title = "(B) Ausplots") +
                   theme_bw() +
                   geom_line(aes(y = predicted_height), size = 1, color = 'red') +
                   coord_cartesian(xlim=c(min(tree_subset$dbh_cm), 1.02*max(ausplots_subset$Diameter)),
                                   ylim=c(min(tree_subset$tree_height_m), 1.02*max(ausplots_subset$Height)))

options(repr.plot.width=5.5, repr.plot.height=3)
plot_grid(tree_plot, aus_plot, ncol = 2, align = "h")

Now you should export the graphs you have created to image files. You can look at the code from last weeks computer exercise and add a code cell to save the graph.

In [None]:
# Add you code to export the graph here. This line is a comment, as it starts with a hash symbol.


# Comment lines do not affect the code at all.

This is the end of the exercise. You have explored two real datasets of measurements from the natural world, made some plots that can be saved and used in scientiic reports, and you have learned to define linear regression models and interpret their results.

***
# Final instructions #
There are a few things you need to do:
- Dont forget to answer the Moodle quiz questions for this lab.
- If you have any problems, or questions, please post on the Moodle forum.
- Save the completed notebook and download it to your computer, as SWAN scratch directories get emptied. Or you can move the files into your Cloustor directory. You can also download the graph.

***
# Further exercise #

The Ausplot Forests data set has lots of information that we havent explored. You can explore it further, to see if the spread of points on the DBH-height graph can be explained by some of the other variables recorded. For example, it would be interesting to see if certain Eucalypt tree species have linear models with greater $R^2$ values.