# BEES1041 Exploring the Natural World #
# Week 1 Computer Exercise - Introduction to R #
***
This is a Jupyter notebook, which organises code and text into cells. This first cell is a text cell, which is formatted using a simple language called markdown. If you double click in this cell, you can edit it. You can also include formulas in markdown, like this one that relates the area (<em>a</em>) and radius (<em>r</em>) of a circle:

$a = \pi{r}^2$.

As this is the first computer exercise, it is a basic introduction to using R in a Jupyter Notebook on SWAN. The following text cells explain several important conponents of the R language, while the code cells demonstrate each component with an example. The next cell is a code cell, with a single line of code. Click on the cell to highlight it, and then click on the run button to see what the code does. When code is running the cell has `[*]` next to it, and when it is finished it has a number `[1]`, which is the number of times it has been run.

In [None]:
head(iris)

The above cell uses the `head()` function to display the first six rows of a data set. Functions are bits of code that are stored in the code library, which take arguments inside their brackets. If you want to know more about how a function works, or what arguments it will accept, you can get a description by using the `help()` function. For example, you can run `help(head)` to see how the `head()` function works.

The `iris` data set is one of several that are built-in to the R language. They are included to make it easy to get started learning the language, and are convenient to test bits of code. The `iris` data set contains the measurements in centimeters of different flower components (sepal length and width, and petal length and width), for 50 flowers from each of 3 species of iris plant. To read more on the data set you can run `help(iris)`.

The `iris` data set is stored as an R data frame, which is like a table, with data arranged in rows and columns. The top line of the table, called the header, contains the column names, and each line afterward is a data row. Each column of a data frame can contain different types of data, such as numbers, text, or dates. A column of data can be accessed directly by using the dollar sign, such as in the code below, which uses the `min()` and `max()` functions to show the range of values for the `iris$Petal.Length` column. Click on the cell and run the two lines of code.

In [None]:
min(iris$Petal.Length)
max(iris$Petal.Length)

When a code cell has more than one line, it executes line-by-line, from top to bottom. This is similar to running a script, which is a just a file that runs all the code it contains.

You can access values of a data frame using their row and column numbers inside a set of square brackets. To see the value in the first row of the second column, run the following cell.

In [None]:
iris[1,2]

You can access a particular row of the data frame by leaving out the column number. For example, to see the third row, run the following cell.

In [None]:
iris[3,]

You can access a column in the same way. If you want a range of rows for a particular column you can access them using a range of numbers separated by a `:`. To see the values for the fourth column, that are between the 49th and 52nd rows, run the following cell.

In [None]:
iris[49:52, 4]

You can get other information about the data frame using lots of differnt functions. For example, the `nrow()` function gives you the numbr of rows, while `ncol()` gives the number of columns.

In [None]:
nrow(iris)
ncol(iris)

You can assign the output from a function to a new object using the `<-` symbol, which is just a `<` followed by a `-`. You can make up the object name, but they can't have spaces, and its best to keep them simple and descriptive. The following code uses the `colnames()` function to create an object called `column_names`. This object is a vector, or list, of the column names. You can access particular elements of vectors using the square brackets, like you did with the data frame. The following creates the column_names vector and prints the third value.

In [None]:
column_names <- colnames(iris)
column_names[3]

You can change values of a vector or data frame by using the `<-` symbol to assign the new value. For example, the following code changes the third element of the column_names vector to `petal_length`. As this is a text element, you need to enclose the text with quotation marks. You can use single quotes `'petal_length'` or double quotes `"petal_length"`, R treats them both the same.

In [None]:
column_names[3] <- 'petal_length'
column_names

It is possible to create new data frames that are subsets of existing data frames. For example, say you only want to examine the data on the `virginica` species of iris, you can select those rows using the `subset()` function. Note that the condition used in `subset()` is `Species == 'virginica'`, which uses a double equal sign `==` to define where a condition is true. The `summary()` function is useful to display statistics.

In [None]:
virginica_data <- subset(iris, Species == 'virginica')
summary(virginica_data)

Other conditional statements like 'not equal to' `!=`, 'greater than' `>`, 'less than' `<`, 'greater than or equal to `>=`, and 'less than or equal to' `<=` can also be used to select subsets. For example, we can select only those rows in the iris data where `Petal.Length` is greater than or equal to 4.

In [None]:
long_petal_data <- subset(iris, Petal.Length >= 4)
nrow(long_petal_data)
min(long_petal_data$Petal.Length)

Sometimes, a single line of code goes over more than one line. This is necessary when the line is too long to display, and/or to make the code more readable. For example, you can combine multiple conditions to select a subset using the `&` operator. As long as the conditions are encolsed in brackets, R will ignore the three seperate lines, and read the following call of the `subset()` function as a single line.  

In [None]:
long_skinny_virginica <- subset(iris, Species == 'virginica' &
                                      Petal.Length >= 5 &
                                      Petal.Width <= 2)
nrow(long_skinny_virginica)

R lets you do mathematical calculations on whole columns of data, just by using the `+` plus, `-` minus, `/` divide and `*` multiply symbols. It also has many functions for statistical caluclations, such as `mean()` to calculate the mean or average, and `sd()` to calculate the standard deviation. The following code calculates an area column as the length column multiplied by the width column. Then it calculates the mean and standard deviation of this new area column.

In [None]:
long_skinny_virginica$Petal.Area <- long_skinny_virginica$Petal.Length * long_skinny_virginica$Petal.Width
mean(long_skinny_virginica$Petal.Area)
sd(long_skinny_virginica$Petal.Area)

This is the end of the exercise. If you were already familiar with R it should have been very simple, but for those who are new to programming, I hope you followed the important concepts. You can always come back to this notebook to refresh your memory, if you forget how to do something. In next weeks exercise, you will analyse some real data to produce some graphs and invetigate a real research question.

***
# Final instructions #
There are a few things you need to do:
- Dont forget to answer the Moodle quiz questions for this exercise.
- If you have any problems, or questions, please post on the Moodle forum.
- Save the completed notebook and download it to your computer, as Colab directories get emptied. Or you can save the files into your Google Drive.

***
# Further exercises #
- Write a new cell that prints out infomation on the `trees` dataset, including the column names, the number of rows, the number of columns, and the minimum, maximum, mean and standard deviation of the trees `Height`.
- Write a markdown cell that explains how the circumferance of each tree can be calculated from the diameter or girth measurement.