
# Lab3 Outcomes

In this lab you should read through and run the code in this lab sheet and complete the lab assessment on either a lab computer or at home. The lab assessment can be found in LMS. Remember, that you only have 60 minutes to complete the assessment once you start it. 
Run the following lines of code first please, as they will ensure better graphics throughout this lab.


In [None]:

library(repr)

# Change plot size to 4 x 3
options(repr.plot.width=4, repr.plot.height=4, repr.plot.res = 120)


# Exercise 1: Larvae Data 

So far we have mostly worked with small datasets and it was fairly easy to get an idea about the main features of variables by simply visually inspecting the entire dataset. However, for larger datasets this will become impossible to do. To still get an idea about what our data looks like, we can instead use what is called the "summary statistics", which amongst others generally includes measures of central tendency like mean and median. Those characteristics will give us an idea about our data without going through a long list of individual data points.  
For you to practice this firstly load the larvae dataset,**make sure that the csv file is located in the same folder as the .jypnb file you opened**.


To start, you firstly need to read in the larvae dataset. Like last week, the `head()` command will print the first 6 rows of observations of the larvae dataset. 

In [None]:
larvae <- read.csv("larvae.csv")
head(larvae)


The larvae data contains 32 observations for 4 variables: ID, Insecticide, NumberLarvae and Group.

*What do you think the ID variable represents?*

*Can you identify the data type of each of the variables?*

### Summary of Data

Now, we will create a summary of each variable using the `summary()` command. The output should list the minimum, median, mean, maximum and interquartiles for each numerical variable. For each categorical variable it will list the number of observations in each group.


In [None]:
summary(larvae)

From the above summary try to answer the following questions: 

*What is the range of 'NumberLarvae'?*

*What is the mean of 'Insecticide'?*

*How many levels does 'Group' have?*

*What is the upper quartile of 'ID'?*

### Boxplot

The summary statistics present a brief overview of some important characteristics of our variables. Another way to make large datasets more presentable are visualisations like the histogram, that we have already used in the previous labs. Another graphic tool that can be used is the boxplot, which you learned about in lecture. Today, we will learn how to create a boxplot in R. Run the code and discuss with your neighbour how the graph matches your summary statistics results.

In [None]:
boxplot(larvae$Insecticide)

Using the `ggplot()` command from last week, we can also use a more sophisticated way of creating a boxplot.

In [None]:
library(ggplot2)
ggplot(data = larvae, mapping = aes(y = Insecticide)) + geom_boxplot (width=1)
  

### Quantiles

Now, let's talk about quantiles a little further. Quantiles are important constructs in statistics, which we will use later on during this semester when we dive into statistical testing. But first, let's familiarize with the meaning of a **p-quantile**. If you already feel confident enough, try to write down a formal definition of a p-quantile or explain it to your neighbour and answer the questions in your assessment. Otherwise, keep on reading. 

Basically the idea of a p-quantile is to split the **ascending list of obersvations**, such that p * 100% of the observations lie to the left of the quantile and (1-p)* 100% of the observations lie to the right of the quantile. We then call the value that splits the dataset in such a way a "p-quantile". For example, going back to your results above, 25% of the observations from the 'Insecticide' variable are smaller or equal to 1.525 and 75% of the observations from that dataset are bigger or equal to 1.525.

The `summary()` command yields the 0.25, the 0.5 and the 0.75 quantiles. Together, those qunatiles split the entire list of observations into 4 quarters, which is why they are also called **quartiles**. However, it is of course possible to also obtain other quantiles. R provides an easy command for this.

In [None]:
quantile(larvae$NumberLarvae, probs = c(0.25, 0.5, 0.75))

Study this command and try to obtain other quantiles of your choice with it. 

Let us now go back to the boxplot one last time as R also provides a way to compare boxplots of multiple groups.
To do this, we can set the x variable to be `Group`, as shown below:  

In [None]:
ggplot(data = larvae, mapping = aes(x = Group, y = NumberLarvae)) + geom_boxplot()


Draw conclusions from the visual comparison of those two boxplots.

# Exercise 2: Meadowfoam Data 

We will now take a brief look at a different dataset, called "meadowfoam".

In [None]:
meadow <- read.csv("meadowfoam.csv")
head(meadow)

Firstly, receive the summary statistics as learned in today's lab. What does this tell you about the 4 variables of the dataset?

The most often used measure of dispersion is the "standard deviation", which is not presented by the `summary()` command. Instead, try the `sd()` command or type `?sd` if you require help with it. 

Another measure of dispersion is the "Inter Quartile Range" (IQR), which is the positive distance between the upper and lower quartile.


*What is the Inter Quartile Range (IQR) of 'Flowers'?*

*What is the variance of 'Intensity'?*




In [None]:
summary(meadow)
(sd(meadow$Intensity))^2

### Scatterplot



We will now take a look at two more visualisations. Firstly, the scatterplot, which maps two numeric variables. The easiest way to obtain a basic scatterplots looks like this:

In [None]:
plot(meadow$Intensity,meadow$Flowers, main="Scatterplot")

Again, the `ggplot()` command offers a more sophisticated way of plotting.

In [None]:
ggplot(data= meadow, mapping = aes(x = Intensity, y = Flowers )) + geom_point()

What kind of relationship between the variables 'Intensity' and 'Flowers' can you observe from the graph?

Why are those two the only reasonable variables from the dataset to use in a scatterplot?

We can further use coloured points to distinguish between the two 'Time' groups by setting `col = Time`.

In [None]:
ggplot(data= meadow, mapping = aes(x = Intensity, y = Flowers , col = Time)) + geom_point()

In R we can also create so called "multiplots". Use the `par()` command like shown beneath to do so. Use `?par` to read up on this command as well if you wish. Again, please run the code in the next line first to optimise visualisation.

In [None]:
#resetting plot size
options(repr.plot.width=5, repr.plot.height=5,repr.plot.res = 120)

In [None]:
par(mfrow=c(2,2))
hist(meadow$Flowers, main="Histogram of Flowers")
boxplot(meadow$Flowers, main="Boxplot of Flowers")
plot(meadow$Intensity,meadow$Flowers, main="Scatterplot Intensity and Flowers")

For the visualisation of categorical variables, histograms, boxplots and scatterplots can obviously not be used. 
Instead, we can visualize the amount of observations for each level of such categorical variables via a so called 
"barplot". 

In [None]:
counts <- table(meadow$Time)
barplot(counts, main="Time") 

Your last task for today's lab is to create a multiplot, including the histogram and the boxplot of 'Intensity' 
as well as the scatterplot and the barplot from above.