# Lab4 Outcomes


In this lab you should read through and run the code in this lab sheet and complete the lab assessment. The lab assessment can be found in LMS. Remember, that you only have 60 minutes to complete the assessment once you start it. 
By the end of this week's lab you should be able to use R to:


* transform data 
* standardize and compare data
* Find the correlation coefficient of two numeric variables and interpret it


Run the following code below first to load the necessary packages for this lab: 

In [None]:
library(ggplot2)  # loading the ggplot2 package
library(repr)     # loading the repr package

# Change plot size for all following plots 
options(repr.plot.width=4, repr.plot.height=4, repr.plot.res = 120)


# Exercise 1: Data Transformation



Today we will take a look into some common ways of transforming data without worrying about the reasons too much yet.
First, load the Larvae dataset from last week's lab. Make sure that the "larvae.csv" file is located in the same directory as this "lab4.jpynb" file. 

In the code cell below you also see some comments on what the different commands do. Remember that comments (starting with `#`) are ignored when the code is executed.



In [None]:
larvae <- read.csv("larvae.csv")     # saving the data in the variable 'larvae'
head(larvae)                         # printing the first six rows of the larvae data
summary(larvae)                      # printing the summary of the larvae data

This dataset records data of an experiment testing the effect of different intensities of a certain insecticide (a/b) on larvae. For now, we will focus on the 'NumberLarvae' variable, which records the number of dead larvae.

Via simple subtraction we can center the data or **mean-center** it if we choose to subtract the mean from every datapoint. Run the following lines of code to do so. 



In [None]:
number_dead <- larvae[,3]
mean <- mean(number_dead)
number_dead_centered <- number_dead - mean

number_dead
round(number_dead_centered, 2)

* What is the mean (of the transformed data set) now?
* How would you interpret negative/positive datapoints in this context?

Via multiplication, we can also scale our data. If we choose to multiply with values smaller than 1, we hence shrink the data down, while we inflate it if we multiply with values larger than 1. The three boxplots below visualise this.

In [None]:
b1 <- number_dead
b2 <- 0.5 * number_dead
b3 <- 3 * number_dead

par(mfrow=c(1,3))
boxplot(b1)
boxplot(b2)
boxplot(b3)


(Mean-) centering and scaling are so-called linear transformations. In the lecture you have also learned about log-transformation. For the purpose of your assessement, try to think about possible reasons why linear- or log-transforamtion could be helpful for analysing data. Try also to think about how linear transformation will affect the descriptive values from the summary statistics.

For instance, how would a linear transformation on the 'NumberLarvae' data via multiplication by 3 and then subtracting 7 from every datapoint affect the summary statistics. You can use the code cells below for calcualtions.  

In [None]:
x <- number_dead 
y <- 3 * number_dead - 7

# Exercise 2: Standardisation

One of the many applications of transforming data is when we standardise data. This is equivalent to mean-centering and scaling the data with the inverse of its standard deviation. The resulting data can then be expressed as a deviation from the mean in multiples of the standard deviation. If for instance any given datapoint is equal to 3.5 after standardisation, we interpret this datapoint to be 3.5 standard deviations above the mean of the corresponding dataset. 

The standardisation helps to make data from different populations comparable.

The following dataset contains imaginary information on mathematics and physics students. Each of which achieved a mark (out of 100) by the end of the year. 

In [None]:
Marks <- read.csv("Marks.csv")
head(Marks)

Retrieve the summary statistics for both groups (Maths and Physics) **separately** first. You can do so by using the `summary()` command and **completing** the code below. 

If you have trouble to separate the groups, take a look at Lecture 8 or talk to the friendly facilitators.




In [None]:
by ... Marks$Mark ... summary

We will now try to compare the best maths' student to the best physics' student. 

* From the absolute scores, which of the two appears to be more dominant in his/ her respective field of study? * 

If we now take the results of the other students into account and therefore use the standardised values instead, which of the two appears to be the more dominant student compared to his/her peers? Use the empty cells below for any kind of calculations you want to perform. Hint: The `min()`, `max()`, `mean()` and `sd()` functions might prove helpful. 
You could also try to use the `scale()` command instead to speed things up. If you are not familiar with it, make sure you read up on it first via `?scale()`.
No matter which method you prefer, but **standardising the data will be necessary to perform this task**.

# Exercise 3: Correlation

For this exercise we will be using the Larvae dataset again.


If you haven't already done so, read in the Larvae dataset. We are interested in the association of the two numeric variables:

* NumberLarvae: This variable records the number of dead larvae
* Insecticide: This variable records different intensities of insecticide treatments (measured in milliliters per liter) 


In [None]:
head(larvae)

In this question, we are interested in how the intensity of the insecticide and the number of dead larvae relate to one another. Below, we have created a scatter plot with Insecticide on the x-axis and NumberLarvae on the y-axis. 

In [None]:
ggplot(larvae, aes(x = Insecticide, y = NumberLarvae)) + geom_point()

What do you notice about the relationship between the two variables? 

*Do you think it is a linear or non-linear relationship?* 

*Does the association between the two variables have a positive or negative direction?* 

*Do you think it is a strong or a weak relationship?*

*Do you notice the different levels of spread for 'NumberLarvae' depending on the insecticide intensity?*

Discuss with your neighbour.


In the lectures, you will have discussed the correlation coefficient. Some important notes from the lecture were: 

* Correlation is only a measure of **linear association**.
* It does not tell us about non-linear associations.
* It does not indicate a cause and effect relationship.


To calculate the correlation between 'Insecticide' and 'NumberLarvae' we can use the `cor()` function in R, as seen below. Does the correlation value match your findings from the scatterplot?

In [None]:
cor(x= , y= )

In this week's lectures you have learned about the general idea of linear regression. How does the so far conducted correlation analysis fit into this idea? Which variable would you choose to be the dependent/ independent variable and why? Do you think a linear regression analysis would be an appropriate approach to further investigate the relationship at hand?

# Exercise 4: Outliers and Data Manipulation

In your LMS you will find a .csv file named 'galaxy'. First, create a variable called 'galaxy', which contains the dataset. 


In [None]:
galaxy <- 


In this dataset we find two numeric variables. The variable 'Velocity' corresponds to the recessional velocity (measured in km per second) of a galaxy moving away from earth. The variable 'Distance' corresponds to the distance of that galaxy from earth.

Use boxplots to visually scan both variables for outliers.

In [None]:
#resetting plot size
options(repr.plot.width=5, repr.plot.height=4,repr.plot.res = 120)


par(mfrow=c(1,2))
boxplot(galaxy$Velocity)
boxplot(galaxy$Distance)

Derive the summary statistics via `summary()` and decide which of the descriptive values may be flawed because of the outliers that you have detected. 

There appears to be one outlier regarding the 'Velocity' and two outliers regarding the 'Distance' variable. 
For some analytical methods it is a problem that our data is flawed like that, so this might have to be addressed. However, there isn't one way how to "treat" outliers but treatment can be performed in different ways depending on the nature of the outlier at question. 

The most drastic approach of course is to delete the dataset altogether if we render the information useless based on its unknown extreme behaviour. However, sometimes outliers result from obvious typing errors and can easily be manually corrected for instance. You should also bear in mind that outliers do not always result from errors, but may instead indeed be valid, yet extreme, observations for which we have no reason to treat or even remove them.

*For the outliers observed above, think about their nature and how you might treat them best.*

This new dataset contains the original data that is not flawed anymore. 

In [None]:
galaxy_new <- read.csv("galaxy_new.csv")

In [None]:
par(mfrow=c(1,2))
boxplot(galaxy_new$Velocity)
boxplot(galaxy_new$Distance)

Calculate the correlation coefficient of 'Velocity' and 'Distance'. Use the empty code cell below to do so.

How would you interpret these findings? 

Present a scatterplot to visualize the association. 

You are now all set to sit this week's assessment. Remember to round all your answers to 3 decimal places. Good luck!