# Outcomes Lab5


In this lab you should read through and run the code in the lab sheet and complete the lab assessment. By the end of this lab you should be able to use R to:


* Run a linear regression in R and interpret the findings.
* Create a residual plot.
* Predict the outcome of a variable depending on levels of another variable.
* log-transform data.


Before running the code in the following questions, we will load the necessary packages below: 

In [None]:
library(ggplot2)
library(repr) 
options(repr.plot.width=4, repr.plot.height=4, repr.plot.res = 120)


# Exercise 1: Linear Regression


At the end of last week's lab you were asked to create a scatterplot to visualise the relationship between the two variables 'Velocity' and 'Distance' from the galaxy dataframe. Here is what it looks like.

In [None]:
galaxy_new <- read.csv("galaxy_new.csv")
ggplot(galaxy_new, aes(x = Velocity, y = Distance)) + geom_point() 

As a reminder, the two variables describe the recessional velocity (measured in km per second) of a galaxy moving away from earth and the distance of that galaxy from earth (measured in Million lightyears). 

Derive the summary statistics for both variables via the `summary()` command. Use the empty code cell below.




In last week's lectures you have learned about linear regression. You can now use the `geom_smooth()` command to visualise the regression line in your scatterplot like shown below. This regression line is the graph of the linear function $y = \beta_0 + \beta_1  x $. Remember from last week, that $y$ is hence a linear transformation of $x$. 

In [None]:
ggplot(galaxy_new, aes(x = Velocity, y = Distance)) +
 geom_point() +
 geom_smooth(method = lm, se = FALSE)


In the equation from above the dependent variable $y$ depends on the independent variable $x$, while $\beta_0$ and $\beta_1$ are constants (called "regression coefficients"). The corresponding regression model is: $y_i = \beta_0 + \beta_1  x_i + \epsilon_i$, where $\epsilon_i$ denotes the individual vertical deviation from the regression line for the *i-th* observation. These deviations are called "residuals". We will now fit a linear regression model to our data, that is we will find the optimal regression coefficients using R. To do so, run the lines below.

In [None]:
# runs the linear regression and stores the residuals
lmRes <- lm(galaxy_new$Distance ~ galaxy_new$Velocity)
# summarizes the results
summary(lmRes)


How to interpret the summary of the results:
The first two lines shows what the command you ran was. This is followed by a five number summary of the residuals. Next, you will find the coefficients table, this is where you will find the estimate of $\beta_0$ and $\beta_1$. The residual standard error represents the average (vertical) distance between the observed values and the regression line measured in units of the dependent variable. You will also find the coefficient of determination $R^2$, which indicates how well the model approximates observations. You can ignore all other output for now.


From the above results, answer the following questions: 

* What is the estimated regression line? **Write it out.** 
* From the output, what are $\beta_0$, $\beta_1$ and $R^2$?
* For your data, does it appear that 'Distance' depends on 'Velocity'? How would interpret the regression coefficients?
* Report the standard deviation of the residuals, note this is labelled *Residual Standard Error* in the regression output.


# Exercise 2: Residual Analysis





#### Plotting the Residuals 

As explained above residuals are vertical deviations from the regression line. Very big residuals (positive as well as negative) hence mark outliers in the dataframe. Outliers are troublesome for the purpose of a regression analysis as they alter the regression line in their favor. Let us take a look at the residuals for the galaxy dataframe. To plot the residuals, you can use the command `plot(lmRes,1)`, where lmRes is the variable we created above. 


In [None]:
lmRes[2]             # sohws all residuals
plot(lmRes,1)       # plots all residuals against their fitted values (values on regression line)


In the residual plot above you will find that the most extreme residuals have their id number printed next to them. Bare this in mind, as we will remove them from the dataframe later. 

**What do you notice about these residuals?**

* Is there any residual pattern?
* Are there any outliers? 
* what do you think the mean of the residuals might be (from the residual plot, no calculation needed)?


## Removing the outliers 

Find the three largest residuals (based on their absolute value) and remove the corresponding datasets from the dataframe using the code line below. 


In [None]:
galaxy_Modified <- galaxy_new[-c(5, ...),] # has to be completed



To see what effect on the results these large residuals have we can re-run the regression without them. Make note of any differences in the results (ie. the values of $\beta_0$ and $\beta_1$).

In [None]:
# runs the linear regression and stores the residuals
lmRes_corrected <- lm(galaxy_Modified$Distance ~ galaxy_Modified$Velocity)
# summarizes the results
summary(lmRes_corrected)

Now, also run a residual analysis via residual plot as seen above

In [None]:
plot(lmRes_corrected,1)

* What do you notice about the outlier corrected residuals? 
* How do these plots compare to the plots of the original residuals?


# Exercise 3: Prediction

Fitting a regression model is an often used method to make predictions on the dependent variable based on different levels of the independent variable. You will now try to predict the number of dead larvae based on the concentration of an insecticide. But first, we will read in the 'larvae' dataframe which we have already used multiple times.

In [None]:
larvae <- read.csv("larvae.csv")
head(larvae)

This is the visualised relationship that we have already seen in last week's lab. Try to add a linear regression line to this visualisation. 

In [None]:
ggplot(larvae, aes(x = Insecticide, y = NumberLarvae)) + geom_point() 
ggplot(larvae, aes(x = Insecticide, y = NumberLarvae)) + geom_point() +  ...   # to be completed

Use the empty code cell to fit a linear regression model with 'Insecticide' as the independent variable and 'NumberLarvae' as the dependent variable.


In [None]:
lm2 <- lm()           # needs to be completed

Interpret your findings regarding the regression coefficients $\beta_0$ and $\beta_1$ as well as $R^2$ and the residual standard deviation.

Now, repeat the residual analysis from above for this regression. 

In [None]:
plot()          # needs to be completed

You will now use the estimated regression line to predict how many larvae will die if it an insecticide with a concentration of 5 units is used.

Enter the regression coefficient $\beta_0$ and $\beta_1$ from your regression output above in the code below and then run the chunk of code. 

In [None]:
# Enter intercept estimate: 
b0 <- 
# Enter b1 estimate: 
b1 <- 

concentration <-   5                                  
PredictedDeadLarvae <- b0 + b1*concentration
PredictedDeadLarvae

# Exercise 4: Log Transformation in linear regression

**What did you notice about the residuals?** Take another look at the residual plot.

* Is there any residual pattern?
* Are there any outliers? 
* Does the variance of the residuals depend on the fitted values or is it rather constant?

Last week you learned about log-transformation. One of its applications is to remove non-constant variance in data.
Your last task for today is to perform a log-transformation on the 'NumberLarvae' variable and investigate the consequences if you then perform a linear regression with those transformed values instead of the original ones. Run the code below to execute the transformation.

In [None]:
larvae$LogNumberLarvae <- log(larvae$NumberLarvae)
head(larvae)

Now, perform a linear regression analysis and a residual analysis using the code cells below.

In [None]:
lm3 <- lm(...)      # linear model, needs to be completed

summary(lm3)

In [None]:
plot(...)           # residual analysis, needs to be completed

**What do you notice about the new residuals?**

* Is there any residual pattern now?
* Are there any outliers? 
* Is the variance across the fitted values similar or rather different?
* What has changed from the previous residual plot? 

Compare your findings for both models in regarding to the residuals. 

* In which model do the residuals appear to be more constant (from the residual plot)?

Well done for studying this lab sheet and good luck for your assignment! Please remember to round your solutions to **3 decimal places** in your assignment (when available you will be informed).