<a href="https://colab.research.google.com/github/bca2/593/blob/master/general_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# General linear/non-linear analysis

The point of this document will be to provide the statistical tools necessary to perform a linear or non-linear analysis of data.
This code should work for most (all?) high-school experiments where there's a single, continuous independent variable (x) and a single, continuous dependent variable (y).
Custom analyses for specific experiements can also be done, and we'll be working on this for all lab projects supplied by the UIUC Crop Sciences department.


The data must be uploaded with a specific name, and the data columns must also be named 'x' and 'y' (or there needs to be an option to specify these names)

Excel/Sheets should also be able to perform the linear and curvilinear analyses properly.
The advantage you get in using this document is access to correct non-linear analyses (Excel does not do these properly).

Additionally, advanced students will have access to the code and should be able to customize their graphs and analyses if they like.
The code is hidden by default, but can be revealed and edited.

Supported analyses (currently) are:

1. Simple linear regression
2. Curvilinear (polynomial) regression
3. Exponential models
4. Logistic models

... additional model can be added as needed based on instructor requests.

Comments are welcome and encouraged.

Comments within the code (by me) to explain the code will be added if the instructors believe that this is a useful document.

In [0]:
#@title Enable R (you must do this)

# The following code lets us speak in the R language while working from within Python

%load_ext rpy2.ipython

# Using R inside python in this way tends to produce warnings
# The warnings don't seem to have an effect on the desired output, so I'm just going to suppress the warnings.
# You can go ahead and run the code with and without warnings by excluding/including the following code:

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore", category=RRuntimeWarning)

# To comment out code, just use the pound sign (hashtag) "#"
# Code that's been commented out will not be run, just like this comment I'm writing right now.
# Comments help us in two major ways:
#   1. You can easily leave instructions or clarifications in your code for other people or for future you.
#   2. You can see how your code runs without certain lines of code without actually deleting your code.
#       My advice: Try not to delete your code. You may change your mind later, and it's good to have it readily available.


# Simple linear regression

1. Upload a ".csv" file named "**linear.csv**".
Make sure that the data are stored in two columns: column "A" should be the independent variable (IV) named "x".
Column "B" should be the dependent variable (DV) "y".

2. Check a scatterplot to confirm that a linear regression is sensible.

3. If a linear regression makes sense, follow through with the rest of the analysis.

4. Plot the residuals (if you like), and plot the fitted model against the observed values.

5. If the model fit looks reasonable, get a summary of the model to get the parameter estimates.

6. Interpret the parameter estimates. What does the intercept mean? Is the intercept important in the context of the experiment (usually it isn't, but it can be).
What's your interpretation of the slope? 

The simple linear model we'll be fitting is:

$$y = \beta_0 + \beta_1 \cdot x + \epsilon$$

where $\beta_0$ is the intercept, $\beta_1$ is the slope, and $\epsilon$ is the error associated with measuring $y$ and is assumed to be a normal with constant variance $\epsilon \sim N(0,\sigma^2)$.

The fitted (or predicted) model is:

$$\hat{y} = \hat{\beta_0} +\hat{\beta_1}\cdot x$$

where $\hat{y}$ is our best prediction of $y$ based on the parameter estimates $\hat{\beta_0}$ and $\hat{\beta_1}$.

In [0]:
#@title Import the data as a file named "linear.csv" in the correct format (see above)
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [0]:
#@title Make the scatterplot

%%R
library(ggplot2)

linear = read.csv("linear.csv")

plot = ggplot(linear, aes(x = x, y = y))
plot = plot + geom_point()
plot

In [0]:
#@title Fit a linear model to the data (no output is produced here)
%%R

fit_linear = lm(y ~ x, data = linear)

In [0]:
#@title Check to make sure the residuals look random and have equal variance (shotgun blast, no pattern)
%%R

library(ggplot2)
library(dplyr)
library(tidyr)

linear_res = linear%>%
    mutate(res = residuals(fit_linear))%>%
    mutate(std_res = res/summary(fit_linear)$sigma )
    
res_plot = ggplot(linear_res, aes(x = predict(fit_linear), y = std_res))
res_plot = res_plot + geom_point()
res_plot

In [0]:
#@title Check the residuals for normality (the points should generally fall on the line)
%%R

qqnorm(residuals(fit_linear))
qqline(residuals(fit_linear))

In [0]:
#@title Check our model fit against the observed values (model fit is the green line)
%%R

plot(linear$x,linear$y)
lines(predict(fit_linear), col="green")

In [0]:
#@title Getting our parameter estimates
%%R

coef(summary(fit_linear))[,1]



## Interpretation

Our estimate of the intercept ($\hat{\beta_0}$) and the slope ($\hat{\beta_1}$, under $x$) are listed in the output above.
Can you figure out what these numbers actually mean?

# Curvilinear model (polynomial model)

1. Upload a ".csv" file named "**poly.csv**".
Make sure that the data are stored in two columns: column "A" should be the independent variable (IV) named "x".
Column "B" should be the dependent variable (DV) "y".

2. Check a scatterplot to confirm that a curvilinear regression is sensible.

3. If a curvilinear regression makes sense, follow through with the rest of the analysis.
**Important: for a polynomial model you must select the degree of the polynomial equation.**
A simple linear model has $degree = 1$, a quadratic model has $degree = 2$, and a cubic model has $degree = 3$.

4. Plot the residuals (if you like), and plot the fitted model against the observed values.

5. If the model fit looks reasonable, get a summary of the model to get the parameter estimates.

6. Interpret the parameter estimates. What does the intercept mean? Is the intercept important in the context of the experiment (usually it isn't, but it can be).
What's your interpretation of the other parameters? 

The general linear model we'll be fitting is:

$$y = \beta_0 + \beta_1 \cdot x + \beta_2 \cdot x^2...+\beta_n \cdot x^n+ \epsilon$$

where $\beta_0$ is the intercept, the other $\beta_i$ are the coefficients associated with the $x^i$and $\epsilon$ is the error associated with measuring $y$ and is assumed to be a normal with constant variance $\epsilon \sim N(0,\sigma^2)$.

The fitted (or predicted) model is:

$$\hat{y} = \hat{\beta_0} +\hat{\beta_1}\cdot x + \hat{\beta_2} \cdot x^2...+\hat{\beta}_n \cdot x^n$$

where $\hat{y}$ is our best prediction of $y$ based on the parameter estimates $\hat{\beta_0}$ and $\hat{\beta_1}$.

In [0]:
#@title Import the data as a file named "poly.csv" in the correct format (see above)
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [0]:
#@title Make the scatterplot

%%R
library(ggplot2)

poly = read.csv("poly.csv")

plot = ggplot(poly, aes(x = x, y = y))
plot = plot + geom_point()
plot

In [0]:
#@title Choose the degree of the model (choose the value and then run this cell)
Degree = 2 #@param {type:"integer"}


In [0]:
#@title Fit a curvilinear model to the data (no output is produced here)
%%R -i Degree

fit_poly = lm(y ~ poly(x,Degree, raw=T), data = poly)

In [0]:
#@title Check to make sure the residuals look random and have equal variance (shotgun blast, no pattern)
%%R

library(ggplot2)
library(dplyr)
library(tidyr)

poly_res = poly%>%
    mutate(res = residuals(fit_poly))%>%
    mutate(std_res = res/summary(fit_poly)$sigma )
    
res_plot = ggplot(poly_res, aes(x = predict(fit_poly), y = std_res))
res_plot = res_plot + geom_point()
res_plot

In [0]:
#@title Check the residuals for normality (the points should generally fall on the line)
%%R

qqnorm(residuals(fit_poly))
qqline(residuals(fit_poly))

In [0]:
#@title Check our model fit against the observed values (model fit is the green line)
%%R

plot(poly$x,poly$y)
lines(predict(fit_poly), col="green")

In [0]:
#@title Getting our parameter estimates
%%R

coef(summary(fit_poly))[,1]



## Interpretation

Our estimate of the intercept ($\hat{\beta_0}$) and the ($\hat{\beta_i}$) are listed in the output above.
Can you figure out what these numbers actually mean?

**Note: The output may look confusing. The parameters estimates are always given in ascending order though.**
The intercept is $\beta_0$, and each of the $\beta_i$ are `poly(...)i`.

# Non-linear models

Data are often not linear.
We'll need to use slightly more complicated methods to properly fit these models.

Note to instructors: I can add more models as needed, this is just a start.

## Exponential model

1. Upload a ".csv" file named "**exp.csv**".
Make sure that the data are stored in two columns: column "A" should be the independent variable (IV) named "x".
Column "B" should be the dependent variable (DV) "y".

2. Check a scatterplot to confirm that an exponential model is sensible.

3. If an exponential model makes sense, follow through with the rest of the analysis.

**Important: You'll have to select starting values for $A$ and $r_{max}$. This is a major difference from a linear model.**

**Part I: Selecting starting values for this model**

4. We'll log-transform our $y$ values and run a linear model.

5. Obtain the parameter estimates from the linear model (don't forget to exponentiate $\beta_0$) and plug them into the starting value estimates for the non-linear model.

**Part II: Fitting the non-linear model**

6. Fit the non-linear model. 

7. Plot the residuals (if you like), and plot the fitted model against the observed values.

8. If the model fit looks reasonable, get a summary of the model to get the parameter estimates.

9. Interpret the parameter estimates. What do they mean?



**Non-linear model**

The general exponential equation we'll be fitting is:

$$y = A\cdot e^{x \cdot r_{max}} +\epsilon$$

where $A$ is the intercept, $r_{max}$ is the growth rate, and $\epsilon$ is the error associated with measuring $y$ and is assumed to be a normal with constant variance $\epsilon \sim N(0,\sigma^2)$.

We know $y$ and $x$.
Just like a simple linear model, we'll be estimating only 2 parameters: $A$ and $r_{max}$.

The fitted (or predicted) model is:

$$\hat{y} = \hat{A} \cdot e^{x \cdot \hat{r}_{max}}$$

where $\hat{y}$ is our best prediction of $y$ based on the parameter estimates $\hat{A}$ and $\hat{r}_{max}$.

**Linearized model to get starting values**

The log-transformed model (approximately) looks like this:

$$\ln{y}=\ln{A} + r_{max} \cdot x$$

Think about why this model is only an approximation.

Notice that if we fit this linear model, our intercept $\beta_0 = \ln{A}$ and our slope $\beta_1 = r_{max}$.
For this model these are probably pretty good starting values.

$r_{max}$ seems pretty easy to find.To find $A$:

\begin{align}
\begin{aligned}
\beta_0 &= \ln{A}\\
    e^{\beta_0}&= A
\end{aligned}
\end{align}



In [0]:
#@title Import the data as a file named "exp.csv" in the correct format (see above)
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [0]:
#@title Make the scatterplot

%%R
library(ggplot2)

exp = read.csv("exp.csv")

plot = ggplot(exp, aes(x = x, y = y))
plot = plot + geom_point()
plot

In [0]:
#@title Fit the log-transformed model to get starting values.

%%R

exp_lin_fit = lm(log(y)~x, data=exp)
exp = exp%>%
    mutate(excel = exp(coef(summary(exp_lin_fit))[1,1])*exp(x*coef(summary(exp_lin_fit))[2,1]))
coef(summary(exp_lin_fit))[,1]

Interpreting the output:

\begin{align}
\begin{aligned}
(Intercept) = \beta_0 &= \ln{A}\\
    e^{\beta_0}&= A\\
    and...\\
    x&=\beta_1
\end{aligned}
\end{align}



In [0]:
#@title Choose starting values A and k for the model (choose the values and then run this cell)
A_start = 120.8 #@param {type:"number"}
k_start = -0.03507106 #@param {type:"number"}

list=[A_start,k_start]

In [0]:
#@title Fit an exponential model to the data (no output is produced here)
%%R -i list

A_start = list[1]
k_start = list[2]

fit_exp <- nls(data=exp,
            formula = y ~ A*exp(x*k),
            start = list(A=A_start,k=k_start))

In [0]:
#@title Check to make sure the residuals look random and have equal variance (shotgun blast, no pattern)
%%R

library(ggplot2)
library(dplyr)
library(tidyr)

exp_res = exp%>%
    mutate(res = residuals(fit_exp))%>%
    mutate(std_res = res/summary(fit_exp)$sigma )
    
res_plot = ggplot(exp_res, aes(x = predict(fit_exp), y = std_res))
res_plot = res_plot + geom_point()
res_plot

In [0]:
#@title Check the residuals for normality (the points should generally fall on the line)
%%R

qqnorm(residuals(fit_exp))
qqline(residuals(fit_exp))

In [0]:
#@title Check our model fit against the observed values (model fit is the green line)
%%R

plot(exp$x,exp$y)
lines(predict(fit_exp), col="green")

In [0]:
#@title Getting our parameter estimates
%%R

coef(summary(fit_exp))[,1]



In [0]:
#@title Excel vs R (R is Green, Excel is Red)
%%R

plot(exp$x,exp$y)
lines(predict(fit_exp), col="green")
lines(exp$x,exp$excel, col="red")

## Logistic model

1. Upload a ".csv" file named "**logistic.csv**".
Make sure that the data are stored in two columns: column "A" should be the independent variable (IV) named "x".
Column "B" should be the dependent variable (DV) "y".

2. Check a scatterplot to confirm that a logistic model is sensible.

3. If a logistic model makes sense, follow through with the rest of the analysis.

**Important: You'll have to select starting values for $K$, $P_0$ and $r_{max}$. Your instructor should be able to help you with this.**

**Part I: Selecting starting values for this model**

4. Fit a linear

4. Plot the residuals (if you like), and plot the fitted model against the observed values.
**Note to instructors: If you don't want to cover residuals then feel free to skip this.**


5. If the model fit looks reasonable, get a summary of the model to get the parameter estimates.

6. Interpret the parameter estimates. What do they mean?

The general exponential equation we'll be fitting is:

$$y = \frac{K}{1+(\frac{K}{P_0}-1)\cdot e^{-r_{max} \cdot x}} + \epsilon$$

where $K$ is the carrying capacity, $P_0$ is the initial population size, $r_{max}$ is the growth rate, and $\epsilon$ is the error associated with measuring $y$ and is assumed to be a normal with constant variance $\epsilon \sim N(0,\sigma^2)$.

We know $y$ and $x$.
We'll be estimating 3 parameters: $K$, $P_0$ and $r_{max}$.

The fitted (or predicted) model is:

$$\hat{y} =\frac{\hat{K}}{1+(\frac{\hat{K}}{\hat{P_0}}-1)\cdot e^{-\hat{r}_{max} \cdot x}}$$

where $\hat{y}$ is our best prediction of $y$ based on the parameter estimates $\hat{K}$, $\hat{P_0}$ and $\hat{r}_{max}$.

In [0]:
#@title Import the data as a file named "logistic.csv" in the correct format (see above)
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))

In [0]:
#@title Make the scatterplot

%%R
library(ggplot2)

logistic = read.csv("logistic.csv")

plot = ggplot(logistic, aes(x = x, y = y))
plot = plot + geom_point()
plot

In [0]:
#@title Choose a cutoff for x
xcut = 25 #@param {type:"number"}


In [0]:
#@title
%%R -i xcut

log_exp_dat = logistic%>%
    filter(x < xcut)
  
log_lin_fit = lm(log(y)~x, data=log_exp_dat)
#logistic = logistic%>%
#    mutate(excel = exp(coef(summary(log_lin_fit))[1,1])*exp(x*coef(summary(log_lin_fit))[2,1]))
coef(summary(log_lin_fit))[,1]

In [0]:
#@title Choose starting values for K, P0 and r (choose the values and then run this cell)
K_start = 1100 #@param {type:"number"}
P0_start = 103 #@param {type:"number"}
r_start = .065 #@param {type:"number"}

list = [K_start, P0_start, r_start]

In [0]:
#@title Fit a logistic model to the data (no output is produced here)
%%R -i list

K_start = list[1]
P0_start = list[2]
r_start = list[3]

fit_logistic <- nls(data=logistic,
            formula = y ~ K/(1+(K/P0-1)*exp(-r*x)),
            start = list(K=K_start,P0=P0_start, r=r_start))

In [0]:
#@title Check to make sure the residuals look random and have equal variance (shotgun blast, no pattern)
%%R

library(ggplot2)
library(dplyr)
library(tidyr)

logistic_res = logistic%>%
    mutate(res = residuals(fit_logistic))%>%
    mutate(std_res = res/summary(fit_logistic)$sigma )
    
res_plot = ggplot(logistic_res, aes(x = predict(fit_logistic), y = std_res))
res_plot = res_plot + geom_point()
res_plot

In [0]:
#@title Check the residuals for normality (the points should generally fall on the line)
%%R

qqnorm(residuals(fit_logistic))
qqline(residuals(fit_logistic))

In [0]:
#@title Check our model fit against the observed values (model fit is the green line)
%%R

plot(logistic$x,logistic$y)
lines(predict(fit_logistic), col="green")

In [0]:
#@title Getting our parameter estimates
%%R

coef(summary(fit_logistic))[,1]

