# Statistical Analysis with R Cheat Sheet

## Base R statistical functions for central tendency and variability

Here’s a selection of statistical functions having to do with central tendency and variability that come with the standard R installation. You’ll find many others in R packages.

Each of these statistical functions consists of a function name immediately followed by parentheses, such as  `mean()`, and  `var()`. Inside the parentheses are the arguments. In this context, “argument” doesn’t mean “disagreement,” “confrontation,” or anything like that. It’s just the math term for whatever a function operates on.  

| **Function**| **What it Calculates**|  
| ----------- | ----------- |
| `mean(_x_)` |Mean of the numbers in vector x.|
|`median(_x_)`|Median of the numbers in vector x|
|`var(_x_)`|Estimated variance of the population from which the numbers in vector x are sampled|
|`sd(_x_)`|Estimated standard deviation of the population from which the numbers in vector x are sampled|
|`scale(_x_)`|Standard scores (z-scores) for the numbers in vector x|
## Base R Statistical Functions for Relative Standing

Here’s a selection of R statistical functions having to do with relative standing.   

| **Function**| **What it Calculates**|  
| ----------- | ----------- |
|`sort(x)`|The numbers in vector x in increasing order|
|`sort(x)[n]`|The nth smallest number in vector x|
|`rank(x)`|Ranks of the numbers (in increasing order) in vector x|
|`rank(-x)`|Ranks of the numbers (in decreasing order) in vector x|
|`rank(x, ties.method= “average”)`|Ranks of the numbers (in increasing order) in vector x, with tied numbers given the average of the ranks that the ties would have attained|
|`rank(x, ties.method= “min”)`|Ranks of the numbers (in increasing order) in vector x, with tied numbers given the minimum of the ranks that the ties would have attained|
|`rank(x, ties.method = “max”)`|Ranks of the numbers (in increasing order) in vector x, with tied numbers given the maximum of the ranks that the ties would have attained|
|`quantile(x)`|The 0th, 25th, 50th, 75th, and 100th  percentiles (i.e, the  _quartiles_) of the numbers in vector x. (That’s not a misprint: quantile(x) returns the quartiles of x.)|
## T-Test Functions for Statistical Analysis with R

Here’s a selection of R statistical functions having to do with t-tests.

|**Function**|**What it Calculates**|
| ----------- | ----------- |
|`t.test(x,mu=n, alternative = “two.sided”)`|Two-tailed t-test that the mean of the numbers in vector x is different from n.|
|`t.test(x,mu=n, alternative = “greater”)`|One-tailed t-test that the mean of the numbers in vector x is greater than n.|
|`t.test(x,mu=n, alternative = “less”)`|One-tailed t-test that the mean of the numbers in vector x is less than n.|
|`t.test(x,y,mu=0, var.equal = TRUE, alternative = “two.sided”)`|Two-tailed t-test that the mean of the numbers in vector x is different from the mean of the numbers in vector y. The variances in the two vectors are assumed to be equal.|
|`t.test(x,y,mu=0, alternative = “two.sided”, paired = TRUE)`|Two-tailed t-test that the mean of the numbers in vector x is different from the mean of the numbers in vector y. The vectors represent matched samples.|

## ANOVA and Regression Analysis Functions for Statistical Analysis with R

Here’s a selection of R statistical functions having to do with Analysis of Variance (ANOVA) and correlation and regression.

When you carry out an ANOVA or a regression analysis, store the analysis in a list. For example,

`a <- lm(y~x, data = d)`

Then, to see the tabled results, use the summary() function:

`summary(a)`

**Analysis of Variance (ANOVA)**   

|**Function**|**What it Calculates**|
| ----------- | ----------- |
|`aov(y~x, data = d)`|Single-factor ANOVA, with the numbers in vector y as the dependent variable and the elements of vector x as the levels of the independent variable. The data are in data frame  _d_.|
|`aov(y~x + Error(w/x), data = d)`|Repeated Measures ANOVA, with the numbers in vector y as the dependent variable and the elements in vector x as the levels of an independent variable. Error(w/x) indicates that each element in vector w experiences all the levels of  _x_  (i.e.,  _x_  is a repeated measure). The data are in data frame d.|
|`aov(y~x*z, data = d)`|Two-factor ANOVA, with the numbers in vector y as the dependent variable and the elements of vectors  _x_  and  _z_  as the levels of the two independent variables. The data are in data frame  _d_.|
|`aov(y~x*z + Error(w/z), data = d)`|Mixed ANOVA, with the numbers in vector z as the dependent variable and the elements of vectors  _x_  and  _y_  as the levels of the two independent variables. Error(w/z) indicates that each element in vector  _w_  experiences all the levels of  _z_  (i.e.,  _z_  is a repeated measure). The data are in data frame  _d_.|

**Correlation and Regression**   


|**Function**|**What it Calculates**|
| ----------- | ----------- |
|`cor(x,y)`|Correlation coefficient between the numbers in vector  _x_  and the numbers in vector  _y_|
|`cor.test(x,y)`|Correlation coefficient between the numbers in vector  _x_  and the numbers in vector  _y_, along with a t-test of the significance of the correlation coefficient.|
|`lm(y~x, data = d)`|Linear regression analysis with the numbers in vector  _y_  as the dependent variable and the numbers in vector _x_  as the independent variable. Data are in data frame  _d_.|
|`coefficients(a)`|Slope and intercept of linear regression model  _a_.|
|`confint(a)`|Confidence intervals of the slope and intercept of linear regression model  _a_|
|`lm(y~x+z, data = d)`|Multiple regression analysis with the numbers in vector y as the dependent variable and the numbers in vectors  _x_  and  _z_  as the independent variables. Data are in data frame _d_.|


## Important libraries to load 
If you don’t have a particular package installed already: install.packages(Tmisc).
~~~
 library(readr)        # for optimized read with read_csv() instead of read.csv() 
 library(dplyr)        # for filter(), mutate(), %>%, etc. see dplyr lesson. 
 library(ggplot2)      # for making plots in this lesson 
 library(broom)        # OPTIONAL: for model tidying with tidy(), augment(), glance() 
 library(Tmisc)        # OPTIONAL: for gg_na() and propmiss()
 ~~~


## The pipe: `%>% `
When you load the `dplyr` library you can use `%>%`, **the pipe**. Running `x %>% f(args)` is the same as `f(x, args)`. 
If you wanted to run function `f()` on data `x`, then run function `g()` on that, then run function `h()` on that result: instead of nesting multiple functions, `h(g(f(x)))`, it’s preferable and more readable to create a chain or pipeline of functions: `x %>% f %>% g %>% h`. 
Pipelines can be spread across multiple lines, with each line ending in `%>%`   

|Function  |Description  |
|--|--|
|`read_csv("path/awesome.csv")` |Read awesome.csv in the path/ folder `library(readr)` |
|View(df)| View tabular data frame df in a graphical viewer |
|head(df) ; tail(df)| Print first and last few rows of data frame df |
|mean, median, range |Descriptive stats. Remember `na.rm=TRUE` if desired |
|is.na(x) |Returns TRUE/FALSE if NA. sum(is.na(x)) to count NAs |
|filter(df, ..,)| Filters data frame according to condition ... (dplyr) |
|t.test(y~grp, data=df) |T-test mean y across grp in data df |
|wilcox.test(y~grp, data=df) |Wilcoxon rank sum / Mann-Whitney U test |
|lmfit <- lm(y~x1+x2, data=df) |Fit linear model y against two x’s |
|anova(lmfit) P|rint ANOVA table on object returned from lm() |
|summary(lmfit) |Get summary information about a model fit with lm() |
|TukeyHSD(aov(lmfit)) |ANOVA Post-hoc pairwise contrasts |
|xt <- xtabs(~x1+x2, data=df) |Cross-tabulate a contingency table |
|addmargins(xt) |Adds summary margin to a contingency table xt |
|prop.table(xt) |Turns count table to proportions (remember margin=1) |
|chisq.test(xt) |Chi-square test on a contingency table xt |
|fisher.test(xt) |Fisher’s exact test on a contingency table xt |
|mosaicplot(xt) |Mosaic plot for a contingency table xt |
|factor(x, levels=c("wt", "mutant")) |Create factor specifying level order |
|relevel(x, ref="wildtype") |Re-level a factor variable |
|glm(y~x1+x2, data=df, family="binomial") |Fit a logistic regression model |
|power.t.test(n, power, sd, delta) |T-test power calculations |
|power.prop.test(n, power, p1, p2)| Proportions test power calculations |
|tidy() augment() glance()| Model tidying functions in the broom package|


## ggplot2 basics 
Build a plot layer-by-later, starting with a call to `ggplot()`, specifying the data and aesthetic mappings, for instance, to x/y coordinates and color. Continue building a plot by adding layers such as geometric objects *(geoms)* or statistics, like a trendline. The example below will use *mydata*, plot *xvar* and *yvar* on the *x* and *y* axes, plot points colored by levels of *groupvar*, and add a linear model trendline. 

`ggplot(mydata, aes(xvar, yvar)) + geom_point(aes(color=groupvar)) + geom_smooth(method="lm")`