# R

## Objects, Statistics, and Packages

## Frequency
- Counting the frequency of an element in `R` is done using the various `table` functions
    - `table` returns a `table` object, which may be converted to a data frame for easier querying
- There is no limit to the number of variables in a cross-tabulation, although it is rare to see something beyond a 2 or 3 way frequency
    - To print higher dimension frequencies, pass table to `ftable`

## Frequency of Qualitative Data
- Qualitative Data represents categories
    - No additional preprocessing needed with categorical data

In [None]:
strings <- c("Yes","Yes","No","Maybe","OK","Yes")
print(table(strings))

In [None]:
library(vcd)
head(Bundesliga)

In [None]:
print(table(Bundesliga$HomeTeam))

In [None]:
homeGames <- table(Bundesliga$HomeTeam)
print(head(homeGames[order(-homeGames)]))

In [None]:
## How do we get the total number of games played?

In [None]:
print(head(table(Bundesliga$HomeTeam,Bundesliga$AwayTeam)))

## Frequency of Quantitative Data
- Quantitative Data requires preprocessing
    - The `table` function can only count things, it won't bin numbers for us
- The `cut` function converts numeric data into factors
    - In addition to the vector to cut, we can either pass the number of bins, or the bins themselves we want to use
    - The parameter `right` controls which side is open and which is closed

In [None]:
print(max(Bundesliga$HomeGoals))
FactorGoals <- cut(Bundesliga$HomeGoals,3,right=FALSE)
print(table(FactorGoals))

In [None]:
print(head(table(Bundesliga$HomeTeam,FactorGoals)))

In [None]:
goalsByTeam <- as.data.frame(table(Bundesliga$HomeTeam,FactorGoals))
print(head(goalsByTeam))


In [None]:
goalsByTeam <- as.data.frame.matrix(table(Bundesliga$HomeTeam,FactorGoals))
print(head(goalsByTeam))

In [None]:
print(head(goalsByTeam[order(-goalsByTeam[3]),]))

## Descriptive Statistics
- Almost every basic statistical function is built-in in `R`
    - `mean`
    - `median`
    - `sd` - Standard Deviation
    - `max`
    - `min`

In [None]:
print(paste("Our dataset includes the years from",min(Bundesliga$Year),"to",max(Bundesliga$Year)))
print(mean(Bundesliga$AwayGoals))
print(mean(Bundesliga$HomeGoals))
print(sd(Bundesliga$AwayGoals))
print(sd(Bundesliga$HomeGoals))

In [None]:
sumAway <- summary(Bundesliga$AwayGoals)
print(class(sumAway))
print(sumAway)
print(summary(Bundesliga$HomeGoals))

## Applying Over Axis
- When applying a descriptive function like mean to a matrix or array, the default option is to flatten it like a vector
- To apply is only over rows or only over columns, we need to use another function
    - For mean, there is the special functions `rowMeans` and `colMeans`
    - In general, we can use the `apply` function, which applies a function over an object across a given margin(sometimes called an axis)
        - In a matrix, 1 applies over the rows, and 2 applies over the columns
```R
    apply(OBJECT,AXIS,FUNCTION)
```

In [None]:
library(psych)
print(dim(iqitems))
print(head(iqitems))
iqitems[is.na(iqitems)] <- 0
print(mean(as.matrix(iqitems)))

In [None]:
print(apply(iqitems,2,mean))

## Correlation
- There are many different kinds of correlation, three of the most common are
    - Pearson's r (most common)
    - Kendall's $\tau$ (Rank-based correlation)
    - Spearman $\rho$ (Rank-based correlation)
- All are available in `R` using the `cor` method, and passing the corresponding string to the `method` parameter

In [None]:
print(cor(Bundesliga$HomeGoals, Bundesliga$AwayGoals,method="spearman"))

## Not really useful because its comparing ranks, but this is how it is called
print(cor(Bundesliga$HomeGoals, Bundesliga$AwayGoals,method="kendall"))

## PCA
- `R` also comes built in with numerous  exploratory data techniques
- Principal Components Analysis (PCA) is a dimensional reduction technique that attempts to find the most important components
- The PCA function in R is named `prcomp`

In [None]:
prcomp(iqitems)

## K-Means
- Clustering is both a machine learning technique as well as a method of exploratory analysis
- The `kmeans` function produces k-clusters by using attributes of data
    - By default, it will use all attributes, if you don't want this, select a subset before passing it to K-means
- A `kmeans` object is returned

In [None]:
clusters <- kmeans(iqitems,10)
print(clusters)

In [None]:
print(str(clusters))
print(clusters$cluster)


In [None]:
head(iqitems[clusters$cluster[clusters$cluster==2]])

## Linear Regression
- It is very common after some exploratory analysis to build a model in R
- Linear regression in `R` is performed using the `lm` function
- `lm` is the first function we are looking at that takes as an argument a formula
```R
lm(formula, data = DATAFRAME)
```

## Formulas in R 
- A formula in R has the general form of 
```R
dependent_var ~ independent_vars
```
- Variable names are not quoted, and are expected to refer to columns in the data frame
- If you think there is no interaction between the independent variables, combine them using `+`
- If you think there is interaction, or just want to allow it as a possibility, combine them using `*`

In [None]:
head(iris)

In [None]:
model1 <- lm(Sepal.Length ~ Sepal.Width + Petal.Length, data = iris)
summary(model1)

In [None]:
model2 <- lm(Sepal.Length ~ Sepal.Width * Petal.Length, data = iris)
summary(model2)

In [None]:
model3 <- lm(Sepal.Length ~ Sepal.Width * Petal.Length * Species, data = iris)
summary(model3)

## ANOVA
- In the social sciences, a very common anaylsis is to determine which variable is the most signifigant
    - The most common way to doing this is Analysis of Variance (ANOVA)
- ANOVA is actually a specialized version of a linear model, but we can call it explicitly by using the function
`aov`
    - If you already have a linear model, you can print the ANOVA by using the function `anova`

In [None]:
model4 <- aov(Sepal.Length ~ Sepal.Width * Petal.Length * Species, data = iris)
print(summary(model4))

In [None]:
print(anova(model3))

## Packages in R
- Like most scripting languages, `R` has a very robust package ecosystem
- To install a package in `R`, use the `install.packages` function, and pass the name of the function you want to install
- Once a package is installed, you can use it by calling
```
    library(PACKAGE_NAME) #No QUOTES
```

## Package Documentation
- Most major packages in R come with two forms of documentation
    - The manual, which contains the same information that can be accessed through the `?` operator
    - Vingettes, which is a more long form documentation, often written in the style of an academic paper
- Example
    - https://cran.r-project.org/web/packages/psych/psych.pdf
    - https://cran.r-project.org/web/packages/psych/vignettes/intro.pdf
    - https://cran.r-project.org/web/packages/psych/vignettes/overview.pdf
    

## CRAN
- So where do the packages come from when we perform `install.packages`?
- By default the come from CRAN the Comprehensive R Archive Network
    - Most scripting languages have an equivalent, often named similarly (CTAN, CPAN)
- Other package repositories exist and can be used, but if you are using a popular package, it is probably published on CRAN

## Finding Pacakges
- CRAN is great at hosting packages
    - Not great at helping you find packages
- Numerous third party websites exist to help you find a package to accomplish something
    - My personal favorite is https://crantastic.org/

## TidyData
- There are many ways to represent data in a data frame, and due to the history of R, almost all of them are use
- Recently there has been a push to create commonsense conventions, known as having "Tidy Data"
- Hadley Wickham (Major player in R and the tidy data movement) defines tidy data as 
    - Each variable is in a column.
    - Each observation is a row.
    - Each value is a cell.


## TidyR
- To promote and enable this, the package TidyR was released
- It was spawned an entire family of packages, collectively known as the tidyverse
    - You can install just tidyR by using install.packages('tidyR')
    - The entire family can be installed with install.packages('tidyverse')
- It contains many functions meant to manipulate data into a tidy form

## The Pipe Operator
- `TidyR` is commonly presented using the operator `%>%`, which comes from an earlier package, `magrittr`
    - It is very similar to the pipe in bash, passing the output of one function as the first argument to the next function
    - The following are eqiuvalent
    
```R
apply(data,1,function)
      
data %>% apply(1,function)
```

## Spreading
- The `spread` function converts from long data to wide data
- The syntax of the `spread` function is
```R
    spread(data,key,value)
```
    - Key is the column you want to use to form your new columns
    - Value is the column you want to use to fill the cells

In [None]:
library(DSR)
long <- table2
extra_wide_cases <- table4
combined <- table5

In [None]:
library(tidyr)
print(as.data.frame(spread(long,?,?)))

## Gathering
- Gathering is the opposite of spread
    - While it is uncommon to need this, it is possible someone made a data frame where not every column is a variable, and you need to collapse things a bit
```R
    gather(data, COLUMN_NAME1, COLUMN_NAME2, cols_to_gather)
```

In [None]:
gathered_cases <- extra_wide_cases %>% gather("Year","Cases",2:3)
print(gathered_cases)

## Separating and Uniting
- Separating and Uniting allows us to create multiple columns from one, or bring together columns that should never has been separated
```R
    separate(data,col_to_separate,new_columns)
    unite(data,col_to_add, from_columns)
```

In [None]:
print(combined)
all_good <- combined %>% unite("year",?) %>% separate(?,?)
print(all_good)