# R

## Packages and Graphics

## Packages in R
- Like most scripting languages, `R` has a very robust package ecosystem
- To install a package in `R`, use the `install.packages` function, and pass the name of the function you want to install
- Once a package is installed, you can use it by calling
```
    library(PACKAGE_NAME) #No QUOTES
```

## Package Documentation
- Most major packages in R come with two forms of documentation
    - The manual, which contains the same information that can be accessed through the `?` operator
    - Vingettes, which is a more long form documentation, often written in the style of an academic paper
- Example
    - https://cran.r-project.org/web/packages/psych/psych.pdf
    - https://cran.r-project.org/web/packages/psych/vignettes/intro.pdf
    - https://cran.r-project.org/web/packages/psych/vignettes/overview.pdf
    

## CRAN
- So where do the packages come from when we perform `install.packages`?
- By default the come from CRAN the Comprehensive R Archive Network
    - Most scripting languages have an equivalent, often named similarly (CTAN, CPAN)
- Other package repositories exist and can be used, but if you are using a popular package, it is probably published on CRAN

## Finding Pacakges
- CRAN is great at hosting packages
    - Not great at helping you find packages
- Numerous third party websites exist to help you find a package to accomplish something
    - My personal favorite is https://crantastic.org/

## TidyData
- There are many ways to represent data in a data frame, and due to the history of R, almost all of them are use
- Recently there has been a push to create commonsense conventions, known as having "Tidy Data"
- Hadley Wickham (Major player in R and the tidy data movement) defines tidy data as 
    - Each variable is in a column.
    - Each observation is a row.
    - Each value is a cell.


## TidyR
- To promote and enable this, the package TidyR was released
- It was spawned an entire family of packages, collectively known as the tidyverse
    - You can install just tidyR by using install.packages('tidyR')
    - The entire family can be installed with install.packages('tidyverse')
- It contains many functions meant to manipulate data into a tidy form

## The Pipe Operator
- `TidyR` is commonly presented using the operator `%>%`, which comes from an earlier package, `magrittr`
    - It is very similar to the pipe in bash, passing the output of one function as the first argument to the next function
    - The following are eqiuvalent
    
```R
apply(data,1,function)
      
data %>% apply(1,function)
```

## Spreading
- The `spread` function converts from long data to wide data
- The syntax of the `spread` function is
```R
    spread(data,key,value)
```
    - Key is the column you want to use to form your new columns
    - Value is the column you want to use to fill the cells

In [None]:
library(DSR)
long <- table2
extra_wide_cases <- table4
combined <- table5

In [None]:
library(tidyr)
print(as.data.frame(spread(long,?,?)))

## Gathering
- Gathering is the opposite of spread
    - While it is uncommon to need this, it is possible someone made a data frame where not every column is a variable, and you need to collapse things a bit
```R
    gather(data, COLUMN_NAME1, COLUMN_NAME2, cols_to_gather)
```

In [None]:
gathered_cases <- extra_wide_cases %>% gather("Year","Cases",2:3)
print(gathered_cases)

## Separating and Uniting
- Separating and Uniting allows us to create multiple columns from one, or bring together columns that should never has been separated
```R
    separate(data,col_to_separate,new_columns)
    unite(data,col_to_add, from_columns)
```

In [None]:
print(combined)
all_good <- combined %>% unite("year",?) %>% separate(?,?)
print(all_good)

## DplyR
- DplyR is another package in the tidyverse
    - Improves upon earlier packaged named `plyr`, which allowed easy manipulation of data
    - Specifically designed to use with data frames
- Just like TidyR, commonly uses pipes
- All functions are verbs


## Selecting Data
- DplyR contains two functions to select data
    - Select selects columns/variables
    - Filter selects rows/observations
- Both of these can take a list of names, but they are more useful with built-in functions in DplyR
    - endsWith
    - startsWith
    - contains
    - one_of

In [None]:
library(dplyr)
starwars <- as.data.frame(starwars)
row.names(starwars) <- starwars$name
print(starwars)

In [None]:
## Standard Boring Select
select(starwars,hair_color,skin_color, eye_color)

In [None]:
##  Select with Pipes and Ends_with
starwars %>% select(ends_with('color'))

In [None]:
starwars %>% select(-name)

In [None]:
starwars %>% filter(species != "Human")

In [None]:
starwars %>% filter(species %in% c('Wookiee','Ewok'))

## Adding or Changing Variables
- The `mutate` and `transmute` functions are used to add new variables as well as update existing ones
    - `mutate` does not drop old variables
    - `transmute` drops everything except those in the function call

In [None]:
starwars %>% mutate( height = height * 0.393701)

In [None]:
starwars %>% transmute( height = height * 0.393701)

In [None]:
starwars %>% filter(species %in% c('Wookiee','Ewok')) %>% mutate( height = height * 0.393701)

## Summarizing and Counting
- In general, to perform an action over a dataframe, use the summarize function
    - `summarize` takes in as its parameters other functions that do the calculations
    - The parameters to these inner functions should be the columns you want summarized
    - Multiple summaries can be computed with one call to `summarize`
- If all you want to do is count the frequency of values in certain column, use the `count` function and pass a column to count

In [None]:
print(starwars %>% summarize(n_distinct(species)))

In [None]:
species_counts <- starwars %>% count(species)
print(as.data.frame(species_counts))

In [None]:
species_counts <- starwars %>% count(species,sort=TRUE)
print(as.data.frame(species_counts))

In [None]:
species_counts <- starwars %>% count(species,homeworld,sort=TRUE)
print(as.data.frame(species_counts))

## Group By
- The `group_by` function allows rows to be grouped based on their values in the given columns or columns
- This makes finding averages and other summary data per group very easy
```R
group_by(data,LIST_OF_COLUMNS)
```

In [None]:
print(starwars %>% group_by(species,homeworld) %>% summarize(avg_height = mean(height)))

In [None]:
print(starwars %>% 
                  group_by(species,homeworld) %>% 
                      summarize(avg_height = mean(height), min_height=min(height)))

## Combining Data Tables
- The various `join` functions offer database like functionality
    - Matching rows are joined together with their columns
    - Matching is done by default on any common variables, but can be specified
- `bind_rows` and `bind_columns` offer a simpler concatenation style combination
    - Matches by position always

In [None]:
print(band_members)

In [None]:
print(band_instruments)

In [None]:
print(full_join(band_members,band_instruments))

In [None]:
print(inner_join(band_members,band_instruments))

In [None]:
print(left_join(band_members,band_instruments))

In [None]:
print(right_join(band_members,band_instruments))

In [None]:
print(band_instruments2)

In [None]:
print(full_join(band_members,band_instruments2,by=c("name" = "artist")))

In [None]:
print(bind_cols(band_members,band_members))

In [None]:
print(bind_rows(band_members,band_members))

## ggplot2
- R has long supported creating graphs from data, but the process was often messy and confusing
- `ggplot2` is a widely used package that standardizes how graphs are created
    - Based on the Grammar of Graphics, a language independent theory on how graphs should be created
    - A very large community with lots of extensions and enhancements available
    - Works directly on data frames

## The `ggplot` function
- The `ggplot` function sets up the basics for our graph, including which data frame to use, and how to use it
```R
ggplot(data_frame,aes(AESTHETICS))
```
- Aesthetics are what we see are the graph, and are defined using data frame columns
    - x and y position
    - color
    - shape

In [None]:
library(ggplot2)
ggplot(starwars,aes(x=height,y=mass))

## Geometries
- The base `ggplot` function sets up the graph and creates a ggplot object, but doesn't produce anything visually
- We need to specify how we want to display our data using geometries
    - geom_point
    - geom_boxplot
    - geom_histogram
    - geom_dist
- Geometries, and every other specification in ggplot2 is done by adding to the original ggplot call

In [None]:
ggplot(starwars,aes(x=height,y=mass)) + geom_point()

In [None]:
ggplot(starwars,aes(x=height,y=mass)) + geom_histogram()

In [None]:
ggplot(starwars) + geom_histogram(aes(height)) + geom_histogram(aes(mass))

In [None]:
ggplot(starwars) + geom_histogram(aes(height),fill="blue") + geom_histogram(aes(mass))

In [None]:
ggplot(starwars,aes(x=height,y=mass,color=species)) + geom_point()

In [None]:
interesting <- (starwars %>% 
         filter(!is.na(species)) %>%
             group_by(species) %>% 
             summarize(count = n()) %>% 
             filter(count > 2))$species
to_vis <- starwars %>% 
    filter(species %in% interesting)

In [None]:
base_plot <- ggplot(to_vis,aes(x=species,fill=species,y=height))
base_plot + geom_violin()

## Modifying Other Aspects
- `ggplot` has a function for almost every aspect of a graphs appearance
- To add titles, use the functions
    - xlabs, ylabs, ggtitle, labs
- To modify area shown, use
    - xlim, ylim, lims
- To modify colors use one of the `scale_` functions

In [None]:
base_plot2 <- ggplot(starwars,aes(x=mass,y=height,,color=species))
scatter <- base_plot2 + geom_point()
plot(scatter)

In [None]:
scatter + ggtitle("Height vs Mass of Starwars Characters")

In [None]:
scatter + labs(title="Height vs Mass of Starwars Characters",x="Mass (kg)",y="Height (cm)")

In [None]:
scatter + labs(title="Height vs Mass of Starwars Characters",x="Mass (kg)",y="Height (cm)") + xlim(0,175) 

In [None]:
scatter + labs(title="Height vs Mass of Starwars Characters",x="Mass (kg)",y="Height (cm)") + xlim(0,175) +
guides(color=guide_legend(title="Species"))

In [None]:
scatter + labs(title="Height vs Mass of Starwars Characters",x="Mass (kg)",y="Height (cm)") + xlim(0,175) +
guides(color=guide_legend(title="Species")) + scale_colour_brewer(palette = "Spectral")

## Themes
- Themes allow you to control things like font, gridline color, etc.
- The elements of the theme can be modified by using the `theme` function and passing the appropriate parameters
- More common is to download or use an existing theme, and add it to your plot using `+ theme_NAME`

In [None]:
library(ggthemes)
almost_finished <- scatter + labs(title="Height vs Mass of Starwars Characters",x="Mass (kg)",y="Height (cm)") + 
xlim(0,175) + guides(color=guide_legend(title="Species"))
almost_finished + theme_fivethirtyeight()

In [None]:
almost_finished + theme_wsj()

In [None]:
almost_finished + theme_economist()

In [None]:
almost_finished + theme_tufte()