## Taking a look around in R and RStudio

Notice the various windows. 

1. The Console allows you to enter commands and view results
2. The Environment window will show you the various objects R is keeping track of (vectors, datasets, lists, etc.)
3. The bottom right window will show you files, figures/graphs, help files, and more. 
4. If you create/open a script, it will appear in the top-right by default. I encourage using an R project and R script for nearly all work. 

## Start an RStudio project

Steps:

- File &rarr; New Project
- Choose a name for the project and folder location.^[I have a folder called "labs" located in the my main PUBG 510 folder. If you are primarily going to be using RStudio through the computer lab, then place this folder in your H:\ network space]
- Now start a new script using the sheet with a plus sign icon from the toolbar or using the file menu. On the right side of your window, it should show the name of your project.
- RStudio projects make it so you don't need to worry about setting a working directory - it is defined in the project. Just make sure all files written to or read by R are in the same folder or in a nested subfolder. 

## Installing and loading packages

R is open-source, and, frankly, kinda stinks on its own.^[This is called base R.] But there are many, many user-generated packages that improve R's functionality. We'll be using these packages all the time, especially a group of packages called [the tidyverse](https://www.tidyverse.org/packages/). 

You only need to install the package once and then you're good to go (until it needs updating). But you also need to load the package in every R session if you want to use those commands. 

In [None]:
# Install the required packages if not already installed 
# By the way, the hashtag/pound/octothorpe symbol will comment out a line in your script

 install.packages(c('tidyverse', 'haven'))

# let's load your packages in the R session

library(tidyverse)
library(haven)


## Some basics

R can handle a great diversity of *objects* including lists, variables, names, vectors, data frames, scalars, and plots. Let's create a vector of data using the *concatenate* function, which is the workforce for how R handles lists of pretty much any type. We can assign this list of values to an object with the assignment operator `<-`. You can read that as "gets". 

In [None]:
x <- c(1,2,3,4,5,6,7,8,9,10) 

# we can display our vector by typing the name of object or by using the print() function
x

Let's try a different way of creating the same vector:

In [None]:
x2 <- 1:10

print(x2)

We can remove an object with the `rm()` function.

In [None]:
rm(x2)

## Functions

R has many built-in **functions** that you can use to conduct analyses. Functions take some input, perform some calculation or transformtion, and produce some output. Functions will often take **arguments** that will let you control how the function works. Let's take a look at a few descriptive functions to analyze a vector of data.

In [None]:
# these functions are simple and self-explanatory
mean(x) 
median(x)
sd(x)
range(x)

In R, we can easily store any output from a function by assigned it to an object using the assignment operator `<-`.

In [None]:
# let's name the mean of x and save it for the future as x.mean
x.mean <- mean(x)

x.mean

# we can call it up any time we want now that we have stored the value.
# let's do a few calculations using the stored mean:

53 + x.mean
x.mean^3

# we can also use our stored result to perform calculations on a vector of data
y <- x - x.mean

y
mean(y)

## Opening datasets

R has various formats for datasets, typically called *data frames*. You can have multiple data frames (as well as other objects) loaded in R. Most commands will ask you to specify which data frame you're using if you want to access a particular variable inside a data frame. Let's pull a simple .csv file with a few variables from the U.S. Census' American Community Survey. The data use the 2021 5-year estimates (so from 2017-2021) and measure the variables at the county level.

We can use the *read_csv()* function from the **tidyverse** package to download the dataset directly from a website and ingest into R:

In [None]:
nj.counties <- read_csv("https://raw.githubusercontent.com/bowendc/510_labs/main/nj_poverty.csv")  

## Using the pipe (|>)

The pipe operator allows users to pass objects onto multiple functions without creating new variables or nesting functions. The original pipe is built in to the **tidyverse** packages (%>%), and you'll see many example codes online using this pipe. It's great. However, starting in R version 4.1, base R includes a new pipe operator that will work throughout everything in R (|>). Let's look at the example below using **tidyverse**'s *summarize* function, which aggregates data in the ways specfied in the command. In this case, we're summarizing two variables (two separate measures of member ideology, coded so that more extreme ideology is higher) in the ABH.data dataset by asking R to calculate their means. 


In [None]:
nj.counties |> summarize(mean.pop = mean(totalpop, na.rm = TRUE), 
                         mean.inc = mean(medincome, na.rm = TRUE))

If we didn't want to use the **tidyverse** syntax, we could also get the same results using base R and the `$` operator: 

In [None]:
# notice that we need to specify the data frame first, then the $, then the varible
mean(nj.counties$totalpop)
mean(nj.counties$medincome)

We can use our new pipe operator to do a little recoding of variables in our dataset. Let's create a variable called `region` that will denote whether the county is in North, Central, or South Jersey. We can use the `GEOID` variable from the Census to identify each county. 

In [None]:
nj.counties.recode <- nj.counties |> mutate(region = case_when(
                            GEOID == 34037 | GEOID == 34041 | GEOID == 34031 | 
                            GEOID == 34039 | GEOID == 34003 | GEOID == 34027 | 
                            GEOID == 34013 | GEOID == 34017 ~ "north jersey",
                            GEOID == 34019 | GEOID == 34021 | 
                            GEOID == 34023 | GEOID == 34035 ~ "central jersey", 
                            TRUE ~ "south jersey")                                   # TRUE assigns every unit not already assigned
                                            )

table(nj.counties.recode$region)  # table() is a simple function for viewing the number of units with each value of a variable

## **filter** and **group_by**

These functions are also part of the **tidyverse**. *filter()* allows us to, well, filter our observations based on some condition or a list of conditions. In the code below, we select just those counties in North Jersey. 

The *group_by()* function allows you to aggregate your data by grouping variables. In the code below, we group the data by region in NJ prior to calculating the means. 

In [None]:
nj.counties.recode |> filter(region=="north jersey") |>
                            summarize(min_inc = min(medincome))

nj.counties.recode |> group_by(region) |> 
                            summarize(mean_pop = mean(totalpop, na.rm = TRUE), 
                                       mean_inc = mean(medincome, na.rm = TRUE))

The pipe first passes the data to be filtered, then passes the filtered dataset to be summarized in the first functions. In the second, we use the pipe to pass the data frame to be grouped, and then the grouped and final data to be summarized. If you want to insert the pipe using a keyboard shortcut in RStudio, you can use <kbd>ctrl</kbd> + <kbd>shift</kbd> + <kbd>m</kbd>. Note that the shortcut uses the **tidyverse** pipe by default, but you can change it to the base R |> pipe in the settings. Check the "Use native pipe operator, |> (requires $ 4.1+) box in Tools &rarr; Global Options &rarr; Code menu window.