# Introductory Data Management in R using base R and the `tidyverse`

## Taking a look around in R and RStudio

Notice the various windows. 

1. The Console allows you to enter commands and view results
2. The Environment window will show you the various objects R is keeping track of (vectors, datasets, lists, etc.)
3. The bottom right window will show you files, figures/graphs, help files, and more. 
4. If you create/open a script, it will appear in the top-right by default. I encourage using an R project and R script for nearly all work. 

## Installing and loading packages

R is open-source, and, has a number of built-in functions.^[This is called base R.] But there are many, many user-generated packages that improve R's functionality. We'll be using these packages all the time, especially a group of packages called [the tidyverse](https://www.tidyverse.org/packages/). 

You only need to install the package once and then you're good to go (until it needs updating). But you also need to load the package in every R session if you want to use those commands. Alternatively, you can use the package name, followed by two colons, followed by the function to access functions from packages you have already installed on your computer but have not directly loaded in the current session. Let's see how this works using the `pacman` package. Pacman is a useful tool: it easily allows you to install, load, and update packages you want to use in your session, but will only install packages if they aren't already installed.

In [1]:
# Install the required packages if not already installed 
# By the way, the hashtag/pound/octothorpe symbol will comment out a line in your script

install.packages('pacman')
library(pacman)

# the pacman package is a useful way to load and install packages. It 
# will check to see if you have already installed the package. If so,
# it will load it. If not, it will download and install and THEN load the
# package. So it simplifies the process for you.

# p_load() is the name of the function from the pacman package. 

# alternatively, if you have pacman installed, you can use the function
# without loading the library by including the package name and :: like this


# now let's install/update/load the tidyverse package
pacman::p_load(tidyverse)


Installing package into ‘/home/dan/R/x86_64-redhat-linux-gnu-library/4.5’
(as ‘lib’ is unspecified)



## Some basics

R can handle a great diversity of *objects* including lists, variables, names, vectors, data frames, scalars, and plots. Let's create a vector of data using the `concatenate` or `combine` function, which is the workforce for how R handles lists of pretty much any type. We can assign this list of values to an object with the assignment operator `<-`. You can read that as "gets". 

In [2]:
# the function below reads: "x gets the vector of values 1 though 10"
x <- c(1,2,3,4,5,6,7,8,9,10)

# we can display our vector by typing the name of object or by using the print() function
x

We can also create text data (called *character* data or *string* data) in the same way:

In [3]:
# notice the use of quotation marks around our text data
z <- c("a", "b", "c", "d", "e", "f", "g", "i", "j", "k")
z

R should recognize the format of these data. We can use the `class` function to see how R has stored our inputs.

In [4]:
class(x)
class(z)

Good - R appropriately sees `x` as a numeric **vector** (a column of data) and `z` as a character vector.

We can also change these formats. Sometimes changing them will lead to data errors (like by forcing text data into a numeric vector). Numeric data can always be stored as characters, although then R will not be able to use the vector for numerical calculations. 

In [5]:
# convert vector x from numeric to character
x <- as.character(x)
class(x)

# convert back to numeric:
x <- as.numeric(x)
class(x)

Right now, our vectors `x` and `z` exist as separate objects. We can check what objects R sees in our current environment with the `ls` (list objects) function:

In [6]:
ls()

However, we often interact with data in datasets, which R refers to as *data frames*. We can combine our two, 10-row vectors together into a data frame assigned as a named object:

In [7]:
# this function will stitch together one column of data with another
# so, 1 goes with a, 2 goes with b, etc

simple.df <- data.frame(x, z)


A handful of simple functions are useful for examining data frames:

- `head()`: shows the header and first several rows of the data frame. 
- `names()`: shows the names of each column in your data frame.
- `table()`: shows the values of columns of data, ordered by values.

In [8]:
# by default, head() shows the column names and 
#   the first six rows
head(simple.df)

# we can add rows to our glimpse of data by using the `n =` argument:
head(simple.df, n = 10)

Unnamed: 0_level_0,x,z
Unnamed: 0_level_1,<dbl>,<chr>
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,6,f


Unnamed: 0_level_0,x,z
Unnamed: 0_level_1,<dbl>,<chr>
1,1,a
2,2,b
3,3,c
4,4,d
5,5,e
6,6,f
7,7,g
8,8,i
9,9,j
10,10,k


In [None]:
# names() is an easy function to get all of our column names 
#   we could then store this as a list if we wanted to or plug
#   into some other function
names(simple.df)

When performing analyses on a data frame, we can access column data using the `df$column.name` format. The `table()` function below will present all values of a column of data and the number of rows with that value.

In [10]:
# show all unique value of x and the number of rows with each value
table(simple.df$x)

# now do z
table(simple.df$z)

# we can present all combinations of x and z and the counts for each
#   combined group of values
table(simple.df$x, simple.df$z)



 1  2  3  4  5  6  7  8  9 10 
 1  1  1  1  1  1  1  1  1  1 


a b c d e f g i j k 
1 1 1 1 1 1 1 1 1 1 

    
     a b c d e f g i j k
  1  1 0 0 0 0 0 0 0 0 0
  2  0 1 0 0 0 0 0 0 0 0
  3  0 0 1 0 0 0 0 0 0 0
  4  0 0 0 1 0 0 0 0 0 0
  5  0 0 0 0 1 0 0 0 0 0
  6  0 0 0 0 0 1 0 0 0 0
  7  0 0 0 0 0 0 1 0 0 0
  8  0 0 0 0 0 0 0 1 0 0
  9  0 0 0 0 0 0 0 0 1 0
  10 0 0 0 0 0 0 0 0 0 1

## The tidyverse 

The [`tidyverse`](https://www.tidyverse.org/packages/) is a suite of very popular packages that are useful for data management, data transformation, and graphing in R. By loading `tidyverse`, you are also loading other very useful functions from some other packages for reading and importing data. 

The `tidyverse` uses a particular kind of data frame storage, known as a `tibble`. The `tidyverse` is very helpful when using data as entire datasets. We can easily filter out observations, aggregate data, and recode variables using `tidyverse` functions. 

In [11]:
# we can convert our data frame to the tibble format easily:
simple.df <- tibble(simple.df)

## Creating and recoding variables

R functions are vectorized; that is, they can take a column of data as an input and perform the same procedure on each cell in the column. So we can, say, create a new column of data that performs mathematical operations on existing columns easily. Here are two ways to accomplish this: 

In [12]:
# base R:
simple.df$x2 <- simple.df$x^2

# in tidyverse (technically dplyr package)
simple.df <- simple.df |> mutate(x3 = x^3)

# let's check our work:
simple.df

x,z,x2,x3
<dbl>,<chr>,<dbl>,<dbl>
1,a,1,1
2,b,4,8
3,c,9,27
4,d,16,64
5,e,25,125
6,f,36,216
7,g,49,343
8,i,64,512
9,j,81,729
10,k,100,1000


We probably need to interpret that `dplyr` line of code:

`simple.df <- simple.df |> mutate(x3 = x^3)`

The `|>` is known as the "pipe"; it takes the output from what is to its left and "pipes" it into the function on the right. So in this case, we take our data frame (`simple.df`) and pipe it into the `mutate` function, which is the `tidyverse` function for changing and transforming columns of data in a data frame. Here, we create a new column or variable named `x3` and set it equal to an existing column `x` to the third power. 

If we omitted the `simple.df <-` part at the beginning of the function, R would perform the requested action and print it in the console, but it would leave `simple.df` unchanged in our environment. Here we assign our new data frame over our existing one. Be careful when you do this - there are many instances in which it would be better to assign your new data to a different data frame rather than writing over the existing one. 

We can do more complex recodes as well. In the code below, we use `if_else()` and `case_when()` functions inside of `mutate()` to create a binary variable and an ordinal variable, respectively. Notice that our two new columns of data are both inside the same `mutate()` function call, just separated by a comma. This way you can string together many variable recodes in the same section of code. 

In [None]:
# if_else() allows us to use conditional statements. First argument is 
#   the condition. In this case, x is less than or equal to 5. 
#   Second argument is what to code the rows if the condition is met.
#   In this case, we'll give it a 1. Third argument is 
#   the value for observations where the condition isn't met, here a 2.
#   Last argument is what to do with missing data. Here, we give the 
#   standard NA code for missing data in R. So our new column will be named
#   day.interviewed and will receive the results of the if_else() statement. 

# case_when is similar. It takes, on the left-hand side, conditions (here values
#   of our string or character variable z) and assigns them codes in the new
#   colum. The tilde (~) symbol represents assignment. So rows where z is a, b, or c
#   is assigned a value of 1 in the new column.
#   Note the last statement, TRUE ~ NA. TRUE here captures all other values not
#   identified in the previous conditions and codes them as missing data (NA).

# In these conditional statements, you can use == (equals), < (less than), > (greater than),
#   <= (less than or equal to), >= (greater than or equal to), and != (does not equal), 
#   among others. You can also use Boolean operators like & (and) and | (or) to connect 
#   multiple conditions together in the same statement. 

simple.df <- simple.df |> 
                mutate(day.interviewed = if_else(x <= 5,
                                              true = 1,
                                              false = 2,
                                              missing = NA),
                       treat_ordered = case_when(
                                            z == "a" | z == "b" | z == "c" ~ 1,
                                            z == "d" | z == "e" | z == "f" ~ 2,
                                            z == "g" | z == "h" | z == "i" ~ 3,
                                            TRUE ~ NA)
                      )

simple.df

x,z,x2,x3,day.interviewed,treat_ordered
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,a,1,1,1,1.0
2,b,4,8,1,1.0
3,c,9,27,1,1.0
4,d,16,64,1,2.0
5,e,25,125,1,2.0
6,f,36,216,2,2.0
7,g,49,343,2,3.0
8,i,64,512,2,3.0
9,j,81,729,2,
10,k,100,1000,2,


## Selecting specific columns and rows of your data

You can easily limit your data frame to particular rows or columns - we might do this if we want to run an analysis on just a subset of observations or make a table using a subset of variables. To accomplish this, we can use `dplyr` functions bundled in the `tidyverse`.

#### `filter()`

The `filter()` function allows you to select specific rows (or observations) in your data. Let's go over some examples:

In [None]:
# keep rows where x is less than 6
simple.df |> filter(x < 6)

# keep rows where x equals 6 (note the use of two equal signs
#   to signal an equality)
simple.df |> filter(x == 6)

# keep rows where x is less than or equal to 6
simple.df |> filter(x <= 6)

# keep rows where x is greater than 4 AND less than 8
simple.df |> filter(x > 4 & x < 8)

# keep rows where z equals d OR k. Note that string data 
#   must be included inside of quotes
simple.df |> filter(z == "d" | z == "k")

x,z,x2,x3,day.interviewed,treat_ordered
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,a,1,1,1,1
2,b,4,8,1,1
3,c,9,27,1,1
4,d,16,64,1,2
5,e,25,125,1,2


x,z,x2,x3,day.interviewed,treat_ordered
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
6,f,36,216,2,2


x,z,x2,x3,day.interviewed,treat_ordered
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
1,a,1,1,1,1
2,b,4,8,1,1
3,c,9,27,1,1
4,d,16,64,1,2
5,e,25,125,1,2
6,f,36,216,2,2


x,z,x2,x3,day.interviewed,treat_ordered
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
5,e,25,125,1,2
6,f,36,216,2,2
7,g,49,343,2,3


x,z,x2,x3,day.interviewed,treat_ordered
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>
4,d,16,64,1,2.0
10,k,100,1000,2,


#### `select()`

The `select()` function allows you to choose columns. 

In [None]:
# select() takes as arguments the names of your columns. No quotes needed.
simple.df |> select(x, z)

# we could also select "not" something, using the ! symbol
#   although we need to include our list of names inside of 
#   c() for this to work.
simple.df |> select(!c(x, z))

x,z
<dbl>,<chr>
1,a
2,b
3,c
4,d
5,e
6,f
7,g
8,i
9,j
10,k


x2,x3,day.interviewed,treat_ordered
<dbl>,<dbl>,<dbl>,<dbl>
1,1,1,1.0
4,8,1,1.0
9,27,1,1.0
16,64,1,2.0
25,125,1,2.0
36,216,2,2.0
49,343,2,3.0
64,512,2,3.0
81,729,2,
100,1000,2,


The pipe is very powerful, and it lets us connect together functions in an efficient way. Let's use both `filter()` and `select()` together:

In [16]:
# this series of functions will take our data frame,
#   pass it to select, which grabs only x2 and x3,
#   and then pipes it into filter, which keeps just 
#   rows with values of x2 between 25 and 81.
simple.df |> select(x2, x3) |> filter(x2 > 25 & x2 <= 81)

x2,x3
<dbl>,<dbl>
36,216
49,343
64,512
81,729


## Reshaping or transposing your data

Imagine that these data were survey interviews, and the unit of analysis is the ***respondent-interview***. In other words, imagine we have 5 respondents, each interviewed twice, with the day recorded in `day.interviewed`. Let's create a quick identifying variable and add it to our data frame with some clever mutating:

In [17]:
simple.df <- simple.df |> mutate(id = if_else(x <= 5,
                                            true = x,
                                            false = x - 5,
                                            missing = NA))
simple.df

x,z,x2,x3,day.interviewed,treat_ordered,id
<dbl>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,a,1,1,1,1.0,1
2,b,4,8,1,1.0,2
3,c,9,27,1,1.0,3
4,d,16,64,1,2.0,4
5,e,25,125,1,2.0,5
6,f,36,216,2,2.0,1
7,g,49,343,2,3.0,2
8,i,64,512,2,3.0,3
9,j,81,729,2,,4
10,k,100,1000,2,,5


Now, let's say we want to reorganize this data so that each row is a ***respondent***, and our data for interview session are represented in different columns. In this case, we would need to move information that is currently stored in *rows* (the additional interviews) and move them into new *columns*. Let's further assume that variables `x`, `z`, `x2`, `x3`, and `treat_ordered` are all interview-level data, that is, they vary by the interview session. So basically, let's reshape our data from having the ***respondent-interview*** as the unit of analysis to having just the ***respondent*** be the unit of analysis. We can do that with the `pivot_wider()` function, which moves data from rows into new columns. `pivot_longer()` does the reverse; it moves data coded in columns into new rows.

In [18]:
simple.df.wide <- simple.df |> pivot_wider(id_cols = id, names_from = day.interviewed, values_from = c(x, z, x2, x3, treat_ordered))

simple.df.wide

id,x_1,x_2,z_1,z_2,x2_1,x2_2,x3_1,x3_2,treat_ordered_1,treat_ordered_2
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1,6,a,f,1,36,1,216,1,2.0
2,2,7,b,g,4,49,8,343,1,3.0
3,3,8,c,i,9,64,27,512,1,3.0
4,4,9,d,j,16,81,64,729,2,
5,5,10,e,k,25,100,125,1000,2,


Pretty neat, huh? Now the interview session (the values from `day.interviewed`), are moved into the suffixes of the variable names in the header. So we have two columns of data for `x` - one representing the first interview and another representing the second. 

How would move data like this back into a *long* format, with a ***respondent-interview*** unit of analysis? It's trickier to get the code right, but still possible to do. 

In [19]:
# ok, this one is really tough! Here, we need to tell R which data should be shifting
#   to rows. We use the `x_1:treat_ordered_2` because it will include those two
#   and the columns in between. We leave off `id` because it will be expanded when we
#   switch back to respondent-interview units. The second argument, names_to, says that
#   we will be left with the value of our variable names and a new variable called 
#   day.interviewed. The last argument, names_pattern, describes how our variable names
#   are structured, with some text (the (.*), an underscore (_), and then other text (.*). 
#   Note that the function is smart enough to recognize the pattern of our 1s and 2s, which
#   are placed back into day.interviewed and removed from the variable names.)

simple.df.longer <- simple.df.wide |>
                        pivot_longer(x_1:treat_ordered_2, 
                                    names_to = c(".value", "day.interviewed"), 
                                    names_pattern = "(.*)_(.*)")
simple.df.longer

id,day.interviewed,x,z,x2,x3,treat_ordered
<dbl>,<chr>,<dbl>,<chr>,<dbl>,<dbl>,<dbl>
1,1,1,a,1,1,1.0
1,2,6,f,36,216,2.0
2,1,2,b,4,8,1.0
2,2,7,g,49,343,3.0
3,1,3,c,9,27,1.0
3,2,8,i,64,512,3.0
4,1,4,d,16,64,2.0
4,2,9,j,81,729,
5,1,5,e,25,125,2.0
5,2,10,k,100,1000,


Transposing is really quite difficult, and I typically need read the supporting documentation to figure out exactly how to do it. That's ok! You can always look up information about functions using the `help()` function in the console. Go ahead and try it! At the end of most help files will be examples. They aren't *always* that useful for new users, but they're at least a place to start. 

In [20]:
help(pivot_longer)

pivot_longer               package:tidyr               R Documentation

_P_i_v_o_t _d_a_t_a _f_r_o_m _w_i_d_e _t_o _l_o_n_g

_D_e_s_c_r_i_p_t_i_o_n:

     ‘pivot_longer()’ "lengthens" data, increasing the number of rows
     and decreasing the number of columns. The inverse transformation
     is ‘pivot_wider()’

     Learn more in ‘vignette("pivot")’.

_U_s_a_g_e:

     pivot_longer(
       data,
       cols,
       ...,
       cols_vary = "fastest",
       names_to = "name",
       names_prefix = NULL,
       names_sep = NULL,
       names_pattern = NULL,
       names_ptypes = NULL,
       names_transform = NULL,
       names_repair = "check_unique",
       values_to = "value",
       values_drop_na = FALSE,
       values_ptypes = NULL,
       values_transform = NULL
     )
     
_A_r_g_u_m_e_n_t_s:

    data: A data frame to pivot.

    cols: <‘tidy-select’> Columns to pivot into longer format.

     ...: Additional arguments passed on