## Describing Variables Using Univariate Statistics and Graphs

In this lab, we are going to cover a number of core functions to calculate simple univariate statistics in R. We can also create simple tables of these univariate descriptive statistics. Before we do that, let's get our required packages, download our data, and do some pre-processing of the data.

In [None]:
# Install the required packages if not already installed 

install.packages("pacman")

# let's load your packages in the R session
pacman::p_load(tidyverse, modelsummary)

Now let's load our dataset. You can pull it directly from my GitHub repository using the url and the `read_csv()` function from **tidyverse**'s **readr** package. The dataset comes from the NJ Department of Education (with some editing by me), and shows the Adjusted 4-Year Cohort Graduation Rate by various populations of students. ([Source](https://www.nj.gov/education/schoolperformance/grad/ACGR.shtml)). These data are from 2022. 

In [None]:
# need to give the data frame a name. read_csv will store as a tibble data frame.
nj.grad <- read_csv("https://raw.githubusercontent.com/bowendc/pol200_labs/main/grad_rates.csv")  

# take a look using head() or View()
head(nj.grad, n = 10)

## Using core statistical functions

Base R comes with a number of built-in functions for calculating statistics. Here are some examples of functions you might want to use. Note that the `na.rm=TRUE` argument allows the function to remove the missing data from the calculation. Otherwise, R will return an `NA` answer when encountering any missing data. You will nearly always want to include this argument, unless you know you should not have missing values.

In [None]:
mean(nj.grad$grad.rate.all, na.rm = TRUE)
median(nj.grad$grad.rate.all, na.rm = TRUE) #50th percentile
range(nj.grad$grad.rate.all, na.rm = TRUE)
min(nj.grad$grad.rate.all, na.rm = TRUE)
max(nj.grad$grad.rate.all, na.rm = TRUE)
sd(nj.grad$grad.rate.all, na.rm = TRUE)
quantile(nj.grad$grad.rate.all, probs = .25, na.rm = TRUE) # 25th percentile
quantile(nj.grad$grad.rate.all, probs = .75, na.rm = TRUE) # 75th percentile

# you can also do calculations with these functions: 
mean(nj.grad$grad.rate.all, na.rm = TRUE) - sd(nj.grad$grad.rate.all, na.rm = TRUE)

# or store them as objects: 
rate.mean <- mean(nj.grad$grad.rate.all, na.rm = TRUE)
rate.sd <- sd(nj.grad$grad.rate.all, na.rm = TRUE)

# let's call up our stored statistics, plug them into an equation,and round to nearest integer
round(rate.mean + rate.sd, 0)


These functions can also be accessed using the `summarize()` function from **tidyverse**. `summarize()` has a bunch of nice properties. It allows us to aggregate by another variable using the function `group_by`. It also will provide the result in a tibble format.

In [None]:
nj.grad |> summarize(Mean = mean(grad.rate.all, na.rm = TRUE),
                     Median = median(grad.rate.all, na.rm = TRUE),
                     `Standard Deviation` = sd(grad.rate.all, na.rm = TRUE)) # the `` is needed b/c of space in name

# let's remake table, this time grouping by NJ county: 
nj.grad |> 
    group_by(County.Name) |>
    summarize(Mean = mean(grad.rate.all, na.rm = TRUE),
              Median = median(grad.rate.all, na.rm = TRUE),
              `Standard Deviation` = sd(grad.rate.all, na.rm = TRUE))

## Creating a table of descriptive statistics 

To create a table of descriptive statistics, we will use the `datasummary_skim()` and `datasummary()` functions of the **modelsummary** package. These provide clean, well-formatted table output that will give us a nice set of options for exporting, viewing, or saving the table. We could also combine with the *select()* function from  **dplyr** in the **tidyverse** to select just some columns of data if we want. Check out the `output` argument from the help file to see some other options. I'm using `output = "desc_stats.csv"` so that our table will be written directly to a `.csv` file (named "desc_stats.csv") in the second example below, ready for copying-and-pasting into a paper if needed.

In [None]:
# datasummary_skim() has default settings for each type of variable in your data frame
# you should see an image of a table pop up in your plot window or web browser.
# notice how datasummary_skim() handles numeric and categorical variables differently. 
nj.grad |> 
    datasummary_skim() 

# left-hand side of ~ receives the variables names listed with the + between
# right-hand side of ~ receives the statistics you'd like to present in your table
# what I have below is standard: 

# here we are using the datasummary(), not the datasummary_skim() function
datasummary(
            grad.rate.all + grad.rate.poor + grad.rate.Eng.learners + count.all  ~ 
            N + Mean + Median + SD + Min + Max,
        data = nj.grad,
        align = "lcccccc",                          # alignment of columns. l = left, c = center, r = right
        title = "Table of Descriptive Statistics", 
        fmt = 1,                                    # the number of decimal places for rounding
        output = "desc_stats.csv"                   # the output format. Lots of options, check help file
    )