## Describing Variables Using Univariate Statistics and Graphs

We have already covered many simple functions to calculate simple statistics (quantities of interest calculated from your data). You can use `mean()`, `sd()`, `median()`, `min()`, `max()`, `range()`. You could also use `var()` to calculate the variance. Each of these functions can be used in the base R syntax or in **tidyverse* by piping in the data and then combining with `summarize`: `df |> summarize(some_name = mean(x))`.

We can also create simple tables of these univariate descriptive statistics. Before we do that, let's get our required packages, download our data, and do some pre-processing of the data.

In [None]:
# Install the required packages if not already installed 
# By the way, the hashtag/pound/octothorpe symbol will comment out a line in your script

 install.packages(c('tidyverse', 'vtable'))

# let's load your packages in the R session

library(tidyverse)
library(vtable)

Now let's load our dataset. You can pull it directly from my GitHub repository using the url and the `read_csv()` function from **tidyverse**. The dataset comes from the NJ Department of Education, and shows the Adjusted 4-Year Cohort Graduation Rate by various populations of students. ([Source](https://www.nj.gov/education/schoolperformance/grad/ACGR.shtml)). These data are from 2022. 

In [None]:
# The `name_repair = make.names` argument replaces the spaces in the column headers with periods, which will make our lives easier

nj.grad <- read_csv("https://raw.githubusercontent.com/bowendc/510_labs/main/nj_schools_grad_rate.csv", name_repair = make.names)  

# take a look at your data frame using view() or by clicking on
# the table icon in the Environment window next to the data frame.

view(nj.grad)

We need to process these data a bit. There are a couple of weird issues we need to address. First, the data have been ingested with missing values - variables have both "N" and "*" included here, but R isn't using the standard NA missing values code. I suspect it is reading the variables in as a "character" or "string" data (text), rather than as a "numeric" variable. Let's check quickly:

In [None]:
class(nj.grad$Cohort.Count)

Yup - it's a character variable. We can convert this variable (and any other that we want) to numeric using the `as_numeric()` function. This function will force any non-numeric entry to be coded as NA.

In [None]:
nj.grad <- nj.grad |> mutate(Graduation.Rate.Num = as.numeric(Graduation.Rate), # here we write over our current nj.grad data frame
                             Cohort.Count.Num = as.numeric(Cohort.Count))       # and add two new variables.

# let's check to make sure it worked. We're not assigning this, just printing it:
nj.grad |> select(Graduation.Rate, Graduation.Rate.Num,
                  Cohort.Count, Cohort.Count.Num) |>
              slice(1:20)

Good! Now on to the second issue. The structure of the data is odd here. What is the **unit of analysis**? For most cases, we will probably rather have the student groups representing *columns* rather than *rows*. To do that we need to **reshape** the data from a `long` format to a `wide` format. We can do that with `pivot_wider()`.

In [None]:
nj.grad.wide <- nj.grad |> pivot_wider(names_from = Student.Group,  # these values will become new variable headers
                                       values_from = c(Graduation.Rate, # these are the data values to populate the new variables
                                                       Cohort.Count,
                                                       Graduated,
                                                       Graduation.Rate.Num,
                                                       Cohort.Count.Num),
                                        names_repair = make.names)   # again, we can fix those spaces in variable names

head(nj.grad.wide)

## Creating a table of descriptive statistics 

To create a table of descriptive statistics, we will use the *sumtable()* function of the **vtable** package. This is a clean, well-formatted table output that will give us a nice set of options for exporting, viewing, or saving the table. The *select()* function is part of **tidyverse** and will allow us to filter out some columns of data if we want. You could do this directly in the *sumtable()* function using the *vars* argument. Check out the *out* argument from the help file to see some other options. I'm using *out = "return"* so that the table will show up as output in the console, but the default settings will open up an html version in your web browser. You can also export to a csv file. 

In [None]:
nj.grad.wide |> 
    select(starts_with(c("Graduation.Rate.Num", "Cohort.Count.Num"))) |>    # the "|" means "or". Notice also what the function starts_with() does here
    sumtable(out = "return")       

## Making Graphs with ggplot2

**ggplot2** is an outstanding, albeit confusing, graphic syntax for R, bundled as part of the **tidyverse**. Let's explore some of the basic plot types and `ggplot()` syntax. To make our lives easier, let's process the data once more to remove District and State-wide graduation rates, leaving us only the high school data.

In [None]:
hs <- nj.grad.wide |> filter(School.Name != "District" & County.Name!="State") # != means "does not equal". & means "and"

### Histograms and density plots

Histograms are graphs that present the distribution of a variable measured at the continuous or interval level by grouping the values into "bins". 

In [None]:
# in base R
hist(hs$Graduation.Rate.Num_Total)

# in ggplot

ggplot(data = hs, mapping = aes(x = Graduation.Rate.Num_Total)) +
    geom_histogram(binwidth = 10)                       # try changing the binwidht

Now, let's make it look better by changing the y-axis to a percentage, fiddling with the theme and colors, and adding labels: 

In [None]:
ggplot(hs, mapping = aes(x = Graduation.Rate.Num_Total)) +
  geom_histogram(aes(y =100*( ..count.. / sum(..count..))), binwidth = 5,
                  fill = "navy") +
    labs(title = "2022 Adjusted 4-Year Graduation Rates, NJ High Schools",
       x = "Graduation Rate",
       y = "Percentage") +
  theme_minimal()

Better! What if we wanted to compare the distributions of two variables? Let's add another `geom_histogram` plot to this graph. To improve visability, let's change the opacity of the colors using the argument `alpha`.

In [None]:
ggplot(hs, mapping = aes(x = Graduation.Rate.Num_Total)) +
  geom_histogram(aes(y =100*( ..count.. / sum(..count..))), binwidth = 5,
                  fill = "navy", alpha = .3) +
  geom_histogram(aes(x = Graduation.Rate.Num_Economically.Disadvantaged.Students, 
                     y = 100*( ..count.. / sum(..count..))), binwidth = 5,
                 fill = "gold", alpha = .3) +
    labs(title = "2022 Graduation Rates, NJ High Schools",
       x = "Graduation Rate",
       y = "Percentage") +
  theme_minimal()

We can present the same graph as smoothed distributions using `geom_density` instead of `geom_histogram`:

In [None]:
ggplot(hs, mapping = aes(x = Graduation.Rate.Num_Total)) +
  geom_density(aes(y =100*( ..count.. / sum(..count..))),
                  fill = "navy", alpha = .3) +
  geom_density(aes(x = Graduation.Rate.Num_Economically.Disadvantaged.Students, 
                     y = 100*( ..count.. / sum(..count..))),
                 fill = "gold", alpha = .3) +
    labs(title = "2022 Graduation Rates, NJ High Schools",
       x = "Graduation Rate",
       y = "Percentage") +
  theme_minimal()

We could even add text and other lines if we want.

In [None]:
# store means and medians
mean.total <- mean(hs$Graduation.Rate.Num_Total, na.rm = TRUE)
med.total <- median(hs$Graduation.Rate.Num_Total, na.rm = TRUE)
mean.eds <- mean(hs$Graduation.Rate.Num_Economically.Disadvantaged.Students, na.rm = TRUE)
med.eds <- median(hs$Graduation.Rate.Num_Economically.Disadvantaged.Students, na.rm = TRUE)

ggplot(hs, mapping = aes(x = Graduation.Rate.Num_Total)) +
  geom_density(aes(y =100*( ..count.. / sum(..count..))),
                  fill = "navy", alpha = .3) +
  geom_density(aes(x = Graduation.Rate.Num_Economically.Disadvantaged.Students, 
                     y = 100*( ..count.. / sum(..count..))),
                 fill = "gold", alpha = .3) +
  geom_vline(aes(xintercept = mean.total), color = "navy") +
  geom_vline(aes(xintercept = mean.eds), color = "gold") +
  geom_vline(aes(xintercept = med.total), color = "navy", linetype = "dashed") +
  geom_vline(aes(xintercept = med.eds), color = "gold", linetype = "dashed") +
    labs(title = "2022 Graduation Rates, NJ High Schools",
       x = "Graduation Rate",
       y = "Percentage") +
  theme_minimal()


### Bar Graphs

Bar graphs are distributional graphs, like histograms, but for discrete (ordinal or categorical/nominal) data. 

In [None]:
ggplot(data = hs, mapping = aes(x = County.Name)) +
    geom_bar()

# sorting the bars by count of high schools in the each county
hs |> count(County.Name) |>
    ggplot(mapping = aes(x = reorder(County.Name, n), y = n)) +
        geom_bar(stat = 'identity') 

### Boxplots

Boxplots show a bunch of information: the median, the IQR, and outliers are all presented simply. Boxplots are also handy plots for graphing distributions by a grouping variable. Let's take a look at the graduation rate for Black students by county using `geom_boxplot()`.

In [None]:
ggplot(hs, mapping = aes(y = Graduation.Rate.Num_Black.or.African.American,
                         x = County.Name)) + 
  geom_boxplot()

# try switching the x and y axes:

ggplot(hs, mapping = aes(x = Graduation.Rate.Num_Black.or.African.American,
                         y = County.Name)) + 
  geom_boxplot()

It's good, but we can make it even better! Let's add several changes:

1. show the actual values of the graduation rates by high school using `geom_jitter()`, which adds a bit of random noise to the dots;
2. change the shape of the points and weight the size by the number of Black students in the school;
3. make the points fairly transparent;
4. change the theme;
5. reorder the counties by median graduation rate in the county

In [None]:
hs$County.name <- as_factor(hs$County.Name) # county needs to be read as a "factor" variable in R for fct_reorder to work

ggplot(hs, mapping = aes(x = Graduation.Rate.Num_Black.or.African.American,
                         y = fct_reorder(County.Name,Graduation.Rate.Num_Black.or.African.American, median))) + 
        geom_boxplot(outlier.shape = NA) + 
        geom_jitter(data = hs, show.legend = FALSE, 
                    aes(width = 0.01, 
                        alpha = .05, 
                        shape = "bullet",
                        size = Cohort.Count.Num_Black.or.African.American,)) + 
        theme_minimal()