# Graphing with ggplot2, Part I

**ggplot2** is an outstanding, albeit complex, graphic syntax for R, bundled as part of the **tidyverse**. Let's explore some of the basic plot types and `ggplot()` syntax. To make our lives easier, we'll be using the `fake` dataframe we used in some of our previous labs. First, let's use the `pacman` package to load the **tidyverse**. We'll also install and load **RColorBrewer**, which provides access to some nice color palettes we can use for graphing.

In [None]:
# install only if needed:
install.packages('pacman')

pacman::p_load(tidyverse, RColorBrewer)

In [None]:
# ingest the data from github and store as "fake"
fake <- read.csv("https://raw.githubusercontent.com/bowendc/pol200_labs/refs/heads/main/fake.csv")

# for a little more interesting graphs, let's change our y variable slightly.
# Don't do this if you are using real data!!!!
fake$y <- fake$y + 15*fake$w

## Histograms and density plots

Histograms are graphs that present the distribution of a variable measured at the continuous or interval level by grouping the values into "bins". 

The `ggplot()` syntax works like this. First, we create the plot using `ggplot` and specify the data we'll be using. Then, we can add graph layers to control the type of plot and the look of the graph. 

### Histograms

`geom_histogram` will draw a histogram. In the code below, we draw different histograms where we first use the default number of bins, then we limit the number of bins to 7, then create bins with a fixed width, and final we change the y axis to measure percentage of total cases rather than counts. 

In [None]:
p1 <- ggplot(data = fake,               # create plot, store as p1
             mapping = aes(x = x))      # tell R which variable(s) we'll graph

p1 + geom_histogram()                   # create histogram
p1 + geom_histogram(bins = 7)           # limit to only 7 bins
p1 + geom_histogram(binwidth = 5)       # fix bin width to 5
p1 + geom_histogram(aes( y = 100*after_stat(count) /sum(after_stat(count))), # make y scale a percentage
                    binwidth = 10)      # change bin width to 10


Let's say we're pretty happy with what we've created. Let's store this and start editing the look of the graph. Below, we change the color of the plot using `fill`, the transparency using `alpha` and the outline color using `color`.

In [None]:
p2 <- p1 + geom_histogram(aes( y = 100*after_stat(count) /sum(after_stat(count))),
                    binwidth = 10,
                    fill = "#193d19",  # set fill color
                    alpha = .4,          # set transparency amount. Closer to 0 is more transparent.
                    color = "black")   # set outline color
p2

We can add another layer to alter the background of the plot and other styling using the `theme_` layer.

In [None]:
# all of these themes are built-in to ggplot2
p2 + theme_bw()
p2 + theme_light()
p2 + theme_minimal()
p2 + theme_gray()
p2 + theme_linedraw()


Plot and axis titles can be created using the `labs` layer.

In [None]:
p2 + labs(title = "Distribution of Variable X",
          y = "Percent of Cases",
          x = "A Variable X (fake data)") +
     theme_bw()

### Bar Charts

Bar charts are versions of histograms in which each value of the variable gets its own bar. So bar charts are suitable for discrete (ordinal or nominal) data. Let's use what we've learned from `geom_histogram` and use make a bar chart instead using `geom_bar`.

In [None]:
# here, we use geom_bar which has some nice default settings
# we also tell R to interpret z as ordered data rather than 
#   a number, which makes the x axis a little nicer. 

ggplot(data = fake,
        mapping = aes(x = ordered(z))) +
        geom_bar(aes( y = 100*after_stat(count) /sum(after_stat(count))),
                    fill = "navy") +
     labs(title = "Distribution of Ordinal Variable Z",
          y = "Percent of Cases",
          x = "A Discrete Independent Variable Z (fake data)") +
     theme_minimal()

## Density plots

Density plots display the distribution of a variable as smoothed distribution without bins or bars. The default `geom_density` will create just the outline of the density plot; I typically use the `fill` argument to create a shaded distribution.

In [None]:
ggplot(data = fake,
        mapping = aes(x = y)) +
        geom_density(aes( y = 100*after_stat(count) /sum(after_stat(count))),
                    fill = "#901c16",
                    alpha = .4,
                    color = "black") +
     labs(title = "Distribution of Variable Y",
          y = "Percent of Cases",
          x = "A Dependent Variable Y (fake data)") +
     theme_minimal()


### Multivariate plots

We can present these distributions across the values of another variable. There are multiple ways of doing this. First, the distribution could be split into categories of another variable and then we can change some attribute (like color, shape, or border) to denote the values of the second variable. We could also split the graph entirely based on the categories of the other variable. Let's go over both of these strategies.

Here, let's split the plot into two distributions occupying the same plot space. In order to visually distinguish the distributions, we can use different `fill` colors. The easiest way to do this is to change the graph aesthetics to allow a `fill` category. We use the `factor` function to inform R that variable `w` is a nominal, not numeric, variable. Then, we use `scale_fill_brewer()` to define the color scale we want to use. Here, we use the `Dark2` scale from the `RColorBrewer` package. There are many cool color palettes. You can read up about how to use these [here](https://ggplot2-book.org/scales-colour#sec-colour-discrete).

In [None]:
ggplot(data = fake, 
       mapping = aes( x = y, fill = factor(w))) + # fill will group observations by the variable w
        geom_density(aes(y = 100*after_stat(count) /sum(after_stat(count))),
                    alpha = .4,
                    color = "black") +
     labs(title = "Distribution of Variable Y by W",
          y = "Percent of Cases",
          x = "A Dependent Variable Y (fake data)",
          fill = "W") +       # change the label of the legend
     scale_fill_brewer(palette = "Dark2") +
     theme_minimal() 


Now let's try faceting the graph: splitting our single plot into two, separate plots. You don't need to use the `fill` argument when faceting, but it does allow you to employ different colors by graph. 

In [None]:
ggplot(data = fake, 
       mapping = aes( x = y, fill = factor(w))) +
        geom_density(aes(y = 100*after_stat(count) /sum(after_stat(count))),
                    alpha = .4,
                    color = "black") +
     labs(title = "Distribution of Variable Y by W",
          y = "Percent of Cases",
          x = "A Dependent Variable Y (fake data)",
          fill = "W") +
     scale_fill_brewer(palette = "Accent") +
     theme_minimal() + 
     facet_grid(~ w)          # facet plot by variable w
                              # if you placed the variable name before
                              # the ~, you would have two rows instead of 
                              # two columns. Could specify a diff variable for 
                              # both rows and columns if you wanted.