# Graphing with `ggplot`, Part II

In this lab, we will cover a few more graph types and illustrate additional ways to control the look of your graphs in R using `ggplot`. We will use a famous data source: Poole and Rosenthal's (plus additional contributors) [DW-Nominate scores](https://voteview.com/data). While technically they are measures of agreement in roll-call voting, we often use them to create estimates of ideological ideal-points for members of Congress over time. They can be used to study polarization and extremism among legislators. Higher values are more conservative and lower values are more liberal. I have aggregated the DW-Nominate scores by Congress for members of the each party in the U.S. House from 1871 (42nd Congress) until today. Let's get started!

In [None]:
# install.packages('pacman')
pacman::p_load(tidyverse)

Pull the data from github. 

In [None]:
house <- read_csv("https://raw.githubusercontent.com/bowendc/pol200_labs/refs/heads/main/house.medians.csv")

Now, let's make a few changes to these data to aid in our analysis. First, let's create a measure of ideological polarization of the parties. We can do that taking the absolute value of the difference between each party's median ideal-point estimate. These are currently stored in separate variables, `Dem.med` and `Rep.med`. Second, let's code a new `year` variable from the included `congress` variable. `congress` ranges from 42 (42nd Congress) to 119 (119th Congress). Finally, let's take our year variable to code a `decade` variable. The code below uses the `%/%` syntax to pull out the first three digits of the year and ignoring the remainder ( for example, `1885` would result in `188`). Then it multiples that number by 10 to get back to the decade: `1880`.

In [None]:
house <- house |> mutate(polarization = abs(Dem.med - Rep.med),
                         year = 1789 + (congress - 1)*2,
                         decade = 10*(year %/% 10))


### Line Graphs

Note that our dataset has polarization and party medians measured across different "regions", coded in the variable `southern`. There are observations for "Southern States", "Border States" (here, I've coded OK, MO, and KY as border states), "Northern States", and "All States". Let's make a line graph in `ggplot` showing polarization over time.

In [None]:
ggplot() +
    geom_line(data = house,                # line graph 
        aes(x = year, y = polarization))

Well, that's weird. What's going on? I thought we were making a line graph! 

The issue is that we have multiple values of `polarization` for each value of `year`. We can only use line graphs when we have unique values of the y variable for each x. Here, we need to filter by `southern`. 

In [None]:
ggplot() +
    geom_line(data = house |> filter(southern == "All States"),  # notice we can load data inside of geom_ calls
        aes(x = year, y = polarization))

Wow. Polarization is increasing dramatically since the mid-20th century! 

Alternatively, we could provide a different line for each value of `southern`:

In [None]:
ggplot() +
    geom_line(data = house,
        aes(x = year, 
            y = polarization, 
            color = southern))   # groups observations by southern and assigns unique color

It looks like the change over time in polarization is more of an issue in the South than in other parts of the country. 

### Connected Scatterplots

Another option is to create a *connected scatterplot*, which mixes points with lines to show trends. Let's graph party median ideology scores by decade. 

In [None]:
ggplot(house |> filter(southern == "All States"), aes(x = decade, y = Dem.med)) +
    geom_point() +
    geom_line()

Oh no. We did it again. Clearly something is up with these lines. This time the issue is that we're using our `decade` variable, but we haven't aggregated the data up to decade level (it is still measured at the level of each congress). So again we have multiple values of y for single value of x, which `geom_line` can't handle. We can fix this by using `group_by` and `summarize` to get a single estimate of our y variables for each decade. 

In [None]:
# aggregate by taking means of variable 
means <- house |> 
            group_by(decade, southern) |>                          # group by x and z variables
            summarize(mean.dem = mean(Dem.med, na.rm = TRUE),
                      mean.rep = mean(Rep.med, na.rm = TRUE,),
                      mean.pol = mean(polarization, na.rm = TRUE))

Now we can plot!

In [None]:
ggplot() +
    geom_line(data = means |> filter(southern == "All States"),   # notices we're using our new df, `means`
              aes(x = decade, y = mean.dem),
              color = "navy") +
    geom_point(data = means |> filter(southern == "All States"), 
              aes(x = decade, y = mean.dem),
              color = "navy") +
    theme_light()

Looks like Democrats get more conservative from 1900 to 1950 and then start shifting to the left.

### Bar charts

*Bar charts* or bar graphs, can be nice ways to represent mean comparison tests. Here, let's present the average party median for Republicans by decade. We'll use `geom_col` which will present the exact value of a variable. Since we have already aggregate the data the represent the mean ideology score by decade, region, and party, we are good to go. Take a look at some of the other options in the code below, including how to change the labeled values on the x axis. 

In [None]:
ggplot() +
    geom_col(data = means |> filter(southern == "All States"), 
              aes(x = decade, y = mean.rep),
              fill = "#5F0505") +
    labs(title = "Average Republican Party Median in U.S. House Per Decade",
          x = "Decade",
          y = "Average of DW-Nominate Medians of House Republican Party Caucus",
          caption = "Source: Lewis et al. (2025). Voteview: Congressional Roll-Call Votes Database. https://voteview.com/") +
    scale_x_continuous(breaks = c(1880, 1900, 1920, 1940,                  # use scale_x_discrete when you have categorical data          
                                1960, 1980, 2000, 2020),                   # define which categories get placed on the axis
                       labels = c("1880s", "1900s", "1920s", "1940s",      # label the categories (in order)
                                "1960s", "1980s", "2000s", "2020s")) +
    theme_light()

### Legends, Color Scales, and More

So both parties are become more extreme over time since the mid-20th century, and it is related to the South. Let's return to our line graphs and examine the Democratic Party during this period, which completely dominated most southern states for the first half of this time period. The code below makes a number of alterations. Notice that we'll use all categories of `southern` except for the "All States" category. But we can create a unique line for each region using `color = southern`. `scale_color_brewer()` defines those colors using the `Dark2` pre-set color palette. 

In [None]:
p <- ggplot(house |> filter(southern != "All States"), 
        aes(x = year, 
        color = southern)) +           # group by southern and color 
        geom_line(aes(y = Dem.med),
                linewidth = 1.5) +     # make lines a bit thicker               
        scale_color_brewer(palette = "Dark2") +   # use pre-defined Dark2 color scheme for southern
        labs(title = "DW-Nominate Medians for Democratic Party in U.S. House by Region",
          x = "Year",
          y = "DW-Nominate Medians of Democrats in U.S. House",
          color = "Region",            # change legend title to Region
          caption = "Source: Lewis et al. (2025). Voteview: Congressional Roll-Call Votes Database. https://voteview.com/") +
        theme_light() + 
        theme(legend.position = c(0.9, 0.9),        # move legend inside the graph, 90% on x axis, 90% on y axis
              legend.background = element_blank())  # make legend background transparent
p

This graph dramatically highlights what is happening to Southern Democrats. They move drastically to the right between 1900 and 1950 and then shift back to the left. One possible explanation is that Jim Crow laws destroy the political, social, and economic power of Black Americans, especially in the South. White Southern Democrats were focused on maintaining their position in the racial hierarchy during this period, and began to reject a more active federal government emerging from progressive and New Deal (Northern/Western) wings of the party. Eventually, Congress passes the Civil Rights Act and the Voting Rights Act in the mid-1960s which secured Black access to the ballot (among other political and economic benefits).

We can illustrate our intuition by included shaded regions on our graph and adding text to the plot using `annotate()`.

In [None]:
p.final <- p + annotate("rect", xmin = 1877, 
                     xmax = 1965,
                     ymin = -.5, 
                     ymax = 0,
                      fill = "gray", alpha = .4) +
                annotate("text", x = 1920,
                     y = -.025,
                     label = "Jim Crow state-sponsored \n racial caste system in effect",
                     color = "black",
                     size = 3)
p.final

Once Jim Crow is ended, Republicans make in-roads into the South, peeling off conservative (mostly white) Southerners. New members are elected, many of whom are Black and Democratic, and the Southern wing of the party aligns with the Northern wing. It is this sorting by ideology inside of the South (and the Northeast, to a lesser extent), that we observe when see rising party polarization.  

Let's say we're happy with this graph. We can save it with `ggsave` and even adjust the aspect ratio by manually setting the width and height. 

In [None]:
ggsave("dem_medians.png",  # name and format of image file
        plot = p.final,    # plot you want to save to file
        width = 10,        # can set aspect ratio and 
        height = 7,        # image quality
        units = "in", 
        dpi = 600)