IntroToDataVis_r.Rmd

---
title: An R Markdown document converted from "IntroToDataVis_r.ipynb"
output: html_document
---

# Hands-on with R+ ggplot2

# 1. Improving Pie Charts

*What is wrong with this figure?*

![](https://drive.google.com/uc?id=1K6hCHovjZV5Icbn3zd-gW86RSRjiH0i-)

## Let's agree that this is a monstrosity. Now, how do we improve it?

```{r}
# import the necessary library
library(ggplot2)
```

## 1.1. Read in the data

*This is a made up data set from a colleague of mine. We have 10 items, each with a text label and a numeric value.*

*I'm using `read.csv` to read in the data.*

```{r}
url = 'https://raw.githubusercontent.com/ageller/IDEAS_FSS-Vis/master/matplotlib/bar/bar.csv'
data = read.csv(url)
data
```

## 1.2. For many uses cases (including this) a bar chart is a better option than a pie chart.

*Humans can more easily interpret differences in bar charts. Pie charts require us to interpret areas = slow, while bar charts use position = fast. Generally, you should choose a bar chart over a pie chart when:*

-   *There are too many categories to easily distinguish between pie chart areas (as we have here).*
-   *Slice sizes in the pie chart are too similar (as we have here).*
-   *You have multiple data sets (which we do not have here).*
-   *When the raw percentages can provide as much (or more) meaning than fraction of a whole (as we have here).*

*Pie charts are only useful when there are few categories, each category has a very different percentage, AND the purpose of your visualization is to show fractions of a whole.*

*Here is the default bar chart from ggplot. Leaves lots to be desired...*

```{r}
ggplot(data, aes(x = Label, y = Value)) +
    geom_bar(stat = "identity") # use stat = "identity" because we are supplying the actual bar values
```

## 1.3. Improve the axis labels and add a plot title

*The text for the bars are unreadable. How should we fix that?*

```{r}
ggplot(data, aes(x = Label, y = Value)) +
    geom_bar(stat = "identity") + # use stat = "identity" because we are supplying the actual bar values
    labs(title = "Percentage of Poor Usage", x = "", y = "Percent")
```

## 1.4. Fix the bar text, sort the data, add the percentage values to each bar

```{r}
ggplot(data, aes(x = reorder(Label, Value), y = Value)) +
    geom_bar(stat = "identity") + # use stat = "identity" because we are supplying the actual bar values
    labs(title = "Percentage of Poor Usage", x = "", y = "") + 
    coord_flip() + # this flips the plot to horizontal
    geom_text(aes(label = paste0(Value,"%")), vjust = 0, hjust = -0.2) + # add labels
    ylim(0,11) # add some space for the text labels; since we flipped the plot we use "ylim" (instead of "xlim")
```

## 1.5. Clean this up a bit

-   *I don't want the grid lines anymore*
-   *We can remove the axes entirely*
-   *Make the font larger*
-   *Let's change the colors, and highlight one of them*
-   *Save the plot*

```{r fig.height=8, fig.width=15, message=TRUE}
# Make plot wider for display
options(repr.plot.width = 15, repr.plot.height = 8)

ggplot(data, aes(x = reorder(Label, Value),
                  y = Value,
                  fill = factor(ifelse(Label == "Color Choice", "Highlighted", "Normal")))) + # to highlight one bar
    geom_bar(stat = "identity", show.legend = FALSE) + # use stat = "identity" because we are supplying the actual bar values
    labs(title = "Percentage of Poor Usage in Data Visualization", x = "", y = "") + 
    coord_flip() + # this flips the plot to horizontal
    geom_text(aes(label = paste0(Value,"%")), vjust = 0, hjust = -0.2, size = 6) + # add labels
    ylim(0,11) + # add some space for the text labels; since we flipped the plot we use "ylim" (instead of "xlim")
    scale_fill_manual(name = "", values = c("orange","grey50")) + # set the colors for highlighting
    theme_classic() + # there are many themes to choose from : https://ggplot2.tidyverse.org/reference/ggtheme.html
    theme(axis.line = element_blank(), # remove the remaining axis lines
          axis.text.x = element_blank(), # remove x axis labels
          axis.ticks.x = element_blank(), # remove x axis ticks
          axis.ticks.y = element_blank(),  # remove y axis ticks
          axis.text = element_text(size = 20), # increase the font size of the labels
          plot.title = element_text(size = 30) # increase the font size of the title
          )

# save the figure (have to specify size here again)
# ggsave("bar_r.pdf", device = "pdf", width = 15, height = 8)
```

# 2. Scatter Plots

```{r}
# import the necessary library
library(ggplot2)
library(ggforce) # only needed to draw the large annotation circles on the scatter plots
```

## 2.1. Read in the data

*These two data sets are from <https://voteview.com/data> and use the using the [DW-NOMINATE method](https://en.wikipedia.org/wiki/NOMINATE_(scaling_method)) to evaluate the political characteristics of individuals on a scale from -1 to 1. Each row in the data is for a different congress person and contains the name, and "x" value and an "alt" value. The horizonal axis, "x", measures the level of liberal (low "x") or conservative (high "x") ideology and can also be interpreted as the position on government intervention in the economy. The vertical axis, "alt" can be interpreted as the position on cross-cutting, salient issues of the day. Most experts agree that the "x" dimension explains the vast majority of differences in voting behaviors.*

*I'm using `read.csv` to read in the data.*

```{r}
url90 = 'https://raw.githubusercontent.com/ageller/IDEAS_FSS-Vis/master/matplotlib/scatter/congress90.csv'
c90 = read.csv(url90)
head(c90)
```

```{r}
url116 = 'https://raw.githubusercontent.com/ageller/IDEAS_FSS-Vis/master/matplotlib/scatter/congress116.csv'
c116 = read.csv(url116)
head(c116)
```

## 2.2 Let's plot these as two subplots

*Is there anything that we should improve upon here?*

```{r}
# There are a couple ways that you can do this.  
# One would be to create two separate plots and use the gridExtra library to put them into one figure.
# The other, which we will use here, is to use facets.

# First, add a descriptive column to each dataframe and combine them
c90$label <- "Congress 90"
c116$label <- "Congress 116"

data2 <- rbind(c90, c116)

# create the scatter plot with facets
ggplot(data = data2, aes(x = x, y = alt)) +
    geom_point() + 
    facet_wrap(~label)
```

## 2.3 Let's improve this

-   *add some descriptive labels to the axes*
-   *improve the colors*
-   *increase the font sizes*

```{r fig.height=8, fig.width=14, message=TRUE}
# more descriptive labels for the facets
c90$label <- "1967 - 1969"
c116$label <- "2019 - 2021"

data2 <- rbind(c90, c116)

ggplot(data = data2, aes(x = x, y = alt)) +
    geom_circle(aes(x0 = 0, y0 = 0, r = 1), inherit.aes = FALSE, fill = "white") + # draw a bounding circle
    geom_point(aes(color = x), size = 4, show.legend = FALSE) + 
    geom_point(shape = 1, size = 4, color = "black") +
    facet_wrap(~label) + 
    labs(title = "The US Congress Has Become More Politically Polarized", 
         x = "Political ideology\n (In each panel liberal is to the left, conservative is to the right.)",
         y = "Position on salient issues") +
    scale_color_gradient2(midpoint = 0, limits = c(-1,1), low = "#0000FF", mid = "white", high = "#FF0000") +
    xlim(-1,1) + ylim(-1,1) +
    coord_fixed() +  # to ensure an equal aspect ratio
    geom_vline(aes(xintercept = 0)) + # add a y-axis at 0
    geom_hline(aes(yintercept = 0)) + # add a x-axis at 0
    theme(panel.grid.major = element_blank(),  # remove the grid
          panel.grid.minor = element_blank(), # remove the grid
          axis.title = element_text(size = 26), # increase the font size of the axis titles
          plot.title = element_text(size = 36), # increase the font size of the title
          strip.text.x = element_text(size = 26), # increase size of the facet labels
          axis.text.x = element_blank(), # remove x axis labels
          axis.ticks.x = element_blank(), # remove x axis ticks
          axis.text.y = element_blank(),  # remove y axis labels
          axis.ticks.y = element_blank()  # remove y axis ticks
          )

# save the figure (have to specify size here)
# ggsave("scatter_r.pdf", device = "pdf", width = 14, height = 8)
```

## 2.4 Would this be better as two overlapping histograms?

*If we don't really care about the y axis, we don't need to use it.*

```{r fig.height=8, fig.width=15, message=TRUE}
# In order to get the look that I want, I will plot each of the groups twice
#   once for the shaded interior of the histograms, using geom_histogram
#   again for the outline, using stat_bin with geom = "step"
# Each of these histograms will show the density of the respective data set

# set the colors
myColors <- c("1967 - 1969" = "#386B5D", "2019 - 2021" = "#3D007A")

# define the binwidth
binwidth <- 0.1

ggplot(data = data2, aes(x = x, color = label, fill = label)) +
    geom_histogram(data = subset(data2, label == "1967 - 1969"),
        aes(fill = label, y = ..density..), 
        binwidth = binwidth, center = 0, color = NA,
        alpha = 0.5, size = 0, show.legend = FALSE) + 
    geom_histogram(data = subset(data2, label == "2019 - 2021"),
        aes(fill = label, y = ..density..), 
        binwidth = binwidth, center = 0, color = NA,
        alpha = 0.5, size = 0, show.legend = FALSE) +
    stat_bin(data = subset(data2, label == "1967 - 1969"),
        aes(y = ..density..), geom = "step",
        binwidth = binwidth, center = 0,
        size = 2, position = position_nudge(x = -binwidth/2.), show.legend = FALSE) + 
    stat_bin(data = subset(data2, label == "2019 - 2021"),
        aes(y = ..density..), geom = "step",
        binwidth = binwidth, center = 0,
        size = 2, position = position_nudge(x = -binwidth/2.), show.legend = FALSE) + 
    scale_fill_manual(values = myColors) + # set the color for the filled histograms
    scale_color_manual(values = myColors) + # set the color for the lines in stat_bin
    labs(title = "The US Congress Has Become More Politically Polarized.", 
         subtitle = "Conservatives have moved further to the right.",
         x = "Liberal                 \u2192         Conservative",
         y = "") + 
    annotate("text", x = 0.1, y = 0.68, label = "1967 - 1969", color = myColors["1967 - 1969"], size = 7, hjust = 0) +
    annotate("text", x = 0.4, y = 0.68, label = "2019 - 2021", color = myColors["2019 - 2021"], size = 7, hjust = 0) +
    theme_classic() + 
    geom_vline(aes(xintercept = 0)) + # add a y-axis at 0
    geom_hline(aes(yintercept = 0)) + # add a x-axis at 0
    theme(axis.line = element_blank(), # remove the remaining axis lines
          axis.text.x = element_blank(), # remove x axis labels
          axis.ticks.x = element_blank(), # remove x axis ticks
          axis.text.y = element_blank(), # remove y axis labels
          axis.ticks.y = element_blank(),  # remove y axis ticks
          axis.title = element_text(size = 26), # increase the font size of the title
          plot.title = element_text(size = 36, hjust = 0.5), # increase the font size of the title and center align
          plot.subtitle = element_text(size = 26, hjust = 0.5) # increase the font size of the subtitle and center align
          )

# save the figure (have to specify size here)
# ggsave("hist_r.png", device = "png", width = 15, height = 8)
```