-
Notifications
You must be signed in to change notification settings - Fork 2
/
IntroToDataVis_r.Rmd
250 lines (198 loc) · 11.2 KB
/
IntroToDataVis_r.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
title: An R Markdown document converted from "IntroToDataVis_r.ipynb"
output: html_document
---
# Hands-on with R+ ggplot2
# 1. Improving Pie Charts
*What is wrong with this figure?*
![](https://drive.google.com/uc?id=1K6hCHovjZV5Icbn3zd-gW86RSRjiH0i-)
## Let's agree that this is a monstrosity. Now, how do we improve it?
```{r}
# import the necessary library
library(ggplot2)
```
## 1.1. Read in the data
*This is a made up data set from a colleague of mine. We have 10 items, each with a text label and a numeric value.*
*I'm using `read.csv` to read in the data.*
```{r}
url = 'https://raw.githubusercontent.com/ageller/IDEAS_FSS-Vis/master/matplotlib/bar/bar.csv'
data = read.csv(url)
data
```
## 1.2. For many uses cases (including this) a bar chart is a better option than a pie chart.
*Humans can more easily interpret differences in bar charts. Pie charts require us to interpret areas = slow, while bar charts use position = fast. Generally, you should choose a bar chart over a pie chart when:*
- *There are too many categories to easily distinguish between pie chart areas (as we have here).*
- *Slice sizes in the pie chart are too similar (as we have here).*
- *You have multiple data sets (which we do not have here).*
- *When the raw percentages can provide as much (or more) meaning than fraction of a whole (as we have here).*
*Pie charts are only useful when there are few categories, each category has a very different percentage, AND the purpose of your visualization is to show fractions of a whole.*
*Here is the default bar chart from ggplot. Leaves lots to be desired...*
```{r}
ggplot(data, aes(x = Label, y = Value)) +
geom_bar(stat = "identity") # use stat = "identity" because we are supplying the actual bar values
```
## 1.3. Improve the axis labels and add a plot title
*The text for the bars are unreadable. How should we fix that?*
```{r}
ggplot(data, aes(x = Label, y = Value)) +
geom_bar(stat = "identity") + # use stat = "identity" because we are supplying the actual bar values
labs(title = "Percentage of Poor Usage", x = "", y = "Percent")
```
## 1.4. Fix the bar text, sort the data, add the percentage values to each bar
```{r}
ggplot(data, aes(x = reorder(Label, Value), y = Value)) +
geom_bar(stat = "identity") + # use stat = "identity" because we are supplying the actual bar values
labs(title = "Percentage of Poor Usage", x = "", y = "") +
coord_flip() + # this flips the plot to horizontal
geom_text(aes(label = paste0(Value,"%")), vjust = 0, hjust = -0.2) + # add labels
ylim(0,11) # add some space for the text labels; since we flipped the plot we use "ylim" (instead of "xlim")
```
## 1.5. Clean this up a bit
- *I don't want the grid lines anymore*
- *We can remove the axes entirely*
- *Make the font larger*
- *Let's change the colors, and highlight one of them*
- *Save the plot*
```{r fig.height=8, fig.width=15, message=TRUE}
# Make plot wider for display
options(repr.plot.width = 15, repr.plot.height = 8)
ggplot(data, aes(x = reorder(Label, Value),
y = Value,
fill = factor(ifelse(Label == "Color Choice", "Highlighted", "Normal")))) + # to highlight one bar
geom_bar(stat = "identity", show.legend = FALSE) + # use stat = "identity" because we are supplying the actual bar values
labs(title = "Percentage of Poor Usage in Data Visualization", x = "", y = "") +
coord_flip() + # this flips the plot to horizontal
geom_text(aes(label = paste0(Value,"%")), vjust = 0, hjust = -0.2, size = 6) + # add labels
ylim(0,11) + # add some space for the text labels; since we flipped the plot we use "ylim" (instead of "xlim")
scale_fill_manual(name = "", values = c("orange","grey50")) + # set the colors for highlighting
theme_classic() + # there are many themes to choose from : https://ggplot2.tidyverse.org/reference/ggtheme.html
theme(axis.line = element_blank(), # remove the remaining axis lines
axis.text.x = element_blank(), # remove x axis labels
axis.ticks.x = element_blank(), # remove x axis ticks
axis.ticks.y = element_blank(), # remove y axis ticks
axis.text = element_text(size = 20), # increase the font size of the labels
plot.title = element_text(size = 30) # increase the font size of the title
)
# save the figure (have to specify size here again)
# ggsave("bar_r.pdf", device = "pdf", width = 15, height = 8)
```
# 2. Scatter Plots
```{r}
# import the necessary library
library(ggplot2)
library(ggforce) # only needed to draw the large annotation circles on the scatter plots
```
## 2.1. Read in the data
*These two data sets are from <https://voteview.com/data> and use the using the [DW-NOMINATE method](https://en.wikipedia.org/wiki/NOMINATE_(scaling_method)) to evaluate the political characteristics of individuals on a scale from -1 to 1. Each row in the data is for a different congress person and contains the name, and "x" value and an "alt" value. The horizonal axis, "x", measures the level of liberal (low "x") or conservative (high "x") ideology and can also be interpreted as the position on government intervention in the economy. The vertical axis, "alt" can be interpreted as the position on cross-cutting, salient issues of the day. Most experts agree that the "x" dimension explains the vast majority of differences in voting behaviors.*
*I'm using `read.csv` to read in the data.*
```{r}
url90 = 'https://raw.githubusercontent.com/ageller/IDEAS_FSS-Vis/master/matplotlib/scatter/congress90.csv'
c90 = read.csv(url90)
head(c90)
```
```{r}
url116 = 'https://raw.githubusercontent.com/ageller/IDEAS_FSS-Vis/master/matplotlib/scatter/congress116.csv'
c116 = read.csv(url116)
head(c116)
```
## 2.2 Let's plot these as two subplots
*Is there anything that we should improve upon here?*
```{r}
# There are a couple ways that you can do this.
# One would be to create two separate plots and use the gridExtra library to put them into one figure.
# The other, which we will use here, is to use facets.
# First, add a descriptive column to each dataframe and combine them
c90$label <- "Congress 90"
c116$label <- "Congress 116"
data2 <- rbind(c90, c116)
# create the scatter plot with facets
ggplot(data = data2, aes(x = x, y = alt)) +
geom_point() +
facet_wrap(~label)
```
## 2.3 Let's improve this
- *add some descriptive labels to the axes*
- *improve the colors*
- *increase the font sizes*
```{r fig.height=8, fig.width=14, message=TRUE}
# more descriptive labels for the facets
c90$label <- "1967 - 1969"
c116$label <- "2019 - 2021"
data2 <- rbind(c90, c116)
ggplot(data = data2, aes(x = x, y = alt)) +
geom_circle(aes(x0 = 0, y0 = 0, r = 1), inherit.aes = FALSE, fill = "white") + # draw a bounding circle
geom_point(aes(color = x), size = 4, show.legend = FALSE) +
geom_point(shape = 1, size = 4, color = "black") +
facet_wrap(~label) +
labs(title = "The US Congress Has Become More Politically Polarized",
x = "Political ideology\n (In each panel liberal is to the left, conservative is to the right.)",
y = "Position on salient issues") +
scale_color_gradient2(midpoint = 0, limits = c(-1,1), low = "#0000FF", mid = "white", high = "#FF0000") +
xlim(-1,1) + ylim(-1,1) +
coord_fixed() + # to ensure an equal aspect ratio
geom_vline(aes(xintercept = 0)) + # add a y-axis at 0
geom_hline(aes(yintercept = 0)) + # add a x-axis at 0
theme(panel.grid.major = element_blank(), # remove the grid
panel.grid.minor = element_blank(), # remove the grid
axis.title = element_text(size = 26), # increase the font size of the axis titles
plot.title = element_text(size = 36), # increase the font size of the title
strip.text.x = element_text(size = 26), # increase size of the facet labels
axis.text.x = element_blank(), # remove x axis labels
axis.ticks.x = element_blank(), # remove x axis ticks
axis.text.y = element_blank(), # remove y axis labels
axis.ticks.y = element_blank() # remove y axis ticks
)
# save the figure (have to specify size here)
# ggsave("scatter_r.pdf", device = "pdf", width = 14, height = 8)
```
## 2.4 Would this be better as two overlapping histograms?
*If we don't really care about the y axis, we don't need to use it.*
```{r fig.height=8, fig.width=15, message=TRUE}
# In order to get the look that I want, I will plot each of the groups twice
# once for the shaded interior of the histograms, using geom_histogram
# again for the outline, using stat_bin with geom = "step"
# Each of these histograms will show the density of the respective data set
# set the colors
myColors <- c("1967 - 1969" = "#386B5D", "2019 - 2021" = "#3D007A")
# define the binwidth
binwidth <- 0.1
ggplot(data = data2, aes(x = x, color = label, fill = label)) +
geom_histogram(data = subset(data2, label == "1967 - 1969"),
aes(fill = label, y = ..density..),
binwidth = binwidth, center = 0, color = NA,
alpha = 0.5, size = 0, show.legend = FALSE) +
geom_histogram(data = subset(data2, label == "2019 - 2021"),
aes(fill = label, y = ..density..),
binwidth = binwidth, center = 0, color = NA,
alpha = 0.5, size = 0, show.legend = FALSE) +
stat_bin(data = subset(data2, label == "1967 - 1969"),
aes(y = ..density..), geom = "step",
binwidth = binwidth, center = 0,
size = 2, position = position_nudge(x = -binwidth/2.), show.legend = FALSE) +
stat_bin(data = subset(data2, label == "2019 - 2021"),
aes(y = ..density..), geom = "step",
binwidth = binwidth, center = 0,
size = 2, position = position_nudge(x = -binwidth/2.), show.legend = FALSE) +
scale_fill_manual(values = myColors) + # set the color for the filled histograms
scale_color_manual(values = myColors) + # set the color for the lines in stat_bin
labs(title = "The US Congress Has Become More Politically Polarized.",
subtitle = "Conservatives have moved further to the right.",
x = "Liberal \u2192 Conservative",
y = "") +
annotate("text", x = 0.1, y = 0.68, label = "1967 - 1969", color = myColors["1967 - 1969"], size = 7, hjust = 0) +
annotate("text", x = 0.4, y = 0.68, label = "2019 - 2021", color = myColors["2019 - 2021"], size = 7, hjust = 0) +
theme_classic() +
geom_vline(aes(xintercept = 0)) + # add a y-axis at 0
geom_hline(aes(yintercept = 0)) + # add a x-axis at 0
theme(axis.line = element_blank(), # remove the remaining axis lines
axis.text.x = element_blank(), # remove x axis labels
axis.ticks.x = element_blank(), # remove x axis ticks
axis.text.y = element_blank(), # remove y axis labels
axis.ticks.y = element_blank(), # remove y axis ticks
axis.title = element_text(size = 26), # increase the font size of the title
plot.title = element_text(size = 36, hjust = 0.5), # increase the font size of the title and center align
plot.subtitle = element_text(size = 26, hjust = 0.5) # increase the font size of the subtitle and center align
)
# save the figure (have to specify size here)
# ggsave("hist_r.png", device = "png", width = 15, height = 8)
```