-
Notifications
You must be signed in to change notification settings - Fork 8
/
15-factors.Rmd
268 lines (207 loc) · 7.6 KB
/
15-factors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
```{r setup15, include=FALSE, message = FALSE, warning=FALSE}
knitr::opts_chunk$set(echo = TRUE, cache = TRUE)
library(ggplot2)
library(dplyr)
library(tidyr)
library(nycflights13)
library(lubridate)
library(forcats)
```
# Ch. 15: Factors
```{block2, type='rmdimportant'}
**Key questions:**
* 15.3.1. #1, 3 (make the visualization and table)
* 15.5.1. #1
```
```{block2, type='rmdtip'}
**Functions and notes:**
```
* `factor` make variable a factor based on `levels` provided
* `fct_rev` reverses order of factors
* `fct_infreq` orders levels in increasing frequency
* `fct_relevel` lets you move levels to front of order
* `fct_inorder` orders existing factor by order values show-up in in data
* `fct_reorder` orders input factors by other specified variables value (median by default), 3 inputs: `f`: factor to modify, `x`: input var to order by, `fun`: function to use on x, also have `desc` option
* `fct_reorder2` orders input factor by max of other specified variable (good for making legends align as expected)
* `fct_recode` lets you change value of each level
* `fct_collapse` is variant of `fct_recode` that allows you to provide multiple old levels as a vector
* `fct_lump` allows you to lump together small groups, use `n` to specify number of groups to end with
Create factors by order they come-in:
Avoiding dropping levels with `drop = FALSE`
```{r}
gss_cat %>%
ggplot(aes(race))+
geom_bar()+
scale_x_discrete(drop = FALSE)
```
## 15.4: General Social Survey
### 15.3.1
1. Explore the distribution of `rincome` (reported income). What makes the default bar chart hard to understand? How could you improve the plot?
* Default bar chart has categories across the x-asix, I flipped these to be across the y-axis
* Also, have highest values at the bottom rather than at the top and have different version of NA showing-up at both top and bottom, all should be on one side
* In `bar_prep`, I used reg expressions to extract the numeric values, arrange by that, and then set factor levels according to the new order
+ Solution is probably unnecessarily complicated...
```{r}
bar_prep <- gss_cat %>%
tidyr::extract(col = rincome, into =c("dollars1", "dollars2"), "([0-9]+)[^0-9]*([0-9]*)", remove = FALSE) %>%
mutate_at(c("dollars1", "dollars2"), ~ifelse(is.na(.) | . == "", 0, as.numeric(.))) %>%
arrange(dollars1, dollars2) %>%
mutate(rincome = fct_inorder(rincome))
bar_prep %>%
ggplot(aes(x = rincome)) +
geom_bar() +
scale_x_discrete(drop = FALSE) +
coord_flip()
```
1. What is the most common `relig` in this survey? What's the most common `partyid`?
```{r}
gss_cat %>%
count(relig, sort = TRUE)
gss_cat %>%
count(partyid, sort = TRUE)
```
* `relig` most common -- Protestant, 10846,
* `partyid` most common -- Independent, 4119
1. Which `relig` does `denom` (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?
*With visualization:*
```{r}
gss_cat %>%
ggplot(aes(x=relig, fill=denom))+
geom_bar()+
coord_flip()
```
* Notice which have the widest variety of colours -- are protestant, and Christian slightly
*With table:*
```{r}
gss_cat %>%
count(relig, denom) %>%
count(relig, sort = TRUE)
```
## 15.4: Modifying factor order
### 15.4.1
1. There are some suspiciously high numbers in `tvhours`. Is the mean a good summary?
```{r}
gss_cat %>%
mutate(tvhours_fct = factor(tvhours)) %>%
ggplot(aes(x = tvhours_fct)) +
geom_bar()
```
* Distribution is reasonably skewed with some values showing-up as 24 hours which seems impossible, in addition to this we have a lot of `NA` values, this may skew results
* Given high number of missing values, `tvhours` may also just not be reliable, do `NA`s associate with other variables? -- Perhaps could try and impute these `NA`s
1. For each factor in `gss_cat` identify whether the order of the levels is arbitrary or principled.
```{r}
gss_cat %>%
purrr::keep(is.factor) %>%
purrr::map(levels)
```
* `rincome` is principaled, rest are arbitrary
1. Why did moving "Not applicable" to the front of the levels move it to the bottom of the plot?
* Becuase is moving this factor to be first in order
## 15.5: Modifying factor levels
Example with `fct_recode`
```{r}
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
```
### 15.5.1
1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
*As a line plot: *
```{r}
gss_cat %>%
mutate(partyid = fct_collapse(
partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(year, partyid) %>%
group_by(year) %>%
mutate(prop = n / sum(n)) %>%
ungroup() %>%
ggplot(aes(
x = year,
y = prop,
colour = fct_reorder2(partyid, year, prop)
)) +
geom_line() +
labs(colour = "partyid")
```
*As a bar plot: *
```{r}
gss_cat %>%
mutate(partyid = fct_collapse(
partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(year, partyid) %>%
group_by(year) %>%
mutate(prop = n / sum(n)) %>%
ungroup() %>%
ggplot(aes(
x = year,
y = prop,
fill = fct_reorder2(partyid, year, prop)
)) +
geom_col() +
labs(colour = "partyid")
```
* Suggests proportion of republicans has gone down with independents and other going up.
1. How could you collapse `rincome` into a small set of categories?
```{r}
other = c("No answer", "Don't know", "Refused", "Not applicable")
high = c("$25000 or more", "$20000 - 24999", "$15000 - 19999", "$10000 - 14999")
med = c("$8000 to 9999", "$7000 to 7999", "$6000 to 6999", "$5000 to 5999")
low = c("$4000 to 4999", "$3000 to 3999", "$1000 to 2999", "Lt $1000")
mutate(gss_cat,
rincome = fct_collapse(
rincome,
other = other,
high = high,
med = med,
low = low
)) %>%
count(rincome)
```
## Appendix
### Viewing all levels
A few ways to get an initial look at the levels or counts across a dataset
```{r, eval = FALSE}
gss_cat %>%
purrr::map(unique)
gss_cat %>%
purrr::map(table)
gss_cat %>%
purrr::map(table) %>%
purrr::map(plot)
gss_cat %>%
mutate_if(is.factor, ~fct_lump(., 14)) %>%
sample_n(1000) %>%
GGally::ggpairs()
```
*Percentage NA each level*:
```{r, eval = FALSE}
gss_cat %>%
purrr::map(~(sum(is.na(.x)) / length(.x))) %>%
as_tibble()
# essentially equivalent...
gss_cat %>%
summarise_all(~(sum(is.na(.)) / length(.)))
```
*Print all levels of tibble*:
```{r, eval = FALSE}
gss_cat %>%
count(age) %>%
print(n = Inf)
```