/
inspect_cat_examples.Rmd
91 lines (60 loc) · 2.98 KB
/
inspect_cat_examples.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
title: "Exploring and visualising categorical features"
output: github_document
---
Illustrative data: `starwars`
---
The examples below make use of the `starwars` from the `dplyr` package.
```{r, message=FALSE, warning=FALSE}
library(dplyr)
data(starwars, package = "dplyr")
# print the first few rows
head(starwars)
```
## `inspect_cat()` for a single data frame
`inspect_cat()` returns a tibble summarising categorical features in a data frame, combining the functionality of the `inspect_imb()` and `table()` functions. The tibble generated contains the columns
+ `col_name` name of each categorical column
+ `cnt` the number of unique levels in the feature
+ `common` the most common level (see also `inspect_imb()`)
+ `common_pcnt` the percentage occurrence of the most dominant level
+ `levels` a list of tibbles each containing frequency tabulations of all levels
```{r}
library(inspectdf)
# explore the categorical features
x <- inspect_cat(starwars)
x
```
For example, the levels for the `hair_color` column are
```{r}
# show frequency tibble for `hair_color` column:
x$levels$hair_color
```
Note that by default, if missing (`NA`) values are present, they are counted as a distinct categorical level. A barplot showing the composition of each categorical column can be created using the `show_plot()` function. Note how missing values are shown as grey bars:
```{r}
x %>% show_plot()
```
The argument `high_cardinality` in the `show_plot()` function can be used to bundle together categories that occur only a small number of times. For example, to combine categories only occurring once, use:
```{r}
x %>%
show_plot(high_cardinality = 1)
```
The resulting bundles are shown in purple.
## `inspect_cat()` for two data frames
To illustrate the comparison of two data frames, we first create two new data frames by randomly sampling the rows of `starwars` and also dropping some of the columns. The results are assigned to the objects `star_1` and `star_2`:
```{r}
# sample 50 rows from `starwars`
star_1 <- starwars %>% sample_n(50)
# sample 50 rows from `starwars` and drop the first two columns
star_2 <- starwars %>% sample_n(50) %>% select(-1, -2)
```
To compare the character columns in a pair of data frames, use the `inspect_cat()`:
```{r}
inspect_cat(star_1, star_2)
```
The tibble returned has the following columns
+ `jsd`, the Jensen-Shannon divergence: a measure of how different the distribution of levels are between columns with the same name present in both data frames. Values are between 0 and 1 - values closer to 1 indicate bigger differences in distribution.
+ `pval`, _p_ values associated with a modified $\chi^2$ test of the relative frequencies of levels in columns with the same name present in both data frames.
+ `lvls_1` and `lvl2_2` are named list columns containing the frequency tables for each column in the first and second data frame input to `inspect_cat()`
<!-- # ```{r} -->
<!-- # inspect_cat(star_1, star_2) %>% show_plot() -->
<!-- # ``` -->