In [4]:
library(tidyverse)

# Lump together factor levels into "other"

A family for lumping together levels that meet some criteria.

* `fct_lump_min()`: lumps levels that appear fewer than min times.

* `fct_lump_prop()`: lumps levels that appear in fewer prop * n times.

* `fct_lump_n()` lumps all levels except for the n most frequent (or least frequent if n < 0)

* `fct_lump_lowfreq()` lumps together the least frequent levels, ensuring that "other" is still the smallest level.

`fct_lump()` exists primarily for historical reasons, as it automatically picks between these different methods depending on its arguments. We no longer recommend that you use it.

```r
fct_lump(
  f,
  n,
  prop,
  w = NULL,
  other_level = "Other",
  ties.method = c("min", "average", "first", "last", "random", "max")
)

fct_lump_min(f, min, w = NULL, other_level = "Other")

fct_lump_prop(f, prop, w = NULL, other_level = "Other")

fct_lump_n(
  f,
  n,
  w = NULL,
  other_level = "Other",
  ties.method = c("min", "average", "first", "last", "random", "max")
)

fct_lump_lowfreq(f, other_level = "Other")
```

**Arguments**  
`f`	
A factor (or character vector).

`n`	
Positive n preserves the most common n values. Negative n preserves the least common -n values. It there are ties, you will get at least abs(n) values.

`prop`	
Positive prop lumps values which do not appear at least prop of the time. Negative prop lumps values that do not appear at most -prop of the time.

`w`	
An optional numeric vector giving weights for frequency of each value (not level) in f.

`other_level`	
Value of level used for "other" values. Always placed at end of levels.

`ties.method`	
A character string specifying how ties are treated. See rank() for details.

`min`	
Preserve levels that appear at least min number of times.

# Examples

In [33]:
x <- factor(rep(LETTERS[1:9], times = c(40, 10, 5, 27, 1, 1, 1, 1, 1)))

# Frequency table
x %>% fct_count()

f,n
A,40
B,10
C,5
D,27
E,1
F,1
G,1
H,1
I,1


In [19]:
# preserve 4 most commons categories (A, B, C, D). 
# change rest to 'outlier'. he result will have 5 categories. 
x %>% fct_lump_n(n = 4, other_level = 'OUTLIER') %>% fct_count()

f,n
A,40
B,10
C,5
D,27
OUTLIER,5


In [29]:
# preserve 6 least common categories (C, E, F, G, H, I)
# change the rest to 'Other' (default). The result will have 7 categories. 
x %>% fct_lump_n(n = -6) %>% fct_count()

f,n
C,5
E,1
F,1
G,1
H,1
I,1
Other,77


In [30]:
# Preserve categories that appears at least 6 times. change the rest to 'outlier'
x %>% fct_lump_min(6, other_level = 'outlier') %>% fct_count()

f,n
A,40
B,10
D,27
outlier,10


In [40]:
x %>% fct_lump_lowfreq() %>% fct_count()

f,n
A,40
D,27
Other,20


<hr>

In [32]:
# Relative frequency table
x %>% fct_count(prop = T) 

f,n,p
A,40,0.45977011
B,10,0.11494253
C,5,0.05747126
D,27,0.31034483
E,1,0.01149425
F,1,0.01149425
G,1,0.01149425
H,1,0.01149425
I,1,0.01149425


In [34]:
# Preserve categories apprear at least 3%
x %>% fct_lump_prop(.03) %>% fct_count(prop = T)

f,n,p
A,40,0.45977011
B,10,0.11494253
C,5,0.05747126
D,27,0.31034483
Other,5,0.05747126


In [37]:
# Preserve ategories appear at most 6%. change the rest to 'outlier'
x %>% fct_lump_prop(-0.06, other_level = 'outlier') %>% fct_count(prop = T)

f,n,p
C,5,0.05747126
E,1,0.01149425
F,1,0.01149425
G,1,0.01149425
H,1,0.01149425
I,1,0.01149425
outlier,77,0.88505747
