dfuzz

The goal of {dfuzz} is to help you cleaning up a messy column of strings of characters in your tibble or data.frame.

This package is highly experimental and is not yet ready for being used for real applications.

It is build around two dependencies which themselves have no dependencies:

{rlang}
{stringdist}, and it is possible to use the full power of the function stringdist() from this excellent package.

{dfuzz} aims at being compatible with both tidyverse and base R dialects.

Installation

You can install this package using {remotes} (or {devtools}):

remotes::install_github("courtiol/dfuzz")

Example

library(dfuzz)

## a toy example:
test_df <- data.frame(fruit = c("banana", "blueberry", "limon", "pinapple",
                                "aple", "apple", "ApplE", "bonana"))
test_df
#>       fruit
#> 1    banana
#> 2 blueberry
#> 3     limon
#> 4  pinapple
#> 5      aple
#> 6     apple
#> 7     ApplE
#> 8    bonana

## fast and dirty workflow:
clean_df1 <- fuzzy_tidy(test_df, fruit)
clean_df1
#>       fruit fruit.clean fruit.cleaned fruit.tidy
#> 1    banana        <NA>        banana     banana
#> 2 blueberry   blueberry          <NA>  blueberry
#> 3     limon       limon          <NA>      limon
#> 4  pinapple    pinapple          <NA>   pinapple
#> 5      aple        <NA>          aple       aple
#> 6     apple        <NA>          aple       aple
#> 7     ApplE       ApplE          <NA>      ApplE
#> 8    bonana        <NA>        banana     banana

## more subtle workflow:
template_fruit <- fuzzy_match(test_df, fruit)
template_fruit
#>   selected  syn_1  syn_2
#> 1     aple   aple  apple
#> 2   banana banana bonana
template_fruit$selected[1] <- "apple"
clean_df2 <- fuzzy_tidy(test_df, fruit, template_fruit)
clean_df2
#>       fruit fruit.clean fruit.cleaned fruit.tidy
#> 1    banana        <NA>        banana     banana
#> 2 blueberry   blueberry          <NA>  blueberry
#> 3     limon       limon          <NA>      limon
#> 4  pinapple    pinapple          <NA>   pinapple
#> 5      aple        <NA>         apple      apple
#> 6     apple        <NA>         apple      apple
#> 7     ApplE       ApplE          <NA>      ApplE
#> 8    bonana        <NA>        banana     banana

## fast and dirty workflow with {tidyverse}:
library(tidyverse)
#> ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
#> ✓ tibble  3.0.4     ✓ dplyr   1.0.2
#> ✓ tidyr   1.1.2     ✓ stringr 1.4.0
#> ✓ readr   1.4.0     ✓ forcats 0.5.0
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
test_df %>%
  fuzzy_tidy(fruit) %>%
  mutate(fruit = fruit.tidy) %>%
  select(-contains("fruit."))
#> # A tibble: 8 x 1
#>   fruit    
#>   <chr>    
#> 1 banana   
#> 2 blueberry
#> 3 limon    
#> 4 pinapple 
#> 5 aple     
#> 6 aple     
#> 7 ApplE    
#> 8 banana

## more subtle workflow with {tidyverse}:
test_df %>%
  mutate(fruit = str_to_title(fruit)) %>%
  fuzzy_match(fruit) -> template_fruit
template_fruit
#> # A tibble: 2 x 3
#>   selected syn_1  syn_2 
#>   <chr>    <chr>  <chr> 
#> 1 Aple     Aple   Apple 
#> 2 Banana   Banana Bonana

template_fruit %>%
  mutate(selected = fct_recode(selected, Apple = "Aple")) -> better_template_fruit

better_template_fruit
#> # A tibble: 2 x 3
#>   selected syn_1  syn_2 
#>   <fct>    <chr>  <chr> 
#> 1 Apple    Aple   Apple 
#> 2 Banana   Banana Bonana

test_df %>%
  mutate(fruit = str_to_title(fruit)) %>%
  fuzzy_tidy(fruit, better_template_fruit) -> clean_df3
clean_df3
#> # A tibble: 8 x 4
#>   fruit     fruit.clean fruit.cleaned fruit.tidy
#>   <chr>     <chr>       <chr>         <chr>     
#> 1 Banana    <NA>        Banana        Banana    
#> 2 Blueberry Blueberry   <NA>          Blueberry 
#> 3 Limon     Limon       <NA>          Limon     
#> 4 Pinapple  Pinapple    <NA>          Pinapple  
#> 5 Aple      <NA>        Apple         Apple     
#> 6 Apple     <NA>        Apple         Apple     
#> 7 Apple     <NA>        Apple         Apple     
#> 8 Bonana    <NA>        Banana        Banana

clean_df3 %>%
  mutate(fruit = fruit.tidy) %>%
  select(-contains("fruit."))
#> # A tibble: 8 x 1
#>   fruit    
#>   <chr>    
#> 1 Banana   
#> 2 Blueberry
#> 3 Limon    
#> 4 Pinapple 
#> 5 Apple    
#> 6 Apple    
#> 7 Apple    
#> 8 Banana

Help & feedbacks wanted!

If you find that this package is an idea worth pursuing, please let me know. Developing is always more fun when it becomes a collaborative work. So please also email me (or leave an issue) if you want to get involved!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
R		R
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
dfuzz.Rproj		dfuzz.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dfuzz

Installation

Example

Help & feedbacks wanted!

About

Releases

Packages

Languages

courtiol/dfuzz

Folders and files

Latest commit

History

Repository files navigation

dfuzz

Installation

Example

Help & feedbacks wanted!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages