Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal joining? #18

Open
Deleetdk opened this issue Sep 19, 2016 · 5 comments
Open

Optimal joining? #18

Deleetdk opened this issue Sep 19, 2016 · 5 comments

Comments

@Deleetdk
Copy link

I like the idea of this package, but it does not work well in practice for my needs. I wrote a knitr explaining the problem here.

In brief: often the task is to match up two n-length vectors of strings against each other 1-to-1. The present join algorithm does not enforce 1-to-1 joins, i.e. sometimes one string gets joined to two others, and sometimes one gets joined to none.

@dgrtwo
Copy link
Owner

dgrtwo commented Sep 19, 2016

The general approach I'd use is to join the two using a high (possibly infinite) distance, to use the distance_col argument to add a distance column, and then to pick the best match in each group (which I'd do with group_by and top_n from dplyr, but there are alternatives).

EN = c("Denmark", "Norway", "USA", "Russia", "Germany")
DA = c("Danmark", "Norge", "USA", "Rusland", "Tyskland")

library(dplyr)
library(fuzzyjoin)

en <- data_frame(EN)
da <- data_frame(DA)

en %>%
  stringdist_inner_join(da, by = c(EN = "DA"), max_dist = Inf, distance_col = "distance") %>%
  group_by(EN) %>%
  top_n(1, -distance)

I do see that this would still allow 1-to-many matches, though. You're looking for a global minimum of distance?

@Deleetdk
Copy link
Author

I expanded the knitr to include my (experimental!) implementation of the proposed function:

http://rpubs.com/EmilOWK/209456

It enforces 1-to-1 joining. In some cases, this will result in incorrect joins. E.g. if there are two pairs of cases with wildly different names, they may end up being paired up in the opposite way. For relatively small datasets, this should be rare, but it remains to be tested in the wild.

However, I use another method of dealing with names of political units: translate names to ISO names -- using fuzzy matching if necessary -- then join as normal. This knitr showcases that kind of solution.

Your call above not only violates 1-to-1, it gets the results wrong twice because it matches Germany with Danmark and Rusland, and not with Tyskland, the correct match.

@ahcyip
Copy link

ahcyip commented Jun 6, 2017

group_by and top_n works for me! Thanks.

@prokopyev
Copy link

This gist offers a solution with its best_only parameter:

https://gist.github.com/gdmcdonald/bacfaafe2cccff18b6a81b319cbc3580

@moh-salah
Copy link

This gist offers a solution with its best_only parameter:

https://gist.github.com/gdmcdonald/bacfaafe2cccff18b6a81b319cbc3580

Nice. It would be useful to make it work with pipes and allow joining on more than one column.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants