Optimal joining? #18

Deleetdk · 2016-09-19T20:38:19Z

I like the idea of this package, but it does not work well in practice for my needs. I wrote a knitr explaining the problem here.

In brief: often the task is to match up two n-length vectors of strings against each other 1-to-1. The present join algorithm does not enforce 1-to-1 joins, i.e. sometimes one string gets joined to two others, and sometimes one gets joined to none.

dgrtwo · 2016-09-19T21:07:55Z

The general approach I'd use is to join the two using a high (possibly infinite) distance, to use the distance_col argument to add a distance column, and then to pick the best match in each group (which I'd do with group_by and top_n from dplyr, but there are alternatives).

EN = c("Denmark", "Norway", "USA", "Russia", "Germany")
DA = c("Danmark", "Norge", "USA", "Rusland", "Tyskland")

library(dplyr)
library(fuzzyjoin)

en <- data_frame(EN)
da <- data_frame(DA)

en %>%
  stringdist_inner_join(da, by = c(EN = "DA"), max_dist = Inf, distance_col = "distance") %>%
  group_by(EN) %>%
  top_n(1, -distance)

I do see that this would still allow 1-to-many matches, though. You're looking for a global minimum of distance?

Deleetdk · 2016-09-19T21:37:31Z

I expanded the knitr to include my (experimental!) implementation of the proposed function:

http://rpubs.com/EmilOWK/209456

It enforces 1-to-1 joining. In some cases, this will result in incorrect joins. E.g. if there are two pairs of cases with wildly different names, they may end up being paired up in the opposite way. For relatively small datasets, this should be rare, but it remains to be tested in the wild.

However, I use another method of dealing with names of political units: translate names to ISO names -- using fuzzy matching if necessary -- then join as normal. This knitr showcases that kind of solution.

Your call above not only violates 1-to-1, it gets the results wrong twice because it matches Germany with Danmark and Rusland, and not with Tyskland, the correct match.

ahcyip · 2017-06-06T07:27:10Z

group_by and top_n works for me! Thanks.

prokopyev · 2018-05-21T21:02:11Z

This gist offers a solution with its best_only parameter:

https://gist.github.com/gdmcdonald/bacfaafe2cccff18b6a81b319cbc3580

moh-salah · 2018-12-12T20:13:37Z

This gist offers a solution with its best_only parameter:

https://gist.github.com/gdmcdonald/bacfaafe2cccff18b6a81b319cbc3580

Nice. It would be useful to make it work with pipes and allow joining on more than one column.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal joining? #18

Optimal joining? #18

Deleetdk commented Sep 19, 2016

dgrtwo commented Sep 19, 2016

Deleetdk commented Sep 19, 2016

ahcyip commented Jun 6, 2017

prokopyev commented May 21, 2018

moh-salah commented Dec 12, 2018

Optimal joining? #18

Optimal joining? #18

Comments

Deleetdk commented Sep 19, 2016

dgrtwo commented Sep 19, 2016

Deleetdk commented Sep 19, 2016

ahcyip commented Jun 6, 2017

prokopyev commented May 21, 2018

moh-salah commented Dec 12, 2018