Blocking as a feature for scoring #1103

fgregg · 2022-09-24T01:02:09Z

Right now, blocking and scoring are two distinct phases.

All the information about how two records came to be blocked together is unused by the scorer. This is a bit silly, as the fact that two records are blocked together by multiple predicates could be a pretty good indicator of co-reference.

I'm not really clear what the best way to take advantage of blocking information in scoring is though.

a few ideas:

ensemble model. Treat each each blocking predicate as a classifier, and put them in an ensemble with the scorer
blocking as feature: add dummy features indicating which predicate rules are cover a pair. these features get fed into the scorer

In both cases, i'm not quite sure how to set up the training.

NickCrews · 2022-09-24T20:41:04Z

Splink uses something very similar to method 2. See https://youtu.be/msz3T741KQI?t=2035 for a nice way of how they think about the different "types" of comparisons that can happen. The whole video had some other great thoughts and visualizations in there too I thought.

fgregg mentioned this issue Sep 24, 2022

Split up Datamodel into predicates, rename to Featurizer #1088

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking as a feature for scoring #1103

Blocking as a feature for scoring #1103

fgregg commented Sep 24, 2022

NickCrews commented Sep 24, 2022

Blocking as a feature for scoring #1103

Blocking as a feature for scoring #1103

Comments

fgregg commented Sep 24, 2022

NickCrews commented Sep 24, 2022