# Data Transformation 3: Data Cleaning
1. Hygiene for Cleaning
1. Outlier detection and handling
1. Imputation (missing values)
1. String proximity and Entity Resolution


## Hygiene for Data "Cleaning"
The very term "data cleaning" is problematic.
- Presumes that the raw data is "truly" dirty. We don't know that.
- We are inherently imposing a model over the data
- Let us not impose our will on the raw data!

Better to think of today's tasks as Transformation functions!
- With recorded input and output
- And the "lineage" of how the output is computed

### Embrace some simple metadata 
If you are transforming just one column
- Keep the original column
- Add a derived column one to the right if possible
- Name the derived column something meaningful

If you are transforming much of a data set
- Create a new derived dataset and store it "near" the original
  - filesystem directory, database schema, git repo, etc
- Name the derived dataset something that hints at the lineage

In all cases, keep the transformation code!
- Manage/version it as you would source code
- Document it as you would source code
- Hopefully in the same repository/toolchain as the data

We'll talk more about metadata and data lineage later in the semester.

# Outlier Detection and Handling
What is an "outlier"?

## Normal Distribution
Center and dispersion (spread).
- Center: mean
- Dispersion: stddev

Outliers are values with "high" spread from the center.

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
# Fig size
plt.rcParams["figure.figsize"]=12,8

## replace the database connection with a database of your own!
%reload_ext sql
%sql postgresql://jmh@localhost:5432/jmh

## Normal Distributions are nice
Let's set up a good old-fashioned univariate Gaussian (normal) in SQL:

In [None]:
%sql DROP TABLE IF EXISTS observations CASCADE;
%sql CREATE TABLE observations AS \
     SELECT normal_rand AS x FROM normal_rand(1000, 50, 5);

results = %sql SELECT x FROM observations
sns.displot(results.dict(), fill=True, kde=True, bins=20)

## Detecting Gaussian Outliers
One rule of thumb: outliers are 2 stddevs ($2\sigma$) from the mean on either side.
- Based on the normal distribution, 2 stddev's is about 95% of the data
- So outliers are below p2.5 and above p97.5
- We could of course pick $3\sigma$ (99.7% of the data) or more

In [None]:
%%sql
CREATE OR REPLACE VIEW normal_outliers AS
 WITH bounds AS (
 SELECT avg(x) - 2*stddev(x) AS lo, avg(x) + 2*stddev(x) AS hi
   FROM observations
 )
 SELECT x FROM observations o, bounds b
  WHERE x NOT BETWEEN b.lo AND b.hi;
    
SELECT * FROM normal_outliers;

## Handling Gaussian Outliers
- One option is simply to *delete* the outlying values from consideration. 
- Now let's look at the data with and without outliers.

In [None]:
results = %sql SELECT x, 'original' as label FROM observations \
                UNION ALL \
               SELECT x, 'cleaned' FROM observations \
                WHERE x NOT IN (SELECT * FROM normal_outliers)
sns.displot(results.dict(), x="x", kind='hist', hue='label', kde=True, bins=20)

## Non-Gaussian Data
What if you corrupt just one value to be very large? 
- Right-biased: not a normal distribution anymore! 
- Messes up "textbook" definitions of center and dispersion!
    - They assumed a Gaussian distribution

In [None]:
## corrupt one value
%sql UPDATE observations SET x = x*10 \
      WHERE x = (SELECT MAX(x) FROM OBSERVATIONS);

### What Happened??

In [None]:
results = %sql SELECT x, 'orig' as label FROM observations \
                UNION ALL \
               SELECT x, 'cleaned' FROM observations \
                WHERE x NOT IN (SELECT * FROM normal_outliers)
sns.displot(results.dict(), x="x", kind='hist', hue='label', kde=True, bins=20, rug=True)
%sql select x from normal_outliers;

## Masking
The $10x$ value is *masking* our earlier outliers
- Even the one on the left!
- We can mask any outlier we please with an even bigger outlier!

Gaussian definitions of "center" and "dispersion" are not **robust**
  - 1 value can drag the mean and stddev *as far as you want*!

Robust measures should tolerate some corruption
  - after all, the whole point is to handle dirty data!
  - we'll define robustness formally shortly

## Trimming: Percentile Outliers
- Suppose we define the outliers by order statistics (percentiles)
- *Trimming*: dropping outliers based on order statistics
- E.g. a "1% trimmed distribution" drops the 1% on either end
  - p1 and p99.

In [None]:
%sql DROP TABLE observations CASCADE;
%sql CREATE TABLE observations AS SELECT normal_rand AS x FROM normal_rand(1000, 50, 5);

results = %sql SELECT x FROM observations
sns.displot(results.dict(), fill=True, kde=True, bins=20)

In [None]:
%%sql
CREATE OR REPLACE VIEW p1p99 AS
SELECT percentile_cont(.01) WITHIN GROUP (ORDER BY x) AS p1,
           percentile_cont(.99) WITHIN GROUP (ORDER BY x) AS p99
      FROM observations;
SELECT * FROM p1p99;

In [None]:
%%sql
CREATE OR REPLACE VIEW trimmed_observations AS
SELECT o.x, 'trimmed' AS label
  FROM observations o, p1p99 p
 WHERE o.x BETWEEN p.p1 AND p.p99
UNION ALL
SELECT o.x, 'original' AS label
  FROM observations o;

CREATE OR REPLACE VIEW trimmed_outliers AS
SELECT o.*
  FROM observations o, p1p99 p
 WHERE o.x NOT BETWEEN p.p1 AND p.p99;

In [None]:
results = %sql SELECT * from trimmed_observations
sns.displot(results.dict(), x="x", kind='hist', hue='label', kde=True, bins=20)

results = %sql SELECT x from trimmed_outliers
results


What if you corrupt just one value to be very large? 

In [None]:
%sql UPDATE observations SET x = x*10 WHERE x = (SELECT MAX(x) FROM observations)

In [None]:
# WHERE x < 500
results = %sql SELECT * FROM trimmed_observations
sns.displot(results.dict(), x="x", kind='hist', hue='label', kde=True, bins=20)
%sql SELECT * FROM trimmed_outliers;

## Is Trimming More Robust than Stddev-Based?
Minor masking on the right, but not nearly as bad.

- Maybe we should have trimmed less? More? How much? Hmmm...
  - Seems like it should depend on the data!
- Before we answer that, one more standard outlier handling scheme.

## Winsorizing
- Trimming:
  - dropped the $k\%$ tails 
- Winsorizing:
  - *replace* those values with the $k$-percentile value.
  - $k\%$ tails contain the same repeated value
- This preserves the probability density of the tails.
  - usually not a big difference from trimming
    - mostly seen in the stddev, not the mean
  - Winsorize preferred to Trimming if something downstream forbids NULL

In [None]:
%%sql
CREATE OR REPLACE VIEW winsorized_observations AS
SELECT CASE WHEN o.x BETWEEN p.p1 AND p.p99 THEN o.x
            WHEN o.x < p.p1 THEN p.p1
            WHEN o.x > p.p99 THEN p.p99
        END AS x,
      'winsorized' AS label
  FROM observations o, p1p99 p
UNION ALL
SELECT o.x, 'original' AS label
  FROM observations o;

The Winsorized distribution against the original:

In [None]:
# WHERE x < 500
results = %sql SELECT * from winsorized_observations
sns.displot(results.dict(), x="x", kind='hist', hue='label', kde=True, bins=20)

Seems close to trimmed. Let's look more closely:

In [None]:
%%sql 
SELECT 'orig' AS distribution, min(x),
       percentile_disc(.25) WITHIN GROUP (ORDER BY x) as p25,
       percentile_disc(.50) WITHIN GROUP (ORDER BY x) as median,
       percentile_disc(.75) WITHIN GROUP (ORDER BY x) as p75,
       max(x), avg(x), stddev(x), count(x) 
       FROM observations
UNION ALL
SELECT 'winsorized', min(x),
       percentile_disc(.25) WITHIN GROUP (ORDER BY x) as p25,
       percentile_disc(.50) WITHIN GROUP (ORDER BY x) as median,
       percentile_disc(.75) WITHIN GROUP (ORDER BY x) as p75,
       max(x), avg(x), stddev(x), count(x) 
       FROM winsorized_observations WHERE label = 'winsorized'
UNION ALL 
SELECT 'trimmed', min(x),
       percentile_disc(.25) WITHIN GROUP (ORDER BY x) as p25,
       percentile_disc(.50) WITHIN GROUP (ORDER BY x) as median,
       percentile_disc(.75) WITHIN GROUP (ORDER BY x) as p75,
       max(x), avg(x), stddev(x), count(x) 
       FROM trimmed_observations WHERE label = 'trimmed';

## Robustness
Robustness is a worst-case analysis. So we think in terms of an *adversary*.

Suppose the adversary could "corrupt" data values arbitrarily. What does that do to an "estimator" (i.e. an aggregate like mean, stddev, etc.)?
- Defn: **Breakdown Point** of an estimator
  - smallest fraction of values the adversary must corrupt to return an *arbitrary result*
  - i.e. to change the center and spread to *whatever it wants*?
  - Depends on the definition of center and spread!

- What is the breakdown point of the $1\%$ trimmed mean?

- What $k\%$ gives us maximum robustness via trimming? 
- Could we do any better with another scheme?

### How Robust Can You Be?
The median of any distribution is maximally robust
- up to $50\%$ corruption of the data!

## Robust Estimators of a Distribution
- Center: Median
- Dispersion: the Median Absolute Deviation (MAD)

Given dataset $X$ with $\tilde X = \mbox{median}(X)$, we define the MAD as:
$$MAD(X) = \mbox{median}(|X_i - \tilde X|)$$

The Median and MAD are both maximally robust!

In [None]:
%%sql
-- percentile_disc returns an actual data value near the percentile
-- percentile_cont returns an interpolated value at the percentile
CREATE OR REPLACE VIEW median AS
(SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY x) as median
  FROM observations);

In [None]:
%%sql
CREATE OR REPLACE VIEW mad AS
WITH
absdevs AS
(SELECT abs(x - median) as d
   FROM observations, median)
SELECT percentile_disc(0.5) WITHIN GROUP (ORDER BY d) as mad
  FROM absdevs;
    
SELECT median, mad
  FROM median, mad;

## Other Robust Centers/Dispersion
You'll commonly see people use:
- center: k% trimmed mean
- center: k% winsorized mean
- dispersion: Interquartile Range (IQR: p75 - p25)

Recall the Tukey numbers for assessing univariate numerics:
- min, p25, median, p75, max: robust center/dispersion
- center: median
- spread: informed by min/max *and* IQR

## A Robust Outlier Metric: Hampel x84
Quartiles just a rule of thumb, and ignore dispersion.

Back to the earlier question: How much should we trim or winsorize?
- Let's use our intuition from the normal distribution.
- E.g. "$2\sigma$ from the mean"

### Hampel X84: Intuition from Normal
"Translate" normal estimators to robust center/dispersion.

- assume a standard normal distribution (mean 0 stddev 1)
- convert standard deviation to MADs
    - in this standard case, 1 stddev = 1.4826 MADs!
    - (Challenge: write Python or SQL to test that!)

Hampel x84: define outliers as $k*1.4826$ MADs from the median!
- benefit: outliers defined by (robust) dispersion of the data
- as opposed to IQR, etc which ignores dispersion

# Redoing our outliers with Hampel x84
Let's find/trim outliers $2*1.4826$ from the median!
- This is just like the order statistics above
- But also like $2\sigma$
  - trims based on a (robust) metric of spread

In [None]:
%%sql
CREATE OR REPLACE VIEW hampelx84x2_observations AS
SELECT o.x,
      'hampelx84x2' AS label
  FROM observations o, median, mad
 WHERE o.x BETWEEN (median-2*1.4826*mad) AND (median+2*1.4826*mad)
UNION ALL
SELECT o.x, 'orig' AS label
  FROM observations o;

CREATE OR REPLACE VIEW Hampel84x2_outliers AS
SELECT x
  FROM observations o, median, mad
 WHERE x NOT BETWEEN (median - 2*1.4826*mad) AND (median + 2*1.4826*mad);

In [None]:
# WHERE x < 500
results = %sql SELECT * FROM hampelx84x2_observations
sns.displot(results.dict(), x="x", kind='hist', hue='label', kde=True, bins=20)
%sql SELECT * FROM Hampel84x2_outliers;

## Model-Based Outlier Detection
Up to now we've been cleaning stored data values. A lot of outlier detection discussion is around model fitting.

Assume you're fitting a model to your data.
- E.g. linear regression

Q: Which data points are "outliers" with respect to the model.
- "Surprising" values
- Use our outlier metrics, but apply to the *model residuals*
    - E.g. L2 distance between actual value and predicted
- Assumption: residuals of your model are normally distributed

Anything further down this path is well into the realm of data analysis, not engineering.
- So we'll stop here

## Outliers: Summing up
Detection: Center and Spread
- The normal distribution gives nice intuition, but not robust
- Robustness: want high breakdown point (at most 50%)!
- Order statistics like percentiles are robust
    - But don't take dispersion into account
- Median and MAD are robust estimators of center and dispersion, respectively
- Hampel X84: robust outlier metric, considers dispersion

Outlier Handling:
- Trimming
- Winsorizing
- Watch your hygiene!
  - Keep the raw data, close by
  - Document

# Data Imputation
Sometimes when data is missing, we fill in "likely" values. Why?
- Missing data can lead to bias
- Some downstream operators won't tolerate missing data
    - E.g. a stats package that needs a dense tensor

Strategies for defining what's "likely"?

- Default values for a column
    - Typically an aggregate of the column
    - E.g. the center (mean/median)
- Correlation across columns (e.g. $P(\mbox{elevation} | \mbox{latitude})$)
- Sampled from a model (possibly trained on other data)
- Interpolation across (ordered) rows

## Choice of Imputation Methods
What is a *good* imputation scheme for your setting?
- It depends!
- This is part of the art of statistics
- We will not offer prescriptions here
   - Focus on *how* rather than *what*

For purposes of illustration, let's introduce some missing values into our data.

In [None]:
## replace the database connection with a database of your own!
%reload_ext sql
%sql postgresql://jmh@localhost:5432/gnis


In [None]:
%%sql
SELECT setseed(0.12345);
DROP TABLE IF EXISTS holey CASCADE;
CREATE TABLE holey AS 
SELECT feature_id, feature_name, feature_class, state, county_name, 
       primary_latitude_dec, primary_longitude_dec, 
       CASE WHEN random() > 0.9 THEN NULL
            ELSE elevation_meters
        END AS elevation_meters
  FROM national;
SELECT count(elevation_meters)::float / count(*) FROM holey;
    

## Default Value Imputation in SQL
Two pass: i.e. an aggregate CTE followed by a query. E.g. mean imputation:

In [None]:
%%sql
WITH elevavg AS (SELECT avg(elevation_meters) FROM holey)
SELECT h.*, 
       CASE WHEN h.elevation_meters IS NOT NULL THEN h.elevation_meters
            ELSE e.avg
        END AS imputed_elevation_meters
  FROM holey h, elevavg e
 LIMIT 100;

## Correlation Across Columns
Given a correlation model, applying it is just a scalar function!
- E.g. linear regression from longitude to elevation_meters?
- Just apply slope and intercept to longitude!

In [None]:
# Here we'll train the model in SQL just for fun
result = %sql SELECT regr_slope(elevation_meters, primary_longitude_dec), regr_intercept(elevation_meters, primary_longitude_dec) FROM holey
slope, intercept = result[0]

In [None]:
%%sql
SELECT *,
       CASE WHEN elevation_meters IS NOT NULL THEN elevation_meters
            ELSE :slope*primary_longitude_dec + :intercept
        END AS imputed_elevation_meters
  FROM holey
 LIMIT 100;

## Exercise
Compare the original data and the imputed data
- Compute the residuals
- What is the distribution of residuals?
- What do the outliers of the residuals look like?

## Model-Based Interpolation
Like the previous case, just trained on different data in advance.

Call a scalar function taking a model prediction function, passing parameters from the values in the row:
```
SELECT *,
       CASE WHEN column IS NOT NULL THEN column
            ELSE model_predict(<constants>, <columns>)
        END
  FROM table;
  ```

## Correlation across ordered rows
This typically involves a window function. A simple example is just to "fill down":

In [None]:
%%sql
-- The following doesn't work in PostgreSQL!
WITH buggy AS (
SELECT *,
       CASE WHEN elevation_meters IS NOT NULL THEN elevation_meters
            ELSE lag(elevation_meters, 1)
                 OVER (ORDER BY feature_id)
        END AS imputed_elevation_meters
  FROM holey
)
SELECT * FROM buggy LIMIT 500;

## Using a UDA to Simulate lag(...) IGNORE NULLS

In [None]:
%%sql
-- Here's a UDA fix from
-- https://stackoverflow.com/questions/18987791/how-do-i-efficiently-select-the-previous-non-null-value
CREATE OR REPLACE FUNCTION coalesce_agg_sfunc(state anyelement, value anyelement) RETURNS anyelement AS
$$
    SELECT coalesce(value, state);
$$ LANGUAGE SQL;

CREATE AGGREGATE coalesce_agg(anyelement) (
    SFUNC = coalesce_agg_sfunc,
    STYPE  = anyelement);

## Redoing our Fill Down Imputation

In [None]:
%%sql
-- Fixed to handle repeated NULLs
WITH fixed AS (
SELECT *,
       coalesce_agg(elevation_meters) OVER (order by feature_id) AS imputed_elevation_meters
  FROM holey
)
SELECT * FROM fixed LIMIT 500;

In [None]:
%%sql
-- Test for NULLs
WITH fixed AS (
SELECT *,
       coalesce_agg(elevation_meters) OVER (order by feature_id) AS imputed_elevation_meters
  FROM holey
)
SELECT count(*) FROM fixed WHERE imputed_elevation_meters IS NULL;

## General Interpolation Across Ordered Rows
This turns out to be relatively tricky/expensive: requires multiple passes over window aggs!

Our goal is to be interpolating across "runs" of NULLs.
- To interpolate, we need to label every row with
  - a unique "run" number (`run`) for each value and its subsequent NULLs
  - the initial, non-NULL value in the run (`run_start`)
  - this row's index in the run (`run_rank`)
  - the total number of rows in this run (`run_size`)
  - the next non-NULL value in order (`run_end`)

With this, we can interpolate via whatever scalar math we like (e.g. linear).

# PUT PICTURE HERE

## Strategy for Interpolation
A 3-pass algorithm:
1. Forward: 
  - compute `run`
  - propagate `run_start`
  - get `nextval` into last row of run
2. Backward, given `run` partitions: 
  - compute `run_size`
  - computer `run_rank`
  - propagate `run_end` from `nextval`
3. The final query uses scalars to interpolate

Can you do better? If you don't use SQL?

In [None]:
%%sql
-- 1. Forward assign "run" numbers to rows, propagate val, get nextval
CREATE OR REPLACE VIEW forward AS
SELECT *,
       SUM(CASE WHEN elevation_meters IS NULL THEN 0 ELSE 1 END) 
         OVER (ORDER BY feature_id) AS run,
       coalesce_agg(elevation_meters) OVER (ORDER BY feature_id) AS run_start,
       CASE WHEN elevation_meters IS NULL 
              THEN lead(elevation_meters, 1) OVER (ORDER BY feature_id)
            ELSE NULL
             END AS nextval
  FROM holey;

SELECT * FROM forward
 LIMIT 500;

In [None]:
%%sql
-- 2. Backward: assign run_end, run_size, run_rank
CREATE OR REPLACE VIEW backward AS
SELECT *,
       CASE WHEN elevation_meters IS NOT NULL THEN elevation_meters
            ELSE coalesce_agg(nextval) OVER (PARTITION BY run ORDER BY feature_id DESC)
        END AS run_end,
       count(*) OVER (PARTITION BY run) AS run_size,
       1 + feature_id - (min(feature_id) OVER (PARTITION BY run)) AS run_rank
  FROM fw;

SELECT * from backward LIMIT 500;

In [None]:
%%sql
-- 3. Simple scalar pass
CREATE OR REPLACE VIEW final AS
SELECT *, 
       run_start + (run_rank-1)*((run_end-run_start)/(run_size))
         AS interpolated
  FROM backward;

SELECT * FROM final
LIMIT 500;

How well did PostgreSQL do? Two sorts! Could you do better?

In [None]:
%sql EXPLAIN SELECT * from bw LIMIT 500;

## Final Notes on Imputation
- What we've discussed is standard *single imputation*
- There are fancier statistical methods even [on Wikipedia](https://en.wikipedia.org/wiki/Imputation_(statistics)):
  - E.g. *Multiple imputation* averages across multiple imputed datasets
- Getting fancier may require (even) more query gymnastics!
- You've seen enough to be dangerous!

# Entity Resolution

# Distance Functions on Strings
Which is more likely:
- "Aditya" $\rightarrow$ "Aditi"
- "Aditya" $\rightarrow$ "Adversary

We want a notion of string "distance". There are many in the literature
- [Levenshtein](https://en.wikipedia.org/wiki/Levenshtein_distance): number of single-character edits
    - Edits include insert, delete or substitute a character
- [Jaro-Winkler](https://www.postgresql.org/docs/current/fuzzystrmatch.html): another edit distance, favors similar prefixes
- Sound indexes: [Soundex](https://en.wikipedia.org/wiki/Soundex)/[Metaphone](https://en.wikipedia.org/wiki/Metaphone)/[Double Metaphone](https://en.wikipedia.org/wiki/Metaphone#Double_Metaphone)

You can get [Postgres Functions](https://www.postgresql.org/docs/current/fuzzystrmatch.html) for Levenshtein, Soundex, Metaphone and Double Metaphone

In [None]:
%%sql
SELECT levenshtein('Aditya', 'Aditi'),
       soundex('Aditya') AS soundex1, 
       soundex('Aditi') AS soundex2,
       metaphone('Aditya', 10) AS metaphone1, 
       metaphone('Aditi', 10) AS metaphone2,
       dmetaphone('Aditya') AS dmetaphone1,
       dmetaphone_alt('Aditya') AS dmetaphone_alt1,
       dmetaphone('Aditi') AS dmetaphone2,
       dmetaphone_alt('Aditi') AS dmetaphone_alt2
UNION ALL
SELECT levenshtein('Aditya', 'Adversary'),
       soundex('Aditya'), soundex('Adversary'),
       metaphone('Aditya', 10), metaphone('Adversary', 10),
       dmetaphone('Aditya'), dmetaphone_alt('Adversary'),
       dmetaphone('Aditya'), dmetaphone_alt('Adversary')
UNION ALL
SELECT levenshtein('Joe', 'Joel'),
       soundex('Joe'), soundex('Joel'),
       metaphone('Joe', 10), metaphone('Joel', 10),
       dmetaphone('Joe'), dmetaphone_alt('Joel'),
       dmetaphone('Joe'), dmetaphone_alt('Joel')


# Entity Resolution
A.k.a. Record Linkage, Data Matching, Deduplication, Standardization

Suppose I have a column of product names.
- Might different names represent the same real-world "entity"?

Suppose I have a table of product tuples
- Might different tuples represent the same real-world "entity"?

## Simple Case: Matching to Reference Data
Given noisy input data $D$, and a curated reference set $R$.
Map each item in $D$ to the best match in $R$ (or NULL).

Example: Suppose you have a curated table `Products`. 
- You then receive sales data from a subsidiary
- You want to match their product names to `Products`:
- Proximity join!
```
SELECT *
  FROM new_sales, products
 WHERE levenshtein(new_sales.product_name, products.name) < 5;
```
This cross join will be nightmarishly slow!
- How slow?

## Matching to Reference Data, Cont.
Desired: first-pass filter 
- Use a text-search index on `products.name`
- Only compute levenshtein distance on top $k$ matches for each

Many systems provide this
- E.g. external text indexing via Elastic or the like
    - Have to keep in sync with other databases
- E.g. in-database text indexing like Postgres' [Gin indexes](https://www.postgresql.org/docs/current/gin-intro.html)
    - Stays automatically in sync
    - Expensive on insertion
    - Not super fast compared to a custom text index

How does all that work?
- We'll return to this topic when we discuss unstructured data and search

## Entity Resolution (Without Reference Data)
Given input data $D$, partition into equivalence classes corresponding to distinct real-world entities.

Clearly a heuristic problem!

- Approach: as before, need to filter first!
  - *Blocking*: group the data into (possibly overlapping) subsets
  - *Matching*: Within each block, try to match up entities
      - *Pairwise*: Create weighted edges via distance function between pairs in a block
      - *Transitive*: Partition the resulting graph into *clusters*
          - "close" nodes should be in the same cluster
          - "distant" nodes in different clusters

## Entity Resolution 1: Blocking
Idea:
  1. Extract one or more "blocking keys" from each entity
  2. Form blocks out of entities that have same/similar blocking keys
  
Standard approach: partiton by some column(s)
  - Each group is a distinct block
  - Each item is in only one block, like GROUP BY

Fast, easy to understand/engineer.
  - Can prep data for this by derive blocking "features"
    - E.g. initials/acronyms

## Other Blocking Schemes
*Many* heuristics here. We'll learn $q$-gram blocking, mostly because $q$-grams are handy to know about anyhow.

Definition: the **$q$-grams** of a string are the substrings of length $q$.
  - E.g. trigrams of `(picasso, pablo)` are {'pic', 'ica', 'cas', 'ass', 'sso', 'so\\$', 'o\\$p', '\\$pa', 'pab', 'abl', 'blo', 'lo\\$'}
  - Handy for computing string-similarity
  - Handy for indexing: build a B-tree on $q$-grams for finding misspellings!
 
One block per distinct $q$-gram in your dataset. How to choose $q$? Heuristic.
  - $q$ too small? Most entities are in most blocks (slow! bad *precision*)
  - $q$ too big? Entities that should be together are not (bad *recall*)
  - Can extend to require matches on multiple $q$-grams
    - blows up the # of blocks, but each block smaller

## Entity Resolution 2a: Distance Metrics for Matching
- Univariate distance
  - String distances as above are limited
  - Abbreviations: IBM = International Business Machines
  - Synonyms: Eggplant = Aubergine
  - Knowledge-base hierarchies: *tangerine* is closer to *orange* than it is to *apple*
  - Upshot: there are many many heuristics in this domain!
- Multivariate (tuple) distance
  - `(Davis, Miles, 1950)`, `(Marsalis, Wynton, 1986)`, `(Davis, Miles, 1986)`
  - `(Kelloggs, Corn Flakes, 1950)`, `(Post, Shredded Wheat, 1950)`, `(Kelloggs, Corn Flakes, 2021)`

## Distance Metric Practicalities
  - In the absence of a supervised model, combine simple distance functions & weights
  - If you can get labeled data, train a model!

## Entity Resolution 2b: Clustering
Given: a block of size $b$ with $b^2$ pairwise distances
  - This is a fully-connected graph on $b$ nodes
  - Again we can heuristically threshold
    - 2 nodes close enough? Declared "same"
    - 2 nodes far enough? Prune the edge
    - Others: uncertain
  - Reasonable clustering goal (Correlation Clustering):
    - maximize the sum of intra-cluster edge weights
    - minimize the sum of inter-cluster edge weights
    - This is *intractable* (NP-Hard) in theory
    - Various approximation algorithms will do a good job
        - E.g. [Clustering Aggregation](http://cs-people.bu.edu/evimaria/cs565/aggregated-journal.pdf)

## Entity Resolution in Practice

Do-it-yourself:
1. Pick a blocking technique
    - Easy: GROUP BY on some columns
2. Pick some distance metrics
3. Call a clustering library

Use a product:
- [open source](https://www.biggorilla.org/software_cat/entity-matching/index.html) not so strong at scale
    - but with good blocking maybe scale's manageable
- various commercial tools

Be aware of products for special cases!
- mailing address deduplication
- company name deduplication
- person deduplication

## Assessing the Results of Entity Resolution

Suppose you start with 1 Million records. Your Entity Resolution software comes back with 250K entities. How well did it do?

This is a HARD PROBLEM.

- Painful to review 250,000 sets and fix by hand!
- Maybe the software can prioritize things "on the bubble"
  - items that have been put together with low confidence
  - items that have been separated with low confidence
  - requires confidence metric for the distance/clustering
- Maybe your fixes reparameterize the ER model
- An example of *Active Learning*
- Not clear that the user effort is worthwhile

## Summing Up: Entity Resolution Techniques
Basic approach:
- Blocking: many approaches (heuristics)
- Matching: 
  - Many distance functions (heuristics)
  - Many clustering techniques (heuristics)
- Assessing/Fixing results: Hard!

## Summing Up: Entity Resolution in Practice
- Entity Resolution is often very important!
- Entity Resolution is often rather dodgy 
  - Heuristics upon heuristics
  - Limited human ability to assess results
- Quite reliable in specific mature domains
  - E.g. mailing address deduplication
- Or in very general settings where training data is plentiful
  - E.g. NLP
- Or when human review is practical
  - E.g. crowd-sourceable