# `fuzzup` Showcase
`fuzzup` offers (1) a simple approach for clustering string entitities based on 
[Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) using
[Fuzzy Matching](https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation))
in conjunction with a simple rule-based clustering method. 

`fuzzup` also provides (2) functions for computing the prominence of  
entity clusters resulting from (1).

In this section we will go through the nuts and bolts of `fuzzup` by applying it to a realistic setting.

## Designed for Handling Output from NER
An important use-case for `fuzzup` is *organizing, structuring and analyzing* output from [Named-Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)(=NER).

For this reason `fuzzup` has been handtailored to fit the output from NER predictions from the [Hugging Face](https://huggingface.co/) [transformers](https://github.com/huggingface/transformers) [NER pipeline](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/pipelines#transformers.TokenClassificationPipeline) specifically.


## Use-case
First of, import dependencies needed later.

In [8]:
from rapidfuzz.fuzz import partial_token_set_ratio
import pandas as pd
import numpy as np

from fuzzup.datasets import simulate_ner_data
from fuzzup.fuzz import (
    fuzzy_cluster, 
    compute_prominence, 
    compute_fuzzy_matrix,
)
from fuzzup.whitelists import match_whitelist

Say, we have used a `transformers` Hugging Face NER pipeline to identify names of persons in a news article. The output from the algorithm is a list of string entities and looks like this (simulated data).

In [9]:
PERSONS_NER = simulate_ner_data()
pd.DataFrame.from_records(PERSONS_NER)

Unnamed: 0,word,entity_group,score,start,end
0,Donald Trump,PER,0.303782,15,46
1,Donald Trump,PER,0.783635,95,79
2,J. biden,PER,0.610754,29,76
3,joe biden,PER,0.324899,32,6
4,Biden,PER,0.492239,27,49
5,Bide,PER,0.315379,33,51
6,mark esper,PER,0.165254,61,60
7,Christopher c . miller,PER,0.353775,25,10
8,jim mattis,PER,0.577723,53,47
9,Nancy Pelosi,PER,0.700337,24,79


As you can see, the output is rather messy (partly due to the stochastic nature of the algorithm). Another reason for the output looking messy is, that for instance 'Joe Biden' has been mentioned a lot of times but in different ways, e.g. 'Joe Biden', 'J. Biden' and 'Biden'. 

We want to organize these strings entities by forming meaningful clusters from them, in which the entities are closely related based on their pairwise edit distances. 

## Workflow

`fuzzup` offers functionality for:

1. Computing all of the mutual string distances (Levensteihn Distances/fuzzy ratios) between the string entities
2. Forming clusters of string entities based on the distances from (1)
3. Computing prominence of the clusters from (2) based on the number of entity occurrences, their positions in the text etc.
4. Matching entities (clusters) with entity whitelists

Together these steps constitute an end-to-end approach for organizing and structuring the output from NER. Below we go through a simple example of the `fuzzup` workflow.    

### Step 1: Compute Pairwise Edit Distances

First, `fuzzup` computes pairwise fuzzy ratios for all pairs of string entities.

[Fuzzy ratios](https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)) are numbers between 0 and 100 are measures of similarity between strings. They are derived from the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) - a string metric, that measures the distance between two strings. 

In short the Levenshtein distance (also known as 'edit distance') between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. 

`fuzzup` has a separate function `compute_fuzzy_matrix` for this, that presents the output - the mutual fuzzy ratios - as a cross-tabular matrix with all ratios. 

In [10]:
from fuzzup.fuzz import fuzzy_cluster
persons = [x.get('word') for x in PERSONS_NER]
compute_fuzzy_matrix(persons, scorer=partial_token_set_ratio)

Unnamed: 0,mark esper,Biden,trumps,Donald,Christopher c . miller,Bide,Nancy Pelosi,J. biden,Trump,Donald Trump,jim mattis,joe biden,miller
mark esper,100.0,33.333332,33.333332,22.222221,50.0,40.0,26.666666,22.222221,40.0,28.571428,37.5,30.76923,40.0
Biden,33.333332,100.0,0.0,33.333332,40.0,100.0,33.333332,88.888885,0.0,25.0,28.571428,80.0,40.0
trumps,33.333332,0.0,100.0,0.0,33.333332,0.0,25.0,0.0,80.0,80.0,44.444443,0.0,28.571428
Donald,22.222221,33.333332,0.0,100.0,22.222221,40.0,25.0,28.571428,0.0,100.0,18.181818,25.0,22.222221
Christopher c . miller,50.0,40.0,33.333332,22.222221,100.0,50.0,30.0,42.857143,40.0,27.272728,35.294117,33.333332,100.0
Bide,40.0,100.0,0.0,40.0,50.0,100.0,40.0,75.0,0.0,25.0,33.333332,75.0,50.0
Nancy Pelosi,26.666666,33.333332,25.0,25.0,30.0,40.0,100.0,26.666666,0.0,23.529411,23.529411,35.294117,28.571428
J. biden,22.222221,88.888885,0.0,28.571428,42.857143,75.0,26.666666,100.0,0.0,18.181818,26.666666,100.0,40.0
Trump,40.0,0.0,80.0,0.0,40.0,0.0,0.0,0.0,100.0,100.0,25.0,0.0,33.333332
Donald Trump,28.571428,25.0,80.0,100.0,27.272728,25.0,22.222221,18.181818,100.0,100.0,25.0,25.0,33.333332


The different string representations of e.g. Donald Trump and Joe Biden have high mutual fuzzy ratios. In comparision representations of different persons have relatively small fuzzy ratios.

You can think of this matrix as a correlation matrix, that shows the correlation between strings.

### Step 2: Forming Clusters
Clusters of entities can be formed using the output from (1) using a naive approach clustering two string entities together, if their mutual fuzzy ratio exceeds a certain threshold.

Computing the pairwise fuzzy ratios and forming the clusters can be done in one take by simply invoking the `fuzzy_cluster` function.


In [11]:
clusters = fuzzy_cluster(PERSONS_NER, 
                         scorer=partial_token_set_ratio, 
                         cutoff=70,
                         merge_output=True)
pd.DataFrame.from_records(clusters)

Unnamed: 0,word,entity_group,score,start,end,cluster_id
0,Donald Trump,PER,0.303782,15,46,Donald Trump
1,Donald Trump,PER,0.783635,95,79,Donald Trump
2,J. biden,PER,0.610754,29,76,joe biden
3,joe biden,PER,0.324899,32,6,joe biden
4,Biden,PER,0.492239,27,49,joe biden
5,Bide,PER,0.315379,33,51,joe biden
6,mark esper,PER,0.165254,61,60,mark esper
7,Christopher c . miller,PER,0.353775,25,10,Christopher c . miller
8,jim mattis,PER,0.577723,53,47,jim mattis
9,Nancy Pelosi,PER,0.700337,24,79,Nancy Pelosi


Note, that the original entities are now equipped with a 'cluster_id', assigning each of the entities to an entity cluster.

We see from the results, that different string representations of e.g. 'Donald Trump' have been clustered together. As you see, the 'cluster_id' of each cluster is the longest string within the entity cluster.

In this case we applied a `partial_token_set_ratio` and a cutoff threshold value of 75 on the pairwise fuzzy ratios. Depending on your use case, you should choose an appropriate scorer from `rapidfuzz.fuzz` and 'fine-tune' the cutoff threshold value on your own data.

### Step 3: Compute Prominence of Entity Clusters
A naïve approach for computing the 'prominence' of the different string clusters is to just count the number of nodes/strings in each cluster. This is the default behaviour of `compute_prominence()`.

In [12]:
clusters = compute_prominence(clusters,
                              merge_output=True)
pd.DataFrame.from_records(clusters).sort_values('prominence_rank', ascending=True)

Unnamed: 0,word,entity_group,score,start,end,cluster_id,prominence_score,prominence_rank
0,Donald Trump,PER,0.303782,15,46,Donald Trump,5.0,1
1,Donald Trump,PER,0.783635,95,79,Donald Trump,5.0,1
10,trumps,PER,0.206012,77,55,Donald Trump,5.0,1
11,Trump,PER,0.021739,17,59,Donald Trump,5.0,1
12,Donald,PER,0.610556,26,68,Donald Trump,5.0,1
2,J. biden,PER,0.610754,29,76,joe biden,4.0,2
3,joe biden,PER,0.324899,32,6,joe biden,4.0,2
4,Biden,PER,0.492239,27,49,joe biden,4.0,2
5,Bide,PER,0.315379,33,51,joe biden,4.0,2
7,Christopher c . miller,PER,0.353775,25,10,Christopher c . miller,2.0,3


In this case, the 'prominence score' of the 'Donald Trump' entity cluster is 5, because Donald Trump is mentioned 5 times in different variations. This is the highest frequency among the clusters and therefore the 'Donald Trump' cluster is scored as the most prominent cluster.

The clusters are ranked by their prominence scores in the 'prominence rank' column.

### Step 4: Matching with Whitelists
It can be useful to have one or more whitelists with specific entities of interest, when analyzing the output from NER. Assume, that we are only interested in Donald Trump and Joe Biden.

We construct a minimal whitelist.

In [13]:
whitelist = ['Donald Trump', 'Joe Biden']

Now, we can match it with our predicted entities using function `match_whitelist`.

In [14]:
match_whitelist(clusters,
                whitelist,
                scorer=partial_token_set_ratio,
                score_cutoff=80,
                aggregate_cluster=True,
                to_dataframe=True).sort_values('prominence_rank', ascending=True)

Unnamed: 0,word,entity_group,score,start,end,cluster_id,prominence_score,prominence_rank,matches
0,Donald Trump,PER,0.303782,15,46,Donald Trump,5.0,1,[Donald Trump]
1,Donald Trump,PER,0.783635,95,79,Donald Trump,5.0,1,[Donald Trump]
10,trumps,PER,0.206012,77,55,Donald Trump,5.0,1,[Donald Trump]
11,Trump,PER,0.021739,17,59,Donald Trump,5.0,1,[Donald Trump]
12,Donald,PER,0.610556,26,68,Donald Trump,5.0,1,[Donald Trump]
2,J. biden,PER,0.610754,29,76,joe biden,4.0,2,[Joe Biden]
3,joe biden,PER,0.324899,32,6,joe biden,4.0,2,[Joe Biden]
4,Biden,PER,0.492239,27,49,joe biden,4.0,2,[Joe Biden]
5,Bide,PER,0.315379,33,51,joe biden,4.0,2,[Joe Biden]
7,Christopher c . miller,PER,0.353775,25,10,Christopher c . miller,2.0,3,[]


Whitelist matching can also be conducted using `Whitelist` subclasses. In the example below, NER output is compared to a `Whitelist` consisting of `Cities`.

In [22]:
from fuzzup.whitelists import Cities

LOCATIONS = [{'word': 'Viborg', 'entity_group': 'LOC', 'cluster_id' : 'vibbe'}, 
             {'word': 'Uldum', 'entity_group': 'ORG', 'cluster_id' : 'uldum' }]

# initialize whitelist
cities = Cities()

# clustering and whitelist matching
clusters = fuzzy_cluster(LOCATIONS)
matches = cities(clusters,
                 score_cutoff=90)

matches

INFO:fuzzup.whitelists:Loading whitelist: city
INFO:fuzzup.whitelists:Done loading.


[{'word': 'Viborg',
  'entity_group': 'LOC',
  'cluster_id': 'vibbe',
  'matches': ['Viborg'],
  'mappings': [{'municipality': 'Viborg'}]}]