# `fuzzup` 
`fuzzup` offers (1) a simple approach for clustering string entitities based on 
[Levenshtein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance) using
[Fuzzy Matching](https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation))
in conjunction with a simple rule-based clustering method. 

`fuzzup` also provides (2) functions for computing the prominence of the resulting 
entity clusters resulting from (1).

In this section we will go through the nuts and bolts of `fuzzup` by applying it to a realistic setting.

## Designed for handling output from NER
An important use-case for `fuzzup` is organizing and structuring output from [Named-Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition)(=NER).

For this reason `fuzzup` has been designed to fit the output from NER predictions from the [Hugging Face](https://huggingface.co/) [transformers](https://github.com/huggingface/transformers) [NER pipeline](https://huggingface.co/docs/transformers/v4.16.2/en/main_classes/pipelines#transformers.TokenClassificationPipeline) specifically.


## Use-case
Say, we have used an NER algorithm to extract person names from a news article. The output from the algorithm is a list of strings entities and looks like this.

In [None]:
PERSONS = ['Donald Trump', 'Donald Trump', 'J. biden', 'joe biden', 'Biden', 'Bide', 'mark esper', 'Christopher c . miller', 'jim mattis', 'Nancy Pelosi', 'trumps', 'Trump', 'Donald', 'miller']

: 

As you can see, the output is rather messy (partly due to the stochastic nature of the algorithm). Another reason for the output looking messy is, that for instance 'Joe Biden' has been mentioned a lot of times but in different ways, i.e. 'Joe Biden', 'J. Biden' and 'Biden'. 

We want to organize these strings entities by forming meaningful clusters from them, in which the entities are closely related based on their pairwise edit distances. 

## Workflow

The solution `fuzzup` offers for this task consists of three steps

1. Compute all of the mutual string distances (Levensteihn Distances/fuzzy ratios) between the strings
2. Form clusters of strings based on the distances from (1)
3. Rank the clusters by simply counting the number of nodes(strings) in each cluster

### Step 1: Compute String Distances

First, we compute all of the mutual [fuzzy ratios]() for the strings.

[Fuzzy ratios](https://en.wikipedia.org/wiki/Fuzzy_matching_(computer-assisted_translation)) are numbers between 0 and 100 are measures of similarity between strings. They are derived from the [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance) - a string metric, that measures the distance between strings. 

In short the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. 

`fuzzup` has a function `compute_fuzzy_matrix` for this, that presents the output - the mutual fuzzy ratios - as a cross-tabular matrix with all ratios. We have chosen `partial_ratio` from the `fuzzywuzzy` package to do this. As the name suggests, it matches strings partially.


In [7]:
from fuzzup.gear import compute_fuzzy_matrix, form_clusters, form_clusters_and_rank
from fuzzywuzzy import fuzz
fuzzy_matrix = compute_fuzzy_matrix(strings, ratio = fuzz.partial_ratio)
fuzzy_matrix

ImportError: cannot import name 'compute_clusters' from 'fuzzup.gear' (/Users/lars.kjeldgaard/projects/fuzzup/venv/lib/python3.8/site-packages/fuzzup/gear.py)

As you see, the different string representations of e.g. Donald Trump and Joe Biden have high mutual fuzzy ratios. In comparision representations of different persons have relatively small fuzzy ratios.

You can think of this matrix as a correlation matrix, that shows the correlation between strings.

### Step 2: Forming Clusters
We can now use unsupervised learning to form clusters of the strings, if we treat the individual strings as observations with the mutual fuzzy ratios (in our `fuzzy_matrix`) as features.

We will apply hierarchical clustering (see [ISLR p. 390-396](https://statlearning.com/ISLR%20Seventh%20Printing.pdf ) for a good description of hierarchical clustering).

The clustering algorithm will form the clusters in order to minimize pairwise string distances intra-cluster. This is done by invoking the `form_clusters` function, i.e.

In [13]:
form_clusters(
    fuzzy_matrix,
    args_linkage = {'method': 'average',
                    'metric': 'euclidean'},
)

[['donald trump', 'trump', 'trumps'],
 ['christopher c . miller', 'miller'],
 ['mark esper'],
 ['george floyds'],
 ['joe biden'],
 ['jim mattis'],
 ['nancy pelosi']]

Here we have applied the 'average' linkage function (popular choice), and we measure the pairwise distances between strings/clusters in an euclidean space spanned by the mutual fuzzy ratios.

We see from the results, that different string representations of e.g. 'Donald Trump' have been clustered together.

Depending on your use case, you can customize arguments for computing (1) the pairwise distrances 'args_pdist', (2) arguments for the linkage function 'args_linkage' (i.e. 'method' and 'metric'), (3) arguments for the hierarchical clustering algorithm 'args_cluster' and (4) the 'flatten_coef' that is a coefficient for computing a generic cutoff for deciding how many clusters to create.

### Step 3: Rank Clusters
A simple/naïve approach for assigning weights to the different string clusters is to just count the number of nodes/strings in each cluster. `fuzzup` has a function for this: `form_clusters_and_rank`. `form_clusters_and_rank` performs steps 1 and 2 and ranks the clusters using this simple logic:

In [None]:
form_clusters_and_rank(strings)

As you see, the longest string from each cluster has been promoted to 'PROMOTED_STRING' for the given cluster.