# NAMA Demo

First lets create some simple data and install packages

In [1]:
import pandas as pd
import numpy as np
from nama import Matcher

df1 = pd.DataFrame(['ABC Inc.','abc inc','A.B.C. INCORPORATED','The XYZ Company','X Y Z CO'],columns=['name'])
df2 = pd.DataFrame(['ABC Inc.','XYZ Co.'],columns=['name'])

print(f'Toy data:\ndf1=\n{df1}\ndf2=\n{df2}')

Toy data:
df1=
                  name
0             ABC Inc.
1              abc inc
2  A.B.C. INCORPORATED
3      The XYZ Company
4             X Y Z CO
df2=
       name
0  ABC Inc.
1   XYZ Co.


Nama is built around an object called a `Matcher`, which holds matching information about a set of strings and partitions the strings into non-overlapping groups.
   - Strings in the same group are considered "matched"
   - Strings in different groups are not matched.
Nama provides tools for creating, modifying, saving, and loading matchers. Then matchers can be used to generate unique group ids for a set of strings, or perform two-way merges between pandas dataframes according to the match groups.

In [10]:
# We start matching by creating an empty matcher
matcher = Matcher()

# First we need to add all the strings we want to match to the matcher
# (in this case the strings the name column of each dataframe)
matcher = matcher.add_strings(df1['name'])
matcher = matcher.add_strings(df2['name'])

# Initially, strings are automatically assigned to singleton groups
# (Groups are automatically labelled according to the most common string,
# with ties broken alphabetically)
print(f'Initial string groups:\n{matcher.groups}')

# At this point we can merge on exact matches, but there isn't much point
# (equivalent to pandas merge function)
print(f"Exact matching with singleton groups:\n{matcher.merge_dfs(df1,df2,on='name')}")

# To get better results, we need to modify the matcher.
# Unite merges all groups that contain the passed strings.
matcher = matcher.unite(['X Y Z CO','XYZ Co.'])
print(f'Updated string groups:\n{matcher.groups}')

# We can inspect the united groups in 3 ways. 
# First, we can get the group that any string belongs too with
print(matcher['XYZ Co.'])
# We can inspect the all the strings in the same group (i.e. that match) with
print(matcher.matches('XYZ Co.'))
# Lastly we can convert the matcher to a dataframe
print(matcher.to_df())

Initial string groups:
{'ABC Inc.': ['ABC Inc.'], 'abc inc': ['abc inc'], 'A.B.C. INCORPORATED': ['A.B.C. INCORPORATED'], 'The XYZ Company': ['The XYZ Company'], 'X Y Z CO': ['X Y Z CO'], 'XYZ Co.': ['XYZ Co.']}
Exact matching with singleton groups:
     name_x match_group    name_y
0  ABC Inc.    ABC Inc.  ABC Inc.
Updated string groups:
{'ABC Inc.': ['ABC Inc.'], 'abc inc': ['abc inc'], 'A.B.C. INCORPORATED': ['A.B.C. INCORPORATED'], 'The XYZ Company': ['The XYZ Company'], 'X Y Z CO': ['X Y Z CO', 'XYZ Co.']}
X Y Z CO
['X Y Z CO', 'XYZ Co.']
                string  count                group
0             ABC Inc.      2             ABC Inc.
1             X Y Z CO      1             X Y Z CO
2              XYZ Co.      1             X Y Z CO
3  A.B.C. INCORPORATED      1  A.B.C. INCORPORATED
4      The XYZ Company      1      The XYZ Company
5              abc inc      1              abc inc


`unite` is very flexible. We can pass a single set of strings, a nested list of strings, or mapping from strings to group labels. The mapping can even be a function that evaluates strings and generates a label.This makes it very simple to do hash collision matching.

Hash collision matching works by matching any strings that have the same hash. A hash could be almost anything, but one useful way to do collision matching is to match strings that are identical after simplifying both strings.

Nama provides some useful simplification functions in nama.utils. `simplify_corp` strips punctuation and capitalization, and removes common parts of names like starting with "the", or ending with "inc" or "ltd".

In [11]:
from nama.utils import simplify_corp

# Make a new matcher for comparison
corp_matcher = Matcher(matcher.strings())

# Unite strings with the same simplified representation
corp_matcher = corp_matcher.unite(simplify_corp)

print(f'Groups after uniting by simplify_corp:\n{corp_matcher.groups}')

Groups after uniting by simplify_corp:
{'A.B.C. INCORPORATED': ['A.B.C. INCORPORATED', 'ABC Inc.', 'abc inc'], 'The XYZ Company': ['The XYZ Company', 'XYZ Co.'], 'X Y Z CO': ['X Y Z CO']}


Another useful approach to matching is to construct a similarity measure between strings. The standard way to do this is to break strings into "tokens"(words or short substrings) and use a measure like a weighted jaccard similarity index to summarize the overlap between the tokens in pairs ofstrings. The token_similarity module provides tools for matching based on token similarity.

First, create a TokenSimilarity model. This can be customized with different tokenizers, similarity measures, and token weighting methods.

In [None]:
from nama.similarity import TokenSimilarity

token_model = TokenSimilarity()

# In the future: A training set can be used to automatically pick the optimal similarity threshold for uniting strings.
# For now: Just set the threshold manually.

# Then we can use the similarity model to predict matches between the matcher
# strings. The predict method returns a new matcher.
token_matcher = token_model.predict(matcher.strings(), threshold=0.05)



Notice that the combination of the two matchers correctly groups all the strings. It is often useful to combine multiple matching techniques.

We can integrate the corp and token matchers into the original matcher with unite.

In [None]:
matcher = matcher.unite(corp_matcher)
matcher = matcher.unite(token_matcher)

# Now merging the dataframes gives us the desired output
print(f"Merging with the final matcher:\n{matcher.merge_dfs(df1,df2,on='name')}")

We can also use the Embedding Similarity model to predict the similarity of larger and more complex 

In [None]:
from nama.utils import load_similarity_model

# First we'll need to load a model 
# From file
sim = load_similarity_model("path-to-model.bin")

# Or we can use our standard model from huggingface
# .... TBD

# Then we'll have the model embed our matcher
embeddings = sim.embed(matcher)

# Now we can do some matching
# We can unite strings according to their predicted pairwise similarity
sim_matcher_similar = embeddings.unite_similar(threshold=0.5)

# We can unite strings with each string's most similar target string
# This method requires a set of target strings which will be matched to our embedded strings
sim_matcher_nearest = embeddings.unite_nearest(target_strings=corp_matcher,threshold=0)

# We can also manipulate the embeddings by slicing like so
first_string = embeddings[0]
#...

# Lastly we can save the embeddings for later use
embeddings.save("path-to-save-embeddings.bin")

We can also train an Embedding Similarity Model to predict the similarity of larger and more complex groupings for which we have some target values

In [None]:
from nama.similarity import EmbeddingSimilarity

train_kwargs = {
    'max_epochs': 2,
    'warmup_frac': 0.2,
    'transformer_lr':1e-5,
    'score_lr':10,
    'batch_size':8,
}

sim = EmbeddingSimilarity()

history_df = sim.train(matcher, verbose=True, **train_kwargs)

With a trained model we can run some tests

In [None]:
# We can test the similarity model with a single threshold
test_scores = sim.test(matcher, threshold=0.5)

# Or can also run a test over multiple thresholds to find the optimal one
test_scores = sim.test(matcher, threshold=np.linspace(0,1,20))

We can then save our model to file for later use


In [None]:
sim.save("path-to-model.bin")

The matcher can also be converted to a dataframe if we want to cluster the names in one dataset or create a mapping to string groups that can be used accross multiple datasets.

In [None]:
matcher_df = matcher.to_df()

Finally, we can save the matcher in csv format for later use

In [None]:
matcher.to_csv('matcher.csv')

In [None]:
from nama import read_csv

# ...and load it again at a later time
loaded_matcher = read_csv('matcher.csv')