# NAMA Demo

First lets create some simple data and install packages

In [23]:
import pandas as pd
import numpy as np
from nama import MatchData

df1 = pd.DataFrame(['ABC Inc.','abc inc','A.B.C. INCORPORATED','The XYZ Company','X Y Z CO'],columns=['name'])
df2 = pd.DataFrame(['ABC Inc.','XYZ Co.'],columns=['name'])

print(f'Toy data:\ndf1=\n{df1}\ndf2=\n{df2}')

Toy data:
df1=
                         name
0                    ABC Inc.
1                     abc inc
2         A.B.C. INCORPORATED
3             The XYZ Company
4                    X Y Z CO
5        GlobalTech Solutions
6         GlobalTechSolutions
7                  GLOBALTECH
8   Innovate Innovations Ltd.
9        Innovate Innovations
10              Innovate Inc.
11        Sunrise Enterprises
12              Sunrise Corp.
13                Sunrise Co.
14          NexGen Innovators
15                NexGen Ltd.
16                     NEXGEN
17             EcoGreen Group
18             EcoGreen Corp.
19        EcoGreen Industries
20         TechFusion Systems
21                 TechFusion
22            TechFusion Ltd.
23    AlphaOmega Technologies
24            AlphaOmega Tech
25                    AO Tech
26             SwiftSync Inc.
27        SwiftSync Solutions
28                  SWIFTSYNC
29         VitalCore Dynamics
30             VitalCore Inc.
31            VitalCore G

## Match Data

Nama is built around an object called `Match Data`, which holds matching information about a set of strings and partitions the strings into non-overlapping groups.
   - Strings in the same group are considered "matched"
   - Strings in different groups are not matched.
Nama provides tools for creating, modifying, saving, and loading matches. Then these matches can be used to generate unique group ids for a set of strings, or perform two-way merges between pandas dataframes according to the match groups.

In [24]:
# We start matching by creating an empty matches
matches = MatchData()

# First we need to add all the strings we want to match to the matches
# (in this case the strings the name column of each dataframe)
matches = matches.add_strings(df1['name'])
matches = matches.add_strings(df2['name'])

# Initially, strings are automatically assigned to singleton groups
# (Groups are automatically labelled according to the most common string,
# with ties broken alphabetically)
print(f'Initial string groups:\n{matches.groups}')

Initial string groups:
{'ABC Inc.': ['ABC Inc.'], 'abc inc': ['abc inc'], 'A.B.C. INCORPORATED': ['A.B.C. INCORPORATED'], 'The XYZ Company': ['The XYZ Company'], 'X Y Z CO': ['X Y Z CO'], 'GlobalTech Solutions': ['GlobalTech Solutions'], 'GlobalTechSolutions': ['GlobalTechSolutions'], 'GLOBALTECH': ['GLOBALTECH'], 'Innovate Innovations Ltd.': ['Innovate Innovations Ltd.'], 'Innovate Innovations': ['Innovate Innovations'], 'Innovate Inc.': ['Innovate Inc.'], 'Sunrise Enterprises': ['Sunrise Enterprises'], 'Sunrise Corp.': ['Sunrise Corp.'], 'Sunrise Co.': ['Sunrise Co.'], 'NexGen Innovators': ['NexGen Innovators'], 'NexGen Ltd.': ['NexGen Ltd.'], 'NEXGEN': ['NEXGEN'], 'EcoGreen Group': ['EcoGreen Group'], 'EcoGreen Corp.': ['EcoGreen Corp.'], 'EcoGreen Industries': ['EcoGreen Industries'], 'TechFusion Systems': ['TechFusion Systems'], 'TechFusion': ['TechFusion'], 'TechFusion Ltd.': ['TechFusion Ltd.'], 'AlphaOmega Technologies': ['AlphaOmega Technologies'], 'AlphaOmega Tech': ['AlphaOm

In [25]:
# At this point we can merge on exact matches, but there isn't much point
# (equivalent to pandas merge function)
print(f"Exact matching with singleton groups:\n{matches.merge_dfs(df1,df2,on='name')}")

Exact matching with singleton groups:
     name_x match_group    name_y
0  ABC Inc.    ABC Inc.  ABC Inc.


In [26]:
# To get better results, we need to modify the matches.
# Unite merges all groups that contain the passed strings.
matches = matches.unite(['ABC Inc.', 'A.B.C. INCORPORATED'])
print(f'Updated string groups:\n{matches.groups}')

Updated string groups:
{'ABC Inc.': ['ABC Inc.', 'A.B.C. INCORPORATED'], 'abc inc': ['abc inc'], 'The XYZ Company': ['The XYZ Company'], 'X Y Z CO': ['X Y Z CO'], 'GlobalTech Solutions': ['GlobalTech Solutions'], 'GlobalTechSolutions': ['GlobalTechSolutions'], 'GLOBALTECH': ['GLOBALTECH'], 'Innovate Innovations Ltd.': ['Innovate Innovations Ltd.'], 'Innovate Innovations': ['Innovate Innovations'], 'Innovate Inc.': ['Innovate Inc.'], 'Sunrise Enterprises': ['Sunrise Enterprises'], 'Sunrise Corp.': ['Sunrise Corp.'], 'Sunrise Co.': ['Sunrise Co.'], 'NexGen Innovators': ['NexGen Innovators'], 'NexGen Ltd.': ['NexGen Ltd.'], 'NEXGEN': ['NEXGEN'], 'EcoGreen Group': ['EcoGreen Group'], 'EcoGreen Corp.': ['EcoGreen Corp.'], 'EcoGreen Industries': ['EcoGreen Industries'], 'TechFusion Systems': ['TechFusion Systems'], 'TechFusion': ['TechFusion'], 'TechFusion Ltd.': ['TechFusion Ltd.'], 'AlphaOmega Technologies': ['AlphaOmega Technologies'], 'AlphaOmega Tech': ['AlphaOmega Tech'], 'AO Tech': ['

`unite` is very flexible. We can pass a single set of strings, a nested list of strings, or mapping from strings to group labels. The mapping can even be a function that evaluates strings and generates a label.This makes it very simple to do hash collision matching.

Hash collision matching works by matching any strings that have the same hash. A hash could be almost anything, but one useful way to do collision matching is to match strings that are identical after simplifying both strings.

Nama provides some useful simplification functions in nama.utils. `simplify_corp` strips punctuation and capitalization, and removes common parts of names like starting with "the", or ending with "inc" or "ltd".

In [27]:
from nama import simplify_corp

# Make a new matches for comparison
corp_matches = MatchData(matches.strings())

# Unite strings with the same simplified representation
corp_matches = corp_matches.unite(simplify_corp)

print(f'Groups after uniting by simplify_corp:\n{corp_matches.groups}')

Groups after uniting by simplify_corp:
{'A.B.C. INCORPORATED': ['A.B.C. INCORPORATED', 'ABC Inc.', 'abc inc'], 'The XYZ Company': ['The XYZ Company', 'XYZ Co.'], 'X Y Z CO': ['X Y Z CO'], 'GlobalTech Solutions': ['GlobalTech Solutions'], 'GlobalTechSolutions': ['GlobalTechSolutions'], 'GLOBALTECH': ['GLOBALTECH'], 'Innovate Innovations': ['Innovate Innovations', 'Innovate Innovations Ltd.'], 'Innovate Inc.': ['Innovate Inc.'], 'Sunrise Enterprises': ['Sunrise Enterprises'], 'Sunrise Co.': ['Sunrise Co.', 'Sunrise Corp.'], 'NexGen Innovators': ['NexGen Innovators'], 'NEXGEN': ['NEXGEN', 'NexGen Ltd.'], 'EcoGreen Corp.': ['EcoGreen Corp.', 'EcoGreen Group'], 'EcoGreen Industries': ['EcoGreen Industries'], 'TechFusion Systems': ['TechFusion Systems'], 'TechFusion': ['TechFusion', 'TechFusion Ltd.'], 'AlphaOmega Technologies': ['AlphaOmega Technologies'], 'AlphaOmega Tech': ['AlphaOmega Tech'], 'AO Tech': ['AO Tech'], 'SwiftSync Solutions': ['SwiftSync Solutions'], 'SWIFTSYNC': ['SWIFTSYNC

We can also inspect the united groups

In [28]:
# Firstly, we can get the group that any string belongs too with
print(matches['A.B.C. INCORPORATED'])
# We can inspect the all the strings in the same group (i.e. that match) with
print(matches.matches('A.B.C. INCORPORATED'))
# Lastly we can convert the matches to a dataframe
print(matches.to_df())

ABC Inc.
['ABC Inc.', 'A.B.C. INCORPORATED']
                       string  count                      group
0                    ABC Inc.      2                   ABC Inc.
1         A.B.C. INCORPORATED      1                   ABC Inc.
2                     AO Tech      1                    AO Tech
3             AlphaOmega Tech      1            AlphaOmega Tech
4     AlphaOmega Technologies      1    AlphaOmega Technologies
5                 CRYSTALPEAK      1                CRYSTALPEAK
6             CrystalPeak Co.      1            CrystalPeak Co.
7     CrystalPeak Enterprises      1    CrystalPeak Enterprises
8              EcoGreen Corp.      1             EcoGreen Corp.
9              EcoGreen Group      1             EcoGreen Group
10        EcoGreen Industries      1        EcoGreen Industries
11                 GLOBALTECH      1                 GLOBALTECH
12       GlobalTech Solutions      1       GlobalTech Solutions
13        GlobalTechSolutions      1        GlobalTechSolut

The matches can also be converted to a dataframe if we want to cluster the names in one dataset or create a mapping to string groups that can be used accross multiple datasets.

In [7]:
matches_df = matches.to_df()
matches_df

Unnamed: 0,string,count,group
0,ABC Inc.,2,ABC Inc.
1,A.B.C. INCORPORATED,1,ABC Inc.
2,The XYZ Company,1,The XYZ Company
3,X Y Z CO,1,X Y Z CO
4,XYZ Co.,1,XYZ Co.
5,abc inc,1,abc inc


Finally, we can save the matches in csv format for later use

In [8]:
matches.to_csv('matches.csv')

In [9]:
from nama import read_csv

# ...and load it again at a later time
loaded_matches = read_csv('matches.csv')
loaded_matches.to_df()

Unnamed: 0,string,count,group
0,ABC Inc.,2,ABC Inc.
1,A.B.C. INCORPORATED,1,ABC Inc.
2,The XYZ Company,1,The XYZ Company
3,X Y Z CO,1,X Y Z CO
4,XYZ Co.,1,XYZ Co.
5,abc inc,1,abc inc


# Embedding Similarity

The Embedding Similarity model allows us to predict the similarity of larger and more complex strings

First we'll need to train a Similarity Model to predict the similarity of larger and more complex matches for which we have some target values

In [10]:
from nama import SimilarityModel

train_kwargs = {
    'max_epochs': 2,
    'warmup_frac': 0.2,
    'transformer_lr':1e-5,
    'score_lr':10,
    'batch_size':8,
}

sim = SimilarityModel()

history_df = sim.train(matches, verbose=True, **train_kwargs)

# Save our trained model to disk
sim.save("path-to-model.bin")

Embedding strings: 100%|██████████| 6/6 [00:02<00:00,  2.91it/s]
training epoch 0: 100%|██████████| 1/1 [00:11<00:00, 11.13s/it]
training epoch 1: 100%|██████████| 1/1 [00:06<00:00,  6.61s/it]


The Embeddings Model has some powerful function that allow us to unite strings in various ways. 

The `unite_similar` function allow us to match similar strings based on their predicted pairwise similarity. 

The `unite_nearest` function allow us to uniting embedding strings with their most similar target strings. This function is particularly useful in scenarios where you have a set of target strings and want to match each embedding string to its nearest corresponding target string.

In [29]:
from nama import load_similarity_model, load_pretrained_model

# We can use our train model directly or load it from the save file
#sim = load_similarity_model("path-to-model.bin")

# Or we can use the standard model from huggingface, this could take a few minutes to download the model
sim = load_pretrained_model("base")

# Then we'll have the model embed our matches
embeddings = sim.embed(matches)

# Now we can do some matching
# We can unite strings according to their predicted pairwise similarity
sim_matches_similar = embeddings.unite_similar(threshold=0.5)

# We can unite strings with each string's most similar target string
# This method requires a set of target strings which will be matched to our embedded strings
sim_matches_nearest = embeddings.unite_nearest(target_strings=corp_matches.strings(),threshold=0.5)

# We can also manipulate the embeddings by slicing like so
first_embedding = embeddings[0:1]
print("Embedding shape: ", first_embedding.V.shape)

# Lastly we can save the embeddings for later use
embeddings.save("path-to-save-embeddings.bin")

Embedding strings: 100%|██████████| 48/48 [00:03<00:00, 13.49it/s]


Embedding shape:  torch.Size([1, 64])


In [30]:
sim_matches_nearest.to_df()

Unnamed: 0,string,count,group
0,ABC Inc.,2,ABC Inc.
1,A.B.C. INCORPORATED,1,A.B.C. INCORPORATED
2,AO Tech,1,AO Tech
3,AlphaOmega Tech,1,AlphaOmega Tech
4,AlphaOmega Technologies,1,AlphaOmega Technologies
5,CRYSTALPEAK,1,CRYSTALPEAK
6,CrystalPeak Co.,1,CrystalPeak Co.
7,CrystalPeak Enterprises,1,CrystalPeak Enterprises
8,EcoGreen Corp.,1,EcoGreen Corp.
9,EcoGreen Group,1,EcoGreen Group


With a trained model we can run some tests

In [31]:
# We can test the similarity model with a single threshold
test_scores = sim.test(matches, threshold=0.5)
pd.DataFrame([test_scores])

Embedding strings: 100%|██████████| 48/48 [00:03<00:00, 12.03it/s]


Unnamed: 0,TP,FP,TN,FN,coverage,accuracy,precision,recall,F1
0,0,21,1152,2,1.0,0,0,0,0


In [32]:
# Or can also run a test over multiple thresholds to find the optimal one
test_scores = sim.test(matches, threshold=np.linspace(0,1,11))
pd.DataFrame(test_scores)

Embedding strings: 100%|██████████| 48/48 [00:04<00:00, 11.27it/s]


Unnamed: 0,TP,FP,TN,FN,coverage,accuracy,precision,recall,F1,threshold
0,2,1173,0,0,1.0,0.001702,0.001702,1.0,0.003398,0.0
1,2,49,1124,0,1.0,0.958298,0.039216,1.0,0.075472,0.1
2,2,44,1129,0,1.0,0.962553,0.043478,1.0,0.083333,0.2
3,0,43,1130,2,1.0,0.0,0.0,0.0,0.0,0.3
4,0,33,1140,2,1.0,0.0,0.0,0.0,0.0,0.4
5,0,21,1152,2,1.0,0.0,0.0,0.0,0.0,0.5
6,0,16,1157,2,1.0,0.0,0.0,0.0,0.0,0.6
7,0,12,1161,2,1.0,0.0,0.0,0.0,0.0,0.7
8,0,11,1162,2,1.0,0.0,0.0,0.0,0.0,0.8
9,0,2,1171,2,1.0,0.0,0.0,0.0,0.0,0.9
