# Demo of Company Name Matcher

This notebook demostrates the basic use cases of Company Name Matcher and compares it with other matching techniques (e.g., RapidFuzz).

# Import and Setup

In [None]:
!pip install RapidFuzz pandas

In [2]:
import pandas as pd # Data I/O 
import re # Preprocess
from rapidfuzz import fuzz # matcher 1
from company_name_matcher import CompanyNameMatcher # matcher 2

# Initialize the matchers

## Rapid Fuzz

In [3]:
def rapid_fuzz_matcher(x1, x2):
    return fuzz.ratio(x1, x2) / 100

## Matcher with a default model

In [4]:
# Simple function to clean the names; we can further remove the stops words (e.g., limited, inc) if needed.
def preprocess_name1(name):
    return re.sub(r'[^a-zA-Z0-9\s]', '', name.lower()).strip()
    
default_matcher = CompanyNameMatcher(
    "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", 
    preprocess_fn = preprocess_name1
)

## Matcher with our fine-tuned model

In [5]:
def preprocess_name2(name):
    return "$" + name.strip() + "$"
    
finetuned_matcher = CompanyNameMatcher(
    "models/multilingual-MiniLM-small-v1", 
    preprocess_fn = preprocess_name2
)

# 1. Pair-wise matching

In [6]:
basic_companies = [
    "Microsoft Corporation",
    "Apple Inc",
    "Google",
    "Apple Computer Inc",
    "苹果公司",          # Apple Inc. in Chinese
]
for company in basic_companies:
    similarity1 = rapid_fuzz_matcher("Apple", company)
    similarity2 = default_matcher.compare_companies("Apple", company)
    similarity3 = finetuned_matcher.compare_companies("Apple", company)
    print(f"Apple vs {company}")
    print("-" * 50)
    print(f"Rapid Fuzz: {similarity1: .2f}")
    print(f"Default Matcher: {similarity2: .2f}")
    print(f"Finetuned Matcher: {similarity3: .2f}")
    print("\n")

Apple vs Microsoft Corporation
--------------------------------------------------
Rapid Fuzz:  0.08
Default Matcher:  0.34
Finetuned Matcher:  0.15


Apple vs Apple Inc
--------------------------------------------------
Rapid Fuzz:  0.71
Default Matcher:  0.91
Finetuned Matcher:  0.74


Apple vs Google
--------------------------------------------------
Rapid Fuzz:  0.36
Default Matcher:  0.32
Finetuned Matcher:  0.42


Apple vs Apple Computer Inc
--------------------------------------------------
Rapid Fuzz:  0.43
Default Matcher:  0.84
Finetuned Matcher:  0.61


Apple vs 苹果公司
--------------------------------------------------
Rapid Fuzz:  0.00
Default Matcher:  0.31
Finetuned Matcher:  0.98




# 2. Bulk matching

In [7]:
data = pd.read_csv("tests/test_data.csv")
data

Unnamed: 0,Name_x,Name_y,Targets
0,alfred Jäggi AG,A. Jaggi Ag,1
1,Amy Mary LLC,"Amy-Mary, llc",1
2,aNM INTERNATIONAL Co.,Anm International,1
3,Antalis Verpackungen GmbH,Antalis Verpackungen Gmbh,1
4,Apofruit Italia - Pievesestina,Apofruit Italia Soc.Coop.Agr.,1
...,...,...,...
195,"Qingdao Taibo Trading Co., Ltd.",QINGDAO TYQ TRADING CO LTD,0
196,"Guangzhou Wenzhao LCD Technology Co., Ltd.","Guangzhou Weidi Technology Co., Ltd.",0
197,"Dongguan Qisheng Optoelectronics Co., Ltd","Dongguan Meisen Electronics Co., Ltd",0
198,Wanhua Chemical (Yantai) Sales Co.，LTD.,"Hanhua Chemical Ningbo Co., Ltd",0


## Build, Load, Expend Index

We only need to build index once

In [8]:
# We can have further cleaning here
companies_to_match = data["Name_y"].to_list()

In [9]:
finetuned_matcher.build_index(
    companies_to_match, 
    n_clusters = 20, 
    save_dir="index_files"
)

Next time, we can simply load the saved index files

In [10]:
finetuned_matcher.load_index(load_dir="index_files")

We can optionally expend the index without rebuilding the whole index

In [11]:
new_companies = [
    "Palantir Technologies",
    "Dell Technologies"
]

In [12]:
finetuned_matcher.expand_index(
    new_companies, 
    save_dir="index_files" # Update the existing index files
)

## Exact Search

In [13]:
%%time

print("Exact Search Results:")
exact_matches = finetuned_matcher.find_matches(
    "Palantir",
    threshold=0.7,
    use_approx=False
)
print(f"Exact matches: {exact_matches}\n")

Exact Search Results:
Exact matches: [('Palantir Technologies', 0.78598243)]

CPU times: user 42.3 ms, sys: 8.78 ms, total: 51.1 ms
Wall time: 51.8 ms


## Approximate Search 

In [14]:
%%time

print("Approximate Search Results:")
approx_matches = finetuned_matcher.find_matches(
    "Palantir",
    threshold=0.7,
    k=1,
    use_approx=True
)
print(f"Approximate matches: {approx_matches}\n")

Approximate Search Results:
Approximate matches: [('Palantir Technologies', 0.7859824)]

CPU times: user 54.2 ms, sys: 58.2 ms, total: 112 ms
Wall time: 44.7 ms


# 3. Working with Embeddings

In [15]:
print("4. Working with Embeddings")
print("-" * 40)
# Single company embedding
single_embedding = finetuned_matcher.get_embedding("Apple Inc")
print(f"Single company embedding shape: {single_embedding.shape}")

4. Working with Embeddings
----------------------------------------
Single company embedding shape: (384,)
