firmmatchr is an R package designed to match messy, user-generated company names (e.g., from surveys, resumes, or web scraping) against a clean, authoritative dictionary (e.g., Orbis, Commercial Registers).
Unlike simple fuzzy matching, firmmatchr uses a cascading pipeline of four distinct algorithms to maximize recall while maintaining high precision. It also includes an optional LLM-based validation step (via Azure OpenAI) to verify doubtful matches.
- Cascading Architecture: Matches are found in stages. Easy matches are caught early; harder matches are passed to more complex engines.
- German-Specific Normalization: Handles Umlauts, legal forms (GmbH, AG, GmbH & Co. KG), and noise words (Standort, Germany).
- High Performance: Built on
data.tableandRSQLite(FTS5) for speed. - LLM Validator: An integrated module to double-check "fuzzy" matches using GPT-4 models.
You can install the package directly from GitHub (after you push it) or locally:
# Install devtools if you haven't
install.packages("devtools")
# Install from GitHub
devtools::install_github("swediot/firmmatchr")
# Or install locally
devtools::install("path/to/firmmatchr")The core function match_companies runs the full cascade.
library(firmmatchr)
library(data.table)
# 1. Load Data
# 'queries': Your messy data (e.g. from a survey)
# 'dictionary': Your clean reference list (e.g. Orbis)
queries <- fread("data/survey_results.csv")
dictionary <- fread("data/orbis_germany.csv")
# 2. Run the Pipeline
results <- match_companies(
queries = queries,
dictionary = dictionary,
# Column mapping
query_col = "employer_name", # Column in 'queries'
unique_id_col = "response_id", # ID in 'queries'
dict_col = "company_name", # Column in 'dictionary'
dict_id_col = "orbis_id", # ID in 'dictionary'
# Thresholds (Tunable)
threshold_jw = 0.85, # Strictness for Fuzzy Matching (0-1)
threshold_zoomer = 0.4, # Looseness for Blocking (Lower = more candidates)
threshold_rarity = 1.0 # Threshold for TF-IDF style matching
)
# 3. View Results
print(results)
# Returns: query_id, dict_id, match_type (Perfect, Fuzzy, Rarity, etc.)If you have matches that are fuzzy but not certain, you can use the LLM module to have an AI act as a human verifier.
Prerequisites: Set your Azure OpenAI credentials in your .Renviron file:
AZURE_ENDPOINT="https://your-resource.openai.azure.com"
AZURE_API_KEY="your-secret-key"
AZURE_DEPLOYMENT="gpt-4-mini"Running Validation:
# 1. Join names back to the results (The LLM needs text, not IDs!)
results_full <- merge(results, queries, by.x="query_id", by.y="response_id")
results_full <- merge(results_full, dictionary, by.x="dict_id", by.y="orbis_id")
# 2. Run Validation
validated <- validate_matches_llm(
data = results_full,
query_name_col = "employer_name", # The messy name
dict_name_col = "company_name", # The official dictionary name
output_dir = "llm_cache", # Saves progress in case of crash
batch_size = 50
)
# Output includes 'LLM_decision' (CORRECT/INCORRECT) and 'LLM_reason'The match_companies function processes unmatched records through four engines in order:
-
Exact Match:
- Joins on the normalized string (lowercase, specialized cleaning).
- Result: Highest confidence.
-
Fuzzy Zoomer (Blocking):
- Uses
zoomerjoin(LSH/MinHash) to find candidates even with spelling errors. - Refined by Jaro-Winkler distance.
- Best for: Typos like "Daimler" vs "Diamler".
- Uses
-
FTS5 (Search Engine):
- Creates an in-memory SQLite database with Full-Text Search 5.
- Allows for token-based matching (e.g., finding "Bosch" inside "Robert Bosch GmbH").
- Best for: Missing words or reordered tokens.
-
Rarity Weighted (TF-IDF):
- Calculates the "rareness" of every word in the dictionary.
- Matches are scored based on the sum of normalized weights of shared rare tokens.
- Best for: Companies with unique names that are heavily misspelled or where word order varies significantly.
See 0_run_package.R for an example usage with synthetic data.
MIT License. See LICENSE for details.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.