Skip to content

bradhackinen/nama

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

NAMA The NAme MAtching tool

Fast, flexible name matching for large datasets

Installation

Recommended install via pip

  1. Create virtual env ``. Optional
  2. Install nama pip install git+https://github.com/bradhackinen/nama.git@master

Install from source with conda

  1. Install Anaconda

  2. Clone nama

git clone https://github.com/bradhackinen/nama.git
  1. Enter the conda directory where the conda environment file is with
cd conda
  1. Create new conda environment with
conda create --name <env-name>
  1. Activate the new environment with
conda activate <env-name>
  1. Download & Install pytorch-mutex
conda install pytorch-mutex-1.0-cuda.tar.bz2
  1. Download & Install pytorch
conda install pytorch-1.10.2-py3.9_cuda11.3_cudnn8.2.0_0.tar.bz2
  1. Install the rest of the dependencies with
conda install --file conda_env.txt
  1. Exit the conda directory with
cd ..
  1. Install the package with
pip install .

Installing from source with pip

  1. Clone nama git clone https://github.com/bradhackinen/nama.git
  2. Create & activate virtual environment python -m venv nama_env && source nama_env/bin/activate
  3. Install dependencies pip install -r requirements.txt
  4. Install the package with pip install ./nama
  • Install from the project root directory pip install .
  • Install from another directory pip install /path-to-project-root

Demo

Usage

Using the Matcher()

Importing data

To import data into the matcher we can either pass nama a pandas DataFrame with

import nama

training_data = nama.from_df(
    df,
    group_column='group_column',
    string_column='string_column')
print(training_data)

or we can pass nama a .csv file directly

import nama

testing_data = nama.read_csv(
    'path-to-data',
    match_format=match_format,
    group_column=group_column,
    string_column=string_column)
print(training_data)

See from_df & read_csv for parameters and function details

Using the EmbeddingSimilarityModel()

Initialation

We can initalize a model like so

from nama.embedding_similarity import EmbeddingSimilarityModel

sim = EmbeddingSimilarityModel()

If using a GPU then we need to send the model to a GPU device like

sim.to(gpu_device)

Training

To train a model we simply need to specifiy the training parmeters and training data

train_kwargs = {
    'max_epochs': 1,
    'warmup_frac': 0.2,
    'transformer_lr':1e-5,
    'score_lr':30,
    'use_counts':False,
    'batch_size':8,
    'early_stopping':False
}

history_df, val_df = sim.train(training_data, verbose=True, **train_kwargs)

We can also save the trained model for later

sim.save("path-to-save-model")

Testing

We can use the model we train above directly like

embeddings = sim.embed(testing_data)

Or load a previously trained model

from nama.embedding_similarity import load_similarity_model

new_sim = load_similarity_model("path-to-saved-model")
embeddings = sim.embed(testing_data)

MORE TO COME

About

Fast, flexible name matching for large datasets

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages