fuzzyjoin

Join two tables by a fuzzy comparison of text columns.

Features

Command line utility to quickly join CSV files.
Ngram blocking to reduce the total number of comparisons.
Pure python levenshtein edit distance using pylev.
Fast levenshtein edit distance using editdistance.
License: MIT

Installation

Pure python: pip install fuzzyjoin
Optimized: pip install fuzzyjoin[fast]

Description

The goal of this package is to provide a quick and convenient way to join two tables on a pair of text columns, which often contain variations of names for the same entity. fuzzyjoin satisfies the simple and common case of joining by a single column from each table for datasets in the thousands of records.

For a more sophisticated and comprehensive treatment of the topic that will allow you to join records using multiple fields, see the packages below:

CLI Help

\> fuzzyjoin --help

Usage: fuzzyjoin_cli.py [OPTIONS] LEFT_CSV RIGHT_CSV

  Inner join <left_csv> and <right_csv> by a fuzzy comparison of
  <left_field> and <right_field>.

Options:
  -f, --fields TEXT...   <left_field> <right_field>  [required]
  -t, --threshold FLOAT  Only return matches above this score.  [default: 0.7]
  -o, --output TEXT      File to write the matches to.
  --multiples TEXT       File for left IDs with multiple matches.
  --exclude TEXT         Function used to exclude records. See:
                         <fuzzyjoin.compare.default_exclude>
  --collate TEXT         Function used to collate <fields>. See:
                         <fuzzyjoin.collate.default_collate>
  --compare TEXT         Function used to compare records. See:
                         <fuzzyjoin.compare.default_compare>
  --numbers-exact        Numbers and order must match exactly.
  --numbers-permutation  Numbers must match but may be out of order.
  --numbers-subset       Numbers must be a subset.
  --ngram-size INTEGER   The ngram size to create blocks with.  [default: 3]
  --no-progress          Do not show comparison progress.
  --debug                Exit to PDB on exception.
  --yes                  Yes to all prompts.
  --help                 Show this message and exit.

CLI Usage

# Use field `name` from left.csv and field `full_name` from right.csv
\> fuzzyjoin --fields name full_name left.csv right.csv
# Export rows with multiple matches from left.csv to a separate file.
\> fuzzyjoin --multiples multiples.csv --fields name full_name left.csv right.csv
# Increase the ngram size, reducing execution time but removing tokens small than `ngram_size`
# as possible matches.
\> fuzzyjoin --ngram-size 5 --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields and in the same order.
\> fuzzyjoin --numbers-exact --fields name full_name left.csv right.csv
# Ensure any numbers that appear are in both fields but may be in a different order.
\> fuzzyjoin --numbers-permutation --fields name full_name left.csv right.csv
# Ensure numbers that appear in one field are at least a subset of the other.
\> fuzzyjoin --numbers-subset --fields name full_name left.csv right.csv
# Use importable function `package.func` from PATH as the comparison function
# instead of `fuzzyjoin.compare.default_compare`.
\> fuzzyjoin --compare package.func --fields name full_name left.csv right.csv

API Usage

from fuzzyjoin import io

# Specify which field to use from the left and right CSV files.
options = Options(
    field_1='name',
    field_2='full_name'
)
matches = io.inner_join_csv_files('left.csv', 'right.csv', options)
io.write_matches(matches, output_file='matches.csv')

TODO

Test transformation and exclude functions.
Implement left join and full join.
Check that the ID is actually unique.
Add documentation.
Option to rename headers and disambiguate duplicate header names.

History

0.5.2 (2019-04-15)

Fix API Usage history docs in README.

0.5.1 (2019-04-15)

Update usage docs.
Rename key_* params to field_* for consistency.

0.5.0 (2019-04-15)

Removed ID field requirement.

0.4.1 (2019-04-12)

Options parameter.

0.3.4 (2019-04-11)

Fix function defaults.
Minor optimizations.
Additional CLI parameters.

0.3.3 (2019-04-10)

Cleanup checks.

0.3.2 (2019-04-10)

Include basic installation instructions.

0.3.1 (2019-04-10)

Minor README updates.

0.3.0 (2019-04-10)

Use editdistance if available, otherwise fallback to pylev.
Report progress by default.
Number comparison options.
Renamed get_multiples to filter_multiples.

0.2.1 (2019-04-10)

Additional docs and tests.

0.2.0 (2019-04-09)

Write multiples matches to a separate file.
Added types and docstrings.

0.1.2 (2019-04-09)

Duplicate release of 0.1.1

0.1.1 (2019-04-09)

First release on PyPI.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
data		data
fuzzyjoin		fuzzyjoin
requirements		requirements
tests		tests
.flake8		.flake8
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
VERSION		VERSION
fuzzyjoin_cli.py		fuzzyjoin_cli.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tasks.py		tasks.py

License

chancyk/fuzzyjoin

Folders and files

Latest commit

History

Repository files navigation

fuzzyjoin

Features

Installation

Description

CLI Help

CLI Usage

API Usage

TODO

History

0.5.2 (2019-04-15)

0.5.1 (2019-04-15)

0.5.0 (2019-04-15)

0.4.1 (2019-04-12)

0.3.4 (2019-04-11)

0.3.3 (2019-04-10)

0.3.2 (2019-04-10)

0.3.1 (2019-04-10)

0.3.0 (2019-04-10)

0.2.1 (2019-04-10)

0.2.0 (2019-04-09)

0.1.2 (2019-04-09)

0.1.1 (2019-04-09)

About

Resources

License

Stars

Watchers

Forks

Languages