Welcome to PyMinHash documentation

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.

Installation

Install directly from PyPI:

pip install pyminhash

or using conda-forge:

conda install -c conda-forge pyminhash

Usage

Apply record matching to column name of your Pandas dataframe df as follows:

myHasher = MinHash(n_hash_tables=10)
myHasher.fit_predict(df, 'name')

This will return the row pairs from df that have non-zero Jaccard similarity.

installation Tutorial.ipynb api/modules

Indices and tables

genindex
modindex
search

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.rst

index.rst

Welcome to PyMinHash documentation

Installation

Usage

Indices and tables

Files

index.rst

Latest commit

History

index.rst

File metadata and controls

Welcome to PyMinHash documentation

Installation

Usage

Indices and tables