Skip to content

Latest commit

 

History

History
52 lines (33 loc) · 1.13 KB

index.rst

File metadata and controls

52 lines (33 loc) · 1.13 KB

Welcome to PyMinHash documentation

MinHashing is a very efficient way of finding similar records in a dataset based on Jaccard similarity. PyMinHash implements efficient minhashing for Pandas dataframes. See instructions below or look at the example notebook to get started.

Installation

Install directly from PyPI:

pip install pyminhash

or using conda-forge:

conda install -c conda-forge pyminhash

Usage

Apply record matching to column name of your Pandas dataframe df as follows:

myHasher = MinHash(n_hash_tables=10)
myHasher.fit_predict(df, 'name')

This will return the row pairs from df that have non-zero Jaccard similarity.

installation Tutorial.ipynb api/modules

Indices and tables

  • genindex
  • modindex
  • search