In [None]:
# default_exp core

In [None]:
#hide
from nbdev.showdoc import *

# Cleanroom Core SDK

A feature parser and hasher and client to the Proof Zero Cleanroom matching engine.

## Benefits

Generating matches across datasets is difficult. The same entities are described by different data (for example, one person using different phone numbers, or addresses). Data are often in different formats -- sometimes just a postal code is stored, sometimes a geo-region, sometimes a street address.

However all of these differences in physical data representation refer to the same logical entities.

The Cleanroom SDK generates pseudokeys that match entities across data sets containing different physical data representations. These pseudokeys enable a match between incomplete, incorrect, and unclean data sets. The percent strength of these pseudokey-based matches is also indicated.

The SDK also acts as a general-purpose feature hashing engine.

## Workflow

In order to join two datasets using pseudokeys we first `load` our data, and then:

1. `tokenize` it -- decompose our data into standard formats,
2. `index` it -- recompose our tokens into cryptographic indexes (privacy-aware feature hashes),
3. `match` it -- compare indexed data sets to generate matches and measure match quality.

This is the "TIM" workflow: `tokenize`, `index`, `match`.

## Security and data privacy

All SDK features run locally, so data remain within control of the user. `match` optionally allows use of the Cleanroom compute cluster and matching models to enhance speed and accuracy.

By using IndexFrame objects data remain in the control of the local system, even when using the Cleanroom compute cluster.

## Exported types

We provide several lightweight types that track progress through the TIM workflow:

* `TokenFrame` is a `pandas` DataFrame that has been tokenized.
* `IndexFrame` is a `TokenFrame` that has been hashed/indexed.
* `MatchFrame` is a match between two `IndexFrame`s.

These types help systems and analysts keep track of where a given dataset is in the tokenize, index, match (TIM) workflow.

## Loading data

We load our data into standard `pandas` DataFrames.

In [None]:
#export
import pandas as pd
def load(filename: str) -> pd.DataFrame:
    """
    Convenience function for loading files using `pandas`. Full version supports XLSX, etc.
    """
    return pd.read_csv(filename)

### Data Loading Testing

In [None]:
data_1 = load('./data_1.csv')
assert (data_1[:1].equals(pd.DataFrame({'acct_num': 'PXCG66212484637575', 'name': 'Kristin Sanchez MD', 'address': '31417 Gina Lodge, Bradleytown, MB P7G 4N5', 'phone': '1 (598) 742-6794', 'SIN': '203 268 552', 'DOB': '1943/04/15'}, index=[0])))
data_1[:1]

Unnamed: 0,acct_num,name,address,phone,SIN,DOB
0,PXCG66212484637575,Kristin Sanchez MD,"31417 Gina Lodge, Bradleytown, MB P7G 4N5",1 (598) 742-6794,203 268 552,1943/04/15


## Tokenizing data

We apply parsers to the DataFrames we want to process in order to decompose the physical data into standard types. This lets us re-compose a hashed index later that encodes knowledge of the underlying data structure.

In [None]:
#export
import itertools
import functools
from typing import Callable, NewType

# The `TokenFrame` type is a wrapper around a `pandas` DataFrame that represents a DataFrame that has been tokenized using the `tokenize` function.
TokenFrame = NewType('TokenFrame', pd.DataFrame)

def tokenize(
    df: pd.DataFrame, schema: dict, suffix_delim: str = ""
) -> TokenFrame:
    """
    Takes a `pandas` DataFrame and a schema. The schema is a `dict` that maps columns in the DataFrame to a list of parsers that are executed in order.
    
    Denormalizes the passed DataFrame by applying the parsers in the schema.
    
    Returns the denormalized DataFrame.
    """
    def map_schema(map_row, map_schema, map_delim):
        results = [
            functools.reduce(
                lambda data, fxn: fxn(data), map_schema[i], map_row[i]
            )
            for i in map_row.index
        ]
        indicies = [
            # Allow parsers to return a named component and use that name to index
            # our new columns, else use the index number as a string.
            map_delim.join(
                [map_row.index[i], str(v[1]) if isinstance(v, tuple) else str(j)]
            )
            for i, u in enumerate(map_row.index)
            for j, v in enumerate(results[i])
        ]
        series = pd.Series(
            data=[v[0] if isinstance(v, tuple) else v for v in itertools.chain(*results)], index=indicies
        )
        return series

    return pd.concat(
        [
            df,
            df.apply(
                map_schema,
                axis=1,
                args=(schema, suffix_delim),
                result_type="expand",
            ),
        ],
        axis=1,
    )

### Sample Tokenization

This is an example of running `tokenize` on a passed `DataFrame` to generate a `TokenFrame` (in this case of one record). See `match` for a complete workflow.

In [None]:
from proofzero_sdk.util import parseString, parseName, parseAddress, parsePhone, parseSIN, parseDate
tokens_1 = tokenize(df=data_1[:1], schema={
    'acct_num': [parseString],
    'name': [parseName],
    'address': [parseAddress],
    'phone': [parsePhone],
    'SIN': [parseSIN],
    'DOB': [parseDate]
}, suffix_delim='_')
tokens_1[:1]

Unnamed: 0,acct_num,name,address,phone,SIN,DOB,acct_num_0,name_title,name_first,name_middle,...,phone_country_code_source,phone_preferred_domestic_carrier,SIN_0,SIN_1,SIN_2,SIN_3,DOB_iso_format,DOB_year,DOB_month,DOB_day
0,PXCG66212484637575,Kristin Sanchez MD,"31417 Gina Lodge, Bradleytown, MB P7G 4N5",1 (598) 742-6794,203 268 552,1943/04/15,PXCG66212484637575,,Kristin,,...,10,,203-268-552,203,268,552,1943-04-15,1943,4,15


## Indexing data

Applies the selected hash, potentially adding salt, to the tokenized `DataFrame` (ie, `TokenFrame`). Returns an `IndexFrame`.

In [None]:
#export

from proofzero_sdk.util import sha2

# The `IndexFrame` type is a wrapper around a `TokenFrame` (ie, a `pandas` DataFrame) that represents a tokenized DataFrame that has been indexed using the `index` function.
IndexFrame = NewType('IndexFrame', TokenFrame)

def index(
    df: TokenFrame, hasher: Callable = sha2
) -> IndexFrame:
    """
    Applies a hash function to the tokenized features and returns an `IndexFrame`. The default hash function is `sha2`, from the Cleanroom utility SDK.
    """
    return df.applymap(hasher)

### Sample Indexing

This is an example of running `index` on a passed `TokenFrame`, in this case returning a single record. See `match` for a complete workflow.

In [None]:
index_1 = index(df=tokens_1)
index_1

Unnamed: 0,acct_num,name,address,phone,SIN,DOB,acct_num_0,name_title,name_first,name_middle,...,phone_country_code_source,phone_preferred_domestic_carrier,SIN_0,SIN_1,SIN_2,SIN_3,DOB_iso_format,DOB_year,DOB_month,DOB_day
0,9802534054846731338401687891444750119150976961...,4418369248196592421047499630841732388346555327...,9594784023836219556184433361991357495783756657...,7344459144365265142091263159766424283511268480...,1010505653944236458406177846488691380962181960...,5141391390246777661242717464678167327763772345...,9802534054846731338401687891444750119150976961...,3877226117079751550214273725156091025388555585...,9229611837054111477081027882940374069625694253...,3877226117079751550214273725156091025388555585...,...,9639508933775287426333904353415401342534309318...,6586455234700639270600574409018427069812517107...,1148673251264303964479456609021610855802601825...,2948056107977080813370872483486457740221751010...,5545403145286023377289154093249644763249517306...,1018998564061470247772449997659560822438884579...,7644637171458321163635300620903733517024394144...,1020423811237369131177406386728771151256212182...,6275815013823685249316828430045322558920567663...,9930297093447128312648856684579760333304390435...


## Matching data

Find records within a passed `IndexFrame`, across all columns, that meet a configurable baseline sensitivity (formally, a Jaccard Index value).

This function is powerful because it matches hashed value without referencing the underlying data. It is therefore appropriate for privacy-first computation. It uses optional metadata (generated by the parsers in the `tokenize` workflow step) to assist matching.

This function optionally connects to the Cleanroom compute cluster to enhance speed and accuracy of returned matches (contact us for an API key: admin@proofzero.io).

In [None]:
#export

# The `MatchFrame` type is a wrapper around a `IndexFrame` (ie, a `TokenFrame` and ultimately a `pandas` DataFrame) that represents an indexed DataFrame that has been matched using the `match` function.
MatchFrame = NewType('MatchFrame', IndexFrame)

def match(
    index_0: IndexFrame, index_1: IndexFrame, sensitivity = 0.5, api_key: str = None
) -> pd.DataFrame:
    """
    Discover matches between the two passed DataFrames. If `api_key` is set the indexed data can be matched using the Cleanroom compute cluster.
    
    `index_0` should contain exactly one row. If a whole-frame match is required pass this function to `pandas.DataFrame.apply`.
    
    `sensitivity` is the lowest [Jaccard Index](https://en.wikipedia.org/wiki/Jaccard_index), as a percentage, that indicates a match.
    
    This function is potentially compute-intense. [Contact us](mailto:admin@proofzero.io) for cloud scaling help.
    """    
    if (len(index_0) != 1):
        raise RuntimeError('Pass exactly one index as index_0.')

    column_intersect = list(set(index_0.columns) & set(index_1.columns))
#     column_union = list(set(index_0.columns) | set(index_1.columns))
#     column_jaccard = len(column_intersect) / len(column_union) # Can use the column Jaccard Index to normalize the row Jaccard, below.

    if (len(column_intersect) < 1):
        raise RuntimeError('No schema overlap -- some columns must match (parsing functions in schema must emit tags that match across both frames).')

    df = pd.DataFrame(index_1)
    df = df[df.columns[df.apply(lambda c: len(c.unique()) > 1)]]
    df['_match'] = df.apply(lambda r: len(set(r) & set(index_0.iloc[0])) / len(set(r) | set(index_0.iloc[0])), axis=1)
    df = df.sort_values(by='_match', ascending=False)
    return df[df['_match'] > sensitivity]

### Sample End-to-End Matching Workflow

This workflow demonstrates end-to-end usage of the Cleanroom SDK. We use the single row we have loaded, tokenized, indexed, and matched, above, and match it against another dataset. **Important note:** these data intentionally contain errors in order to illustrate the algorithm working with broken data.

We now load a new `DataFrame` and privately search for that record.

### Load the search space

In [None]:
data_2 = load('./data_2.csv')
data_2

Unnamed: 0,acct_num,name,address,phone,SIN,DOB
0,CSHE38391159497596,Tammy Grant,"245 Katherine Via Suite 081, Burtonton, NT E3M4V6",232-641-0083,184 381 564,1944-04-23
1,EKOT22531160348874,Cl√©mence Guillaume,"702, boulevard Jos√©phine Vincent, 25009 Saint...",386-006-4400,674 134 051,1969-02-27
2,CUVC30129179791293,Kristin Sanchez MD,"31417 Gina Lodge, Bradleytown, MB P7G 4N5",742 598 6794,203 268 552,1943/04/15
3,EVTK79966801588380,Dr. Megan Gomez,"887 Williams Road, West Ralphshire, NL Y6M9C4",(996) 489-6057,071 388 466,2011-09-04
4,AXMY64726382738404,Robert Evans,"7156 Tracy Points Suite 587, Perezville, AB M7...",608 900 6570,123 827 644,1991-10-29
5,ROZX17924884888416,Riley Bell DDS,"872 Hebert Parks, New Wanda, NT L9A 3X1",647 927 4901,544 325 061,1979-05-10
6,KJSJ20498435130926,Donald Sullivan DDS,"05620 Patton Drives, West Susan, NT J1C8S7",868.610.7617,335 441 184,1979-07-09
7,THXW69049993202165,John Bush,"764 White Springs Apt. 124, Brandonfurt, YT T6...",641 411 5627,815 346 754,1924-01-23
8,IAKB97381626750673,Thierry Hebert,"75, rue Alves, 19169 Moulin",1 (671) 051-7137,580 532 836,1981-02-03
9,BUQQ02115919743234,Mrs. Tracey Collins,"981 Blake Viaduct, Port Courtney, SK A9A 3E7",834 463 5846,170 346 886,1913-07-29


### Tokenize the search space

We now decompose the loaded `DataFrame` we are going to search by passing it and a schema to `tokenize`.

The schema is a `dict` that maps column names (keys) to `list`s of functions that will be applied in order to values in the respective column.

This results in a `TokenFrame` where columns include metadata tags that will support the match.

In [None]:
tokens_2 = tokenize(df=data_2, schema={
    'acct_num': [parseString],
    'name': [parseName],
    'address': [parseAddress],
    'phone': [parsePhone],
    'SIN': [parseSIN],
    'DOB': [parseDate]
}, suffix_delim='_').reset_index(drop=True)
tokens_2

Unnamed: 0,acct_num,name,address,phone,SIN,DOB,DOB_day,DOB_iso_format,DOB_month,DOB_year,...,phone_country_code,phone_country_code_source,phone_extension,phone_full_phone,phone_italian_zero,phone_leading_zero_count,phone_line_number,phone_national_number,phone_preferred_domestic_carrier,phone_raw
0,CSHE38391159497596,Tammy Grant,"245 Katherine Via Suite 081, Burtonton, NT E3M4V6",232-641-0083,184 381 564,1944-04-23,23,1944-04-23,4,1944,...,1,20,,tel:+1-232-641-0083,,,83,2326410083,,232-641-0083
1,EKOT22531160348874,Cl√©mence Guillaume,"702, boulevard Jos√©phine Vincent, 25009 Saint...",386-006-4400,674 134 051,1969-02-27,27,1969-02-27,2,1969,...,1,20,,tel:+1-386-006-4400,,,4400,3860064400,,386-006-4400
2,CUVC30129179791293,Kristin Sanchez MD,"31417 Gina Lodge, Bradleytown, MB P7G 4N5",742 598 6794,203 268 552,1943/04/15,15,1943-04-15,4,1943,...,1,20,,tel:+1-742-598-6794,,,6794,7425986794,,742 598 6794
3,EVTK79966801588380,Dr. Megan Gomez,"887 Williams Road, West Ralphshire, NL Y6M9C4",(996) 489-6057,071 388 466,2011-09-04,4,2011-09-04,9,2011,...,1,20,,tel:+1-996-489-6057,,,6057,9964896057,,(996) 489-6057
4,AXMY64726382738404,Robert Evans,"7156 Tracy Points Suite 587, Perezville, AB M7...",608 900 6570,123 827 644,1991-10-29,29,1991-10-29,10,1991,...,1,20,,tel:+1-608-900-6570,,,6570,6089006570,,608 900 6570
5,ROZX17924884888416,Riley Bell DDS,"872 Hebert Parks, New Wanda, NT L9A 3X1",647 927 4901,544 325 061,1979-05-10,10,1979-05-10,5,1979,...,1,20,,tel:+1-647-927-4901,,,4901,6479274901,,647 927 4901
6,KJSJ20498435130926,Donald Sullivan DDS,"05620 Patton Drives, West Susan, NT J1C8S7",868.610.7617,335 441 184,1979-07-09,9,1979-07-09,7,1979,...,1,20,,tel:+1-868-610-7617,,,7617,8686107617,,868.610.7617
7,THXW69049993202165,John Bush,"764 White Springs Apt. 124, Brandonfurt, YT T6...",641 411 5627,815 346 754,1924-01-23,23,1924-01-23,1,1924,...,1,20,,tel:+1-641-411-5627,,,5627,6414115627,,641 411 5627
8,IAKB97381626750673,Thierry Hebert,"75, rue Alves, 19169 Moulin",1 (671) 051-7137,580 532 836,1981-02-03,3,1981-02-03,2,1981,...,1,10,,tel:+1-671-051-7137,,,7137,6710517137,,1 (671) 051-7137
9,BUQQ02115919743234,Mrs. Tracey Collins,"981 Blake Viaduct, Port Courtney, SK A9A 3E7",834 463 5846,170 346 886,1913-07-29,29,1913-07-29,7,1913,...,1,20,,tel:+1-834-463-5846,,,5846,8344635846,,834 463 5846


### Index the search space

We now hash the `TokenFrame` to generate an `IndexFrame` using the `index` function. This applies a hash function to every value in the frame.

`IndexFrame`s encode data and metadata that allow for private comparison of data.

In [None]:
index_2 = index(df=tokens_2)
index_2

Unnamed: 0,acct_num,name,address,phone,SIN,DOB,DOB_day,DOB_iso_format,DOB_month,DOB_year,...,phone_country_code,phone_country_code_source,phone_extension,phone_full_phone,phone_italian_zero,phone_leading_zero_count,phone_line_number,phone_national_number,phone_preferred_domestic_carrier,phone_raw
0,9352391750577855971435718263705245987255069699...,2286469055134425837256437511113943078399915788...,1662836679935136219346165696806043302595180201...,6803476586732225874988534598639217266507945391...,1105230399750924418265628994290112269218121494...,6422048618153117207851186583422705470380740205...,6557095579074420993106375022262607033247559721...,6422048618153117207851186583422705470380740205...,6275815013823685249316828430045322558920567663...,3073278372982378113836217057325296307431671824...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,1044290470638575181681436323587881468021630185...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,2865553556831804391881151340185794355881904637...,2871566366896193952836971386518419114925059914...,6586455234700639270600574409018427069812517107...,6803476586732225874988534598639217266507945391...
1,7648492088929775047729857745339394440483050245...,7173845473716532081973976595696249047733478027...,6639959451600362907651915301633764411274838796...,7679130770068404427594818912209342916763734244...,9467357279037345957822471375862774448184139061...,1080243633790715765797058260546576300705715625...,7954363958932136039762609371070148394491817317...,1080243633790715765797058260546576300705715625...,2427484933034463179650951670100484315564935015...,2463484216937542648069538645574222400470592853...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,4757384045636314203805490169031777761441884714...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,3281336301241987920828526158359502094875847983...,8785399967929720641290699613736549519957908443...,6586455234700639270600574409018427069812517107...,7679130770068404427594818912209342916763734244...
2,3118710362407207060820519817066124611602675044...,4418369248196592421047499630841732388346555327...,9594784023836219556184433361991357495783756657...,5103122257192237244690193744677031637888782293...,1010505653944236458406177846488691380962181960...,5141391390246777661242717464678167327763772345...,9930297093447128312648856684579760333304390435...,7644637171458321163635300620903733517024394144...,6275815013823685249316828430045322558920567663...,1020423811237369131177406386728771151256212182...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,7341275782727882540070624817464065162426234694...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,5598189335674161013176229442725892028281044088...,2209634314611670011518640833334290179954075682...,6586455234700639270600574409018427069812517107...,5103122257192237244690193744677031637888782293...
3,9484334146537988355728459332866502132929799924...,9318338305054446347926254659255222545484711531...,4935503645211391673912889753844162833274778751...,6486692424528110891178009898447355217786763806...,8204045005077247441814741234893571359393397113...,1480736501988108952136288534524016843838889772...,6275815013823685249316828430045322558920567663...,1480736501988108952136288534524016843838889772...,8309389957186075165092560624942708659677963045...,8356795650354981540143983036404099727469169323...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,6887763359356051501250786368179416002546089276...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,1843676371410185153127045889751246799596599688...,1360659012191643202926955576358558507267719450...,6586455234700639270600574409018427069812517107...,6486692424528110891178009898447355217786763806...
4,1071018365514043172837255808126169933999636232...,1114326782808393365232341598419425129588183764...,1152869418489848773624024499307078600209855856...,4132947177542828096187307334685875194356396400...,1064920995823064904534557796589321431559378713...,9661192392824024868844046871643226699778270959...,4015104045131944362036482432099818132485504687...,9661192392824024868844046871643226699778270959...,9639508933775287426333904353415401342534309318...,2806674585915850563367143233752860736861333529...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,9112657378520514729919687040798975455706196413...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,1604774408038238001616894894981213396335227855...,3899256164169218318470815743768918876191092915...,6586455234700639270600574409018427069812517107...,4132947177542828096187307334685875194356396400...
5,2496563627104139277169365876460382559038435572...,6454403102432039287053110166990066106002084798...,3651982899382164801461990073630033283287844954...,3463588876646302938116912807130962241816660749...,8423106688771213837410693638542511504917139477...,9218923082173177104284294915100902987111843609...,9639508933775287426333904353415401342534309318...,9218923082173177104284294915100902987111843609...,7141540202189356383931936714304556824551311787...,8890606647136029945955242371960322683693604096...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,1908582433149376928682468664470283852831797634...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,2538568968468527516589094882186601211166153015...,6943853430047425091849724175437711841508113243...,6586455234700639270600574409018427069812517107...,3463588876646302938116912807130962241816660749...
6,3391740647349871678463249016707737099073381208...,6510695797218171340179950464347791190367619038...,1356721171139435678139640805919245873905005893...,4452033968511225499961759392516478615400951249...,1129346035519486222072149185290677775961657288...,3204819792041604548297263397723071743723729083...,8309389957186075165092560624942708659677963045...,3204819792041604548297263397723071743723729083...,3670113380018528025358701623221752071603559479...,8890606647136029945955242371960322683693604096...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,3274476228479190890379417191182101453744803451...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,1094476508019270150735424240679754909398049478...,7869173949783705573601297745130574015741517693...,6586455234700639270600574409018427069812517107...,4452033968511225499961759392516478615400951249...
7,2896386555688245340973815017167035869033084574...,3740050141912127749762587322711714228295478755...,3313168774184323914800263778513565031691012483...,9615368487897068366175598029056652005663279036...,4094676365399055458422893148292763515937224229...,5935339395422512591922443135064481007877189544...,6557095579074420993106375022262607033247559721...,5935339395422512591922443135064481007877189544...,3408518341937466763398361241378156279395974866...,1082895589924779261153191963616926382817514424...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,8113856301539894102590427744572848805829834382...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,2146881114898945340989815055433395842846525431...,9068029800339798913586060103079508368927777089...,6586455234700639270600574409018427069812517107...,9615368487897068366175598029056652005663279036...
8,5662691832183266586601746727091876475468238588...,8793299419530868389766942220344102210611327904...,7177501832337986028722142167758486731529362980...,6984995904642672233742553695146030160722109026...,5840638335963406219321396630626013263555840849...,5456143119833488579746085901041402303164055840...,9345861891868189917479994948603472140492772331...,5456143119833488579746085901041402303164055840...,2427484933034463179650951670100484315564935015...,3709477872779721260616925464012062405089662830...,...,3408518341937466763398361241378156279395974866...,9639508933775287426333904353415401342534309318...,6586455234700639270600574409018427069812517107...,2190823024237635695351018496824155996488847686...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,3177436389617842158951839862673130634419658207...,6539377457810780674641823567689985700724229582...,6586455234700639270600574409018427069812517107...,6984995904642672233742553695146030160722109026...
9,1035600337502627784175275552158822330557269199...,4735428363639079114833199783652315610058912369...,1027512345073784590551214333221150318978954428...,4293738925975353113301284603696259532686371483...,4747762857438858189525102125328485908535398898...,6720182719191230318238434100847481789129316947...,4015104045131944362036482432099818132485504687...,6720182719191230318238434100847481789129316947...,3670113380018528025358701623221752071603559479...,3598231929181579401121477957565026611222369701...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,1842647058941403076514827134355086252610370362...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,4193877429411831103009386854891233981512646764...,2793946552859464032316271621194932311142742246...,6586455234700639270600574409018427069812517107...,4293738925975353113301284603696259532686371483...


### Private Searching With `match`

We are now able to compare `index_1`, a single-record `IndexFrame`, with `index_2`, the encrypted search space we have created as part of this end-to-end example workflow.

In [None]:
tight_match = match(index_1, index_2)
tight_match

Unnamed: 0,acct_num,name,address,phone,SIN,DOB,DOB_day,DOB_iso_format,DOB_month,DOB_year,...,phone_area_code,phone_central_office,phone_country_code_source,phone_extension,phone_full_phone,phone_italian_zero,phone_line_number,phone_national_number,phone_raw,_match
2,3118710362407207060820519817066124611602675044...,4418369248196592421047499630841732388346555327...,9594784023836219556184433361991357495783756657...,5103122257192237244690193744677031637888782293...,1010505653944236458406177846488691380962181960...,5141391390246777661242717464678167327763772345...,9930297093447128312648856684579760333304390435...,7644637171458321163635300620903733517024394144...,6275815013823685249316828430045322558920567663...,1020423811237369131177406386728771151256212182...,...,1535860380141259287318590160938114468677934395...,6055181033144981996992600968534808922389255124...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,7341275782727882540070624817464065162426234694...,6586455234700639270600574409018427069812517107...,5598189335674161013176229442725892028281044088...,2209634314611670011518640833334290179954075682...,5103122257192237244690193744677031637888782293...,0.692308


This `tight_match` result indicates that we have found an approximate 69% match in row `2`. **This is because the data we have been using contain data quality issues. The algorithm is designed to work with, and report on, broken data.**

We now conduct a wider match, at 8% sensitivity, to illustrate fuzzy matching.

In [None]:
wide_match = match(index_1, index_2, sensitivity=0.08)
wide_match

Unnamed: 0,acct_num,name,address,phone,SIN,DOB,DOB_day,DOB_iso_format,DOB_month,DOB_year,...,phone_area_code,phone_central_office,phone_country_code_source,phone_extension,phone_full_phone,phone_italian_zero,phone_line_number,phone_national_number,phone_raw,_match
2,3118710362407207060820519817066124611602675044...,4418369248196592421047499630841732388346555327...,9594784023836219556184433361991357495783756657...,5103122257192237244690193744677031637888782293...,1010505653944236458406177846488691380962181960...,5141391390246777661242717464678167327763772345...,9930297093447128312648856684579760333304390435...,7644637171458321163635300620903733517024394144...,6275815013823685249316828430045322558920567663...,1020423811237369131177406386728771151256212182...,...,1535860380141259287318590160938114468677934395...,6055181033144981996992600968534808922389255124...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,7341275782727882540070624817464065162426234694...,6586455234700639270600574409018427069812517107...,5598189335674161013176229442725892028281044088...,2209634314611670011518640833334290179954075682...,5103122257192237244690193744677031637888782293...,0.692308
29,7057292769575579361050821897241918248606011753...,5617328753057921737658028497456179958088764425...,7924640378432814821929107008537496792425512191...,8056021172802262843011021254097992748539323027...,5560362402943197755118457963525278312724320540...,6620527415006331229996536496131348073731698661...,9930297093447128312648856684579760333304390435...,6620527415006331229996536496131348073731698661...,9639508933775287426333904353415401342534309318...,6599897967096895158358548374144717673673624601...,...,9849018411840847459262668700455075173693650456...,2678058268810809356502314573059130680737698216...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,1077246510238277462451958566530793134111446230...,6586455234700639270600574409018427069812517107...,1626144164801629588189582668234624653071589971...,7731075304118589669482615807785889205915999986...,8056021172802262843011021254097992748539323027...,0.081967
35,1522443405030172063833003621862690447511520910...,1510764360549103267758070134620527638540320606...,1032291683791627335537779811266356817509601779...,1142221496686735015740876987085575466380068692...,5857799411911002468644558003037190101677920889...,3743201676640474796187789696596758503012640993...,3408518341937466763398361241378156279395974866...,3743201676640474796187789696596758503012640993...,6275815013823685249316828430045322558920567663...,8170746109861022644513680967690352776868691534...,...,2320325385253540053731626185167439000920282668...,7101302949556099159753348202150164381759588516...,9639508933775287426333904353415401342534309318...,6586455234700639270600574409018427069812517107...,3326602003511100398995841115268137770285393235...,6586455234700639270600574409018427069812517107...,1413121891302400327783318164790066508156007952...,5713873125695594953562533575559997276590948364...,1142221496686735015740876987085575466380068692...,0.081967
13,8949147714013812800645782586271897047378394637...,1313930567137998699556395488990178716538720514...,8966437230744534859680604116595486340241455709...,5909103129090115836423108419915119846082693792...,7154471947853744924794385215635589120690511641...,3260431721073643449026212734830373343915435688...,2427484933034463179650951670100484315564935015...,3260431721073643449026212734830373343915435688...,6275815013823685249316828430045322558920567663...,2830895001478673387714523141784160510126249519...,...,9220158232455259385532883106212347638569721719...,2205842866729370346316112159386094892855153174...,9639508933775287426333904353415401342534309318...,6586455234700639270600574409018427069812517107...,5022088046076237582333433439579355700366510251...,6586455234700639270600574409018427069812517107...,4846998203962533123753486638213503164965004563...,2440978461227243293204259400603359034151316360...,5909103129090115836423108419915119846082693792...,0.081967


Fuzzy matching allows for the private generation of "lookalike" marketing sets across data silos, for example.