# Cleanroom SDK

A parsing and feature hashing client to the Cleanroom compute cluster matching system.

## Install

Within your python environment:

```bash
tar -xvf proofzero_sdk.tar.bz2
pip install ./proofzero_sdk
```

## Use Case

In order to illustrate an end-to-end use case that exercises the Cleanroom SDK we're going to privately search a cryptographically hashed dataset. To do this we will:

1. Load, tokenize, and hash a single record from one dataset.
1. Load, tokenize, and hash a second dataset.
1. Conduct a fuzzy search for the record from the first set within the second set.

**Important Note:** the records we will be loading and searching with contain data quality errors in order to illustrate the algorithm working with realistic data.

## Quickstart

Import the code into your `.py` file or Jupyter notebook:

In [None]:
import proofzero_sdk.core as p0
import proofzero_sdk.util as p1

### Load, Tokenize, And Index Our Search Data

First, we set up our `query_record` by loading, tokenizing, and indexing a row from our first dataset.

In [None]:
query_data = p0.load('./data_1.csv')[:1]
query_tokens = p0.tokenize(query_data, schema={
    'acct_num': [p1.parseString],
    'name': [p1.parseName],
    'address': [p1.parseAddress],
    'phone': [p1.parsePhone],
    'SIN': [p1.parseSIN],
    'DOB': [p1.parseDate]
}, suffix_delim='_')
query_record = p0.index(query_tokens)
query_record

Unnamed: 0,acct_num,name,address,phone,SIN,DOB,acct_num_0,name_title,name_first,name_middle,...,phone_country_code_source,phone_preferred_domestic_carrier,SIN_0,SIN_1,SIN_2,SIN_3,DOB_iso_format,DOB_year,DOB_month,DOB_day
0,9802534054846731338401687891444750119150976961...,4418369248196592421047499630841732388346555327...,9594784023836219556184433361991357495783756657...,7344459144365265142091263159766424283511268480...,1010505653944236458406177846488691380962181960...,5141391390246777661242717464678167327763772345...,9802534054846731338401687891444750119150976961...,3877226117079751550214273725156091025388555585...,9229611837054111477081027882940374069625694253...,3877226117079751550214273725156091025388555585...,...,9639508933775287426333904353415401342534309318...,6586455234700639270600574409018427069812517107...,1148673251264303964479456609021610855802601825...,2948056107977080813370872483486457740221751010...,5545403145286023377289154093249644763249517306...,1018998564061470247772449997659560822438884579...,7644637171458321163635300620903733517024394144...,1020423811237369131177406386728771151256212182...,6275815013823685249316828430045322558920567663...,9930297093447128312648856684579760333304390435...


### Load, Tokenize, and Index Our Search Space

Second, we set up our search space by loading, tokenizing, and indexing our entire second dataset.

In [None]:
search_data = p0.load('./data_2.csv')
search_tokens = p0.tokenize(search_data, schema={
    'acct_num': [p1.parseString],
    'name': [p1.parseName],
    'address': [p1.parseAddress],
    'phone': [p1.parsePhone],
    'SIN': [p1.parseSIN],
    'DOB': [p1.parseDate]
}, suffix_delim='_')
search_space = p0.index(search_tokens)
search_space

Unnamed: 0,acct_num,name,address,phone,SIN,DOB,DOB_day,DOB_iso_format,DOB_month,DOB_year,...,phone_country_code,phone_country_code_source,phone_extension,phone_full_phone,phone_italian_zero,phone_leading_zero_count,phone_line_number,phone_national_number,phone_preferred_domestic_carrier,phone_raw
0,9352391750577855971435718263705245987255069699...,2286469055134425837256437511113943078399915788...,1662836679935136219346165696806043302595180201...,6803476586732225874988534598639217266507945391...,1105230399750924418265628994290112269218121494...,6422048618153117207851186583422705470380740205...,6557095579074420993106375022262607033247559721...,6422048618153117207851186583422705470380740205...,6275815013823685249316828430045322558920567663...,3073278372982378113836217057325296307431671824...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,1044290470638575181681436323587881468021630185...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,2865553556831804391881151340185794355881904637...,2871566366896193952836971386518419114925059914...,6586455234700639270600574409018427069812517107...,6803476586732225874988534598639217266507945391...
1,7648492088929775047729857745339394440483050245...,7173845473716532081973976595696249047733478027...,6639959451600362907651915301633764411274838796...,7679130770068404427594818912209342916763734244...,9467357279037345957822471375862774448184139061...,1080243633790715765797058260546576300705715625...,7954363958932136039762609371070148394491817317...,1080243633790715765797058260546576300705715625...,2427484933034463179650951670100484315564935015...,2463484216937542648069538645574222400470592853...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,4757384045636314203805490169031777761441884714...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,3281336301241987920828526158359502094875847983...,8785399967929720641290699613736549519957908443...,6586455234700639270600574409018427069812517107...,7679130770068404427594818912209342916763734244...
2,3118710362407207060820519817066124611602675044...,4418369248196592421047499630841732388346555327...,9594784023836219556184433361991357495783756657...,5103122257192237244690193744677031637888782293...,1010505653944236458406177846488691380962181960...,5141391390246777661242717464678167327763772345...,9930297093447128312648856684579760333304390435...,7644637171458321163635300620903733517024394144...,6275815013823685249316828430045322558920567663...,1020423811237369131177406386728771151256212182...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,7341275782727882540070624817464065162426234694...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,5598189335674161013176229442725892028281044088...,2209634314611670011518640833334290179954075682...,6586455234700639270600574409018427069812517107...,5103122257192237244690193744677031637888782293...
3,9484334146537988355728459332866502132929799924...,9318338305054446347926254659255222545484711531...,4935503645211391673912889753844162833274778751...,6486692424528110891178009898447355217786763806...,8204045005077247441814741234893571359393397113...,1480736501988108952136288534524016843838889772...,6275815013823685249316828430045322558920567663...,1480736501988108952136288534524016843838889772...,8309389957186075165092560624942708659677963045...,8356795650354981540143983036404099727469169323...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,6887763359356051501250786368179416002546089276...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,1843676371410185153127045889751246799596599688...,1360659012191643202926955576358558507267719450...,6586455234700639270600574409018427069812517107...,6486692424528110891178009898447355217786763806...
4,1071018365514043172837255808126169933999636232...,1114326782808393365232341598419425129588183764...,1152869418489848773624024499307078600209855856...,4132947177542828096187307334685875194356396400...,1064920995823064904534557796589321431559378713...,9661192392824024868844046871643226699778270959...,4015104045131944362036482432099818132485504687...,9661192392824024868844046871643226699778270959...,9639508933775287426333904353415401342534309318...,2806674585915850563367143233752860736861333529...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,9112657378520514729919687040798975455706196413...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,1604774408038238001616894894981213396335227855...,3899256164169218318470815743768918876191092915...,6586455234700639270600574409018427069812517107...,4132947177542828096187307334685875194356396400...
5,2496563627104139277169365876460382559038435572...,6454403102432039287053110166990066106002084798...,3651982899382164801461990073630033283287844954...,3463588876646302938116912807130962241816660749...,8423106688771213837410693638542511504917139477...,9218923082173177104284294915100902987111843609...,9639508933775287426333904353415401342534309318...,9218923082173177104284294915100902987111843609...,7141540202189356383931936714304556824551311787...,8890606647136029945955242371960322683693604096...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,1908582433149376928682468664470283852831797634...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,2538568968468527516589094882186601211166153015...,6943853430047425091849724175437711841508113243...,6586455234700639270600574409018427069812517107...,3463588876646302938116912807130962241816660749...
6,3391740647349871678463249016707737099073381208...,6510695797218171340179950464347791190367619038...,1356721171139435678139640805919245873905005893...,4452033968511225499961759392516478615400951249...,1129346035519486222072149185290677775961657288...,3204819792041604548297263397723071743723729083...,8309389957186075165092560624942708659677963045...,3204819792041604548297263397723071743723729083...,3670113380018528025358701623221752071603559479...,8890606647136029945955242371960322683693604096...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,3274476228479190890379417191182101453744803451...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,1094476508019270150735424240679754909398049478...,7869173949783705573601297745130574015741517693...,6586455234700639270600574409018427069812517107...,4452033968511225499961759392516478615400951249...
7,2896386555688245340973815017167035869033084574...,3740050141912127749762587322711714228295478755...,3313168774184323914800263778513565031691012483...,9615368487897068366175598029056652005663279036...,4094676365399055458422893148292763515937224229...,5935339395422512591922443135064481007877189544...,6557095579074420993106375022262607033247559721...,5935339395422512591922443135064481007877189544...,3408518341937466763398361241378156279395974866...,1082895589924779261153191963616926382817514424...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,8113856301539894102590427744572848805829834382...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,2146881114898945340989815055433395842846525431...,9068029800339798913586060103079508368927777089...,6586455234700639270600574409018427069812517107...,9615368487897068366175598029056652005663279036...
8,5662691832183266586601746727091876475468238588...,8793299419530868389766942220344102210611327904...,7177501832337986028722142167758486731529362980...,6984995904642672233742553695146030160722109026...,5840638335963406219321396630626013263555840849...,5456143119833488579746085901041402303164055840...,9345861891868189917479994948603472140492772331...,5456143119833488579746085901041402303164055840...,2427484933034463179650951670100484315564935015...,3709477872779721260616925464012062405089662830...,...,3408518341937466763398361241378156279395974866...,9639508933775287426333904353415401342534309318...,6586455234700639270600574409018427069812517107...,2190823024237635695351018496824155996488847686...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,3177436389617842158951839862673130634419658207...,6539377457810780674641823567689985700724229582...,6586455234700639270600574409018427069812517107...,6984995904642672233742553695146030160722109026...
9,1035600337502627784175275552158822330557269199...,4735428363639079114833199783652315610058912369...,1027512345073784590551214333221150318978954428...,4293738925975353113301284603696259532686371483...,4747762857438858189525102125328485908535398898...,6720182719191230318238434100847481789129316947...,4015104045131944362036482432099818132485504687...,6720182719191230318238434100847481789129316947...,3670113380018528025358701623221752071603559479...,3598231929181579401121477957565026611222369701...,...,3408518341937466763398361241378156279395974866...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,1842647058941403076514827134355086252610370362...,6586455234700639270600574409018427069812517107...,6586455234700639270600574409018427069812517107...,4193877429411831103009386854891233981512646764...,2793946552859464032316271621194932311142742246...,6586455234700639270600574409018427069812517107...,4293738925975353113301284603696259532686371483...


### Privately Search

Last, we run our search and find our `query_record` within our `search_space`. See the documentation for `match` for a more detailed walkthrough.

**Note:** We could pass an API key to `match` to scale the query using the Cleanroom compute cluster to improve match speed and accuracy. [Get in touch with us](mailto:admin@proofzero.io) for an API key.

In [None]:
results = p0.match(query_record, search_space)
results

Unnamed: 0,acct_num,name,address,phone,SIN,DOB,DOB_day,DOB_iso_format,DOB_month,DOB_year,...,phone_area_code,phone_central_office,phone_country_code_source,phone_extension,phone_full_phone,phone_italian_zero,phone_line_number,phone_national_number,phone_raw,_match
2,3118710362407207060820519817066124611602675044...,4418369248196592421047499630841732388346555327...,9594784023836219556184433361991357495783756657...,5103122257192237244690193744677031637888782293...,1010505653944236458406177846488691380962181960...,5141391390246777661242717464678167327763772345...,9930297093447128312648856684579760333304390435...,7644637171458321163635300620903733517024394144...,6275815013823685249316828430045322558920567663...,1020423811237369131177406386728771151256212182...,...,1535860380141259287318590160938114468677934395...,6055181033144981996992600968534808922389255124...,1950749364813865911797004310287114589195880318...,6586455234700639270600574409018427069812517107...,7341275782727882540070624817464065162426234694...,6586455234700639270600574409018427069812517107...,5598189335674161013176229442725892028281044088...,2209634314611670011518640833334290179954075682...,5103122257192237244690193744677031637888782293...,0.692308


### Map Our Results

Now we quickly map our private query back to our internal data. Here is our original plaintext record:

In [None]:
query_data

Unnamed: 0,acct_num,name,address,phone,SIN,DOB
0,PXCG66212484637575,Kristin Sanchez MD,"31417 Gina Lodge, Bradleytown, MB P7G 4N5",1 (598) 742-6794,203 268 552,1943/04/15


And here is the result of our private search:

In [None]:
search_data.iloc[[results.index[0]]]

Unnamed: 0,acct_num,name,address,phone,SIN,DOB
2,CUVC30129179791293,Kristin Sanchez MD,"31417 Gina Lodge, Bradleytown, MB P7G 4N5",742 598 6794,203 268 552,1943/04/15


**Note:** even though our search was on cryptographically hashed data it was able to match a miskeyed phone number and differing internal primary keys (`acct_num`).