## Ingest
Bring .csv into the system

In [4]:
from ingest import load_file
os.listdir('assets/data')

['dws_wages.csv',
 'usbe_students.csv',
 'ushe_students.csv',
 'ustc_students.csv']

In [5]:
load_file('assets/data/dws_wages.csv', 'dws_wages')
load_file('assets/data/usbe_students.csv', 'usbe_students')
load_file('assets/data/ustc_students.csv', 'ustc_students')
load_file('assets/data/ushe_students.csv', 'ushe_students')

## Linking Steps

Process of Record Linkage

0. Load
1. Preprocess
2. Index
3. Compare
4. Classify
5. Evaluate
6. Update (Post MPI - Adding new information, new MPI, to MPI Master Record)

### Master Person Architectures
Visualize data architecture store of MPI raw records.  NOTE: data may not be present in all database options.  Use whichever cell aligns with configured loading behavior.

**SQL** View

In [1]:
from db import get_session
import pandas as pd 


with get_session() as session:
    result = session.execute(
        'SELECT * FROM mpi_vectors LIMIT 5'
    ).fetchall()
    count_mpi = session.execute(
        'SELECT COUNT(*) FROM mpi_vectors'
    ).fetchone()
pd.DataFrame(result)

Unnamed: 0,0,1,2,3
0,1,,elvis,11
1,1,presley,elvis,12
2,1,costello,elvis,13
3,2,austin,jane,11
4,2,austin,janet,12


In [9]:
print('Total MPI in system: ', str(count_mpi[0]))

Total MPI in system:  5


**NoSQL** View

In [2]:
from db import get_mongo
import json

db = get_mongo()
count_docs = db.raw.count_documents({})
x = db.raw.find_one({})
x['_id'] = str(x['_id'])

print(json.dumps(x, indent=2))

{
  "_id": "5fd19de3056d4354ab1f9b45",
  "mpi": 1,
  "sources": [
    {
      "guid": 11,
      "score": 0.0,
      "fields": [
        {
          "fieldname": "last_name",
          "value": null
        },
        {
          "fieldname": "first_name",
          "value": "elvis"
        }
      ]
    },
    {
      "guid": 12,
      "score": 0.0,
      "fields": [
        {
          "fieldname": "last_name",
          "value": "presley"
        },
        {
          "fieldname": "first_name",
          "value": "elvis"
        }
      ]
    },
    {
      "guid": 13,
      "score": 0.0,
      "fields": [
        {
          "fieldname": "last_name",
          "value": "costello"
        },
        {
          "fieldname": "first_name",
          "value": "elvis"
        }
      ]
    }
  ]
}


In [3]:
# Delete all records (skip if not needed)
d = db.raw.delete_many({})
d.deleted_count

1

In [50]:
print('Total MPI in system: ', count_docs)

Total MPI in system:  0


## Prepare Data

Prepare identity view (MPI vectors) and data view (distinct mapped columns from source)

In [4]:
source_tablename = 'usbe_students'
# source_tablename = 'ustc_students'
# source_tablename = 'ushe_students'
# source_tablename = 'dws_wages'

In [5]:
from mpi.prepare import create_data_view, create_identity_view

raw, subset = create_data_view(source_tablename)
dview = subset.drop_duplicates()
iview = create_identity_view(mapped_columns=dview.columns.to_list())

### Performance Option / Seeding
Check here for potential for match.  If a match is impossible on available fields, can circumvent the linkage process and generate the MPI's here.

In [6]:
# Check for match availability.  If not, halt process and create MPIs
from mpi.link import is_match_available
from mpi.update import generate_mpi, write_mpi_data, gen_mpi_insert
from mpi.update import update_mpi_vector_table
from mpi.preprocess import clean_raw


if is_match_available(dview, iview):
    print('Match available.  Proceed with linking process.')
else:
    print('Match unavailable.  Generated MPIs for data view.')
    temp = generate_mpi(
        clean_raw(dview)
    )
    write_mpi_data(gen_mpi_insert(temp))
    update_mpi_vector_table()
    
    # Recreate a view from the MPI table with valid identity data
    iview = create_identity_view(mapped_columns=dview.columns.tolist())
    

Match unavailable.  Generated MPIs for data view.


In [7]:
raw.head(1)

Unnamed: 0,BIRTH_DATE,FIRST_ENTERED_US,FIRST_NAME,GENDER,LAST_NAME,MIDDLE_NAME,SSID,STUDENT_ID,id,guid
0,1/3/1856,17-FEB-09 12.00.00.000000000 AM,Jasmyn Rei,n,Zitzman,Adrianne,0x33e5fc3c0x67cbf878,529342.0,74093,1936218537643632373


In [8]:
dview.head(1)

Unnamed: 0,birth_date_pool,first_name_pool,gender_pool,last_name_pool,middle_name_pool,ssid_pool,usbe_student_id_pool,guid
0,1/3/1856,Jasmyn Rei,n,Zitzman,Adrianne,0x33e5fc3c0x67cbf878,529342.0,1936218537643632373


In [9]:
iview.head(1)

Unnamed: 0,last_name_pool,ssid_pool,middle_name_pool,usbe_student_id_pool,birth_date_pool,gender_pool,first_name_pool,freq_score,mpi
0,zitzman,0x33e5fc3c0x67cbf878,adrianne,529342.0,1/3/1856,n,jasmyn rei,1.0,3249281-15634576-7762471-3383175


In [10]:
len(iview)

10000

## Building record linkage and mpi classification

In [11]:
from mpi.preprocess import clean_raw, match_dtype

### Preprocessing

Standardize data across data and identity views.

In [12]:
# Match Dtypes - Align data types prior to cleaning.
#    This helps the cleaner by segmenting string/object and numeric fields

# Cast columns to matching datatypes for comparisons later on
source_data, id_data = match_dtype(dview, iview)  

# Clean data and re-index comparison.
subset = clean_raw(subset)
source_data = clean_raw(source_data)
id_data = clean_raw(id_data)

In [13]:
source_data.head(1)

Unnamed: 0,birth_date_pool,first_name_pool,gender_pool,last_name_pool,middle_name_pool,ssid_pool,usbe_student_id_pool,guid
0,1/3/1856,jasmyn rei,n,zitzman,adrianne,0x33e5fc3c0x67cbf878,529342.0,1936218537643632373


In [14]:
id_data.head(1)

Unnamed: 0,last_name_pool,ssid_pool,middle_name_pool,usbe_student_id_pool,birth_date_pool,gender_pool,first_name_pool,freq_score,mpi
0,zitzman,0x33e5fc3c0x67cbf878,adrianne,529342.0,1/3/1856,n,jasmyn rei,1.0,3249281-15634576-7762471-3383175


## Indexing

Make record pairs - pair rows needing match to potential identity candidates.

Indexing serves two purposes:

1. Create the list of pairs to check (candidate link).  Example: row 1 from table 1 to row 199 from table 2.

2. Reduce the potential number of pairs to check (candidates).

In [15]:
from mpi.index import build_indexer
from utils import match_dataframe_columns

In [16]:
# Create indexer on dataview
#    Indexer is a set of rules to generate 
#    candidate matches from data -> identities

source_matched, id_matched = match_dataframe_columns(source_data, id_data)

indexer = build_indexer(source_matched)

# Check index algorithms (generated from data view columns)
indexer.algorithms

[<SortedNeighbourhood left_on='last_name_pool', right_on='last_name_pool'>,
 <SortedNeighbourhood left_on='middle_name_pool', right_on='middle_name_pool'>,
 <SortedNeighbourhood left_on='first_name_pool', right_on='first_name_pool'>,
 <Block left_on='ssid_pool', right_on='ssid_pool'>,
 <Block left_on='usbe_student_id_pool', right_on='usbe_student_id_pool'>]

In [17]:
# Run indexer on dataview, identity view
candidates = indexer.index(source_matched, id_matched)

# Full indexing is a cross join of data and all possible identities.

# Demonstrating full indexing size:
print('Full Index Length: ', len(source_data) * len(id_data))

# Examine multi indices.  On the left is the data view index.  Right identity.
print('Algorithmic Index Length: ', len(candidates))

# Estimate Savings
print('Savings: ', (1- len(candidates)/(len(source_data) * len(id_data))) * 100)

# Preview indices:
for pair in candidates[0:5]:
    print(f'Data-row {pair[0]}', f'ID-row {pair[1]}')

Full Index Length:  100000000
Algorithmic Index Length:  94924
Savings:  99.90507600000001
Data-row 0 ID-row 0
Data-row 0 ID-row 2360
Data-row 0 ID-row 2473
Data-row 0 ID-row 2568
Data-row 0 ID-row 4017


## Comparing

Indexing does not normally store the outcome of its findings.  Indexing algorithms are meant to be fast, can be error prone.  Algorithms can be tuned for string (many), numeric, and time/date fields.

The output of comparison is a clean feature matrix for the classifier to train/predict on.

In [18]:
from mpi.compare import build_comparator

In [19]:
# Create comparator on dataview
#    Comparator is a set of algorithms for each feature to be compared.
#    These are genearlly much more expensive compared to indexing functions
cmp = build_comparator(source_matched)

# Check comparison algorithms and fields
cmp.features

[<Exact 'ssid_pool'>,
 <Numeric 'usbe_student_id_pool'>,
 <String 'last_name_pool'>,
 <String 'middle_name_pool'>,
 <String 'gender_pool'>,
 <String 'first_name_pool'>]

In [20]:
# Compute comparisons
#    Gives clean match dataset for classification
comparisons = cmp.compute(candidates, source_data, id_data)
comparisons.head()

Unnamed: 0,Unnamed: 1,ssid_pool,usbe_student_id_pool,last_name_pool,middle_name_pool,gender_pool,first_name_pool
0,0,1,1.0,1.0,1.0,1.0,1.0
0,2360,0,0.0,0.0,0.0,0.0,0.0
0,2473,0,0.0,0.0,0.0,0.0,0.0
0,2568,0,0.0,0.0,1.0,0.0,0.0
0,4017,0,0.0,0.0,0.0,1.0,0.0


## Classification

Score candidates for match.  

#### Two approaches: Supervised vs Unsupervised
 * **Supervised** approach requires a training set.
 * **Unsupervised** does not require a training set and operates on only on the comparison table itself.

In [21]:
from mpi.classify import estimate_true, build_classifier

# Get estimated true linkages for supervised model
links_true = estimate_true(comparisons)

# Create classifier
clf = build_classifier('logistic', comparisons, match_index=links_true)

# Check probabilities (score) of each comparison -- NOT IN USE IN THIS VERSION
predictions = clf.prob(comparison_vectors=comparisons)
predictions

0     0       9.993244e-01
      2360    8.937190e-07
      2473    8.937190e-07
      2568    2.973369e-05
      4017    3.979535e-06
                  ...     
9999  4003    8.937190e-07
      4326    2.973369e-05
      5186    2.621460e-05
      6481    3.868693e-03
      9999    9.993244e-01
Length: 94924, dtype: float64

## Evaluate
Express classification quality and explore outliers

In [22]:
from recordlinkage import reduction_ratio
from recordlinkage import confusion_matrix

links_pred = clf.predict(comparison_vectors=comparisons)

rratio = reduction_ratio(links_pred, source_data)
cmatrix = confusion_matrix(links_true, links_pred, candidates)

In [23]:
# Review confusion matrix
#    TP-FN
#    |  |
#    FP-TN
print(cmatrix)

# Review reduction ratio
print(rratio)

[[ 9998     0]
 [    2 84924]]
0.9997999799979999


The confusion matrix may not be particularly useful here as generation of true links is prone to error. The reduction ratio is more sensitive than binary predictions in this case.

In [24]:
# Review findings
#   Interesting that the logistic predicted an MPI indices for each given an incomplete target list.

# Is the relationship 1,1?
split_list = lambda x: ([ix[0] for ix in x], [ix[1] for ix in x])
i1, i2 = split_list(links_pred)
len(list(set(i1))), len(list(set(i2)))

(10000, 10000)

## Update

Append matched MPIs and match score to data view and merge to original data.

In [25]:
from mpi.update import expand_match_to_raw

# Join data view (DISTINCT identities in source table), now containing matched and generated MPIs, to raw table.
#    This can be done a few ways.  Here, the data view (whose columns have been renamed and processed)
#    is joined to the original subset (whose columns were just renamed).  The subset is then indexed back 
#    unto the raw table so original column names and source formatting are preserved.


updated, matched, unmatched = expand_match_to_raw(raw, subset, source_data, id_data, links_pred)
updated.head(1)

Unnamed: 0,BIRTH_DATE,FIRST_ENTERED_US,FIRST_NAME,GENDER,LAST_NAME,MIDDLE_NAME,SSID,STUDENT_ID,id,guid,mpi
0,1/3/1856,17-FEB-09 12.00.00.000000000 AM,Jasmyn Rei,n,Zitzman,Adrianne,0x33e5fc3c0x67cbf878,529342.0,74093,1936218537643632373,3249281-15634576-7762471-3383175


In [26]:
# Update the MPI Vectors table for future use
from mpi.update import update_mpi_vector_table
update_mpi_vector_table()

### De-Identification

Create de-identified table while match available in memory or as referenced temp table.

In [27]:
from assets.mapping import colmap
from db import dataframe_to_db
from di import simple_di

dataframe_to_db(
    simple_di(updated), 
    tablename=source_tablename + '_di'
)

'usbe_students_di'

## Flag MPI

Rule 1:  MPI contains disagreement in blocking identifiers (local_id, ssn, ssid)

Rule 2:  Blocking identifer shared between multiple MPI