# Intermediate/advanced data manipulation in a data linkage context

This notebook is intended to be a demo of what you *could* use DLH_utils for. 

I'm sure to many a lot of this code may look very similar! We will have taken similar approaches for the vast majority of the problems faced here. We've just wrapped these mostly standard approaches up into reusable functions, hopefully to save everyone doing linkage both some time and headaches! 

In [7]:
# To start, install dlh_utils if not installed already. Notice the '-U' argument to upgrade existing installations. 
!pip3 install -U 'dlh_utils'

Looking in indexes: http://sccm_functional:****@art-p-01/artifactory/api/pypi/yr-python/simple


In [8]:
# Import necessary libraries
import pyspark.sql.functions as F
import pandas as pd

from dlh_utils import utilities
from dlh_utils import dataframes
from dlh_utils import linkage
from dlh_utils import standardisation
from dlh_utils import sessions
from dlh_utils import profiling
from dlh_utils import flags

In [9]:
# You can use our sessions module to set up your spark session
# This will also create a Spark UI, which you can use to track your code's efficiency
spark = sessions.getOrCreateSparkSession(appName = 'dlh_utils_demo', size = 'medium')

In [10]:
# Read in raw data
census = pd.read_csv("/home/cdsw/dlh_utils_demo/census_residents.csv")
ccs = pd.read_csv("/home/cdsw/dlh_utils_demo/ccs_perturbed.csv")

# note, if this was stored in Hue, the read_format() function from the utilities module would've been useful

# For demo purposes, let's convert this to a spark df using utilities
census = utilities.pandas_to_spark(census)
ccs = utilities.pandas_to_spark(ccs)

# Profiling

To give a quick overview of the features of our data, we can use the **df_describe()** function from the profiling module:

In [12]:
# To see one of our function docstrings, we can put the module and function name, followed by a ?

profiling.df_describe?

In [5]:
descriptive_census = profiling.df_describe(census,
                                           output_mode = 'pandas',
                                           approx_distinct = False,
                                           rsd = 0.05
                                           )
descriptive_census

Unnamed: 0,variable,type,row_count,distinct,percent_distinct,null,percent_null,not_null,percent_not_null,empty,percent_empty,min,max,min_l,max_l,max_l_before_point,min_l_before_point,max_l_after_point,min_l_after_point
0,Address,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,19,51,,,,
1,ENUM_FNAME,string,100001,1092,1.091989,0,0.0,100001,100.0,0,0.0,,,3,15,,,,
2,ENUM_SNAME,string,100001,1493,1.492985,0,0.0,100001,100.0,0,0.0,,,3,15,,,,
3,ID,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,20,20,,,,
4,Marital_Status,string,100001,6,0.006,0,0.0,100001,100.0,13386,13.385866,,,1,17,,,,
5,Postcode,string,100001,99457,99.456005,0,0.0,100001,100.0,0,0.0,,,6,8,,,,
6,Sex,string,100001,10,0.01,0,0.0,100001,100.0,5636,5.635944,,,1,6,,,,
7,Resident_Day_Of_Birth,bigint,100001,31,0.031,0,0.0,100001,100.0,0,0.0,1.0,31.0,1,2,,,,
8,Resident_Month_Of_Birth,bigint,100001,12,0.012,0,0.0,100001,100.0,0,0.0,1.0,12.0,1,2,,,,
9,Resident_Year_Of_Birth,bigint,100001,89,0.088999,0,0.0,100001,100.0,0,0.0,1934.0,2022.0,4,4,,,,


From this we can see that we have a percentage distinct in our sex variable far from 50% which we would expect. This could suggest a high level of missingness, but we can see from the rest of the output that we don't have any missing or null sex values, suggesting some have been incorrectly coded or skewed in the data.

We can also see that, whilst there are no nulls, the Sex variable contains a lot of empty values. This suggests there are different definitions for nulls, which we can cast to True Nones later when we standardise the data. 

On bigger data, these observations can give quick insights into which variables may be the most/least useful for matching. 

The **value_counts()** functions shows the top or bottom n values in our data, and is another crucial step when profiling our data. Along with cleaning and standardising our data, this is one of the most time consuming part of data linkage.  

Value counts can give us an overview of the different types of missingness in these variables, which will be useful when we come to standardise missingness in our data later.

Whilst it is important to know and understand both sets of data, for the purpose of the demo we will only look at the CCS dataset. 

In [6]:
value_counts_ccs = profiling.value_counts(ccs,
                                         limit = 5,
                                         output_mode = 'pandas'
                                         )

# the value counts function returns two dataframes; one for the top n values in each variable and one for the bottom n values. 
# we can select the top value count dataframe by subsetting the value_counts_ccs tuple:

value_counts_ccs[0]

Unnamed: 0,Address,Address_count,FNAME,FNAME_count,SNAME,SNAME_count,Marital_Status,Marital_Status_count,Postcode,Postcode_count,Sex,Sex_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count,Resident_ID,Resident_ID_count
0,-9,48,-9,51,-7,50,Single,287,-7,35,Female,228,-9,45,-9,43,-7,34,c1289733399550728998,1
1,"Studio 4\nSimpson glens, Lake Paul",4,Victoria,9,Smith,26,,136,CV4P 2DD,5,Male,214,1938,22,84,21,10-03-1987,4,c1406628632687313907,1
2,"Studio 4\nDiane underpass, Eleanorton",3,Mr Ryan,7,Jones,16,Divorced,129,M30 7WF,4,F,134,2005,21,17,20,03-01-2022,4,c1462481395779002923,1
3,"Flat 05F\nGarry knolls, Jamieport",3,Tracey,7,Roberts,13,Married,117,WA5M 2PS,4,M,131,2020,21,11,20,23-08-1963,4,c1610482117117913758,1
4,"Studio 5\nFuller burgs, New Lindsey",3,Glen,6,Taylor,12,Civil partnership,109,W9D 0EZ,4,-7,99,1997,20,2,20,20-10-1951,4,c1631997721075661206,1


In [7]:
# we can do the same for the bottom values: 

value_counts_ccs[1]

Unnamed: 0,Address,Address_count,FNAME,FNAME_count,SNAME,SNAME_count,Marital_Status,Marital_Status_count,Postcode,Postcode_count,Sex,Sex_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count,Resident_ID,Resident_ID_count
0,"1 Jones centers, Cooperhaven",1,Lauren,1,ChapmKan,1,Civil partnerQship,1,BT9 8UR,1,1,45,1939,2,83,3,01-09-1972,1,c1289733399550728998,1
1,"45 Alan plains, Denisshire",1,Mr Julie,1,Der-Barton,1,Divorce?d,1,EX2Y 1BU,1,2,55,1975,5,43,4,02-06-1987,1,c1406628632687313907,1
2,"1 Murray meadows, West Jessica",1,LeslEy,1,D\er-Hunt,1,Si0ngle,1,CB1 3SS,1,,57,1979,5,46,5,02-06-1991,1,c1462481395779002923,1
3,"28 James row, New Tobystad",1,JUlie,1,Sinclair,1,siNgle,1,B7E 8YN,1,-9,63,1976,6,14,6,03-10-2006,1,c1610482117117913758,1
4,"43 Kelly mills, Dixonport",1,Lydia,1,Heath,1,CivJil partnership,1,E3 6TJ,1,NAN,70,2022,6,57,6,07-03-2020,1,c1631997721075661206,1


To flag out of scope values in our data, we can use the **flag()** function:

In [7]:
# This will flag invalid values in our data, for example, by seeing if there is anyone over the age of 115 in our data:

out_of_scope = flags.flag(df = census,
                          ref_col = 'Resident_Age',
                          condition = '>=',
                          condition_value = 115,
                          alias = None,
                          prefix = 'FLAG',
                          fill_null = None
                         )

out_of_scope.filter(F.col('FLAG_Resident_Age>=115')== 'true').select('FLAG_Resident_Age>=115', 'Resident_Age', 'DOB')

FLAG_Resident_Age>=115,Resident_Age,DOB
True,143,05/07/1879
True,143,05/07/1879
True,143,05/07/1879
True,176,25/09/1846
True,176,25/09/1846
True,143,05/07/1879
True,176,25/09/1846


In [9]:
census

Address,ENUM_FNAME,ENUM_SNAME,ID,Marital_Status,Postcode,Sex,Resident_Day_Of_Birth,Resident_Month_Of_Birth,Resident_Year_Of_Birth,Resident_Age,DOB
Studio 48 Cooper ...,Mrs Margaret,Ross,c4064232788196233825,NAN,CV25 4ZY,,6,8,1956,66,06/08/1956
43 Rebecca street...,Mrs Darren,Baldwin,c7365350289112516537,Single,E2 0LP,-7,29,12,2013,9,29/12/2013
"04 Lane shores, S...",Mrs Eric,Bibi,c8205386463232611653,Single,DY01 1TR,Female,11,10,2016,6,11/10/2016
"7 Noble valley, L...",Diane,Kent,c2381984462771197706,,SO2P 9WS,Male,28,1,1972,50,28/01/1972
57 Pearson corner...,Mr Grace,Baker,c6384487823194391043,Civil partnership,EH4P 9RN,Female,26,11,1966,56,26/11/1966
0 Jeremy mountain...,Mrs Chloe,Chandler,c7777611692672993318,Divorced,G7F 3RE,2,3,3,1963,59,03/03/1963
Studio 5 Fuller b...,Mr Katie,Der-Anderson,c7179219388724687888,Divorced,G88 6DB,Female,9,4,1960,62,09/04/1960
"644 Garry walk, B...",Mrs Denise,King,c3458599216452476033,Divorced,FY9W 4RU,F,28,11,1981,41,28/11/1981
Studio 73 Clayton...,Hazel,Der-Barton,c9188328200772085537,Married,LN96 9XA,Male,13,4,1947,75,13/04/1947
Flat 92B Ross exp...,Harriet,Chapman,c1862566591390004870,Single,S0D 9AX,M,8,11,2001,21,08/11/2001


In [10]:
ccs

Address,FNAME,SNAME,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
Studio 48 Cooper ...,-9,Ross,NAN,CV25 4Z/Y,,1956,66,06-08-1956,c2026847926404610461
43 Rebecca street...,Mrs Darren,Baldwin,Single,-7,Male,2013,9,29-12-2013,c7839596180442651345
-9,Mrs Eric,Bibi,Single,PL38 0PR,Female,2016,6,09-11-2015,c3258728696626565719
"7 Noble valley, ...",Mark,Kent,,SO2P 9WS,Male,1972,50,28-01-1972,c2287010195568088798
57 Pearson corner...,Mr Grace,Baker,Civil partnership,eh4p 9Rn,Female,1966,56,26-11-1966,c1945351111358374057
-9,Mrs Chloe,Chandler,Divorced,G7F 3RE,2,1963,59,11-03-2012,c7831454145019129197
Studio 5 Fuller b...,Mr Katie,Der-Anderson,Divorced,G88 6DB,Female,1960,62,15-08-1978,c6030118478018109776
"644 Garry walk, B...",Mrs Denise,King,Divorced,W7- 2TY,F,1981,41,28-11-1981,c6446332115853614978
Studio 73 Clayton...,-9,Der-Barton,-9,LN96 9XA,Male,1947,-9,29-12-2013,c1293143515607798169
Studio 33o8 Hazel...,Keith,ChapmKan,Single,S0D 9AX,M,2001,21,08-11-2001,c1864186096263678574


# Data Cleaning & Standardisation

Cleaning and standardisation can be a time-consuming process because, even with good linkage methods, messy and unstandardised data will lead to a poor match rate. 

This is because when matching data, dataframes need to be as similar to each other as possible, which is what this step will now do. 

In [11]:
# Looks like there is a new line character in address - this will need to be removed
# We can replace these '\n' values with spaces:

census = standardisation.reg_replace(df = census, dic = {' ': '\n'})
ccs = standardisation.reg_replace(df = ccs, dic = {' ': '\n'})

ccs.select('Address').show(truncate = False)

+-------------------------------------------+
|Address                                    |
+-------------------------------------------+
|Studio 48 Cooper street, Port Fredericktown|
|43 Rebecca street, Harveytown              |
|-9                                         |
| 7 Noble valley, Lake Simonville           |
|57 Pearson corner, Joannaborough           |
|-9                                         |
|Studio 5 Fuller burgs, New Lindsey         |
|644 Garry walk, Blackburnville             |
|Studio 73 Clayton mountains, Stevenbury    |
|Studio 33o8 Hazel river, Adamsstad         |
|8 Grant spurs, South Philip                |
| 3 Smith mount, New Frances                |
|21 Stephen island, terrymouth              |
|flat 78 Jones Glen, marIonbuRgh            |
|-9                                         |
|Flat 2 Oliver corners, Baxterton           |
|Studio 6 Dixon bypass, New Marian          |
|824 J,effrey roads, Mathewfort             |
|Flat 22 Kennedy keys, Port Valeri

Let's standardise the date format to be consistent across our data in a **dd/MM/yyyy** format:

In [12]:
ccs = standardisation.standardise_date(ccs, col_name = "DOB", in_date_format = "dd-MM-yyyy", out_date_format = "dd/MM/yyyy")

ccs

Address,FNAME,SNAME,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
Studio 48 Cooper ...,-9,Ross,NAN,CV25 4Z/Y,,1956,66,06/08/1956,c2026847926404610461
43 Rebecca street...,Mrs Darren,Baldwin,Single,-7,Male,2013,9,29/12/2013,c7839596180442651345
-9,Mrs Eric,Bibi,Single,PL38 0PR,Female,2016,6,09/11/2015,c3258728696626565719
"7 Noble valley, ...",Mark,Kent,,SO2P 9WS,Male,1972,50,28/01/1972,c2287010195568088798
57 Pearson corner...,Mr Grace,Baker,Civil partnership,eh4p 9Rn,Female,1966,56,26/11/1966,c1945351111358374057
-9,Mrs Chloe,Chandler,Divorced,G7F 3RE,2,1963,59,11/03/2012,c7831454145019129197
Studio 5 Fuller b...,Mr Katie,Der-Anderson,Divorced,G88 6DB,Female,1960,62,15/08/1978,c6030118478018109776
"644 Garry walk, B...",Mrs Denise,King,Divorced,W7- 2TY,F,1981,41,28/11/1981,c6446332115853614978
Studio 73 Clayton...,-9,Der-Barton,-9,LN96 9XA,Male,1947,-9,29/12/2013,c1293143515607798169
Studio 33o8 Hazel...,Keith,ChapmKan,Single,S0D 9AX,M,2001,21,08/11/2001,c1864186096263678574


Next, we have an 'ID' column in census, but a 'Resident_ID' column in ccs. We also have name variables named differently in each dataset. Let's align these variable names.

We can use **rename_columns()** from the dataframes module to rename all of these at once. 

In [13]:
census = dataframes.rename_columns(census, rename_dict = {"ENUM_FNAME":"FORENAME","ENUM_SNAME":"SURNAME"})
ccs = dataframes.rename_columns(ccs, rename_dict = {"Resident_ID":"ID","FNAME":"FORENAME","SNAME":"SURNAME"})

ccs.columns

['Address',
 'FORENAME',
 'SURNAME',
 'Marital_Status',
 'Postcode',
 'Sex',
 'Resident_Year_Of_Birth',
 'Resident_Age',
 'DOB',
 'ID']

Now let's set all of variables to upper case for consistency, using **standardise_case()**:

In [14]:
census = standardisation.standardise_case(census)
ccs = standardisation.standardise_case(ccs)

ccs

Address,FORENAME,SURNAME,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,ID
STUDIO 48 COOPER ...,-9,ROSS,NAN,CV25 4Z/Y,,1956,66,06/08/1956,C2026847926404610461
43 REBECCA STREET...,MRS DARREN,BALDWIN,SINGLE,-7,MALE,2013,9,29/12/2013,C7839596180442651345
-9,MRS ERIC,BIBI,SINGLE,PL38 0PR,FEMALE,2016,6,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, ...",MARK,KENT,,SO2P 9WS,MALE,1972,50,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,MR GRACE,BAKER,CIVIL PARTNERSHIP,EH4P 9RN,FEMALE,1966,56,26/11/1966,C1945351111358374057
-9,MRS CHLOE,CHANDLER,DIVORCED,G7F 3RE,2,1963,59,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,MR KATIE,DER-ANDERSON,DIVORCED,G88 6DB,FEMALE,1960,62,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",MRS DENISE,KING,DIVORCED,W7- 2TY,F,1981,41,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,-9,DER-BARTON,-9,LN96 9XA,MALE,1947,-9,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,SINGLE,S0D 9AX,M,2001,21,08/11/2001,C1864186096263678574


Next, the values for missingness are all over the place. I can spot a few NaNs, minus 7/9s, and whitespaces. Let's standardise missingness with the **standardise_null()** function. We can retrieve these null values from the previous **value_counts()** outputs: 

In [16]:
# We can use the standardise_null function to replace these with true None values:
# We use regex to do this: https://regex101.com/ 
census = standardisation.standardise_null(census, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$")
ccs = standardisation.standardise_null(ccs, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$")

ccs

Address,FORENAME,SURNAME,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,ID
STUDIO 48 COOPER ...,,ROSS,,CV25 4Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,MRS DARREN,BALDWIN,SINGLE,,MALE,2013,9.0,29/12/2013,C7839596180442651345
,MRS ERIC,BIBI,SINGLE,PL38 0PR,FEMALE,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, ...",MARK,KENT,,SO2P 9WS,MALE,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,MR GRACE,BAKER,CIVIL PARTNERSHIP,EH4P 9RN,FEMALE,1966,56.0,26/11/1966,C1945351111358374057
,MRS CHLOE,CHANDLER,DIVORCED,G7F 3RE,2,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,MR KATIE,DER-ANDERSON,DIVORCED,G88 6DB,FEMALE,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",MRS DENISE,KING,DIVORCED,W7- 2TY,F,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DER-BARTON,,LN96 9XA,MALE,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,SINGLE,S0D 9AX,M,2001,21.0,08/11/2001,C1864186096263678574


Next, we have a mix of 1s, 2s, Ms, and Fs in our sex column. Let's standardise this to be either 1s or 2s. For this we can use **reg_replace()**:

In [17]:
# reg_replace() takes a dictionary, where the value is the regex to replace, and the key is what this will be replaced with
# so we're replacing 'M' with '1', and 'F' with '2':
census = standardisation.reg_replace(census, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})
ccs = standardisation.reg_replace(ccs, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})

ccs

Address,FORENAME,SURNAME,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,ID
STUDIO 48 COOPER ...,,ROSS,,CV25 4Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,MRS DARREN,BALDWIN,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,MRS ERIC,BIBI,SINGLE,PL38 0PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, ...",MARK,KENT,,SO2P 9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,MR GRACE,BAKER,CIVIL PARTNERSHIP,EH4P 9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,MRS CHLOE,CHANDLER,DIVORCED,G7F 3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,MR KATIE,DER-ANDERSON,DIVORCED,G88 6DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",MRS DENISE,KING,DIVORCED,W7- 2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DER-BARTON,,LN96 9XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,SINGLE,S0D 9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


Now that our sex column is populated with just 1s and 2s, we might want to change the type from string to integer. This can be done using the **cast_type()** function in the standardisation module:

In [18]:
# Casting strings to integer can increase performance as it makes the data type smaller, taking up less storage space

census = standardisation.cast_type(census, subset = ['SEX'], types = "integer")
ccs = standardisation.cast_type(ccs, subset = ['SEX'], types = "integer")

ccs.select('SEX').dtypes

[('SEX', 'int')]

Next, let's focus on our name variables. Forenames still contain titles and some surnames have common prefixes like 'Van' or 'Der'. We can strip out titles and concatenate surname prefixes with our **clean_forename()** and **clean_surname()** functions. 

In [19]:
census = standardisation.clean_forename(census, subset = 'FORENAME')
ccs = standardisation.clean_forename(ccs, subset = 'FORENAME')

census = standardisation.clean_surname(census, subset = 'SURNAME')
ccs = standardisation.clean_surname(ccs, subset = 'SURNAME')

ccs

Address,FORENAME,SURNAME,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,ID
STUDIO 48 COOPER ...,,ROSS,,CV25 4Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,SINGLE,PL38 0PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, ...",MARK,KENT,,SO2P 9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,CIVIL PARTNERSHIP,EH4P 9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,DIVORCED,G7F 3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,DIVORCED,G88 6DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,DIVORCED,W7- 2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,,LN96 9XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,SINGLE,S0D 9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


Moving on, a lot of the variables contain multiple white spaces in a row. We can remove white spcaes altogether using the **standardise_white_space()** function, setting white space level (wsl) to none. 

However, we don't want to remove all of the white spaces in our address variable, as this would make the text unreadable. Instead, we can set wsl to one, meaning only gaps of 2 spaces or more will be removed.  

In [20]:
# Using a list comprehension, we remove all white spaces from all columns except Address
census = standardisation.standardise_white_space(census, 
                                                 subset = [column for column in census.columns if column != 'Address'], 
                                                 wsl = "none")
ccs = standardisation.standardise_white_space(ccs, 
                                              subset = [column for column in ccs.columns if column != 'Address'], 
                                              wsl = "none")

# Then we allow a single white space for the Address column
census = standardisation.standardise_white_space(census, subset = 'Address', wsl = "one")
ccs = standardisation.standardise_white_space(ccs, subset = 'Address', wsl = "one")

ccs

Address,FORENAME,SURNAME,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,ID
STUDIO 48 COOPER ...,,ROSS,,CV254Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,SINGLE,PL380PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, L...",MARK,KENT,,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,DIVORCED,G7F3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,DIVORCED,G886DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,DIVORCED,W7-2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,,LN969XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


We might have some leading/trailing whitespaces in some of our variables as they can be hard to spot, and therefore it's good practice to use **trim()** to remove these:

In [21]:
census = standardisation.trim(census)
ccs = standardisation.trim(ccs)

ccs

Address,FORENAME,SURNAME,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,ID
STUDIO 48 COOPER ...,,ROSS,,CV254Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,SINGLE,PL380PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, L...",MARK,KENT,,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,DIVORCED,G7F3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,DIVORCED,G886DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,DIVORCED,W7-2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,,LN969XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


Finally, let's strip out numbers from our name variables. Again, we can use the **reg_replace()** function for this:

In [22]:
census = standardisation.reg_replace(census, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})
ccs = standardisation.reg_replace(ccs, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})

ccs

Address,FORENAME,SURNAME,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,ID
STUDIO 48 COOPER ...,,ROSS,,CV254Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,SINGLE,PL380PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, L...",MARK,KENT,,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,DIVORCED,G7F3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,DIVORCED,G886DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,DIVORCED,W7-2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,,LN969XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


This still leaves apostrophes and hyphens in our name variables. The **remove_punct()** function can handle these. While we're at it, let's also use **remove_punct()** to get rid of dashes in our address field, but we'll have to specify the optional argument **keep** to make sure it doesn't strip out commas from addresses:

In [23]:
#First remove all punction from every column except address
census = standardisation.remove_punct(census, 
                                      subset = [column for column in census.columns if column not in ['Address','DOB']], 
                                      )

ccs = standardisation.remove_punct(ccs, 
                                   subset = [column for column in ccs.columns if column not in ['Address','DOB']]
                                  )

#Then remove the punctuation from address except for the commas
census = standardisation.remove_punct(census, subset = 'Address', keep = ',')
ccs = standardisation.remove_punct(ccs, subset = 'Address', keep = ',')

ccs

Address,FORENAME,SURNAME,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,ID
STUDIO 48 COOPER ...,,ROSS,,CV254ZY,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,SINGLE,PL380PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, L...",MARK,KENT,,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,DIVORCED,G7F3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,DIVORCED,G886DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,DIVORCED,W72TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,,LN969XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


We have now finished the cleaning and standardisation steps. Before we move on, let's look at what improvements have been made:

In [24]:
ccs

Address,FORENAME,SURNAME,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,ID
STUDIO 48 COOPER ...,,ROSS,,CV254ZY,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,SINGLE,PL380PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, L...",MARK,KENT,,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,DIVORCED,G7F3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,DIVORCED,G886DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,DIVORCED,W72TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,,LN969XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


# Derive Variables

We've got quite a few identifying variables that we can split out into further variables for matching. These can be useful if, for instance, records don't match on house number due to an error, but do match on street. 

First, let's derive street and town from the address variable. The **split()** function from the dataframes module will be useful here, splitting on comma. 

In [25]:
# This will create a new column called "ADDRESS_SPLIT" that contains an array of each address element, separated by a comma
census = dataframes.split(census, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")
ccs = dataframes.split(ccs, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")

ccs.select("ADDRESS", "ADDRESS_SPLIT").show(truncate = False)

+-------------------------------------------+----------------------------------------------+
|ADDRESS                                    |ADDRESS_SPLIT                                 |
+-------------------------------------------+----------------------------------------------+
|STUDIO 48 COOPER STREET, PORT FREDERICKTOWN|[STUDIO 48 COOPER STREET,  PORT FREDERICKTOWN]|
|43 REBECCA STREET, HARVEYTOWN              |[43 REBECCA STREET,  HARVEYTOWN]              |
|null                                       |null                                          |
|7 NOBLE VALLEY, LAKE SIMONVILLE            |[7 NOBLE VALLEY,  LAKE SIMONVILLE]            |
|57 PEARSON CORNER, JOANNABOROUGH           |[57 PEARSON CORNER,  JOANNABOROUGH]           |
|null                                       |null                                          |
|STUDIO 5 FULLER BURGS, NEW LINDSEY         |[STUDIO 5 FULLER BURGS,  NEW LINDSEY]         |
|644 GARRY WALK, BLACKBURNVILLE             |[644 GARRY WALK,  BLACKBU

In [26]:
# We can then select the first element of the 'split address' to create the 'street address' variable
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)

# The second element contains the town name, which we can append to a new column also 
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)

# Since we no longer need the 'ADDRESS_SPLIT' column, we can remove it using our drop_columns() function
census = dataframes.drop_columns(census, subset = 'ADDRESS_SPLIT')
ccs = dataframes.drop_columns(ccs, subset = 'ADDRESS_SPLIT')

ccs.select("ADDRESS", "STREET", "TOWN").show(truncate = False)

+-----------------------------------------+---------------------------+-------------------+
|ADDRESS                                  |STREET                     |TOWN               |
+-----------------------------------------+---------------------------+-------------------+
|null                                     |null                       |null               |
|183 PAIGE SPURS, SOUTH DAMIENBURGH       |183 PAIGE SPURS            | SOUTH DAMIENBURGH |
|6 SKINNER MISSION, S7OUTH MELANIEPORT    |6 SKINNER MISSION          | S7OUTH MELANIEPORT|
|69 NEIL HILL, TURNERBURY                 |69 NEIL HILL               | TURNERBURY        |
|794 STEWART SUMMIT, BELLCHESTER          |794 STEWART SUMMIT         | BELLCHESTER       |
|2 BARLOW ESTATES, EAST CLAREBURY         |2 BARLOW ESTATES           | EAST CLAREBURY    |
|STUDIO 43R JASMINE FORGE, WEST DAMIANLAND|STUDIO 43R JASMINE FORGE   | WEST DAMIANLAND   |
|72 CROSS EXTENSION, LAKE PAMELA          |72 CROSS EXTENSION         | LAKE PAM

We can create a 'full name' variable by concatenating the two existing name columns together, using **concat()**:

In [27]:
census = dataframes.concat(census, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")
ccs = dataframes.concat(ccs, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")

ccs.select("FORENAME", "SURNAME", "FULL_NAME")

FORENAME,SURNAME,FULL_NAME
LUKE,,LUKE
BEN,KELLY,BEN KELLY
NICOLA,HARRISON,NICOLA HARRISON
SAMUEL,VANMORTON,SAMUEL VANMORTON
KERRY,COLE,KERRY COLE
STUART,TURNER,STUART TURNER
MAURICE,PARKER,MAURICE PARKER
SHEILA,,SHEILA
PATRICIA,BIBI,PATRICIA BIBI
DEBRA,WRIGHT,DEBRA WRIGHT


For data that has been collected over the phone, our usual matching methods that look for differences in strings might not be as effective. Instead we can capture the way names *sound* with phonetic encoders to compensate for this type of error. 

We have functions for this in the linkage module. 

In [28]:
census = linkage.metaphone(df = census, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
census = linkage.soundex(df = census, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

ccs = linkage.metaphone(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
ccs = linkage.soundex(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

ccs.select("FORENAME", "FORENAME_METAPHONE", "FORENAME_SOUNDEX")

FORENAME,FORENAME_METAPHONE,FORENAME_SOUNDEX
GRACE,KRS,G620
CHLOE,XL,C400
LYDIA,LT,L300
RACHAEL,RXL,R240
NATASHA,NTX,N320
MARTYN,MRTN,M635
DIANA,TN,D500
GORDON,KRTN,G635
,,
KAYLEIGH,KL,K420


Similarly, if there have been spelling mistakes in names, alphabetising string columns may also aid matching. We have a function for this in the linkage module. 

In [29]:
census = linkage.alpha_name(census, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')
ccs = linkage.alpha_name(ccs, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')

ccs.select("FORENAME", "ALPHABETISE_FORENAME")

FORENAME,ALPHABETISE_FORENAME
GRACE,ACEGR
CHLOE,CEHLO
LYDIA,ADILY
RACHAEL,AACEHLR
NATASHA,AAAHNST
MARTYN,AMNRTY
DIANA,AADIN
GORDON,DGNOOR
,
KAYLEIGH,AEGHIKLY


There are more common matching variables we could still derive, for example, a common practice in data linkage is to derive a postcode district variable instead of using full postcode. 

The second part of a postcode is *always* 3 characters, whilst the first part can range from 2-4. Therefore, to derive postcode district, we remove the last 3 characters from postcode. 

This can be done using the **substring()** function:

In [30]:
census = dataframes.substring(census, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)
ccs = dataframes.substring(ccs, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)

ccs.select("POSTCODE", "PC_DISTRICT")

POSTCODE,PC_DISTRICT
IM6M5LD,IM6M
B58TU,B5
HP399XG,HP39
LL235XT,LL23
W75FG,W7
AB5E7ZL,AB5E
N6A1DJ,N6A
LE081BB,LE08
E727SA,E72
,


# Deduplication

This is quite easily done, defining our duplicate matchkey(s) and using the **deduplicate** function:

In [31]:
# Define our matchkey
deduplicate_mkey = ['FORENAME', 'SURNAME','Resident_Age','Sex','Postcode','Address']
ccs.count()

1096

In [32]:
census = linkage.deduplicate(df = census, record_id = 'ID', mks = deduplicate_mkey)[0]
ccs = linkage.deduplicate(df = ccs, record_id = 'ID', mks = deduplicate_mkey)[0]
ccs.count()

1090

# Deterministic Matching (rule-based)

Now that we've removed duplicates, we can start to investigate some matchkeys:

In [33]:
# First, let's suffix each dataset's columns to distinguish the two dataframes 
census = dataframes.suffix_columns(census, suffix = '_census')
ccs = dataframes.suffix_columns(ccs, suffix = '_ccs')

# To help our code run quicker we can persist it
census.persist().count()
ccs.persist().count()

1090

In [34]:
census.columns

['Address_census',
 'FORENAME_census',
 'SURNAME_census',
 'ID_census',
 'Marital_Status_census',
 'POSTCODE_census',
 'SEX_census',
 'Resident_Day_Of_Birth_census',
 'Resident_Month_Of_Birth_census',
 'Resident_Year_Of_Birth_census',
 'Resident_Age_census',
 'DOB_census',
 'STREET_census',
 'TOWN_census',
 'FULL_NAME_census',
 'FORENAME_METAPHONE_census',
 'FORENAME_SOUNDEX_census',
 'ALPHABETISE_FORENAME_census',
 'PC_DISTRICT_census']

In [35]:
# Defining a very strict matchkey
MK1 = [census.FULL_NAME_census == ccs.FULL_NAME_ccs,
       census.SEX_census == ccs.SEX_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.POSTCODE_census == ccs.POSTCODE_ccs]

# Now allowing for misspellings in forename, using a string comparator - standardised Levenshtein edit distance
MK2 = [linkage.std_lev_score(F.col('FORENAME_census'),F.col('FORENAME_ccs')) > 0.5,
       census.SURNAME_census == ccs.SURNAME_ccs,
       census.SEX_census == ccs.SEX_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs]

# Taking the phonetic encoding of forename - using the metaphone algorithm
MK3 = [census.FORENAME_METAPHONE_census == ccs.FORENAME_METAPHONE_ccs,
       census.SEX_census == ccs.SEX_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs]

matchkeys = [MK1,MK2,MK3]

In [38]:
links = linkage.deterministic_linkage(df_l = census, df_r = ccs, id_l = 'ID_census', id_r = 'ID_ccs', 
                                      matchkeys = matchkeys, out_dir = '/user/hartj2/census_ccs_links')


MATCHKEY 1
matches on matchkey:  265
total matches:  265
left residual:  99736
right residual:  825

MATCHKEY 2
matches on matchkey:  125
total matches:  390
left residual:  99611
right residual:  700

MATCHKEY 3
matches on matchkey:  85
total matches:  475
left residual:  99526
right residual:  615


In [39]:
links

ID_census,ID_ccs,matchkey
C1433277973806035679,C6883631728895847355,1
C1966360112880317856,C4789040885158284449,1
C2325862273912155266,C5535561504398033731,1
C2518542149903085123,C3607616730251071161,1
C2859856486916324368,C6041478008464468015,1
C2909764675783086411,C3350943424263520857,1
C2967947797887915562,C6953693924981990396,1
C3385408824654441031,C7288498466475780443,1
C3447648659395475990,C4134795768120584870,1
C3799523357043416091,C7335138963629014380,1


That's all well and good that we've matched some records on these rules that we've come up with, but *how good* are these matches? To work that out, we need to do clerical review, which is manual review of the matches we've made for quality assurance.

To do this, we can use the `matchkey_dataframe` and `clerical_sample` functions:

In [40]:
mk_df = linkage.matchkey_dataframe(mks = matchkeys)

This creates a dataframe, storing the information on each matchkey and it's details

In [41]:
mk_df

matchkey,description
1,[(FULL_NAME_censu...
2,[((1-(levenshtein...
3,[(FORENAME_METAPH...


We can feed this into our `clerical_sample` function to generate samples for clerical review

In [43]:
linkage.clerical_sample(links, mk_df, df_l = census, df_r = ccs, id_l = 'id_census', id_r = 'id_ccs', n_ids = 5)

matchkey,id_census,id_ccs,ALPHABETISE_FORENAME_ccs,ALPHABETISE_FORENAME_census,Address_ccs,Address_census,DOB_ccs,DOB_census,FORENAME_METAPHONE_ccs,FORENAME_METAPHONE_census,FORENAME_SOUNDEX_ccs,FORENAME_SOUNDEX_census,FORENAME_ccs,FORENAME_census,FULL_NAME_ccs,FULL_NAME_census,Marital_Status_ccs,Marital_Status_census,PC_DISTRICT_ccs,PC_DISTRICT_census,POSTCODE_ccs,POSTCODE_census,Resident_Age_ccs,Resident_Age_census,Resident_Day_Of_Birth_census,Resident_Month_Of_Birth_census,Resident_Year_Of_Birth_ccs,Resident_Year_Of_Birth_census,SEX_ccs,SEX_census,STREET_ccs,STREET_census,SURNAME_ccs,SURNAME_census,TOWN_ccs,TOWN_census,description
1,C2841930630583919923,C4225029719784582868,AERSTTW,AERSTTW,796 JEFFREY TRACK...,796 JEFFREY TRACK...,23/02/1971,23/02/1971,STWRT,STWRT,S363,S363,STEWART,STEWART,STEWART WALTERS,STEWART WALTERS,MARRIED,MARRIED,KT1,KT1,KT10YD,KT10YD,51,51,23,2,1971,1971,1,1,796 JEFFREY TRACK,796 JEFFREY TRACK,WALTERS,WALTERS,STOREYBOROUGH,STOREYBOROUGH,[(FULL_NAME_censu...
1,C3183129369158987851,C8320916867829381826,ADNNO,ADNNO,STUDIO 0 JORDAN T...,STUDIO 0 JORDAN T...,20/01/2015,20/01/2015,TN,TN,D500,D500,DONNA,DONNA,DONNA BRADLEY,DONNA BRADLEY,SINGLE,SINGLE,B48,B48,B486XG,B486XG,7,7,20,1,2015,2015,2,2,STUDIO 0 JORDAN T...,STUDIO 0 JORDAN T...,BRADLEY,BRADLEY,STANLEYPORT,STANLEYPORT,[(FULL_NAME_censu...
1,C3934888329202972175,C4086595455689543106,AAJNNO,AAJNNO,"1 RICHARDS WAY, P...","1 RICHARDS WAY, P...",10/12/2004,10/12/2004,JN,JN,J500,J500,JOANNA,JOANNA,JOANNA BROWN,JOANNA BROWN,MARRIED,MARRIED,KW1,KW1,KW18ST,KW18ST,18,18,10,12,2004,2004,1,1,1 RICHARDS WAY,1 RICHARDS WAY,BROWN,BROWN,PORT JAMES,PORT JAMES,[(FULL_NAME_censu...
1,C7019507166971004966,C3399518453768614955,AAENSSV,AAENSSV,STUDIO 83K COOK I...,STUDIO 83K COOK I...,09/10/2011,09/10/2011,FNS,FNS,V520,V520,VANESSA,VANESSA,VANESSA VANROBERTS,VANESSA VANROBERTS,SINGLE,SINGLE,IG59,IG59,IG599NW,IG599NW,11,11,9,10,2011,2011,1,1,STUDIO 83K COOK I...,STUDIO 83K COOK I...,VANROBERTS,VANROBERTS,GORDONTON,GORDONTON,[(FULL_NAME_censu...
1,C7601528133291145405,C4955257115378102352,EEEILN,EEEILN,"8 FOSTER MISSION,...","8 FOSTER MISSION,...",18/04/1967,18/04/1967,ELN,ELN,E450,E450,EILEEN,EILEEN,EILEEN MORGAN,EILEEN MORGAN,DIVORCED,DIVORCED,S4,S4,S42PU,S42PU,55,55,18,4,1967,1967,2,2,8 FOSTER MISSION,8 FOSTER MISSION,MORGAN,MORGAN,GLENNSIDE,GLENNSIDE,[(FULL_NAME_censu...
2,C1804691338017573148,C3694484938528976612,ACJLR,ACLR,FLAT 06 LEWIS STR...,FLAT 06 LEWIS STR...,12/01/1985,12/01/1985,KRJL,KRL,C624,C640,CARJL,CARL,CARJL JOHNSON,CARL JOHNSON,,,NP6,NP6,NP64EL,NP64EL,37,37,12,1,1985,1985,1,1,FLAT 06 LEWIS STR...,FLAT 06 LEWIS STR...,JOHNSON,JOHNSON,EAST ALAN,EAST ALAN,[((1-(levenshtein...
2,C3385011852067108429,C7123950296791221435,EJNU,EJNU,FLAT 3 ELEANOR MA...,FLAT 01 REYNOLDS ...,,08/09/1964,JN,JN,J500,J500,JUNE,JUNE,JUNE DERLONG,JUNE DERLONG,SINGLE,SINGLE,M68O,M68,M68O7HW,M687HW,58,58,8,9,1964,1964,2,2,FLAT 3 ELEANOR MA...,FLAT 01 REYNOLDS ...,DERLONG,DERLONG,WALLCHESTER,EAST SAMUELBERG,[((1-(levenshtein...
2,C5442881238456089562,C2374184616021943202,ACHIINRST,ACHIINRST,481 ALISON JUNCTI...,481 ALISON JUNCTI...,24/07/1941,24/07/1941,XRSXN,XRSXN,C623,C623,CHRISTIAN,CHRISTIAN,CHRISTIAN VANGILBERT,CHRISTIAN VANGILBERT,SINNGLE,SINGLE,N3S,M1B,N3S9LR,M1B4FG,81,81,24,7,1941,1941,1,1,481 ALISON JUNCTION,481 ALISON JUNCTION,VANGILBERT,VANGILBERT,PORT RUSSELLSTAD,PORT RUSSELLSTAD,[((1-(levenshtein...
2,C6607443554311123697,C5355458740593446896,ADLNOR,ADLNOR,,"4 SYKES LODGE, LA...",21/03/1948,29/10/1963,RNLT,RNLT,R543,R543,RONALD,RONALD,RONALD SMITH,RONALD SMITH,,,WV78,NR89,WV787HR,NR892TX,59,59,29,10,1963,1963,1,1,,4 SYKES LODGE,SMITH,SMITH,,LAWSONMOUTH,[((1-(levenshtein...
2,C6646705785289839024,C8315020339982090861,ACIKPRT,ACIKPRT,348 BRENDA TURNPI...,STUDIO 40 BELL CA...,09/04/2014,09/04/2014,PTRK,PTRK,P362,P362,PATRICK,PATRICK,PATRICK JONES,PATRICK JONES,SINGLE,SINGLE,ST2Y,M00,ST2Y3QU,M009YP,8,8,9,4,2014,2014,1,1,348 BRENDA TURNPIKE,STUDIO 40 BELL CAPE,JONES,JONES,PORT KYLE,LAKE NORMANTON,[((1-(levenshtein...


This is only scratching the surface of the linkage-specific functions in `dlh_utils`. There are also methods for clustering our records, so we can identify conflicting matches (where record A matches to both records B and C) a function for blocking and more. Take a look at our [repository](https://github.com/Data-Linkage/dlh_utils) for more info :)