# Intermediate/advanced data manipulation in a data linkage context

This notebook is intended to be a demo of what you *could* use DLH_utils for. 

I'm sure to many a lot of this code may look very similar! We will have taken similar approaches for the vast majority of the problems faced here. We've just wrapped these mostly standard approaches up into reusable functions, hopefully to save everyone doing linkage both some time and headaches! 

In [92]:
# to start, install dlh_utils if not installed already. Notice the '-U' argument to upgrade existing installations. 
!pip3 install -U 'dlh_utils'

Looking in indexes: http://sccm_functional:****@art-p-01/artifactory/api/pypi/yr-python/simple


In [93]:
# import necessary libraries
import pyspark.sql.functions as F
import pandas as pd

from dlh_utils import utilities
from dlh_utils import dataframes
from dlh_utils import linkage
from dlh_utils import standardisation
from dlh_utils import sessions
from dlh_utils import profiling
from dlh_utils import flags

In [94]:
# you can use our sessions module to set up your spark session
# this will also create a Spark UI, which you can use to track your code's efficiency
spark = sessions.getOrCreateSparkSession(appName = 'dlh_utils_demo', size = 'medium')

In [95]:
# read in raw data
census = pd.read_csv("/home/cdsw/dlh_utils_demo/census_residents.csv")
ccs = pd.read_csv("/home/cdsw/dlh_utils_demo/ccs_perturbed.csv")

# note, if this was stored in Hue, the read_format() function from the utilities module would've been useful

# for demo purposes, let's convert this to a spark df using utilities
census = utilities.pandas_to_spark(census)
ccs = utilities.pandas_to_spark(ccs)

To give a quick overview of the features of our data, we can use the **describe()** function from the profiling module:

In [49]:
descriptive_census = profiling.df_describe(census,
                                           output_mode = 'pandas',
                                           approx_distinct = False,
                                           rsd = 0.05
                                           )
descriptive_census

Unnamed: 0,variable,type,row_count,distinct,percent_distinct,null,percent_null,not_null,percent_not_null,empty,percent_empty,min,max,min_l,max_l,max_l_before_point,min_l_before_point,max_l_after_point,min_l_after_point
0,Address,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,19,51,,,,
1,ENUM_FNAME,string,100001,1092,1.091989,0,0.0,100001,100.0,0,0.0,,,3,15,,,,
2,ENUM_SNAME,string,100001,1493,1.492985,0,0.0,100001,100.0,0,0.0,,,3,15,,,,
3,ID,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,20,20,,,,
4,Marital_Status,string,100001,6,0.006,0,0.0,100001,100.0,13386,13.385866,,,1,17,,,,
5,Postcode,string,100001,99457,99.456005,0,0.0,100001,100.0,0,0.0,,,6,8,,,,
6,Sex,string,100001,10,0.01,0,0.0,100001,100.0,5636,5.635944,,,1,6,,,,
7,Resident_Day_Of_Birth,bigint,100001,31,0.031,0,0.0,100001,100.0,0,0.0,1.0,31.0,1,2,,,,
8,Resident_Month_Of_Birth,bigint,100001,12,0.012,0,0.0,100001,100.0,0,0.0,1.0,12.0,1,2,,,,
9,Resident_Year_Of_Birth,bigint,100001,89,0.088999,0,0.0,100001,100.0,0,0.0,1934.0,2022.0,4,4,,,,


From this we can see that we have a percentage distinct in our sex variable far from 50% which we would expect. This could suggest a high level of missingness, but we can see from the rest of the output that we don't have any missing or null sex values, suggesting some have been incorrectly coded or skewed in the data.

We can also see that, whilst there are no nulls, the Sex variable contains a lot of empty values. This suggests there are different definitions for nulls, which we can cast to True Nones later when we standardise the data. 

On bigger data, these observations can give quick insights into which variables may be the most/least useful for matching. 

The **value_counts()** functions shows the top or bottom n values in our data, and is another crucial step when profiling our data. Along with cleaning and standardising our data, this is one of the most time consuming part of data linkage.  

Value counts can give us an overview of the different types of missingness in these variables, which will be useful when we come to standardise missingness in our data later.

Whilst it is important to know and understand both sets of data, for the purpose of the demo we will only look at the CCS dataset. 

In [50]:
value_counts_ccs = profiling.value_counts(ccs,
                                         limit = 5,
                                         output_mode = 'pandas'
                                         )

# the value counts function returns two dataframes; one for the top n values in each variable and one for the bottom n values. 
# we can select the top value count dataframe by subsetting the value_counts_ccs tuple:

value_counts_ccs[0]

Unnamed: 0,Address,Address_count,FNAME,FNAME_count,SNAME,SNAME_count,ID,ID_count,Marital_Status,Marital_Status_count,...,Sex,Sex_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count,Resident_ID,Resident_ID_count
0,-9,48,-9,51,-7,50,-7,36,Single,287,...,Female,228,-9,45,-9,43,-7,34,c1019879803218059747,1
1,"Studio 4\nSimpson glens, Lake Paul",4,Victoria,9,Smith,26,c7949424308517863587,3,,136,...,Male,214,1938,22,84,21,10-03-1987,4,c1116695461193772332,1
2,"Studio 4\nDiane underpass, Eleanorton",3,Tracey,7,Jones,16,c5305314753251312254,3,Divorced,129,...,F,134,2005,21,2,20,03-01-2022,4,c1126124671950375198,1
3,"Flat 05F\nGarry knolls, Jamieport",3,Mr Ryan,7,Roberts,13,c5873915529124500787,3,Married,117,...,M,131,2020,21,17,20,23-08-1963,4,c1338471321100034480,1
4,"Studio 5\nFuller burgs, New Lindsey",3,Howard,6,Taylor,12,c8751933968454216474,2,Civil partnership,109,...,-7,99,1997,20,11,20,20-10-1951,4,c1690168800947222415,1


In [51]:
# we can do the same for the bottom values: 

value_counts_ccs[1]

Unnamed: 0,Address,Address_count,FNAME,FNAME_count,SNAME,SNAME_count,ID,ID_count,Marital_Status,Marital_Status_count,...,Sex,Sex_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count,Resident_ID,Resident_ID_count
0,"1 Jones centers, Cooperhaven",1,Lauren,1,Davison,1,C5979341665824936717,1,Civil partnerQship,1,...,1,45,1939,2,83,3,01-09-1972,1,c1289733399550728998,1
1,"45 Alan plains, Denisshire",1,Mr Julie,1,Der-Butler,1,c1045458737369U422279,1,Divorce?d,1,...,2,55,1975,5,43,4,02-06-1987,1,c1406628632687313907,1
2,"1 Murray meadows, West Jessica",1,LeslEy,1,Hope,1,c1364885074130231367,1,Si0ngle,1,...,,57,1979,5,46,5,02-06-1991,1,c1462481395779002923,1
3,"28 James row, New Tobystad",1,JUlie,1,Godfrey,1,c1874534031179144583,1,siNgle,1,...,-9,63,1976,6,49,6,03-10-2006,1,c1610482117117913758,1
4,"43 Kelly mills, Dixonport",1,Lydia,1,Carroll,1,c2060046012088437501,1,CivJil partnership,1,...,NAN,70,2022,6,10,6,07-03-2020,1,c1631997721075661206,1


To flag out of scope values in our data, we can use the **flag()** function:

In [99]:
# This will flag invalid values in our data, for example, by seeing if there is anyone over the age of 110 in our data:

out_of_scope = flags.flag(df = census,
                          ref_col = 'Resident_Age',
                          condition = '<=',
                          condition_value = 110,
                          alias = None,
                          prefix = 'FLAG',
                          fill_null = None
                         )

out_of_scope.filter(F.col('FLAG_Resident_Age<=110')== 'false')

Address,ENUM_FNAME,ENUM_SNAME,ID,Marital_Status,Postcode,Sex,Resident_Day_Of_Birth,Resident_Month_Of_Birth,Resident_Year_Of_Birth,Resident_Age,DOB,FLAG_Resident_Age<=110
"65 Maurice rest, ...",Mrs Ronald,Phillips,c8508241626308406829,Single,N1 1XZ,F,5,7,1979,143,05/07/1879,False
Studio 49l Barnes...,Jane,Sullivan,c4347378261099643605,Married,S62 2SF,Female,5,7,1979,143,05/07/1879,False
"66 Terence key, D...",Clifford,Parkes,c6230487516652079992,Single,B4A 9ZJ,Male,5,7,1979,143,05/07/1879,False
Flat 41P Kenneth ...,Geraldine,Jones,c9013672160856151452,Single,FK82 2QX,Male,25,9,1946,176,25/09/1846,False
"12 Rhys ports, Be...",Mr Louise,Pearson,c8509480630649621294,Single,N9 6EE,M,25,9,1946,176,25/09/1846,False
"2 Booth bridge, N...",Marion,Greenwood,c2099225951626557541,Married,ME9 3GD,Female,5,7,1979,143,05/07/1879,False
Flat 94 White mew...,Mrs Antony,Patel,c2271341991490523787,Divorced,SO34 8LY,Male,25,9,1946,176,25/09/1846,False


We can see we have supercentenarian Ben in our data, which is probably wrong, but we've also got a few different date types that have been flagged as well. 

If you are working with larger data, the **flag_check()** and **flag_summary()** functions can produce more detailed flag metrics that will help you spot issues like this more readily. 

Let's move on to cleaning and standardising where we can start to deal with these issues.

In [81]:
census

Address,ENUM_FNAME,ENUM_SNAME,ID,Marital_Status,Postcode,Sex,Resident_Day_Of_Birth,Resident_Month_Of_Birth,Resident_Year_Of_Birth,Resident_Age,DOB
Studio 48 Cooper ...,Mrs Margaret,Ross,c4064232788196233825,NAN,CV25 4ZY,,6,8,1956,66,06/08/1956
43 Rebecca street...,Mrs Darren,Baldwin,c7365350289112516537,Single,E2 0LP,-7,29,12,2013,9,29/12/2013
"04 Lane shores, S...",Mrs Eric,Bibi,c8205386463232611653,Single,DY01 1TR,Female,11,10,2016,6,11/10/2016
"7 Noble valley, L...",Diane,Kent,c2381984462771197706,,SO2P 9WS,Male,28,1,1972,50,28/01/1972
57 Pearson corner...,Mr Grace,Baker,c6384487823194391043,Civil partnership,EH4P 9RN,Female,26,11,1966,56,26/11/1966
0 Jeremy mountain...,Mrs Chloe,Chandler,c7777611692672993318,Divorced,G7F 3RE,2,3,3,1963,59,03/03/1963
Studio 5 Fuller b...,Mr Katie,Der-Anderson,c7179219388724687888,Divorced,G88 6DB,Female,9,4,1960,62,09/04/1960
"644 Garry walk, B...",Mrs Denise,King,c3458599216452476033,Divorced,FY9W 4RU,F,28,11,1981,41,28/11/1981
Studio 73 Clayton...,Hazel,Der-Barton,c9188328200772085537,Married,LN96 9XA,Male,13,4,1947,75,13/04/1947
Flat 92B Ross exp...,Harriet,Chapman,c1862566591390004870,Single,S0D 9AX,M,8,11,2001,21,08/11/2001


In [82]:
ccs

Address,FNAME,SNAME,ID,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
Studio 48 Cooper ...,-9,Ross,c4064232788196233825,NAN,CV25 4Z/Y,,1956,66,06-08-1956,c2026847926404610461
43 Rebecca street...,Mrs Darren,Baldwin,c7365350289112516537,Single,-7,Male,2013,9,29-12-2013,c7839596180442651345
-9,Mrs Eric,Bibi,c8205386463232611653,Single,PL38 0PR,Female,2016,6,09-11-2015,c3258728696626565719
"7 Noble valley, ...",Mark,Kent,c2381984462771197706,,SO2P 9WS,Male,1972,50,28-01-1972,c2287010195568088798
57 Pearson corner...,Mr Grace,Baker,c6384487823194391043,Civil partnership,eh4p 9Rn,Female,1966,56,26-11-1966,c1945351111358374057
-9,Mrs Chloe,Chandler,c7777611692672993318,Divorced,G7F 3RE,2,1963,59,11-03-2012,c7831454145019129197
Studio 5 Fuller b...,Mr Katie,Der-Anderson,c7179219388724687888,Divorced,G88 6DB,Female,1960,62,15-08-1978,c6030118478018109776
"644 Garry walk, B...",Mrs Denise,King,c3458599216452476033,Divorced,W7- 2TY,F,1981,41,28-11-1981,c6446332115853614978
Studio 73 Clayton...,-9,Der-Barton,c9188328200772085537,-9,LN96 9XA,Male,1947,-9,29-12-2013,c1293143515607798169
Studio 33o8 Hazel...,Keith,ChapmKan,c1862566591390004870,Single,S0D 9AX,M,2001,21,08-11-2001,c1864186096263678574


# Data Cleaning & Standardisation

In [100]:
# Looks like there is a new line character in address - this will need to be removed
# We can replace these '\n' values with spaces:

census = standardisation.reg_replace(df = census, dic = {' ': '\n'})
ccs = standardisation.reg_replace(df = ccs, dic = {' ': '\n'})

ccs.select('Address').show(truncate = False)

+-------------------------------------------+
|Address                                    |
+-------------------------------------------+
|Studio 48 Cooper street, Port Fredericktown|
|43 Rebecca street, Harveytown              |
|-9                                         |
| 7 Noble valley, Lake Simonville           |
|57 Pearson corner, Joannaborough           |
|-9                                         |
|Studio 5 Fuller burgs, New Lindsey         |
|644 Garry walk, Blackburnville             |
|Studio 73 Clayton mountains, Stevenbury    |
|Studio 33o8 Hazel river, Adamsstad         |
|8 Grant spurs, South Philip                |
| 3 Smith mount, New Frances                |
|21 Stephen island, terrymouth              |
|flat 78 Jones Glen, marIonbuRgh            |
|-9                                         |
|Flat 2 Oliver corners, Baxterton           |
|Studio 6 Dixon bypass, New Marian          |
|824 J,effrey roads, Mathewfort             |
|Flat 22 Kennedy keys, Port Valeri

Let's standardise the date format to be consistent across our data in a **dd/MM/yyyy** format:

In [101]:
ccs = standardisation.standardise_date(ccs, col_name = "DOB", in_date_format = "dd-MM-yyyy", out_date_format = "dd/MM/yyyy")

ccs

Address,FNAME,SNAME,ID,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
Studio 48 Cooper ...,-9,Ross,c4064232788196233825,NAN,CV25 4Z/Y,,1956,66,06/08/1956,c2026847926404610461
43 Rebecca street...,Mrs Darren,Baldwin,c7365350289112516537,Single,-7,Male,2013,9,29/12/2013,c7839596180442651345
-9,Mrs Eric,Bibi,c8205386463232611653,Single,PL38 0PR,Female,2016,6,09/11/2015,c3258728696626565719
"7 Noble valley, ...",Mark,Kent,c2381984462771197706,,SO2P 9WS,Male,1972,50,28/01/1972,c2287010195568088798
57 Pearson corner...,Mr Grace,Baker,c6384487823194391043,Civil partnership,eh4p 9Rn,Female,1966,56,26/11/1966,c1945351111358374057
-9,Mrs Chloe,Chandler,c7777611692672993318,Divorced,G7F 3RE,2,1963,59,11/03/2012,c7831454145019129197
Studio 5 Fuller b...,Mr Katie,Der-Anderson,c7179219388724687888,Divorced,G88 6DB,Female,1960,62,15/08/1978,c6030118478018109776
"644 Garry walk, B...",Mrs Denise,King,c3458599216452476033,Divorced,W7- 2TY,F,1981,41,28/11/1981,c6446332115853614978
Studio 73 Clayton...,-9,Der-Barton,c9188328200772085537,-9,LN96 9XA,Male,1947,-9,29/12/2013,c1293143515607798169
Studio 33o8 Hazel...,Keith,ChapmKan,c1862566591390004870,Single,S0D 9AX,M,2001,21,08/11/2001,c1864186096263678574


Next, we have generic 'ID' columns in each dataset. We also have address and name variables named differently in each dataset. 

We can use **rename_columns()** from the dataframes module to rename all of these at once. 

In [102]:
census = dataframes.rename_columns(census, rename_dict = {"ID":"ID_Census","ENUM_FNAME":"FORENAME","ENUM_SNAME":"SURNAME"})
ccs = dataframes.rename_columns(ccs, rename_dict = {"ID":"ID_CCS","FNAME":"FORENAME","SNAME":"SURNAME"})

ccs.columns

['Address',
 'FORENAME',
 'SURNAME',
 'ID_CCS',
 'Marital_Status',
 'Postcode',
 'Sex',
 'Resident_Year_Of_Birth',
 'Resident_Age',
 'DOB',
 'Resident_ID']

Now let's set all of variables to upper case for consistency, using **standardise_case()**:

In [103]:
census = standardisation.standardise_case(census)
ccs = standardisation.standardise_case(ccs)

ccs

Address,FORENAME,SURNAME,ID_CCS,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,-9,ROSS,C4064232788196233825,NAN,CV25 4Z/Y,,1956,66,06/08/1956,C2026847926404610461
43 REBECCA STREET...,MRS DARREN,BALDWIN,C7365350289112516537,SINGLE,-7,MALE,2013,9,29/12/2013,C7839596180442651345
-9,MRS ERIC,BIBI,C8205386463232611653,SINGLE,PL38 0PR,FEMALE,2016,6,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, ...",MARK,KENT,C2381984462771197706,,SO2P 9WS,MALE,1972,50,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,MR GRACE,BAKER,C6384487823194391043,CIVIL PARTNERSHIP,EH4P 9RN,FEMALE,1966,56,26/11/1966,C1945351111358374057
-9,MRS CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F 3RE,2,1963,59,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,MR KATIE,DER-ANDERSON,C7179219388724687888,DIVORCED,G88 6DB,FEMALE,1960,62,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",MRS DENISE,KING,C3458599216452476033,DIVORCED,W7- 2TY,F,1981,41,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,-9,DER-BARTON,C9188328200772085537,-9,LN96 9XA,MALE,1947,-9,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,C1862566591390004870,SINGLE,S0D 9AX,M,2001,21,08/11/2001,C1864186096263678574


Next, the values for missingness are all over the place. I can spot a few NaNs, minus 7/9s, and whitespaces. Let's standardise missingness with the **standardise_null()** function. We can retrieve these null values from the previous **value_counts()** outputs: 

In [104]:
# we can use the standardise_null function to replace these with true None values:
# we use regex to do this: https://regex101.com/ 
census = standardisation.standardise_null(census, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$")
ccs = standardisation.standardise_null(ccs, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$")

ccs

Address,FORENAME,SURNAME,ID_CCS,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV25 4Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,MRS DARREN,BALDWIN,C7365350289112516537,SINGLE,,MALE,2013,9.0,29/12/2013,C7839596180442651345
,MRS ERIC,BIBI,C8205386463232611653,SINGLE,PL38 0PR,FEMALE,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, ...",MARK,KENT,C2381984462771197706,,SO2P 9WS,MALE,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,MR GRACE,BAKER,C6384487823194391043,CIVIL PARTNERSHIP,EH4P 9RN,FEMALE,1966,56.0,26/11/1966,C1945351111358374057
,MRS CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F 3RE,2,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,MR KATIE,DER-ANDERSON,C7179219388724687888,DIVORCED,G88 6DB,FEMALE,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",MRS DENISE,KING,C3458599216452476033,DIVORCED,W7- 2TY,F,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DER-BARTON,C9188328200772085537,,LN96 9XA,MALE,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,C1862566591390004870,SINGLE,S0D 9AX,M,2001,21.0,08/11/2001,C1864186096263678574


Great, these now all show up as true nulls. 

Next, we have a mix of 1s, 2s, Ms, and Fs in our sex column. Let's standardise this to be either 1s or 2s. For this we can use **reg_replace()**:

In [105]:
# reg_replace() takes a dictionary, where the value is the regex to replace, and the key is what this will be replaced with
# so we're replacing 'M' with '1', and 'F' with '2':
census = standardisation.reg_replace(census, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})
ccs = standardisation.reg_replace(ccs, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})

ccs

Address,FORENAME,SURNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV25 4Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,MRS DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,MRS ERIC,BIBI,C8205386463232611653,SINGLE,PL38 0PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, ...",MARK,KENT,C2381984462771197706,,SO2P 9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,MR GRACE,BAKER,C6384487823194391043,CIVIL PARTNERSHIP,EH4P 9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,MRS CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F 3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,MR KATIE,DER-ANDERSON,C7179219388724687888,DIVORCED,G88 6DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",MRS DENISE,KING,C3458599216452476033,DIVORCED,W7- 2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DER-BARTON,C9188328200772085537,,LN96 9XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,C1862566591390004870,SINGLE,S0D 9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


Now that our sex column is populated with just 1s and 2s, we might want to change the type from string to integer. This can be done using the **cast_type()** function in the standardisation module:

In [106]:
# Casting strings to integer can increase performance as it makes the data type smaller, taking up less storage space

census = standardisation.cast_type(census, subset = ['SEX'], types = "integer")
ccs = standardisation.cast_type(ccs, subset = ['SEX'], types = "integer")

ccs.select('SEX').dtypes

[('SEX', 'int')]

SORT OUT COMMENTARY NOW HAVE RESHUFFLED

Next, let's focus on our name variables. Forenames still contain titles and some surnames have common prefixes like 'Van' or 'Der'. We can strip out titles and concatenate surname prefixes with our **clean_forename()** and **clean_surname()** functions. 

In [107]:
census = standardisation.clean_forename(census, subset = 'FORENAME')
ccs = standardisation.clean_forename(ccs, subset = 'FORENAME')

census = standardisation.clean_surname(census, subset = 'SURNAME')
ccs = standardisation.clean_surname(ccs, subset = 'SURNAME')

ccs

Address,FORENAME,SURNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV25 4Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,PL38 0PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, ...",MARK,KENT,C2381984462771197706,,SO2P 9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVIL PARTNERSHIP,EH4P 9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F 3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G88 6DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,W7- 2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN96 9XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,C1862566591390004870,SINGLE,S0D 9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


Let's begin to have a look at our postcode, address and name variables. 

A lot of the variables contain multiple white spaces in a row. We can remove white spcaes altogether using the **standardise_white_space()** function, setting white space level (wsl) to none. 

However, we don't want to remove all of the white spaces in our address variable, as this would make the text unreadable. Instead, we can set wsl to one, meaning only gaps of 2 spaces or more will be removed.  

In [108]:
# Using a list comprehension, we remove all white spaces from all columns except Address
census = standardisation.standardise_white_space(census, 
                                                 subset = [column for column in census.columns if column != 'Address'], 
                                                 wsl = "none")
ccs = standardisation.standardise_white_space(ccs, 
                                              subset = [column for column in ccs.columns if column != 'Address'], 
                                              wsl = "none")

# Then we allow a single white space for the Address column
census = standardisation.standardise_white_space(census, subset = 'Address', wsl = "one")
ccs = standardisation.standardise_white_space(ccs, subset = 'Address', wsl = "one")

ccs

Address,FORENAME,SURNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV254Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,PL380PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, L...",MARK,KENT,C2381984462771197706,,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G886DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,W7-2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN969XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,C1862566591390004870,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


We might have some leading/trailing whitespaces in some of our variables as they can be hard to spot, and therefore it's good practice to use **trim()** to remove these:

In [109]:
census = standardisation.trim(census)
ccs = standardisation.trim(ccs)

ccs

Address,FORENAME,SURNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV254Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,PL380PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, L...",MARK,KENT,C2381984462771197706,,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G886DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,W7-2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN969XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,C1862566591390004870,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


Finally, let's strip out numbers from our name variables. Again, we can use the **reg_replace()** function for this:

In [110]:
census = standardisation.reg_replace(census, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})
ccs = standardisation.reg_replace(ccs, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})

ccs

Address,FORENAME,SURNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV254Z/Y,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,PL380PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, L...",MARK,KENT,C2381984462771197706,,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G886DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,W7-2TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN969XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,C1862566591390004870,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


This still leaves apostrophes and hyphens in our name variables. The **remove_punct()** function can handle these. While we're at it, let's also use **remove_punct()** to get rid of dashes in our address field, but we'll have to specify the optional argument **keep** to make sure it doesn't strip out commas from addresses:

In [111]:
#First remove all punction from every column except address
census = standardisation.remove_punct(census, 
                                      subset = [column for column in census.columns if column not in ['Address','DOB']], 
                                      )

ccs = standardisation.remove_punct(ccs, 
                                   subset = [column for column in ccs.columns if column not in ['Address','DOB']]
                                  )


#Then remove the punctuation from address except for the commas
census = standardisation.remove_punct(census, subset = 'Address', keep = ',')
ccs = standardisation.remove_punct(ccs, subset = 'Address', keep = ',')

ccs

Address,FORENAME,SURNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV254ZY,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,PL380PR,2.0,2016,6.0,09/11/2015,C3258728696626565719
"7 NOBLE VALLEY, L...",MARK,KENT,C2381984462771197706,,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F3RE,2.0,1963,59.0,11/03/2012,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G886DB,2.0,1960,62.0,15/08/1978,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,W72TY,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN969XA,1.0,1947,,29/12/2013,C1293143515607798169
STUDIO 33O8 HAZEL...,KEITH,CHAPMKAN,C1862566591390004870,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


# Derive Variables

We've got quite a few identifying variables that we can split out into further variables for matching. These can be useful if, for instance, records don't match on house number due to an error, but do match on street. 

First, let's derive street and town from the address variable. The **split()** function from the dataframes module will be useful here, splitting on comma. 

In [112]:
# This will create a new column called "ADDRESS_SPLIT" that contains an array of each address element, separated by a comma
census = dataframes.split(census, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")
ccs = dataframes.split(ccs, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")

ccs.select("ADDRESS", "ADDRESS_SPLIT").show(truncate = False)

+-------------------------------------------+----------------------------------------------+
|ADDRESS                                    |ADDRESS_SPLIT                                 |
+-------------------------------------------+----------------------------------------------+
|STUDIO 48 COOPER STREET, PORT FREDERICKTOWN|[STUDIO 48 COOPER STREET,  PORT FREDERICKTOWN]|
|43 REBECCA STREET, HARVEYTOWN              |[43 REBECCA STREET,  HARVEYTOWN]              |
|null                                       |null                                          |
|7 NOBLE VALLEY, LAKE SIMONVILLE            |[7 NOBLE VALLEY,  LAKE SIMONVILLE]            |
|57 PEARSON CORNER, JOANNABOROUGH           |[57 PEARSON CORNER,  JOANNABOROUGH]           |
|null                                       |null                                          |
|STUDIO 5 FULLER BURGS, NEW LINDSEY         |[STUDIO 5 FULLER BURGS,  NEW LINDSEY]         |
|644 GARRY WALK, BLACKBURNVILLE             |[644 GARRY WALK,  BLACKBU

In [113]:
# We can then select the first element of the 'split address' to create the 'street address' variable
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)

# The second element contains the town name, which we can append to a new column also 
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)

# Since we no longer need the 'ADDRESS_SPLIT' column, we can remove it using our drop_columns() function
census = dataframes.drop_columns(census, subset = 'ADDRESS_SPLIT')
ccs = dataframes.drop_columns(ccs, subset = 'ADDRESS_SPLIT')

ccs.select("ADDRESS", "STREET", "TOWN").show(truncate = False)

+------------------------------------------+---------------------------+-----------------+
|ADDRESS                                   |STREET                     |TOWN             |
+------------------------------------------+---------------------------+-----------------+
|STUDIO 5 FULLER BURGS, NEW LINDSEY        |STUDIO 5 FULLER BURGS      | NEW LINDSEY     |
|STUDIO 73 CLAYTON MOUNTAINS, STEVENBURY   |STUDIO 73 CLAYTON MOUNTAINS| STEVENBURY      |
|710 HODGSON RIDGE, HILLVILLE              |710 HODGSON RIDGE          | HILLVILLE       |
|FLAT 73 JESSICA MOUNT, MARTINBERG         |FLAT 73 JESSICA MOUNT      | MARTINBERG      |
|STUDIO 41 BROWN MOUNTAIN, PORT DOUGLASLAND|STUDIO 41 BROWN MOUNTAIN   | PORT DOUGLASLAND|
|2 PRITCHARD STRAVENUE, DARRENMOUTH        |2 PRITCHARD STRAVENUE      | DARRENMOUTH     |
|573 BRUCE GREENS, EAST HELEN              |573 BRUCE GREENS           | EAST HELEN      |
|STUDIO 83K COOK INLET, GORDONTON          |STUDIO 83K COOK INLET      | GORDONTON       |

We can create a 'full name' variable by concatenating the two existing name columns together, using **concat()**:

In [114]:
census = dataframes.concat(census, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")
ccs = dataframes.concat(ccs, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")

ccs.select("FORENAME", "SURNAME", "FULL_NAME")

FORENAME,SURNAME,FULL_NAME
KATIE,DERANDERSON,KATIE DERANDERSON
,DERBARTON,DERBARTON
AIMEE,EVANS,AIMEE EVANS
MRSANN,WALTON,MRSANN WALTON
RACHAEL,VANGALLAGHER,RACHAEL VANGALLAGHER
ELLIE,STEPHENS,ELLIE STEPHENS
JULIE,GIBSON,JULIE GIBSON
SAMUEL,EVACNS,SAMUEL EVACNS
JONATHAN,TUCKER,JONATHAN TUCKER
CAROLINE,JONES,CAROLINE JONES


For data that has been collected over the phone, our usual matching methods that look for differences in strings might not be as effective. Instead we can capture the way names *sound* with phonetic encoders to compensate for this type of error. 

We have functions for this in the linkage module. 

In [115]:
census = linkage.metaphone(df = census, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
census = linkage.soundex(df = census, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

ccs = linkage.metaphone(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
ccs = linkage.soundex(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

ccs.select("FORENAME", "FORENAME_METAPHONE", "FORENAME_SOUNDEX")

FORENAME,FORENAME_METAPHONE,FORENAME_SOUNDEX
KATIE,KT,K300
,,
AIMEE,AM,A500
MRSANN,MRSN,M625
RACHAEL,RXL,R240
ELLIE,EL,E400
JULIE,JL,J400
SAMUEL,SML,S540
JONATHAN,JN0N,J535
CAROLINE,KRLN,C645


Similarly, if there have been spelling mistakes in names, alphabetising string columns may also aid matching. We have a function for this in the linkage module. 

In [116]:
census = linkage.alpha_name(census, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')
ccs = linkage.alpha_name(ccs, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')

ccs.select("FORENAME", "ALPHABETISE_FORENAME")

FORENAME,ALPHABETISE_FORENAME
KATIE,AEIKT
,
AIMEE,AEEIM
MRSANN,AMNNRS
RACHAEL,AACEHLR
ELLIE,EEILL
JULIE,EIJLU
SAMUEL,AELMSU
JONATHAN,AAHJNNOT
CAROLINE,ACEILNOR


There are more common matching variables we could still derive, for example, a common practice in data linkage is to derive a postcode district variable instead of using full postcode. 

The second part of a postcode is *always* 3 characters, whilst the first part can range from 2-4. Therefore, to derive postcode district, we remove the last 3 characters from postcode. 

This can be done using the **substring()** function:

In [117]:
census = dataframes.substring(census, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)
ccs = dataframes.substring(ccs, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)

ccs.select("POSTCODE", "PC_DISTRICT")

POSTCODE,PC_DISTRICT
G886DB,G88
LN969XA,LN96
B060DT,B06
E2C564ZS,2C56
B03FA,B0
E2A1NY,E2A
L23LX,L2
M418JD,M41
CH3E4HT,CH3E
LE081BB,LE08


If you have a time lag between the collection of two surveys you are trying to link together, you may want to align respondent ages for matching. We can do this using the **age_at()** function.

This function takes a few arguments:
* the dataframe
* the name of the date of birth column
* the data format the date of birth column is in
* the date(s) to calculate age at

In [118]:
# We can find out their age at the most recent Census, for example:
census_date = '21/03/2021'

census = standardisation.age_at(census, 'DOB', 'dd/MM/yyyy', census_date)
ccs = standardisation.age_at(ccs, 'DOB', 'dd/MM/yyyy', census_date)

census.select('DOB','age_at_21/03/2021')

DOB,age_at_21/03/2021
1986-07-04,
2011-04-08,
1986-07-12,
1965-07-16,
1999-07-16,
1992-11-28,
1958-09-08,
1981-05-29,
2009-04-19,
1937-07-02,


In [1]:
# NOT SURE IF THE BELOW IS RIGHT - JUST TAKING IT FROM DAP VERSION BEFORE DELETED

# Deduplication

This is quite easily done, defining our duplicate matchkey(s) and using the **deduplicate** function:

In [None]:
# define our matchkey
deduplicate_mkey = ['First_Name', 'Last_Name','Resident_Age','Sex','Postcode','Address']
ccs.count()

In [None]:
census = linkage.deduplicate(df = census, record_id - 'Resident_ID', mks = deduplicate_mkey)
ccs = linkage.deduplicate(df = ccs, record_id - 'Resident_ID', mks = deduplicate_mkey)
census.count()

# Deterministic Matching (rule-based)

Now that we've removed duplicates, we can start to investigate some matchkeys:

In [75]:
# first, let's suffix each dataset's columns to distinguish the two dataframes 
census = dataframes.suffix_columns(census, suffix = '_census')
ccs = dataframes.suffix_columns(ccs, suffix = '_ccs')

census.persist().count()
ccs.persist().count()

In [76]:
MK1 = [census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# letting middle name be a mismatch 
MK2 = [census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# taking the phonetic encoding of forename - using the metaphone algorithm
MK3 = [census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# Now allowing for misspellings rather than mishearings of names, using standardised Levenshtein edit distance
MK4 = [census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

matchkeys = [MK1,MK2,MK3,MK4]

#census.Full_Name_census == ccs.Full_Name_ccs,
#census.First_Name_census == ccs.First_Name_ccs,
 #      census.Last_Name_census == ccs.Last_Name_ccs

AttributeError: 'DataFrame' object has no attribute 'Sex_census'

In [None]:
links = linkage.deterministic_linkage(df_l = census, df_r = ccs, id_l = 'Resident_ID_census', id_r = 'Resident_ID_ccs', 
                                      matchkeys = matchkeys, our_dir = '/user/edwara5/census_ccs_links')

In [None]:
links.show()

In [None]:
mk_df = linkage.matchkey_dataframe(matchkeys)