# Intermediate/advanced data manipulation in a data linkage context

This notebook is intended to be a demo of what you *could* use DLH_utils for. 

I'm sure to many a lot of this code may look very similar! We will have taken similar approaches for the vast majority of the problems faced here. We've just wrapped these mostly standard approaches up into reusable functions, hopefully to save everyone doing linkage both some time and headaches! 

In [1]:
# to start, install dlh_utils if not installed already. Notice the '-U' argument to upgrade existing installations. 
!pip3 install -U 'dlh_utils'

Looking in indexes: http://sccm_functional:****@art-p-01/artifactory/api/pypi/yr-python/simple


In [2]:
# import necessary libraries
import pyspark.sql.functions as F
import pandas as pd

from dlh_utils import utilities
from dlh_utils import dataframes
from dlh_utils import linkage
from dlh_utils import standardisation
from dlh_utils import sessions
from dlh_utils import profiling
from dlh_utils import flags

In [3]:
# you can use our sessions module to set up your spark session
# this will also create a Spark UI, which you can use to track your code's efficiency
spark = sessions.getOrCreateSparkSession(appName = 'dlh_utils_demo', size = 'medium')

In [4]:
# read in raw data
census = pd.read_csv("/home/cdsw/dlh_utils_demo/census_residents.csv")
ccs = pd.read_csv("/home/cdsw/dlh_utils_demo/ccs_perturbed.csv")

# note, if this was stored in Hue, the read_format() function from the utilities module would've been useful

# for demo purposes, let's convert this to a spark df using utilities
census = utilities.pandas_to_spark(census)
ccs = utilities.pandas_to_spark(ccs)

To give a quick overview of the features of our data, we can use the **describe()** function from the profiling module:

In [5]:
descriptive_census = profiling.df_describe(census,
                                           output_mode = 'pandas',
                                           approx_distinct = False,
                                           rsd = 0.05
                                           )
descriptive_census

Unnamed: 0,variable,type,row_count,distinct,percent_distinct,null,percent_null,not_null,percent_not_null,empty,percent_empty,min,max,min_l,max_l,max_l_before_point,min_l_before_point,max_l_after_point,min_l_after_point
0,Address,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,19,51,,,,
1,ENUM_FNAME,string,100001,1092,1.091989,0,0.0,100001,100.0,0,0.0,,,3,15,,,,
2,ENUM_SNAME,string,100001,1493,1.492985,0,0.0,100001,100.0,0,0.0,,,3,15,,,,
3,ID,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,20,20,,,,
4,Marital_Status,string,100001,6,0.006,0,0.0,100001,100.0,0,0.0,,,3,17,,,,
5,Postcode,string,100001,99457,99.456005,0,0.0,100001,100.0,0,0.0,,,6,8,,,,
6,Sex,string,100001,10,0.01,0,0.0,100001,100.0,5636,5.635944,,,1,6,,,,
7,Resident_Day_Of_Birth,bigint,100001,31,0.031,0,0.0,100001,100.0,0,0.0,1.0,31.0,1,2,,,,
8,Resident_Month_Of_Birth,bigint,100001,12,0.012,0,0.0,100001,100.0,0,0.0,1.0,12.0,1,2,,,,
9,Resident_Year_Of_Birth,bigint,100001,92,0.091999,0,0.0,100001,100.0,0,0.0,1856.0,2022.0,4,4,,,,


From this we can see that we have a percentage distinct in our sex variable far from 50% which we would expect. This could suggest a high level of missingness, but we can see from the rest of the output that we don't have any missing or null sex values, suggesting some have been incorrectly coded or skewed in the data.

We can also see that, whilst there are no nulls, the Sex variable contains a lot of empty values. This suggests there are different definitions for nulls, which we can cast to True Nones later when we standardise the data. 

On bigger data, these observations can give quick insights into which variables may be the most/least useful for matching. 

The **value_counts()** functions shows the top or bottom n values in our data, and is another crucial step when profiling our data. Along with cleaning and standardising our data, this is one of the most time consuming part of data linkage.  

Value counts can give us an overview of the different types of missingness in these variables, which will be useful when we come to standardise missingness in our data later.

Whilst it is important to know and understand both sets of data, for the purpose of the demo we will only look at the CCS dataset. 

In [6]:
value_counts_ccs = profiling.value_counts(ccs,
                                         limit = 5,
                                         output_mode = 'pandas'
                                         )

# the value counts function returns two dataframes; one for the top n values in each variable and one for the bottom n values. 
# we can select the top value count dataframe by subsetting the value_counts_ccs tuple:

value_counts_ccs[0]

Unnamed: 0,Address,Address_count,ENUM_FNAME,ENUM_FNAME_count,ENUM_SNAME,ENUM_SNAME_count,ID,ID_count,Marital_Status,Marital_Status_count,...,Sex,Sex_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count,Resident_ID,Resident_ID_count
0,-9,50,-9,48,-7,51,-7,36,Single,284,...,Female,229,-9,45,-9,43,-7,37,c1289733399550728998,1
1,"Studio 4\nSimpson glens, Lake Paul",4,Victoria,10,Smith,28,c7949424308517863587,3,###,136,...,Male,213,1938,22,84,21,16-11-1994,4,c1406628632687313907,1
2,"Studio 36\nKing forges, Rileyburgh",3,Tracey,7,Jones,16,c5305314753251312254,3,Divorced,128,...,M,135,2020,21,11,20,04-09-1955,4,c1462481395779002923,1
3,"Studio 5\nFuller burgs, New Lindsey",3,Howard,7,Roberts,13,c5873915529124500787,3,Married,116,...,F,133,2005,21,17,20,23-08-1963,4,c1610482117117913758,1
4,"Studio 4\nDiane underpass, Eleanorton",3,Glen,6,Taylor,12,c3119031928535250479,2,Civil partnership,110,...,-7,97,1997,20,2,20,10-03-1987,4,c1631997721075661206,1


In [7]:
# we can do the same for the bottom values: 

value_counts_ccs[1]

Unnamed: 0,Address,Address_count,ENUM_FNAME,ENUM_FNAME_count,ENUM_SNAME,ENUM_SNAME_count,ID,ID_count,Marital_Status,Marital_Status_count,...,Sex,Sex_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count,Resident_ID,Resident_ID_count
0,"1 Jones centers, Cooperhaven",1,Lauren,1,ChapmKan,1,C5979341665824936717,1,Civil partnerQship,1,...,1,44,1939,2,83,3,07-03-2020,1,c1289733399550728998,1
1,"45 Alan plains, Denisshire",1,Mr Julie,1,Der-Barton,1,c1045458737369U422279,1,Divorce?d,1,...,2,54,1975,5,43,4,13-03-1955,1,c1406628632687313907,1
2,"1 Murray meadows, West Jessica",1,LeslEy,1,D\er-Hunt,1,c1364885074130231367,1,Si0ngle,1,...,,58,1979,5,46,5,08-02-1972,1,c1462481395779002923,1
3,"0 Higgins glens, North Joshuaborough",1,Harry,1,Sinclair,1,c1874534031179144583,1,"si,ngle",1,...,-9,63,2022,6,49,6,02-06-1987,1,c1610482117117913758,1
4,"28 James row, New Tobystad",1,Lydia,1,Heath,1,c2060046012088437501,1,DivorcSed,1,...,NAN,70,1973,6,10,6,08-09-1997,1,c1631997721075661206,1


To flag out of scope values in our data, we can use the **flag()** function:

In [46]:
# This will flag invalid values in our data, for example, by seeing if there is anyone over the age of 100 in our data:

out_of_scope = flags.flag(df = census,
                          ref_col = 'Resident_Age',
                          condition = '>=',
                          condition_value = '110',
                          alias = None,
                          prefix = 'FLAG',
                          fill_null = None
                         )

out_of_scope

Address,ENUM_FNAME,ENUM_SNAME,ID,Marital_Status,Postcode,Sex,Resident_Day_Of_Birth,Resident_Month_Of_Birth,Resident_Year_Of_Birth,Resident_Age,DOB
Studio 48 Cooper ...,Mrs Margaret,Ross,c4064232788196233825,NAN,CV25 4ZY,,6,8,1956,66,06/08/1956
43 Rebecca street...,Mrs Darren,Baldwin,c7365350289112516537,Single,E2 0LP,-7,29,12,2013,9,29/12/2013
"04 Lane shores, S...",Mrs Eric,Bibi,c8205386463232611653,Single,DY01 1TR,Female,11,10,2016,6,11/10/2016
"7 Noble valley, L...",Diane,Kent,c2381984462771197706,###,SO2P 9WS,Male,28,1,1972,50,28/01/1972
57 Pearson corner...,Mr Grace,Baker,c6384487823194391043,Civil partnership,EH4P 9RN,Female,26,11,1966,56,26/11/1966
0 Jeremy mountain...,Mrs Chloe,Chandler,c7777611692672993318,Divorced,G7F 3RE,2,3,3,1963,59,03/03/1963
Studio 5 Fuller b...,Mr Katie,Der-Anderson,c7179219388724687888,Divorced,G88 6DB,Female,9,4,1960,62,09/04/1960
"644 Garry walk, B...",Mrs Denise,King,c3458599216452476033,Divorced,FY9W 4RU,F,28,11,1981,41,28/11/1981
Studio 73 Clayton...,Hazel,Der-Barton,c9188328200772085537,Married,LN96 9XA,Male,13,4,1947,75,13/04/1947
Flat 92B Ross exp...,Harriet,Chapman,c1862566591390004870,Single,S0D 9AX,M,8,11,2001,21,08/11/2001


We can see we have supercentenarian Ben in our data, which is probably wrong, but we've also got a few different date types that have been flagged as well. 

If you are working with larger data, the **flag_check()** and **flag_summary()** functions can produce more detailed flag metrics that will help you spot issues like this more readily. 

Let's move on to cleaning and standardising where we can start to deal with these issues.

In [47]:
census

Address,ENUM_FNAME,ENUM_SNAME,ID,Marital_Status,Postcode,Sex,Resident_Day_Of_Birth,Resident_Month_Of_Birth,Resident_Year_Of_Birth,Resident_Age,DOB
Studio 48 Cooper ...,Mrs Margaret,Ross,c4064232788196233825,NAN,CV25 4ZY,,6,8,1956,66,06/08/1956
43 Rebecca street...,Mrs Darren,Baldwin,c7365350289112516537,Single,E2 0LP,-7,29,12,2013,9,29/12/2013
"04 Lane shores, S...",Mrs Eric,Bibi,c8205386463232611653,Single,DY01 1TR,Female,11,10,2016,6,11/10/2016
"7 Noble valley, L...",Diane,Kent,c2381984462771197706,###,SO2P 9WS,Male,28,1,1972,50,28/01/1972
57 Pearson corner...,Mr Grace,Baker,c6384487823194391043,Civil partnership,EH4P 9RN,Female,26,11,1966,56,26/11/1966
0 Jeremy mountain...,Mrs Chloe,Chandler,c7777611692672993318,Divorced,G7F 3RE,2,3,3,1963,59,03/03/1963
Studio 5 Fuller b...,Mr Katie,Der-Anderson,c7179219388724687888,Divorced,G88 6DB,Female,9,4,1960,62,09/04/1960
"644 Garry walk, B...",Mrs Denise,King,c3458599216452476033,Divorced,FY9W 4RU,F,28,11,1981,41,28/11/1981
Studio 73 Clayton...,Hazel,Der-Barton,c9188328200772085537,Married,LN96 9XA,Male,13,4,1947,75,13/04/1947
Flat 92B Ross exp...,Harriet,Chapman,c1862566591390004870,Single,S0D 9AX,M,8,11,2001,21,08/11/2001


In [48]:
ccs

Address,ENUM_FNAME,ENUM_SNAME,ID,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
Studio 48 Cooper ...,-9,Ross,c4064232788196233825,NAN,CV25 4ZYC,,1956,66,06-08-1956,c2026847926404610461
43 Rebecca street...,Mrs Darren,Baldwin,c7365350289112516537,Single,-7,M,2013,9,29-12-2013,c7839596180442651345
-9,Mrs Eric,Bibi,c8205386463232611653,Single,SP26 8TN,Female,2016,6,21-09-1994,c3258728696626565719
"7 Noble valley, ...",Mrs Margaret,Kent,c2381984462771197706,###,SO2P 9WS,Male,1972,50,28-01-1972,c2287010195568088798
57 Pearson corner...,Mr Grace,Baker,c6384487823194391043,Civil partnership,eH4p 9rn,Female,1966,56,26-11-1966,c1945351111358374057
-9,Mrs Chloe,Chandler,c7777611692672993318,Divorced,G7F 3RE,2,1963,59,09-07-1934,c7831454145019129197
Studio 5 Fuller b...,Mr Katie,Der-Anderson,c7179219388724687888,Divorced,G88 6DB,Female,1960,62,-7,c6030118478018109776
"644 Garry walk, B...",Mrs Denise,King,c3458599216452476033,Divorced,L2 6LG,F,1981,41,28-11-1981,c6446332115853614978
Studio 73 Clayton...,-9,Der-Barton,c9188328200772085537,-9,LN96 9XA,Male,1947,-9,16-11-1994,c1293143515607798169
"311 Eric track, L...",Hayley,ChapmKan,c1862566591390004870,Single,S0D 9AX,M,2001,21,08-11-2001,c1864186096263678574


# Data Cleaning & Standardisation

In [77]:
# Looks like there is a new line character in address - this will need to be removed
# We can replace these '\n' values with spaces:

census = standardisation.reg_replace(df = census, dic = {' ': '\n'})
ccs = standardisation.reg_replace(df = ccs, dic = {' ': '\n'})

ccs.select('Address')

AnalysisException: "cannot resolve '`Address`' given input columns: [Resident_ID_df2_ccs, Resident_Year_Of_Birth_df2_ccs, POSTCODE_df2_ccs, Address_df2_ccs, ENUM_FNAME_df2_ccs, TOWN_df2_ccs, ENUM_SNAME_df2_ccs, Resident_Age_df2_ccs, SEX_df2_ccs, DOB_df2_ccs, STREET_df2_ccs, Marital_Status_df2_ccs, ID_CCS_df2_ccs, PC_DISTRICT_df2_ccs];;\n'Project ['Address]\n+- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14644, POSTCODE_df2_ccs#14659, SEX_df2_ccs#14674, Resident_Year_Of_Birth_df2_ccs#14689, Resident_Age_df2_ccs#14704, DOB_df2_ccs#14719, Resident_ID_df2_ccs#14734, STREET_df2_ccs#14749, TOWN_df2_ccs#14764, regexp_replace(PC_DISTRICT_df2_ccs#14329, \n,  ) AS PC_DISTRICT_df2_ccs#14779]\n   +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14644, POSTCODE_df2_ccs#14659, SEX_df2_ccs#14674, Resident_Year_Of_Birth_df2_ccs#14689, Resident_Age_df2_ccs#14704, DOB_df2_ccs#14719, Resident_ID_df2_ccs#14734, STREET_df2_ccs#14749, regexp_replace(TOWN_df2_ccs#14314, \n,  ) AS TOWN_df2_ccs#14764, PC_DISTRICT_df2_ccs#14329]\n      +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14644, POSTCODE_df2_ccs#14659, SEX_df2_ccs#14674, Resident_Year_Of_Birth_df2_ccs#14689, Resident_Age_df2_ccs#14704, DOB_df2_ccs#14719, Resident_ID_df2_ccs#14734, regexp_replace(STREET_df2_ccs#14299, \n,  ) AS STREET_df2_ccs#14749, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n         +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14644, POSTCODE_df2_ccs#14659, SEX_df2_ccs#14674, Resident_Year_Of_Birth_df2_ccs#14689, Resident_Age_df2_ccs#14704, DOB_df2_ccs#14719, regexp_replace(Resident_ID_df2_ccs#14284, \n,  ) AS Resident_ID_df2_ccs#14734, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n            +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14644, POSTCODE_df2_ccs#14659, SEX_df2_ccs#14674, Resident_Year_Of_Birth_df2_ccs#14689, Resident_Age_df2_ccs#14704, regexp_replace(DOB_df2_ccs#14269, \n,  ) AS DOB_df2_ccs#14719, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n               +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14644, POSTCODE_df2_ccs#14659, SEX_df2_ccs#14674, Resident_Year_Of_Birth_df2_ccs#14689, regexp_replace(Resident_Age_df2_ccs#14254, \n,  ) AS Resident_Age_df2_ccs#14704, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n                  +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14644, POSTCODE_df2_ccs#14659, SEX_df2_ccs#14674, regexp_replace(Resident_Year_Of_Birth_df2_ccs#14239, \n,  ) AS Resident_Year_Of_Birth_df2_ccs#14689, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n                     +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14644, POSTCODE_df2_ccs#14659, regexp_replace(SEX_df2_ccs#14224, \n,  ) AS SEX_df2_ccs#14674, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n                        +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14644, regexp_replace(POSTCODE_df2_ccs#14209, \n,  ) AS POSTCODE_df2_ccs#14659, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n                           +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14629, regexp_replace(Marital_Status_df2_ccs#14194, \n,  ) AS Marital_Status_df2_ccs#14644, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n                              +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14614, regexp_replace(ID_CCS_df2_ccs#14179, \n,  ) AS ID_CCS_df2_ccs#14629, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n                                 +- Project [Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14599, regexp_replace(ENUM_SNAME_df2_ccs#14164, \n,  ) AS ENUM_SNAME_df2_ccs#14614, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n                                    +- Project [Address_df2_ccs#14584, regexp_replace(ENUM_FNAME_df2_ccs#14149, \n,  ) AS ENUM_FNAME_df2_ccs#14599, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n                                       +- Project [regexp_replace(Address_df2_ccs#14134, \n,  ) AS Address_df2_ccs#14584, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2_ccs#14329]\n                                          +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2_ccs#14314, PC_DISTRICT_df2#13550 AS PC_DISTRICT_df2_ccs#14329]\n                                             +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2_ccs#14299, TOWN_df2#13535 AS TOWN_df2_ccs#14314, PC_DISTRICT_df2#13550]\n                                                +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2_ccs#14284, STREET_df2#13520 AS STREET_df2_ccs#14299, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                   +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2_ccs#14269, Resident_ID_df2#13505 AS Resident_ID_df2_ccs#14284, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                      +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2_ccs#14254, DOB_df2#13490 AS DOB_df2_ccs#14269, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                         +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2#13475 AS Resident_Age_df2_ccs#14254, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                            +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2#13460 AS Resident_Year_Of_Birth_df2_ccs#14239, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                               +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2_ccs#14209, SEX_df2#13445 AS SEX_df2_ccs#14224, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                                  +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2_ccs#14194, POSTCODE_df2#13430 AS POSTCODE_df2_ccs#14209, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                                     +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2_ccs#14179, Marital_Status_df2#13415 AS Marital_Status_df2_ccs#14194, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                                        +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2_ccs#14164, ID_CCS_df2#13400 AS ID_CCS_df2_ccs#14179, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                                           +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2#13385 AS ENUM_SNAME_df2_ccs#14164, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                                              +- Project [Address_df2_ccs#14134, ENUM_FNAME_df2#13370 AS ENUM_FNAME_df2_ccs#14149, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                                                 +- Project [Address_df2#13355 AS Address_df2_ccs#14134, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT_df2#13550]\n                                                                                    +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN_df2#13535, PC_DISTRICT#13069 AS PC_DISTRICT_df2#13550]\n                                                                                       +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET_df2#13520, TOWN#12953 AS TOWN_df2#13535, PC_DISTRICT#13069]\n                                                                                          +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID_df2#13505, STREET#12922 AS STREET_df2#13520, TOWN#12953, PC_DISTRICT#13069]\n                                                                                             +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB_df2#13490, Resident_ID#12624 AS Resident_ID_df2#13505, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age_df2#13475, DOB#12281 AS DOB_df2#13490, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                   +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth_df2#13460, Resident_Age#12612 AS Resident_Age_df2#13475, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                      +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX_df2#13445, Resident_Year_Of_Birth#12600 AS Resident_Year_Of_Birth_df2#13460, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                         +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE_df2#13430, SEX#12588 AS SEX_df2#13445, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                            +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status_df2#13415, POSTCODE#13084 AS POSTCODE_df2#13430, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                               +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS_df2#13400, Marital_Status#12564 AS Marital_Status_df2#13415, POSTCODE#13084, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                                  +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME_df2#13385, ID_CCS#12552 AS ID_CCS_df2#13400, Marital_Status#12564, POSTCODE#13084, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                                     +- Project [Address_df2#13355, ENUM_FNAME_df2#13370, ENUM_SNAME#12540 AS ENUM_SNAME_df2#13385, ID_CCS#12552, Marital_Status#12564, POSTCODE#13084, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                                        +- Project [Address_df2#13355, ENUM_FNAME#12528 AS ENUM_FNAME_df2#13370, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, POSTCODE#13084, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                                           +- Project [Address#12649 AS Address_df2#13355, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, POSTCODE#13084, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                                              +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, reverse(POSTCODE#13055) AS POSTCODE#13084, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, PC_DISTRICT#13069]\n                                                                                                                                 +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, POSTCODE#13055, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953, reverse(substring(POSTCODE#13055, 4, 4)) AS PC_DISTRICT#13069]\n                                                                                                                                    +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, reverse(POSTCODE#12576) AS POSTCODE#13055, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953]\n                                                                                                                                       +- Deduplicate [DOB#12281, TOWN#12953, Marital_Status#12564, Resident_Age#12612, ID_CCS#12552, Address#12649, ENUM_FNAME#12528, SEX#12588, STREET#12922, Postcode#12576, ENUM_SNAME#12540, Resident_Year_Of_Birth#12600, Resident_ID#12624]\n                                                                                                                                          +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, TOWN#12953]\n                                                                                                                                             +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12922, ADDRESS_SPLIT#12882[1] AS TOWN#12953, ADDRESS_SPLIT#12882]\n                                                                                                                                                +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, ADDRESS_SPLIT#12882[0] AS STREET#12922, TOWN#12810, ADDRESS_SPLIT#12882]\n                                                                                                                                                   +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12780, TOWN#12810, CASE WHEN (isnull(ADDRESS#12649) || isnan(cast(ADDRESS#12649 as double))) THEN cast(null as array<string>) ELSE split(ADDRESS#12649, ,) END AS ADDRESS_SPLIT#12882]\n                                                                                                                                                      +- Deduplicate [DOB#12281, TOWN#12810, Marital_Status#12564, Resident_Age#12612, ID_CCS#12552, Address#12649, ENUM_FNAME#12528, SEX#12588, STREET#12780, Postcode#12576, ENUM_SNAME#12540, Resident_Year_Of_Birth#12600, Resident_ID#12624]\n                                                                                                                                                         +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, STREET#12780, TOWN#12810]\n                                                                                                                                                            +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, ADDRESS_SPLIT#12743, STREET#12780, ADDRESS_SPLIT#12743[1] AS TOWN#12810]\n                                                                                                                                                               +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, ADDRESS_SPLIT#12743, ADDRESS_SPLIT#12743[0] AS STREET#12780]\n                                                                                                                                                                  +- Project [Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624, CASE WHEN (isnull(ADDRESS#12649) || isnan(cast(ADDRESS#12649 as double))) THEN cast(null as array<string>) ELSE split(ADDRESS#12649, ,) END AS ADDRESS_SPLIT#12743]\n                                                                                                                                                                     +- Project [regexp_replace(Address#12318, [^A-Za-z0-9 ,], ) AS Address#12649, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, Resident_ID#12624]\n                                                                                                                                                                        +- Project [Address#12318, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, Resident_Age#12612, DOB#12281, regexp_replace(Resident_ID#12293, [^A-Za-z0-9 ], ) AS Resident_ID#12624]\n                                                                                                                                                                           +- Project [Address#12318, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, Resident_Year_Of_Birth#12600, regexp_replace(Resident_Age#12269, [^A-Za-z0-9 ], ) AS Resident_Age#12612, DOB#12281, Resident_ID#12293]\n                                                                                                                                                                              +- Project [Address#12318, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, SEX#12588, regexp_replace(Resident_Year_Of_Birth#12257, [^A-Za-z0-9 ], ) AS Resident_Year_Of_Birth#12600, Resident_Age#12269, DOB#12281, Resident_ID#12293]\n                                                                                                                                                                                 +- Project [Address#12318, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, Postcode#12576, regexp_replace(SEX#12245, [^A-Za-z0-9 ], ) AS SEX#12588, Resident_Year_Of_Birth#12257, Resident_Age#12269, DOB#12281, Resident_ID#12293]\n                                                                                                                                                                                    +- Project [Address#12318, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, Marital_Status#12564, regexp_replace(Postcode#12233, [^A-Za-z0-9 ], ) AS Postcode#12576, SEX#12245, Resident_Year_Of_Birth#12257, Resident_Age#12269, DOB#12281, Resident_ID#12293]\n                                                                                                                                                                                       +- Project [Address#12318, ENUM_FNAME#12528, ENUM_SNAME#12540, ID_CCS#12552, regexp_replace(Marital_Status#12221, [^A-Za-z0-9 ], ) AS Marital_Status#12564, Postcode#12233, SEX#12245, Resident_Year_Of_Birth#12257, Resident_Age#12269, DOB#12281, Resident_ID#12293]\n                                                                                                                                                                                          +- Project [Address#12318, ENUM_FNAME#12528, ENUM_SNAME#12540, regexp_replace(ID_CCS#12209, [^A-Za-z0-9 ], ) AS ID_CCS#12552, Marital_Status#12221, Postcode#12233, SEX#12245, Resident_Year_Of_Birth#12257, Resident_Age#12269, DOB#12281, Resident_ID#12293]\n                                                                                                                                                                                             +- Project [Address#12318, ENUM_FNAME#12528, regexp_replace(ENUM_SNAME#12197, [^A-Za-z0-9 ], ) AS ENUM_SNAME#12540, ID_CCS#12209, Marital_Status#12221, Postcode#12233, SEX#12245, Resident_Year_Of_Birth#12257, Resident_Age#12269, DOB#12281, Resident_ID#12293]\n                                                                                                                                                                                                +- Project [Address#12318, regexp_replace(ENUM_FNAME#12185, [^A-Za-z0-9 ], ) AS ENUM_FNAME#12528, ENUM_SNAME#12197, ID_CCS#12209, Marital_Status#12221, Postcode#12233, SEX#12245, Resident_Year_Of_Birth#12257, Resident_Age#12269, DOB#12281, Resident_ID#12293]\n                                                                                                                                                                                                   +- Project [regexp_replace(Address#11993, [^A-Za-z0-9 ,], ) AS Address#12318, ENUM_FNAME#12185, ENUM_SNAME#12197, ID_CCS#12209, Marital_Status#12221, Postcode#12233, SEX#12245, Resident_Year_Of_Birth#12257, Resident_Age#12269, DOB#12281, Resident_ID#12293]\n                                                                                                                                                                                                      +- Project [Address#11993, ENUM_FNAME#12185, ENUM_SNAME#12197, ID_CCS#12209, Marital_Status#12221, Postcode#12233, SEX#12245, Resident_Year_Of_Birth#12257, Resident_Age#12269, DOB#12281, regexp_replace(Resident_ID#11968, [^A-Za-z0-9 ], ) AS Resident_ID#12293]\n                                                                                                                                                                                                         +- Project [Address#11993, ENUM_FNAME#12185, ENUM_SNAME#12197, ID_CCS#12209, Marital_Status#12221, Postcode#12233, SEX#12245, Resident_Year_Of_Birth#12257, Resident_Age#12269, regexp_replace(DOB#11956, [^A-Za-z0-9 ], ) AS DOB#12281, Resident_ID#11968]\n                                                                                                                                                                                                            +- Project [Address#11993, ENUM_FNAME#12185, ENUM_SNAME#12197, ID_CCS#12209, Marital_Status#12221, Postcode#12233, SEX#12245, Resident_Year_Of_Birth#12257, regexp_replace(Resident_Age#11944, [^A-Za-z0-9 ], ) AS Resident_Age#12269, DOB#11956, Resident_ID#11968]\n                                                                                                                                                                                                               +- Project [Address#11993, ENUM_FNAME#12185, ENUM_SNAME#12197, ID_CCS#12209, Marital_Status#12221, Postcode#12233, SEX#12245, regexp_replace(Resident_Year_Of_Birth#11932, [^A-Za-z0-9 ], ) AS Resident_Year_Of_Birth#12257, Resident_Age#11944, DOB#11956, Resident_ID#11968]\n                                                                                                                                                                                                                  +- Project [Address#11993, ENUM_FNAME#12185, ENUM_SNAME#12197, ID_CCS#12209, Marital_Status#12221, Postcode#12233, regexp_replace(SEX#11920, [^A-Za-z0-9 ], ) AS SEX#12245, Resident_Year_Of_Birth#11932, Resident_Age#11944, DOB#11956, Resident_ID#11968]\n                                                                                                                                                                                                                     +- Project [Address#11993, ENUM_FNAME#12185, ENUM_SNAME#12197, ID_CCS#12209, Marital_Status#12221, regexp_replace(Postcode#11908, [^A-Za-z0-9 ], ) AS Postcode#12233, SEX#11920, Resident_Year_Of_Birth#11932, Resident_Age#11944, DOB#11956, Resident_ID#11968]\n                                                                                                                                                                                                                        +- Project [Address#11993, ENUM_FNAME#12185, ENUM_SNAME#12197, ID_CCS#12209, regexp_replace(Marital_Status#11896, [^A-Za-z0-9 ], ) AS Marital_Status#12221, Postcode#11908, SEX#11920, Resident_Year_Of_Birth#11932, Resident_Age#11944, DOB#11956, Resident_ID#11968]\n                                                                                                                                                                                                                           +- Project [Address#11993, ENUM_FNAME#12185, ENUM_SNAME#12197, regexp_replace(ID_CCS#11884, [^A-Za-z0-9 ], ) AS ID_CCS#12209, Marital_Status#11896, Postcode#11908, SEX#11920, Resident_Year_Of_Birth#11932, Resident_Age#11944, DOB#11956, Resident_ID#11968]\n                                                                                                                                                                                                                              +- Project [Address#11993, ENUM_FNAME#12185, regexp_replace(ENUM_SNAME#11872, [^A-Za-z0-9 ], ) AS ENUM_SNAME#12197, ID_CCS#11884, Marital_Status#11896, Postcode#11908, SEX#11920, Resident_Year_Of_Birth#11932, Resident_Age#11944, DOB#11956, Resident_ID#11968]\n                                                                                                                                                                                                                                 +- Project [Address#11993, regexp_replace(ENUM_FNAME#11860, [^A-Za-z0-9 ], ) AS ENUM_FNAME#12185, ENUM_SNAME#11872, ID_CCS#11884, Marital_Status#11896, Postcode#11908, SEX#11920, Resident_Year_Of_Birth#11932, Resident_Age#11944, DOB#11956, Resident_ID#11968]\n                                                                                                                                                                                                                                    +- Project [regexp_replace(Address#11324, [^A-Za-z0-9 ,], ) AS Address#11993, ENUM_FNAME#11860, ENUM_SNAME#11872, ID_CCS#11884, Marital_Status#11896, Postcode#11908, SEX#11920, Resident_Year_Of_Birth#11932, Resident_Age#11944, DOB#11956, Resident_ID#11968]\n                                                                                                                                                                                                                                       +- Project [Address#11324, ENUM_FNAME#11860, ENUM_SNAME#11872, ID_CCS#11884, Marital_Status#11896, Postcode#11908, SEX#11920, Resident_Year_Of_Birth#11932, Resident_Age#11944, DOB#11956, regexp_replace(Resident_ID#11432, [^A-Za-z0-9 ], ) AS Resident_ID#11968]\n                                                                                                                                                                                                                                          +- Project [Address#11324, ENUM_FNAME#11860, ENUM_SNAME#11872, ID_CCS#11884, Marital_Status#11896, Postcode#11908, SEX#11920, Resident_Year_Of_Birth#11932, Resident_Age#11944, regexp_replace(DOB#11420, [^A-Za-z0-9 ], ) AS DOB#11956, Resident_ID#11432]\n                                                                                                                                                                                                                                             +- Project [Address#11324, ENUM_FNAME#11860, ENUM_SNAME#11872, ID_CCS#11884, Marital_Status#11896, Postcode#11908, SEX#11920, Resident_Year_Of_Birth#11932, regexp_replace(Resident_Age#11408, [^A-Za-z0-9 ], ) AS Resident_Age#11944, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                +- Project [Address#11324, ENUM_FNAME#11860, ENUM_SNAME#11872, ID_CCS#11884, Marital_Status#11896, Postcode#11908, SEX#11920, regexp_replace(Resident_Year_Of_Birth#11396, [^A-Za-z0-9 ], ) AS Resident_Year_Of_Birth#11932, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                   +- Project [Address#11324, ENUM_FNAME#11860, ENUM_SNAME#11872, ID_CCS#11884, Marital_Status#11896, Postcode#11908, regexp_replace(cast(SEX#10419 as string), [^A-Za-z0-9 ], ) AS SEX#11920, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                      +- Project [Address#11324, ENUM_FNAME#11860, ENUM_SNAME#11872, ID_CCS#11884, Marital_Status#11896, regexp_replace(Postcode#11384, [^A-Za-z0-9 ], ) AS Postcode#11908, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                         +- Project [Address#11324, ENUM_FNAME#11860, ENUM_SNAME#11872, ID_CCS#11884, regexp_replace(Marital_Status#11372, [^A-Za-z0-9 ], ) AS Marital_Status#11896, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                            +- Project [Address#11324, ENUM_FNAME#11860, ENUM_SNAME#11872, regexp_replace(ID_CCS#11360, [^A-Za-z0-9 ], ) AS ID_CCS#11884, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                               +- Project [Address#11324, ENUM_FNAME#11860, regexp_replace(ENUM_SNAME#11637, [^A-Za-z0-9 ], ) AS ENUM_SNAME#11872, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                                  +- Project [Address#11324, regexp_replace(ENUM_FNAME#11625, [^A-Za-z0-9 ], ) AS ENUM_FNAME#11860, ENUM_SNAME#11637, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                                     +- Project [Address#11324, ENUM_FNAME#11625, regexp_replace(ENUM_SNAME#11550, [0-9], ) AS ENUM_SNAME#11637, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                                        +- Project [Address#11324, regexp_replace(ENUM_FNAME#11538, [0-9], ) AS ENUM_FNAME#11625, ENUM_SNAME#11550, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                                           +- Project [Address#11324, ENUM_FNAME#11538, regexp_replace(ENUM_SNAME#11348, [0-9], ) AS ENUM_SNAME#11550, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                                              +- Project [Address#11324, regexp_replace(ENUM_FNAME#11336, [0-9], ) AS ENUM_FNAME#11538, ENUM_SNAME#11348, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, Resident_ID#11432]\n                                                                                                                                                                                                                                                                                 +- Project [Address#11324, ENUM_FNAME#11336, ENUM_SNAME#11348, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, DOB#11420, trim(Resident_ID#11049, None) AS Resident_ID#11432]\n                                                                                                                                                                                                                                                                                    +- Project [Address#11324, ENUM_FNAME#11336, ENUM_SNAME#11348, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, Resident_Age#11408, trim(DOB#11024, None) AS DOB#11420, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                       +- Project [Address#11324, ENUM_FNAME#11336, ENUM_SNAME#11348, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, Resident_Year_Of_Birth#11396, trim(Resident_Age#10999, None) AS Resident_Age#11408, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                          +- Project [Address#11324, ENUM_FNAME#11336, ENUM_SNAME#11348, ID_CCS#11360, Marital_Status#11372, Postcode#11384, SEX#10419, trim(Resident_Year_Of_Birth#10974, None) AS Resident_Year_Of_Birth#11396, Resident_Age#10999, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                             +- Project [Address#11324, ENUM_FNAME#11336, ENUM_SNAME#11348, ID_CCS#11360, Marital_Status#11372, trim(Postcode#10948, None) AS Postcode#11384, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                                +- Project [Address#11324, ENUM_FNAME#11336, ENUM_SNAME#11348, ID_CCS#11360, trim(Marital_Status#10923, None) AS Marital_Status#11372, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                                   +- Project [Address#11324, ENUM_FNAME#11336, ENUM_SNAME#11348, trim(ID_CCS#10898, None) AS ID_CCS#11360, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                                      +- Project [Address#11324, ENUM_FNAME#11336, trim(ENUM_SNAME#10873, None) AS ENUM_SNAME#11348, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                                         +- Project [Address#11324, trim(ENUM_FNAME#10848, None) AS ENUM_FNAME#11336, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                                            +- Project [trim(Address#11101, None) AS Address#11324, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                                               +- Project [regexp_replace(Address#11089, \\s+,  ) AS Address#11101, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                                                  +- Project [trim(Address#10086, None) AS Address#11089, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, Resident_ID#11049]\n                                                                                                                                                                                                                                                                                                                     +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, regexp_replace(Resident_ID#11037, \\s+, ) AS Resident_ID#11049]\n                                                                                                                                                                                                                                                                                                                        +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, DOB#11024, trim(Resident_ID#10206, None) AS Resident_ID#11037]\n                                                                                                                                                                                                                                                                                                                           +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, regexp_replace(DOB#11012, \\s+, ) AS DOB#11024, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                              +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, Resident_Age#10999, trim(DOB#10194, None) AS DOB#11012, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                 +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, regexp_replace(Resident_Age#10987, \\s+, ) AS Resident_Age#10999, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                    +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10974, trim(Resident_Age#10182, None) AS Resident_Age#10987, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                       +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, regexp_replace(Resident_Year_Of_Birth#10962, \\s+, ) AS Resident_Year_Of_Birth#10974, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                          +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, Postcode#10948, SEX#10419, trim(Resident_Year_Of_Birth#10170, None) AS Resident_Year_Of_Birth#10962, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                             +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, regexp_replace(Postcode#10936, \\s+, ) AS Postcode#10948, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, Marital_Status#10923, trim(Postcode#10146, None) AS Postcode#10936, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                   +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, regexp_replace(Marital_Status#10911, \\s+, ) AS Marital_Status#10923, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                      +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, ID_CCS#10898, trim(Marital_Status#10134, None) AS Marital_Status#10911, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                         +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, regexp_replace(ID_CCS#10886, \\s+, ) AS ID_CCS#10898, Marital_Status#10134, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                            +- Project [Address#10086, ENUM_FNAME#10848, ENUM_SNAME#10873, trim(ID_CCS#10122, None) AS ID_CCS#10886, Marital_Status#10134, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                               +- Project [Address#10086, ENUM_FNAME#10848, regexp_replace(ENUM_SNAME#10861, \\s+, ) AS ENUM_SNAME#10873, ID_CCS#10122, Marital_Status#10134, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                  +- Project [Address#10086, ENUM_FNAME#10848, trim(ENUM_SNAME#10484, None) AS ENUM_SNAME#10861, ID_CCS#10122, Marital_Status#10134, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                     +- Project [Address#10086, regexp_replace(ENUM_FNAME#10836, \\s+, ) AS ENUM_FNAME#10848, ENUM_SNAME#10484, ID_CCS#10122, Marital_Status#10134, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                        +- Project [Address#10086, trim(ENUM_FNAME#10459, None) AS ENUM_FNAME#10836, ENUM_SNAME#10484, ID_CCS#10122, Marital_Status#10134, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                           +- Project [Address#10086, ENUM_FNAME#10459, regexp_replace(ENUM_SNAME#10110, \\bNO SURNAME\\b|\\bSURNAME\\b|(?<=\\bDE)[ -]|(?<=\\bDA)[ -]|(?<=\\bDU)[ -]|(?<=\\bST)[ -]|(?<=\\bMC)[ -]|(?<=\\bMAC)[ -]|(?<=\\bVAN)[ -]|(?<=\\bVON)[ -]|(?<=\\bLA)[ -]|(?<=\\bLE)[ -]|(?<=\\bO)[ -]|(?<=\\bAL)[ -]|(?<=\\bDER)[ -]|(?<=\\bEL)[ -]|(?<=\\bDI)[ -]|(?<=\\bDEL)[ -]|(?<=\\bUL)[ -]|(?<=\\bBIN)[ -]|(?<=\\bSAN)[ -]|(?<=\\bBA)[ -], ) AS ENUM_SNAME#10484, ID_CCS#10122, Marital_Status#10134, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                              +- Project [Address#10086, regexp_replace(ENUM_FNAME#10098, \\bMR\\b|\\bMRS\\b|\\bDR\\b|\\bMISS\\b|\\bNO NAME\\b|\\bNAME\\b|\\bFORENAME\\b|\\bMS\\b|\\bMSTR\\b|\\bPROF\\b|\\bSIR\\b|\\bLADY\\b, ) AS ENUM_FNAME#10459, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, Postcode#10146, SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                                 +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, Postcode#10146, cast(SEX#10324 as int) AS SEX#10419, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                                    +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, Postcode#10146, regexp_replace(SEX#10312, ^F$|^FEMALE$, 2) AS SEX#10324, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                                       +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, Postcode#10146, regexp_replace(SEX#10158, ^M$|^MALE$, 1) AS SEX#10312, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                                          +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, Postcode#10146, Sex#10158, Resident_Year_Of_Birth#10170, Resident_Age#10182, DOB#10194, CASE WHEN Resident_ID#9850 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE Resident_ID#9850 END AS Resident_ID#10206]\n                                                                                                                                                                                                                                                                                                                                                                                             +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, Postcode#10146, Sex#10158, Resident_Year_Of_Birth#10170, Resident_Age#10182, CASE WHEN DOB#9837 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE DOB#9837 END AS DOB#10194, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, Postcode#10146, Sex#10158, Resident_Year_Of_Birth#10170, CASE WHEN Resident_Age#9824 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE Resident_Age#9824 END AS Resident_Age#10182, DOB#9837, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                   +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, Postcode#10146, Sex#10158, CASE WHEN Resident_Year_Of_Birth#9811 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE Resident_Year_Of_Birth#9811 END AS Resident_Year_Of_Birth#10170, Resident_Age#9824, DOB#9837, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                      +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, Postcode#10146, CASE WHEN Sex#9798 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE Sex#9798 END AS Sex#10158, Resident_Year_Of_Birth#9811, Resident_Age#9824, DOB#9837, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                         +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, Marital_Status#10134, CASE WHEN Postcode#9785 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE Postcode#9785 END AS Postcode#10146, Sex#9798, Resident_Year_Of_Birth#9811, Resident_Age#9824, DOB#9837, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                            +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, ID_CCS#10122, CASE WHEN Marital_Status#9772 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE Marital_Status#9772 END AS Marital_Status#10134, Postcode#9785, Sex#9798, Resident_Year_Of_Birth#9811, Resident_Age#9824, DOB#9837, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                               +- Project [Address#10086, ENUM_FNAME#10098, ENUM_SNAME#10110, CASE WHEN ID_CCS#9759 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE ID_CCS#9759 END AS ID_CCS#10122, Marital_Status#9772, Postcode#9785, Sex#9798, Resident_Year_Of_Birth#9811, Resident_Age#9824, DOB#9837, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                                  +- Project [Address#10086, ENUM_FNAME#10098, CASE WHEN ENUM_SNAME#9746 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE ENUM_SNAME#9746 END AS ENUM_SNAME#10110, ID_CCS#9759, Marital_Status#9772, Postcode#9785, Sex#9798, Resident_Year_Of_Birth#9811, Resident_Age#9824, DOB#9837, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                                     +- Project [Address#10086, CASE WHEN ENUM_FNAME#9733 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE ENUM_FNAME#9733 END AS ENUM_FNAME#10098, ENUM_SNAME#9746, ID_CCS#9759, Marital_Status#9772, Postcode#9785, Sex#9798, Resident_Year_Of_Birth#9811, Resident_Age#9824, DOB#9837, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                                        +- Project [CASE WHEN Address#9720 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$|^###$ THEN cast(null as string) ELSE Address#9720 END AS Address#10086, ENUM_FNAME#9733, ENUM_SNAME#9746, ID_CCS#9759, Marital_Status#9772, Postcode#9785, Sex#9798, Resident_Year_Of_Birth#9811, Resident_Age#9824, DOB#9837, Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                                           +- Project [Address#9720, ENUM_FNAME#9733, ENUM_SNAME#9746, ID_CCS#9759, Marital_Status#9772, Postcode#9785, Sex#9798, Resident_Year_Of_Birth#9811, Resident_Age#9824, DOB#9837, upper(Resident_ID#9369) AS Resident_ID#9850]\n                                                                                                                                                                                                                                                                                                                                                                                                                              +- Project [Address#9720, ENUM_FNAME#9733, ENUM_SNAME#9746, ID_CCS#9759, Marital_Status#9772, Postcode#9785, Sex#9798, Resident_Year_Of_Birth#9811, Resident_Age#9824, upper(DOB#9398) AS DOB#9837, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                 +- Project [Address#9720, ENUM_FNAME#9733, ENUM_SNAME#9746, ID_CCS#9759, Marital_Status#9772, Postcode#9785, Sex#9798, Resident_Year_Of_Birth#9811, upper(Resident_Age#9345) AS Resident_Age#9824, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                    +- Project [Address#9720, ENUM_FNAME#9733, ENUM_SNAME#9746, ID_CCS#9759, Marital_Status#9772, Postcode#9785, Sex#9798, upper(Resident_Year_Of_Birth#9333) AS Resident_Year_Of_Birth#9811, Resident_Age#9345, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                       +- Project [Address#9720, ENUM_FNAME#9733, ENUM_SNAME#9746, ID_CCS#9759, Marital_Status#9772, Postcode#9785, upper(Sex#9321) AS Sex#9798, Resident_Year_Of_Birth#9333, Resident_Age#9345, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                          +- Project [Address#9720, ENUM_FNAME#9733, ENUM_SNAME#9746, ID_CCS#9759, Marital_Status#9772, upper(Postcode#9309) AS Postcode#9785, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                             +- Project [Address#9720, ENUM_FNAME#9733, ENUM_SNAME#9746, ID_CCS#9759, upper(Marital_Status#9297) AS Marital_Status#9772, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                +- Project [Address#9720, ENUM_FNAME#9733, ENUM_SNAME#9746, upper(ID_CCS#9517) AS ID_CCS#9759, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                   +- Project [Address#9720, ENUM_FNAME#9733, upper(ENUM_SNAME#9273) AS ENUM_SNAME#9746, ID_CCS#9517, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                      +- Project [Address#9720, upper(ENUM_FNAME#9261) AS ENUM_FNAME#9733, ENUM_SNAME#9273, ID_CCS#9517, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                         +- Project [upper(Address#9249) AS Address#9720, ENUM_FNAME#9261, ENUM_SNAME#9273, ID_CCS#9517, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                            +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285 AS ID_CCS#9517, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                               +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, from_unixtime(DOB#9386L, dd/MM/yyyy, Some(Europe/London)) AS DOB#9398, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                  +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, unix_timestamp(DOB#9357, dd-MM-yyyy, Some(Europe/London)) AS DOB#9386L, Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                     +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, DOB#9357, regexp_replace(Resident_ID#6705, \n,  ) AS Resident_ID#9369]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                        +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, Resident_Age#9345, regexp_replace(DOB#6704, \n,  ) AS DOB#9357, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                           +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285, Marital_Status#9297, Postcode#9309, Sex#9321, Resident_Year_Of_Birth#9333, regexp_replace(cast(Resident_Age#6703L as string), \n,  ) AS Resident_Age#9345, DOB#6704, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                              +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285, Marital_Status#9297, Postcode#9309, Sex#9321, regexp_replace(cast(Resident_Year_Of_Birth#6702L as string), \n,  ) AS Resident_Year_Of_Birth#9333, Resident_Age#6703L, DOB#6704, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285, Marital_Status#9297, Postcode#9309, regexp_replace(Sex#6701, \n,  ) AS Sex#9321, Resident_Year_Of_Birth#6702L, Resident_Age#6703L, DOB#6704, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285, Marital_Status#9297, regexp_replace(Postcode#6700, \n,  ) AS Postcode#9309, Sex#6701, Resident_Year_Of_Birth#6702L, Resident_Age#6703L, DOB#6704, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, ID#9285, regexp_replace(Marital_Status#6699, \n,  ) AS Marital_Status#9297, Postcode#6700, Sex#6701, Resident_Year_Of_Birth#6702L, Resident_Age#6703L, DOB#6704, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          +- Project [Address#9249, ENUM_FNAME#9261, ENUM_SNAME#9273, regexp_replace(ID#6698, \n,  ) AS ID#9285, Marital_Status#6699, Postcode#6700, Sex#6701, Resident_Year_Of_Birth#6702L, Resident_Age#6703L, DOB#6704, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             +- Project [Address#9249, ENUM_FNAME#9261, regexp_replace(ENUM_SNAME#6697, \n,  ) AS ENUM_SNAME#9273, ID#6698, Marital_Status#6699, Postcode#6700, Sex#6701, Resident_Year_Of_Birth#6702L, Resident_Age#6703L, DOB#6704, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                +- Project [Address#9249, regexp_replace(ENUM_FNAME#6696, \n,  ) AS ENUM_FNAME#9261, ENUM_SNAME#6697, ID#6698, Marital_Status#6699, Postcode#6700, Sex#6701, Resident_Year_Of_Birth#6702L, Resident_Age#6703L, DOB#6704, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   +- Project [regexp_replace(Address#6695, \n,  ) AS Address#9249, ENUM_FNAME#6696, ENUM_SNAME#6697, ID#6698, Marital_Status#6699, Postcode#6700, Sex#6701, Resident_Year_Of_Birth#6702L, Resident_Age#6703L, DOB#6704, Resident_ID#6705]\n                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      +- LogicalRDD [Address#6695, ENUM_FNAME#6696, ENUM_SNAME#6697, ID#6698, Marital_Status#6699, Postcode#6700, Sex#6701, Resident_Year_Of_Birth#6702L, Resident_Age#6703L, DOB#6704, Resident_ID#6705], false\n"

Let's standardise the date format to be consistent across our data in a **ddMMyyyy** format:

In [50]:
ccs = standardisation.standardise_date(ccs, col_name = "DOB", in_date_format = "dd-MM-yyyy", out_date_format = "dd/MM/yyyy")

ccs

Address,ENUM_FNAME,ENUM_SNAME,ID,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
Studio 48 Cooper ...,-9,Ross,c4064232788196233825,NAN,CV25 4ZYC,,1956,66,06/08/1956,c2026847926404610461
43 Rebecca street...,Mrs Darren,Baldwin,c7365350289112516537,Single,-7,M,2013,9,29/12/2013,c7839596180442651345
-9,Mrs Eric,Bibi,c8205386463232611653,Single,SP26 8TN,Female,2016,6,21/09/1994,c3258728696626565719
"7 Noble valley, ...",Mrs Margaret,Kent,c2381984462771197706,###,SO2P 9WS,Male,1972,50,28/01/1972,c2287010195568088798
57 Pearson corner...,Mr Grace,Baker,c6384487823194391043,Civil partnership,eH4p 9rn,Female,1966,56,26/11/1966,c1945351111358374057
-9,Mrs Chloe,Chandler,c7777611692672993318,Divorced,G7F 3RE,2,1963,59,09/07/1934,c7831454145019129197
Studio 5 Fuller b...,Mr Katie,Der-Anderson,c7179219388724687888,Divorced,G88 6DB,Female,1960,62,,c6030118478018109776
"644 Garry walk, B...",Mrs Denise,King,c3458599216452476033,Divorced,L2 6LG,F,1981,41,28/11/1981,c6446332115853614978
Studio 73 Clayton...,-9,Der-Barton,c9188328200772085537,-9,LN96 9XA,Male,1947,-9,16/11/1994,c1293143515607798169
"311 Eric track, L...",Hayley,ChapmKan,c1862566591390004870,Single,S0D 9AX,M,2001,21,08/11/2001,c1864186096263678574


Next, we have generic 'ID' columns in each dataset. We also have address and name variables named differently in each dataset. 

We can use **rename_columns()** from the dataframes module to rename all of these at once. 

In [51]:
census = dataframes.rename_columns(census, rename_dict = {"ID":"ID_Census","ENUM_FNAME":"FORENAME","ENUM_SNAME":"SURNAME"})
ccs = dataframes.rename_columns(ccs, rename_dict = {"ID":"ID_CCS","FNAME":"FORENAME","SNAME":"SURNAME"})

ccs.columns

['Address',
 'ENUM_FNAME',
 'ENUM_SNAME',
 'ID_CCS',
 'Marital_Status',
 'Postcode',
 'Sex',
 'Resident_Year_Of_Birth',
 'Resident_Age',
 'DOB',
 'Resident_ID']

Now let's set all of variables to upper case for consistency, using **standardise_case()**:

In [52]:
census = standardisation.standardise_case(census)
ccs = standardisation.standardise_case(ccs)

ccs

Address,ENUM_FNAME,ENUM_SNAME,ID_CCS,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,-9,ROSS,C4064232788196233825,NAN,CV25 4ZYC,,1956,66,06/08/1956,C2026847926404610461
43 REBECCA STREET...,MRS DARREN,BALDWIN,C7365350289112516537,SINGLE,-7,M,2013,9,29/12/2013,C7839596180442651345
-9,MRS ERIC,BIBI,C8205386463232611653,SINGLE,SP26 8TN,FEMALE,2016,6,21/09/1994,C3258728696626565719
"7 NOBLE VALLEY, ...",MRS MARGARET,KENT,C2381984462771197706,###,SO2P 9WS,MALE,1972,50,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,MR GRACE,BAKER,C6384487823194391043,CIVIL PARTNERSHIP,EH4P 9RN,FEMALE,1966,56,26/11/1966,C1945351111358374057
-9,MRS CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F 3RE,2,1963,59,09/07/1934,C7831454145019129197
STUDIO 5 FULLER B...,MR KATIE,DER-ANDERSON,C7179219388724687888,DIVORCED,G88 6DB,FEMALE,1960,62,,C6030118478018109776
"644 GARRY WALK, B...",MRS DENISE,KING,C3458599216452476033,DIVORCED,L2 6LG,F,1981,41,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,-9,DER-BARTON,C9188328200772085537,-9,LN96 9XA,MALE,1947,-9,16/11/1994,C1293143515607798169
"311 ERIC TRACK, L...",HAYLEY,CHAPMKAN,C1862566591390004870,SINGLE,S0D 9AX,M,2001,21,08/11/2001,C1864186096263678574


Next, the values for missingness are all over the place. I can spot a few NaNs, minus 7/9s, hashtags, and whitespaces. Let's standardise missingness with the **standardise_null()** function. We can retrieve these null values from the previous **value_counts()** outputs: 

In [53]:
# we can use the standardise_null function to replace these with true None values:
# we use regex to do this: https://regex101.com/ 
census = standardisation.standardise_null(census, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$|^###$")
ccs = standardisation.standardise_null(ccs, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$|^###$")

ccs

Address,ENUM_FNAME,ENUM_SNAME,ID_CCS,Marital_Status,Postcode,Sex,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV25 4ZYC,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,MRS DARREN,BALDWIN,C7365350289112516537,SINGLE,,M,2013,9.0,29/12/2013,C7839596180442651345
,MRS ERIC,BIBI,C8205386463232611653,SINGLE,SP26 8TN,FEMALE,2016,6.0,21/09/1994,C3258728696626565719
"7 NOBLE VALLEY, ...",MRS MARGARET,KENT,C2381984462771197706,###,SO2P 9WS,MALE,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,MR GRACE,BAKER,C6384487823194391043,CIVIL PARTNERSHIP,EH4P 9RN,FEMALE,1966,56.0,26/11/1966,C1945351111358374057
,MRS CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F 3RE,2,1963,59.0,09/07/1934,C7831454145019129197
STUDIO 5 FULLER B...,MR KATIE,DER-ANDERSON,C7179219388724687888,DIVORCED,G88 6DB,FEMALE,1960,62.0,,C6030118478018109776
"644 GARRY WALK, B...",MRS DENISE,KING,C3458599216452476033,DIVORCED,L2 6LG,F,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DER-BARTON,C9188328200772085537,,LN96 9XA,MALE,1947,,16/11/1994,C1293143515607798169
"311 ERIC TRACK, L...",HAYLEY,CHAPMKAN,C1862566591390004870,SINGLE,S0D 9AX,M,2001,21.0,08/11/2001,C1864186096263678574


Great, these now all show up as true nulls. 

Next, we have a mix of 1s, 2s, Ms, and Fs in our sex column. Let's standardise this to be either 1s or 2s. For this we can use **reg_replace()**:

In [54]:
# reg_replace() takes a dictionary, where the value is the regex to replace, and the key is what this will be replaced with
# so we're replacing 'M' with '1', and 'F' with '2':
census = standardisation.reg_replace(census, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})
ccs = standardisation.reg_replace(ccs, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})

ccs

Address,ENUM_FNAME,ENUM_SNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV25 4ZYC,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,MRS DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,MRS ERIC,BIBI,C8205386463232611653,SINGLE,SP26 8TN,2.0,2016,6.0,21/09/1994,C3258728696626565719
"7 NOBLE VALLEY, ...",MRS MARGARET,KENT,C2381984462771197706,###,SO2P 9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,MR GRACE,BAKER,C6384487823194391043,CIVIL PARTNERSHIP,EH4P 9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,MRS CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F 3RE,2.0,1963,59.0,09/07/1934,C7831454145019129197
STUDIO 5 FULLER B...,MR KATIE,DER-ANDERSON,C7179219388724687888,DIVORCED,G88 6DB,2.0,1960,62.0,,C6030118478018109776
"644 GARRY WALK, B...",MRS DENISE,KING,C3458599216452476033,DIVORCED,L2 6LG,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DER-BARTON,C9188328200772085537,,LN96 9XA,1.0,1947,,16/11/1994,C1293143515607798169
"311 ERIC TRACK, L...",HAYLEY,CHAPMKAN,C1862566591390004870,SINGLE,S0D 9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


Now that our sex column is populated with just 1s and 2s, we might want to change the type from string to integer. This can be done using the **cast_type()** function in the standardisation module:

In [55]:
# Casting strings to integer can increase performance as it makes the data type smaller, taking up less storage space

census = standardisation.cast_type(census, subset = ['SEX'], types = "integer")
ccs = standardisation.cast_type(ccs, subset = ['SEX'], types = "integer")

ccs.select('SEX').dtypes

[('SEX', 'int')]

SORT OUT COMMENTARY NOW HAVE RESHUFFLED

Next, let's focus on our name variables. Forenames still contain titles and some surnames have common prefixes like 'Van' or 'Der'. We can strip out titles and concatenate surname prefixes with our **clean_forename()** and **clean_surname()** functions. 

In [57]:
census = standardisation.clean_forename(census, subset = 'FORENAME')
ccs = standardisation.clean_forename(ccs, subset = 'FORENAME')

census = standardisation.clean_surname(census, subset = 'SURNAME')
ccs = standardisation.clean_surname(ccs, subset = 'SURNAME')

ccs

Address,ENUM_FNAME,ENUM_SNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV25 4ZYC,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,SP26 8TN,2.0,2016,6.0,21/09/1994,C3258728696626565719
"7 NOBLE VALLEY, ...",MARGARET,KENT,C2381984462771197706,###,SO2P 9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVIL PARTNERSHIP,EH4P 9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F 3RE,2.0,1963,59.0,09/07/1934,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G88 6DB,2.0,1960,62.0,,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,L2 6LG,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN96 9XA,1.0,1947,,16/11/1994,C1293143515607798169
"311 ERIC TRACK, L...",HAYLEY,CHAPMKAN,C1862566591390004870,SINGLE,S0D 9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


Let's begin to have a look at our postcode, address and name variables. 

A lot of the variables contain multiple white spaces in a row. We can remove white spcaes altogether using the **standardise_white_space()** function, setting white space level (wsl) to none. 

However, we don't want to remove all of the white spaces in our address variable, as this would make the text unreadable. Instead, we can set wsl to one, meaning only gaps of 2 spaces or more will be removed.  

In [58]:
# Using a list comprehension, we remove all white spaces from all columns except Address
census = standardisation.standardise_white_space(census, 
                                                 subset = [column for column in census.columns if column != 'Address'], 
                                                 wsl = "none")
ccs = standardisation.standardise_white_space(ccs, 
                                              subset = [column for column in ccs.columns if column != 'Address'], 
                                              wsl = "none")

# Then we allow a single white space for the Address column
census = standardisation.standardise_white_space(census, subset = 'Address', wsl = "one")
ccs = standardisation.standardise_white_space(ccs, subset = 'Address', wsl = "one")

ccs

Address,ENUM_FNAME,ENUM_SNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV254ZYC,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,SP268TN,2.0,2016,6.0,21/09/1994,C3258728696626565719
"7 NOBLE VALLEY, L...",MARGARET,KENT,C2381984462771197706,###,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F3RE,2.0,1963,59.0,09/07/1934,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G886DB,2.0,1960,62.0,,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,L26LG,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN969XA,1.0,1947,,16/11/1994,C1293143515607798169
"311 ERIC TRACK, L...",HAYLEY,CHAPMKAN,C1862566591390004870,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


We might have some leading/trailing whitespaces in some of our variables as they can be hard to spot, and therefore it's good practice to use **trim()** to remove these:

In [59]:
census = standardisation.trim(census)
ccs = standardisation.trim(ccs)

ccs

Address,ENUM_FNAME,ENUM_SNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV254ZYC,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,SP268TN,2.0,2016,6.0,21/09/1994,C3258728696626565719
"7 NOBLE VALLEY, L...",MARGARET,KENT,C2381984462771197706,###,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F3RE,2.0,1963,59.0,09/07/1934,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G886DB,2.0,1960,62.0,,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,L26LG,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN969XA,1.0,1947,,16/11/1994,C1293143515607798169
"311 ERIC TRACK, L...",HAYLEY,CHAPMKAN,C1862566591390004870,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


Finally, let's strip out numbers from our name variables. Again, we can use the **reg_replace()** function for this:

In [61]:
census = standardisation.reg_replace(census, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})
ccs = standardisation.reg_replace(ccs, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})

ccs

Address,ENUM_FNAME,ENUM_SNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV254ZYC,,1956,66.0,06/08/1956,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29/12/2013,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,SP268TN,2.0,2016,6.0,21/09/1994,C3258728696626565719
"7 NOBLE VALLEY, L...",MARGARET,KENT,C2381984462771197706,###,SO2P9WS,1.0,1972,50.0,28/01/1972,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26/11/1966,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F3RE,2.0,1963,59.0,09/07/1934,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G886DB,2.0,1960,62.0,,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,L26LG,2.0,1981,41.0,28/11/1981,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN969XA,1.0,1947,,16/11/1994,C1293143515607798169
"311 ERIC TRACK, L...",HAYLEY,CHAPMKAN,C1862566591390004870,SINGLE,S0D9AX,1.0,2001,21.0,08/11/2001,C1864186096263678574


This still leaves apostrophes and hyphens in our name variables. The **remove_punct()** function can handle these. While we're at it, let's also use **remove_punct()** to get rid of dashes in our address field, but we'll have to specify the optional argument **keep** to make sure it doesn't strip out commas from addresses:

In [64]:
#First remove all punction from every column except address
census = standardisation.remove_punct(census, 
                                      subset = [column for column in census.columns if column not in ['Address','DOB']], 
                                      )

ccs = standardisation.remove_punct(ccs, 
                                   subset = [column for column in ccs.columns if column not in ['Address','DOB']]
                                  )


#Then remove the punctuation from address except for the commas
census = standardisation.remove_punct(census, subset = 'Address', keep = ',')
ccs = standardisation.remove_punct(ccs, subset = 'Address', keep = ',')

ccs

Address,ENUM_FNAME,ENUM_SNAME,ID_CCS,Marital_Status,Postcode,SEX,Resident_Year_Of_Birth,Resident_Age,DOB,Resident_ID
STUDIO 48 COOPER ...,,ROSS,C4064232788196233825,,CV254ZYC,,1956,66.0,6081956.0,C2026847926404610461
43 REBECCA STREET...,DARREN,BALDWIN,C7365350289112516537,SINGLE,,1.0,2013,9.0,29122013.0,C7839596180442651345
,ERIC,BIBI,C8205386463232611653,SINGLE,SP268TN,2.0,2016,6.0,21091994.0,C3258728696626565719
"7 NOBLE VALLEY, L...",MARGARET,KENT,C2381984462771197706,,SO2P9WS,1.0,1972,50.0,28011972.0,C2287010195568088798
57 PEARSON CORNER...,GRACE,BAKER,C6384487823194391043,CIVILPARTNERSHIP,EH4P9RN,2.0,1966,56.0,26111966.0,C1945351111358374057
,CHLOE,CHANDLER,C7777611692672993318,DIVORCED,G7F3RE,2.0,1963,59.0,9071934.0,C7831454145019129197
STUDIO 5 FULLER B...,KATIE,DERANDERSON,C7179219388724687888,DIVORCED,G886DB,2.0,1960,62.0,,C6030118478018109776
"644 GARRY WALK, B...",DENISE,KING,C3458599216452476033,DIVORCED,L26LG,2.0,1981,41.0,28111981.0,C6446332115853614978
STUDIO 73 CLAYTON...,,DERBARTON,C9188328200772085537,,LN969XA,1.0,1947,,16111994.0,C1293143515607798169
"311 ERIC TRACK, L...",HAYLEY,CHAPMKAN,C1862566591390004870,SINGLE,S0D9AX,1.0,2001,21.0,8112001.0,C1864186096263678574


# Derive Variables

We've got quite a few identifying variables that we can split out into further variables for matching. These can be useful if, for instance, records don't match on house number due to an error, but do match on street. 

First, let's derive street and town from the address variable. The **split()** function from the dataframes module will be useful here, splitting on comma. 

In [68]:
# This will create a new column called "ADDRESS_SPLIT" that contains an array of each address element, separated by a comma
census = dataframes.split(census, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")
ccs = dataframes.split(ccs, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")

ccs.select("ADDRESS", "ADDRESS_SPLIT")

+------------------------------------------+---------------------------------------------+
|ADDRESS                                   |ADDRESS_SPLIT                                |
+------------------------------------------+---------------------------------------------+
|57 PEARSON CORNER, JOANNABOROUGH          |[57 PEARSON CORNER,  JOANNABOROUGH]          |
|null                                      |null                                         |
|STUDIO 16Y SIMPSON DRIVE, THORNTONBOROUGH |[STUDIO 16Y SIMPSON DRIVE,  THORNTONBOROUGH] |
|STUDIO 41 BROWN MOUNTAIN, PORT DOUGLASLAND|[STUDIO 41 BROWN MOUNTAIN,  PORT DOUGLASLAND]|
|FLAT 5 HELEN LOAF, EAST JONATHANSIDE      |[FLAT 5 HELEN LOAF,  EAST JONATHANSIDE]      |
|FLAT 54P JENNINGS HIGHWAY, BARLOWPORT     |[FLAT 54P JENNINGS HIGHWAY,  BARLOWPORT]     |
|FLAT 55B JAKE ROUTE, PORT KYLEPORT        |[FLAT 55B JAKE ROUTE,  PORT KYLEPORT]        |
|STUDIO 33 MEGAN DAM, POWELLBURGH          |[STUDIO 33 MEGAN DAM,  POWELLBURGH]          |

In [69]:
# We can then select the first element of the 'split address' to create the 'street address' variable
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)

# The second element contains the town name, which we can append to a new column also 
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)

# Since we no longer need the 'ADDRESS_SPLIT' column, we can remove it using our drop_columns() function
census = dataframes.drop_columns(census, subset = 'ADDRESS_SPLIT')
ccs = dataframes.drop_columns(ccs, subset = 'ADDRESS_SPLIT')

ccs.select("ADDRESS", "STREET", "TOWN")

+------------------------------------------+--------------------------+------------------+
|ADDRESS                                   |STREET                    |TOWN              |
+------------------------------------------+--------------------------+------------------+
|57 PEARSON CORNER, JOANNABOROUGH          |57 PEARSON CORNER         | JOANNABOROUGH    |
|null                                      |null                      |null              |
|STUDIO 16Y SIMPSON DRIVE, THORNTONBOROUGH |STUDIO 16Y SIMPSON DRIVE  | THORNTONBOROUGH  |
|STUDIO 41 BROWN MOUNTAIN, PORT DOUGLASLAND|STUDIO 41 BROWN MOUNTAIN  | PORT DOUGLASLAND |
|FLAT 5 HELEN LOAF, EAST JONATHANSIDE      |FLAT 5 HELEN LOAF         | EAST JONATHANSIDE|
|FLAT 54P JENNINGS HIGHWAY, BARLOWPORT     |FLAT 54P JENNINGS HIGHWAY | BARLOWPORT       |
|FLAT 55B JAKE ROUTE, PORT KYLEPORT        |FLAT 55B JAKE ROUTE       | PORT KYLEPORT    |
|STUDIO 33 MEGAN DAM, POWELLBURGH          |STUDIO 33 MEGAN DAM       | POWELLBURGH      |

We can create a 'full name' variable by concatenating the two existing name columns together, using **concat()**:

In [29]:
census = dataframes.concat(census, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")
ccs = dataframes.concat(ccs, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")

ccs.select("FORENAME", "SURNAME", "FULL_NAME")

+-----------+------------+--------------------+
|   FORENAME|     SURNAME|           FULL_NAME|
+-----------+------------+--------------------+
|   MRSCHLOE|    CHANDLER|   MRSCHLOE CHANDLER|
|      REECE|        LONG|          REECE LONG|
|   MRSMOLLY|     SKINNER|    MRSMOLLY SKINNER|
|      LYDIA|        WEBB|          LYDIA WEBB|
|    SUZANNE|VANGALLAGHER|SUZANNE VANGALLAGHER|
|   ROSEMARY|      CLARKE|     ROSEMARY CLARKE|
|      SCOTT|       YATES|         SCOTT YATES|
|       GLEN|       CLARK|          GLEN CLARK|
|   GEORGINA|      FULLER|     GEORGINA FULLER|
|    MRSHUGH|      HARRIS|      MRSHUGH HARRIS|
|MRCATHERINE|  FITZGERALD|MRCATHERINE FITZG...|
|   MROLIVER|     SIMPSON|    MROLIVER SIMPSON|
|     JUSTIN|      THORPE|       JUSTIN THORPE|
|   MRJOANNE|      DAWSON|     MRJOANNE DAWSON|
|  MRSRONALD|        LORD|      MRSRONALD LORD|
|     HILARY|      MISTRY|       HILARY MISTRY|
|       RHYS|     MANNING|        RHYS MANNING|
|   CAROLINE|       JONES|      CAROLINE

For data that has been collected over the phone, our usual matching methods that look for differences in strings might not be as effective. Instead we can capture the way names *sound* with phonetic encoders to compensate for this type of error. 

We have functions for this in the linkage module. 

In [30]:
census = linkage.metaphone(df = census, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
census = linkage.soundex(df = census, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

ccs = linkage.metaphone(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
ccs = linkage.soundex(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

ccs.select("FORENAME", "FORENAME_METAPHONE", "FORENAME_SOUNDEX")

+-----------+------------------+----------------+
|   FORENAME|FORENAME_METAPHONE|FORENAME_SOUNDEX|
+-----------+------------------+----------------+
|   MRSCHLOE|             MRSXL|            M624|
|      REECE|                RS|            R200|
|   MRSMOLLY|             MRSML|            M625|
|      LYDIA|                LT|            L300|
|    SUZANNE|               SSN|            S250|
|   ROSEMARY|              RSMR|            R256|
|      SCOTT|               SKT|            S300|
|       GLEN|               KLN|            G450|
|   GEORGINA|              JRJN|            G625|
|    MRSHUGH|               MRX|            M622|
|MRCATHERINE|            MRK0RN|            M623|
|   MROLIVER|             MRLFR|            M641|
|     JUSTIN|              JSTN|            J235|
|   MRJOANNE|              MRJN|            M625|
|  MRSRONALD|           MRSRNLT|            M626|
|     HILARY|               HLR|            H460|
|       RHYS|                RS|            R200|


Similarly, if there have been spelling mistakes in names, alphabetising string columns may also aid matching. We have a function for this in the linkage module. 

In [31]:
census = linkage.alpha_name(census, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')
ccs = linkage.alpha_name(ccs, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')

ccs.select("FORENAME", "ALPHABETISE_FORENAME")

+-----------+--------------------+
|   FORENAME|ALPHABETISE_FORENAME|
+-----------+--------------------+
|   MRSCHLOE|            CEHLMORS|
|      REECE|               CEEER|
|   MRSMOLLY|            LLMMORSY|
|      LYDIA|               ADILY|
|    SUZANNE|             AENNSUZ|
|   ROSEMARY|            AEMORRSY|
|      SCOTT|               COSTT|
|       GLEN|                EGLN|
|   GEORGINA|            AEGGINOR|
|    MRSHUGH|             GHHMRSU|
|MRCATHERINE|         ACEEHIMNRRT|
|   MROLIVER|            EILMORRV|
|     JUSTIN|              IJNSTU|
|   MRJOANNE|            AEJMNNOR|
|  MRSRONALD|           ADLMNORRS|
|     HILARY|              AHILRY|
|       RHYS|                HRSY|
|   CAROLINE|            ACEILNOR|
|   MRCHERYL|            CEHLMRRY|
|   KAYLEIGH|            AEGHIKLY|
+-----------+--------------------+
only showing top 20 rows



There are more common matching variables we could still derive, for example, a common practice in data linkage is to derive a postcode district variable instead of using full postcode. 

The second part of a postcode is *always* 3 characters, whilst the first part can range from 2-4. Therefore, to derive postcode district, we remove the last 3 characters from postcode. 

This can be done using the **substring()** function:

In [70]:
census = dataframes.substring(census, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)
ccs = dataframes.substring(ccs, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)

ccs.select("POSTCODE", "PC_DISTRICT")

POSTCODE,PC_DISTRICT
EH4P9RN,EH4P
TN6M6JB,TN6M
B060DT,B06
BS03FA,BS0
DE14PW,DE1
HP399XG,HP39
LE820DP,LE82
B2S3SS,B2S
B04XYX,B04
RM518XW,RM51


If you have a time lag between the collection of two surveys you are trying to link together, you may want to align respondent ages for matching. We can do this using the **age_at()** function.

This function takes a few arguments:
* the dataframe
* the name of the date of birth column
* the data format the date of birth column is in
* the date(s) to calculate age at

In [33]:
# We can find out their age at the most recent Census, for example:
census_date = '21/03/2021'

census = standardisation.age_at(census, 'DOB', 'dd/MM/yyyy', census_date)
ccs = standardisation.age_at(ccs, 'DOB', 'dd/MM/yyyy', census_date)

census.select('DOB','age_at_21/03/2021')

DOB,age_at_21/03/2021
,
,
,
,
,
,
,
,
,
,


In [1]:
# NOT SURE IF THE BELOW IS RIGHT - JUST TAKING IT FROM DAP VERSION BEFORE DELETED

# Deduplication

This is quite easily done, defining our duplicate matchkey(s) and using the **deduplicate** function:

In [None]:
# define our matchkey
deduplicate_mkey = ['First_Name', 'Last_Name','Resident_Age','Sex','Postcode','Address']
ccs.count()

In [None]:
census = linkage.deduplicate(df = census, record_id - 'Resident_ID', mks = deduplicate_mkey)
ccs = linkage.deduplicate(df = ccs, record_id - 'Resident_ID', mks = deduplicate_mkey)
census.count()

# Deterministic Matching (rule-based)

Now that we've removed duplicates, we can start to investigate some matchkeys:

In [75]:
# first, let's suffix each dataset's columns to distinguish the two dataframes 
census = dataframes.suffix_columns(census, suffix = '_census')
ccs = dataframes.suffix_columns(ccs, suffix = '_ccs')

census.persist().count()
ccs.persist().count()

In [76]:
MK1 = [census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# letting middle name be a mismatch 
MK2 = [census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# taking the phonetic encoding of forename - using the metaphone algorithm
MK3 = [census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# Now allowing for misspellings rather than mishearings of names, using standardised Levenshtein edit distance
MK4 = [census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

matchkeys = [MK1,MK2,MK3,MK4]

#census.Full_Name_census == ccs.Full_Name_ccs,
#census.First_Name_census == ccs.First_Name_ccs,
 #      census.Last_Name_census == ccs.Last_Name_ccs

AttributeError: 'DataFrame' object has no attribute 'Sex_census'

In [None]:
links = linkage.deterministic_linkage(df_l = census, df_r = ccs, id_l = 'Resident_ID_census', id_r = 'Resident_ID_ccs', 
                                      matchkeys = matchkeys, our_dir = '/user/edwara5/census_ccs_links')

In [None]:
links.show()

In [None]:
mk_df = linkage.matchkey_dataframe(matchkeys)