# DLH_utils demo

This notebook is intended to be a demo of what you *could* use DLH_utils for. 

I'm sure to many a lot of this code may look very similar! We will have taken similar approaches for the vast majority of the problems faced here. We've just wrapped these mostly standard approaches up into reusable functions, hopefully to save everyone doing linkage both some time and headaches! 

In [2]:
# to start, install dlh_utils if not installed already. Notice the '-U' argument to upgrade existing installations. 
!pip3 install -U 'dlh_utils'

Looking in indexes: http://sccm_functional:****@art-p-01/artifactory/api/pypi/yr-python/simple
Collecting dlh_utils
[?25l  Downloading http://art-p-01/artifactory/api/pypi/yr-python/packages/packages/22/7e/26cf5df2df2fc1d1559fd3d64ac0aaf79fe7f72b3120b4deb2950b2cb5f3/dlh_utils-0.2.4-py3-none-any.whl (56kB)
[K     |████████████████████████████████| 61kB 61.2MB/s eta 0:00:01
Collecting importlib_metadata<5.0.0,>=4.8.3 (from dlh_utils)
  Downloading http://art-p-01/artifactory/api/pypi/yr-python/packages/packages/a0/a1/b153a0a4caf7a7e3f15c2cd56c7702e2cf3d89b1b359d1f1c5e59d68f4ce/importlib_metadata-4.8.3-py3-none-any.whl
Collecting zipp>=0.5 (from importlib_metadata<5.0.0,>=4.8.3->dlh_utils)
  Downloading http://art-p-01/artifactory/api/pypi/yr-python/packages/packages/bd/df/d4a4974a3e3957fd1c1fa3082366d7fff6e428ddb55f074bf64876f8e8ad/zipp-3.6.0-py3-none-any.whl
Installing collected packages: zipp, importlib-metadata, dlh-utils
  Found existing installation: dlh-utils 0.1
    Uninstalling

In [3]:
# import necessary libraries
import pyspark.sql.functions as F
import pandas as pd

from dlh_utils import utilities
from dlh_utils import dataframes
from dlh_utils import linkage
from dlh_utils import standardisation
from dlh_utils import sessions
from dlh_utils import profiling
from dlh_utils import flags

In [4]:
# you can use our sessions module to set up your spark session
# this will also create a Spark UI, which you can use to track your code's efficiency
spark = sessions.getOrCreateSparkSession(appName = 'dlh_utils_demo', size = 'medium')

In [6]:
# read in raw data
census = pd.read_csv("/home/cdsw/dlh_utils_demo/census_residents.csv")
ccs = pd.read_csv("/home/cdsw/dlh_utils_demo/ccs_perturbed.csv")

# note, if this was stored in Hue, the read_format() function from the utilities module would've been useful

#for demo purposes, let's convert this to a spark df using utilities
census = utilities.pandas_to_spark(census)
ccs = utilities.pandas_to_spark(ccs)

To give a quick overview of the features of our data, we can use the **descibe()** function from the profiling module:

In [10]:
descriptive_census = profiling.df_describe(census,
                                           output_mode = 'pandas',
                                           approx_distinct = False,
                                           rsd = 0.05
                                           )
descriptive_census

Unnamed: 0,variable,type,row_count,distinct,percent_distinct,null,percent_null,not_null,percent_not_null,empty,percent_empty,min,max,min_l,max_l,max_l_before_point,min_l_before_point,max_l_after_point,min_l_after_point
0,Address,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,19,51,,,,
1,ENUM_FNAME,string,100001,364,0.363996,0,0.0,100001,100.0,0,0.0,,,3,11,,,,
2,ENUM_SNAME,string,100001,500,0.499995,0,0.0,100001,100.0,0,0.0,,,3,11,,,,
3,ID,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,20,20,,,,
4,Marital_Status,string,100001,6,0.006,0,0.0,100001,100.0,13304,13.303867,,,1,17,,,,
5,Postcode,string,100001,99457,99.456005,0,0.0,100001,100.0,0,0.0,,,6,8,,,,
6,Sex,string,100001,10,0.01,0,0.0,100001,100.0,5564,5.563944,,,1,6,,,,
7,Resident_Day_Of_Birth,bigint,100001,31,0.031,0,0.0,100001,100.0,0,0.0,1.0,31.0,1,2,,,,
8,Resident_Month_Of_Birth,bigint,100001,12,0.012,0,0.0,100001,100.0,0,0.0,1.0,12.0,1,2,,,,
9,Resident_Year_Of_Birth,bigint,100001,89,0.088999,0,0.0,100001,100.0,0,0.0,1934.0,2022.0,4,4,,,,


NO NULLS BUT EMPTY VALUES - DIFFERENT DEFINITIONS FOR NULLS - WE CAN CAST THESE TO NULLS/TRUE NONES LATER

From this we can see that we have a percentage distinct in our sex variable far from 50% which we would expect. This could suggest a high level of missingness, but we can see from the rest of the output that we don't have any missing or null sex values, suggesting some have been incorrectly coded or skewed in the data.

We also don't have many distinct postcodes, and our surname variable has a lot of 'empty' values. On bigger data, these observations can give quick insights into which variables may be the most/least useful for matching. 

The **value_counts()** functions shows the top or bottom n values in our data. This can give us an overview of the different types of missingness in these variables, which will be useful when we come to standardise missingness in our data later.

In [12]:
top_5_value_counts_census = profiling.value_counts(census,
                                            limit = 5,
                                            output_mode = 'pandas'
                                            )
# the value counts function returns two dataframes; one for the top n values in each variable and one for the bottom n values. 
# we can select the top value count dataframe by subsetting the top_5_value_counts_df1 tuple:

top_5_value_counts_census[0]

Unnamed: 0,Address,Address_count,ENUM_FNAME,ENUM_FNAME_count,ENUM_SNAME,ENUM_SNAME_count,ID,ID_count,Marital_Status,Marital_Status_count,...,Resident_Day_Of_Birth,Resident_Day_Of_Birth_count,Resident_Month_Of_Birth,Resident_Month_Of_Birth_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count
0,"0 Adams divide, Damienberg",1,Toby,319,Smith,2897,c1000593123146290284,1,Single,33409,...,13,3422,10,8579,1934,1217,88,1217,09/10/1990,12
1,"0 Alan drive, Katiemouth",1,Michelle,315,Jones,2367,c1001639032354420555,1,Married,13491,...,28,3358,7,8577,1948,1204,74,1204,20/09/1935,12
2,"0 Allen plains, Hughesville",1,Jessica,312,Williams,1713,c1002072654919421883,1,Divorced,13412,...,2,3339,1,8552,2007,1190,15,1190,08/04/1966,11
3,"0 Amelia mills, West Terenceshire",1,Shaun,309,Taylor,1304,c1003117462577937457,1,,13304,...,20,3331,3,8546,1990,1183,32,1183,18/09/1937,11
4,"0 Archer locks, Lake Paula",1,Glenn,308,Brown,1281,c1007430416000428392,1,Civil partnership,13290,...,9,3324,8,8495,1959,1181,63,1181,10/06/1980,11


In [13]:
top_5_value_counts_ccs = profiling.value_counts(ccs,
                                            limit = 5,
                                            output_mode = 'pandas'
                                            )
# the value counts function returns two dataframes; one for the top n values in each variable and one for the bottom n values. 
# we can select the top value count dataframe by subsetting the top_5_value_counts_df1 tuple:

top_5_value_counts_ccs[0]

Unnamed: 0,Address,Address_count,ENUM_FNAME,ENUM_FNAME_count,ENUM_SNAME,ENUM_SNAME_count,ID,ID_count,Marital_Status_CCS,Marital_Status_CCS_count,...,Sex,Sex_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count,Resident_ID,Resident_ID_count
0,-9,52,-9,45,-9,55,-9,47,Single,289,...,Female,263,-9,31,-9,43,-9,46,c1289733399550728998,1
1,"Flat 15\nSteven pike, Lake Peterhaven",4,Kieran,8,Smith,33,c4872972226932063802,3,,149,...,Male,201,1938,21,2,21,23/08/1963,5,c1406628632687313907,1
2,"Studio 0\nFrancis shore, West Lukeview",4,Phillip,8,Jones,19,c3392644228012631624,3,Married,117,...,F,116,2020,21,25,20,07/02/1956,4,c1462481395779002923,1
3,"056 Josephine ville, West Lesleyberg",4,Olivia,7,Taylor,14,c2930018223824357607,3,Divorced,110,...,M,107,2005,21,11,19,02/01/1971,4,c1610482117117913758,1
4,"Studio 16y\nSimpson drive, Thorntonborough",4,Russell,7,Davies,13,c8831963511127540189,3,NAN,104,...,-9,102,1997,20,84,19,10/07/1972,4,c1631997721075661206,1


In [24]:
#STEP TO EXPLORE MINUS VALUES
'''
for count,column in enumerate(ccs.columns,1):
    if count == 1:
        if ccs.filter(F.col(column).startswith('-')).count()>1:
            df = ccs.filter(F.col(column).startswith('-')).distinct()
    else:
        df = df.union(ccs.filter(F.col(column).startswith('-')).distinct())

df.show()
 

df = [ccs.filter(F.col(column).startswith('-')) for column in ccs.columns]
df
'''
from functools import reduce

df = ccs.filter(
    reduce(
        lambda x, y: x | y,    # `|` means `or`; use `&` if you want `and`
        [(F.col(c).startswith('-')) for c in ccs.columns]
    )
)

df.show()

+--------------------+----------+----------+--------------------+------------------+---------+------+----------------------+------------+----------+--------------------+
|             Address|ENUM_FNAME|ENUM_SNAME|                  ID|Marital_Status_CCS| Postcode|   Sex|Resident_Year_Of_Birth|Resident_Age|       DOB|         Resident_ID|
+--------------------+----------+----------+--------------------+------------------+---------+------+----------------------+------------+----------+--------------------+
|                  -9|      eriC|      Bibi|c3552902187723607632|            Single| DY01 1TR|  Male|                  2016|           6|11/10/2016|c3258728696626565719|
|                  -9|     chloe|  Chandler|c2381984462771197706| Civil partnership|       -9|   NAN|                  1963|          59|28/12/1996|c7831454145019129197|
|644 Garry walk, B...|    Denise|      King|c1548782761489667658|               NAN| FY9W 4RU|    -9|                  1981|          41|28/11/1981|c6

To flag out of scope values in our data, we can use the **flag()** function:

In [31]:
ccs.filter(F.to_date(ccs['DOB']).lt(F.lit('00/00/1900'))).show()
'''
out_of_scope = flags.flag(df = census,
                          ref_col = 'DOB',
                          condition = '>=',
                          condition_value = '00/00/1900',
                          condition_col = None,
                          alias = None,
                          prefix = 'FLAG',
                          fill_null = None
                         )
out_of_scope.show()
'''

TypeError: 'Column' object is not callable

We can see we have supercentenarian Ben in our data, which is probably wrong, but we've also got a few different date types that have been flagged as well. 

If you are working with larger data, the **flag_check()** and **flag_summary()** functions can produce more detailed flag metrics that will help you spot issues like this more readily. 

Let's move on to cleaning and standardising where we can start to deal with these issues.

In [33]:
census.show(truncate = False)

+-------------------------------------------+----------+----------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----------+
|Address                                    |ENUM_FNAME|ENUM_SNAME|ID                  |Marital_Status   |Postcode|Sex   |Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age|DOB       |
+-------------------------------------------+----------+----------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----------+
|Studio 48
Cooper street, Port Fredericktown|Margaret  |Ross      |c4064232788196233825|Divorced         |CV25 4ZY|Male  |6                    |8                      |1956                  |66          |06/08/1956|
|43 Rebecca street, Harveytown              |Darren    |Baldwin   |c6330546597769552216|Single           |E2 0LP  |      |29            

In [37]:
ccs.show(truncate = False)

#looks like there is a new line character in address - this will need to be removed

+-------------------------------------------+----------+----------+---------------------+------------------+---------+------+----------------------+------------+----------+--------------------+
|Address                                    |ENUM_FNAME|ENUM_SNAME|ID                   |Marital_Status_CCS|Postcode |Sex   |Resident_Year_Of_Birth|Resident_Age|DOB       |Resident_ID         |
+-------------------------------------------+----------+----------+---------------------+------------------+---------+------+----------------------+------------+----------+--------------------+
|Studio 48
Cooper street, Port Fredericktown|MargaDret |Ro?ss     |c4064232788196233825 |Divorced          |CV25 4ZY |Male  |1956                  |66          |06/08/1956|c2026847926404610461|
|43 Rebecca street, Harveytown              |Oliver    |Baldwin   |c6330546597769552216 |Single            |e2 0lp   |      |2013                  |9           |29/12/2013|c7839596180442651345|
|-9                           

# Data Cleaning & Standardisation

In [38]:
# replace '\n' values with spaces:

census = standardisation.reg_replace(df = census, dic = {' ': '\n'})
ccs = standardisation.reg_replace(df = ccs, dic = {' ': '\n'})

ccs.select('Address').show(truncate = False)

+-------------------------------------------+
|Address                                    |
+-------------------------------------------+
|Studio 48 Cooper street, Port Fredericktown|
|43 Rebecca street, Harveytown              |
|-9                                         |
| 7 Noble valley, Lake Simonville           |
|464 Victor mews, Janemouth                 |
|-9                                         |
|Studio 5 Fuller burgs, New Lindsey         |
|644 Garry walk, Blackburnville             |
|Studio 73 Clayton mountains, Stevenbury    |
|Flat 92B Ross expressway, Brayshire        |
|8 Grant spurs, South Philip                |
|414 Forster plains, Aimeemouth             |
|21 Stephen island, terrymouth              |
|flat 78 Jones Glen, marIonbuRgh            |
|-9                                         |
|69 Neil hill, Turnerbury                   |
|Studio 6 Dixon bypass, New Marian          |
|Flat 00 John bridge, Ahmedport             |
|Flat 22 Kennedy keys, Port Valeri

Let's standardise the date format to be consistent across our data in a **ddMMyyyy** format:
NEED TO CHANGE JUST CCS ONCE DATA CHANGED

In [41]:
ccs = standardisation.standardise_date(ccs, col_name = "DOB", in_date_format = "yyyy-MM-dd", out_date_format = "dd/MM/yyyy")

census.show()

+--------------------+----------+----------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----+
|             Address|ENUM_FNAME|ENUM_SNAME|                  ID|   Marital_Status|Postcode|   Sex|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+----------+----------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----+
|Studio 48 Cooper ...|  Margaret|      Ross|c4064232788196233825|         Divorced|CV25 4ZY|  Male|                    6|                      8|                  1956|          66|null|
|43 Rebecca street...|    Darren|   Baldwin|c6330546597769552216|           Single|  E2 0LP|      |                   29|                     12|                  2013|           9|null|
|04 Lane shores, S...|      Eric|      Bibi|c3552902187723607632|

Next, we have generic 'ID' columns in each dataset. We also have address and name variables named differently in each dataset. 

We can use **rename_columns()** from the dataframes module to rename all of these at once. 

In [42]:
census = dataframes.rename_columns(census, rename_dict = {"ID":"ID_Census","ENUM_FNAME":"FORENAME","ENUM_SNAME":"SURNAME"})
ccs = dataframes.rename_columns(ccs, rename_dict = {"ID":"ID_CCS","ENUM_FNAME":"FORENAME","ENUM_SNAME":"SURNAME"})

census.columns

['Address',
 'FORENAME',
 'SURNAME',
 'ID_Census',
 'Marital_Status',
 'Postcode',
 'Sex',
 'Resident_Day_Of_Birth',
 'Resident_Month_Of_Birth',
 'Resident_Year_Of_Birth',
 'Resident_Age',
 'DOB']

Now let's set all of variables to upper case for consistency, using **standardise_case()**:

In [43]:
census = standardisation.standardise_case(census)
ccs = standardisation.standardise_case(ccs)

census.show()

+--------------------+--------+--------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----+
|             Address|FORENAME| SURNAME|           ID_Census|   Marital_Status|Postcode|   Sex|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+--------+--------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----+
|STUDIO 48 COOPER ...|MARGARET|    ROSS|C4064232788196233825|         DIVORCED|CV25 4ZY|  MALE|                    6|                      8|                  1956|          66|null|
|43 REBECCA STREET...|  DARREN| BALDWIN|C6330546597769552216|           SINGLE|  E2 0LP|      |                   29|                     12|                  2013|           9|null|
|04 LANE SHORES, S...|    ERIC|    BIBI|C3552902187723607632|           SINGLE|DY01 1

Next, the values for missingness are all over the place. I can spot a few NaNs, nulls, and whitespaces. Let's standardise missingness with the **standardise_null()** function. We can retrieve these null values from the previous **value_counts()** outputs: 

In [45]:
# we can use the standardise_null function to replace these with true None values:
# we use regex to do this: https://regex101.com/ 
census = standardisation.standardise_null(census, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$")
ccs = standardisation.standardise_null(ccs, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$")

census.show()

+--------------------+--------+--------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----+
|             Address|FORENAME| SURNAME|           ID_Census|   Marital_Status|Postcode|   Sex|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+--------+--------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----+
|STUDIO 48 COOPER ...|MARGARET|    ROSS|C4064232788196233825|         DIVORCED|CV25 4ZY|  MALE|                    6|                      8|                  1956|          66|null|
|43 REBECCA STREET...|  DARREN| BALDWIN|C6330546597769552216|           SINGLE|  E2 0LP|  null|                   29|                     12|                  2013|           9|null|
|04 LANE SHORES, S...|    ERIC|    BIBI|C3552902187723607632|           SINGLE|DY01 1

Great, these now all show up as true nulls. 

Next, we have a mix of 1s, 2s, Ms, and Fs in our sex column. Let's standardise this to be either 1s or 2s. For this we can use **reg_replace()**:

In [47]:
# reg_replace() takes a dictionary, where the value is the regex to replace, and the key is what this will be replaced with
# so we're replacing 'M' with '1', and 'F' with '2':
census = standardisation.reg_replace(census, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})
ccs = standardisation.reg_replace(ccs, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})

census.show()

+--------------------+--------+--------+--------------------+-----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|             Address|FORENAME| SURNAME|           ID_Census|   Marital_Status|Postcode| SEX|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+--------+--------+--------------------+-----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|STUDIO 48 COOPER ...|MARGARET|    ROSS|C4064232788196233825|         DIVORCED|CV25 4ZY|   1|                    6|                      8|                  1956|          66|null|
|43 REBECCA STREET...|  DARREN| BALDWIN|C6330546597769552216|           SINGLE|  E2 0LP|null|                   29|                     12|                  2013|           9|null|
|04 LANE SHORES, S...|    ERIC|    BIBI|C3552902187723607632|           SINGLE|DY01 1TR|   1|  

Now that our sex column is populated with just 1s and 2s, we might want to change the type from string to integer. This can be done using the **cast_type()** function in the standardisation module:

In [48]:
census = standardisation.cast_type(census, subset = ['SEX'], types = "integer")
ccs = standardisation.cast_type(ccs, subset = ['SEX'], types = "integer")

census.select('SEX').dtypes

[('SEX', 'int')]

Let's begin to have a look at our postcode, address, and name variables now. It looks like we sometimes have whitespace in the postcode column, and sometimes have multiple consecutive whitespaces in our name/address columns. 

We can use the **standardise_white_space()** function to limit whitespace in all columns to one, then use it again on just the postcode column, by specifying the subset argument and limiting it to only the postcode column. 

In [50]:
census = standardisation.standardise_white_space(census, subset = 'Address', wsl = "one")
ccs = standardisation.standardise_white_space(ccs, subset = 'Address', wsl = "one")

census = standardisation.standardise_white_space(census, 
                                                 subset = [column for column in census.columns if column != 'Address'], 
                                                 wsl = "none")
ccs = standardisation.standardise_white_space(ccs, 
                                              subset = [column for column in ccs.columns if column != 'Address'], 
                                              wsl = "none")

census.show()

+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|             Address|FORENAME| SURNAME|           ID_Census|  Marital_Status|POSTCODE| SEX|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|STUDIO 48 COOPER ...|MARGARET|    ROSS|C4064232788196233825|        DIVORCED| CV254ZY|   1|                    6|                      8|                  1956|          66|null|
|43 REBECCA STREET...|  DARREN| BALDWIN|C6330546597769552216|          SINGLE|   E20LP|null|                   29|                     12|                  2013|           9|null|
|04 LANE SHORES, S...|    ERIC|    BIBI|C3552902187723607632|          SINGLE| DY011TR|   1|        

We still have some leading/trailing whitespace in some of our variables, let's **trim** these:

In [51]:
census = standardisation.trim(census)
ccs = standardisation.trim(ccs)

census.show()

+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|             Address|FORENAME| SURNAME|           ID_Census|  Marital_Status|POSTCODE| SEX|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|STUDIO 48 COOPER ...|MARGARET|    ROSS|C4064232788196233825|        DIVORCED| CV254ZY|   1|                    6|                      8|                  1956|          66|null|
|43 REBECCA STREET...|  DARREN| BALDWIN|C6330546597769552216|          SINGLE|   E20LP|null|                   29|                     12|                  2013|           9|null|
|04 LANE SHORES, S...|    ERIC|    BIBI|C3552902187723607632|          SINGLE| DY011TR|   1|        

Next, let's focus on our name variables. Forenames still contain titles and some surnames have common prefixes like 'Van' or 'Der'. We can strip out titles and concatenate surname prefixes with our **clean_forename()** and **clean_surname()** functions. 

In [53]:
census = standardisation.clean_forename(census, subset = 'FORENAME')
ccs = standardisation.clean_forename(ccs, subset = 'FORENAME')

census = standardisation.clean_surname(census, subset = 'SURNAME')
ccs = standardisation.clean_surname(ccs, subset = 'SURNAME')

census.show()

+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|             Address|FORENAME| SURNAME|           ID_Census|  Marital_Status|POSTCODE| SEX|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|STUDIO 48 COOPER ...|MARGARET|    ROSS|C4064232788196233825|        DIVORCED| CV254ZY|   1|                    6|                      8|                  1956|          66|null|
|43 REBECCA STREET...|  DARREN| BALDWIN|C6330546597769552216|          SINGLE|   E20LP|null|                   29|                     12|                  2013|           9|null|
|04 LANE SHORES, S...|    ERIC|    BIBI|C3552902187723607632|          SINGLE| DY011TR|   1|        

Finally, let's strip out numbers from our name variables. Again, we can use the **reg_replace()** function for this:

In [58]:
census = standardisation.reg_replace(census, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})
ccs = standardisation.reg_replace(ccs, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})

census.show()

+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|             Address|FORENAME| SURNAME|           ID_Census|  Marital_Status|POSTCODE| SEX|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|STUDIO 48 COOPER ...|MARGARET|    ROSS|C4064232788196233825|        DIVORCED| CV254ZY|   1|                    6|                      8|                  1956|          66|null|
|43 REBECCA STREET...|  DARREN| BALDWIN|C6330546597769552216|          SINGLE|   E20LP|null|                   29|                     12|                  2013|           9|null|
|04 LANE SHORES, S...|    ERIC|    BIBI|C3552902187723607632|          SINGLE| DY011TR|   1|        

This still leaves apostrophes and hyphens in our name variables. The **remove_punct()** function can handle these. While we're at it, let's also use **remove_punct()** to get rid of dashes in our address field, but we'll have to specify the optional argument **keep** to make sure it doesn't strip out commas from addresses:

In [59]:
#First remove all punction from every column except address
census = standardisation.remove_punct(census, 
                                      subset = [column for column in census.columns if column != 'Address'], 
                                      )

ccs = standardisation.remove_punct(ccs, 
                                   subset = [column for column in ccs.columns if column != 'Address']
                                  )


#Then remove the punctuation from address except for the commas
census = standardisation.remove_punct(census, subset = 'Address', keep = ',')
ccs = standardisation.remove_punct(ccs, subset = 'Address', keep = ',')

census.show()

+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|             Address|FORENAME| SURNAME|           ID_Census|  Marital_Status|POSTCODE| SEX|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+--------+--------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----+
|STUDIO 48 COOPER ...|MARGARET|    ROSS|C4064232788196233825|        DIVORCED| CV254ZY|   1|                    6|                      8|                  1956|          66|null|
|43 REBECCA STREET...|  DARREN| BALDWIN|C6330546597769552216|          SINGLE|   E20LP|null|                   29|                     12|                  2013|           9|null|
|04 LANE SHORES, S...|    ERIC|    BIBI|C3552902187723607632|          SINGLE| DY011TR|   1|        

# Derive Variables

We've got quite a few identifying variables that we can split out into further variables for matching. 

First, let's derive street and town from the address variable. The **split()** function from the dataframes module will be useful here, splitting on comma. 

In [61]:
# this will create a new column called "ADDRESS_SPLIT" that contains an array of each address element, separated by a comma
census = dataframes.split(census, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")
ccs = dataframes.split(ccs, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")

census.select("ADDRESS", "ADDRESS_SPLIT").show(truncate = False)

+-------------------------------------------+----------------------------------------------+
|ADDRESS                                    |ADDRESS_SPLIT                                 |
+-------------------------------------------+----------------------------------------------+
|STUDIO 48 COOPER STREET, PORT FREDERICKTOWN|[STUDIO 48 COOPER STREET,  PORT FREDERICKTOWN]|
|43 REBECCA STREET, HARVEYTOWN              |[43 REBECCA STREET,  HARVEYTOWN]              |
|04 LANE SHORES, SOUTH DANIELFORT           |[04 LANE SHORES,  SOUTH DANIELFORT]           |
|7 NOBLE VALLEY, LAKE SIMONVILLE            |[7 NOBLE VALLEY,  LAKE SIMONVILLE]            |
|57 PEARSON CORNER, JOANNABOROUGH           |[57 PEARSON CORNER,  JOANNABOROUGH]           |
|0 JEREMY MOUNTAINS, NORTH FRANK            |[0 JEREMY MOUNTAINS,  NORTH FRANK]            |
|STUDIO 5 FULLER BURGS, NEW LINDSEY         |[STUDIO 5 FULLER BURGS,  NEW LINDSEY]         |
|644 GARRY WALK, BLACKBURNVILLE             |[644 GARRY WALK,  BLACKBU

In [63]:
# we can then select the first element of the 'split address' to create the 'street address' variable COULD WE USE LOOP?
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)

# the second element contains the town name, which we can append to a new column also 
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)

# since we no longer need the 'ADDRESS_SPLIT' column, we can remove it using our drop_columns() function
census = dataframes.drop_columns(census, subset = 'ADDRESS_SPLIT')
ccs = dataframes.drop_columns(ccs, subset = 'ADDRESS_SPLIT')

census.select("ADDRESS", "STREET", "TOWN").show(truncate = False)

AnalysisException: "cannot resolve '`ADDRESS_SPLIT`' given input columns: [Resident_Year_Of_Birth, TOWN, SURNAME, Marital_Status, STREET, Resident_Month_Of_Birth, Resident_Age, Resident_Day_Of_Birth, FORENAME, ID_Census, DOB, SEX, Address, POSTCODE];;\n'Project [Address#9342, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#9183, Resident_Age#9196, DOB#9209, 'ADDRESS_SPLIT[0] AS STREET#9576, TOWN#9505]\n+- Deduplicate [DOB#9209, TOWN#9505, Marital_Status#9118, Resident_Age#9196, Resident_Day_Of_Birth#9157, Address#9342, Resident_Month_Of_Birth#9170, SEX#9144, STREET#9476, ID_Census#9105, FORENAME#9079, POSTCODE#9131, Resident_Year_Of_Birth#9183, SURNAME#9092]\n   +- Project [Address#9342, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#9183, Resident_Age#9196, DOB#9209, STREET#9476, TOWN#9505]\n      +- Project [Address#9342, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#9183, Resident_Age#9196, DOB#9209, ADDRESS_SPLIT#9440, STREET#9476, ADDRESS_SPLIT#9440[1] AS TOWN#9505]\n         +- Project [Address#9342, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#9183, Resident_Age#9196, DOB#9209, ADDRESS_SPLIT#9440, ADDRESS_SPLIT#9440[0] AS STREET#9476]\n            +- Project [Address#9342, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#9183, Resident_Age#9196, DOB#9209, CASE WHEN (isnull(ADDRESS#9342) || isnan(cast(ADDRESS#9342 as double))) THEN cast(null as array<string>) ELSE split(ADDRESS#9342, ,) END AS ADDRESS_SPLIT#9440]\n               +- Project [Address#9342, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#9183, Resident_Age#9196, DOB#9209, CASE WHEN (isnull(ADDRESS#9342) || isnan(cast(ADDRESS#9342 as double))) THEN cast(null as array<string>) ELSE split(ADDRESS#9342, ,) END AS ADDRESS_SPLIT#9404]\n                  +- Project [regexp_replace(Address#8398, [^A-Za-z0-9 ,], ) AS Address#9342, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#9183, Resident_Age#9196, DOB#9209]\n                     +- Project [Address#8398, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#9183, Resident_Age#9196, regexp_replace(DOB#8528, [^A-Za-z0-9 ], ) AS DOB#9209]\n                        +- Project [Address#8398, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#9183, regexp_replace(Resident_Age#8515, [^A-Za-z0-9 ], ) AS Resident_Age#9196, DOB#8528]\n                           +- Project [Address#8398, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#9170, regexp_replace(Resident_Year_Of_Birth#8502, [^A-Za-z0-9 ], ) AS Resident_Year_Of_Birth#9183, Resident_Age#8515, DOB#8528]\n                              +- Project [Address#8398, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, Resident_Day_Of_Birth#9157, regexp_replace(Resident_Month_Of_Birth#8489, [^A-Za-z0-9 ], ) AS Resident_Month_Of_Birth#9170, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                 +- Project [Address#8398, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, SEX#9144, regexp_replace(Resident_Day_Of_Birth#8476, [^A-Za-z0-9 ], ) AS Resident_Day_Of_Birth#9157, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                    +- Project [Address#8398, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, POSTCODE#9131, regexp_replace(cast(SEX#7147 as string), [^A-Za-z0-9 ], ) AS SEX#9144, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                       +- Project [Address#8398, FORENAME#9079, SURNAME#9092, ID_Census#9105, Marital_Status#9118, regexp_replace(POSTCODE#8463, [^A-Za-z0-9 ], ) AS POSTCODE#9131, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                          +- Project [Address#8398, FORENAME#9079, SURNAME#9092, ID_Census#9105, regexp_replace(Marital_Status#8450, [^A-Za-z0-9 ], ) AS Marital_Status#9118, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                             +- Project [Address#8398, FORENAME#9079, SURNAME#9092, regexp_replace(ID_Census#8437, [^A-Za-z0-9 ], ) AS ID_Census#9105, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                                +- Project [Address#8398, FORENAME#9079, regexp_replace(SURNAME#9005, [^A-Za-z0-9 ], ) AS SURNAME#9092, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                                   +- Project [Address#8398, regexp_replace(FORENAME#8992, [^A-Za-z0-9 ], ) AS FORENAME#9079, SURNAME#9005, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                                      +- Project [Address#8398, FORENAME#8992, regexp_replace(SURNAME#8760, [0-9], ) AS SURNAME#9005, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                                         +- Project [Address#8398, regexp_replace(FORENAME#8735, [0-9], ) AS FORENAME#8992, SURNAME#8760, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                                            +- Project [Address#8398, FORENAME#8735, regexp_replace(SURNAME#8424, \\bNO SURNAME\\b|\\bSURNAME\\b|(?<=\\bDE)[ -]|(?<=\\bDA)[ -]|(?<=\\bDU)[ -]|(?<=\\bST)[ -]|(?<=\\bMC)[ -]|(?<=\\bMAC)[ -]|(?<=\\bVAN)[ -]|(?<=\\bVON)[ -]|(?<=\\bLA)[ -]|(?<=\\bLE)[ -]|(?<=\\bO)[ -]|(?<=\\bAL)[ -]|(?<=\\bDER)[ -]|(?<=\\bEL)[ -]|(?<=\\bDI)[ -]|(?<=\\bDEL)[ -]|(?<=\\bUL)[ -]|(?<=\\bBIN)[ -]|(?<=\\bSAN)[ -]|(?<=\\bBA)[ -], ) AS SURNAME#8760, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                                               +- Project [Address#8398, regexp_replace(FORENAME#8411, \\bMR\\b|\\bMRS\\b|\\bDR\\b|\\bMISS\\b|\\bNO NAME\\b|\\bNAME\\b|\\bFORENAME\\b|\\bMS\\b|\\bMSTR\\b|\\bPROF\\b|\\bSIR\\b|\\bLADY\\b, ) AS FORENAME#8735, SURNAME#8424, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, DOB#8528]\n                                                                  +- Project [Address#8398, FORENAME#8411, SURNAME#8424, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, Resident_Age#8515, trim(DOB#8122, None) AS DOB#8528]\n                                                                     +- Project [Address#8398, FORENAME#8411, SURNAME#8424, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8502, trim(Resident_Age#8095, None) AS Resident_Age#8515, DOB#8122]\n                                                                        +- Project [Address#8398, FORENAME#8411, SURNAME#8424, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8489, trim(Resident_Year_Of_Birth#8068, None) AS Resident_Year_Of_Birth#8502, Resident_Age#8095, DOB#8122]\n                                                                           +- Project [Address#8398, FORENAME#8411, SURNAME#8424, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8476, trim(Resident_Month_Of_Birth#8041, None) AS Resident_Month_Of_Birth#8489, Resident_Year_Of_Birth#8068, Resident_Age#8095, DOB#8122]\n                                                                              +- Project [Address#8398, FORENAME#8411, SURNAME#8424, ID_Census#8437, Marital_Status#8450, POSTCODE#8463, SEX#7147, trim(Resident_Day_Of_Birth#8014, None) AS Resident_Day_Of_Birth#8476, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, Resident_Age#8095, DOB#8122]\n                                                                                 +- Project [Address#8398, FORENAME#8411, SURNAME#8424, ID_Census#8437, Marital_Status#8450, trim(POSTCODE#7986, None) AS POSTCODE#8463, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, Resident_Age#8095, DOB#8122]\n                                                                                    +- Project [Address#8398, FORENAME#8411, SURNAME#8424, ID_Census#8437, trim(Marital_Status#7959, None) AS Marital_Status#8450, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, Resident_Age#8095, DOB#8122]\n                                                                                       +- Project [Address#8398, FORENAME#8411, SURNAME#8424, trim(ID_Census#7932, None) AS ID_Census#8437, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, Resident_Age#8095, DOB#8122]\n                                                                                          +- Project [Address#8398, FORENAME#8411, trim(SURNAME#7905, None) AS SURNAME#8424, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, Resident_Age#8095, DOB#8122]\n                                                                                             +- Project [Address#8398, trim(FORENAME#7878, None) AS FORENAME#8411, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, Resident_Age#8095, DOB#8122]\n                                                                                                +- Project [trim(Address#7826, None) AS Address#8398, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, Resident_Age#8095, DOB#8122]\n                                                                                                   +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, Resident_Age#8095, regexp_replace(DOB#8109, \\s+, ) AS DOB#8122]\n                                                                                                      +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, Resident_Age#8095, trim(DOB#7459, None) AS DOB#8109]\n                                                                                                         +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, regexp_replace(Resident_Age#8082, \\s+, ) AS Resident_Age#8095, DOB#7459]\n                                                                                                            +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#8068, trim(Resident_Age#7432, None) AS Resident_Age#8082, DOB#7459]\n                                                                                                               +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, regexp_replace(Resident_Year_Of_Birth#8055, \\s+, ) AS Resident_Year_Of_Birth#8068, Resident_Age#7432, DOB#7459]\n                                                                                                                  +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#8041, trim(Resident_Year_Of_Birth#7405, None) AS Resident_Year_Of_Birth#8055, Resident_Age#7432, DOB#7459]\n                                                                                                                     +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, regexp_replace(Resident_Month_Of_Birth#8028, \\s+, ) AS Resident_Month_Of_Birth#8041, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                        +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#8014, trim(Resident_Month_Of_Birth#7378, None) AS Resident_Month_Of_Birth#8028, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                           +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, regexp_replace(Resident_Day_Of_Birth#8001, \\s+, ) AS Resident_Day_Of_Birth#8014, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                              +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, POSTCODE#7986, SEX#7147, trim(Resident_Day_Of_Birth#7351, None) AS Resident_Day_Of_Birth#8001, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                 +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, regexp_replace(POSTCODE#7973, \\s+, ) AS POSTCODE#7986, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                    +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, Marital_Status#7959, trim(POSTCODE#7737, None) AS POSTCODE#7973, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                       +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, regexp_replace(Marital_Status#7946, \\s+, ) AS Marital_Status#7959, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                          +- Project [Address#7826, FORENAME#7878, SURNAME#7905, ID_Census#7932, trim(Marital_Status#7296, None) AS Marital_Status#7946, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                             +- Project [Address#7826, FORENAME#7878, SURNAME#7905, regexp_replace(ID_Census#7919, \\s+, ) AS ID_Census#7932, Marital_Status#7296, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                +- Project [Address#7826, FORENAME#7878, SURNAME#7905, trim(ID_Census#7269, None) AS ID_Census#7919, Marital_Status#7296, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                   +- Project [Address#7826, FORENAME#7878, regexp_replace(SURNAME#7892, \\s+, ) AS SURNAME#7905, ID_Census#7269, Marital_Status#7296, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                      +- Project [Address#7826, FORENAME#7878, trim(SURNAME#7242, None) AS SURNAME#7892, ID_Census#7269, Marital_Status#7296, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                         +- Project [Address#7826, regexp_replace(FORENAME#7865, \\s+, ) AS FORENAME#7878, SURNAME#7242, ID_Census#7269, Marital_Status#7296, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                            +- Project [Address#7826, trim(FORENAME#7215, None) AS FORENAME#7865, SURNAME#7242, ID_Census#7269, Marital_Status#7296, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                               +- Project [regexp_replace(Address#7813, \\s+,  ) AS Address#7826, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                                  +- Project [trim(Address#7188, None) AS Address#7813, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                                     +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, regexp_replace(POSTCODE#7724, \\s+, ) AS POSTCODE#7737, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                                        +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, trim(POSTCODE#7323, None) AS POSTCODE#7724, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, DOB#7459]\n                                                                                                                                                                           +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, regexp_replace(DOB#7446, \\s+,  ) AS DOB#7459]\n                                                                                                                                                                              +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, Resident_Age#7432, trim(DOB#6790, None) AS DOB#7446]\n                                                                                                                                                                                 +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, regexp_replace(Resident_Age#7419, \\s+,  ) AS Resident_Age#7432, DOB#6790]\n                                                                                                                                                                                    +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#7405, trim(Resident_Age#6777, None) AS Resident_Age#7419, DOB#6790]\n                                                                                                                                                                                       +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, regexp_replace(Resident_Year_Of_Birth#7392, \\s+,  ) AS Resident_Year_Of_Birth#7405, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                          +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#7378, trim(Resident_Year_Of_Birth#6764, None) AS Resident_Year_Of_Birth#7392, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                             +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, Resident_Day_Of_Birth#7351, regexp_replace(Resident_Month_Of_Birth#7365, \\s+,  ) AS Resident_Month_Of_Birth#7378, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, Resident_Day_Of_Birth#7351, trim(Resident_Month_Of_Birth#6751, None) AS Resident_Month_Of_Birth#7365, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                   +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, regexp_replace(Resident_Day_Of_Birth#7338, \\s+,  ) AS Resident_Day_Of_Birth#7351, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                      +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, Postcode#7323, SEX#7147, trim(Resident_Day_Of_Birth#6738, None) AS Resident_Day_Of_Birth#7338, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                         +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, regexp_replace(Postcode#7310, \\s+,  ) AS Postcode#7323, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                            +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, Marital_Status#7296, trim(Postcode#6712, None) AS Postcode#7310, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                               +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, regexp_replace(Marital_Status#7283, \\s+,  ) AS Marital_Status#7296, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                  +- Project [Address#7188, FORENAME#7215, SURNAME#7242, ID_Census#7269, trim(Marital_Status#6699, None) AS Marital_Status#7283, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                     +- Project [Address#7188, FORENAME#7215, SURNAME#7242, regexp_replace(ID_Census#7256, \\s+,  ) AS ID_Census#7269, Marital_Status#6699, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                        +- Project [Address#7188, FORENAME#7215, SURNAME#7242, trim(ID_Census#6686, None) AS ID_Census#7256, Marital_Status#6699, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                           +- Project [Address#7188, FORENAME#7215, regexp_replace(SURNAME#7229, \\s+,  ) AS SURNAME#7242, ID_Census#6686, Marital_Status#6699, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                              +- Project [Address#7188, FORENAME#7215, trim(SURNAME#6673, None) AS SURNAME#7229, ID_Census#6686, Marital_Status#6699, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                 +- Project [Address#7188, regexp_replace(FORENAME#7202, \\s+,  ) AS FORENAME#7215, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                    +- Project [Address#7188, trim(FORENAME#6660, None) AS FORENAME#7202, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                       +- Project [regexp_replace(Address#7175, \\s+,  ) AS Address#7188, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                          +- Project [trim(Address#6647, None) AS Address#7175, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                             +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, cast(SEX#7072 as int) AS SEX#7147, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                                +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, regexp_replace(SEX#7059, ^F$|^FEMALE$, 2) AS SEX#7072, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                                   +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, regexp_replace(SEX#6985, ^M$|^MALE$, 1) AS SEX#7059, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                                      +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, regexp_replace(SEX#6972, ^F$, 2) AS SEX#6985, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                                         +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, regexp_replace(SEX#6725, ^M$, 1) AS SEX#6972, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, DOB#6790]\n                                                                                                                                                                                                                                                            +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, Sex#6725, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, Resident_Age#6777, CASE WHEN DOB#6465 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE DOB#6465 END AS DOB#6790]\n                                                                                                                                                                                                                                                               +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, Sex#6725, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6764, CASE WHEN Resident_Age#6452 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE Resident_Age#6452 END AS Resident_Age#6777, DOB#6465]\n                                                                                                                                                                                                                                                                  +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, Sex#6725, Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6751, CASE WHEN Resident_Year_Of_Birth#6439 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE Resident_Year_Of_Birth#6439 END AS Resident_Year_Of_Birth#6764, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                     +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, Sex#6725, Resident_Day_Of_Birth#6738, CASE WHEN Resident_Month_Of_Birth#6426 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE Resident_Month_Of_Birth#6426 END AS Resident_Month_Of_Birth#6751, Resident_Year_Of_Birth#6439, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                        +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, Sex#6725, CASE WHEN Resident_Day_Of_Birth#6413 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE Resident_Day_Of_Birth#6413 END AS Resident_Day_Of_Birth#6738, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                           +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, Postcode#6712, CASE WHEN Sex#6400 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE Sex#6400 END AS Sex#6725, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                              +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, Marital_Status#6699, CASE WHEN Postcode#6387 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE Postcode#6387 END AS Postcode#6712, Sex#6400, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                                 +- Project [Address#6647, FORENAME#6660, SURNAME#6673, ID_Census#6686, CASE WHEN Marital_Status#6374 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE Marital_Status#6374 END AS Marital_Status#6699, Postcode#6387, Sex#6400, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                                    +- Project [Address#6647, FORENAME#6660, SURNAME#6673, CASE WHEN ID_Census#6361 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE ID_Census#6361 END AS ID_Census#6686, Marital_Status#6374, Postcode#6387, Sex#6400, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                                       +- Project [Address#6647, FORENAME#6660, CASE WHEN SURNAME#6348 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE SURNAME#6348 END AS SURNAME#6673, ID_Census#6361, Marital_Status#6374, Postcode#6387, Sex#6400, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                                          +- Project [Address#6647, CASE WHEN FORENAME#6335 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE FORENAME#6335 END AS FORENAME#6660, SURNAME#6348, ID_Census#6361, Marital_Status#6374, Postcode#6387, Sex#6400, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                                             +- Project [CASE WHEN Address#6322 RLIKE ^NAN$|^NULL$|^\\s*$|^-7$|^-9$ THEN cast(null as string) ELSE Address#6322 END AS Address#6647, FORENAME#6335, SURNAME#6348, ID_Census#6361, Marital_Status#6374, Postcode#6387, Sex#6400, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, Resident_Age#6452, DOB#6465]\n                                                                                                                                                                                                                                                                                                +- Project [Address#6322, FORENAME#6335, SURNAME#6348, ID_Census#6361, Marital_Status#6374, Postcode#6387, Sex#6400, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, Resident_Age#6452, CASE WHEN DOB#6129 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE DOB#6129 END AS DOB#6465]\n                                                                                                                                                                                                                                                                                                   +- Project [Address#6322, FORENAME#6335, SURNAME#6348, ID_Census#6361, Marital_Status#6374, Postcode#6387, Sex#6400, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6439, CASE WHEN Resident_Age#6115 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE Resident_Age#6115 END AS Resident_Age#6452, DOB#6129]\n                                                                                                                                                                                                                                                                                                      +- Project [Address#6322, FORENAME#6335, SURNAME#6348, ID_Census#6361, Marital_Status#6374, Postcode#6387, Sex#6400, Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6426, CASE WHEN Resident_Year_Of_Birth#6101 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE Resident_Year_Of_Birth#6101 END AS Resident_Year_Of_Birth#6439, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                         +- Project [Address#6322, FORENAME#6335, SURNAME#6348, ID_Census#6361, Marital_Status#6374, Postcode#6387, Sex#6400, Resident_Day_Of_Birth#6413, CASE WHEN Resident_Month_Of_Birth#6087 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE Resident_Month_Of_Birth#6087 END AS Resident_Month_Of_Birth#6426, Resident_Year_Of_Birth#6101, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                            +- Project [Address#6322, FORENAME#6335, SURNAME#6348, ID_Census#6361, Marital_Status#6374, Postcode#6387, Sex#6400, CASE WHEN Resident_Day_Of_Birth#6073 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE Resident_Day_Of_Birth#6073 END AS Resident_Day_Of_Birth#6413, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                               +- Project [Address#6322, FORENAME#6335, SURNAME#6348, ID_Census#6361, Marital_Status#6374, Postcode#6387, CASE WHEN Sex#6059 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE Sex#6059 END AS Sex#6400, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                                  +- Project [Address#6322, FORENAME#6335, SURNAME#6348, ID_Census#6361, Marital_Status#6374, CASE WHEN Postcode#6045 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE Postcode#6045 END AS Postcode#6387, Sex#6059, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                                     +- Project [Address#6322, FORENAME#6335, SURNAME#6348, ID_Census#6361, CASE WHEN Marital_Status#6031 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE Marital_Status#6031 END AS Marital_Status#6374, Postcode#6045, Sex#6059, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                                        +- Project [Address#6322, FORENAME#6335, SURNAME#6348, CASE WHEN ID_Census#6017 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE ID_Census#6017 END AS ID_Census#6361, Marital_Status#6031, Postcode#6045, Sex#6059, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                                           +- Project [Address#6322, FORENAME#6335, CASE WHEN SURNAME#6003 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE SURNAME#6003 END AS SURNAME#6348, ID_Census#6017, Marital_Status#6031, Postcode#6045, Sex#6059, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                                              +- Project [Address#6322, CASE WHEN FORENAME#5989 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE FORENAME#5989 END AS FORENAME#6335, SURNAME#6003, ID_Census#6017, Marital_Status#6031, Postcode#6045, Sex#6059, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                                                 +- Project [CASE WHEN Address#5975 RLIKE ^NAN$|^NULL$|^\\s*$|^----$|^####$ THEN cast(null as string) ELSE Address#5975 END AS Address#6322, FORENAME#5989, SURNAME#6003, ID_Census#6017, Marital_Status#6031, Postcode#6045, Sex#6059, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, Resident_Age#6115, DOB#6129]\n                                                                                                                                                                                                                                                                                                                                    +- Project [Address#5975, FORENAME#5989, SURNAME#6003, ID_Census#6017, Marital_Status#6031, Postcode#6045, Sex#6059, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, Resident_Age#6115, upper(DOB#5825) AS DOB#6129]\n                                                                                                                                                                                                                                                                                                                                       +- Project [Address#5975, FORENAME#5989, SURNAME#6003, ID_Census#6017, Marital_Status#6031, Postcode#6045, Sex#6059, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#6101, upper(Resident_Age#5649) AS Resident_Age#6115, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                          +- Project [Address#5975, FORENAME#5989, SURNAME#6003, ID_Census#6017, Marital_Status#6031, Postcode#6045, Sex#6059, Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#6087, upper(Resident_Year_Of_Birth#5636) AS Resident_Year_Of_Birth#6101, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                             +- Project [Address#5975, FORENAME#5989, SURNAME#6003, ID_Census#6017, Marital_Status#6031, Postcode#6045, Sex#6059, Resident_Day_Of_Birth#6073, upper(Resident_Month_Of_Birth#5623) AS Resident_Month_Of_Birth#6087, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                +- Project [Address#5975, FORENAME#5989, SURNAME#6003, ID_Census#6017, Marital_Status#6031, Postcode#6045, Sex#6059, upper(Resident_Day_Of_Birth#5610) AS Resident_Day_Of_Birth#6073, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                   +- Project [Address#5975, FORENAME#5989, SURNAME#6003, ID_Census#6017, Marital_Status#6031, Postcode#6045, upper(Sex#5597) AS Sex#6059, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                      +- Project [Address#5975, FORENAME#5989, SURNAME#6003, ID_Census#6017, Marital_Status#6031, upper(Postcode#5584) AS Postcode#6045, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                         +- Project [Address#5975, FORENAME#5989, SURNAME#6003, ID_Census#6017, upper(Marital_Status#5571) AS Marital_Status#6031, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                            +- Project [Address#5975, FORENAME#5989, SURNAME#6003, upper(ID_Census#5899) AS ID_Census#6017, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                               +- Project [Address#5975, FORENAME#5989, upper(SURNAME#5925) AS SURNAME#6003, ID_Census#5899, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                                  +- Project [Address#5975, upper(FORENAME#5912) AS FORENAME#5989, SURNAME#5925, ID_Census#5899, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                                     +- Project [upper(Address#5519) AS Address#5975, FORENAME#5912, SURNAME#5925, ID_Census#5899, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                                        +- Project [Address#5519, FORENAME#5912, ENUM_SNAME#5545 AS SURNAME#5925, ID_Census#5899, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                                           +- Project [Address#5519, ENUM_FNAME#5532 AS FORENAME#5912, ENUM_SNAME#5545, ID_Census#5899, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                                              +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558 AS ID_Census#5899, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                                                 +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, from_unixtime(DOB#5812L, dd/MM/yyyy, Some(Europe/London)) AS DOB#5825]\n                                                                                                                                                                                                                                                                                                                                                                                    +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, unix_timestamp(DOB#5662, yyyy-MM-dd, Some(Europe/London)) AS DOB#5812L]\n                                                                                                                                                                                                                                                                                                                                                                                       +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, Resident_Age#5649, regexp_replace(DOB#11, \n,  ) AS DOB#5662]\n                                                                                                                                                                                                                                                                                                                                                                                          +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#5636, regexp_replace(cast(Resident_Age#10L as string), \n,  ) AS Resident_Age#5649, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                             +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#5623, regexp_replace(cast(Resident_Year_Of_Birth#9L as string), \n,  ) AS Resident_Year_Of_Birth#5636, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, Marital_Status#5571, Postcode#5584, Sex#5597, Resident_Day_Of_Birth#5610, regexp_replace(cast(Resident_Month_Of_Birth#8L as string), \n,  ) AS Resident_Month_Of_Birth#5623, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                   +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, Marital_Status#5571, Postcode#5584, Sex#5597, regexp_replace(cast(Resident_Day_Of_Birth#7L as string), \n,  ) AS Resident_Day_Of_Birth#5610, Resident_Month_Of_Birth#8L, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                      +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, Marital_Status#5571, Postcode#5584, regexp_replace(Sex#6, \n,  ) AS Sex#5597, Resident_Day_Of_Birth#7L, Resident_Month_Of_Birth#8L, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                         +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, Marital_Status#5571, regexp_replace(Postcode#5, \n,  ) AS Postcode#5584, Sex#6, Resident_Day_Of_Birth#7L, Resident_Month_Of_Birth#8L, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                            +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, ID#5558, regexp_replace(Marital_Status#4, \n,  ) AS Marital_Status#5571, Postcode#5, Sex#6, Resident_Day_Of_Birth#7L, Resident_Month_Of_Birth#8L, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                               +- Project [Address#5519, ENUM_FNAME#5532, ENUM_SNAME#5545, regexp_replace(ID#3, \n,  ) AS ID#5558, Marital_Status#4, Postcode#5, Sex#6, Resident_Day_Of_Birth#7L, Resident_Month_Of_Birth#8L, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                                  +- Project [Address#5519, ENUM_FNAME#5532, regexp_replace(ENUM_SNAME#2, \n,  ) AS ENUM_SNAME#5545, ID#3, Marital_Status#4, Postcode#5, Sex#6, Resident_Day_Of_Birth#7L, Resident_Month_Of_Birth#8L, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                                     +- Project [Address#5519, regexp_replace(ENUM_FNAME#1, \n,  ) AS ENUM_FNAME#5532, ENUM_SNAME#2, ID#3, Marital_Status#4, Postcode#5, Sex#6, Resident_Day_Of_Birth#7L, Resident_Month_Of_Birth#8L, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                                        +- Project [regexp_replace(Address#0, \n,  ) AS Address#5519, ENUM_FNAME#1, ENUM_SNAME#2, ID#3, Marital_Status#4, Postcode#5, Sex#6, Resident_Day_Of_Birth#7L, Resident_Month_Of_Birth#8L, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11]\n                                                                                                                                                                                                                                                                                                                                                                                                                           +- LogicalRDD [Address#0, ENUM_FNAME#1, ENUM_SNAME#2, ID#3, Marital_Status#4, Postcode#5, Sex#6, Resident_Day_Of_Birth#7L, Resident_Month_Of_Birth#8L, Resident_Year_Of_Birth#9L, Resident_Age#10L, DOB#11], false\n"

We can create a 'full name' variable by concatenating the two existing name columns together, using **concat()**:

In [65]:
census = dataframes.concat(census, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")
ccs = dataframes.concat(ccs, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")

census.select("FORENAME", "SURNAME", "FULL_NAME").show()

+---------+---------+----------------+
| FORENAME|  SURNAME|       FULL_NAME|
+---------+---------+----------------+
|   DENISE|     KING|     DENISE KING|
| NICHOLAS|  ROBERTS|NICHOLAS ROBERTS|
|     LUKE|   THOMAS|     LUKE THOMAS|
| CAROLINE|  BENNETT|CAROLINE BENNETT|
|    JULIE|WHITTAKER| JULIE WHITTAKER|
|    MEGAN|    EVANS|     MEGAN EVANS|
|     DAWN|  COLEMAN|    DAWN COLEMAN|
|  DOMINIC|  ROBERTS| DOMINIC ROBERTS|
|  ANTHONY|     CARR|    ANTHONY CARR|
|      KIM|     HALL|        KIM HALL|
|   CAROLE| HARRISON| CAROLE HARRISON|
|    DYLAN|     KING|      DYLAN KING|
|    WAYNE|  COLLIER|   WAYNE COLLIER|
|     JOEL|    CLARK|      JOEL CLARK|
|  GREGORY|   THOMAS|  GREGORY THOMAS|
|   JOSEPH|     ROWE|     JOSEPH ROWE|
|     LUCY|TOMLINSON|  LUCY TOMLINSON|
|   RONALD|     MANN|     RONALD MANN|
|   GEORGE|     COOK|     GEORGE COOK|
|JOSEPHINE|   THOMAS|JOSEPHINE THOMAS|
+---------+---------+----------------+
only showing top 20 rows



For ethnically diverse datasets, phonetic encodings of name variables may aid matching. We have functions for this in the linkage module. 

In [67]:
census = linkage.metaphone(df = census, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
census = linkage.soundex(df = census, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

ccs = linkage.metaphone(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
ccs = linkage.soundex(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

census.select("FORENAME", "FORENAME_METAPHONE", "FORENAME_SOUNDEX").show()  

+---------+------------------+----------------+
| FORENAME|FORENAME_METAPHONE|FORENAME_SOUNDEX|
+---------+------------------+----------------+
|   DENISE|               TNS|            D520|
| NICHOLAS|              NXLS|            N242|
|     LUKE|                LK|            L200|
| CAROLINE|              KRLN|            C645|
|    JULIE|                JL|            J400|
|    MEGAN|               MKN|            M250|
|     DAWN|                TN|            D500|
|  DOMINIC|              TMNK|            D552|
|  ANTHONY|              AN0N|            A535|
|      KIM|                KM|            K500|
|   CAROLE|               KRL|            C640|
|    DYLAN|               TLN|            D450|
|    WAYNE|                WN|            W500|
|     JOEL|                JL|            J400|
|  GREGORY|              KRKR|            G626|
|   JOSEPH|               JSF|            J210|
|     LUCY|                LS|            L200|
|   RONALD|              RNLT|          

Similarly, if there have been spelling mistakes, alphabetising string columns may also aid matching. We have a function for this in the linkage module. 

In [68]:
census = linkage.alpha_name(census, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')
ccs = linkage.alpha_name(ccs, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')

census.select("FORENAME", "ALPHABETISE_FORENAME").show()

+---------+--------------------+
| FORENAME|ALPHABETISE_FORENAME|
+---------+--------------------+
|   DENISE|              DEEINS|
| NICHOLAS|            ACHILNOS|
|     LUKE|                EKLU|
| CAROLINE|            ACEILNOR|
|    JULIE|               EIJLU|
|    MEGAN|               AEGMN|
|     DAWN|                ADNW|
|  DOMINIC|             CDIIMNO|
|  ANTHONY|             AHNNOTY|
|      KIM|                 IKM|
|   CAROLE|              ACELOR|
|    DYLAN|               ADLNY|
|    WAYNE|               AENWY|
|     JOEL|                EJLO|
|  GREGORY|             EGGORRY|
|   JOSEPH|              EHJOPS|
|     LUCY|                CLUY|
|   RONALD|              ADLNOR|
|   GEORGE|              EEGGOR|
|JOSEPHINE|           EEHIJNOPS|
+---------+--------------------+
only showing top 20 rows



There are more common matching variables we could still derive. Taking the **substring()** of our postcode column can help us derive less granular geographic variables for matching:

In [70]:
census = dataframes.substring(census, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)
ccs = dataframes.substring(ccs, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)

census.select("POSTCODE", "PC_DISTRICT").show()

#ADD COMMENTARY FROM PYSPARK COURSE EXPLAINING THIS CHAOS

+--------+-----------+
|POSTCODE|PC_DISTRICT|
+--------+-----------+
| FY9W4RU|       FY9W|
|   E85FD|         E8|
|  S7G3JZ|        S7G|
| WC1M0ZL|       WC1M|
| IM392DA|       IM39|
|  N222RT|        N22|
|  G7G5UB|        G7G|
|  S205ZS|        S20|
| MK047HX|       MK04|
| BB2N6YE|       BB2N|
| WA2A6NR|       WA2A|
| FY122QE|       FY12|
| GY3H6QL|       GY3H|
|   E00PJ|         E0|
|  B154FA|        B15|
|  PH72SN|        PH7|
|   S72AU|         S7|
|   N71DZ|         N7|
| OL938AA|       OL93|
| UB7E5TX|       UB7E|
+--------+-----------+
only showing top 20 rows



If you have a time lag between the collection of two surveys you are trying to link together, you may want to align respondent ages for matching. We can do this using the age_at() function: NEED TO EXPLAIN WHAT EACH ARG IS AND WHY WE CANT SPECIFY THEM 

In [75]:
# we can find out their age at the most recent Census, for example:
census_date = '21/03/2021'

census = standardisation.age_at(census, 'DOB', 'dd/MM/yyyy', census_date)
ccs = standardisation.age_at(ccs, 'DOB', 'dd/MM/yyyy', census_date)

census.select('DOB','age_at_21/03/2021')

DOB,age_at_21/03/2021
,
,
,
,
,
,
,
,
,
,


In [1]:
# NOT SURE IF WE NEED THE BELOW - JUST TAKING IT FROM DAP VERSION BEFORE DELETED

# Deduplication

This is quite easily done, defining our duplicate matchkey(s) and using the **deduplicate** function:

In [None]:
# define our matchkey
deduplicate_mkey = ['First_Name', 'Last_Name','Resident_Age','Sex','Postcode','Address']
census.count()

In [None]:
census = linkage.deduplicate(df = census, record_id - 'Resident_ID', mks = deduplicate_mkey)
ccs = linkage.deduplicate(df = ccs, record_id - 'Resident_ID', mks = deduplicate_mkey)
census.count()

Now that we've removed duplicates, we can start to investigate some matchkeys:

In [None]:
# first, let's suffix each dataset's columns to distinguish the two dataframes 
census = dataframes.suffix_columns(census, suffix = '_df1')
ccs = dataframes.suffix_columns(ccs, suffix = '_df2')

census.persist().count()
ccs.persist().count()

In [None]:
MK1 = [census.Full_Name_census == ccs.Full_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# letting middle name be a mismatch 
MK2 = [census.First_Name_census == ccs.First_Name_ccs,
       census.Last_Name_census == ccs.Last_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# taking the phonetic encoding of forename - using the metaphone algorithm
MK3 = [census.forename_metaphone_census == ccs.forename_metaphone_ccs,
       census.Last_Name_census == ccs.Last_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# Now allowing for misspellings rather than mishearings of names, using standardised Levenshtein edit distance
MK4 = [linkage.std_lev_score(F.col('First_Name_census'),F,col('First_Name_ccs')) > 0.7,
       census.Last_Name_census == ccs.Last_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# similar to the above, but now using a different string comparison algorithm - the Jaro comparator
MK5 = [linkage.jaro(F.col('First_Name_census'),F,col('First_Name_ccs')) > 0.7,
       census.Last_Name_census == ccs.Last_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

matchkeys = [MK1,MK2,MK3,MK4,MK5]

In [None]:
links = linkage.deterministic_linkage(df_l = census, df_r = ccs, id_l = 'Resident_ID_crensus', id_r = 'Resident_ID_ccs', 
                                      matchkeys = matchkeys, our_dir = '/user/edwara5/census_ccs_links')

In [None]:
links.show()