# DLH_utils demo

This notebook is intended to be a demo of what you *could* use DLH_utils for. 

I'm sure to many a lot of this code may look very similar! We will have taken similar approaches for the vast majority of the problems faced here. We've just wrapped these mostly standard approaches up into reusable functions, hopefully to save everyone doing linkage both some time and headaches! 

In [1]:
# to start, install dlh_utils if not installed already. Notice the '-U' argument to upgrade existing installations. 
!pip3 install -U 'dlh_utils'

Looking in indexes: http://sccm_functional:****@art-p-01/artifactory/api/pypi/yr-python/simple
Requirement already up-to-date: dlh_utils in /home/cdsw/.local/lib/python3.6/site-packages (0.2.4)
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
# import necessary libraries
import pyspark.sql.functions as F
import pandas as pd

from dlh_utils import utilities
from dlh_utils import dataframes
from dlh_utils import linkage
from dlh_utils import standardisation
from dlh_utils import sessions
from dlh_utils import profiling
from dlh_utils import flags

In [3]:
# you can use our sessions module to set up your spark session
# this will also create a Spark UI, which you can use to track your code's efficiency
spark = sessions.getOrCreateSparkSession(appName = 'dlh_utils_demo', size = 'medium')

In [4]:
# read in raw data
census = pd.read_csv("/home/cdsw/dlh_utils_demo/census_residents.csv")
ccs = pd.read_csv("/home/cdsw/dlh_utils_demo/ccs_perturbed.csv")

# note, if this was stored in Hue, the read_format() function from the utilities module would've been useful

#for demo purposes, let's convert this to a spark df using utilities
census = utilities.pandas_to_spark(census)
ccs = utilities.pandas_to_spark(ccs)

To give a quick overview of the features of our data, we can use the **describe()** function from the profiling module:

In [5]:
descriptive_census = profiling.df_describe(census,
                                           output_mode = 'pandas',
                                           approx_distinct = False,
                                           rsd = 0.05
                                           )
descriptive_census

Unnamed: 0,variable,type,row_count,distinct,percent_distinct,null,percent_null,not_null,percent_not_null,empty,percent_empty,min,max,min_l,max_l,max_l_before_point,min_l_before_point,max_l_after_point,min_l_after_point
0,Address,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,19,51,,,,
1,ENUM_FNAME,string,100001,1092,1.091989,0,0.0,100001,100.0,0,0.0,,,3,15,,,,
2,ENUM_SNAME,string,100001,1493,1.492985,0,0.0,100001,100.0,0,0.0,,,3,15,,,,
3,ID,string,100001,100001,100.0,0,0.0,100001,100.0,0,0.0,,,20,20,,,,
4,Marital_Status,string,100001,6,0.006,0,0.0,100001,100.0,0,0.0,,,3,17,,,,
5,Postcode,string,100001,99457,99.456005,0,0.0,100001,100.0,0,0.0,,,6,8,,,,
6,Sex,string,100001,10,0.01,0,0.0,100001,100.0,5636,5.635944,,,1,6,,,,
7,Resident_Day_Of_Birth,bigint,100001,31,0.031,0,0.0,100001,100.0,0,0.0,1.0,31.0,1,2,,,,
8,Resident_Month_Of_Birth,bigint,100001,12,0.012,0,0.0,100001,100.0,0,0.0,1.0,12.0,1,2,,,,
9,Resident_Year_Of_Birth,bigint,100001,89,0.088999,0,0.0,100001,100.0,0,0.0,1934.0,2022.0,4,4,,,,


NO NULLS BUT EMPTY VALUES - DIFFERENT DEFINITIONS FOR NULLS - WE CAN CAST THESE TO NULLS/TRUE NONES LATER

From this we can see that we have a percentage distinct in our sex variable far from 50% which we would expect. This could suggest a high level of missingness, but we can see from the rest of the output that we don't have any missing or null sex values, suggesting some have been incorrectly coded or skewed in the data.

We can also see that, whilst there are no nulls, the Sex variable contains a lot of empty values. This suggests there are different definitions for nulls, which we can cast to True Nones later when we standardise the data. 

On bigger data, these observations can give quick insights into which variables may be the most/least useful for matching. 

The **value_counts()** functions shows the top or bottom n values in our data. This can give us an overview of the different types of missingness in these variables, which will be useful when we come to standardise missingness in our data later.

In [6]:
top_5_value_counts_census = profiling.value_counts(census,
                                            limit = 5,
                                            output_mode = 'pandas'
                                            )
# the value counts function returns two dataframes; one for the top n values in each variable and one for the bottom n values. 
# we can select the top value count dataframe by subsetting the top_5_value_counts_census tuple:

top_5_value_counts_census[0]

Unnamed: 0,Address,Address_count,ENUM_FNAME,ENUM_FNAME_count,ENUM_SNAME,ENUM_SNAME_count,ID,ID_count,Marital_Status,Marital_Status_count,...,Resident_Day_Of_Birth,Resident_Day_Of_Birth_count,Resident_Month_Of_Birth,Resident_Month_Of_Birth_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count
0,"0 Adams divide, Damienberg",1,Glenn,201,Smith,2560,c1000622967623155092,1,Single,33183,...,13,3422,10,8579,1934,1217,88,1217,09/10/1990,12
1,"0 Alan drive, Katiemouth",1,Michelle,196,Jones,2119,c1001178750255994031,1,Divorced,13517,...,28,3358,7,8577,1948,1204,74,1204,20/09/1935,12
2,"0 Allen plains, Hughesville",1,Teresa,196,Williams,1567,c1003932737970777024,1,###,13386,...,2,3339,1,8552,2007,1190,15,1190,08/04/1966,11
3,"0 Amelia mills, West Terenceshire",1,Luke,194,Brown,1159,c1006571884205429715,1,NAN,13366,...,20,3331,3,8546,1990,1183,32,1183,18/09/1937,11
4,"0 Archer locks, Lake Paula",1,Sean,190,Taylor,1152,c1007387130106763057,1,Civil partnership,13330,...,9,3324,8,8495,1959,1181,63,1181,10/06/1980,11


In [7]:
# let's do the same thing for the CCS
top_5_value_counts_ccs = profiling.value_counts(ccs,
                                            limit = 5,
                                            output_mode = 'pandas'
                                            )

top_5_value_counts_ccs[0]

Unnamed: 0,Address,Address_count,ENUM_FNAME,ENUM_FNAME_count,ENUM_SNAME,ENUM_SNAME_count,ID,ID_count,Marital_Status,Marital_Status_count,...,Sex,Sex_count,Resident_Year_Of_Birth,Resident_Year_Of_Birth_count,Resident_Age,Resident_Age_count,DOB,DOB_count,Resident_ID,Resident_ID_count
0,-9,50,-9,48,-7,51,-7,36,Single,284,...,Female,229,-9,45,-9,43,-7,37,c1289733399550728998,1
1,"Studio 4\nSimpson glens, Lake Paul",4,Victoria,10,Smith,28,c7949424308517863587,3,###,136,...,Male,213,1938,22,84,21,16-11-1994,4,c1406628632687313907,1
2,"Studio 36\nKing forges, Rileyburgh",3,Tracey,7,Jones,16,c5305314753251312254,3,Divorced,128,...,M,135,2020,21,11,20,04-09-1955,4,c1462481395779002923,1
3,"Studio 5\nFuller burgs, New Lindsey",3,Howard,7,Roberts,13,c5873915529124500787,3,Married,116,...,F,133,2005,21,17,20,23-08-1963,4,c1610482117117913758,1
4,"Studio 4\nDiane underpass, Eleanorton",3,Glen,6,Taylor,12,c3119031928535250479,2,Civil partnership,110,...,-7,97,1997,20,2,20,10-03-1987,4,c1631997721075661206,1


In [24]:
#STEP TO EXPLORE MINUS VALUES
'''
for count,column in enumerate(ccs.columns,1):
    if count == 1:
        if ccs.filter(F.col(column).startswith('-')).count()>1:
            df = ccs.filter(F.col(column).startswith('-')).distinct()
    else:
        df = df.union(ccs.filter(F.col(column).startswith('-')).distinct())

df.show()
 

df = [ccs.filter(F.col(column).startswith('-')) for column in ccs.columns]
df
'''
from functools import reduce

df = ccs.filter(
    reduce(
        lambda x, y: x | y,    # `|` means `or`; use `&` if you want `and`
        [(F.col(c).startswith('-')) for c in ccs.columns]
    )
)

df.show()

+--------------------+----------+----------+--------------------+------------------+---------+------+----------------------+------------+----------+--------------------+
|             Address|ENUM_FNAME|ENUM_SNAME|                  ID|Marital_Status_CCS| Postcode|   Sex|Resident_Year_Of_Birth|Resident_Age|       DOB|         Resident_ID|
+--------------------+----------+----------+--------------------+------------------+---------+------+----------------------+------------+----------+--------------------+
|                  -9|      eriC|      Bibi|c3552902187723607632|            Single| DY01 1TR|  Male|                  2016|           6|11/10/2016|c3258728696626565719|
|                  -9|     chloe|  Chandler|c2381984462771197706| Civil partnership|       -9|   NAN|                  1963|          59|28/12/1996|c7831454145019129197|
|644 Garry walk, B...|    Denise|      King|c1548782761489667658|               NAN| FY9W 4RU|    -9|                  1981|          41|28/11/1981|c6

To flag out of scope values in our data, we can use the **flag()** function:

In [31]:
ccs.filter(F.to_date(ccs['DOB']).lt(F.lit('00/00/1900'))).show()
'''
out_of_scope = flags.flag(df = census,
                          ref_col = 'DOB',
                          condition = '>=',
                          condition_value = '00/00/1900',
                          condition_col = None,
                          alias = None,
                          prefix = 'FLAG',
                          fill_null = None
                         )
out_of_scope.show()
'''

TypeError: 'Column' object is not callable

We can see we have supercentenarian Ben in our data, which is probably wrong, but we've also got a few different date types that have been flagged as well. 

If you are working with larger data, the **flag_check()** and **flag_summary()** functions can produce more detailed flag metrics that will help you spot issues like this more readily. 

Let's move on to cleaning and standardising where we can start to deal with these issues.

In [8]:
census

Address,ENUM_FNAME,ENUM_SNAME,ID,Marital_Status,Postcode,Sex,Resident_Day_Of_Birth,Resident_Month_Of_Birth,Resident_Year_Of_Birth,Resident_Age,DOB
Studio 48 Cooper ...,Mrs Margaret,Ross,c4064232788196233825,NAN,CV25 4ZY,,6,8,1956,66,06/08/1956
43 Rebecca street...,Mrs Darren,Baldwin,c7365350289112516537,Single,E2 0LP,-7,29,12,2013,9,29/12/2013
"04 Lane shores, S...",Mrs Eric,Bibi,c8205386463232611653,Single,DY01 1TR,Female,11,10,2016,6,11/10/2016
"7 Noble valley, L...",Diane,Kent,c2381984462771197706,###,SO2P 9WS,Male,28,1,1972,50,28/01/1972
57 Pearson corner...,Mr Grace,Baker,c6384487823194391043,Civil partnership,EH4P 9RN,Female,26,11,1966,56,26/11/1966
0 Jeremy mountain...,Mrs Chloe,Chandler,c7777611692672993318,Divorced,G7F 3RE,2,3,3,1963,59,03/03/1963
Studio 5 Fuller b...,Mr Katie,Der-Anderson,c7179219388724687888,Divorced,G88 6DB,Female,9,4,1960,62,09/04/1960
"644 Garry walk, B...",Mrs Denise,King,c3458599216452476033,Divorced,FY9W 4RU,F,28,11,1981,41,28/11/1981
Studio 73 Clayton...,Hazel,Der-Barton,c9188328200772085537,Married,LN96 9XA,Male,13,4,1947,75,13/04/1947
Flat 92B Ross exp...,Harriet,Chapman,c1862566591390004870,Single,S0D 9AX,M,8,11,2001,21,08/11/2001


In [10]:
ccs.show(truncate = False)

+-------------------------------------------+------------+------------+---------------------+-----------------+----------+------+----------------------+------------+----------+--------------------+
|Address                                    |ENUM_FNAME  |ENUM_SNAME  |ID                   |Marital_Status   |Postcode  |Sex   |Resident_Year_Of_Birth|Resident_Age|DOB       |Resident_ID         |
+-------------------------------------------+------------+------------+---------------------+-----------------+----------+------+----------------------+------------+----------+--------------------+
|Studio 48
Cooper street, Port Fredericktown|-9          |Ross        |c4064232788196233825 |NAN              |CV25 4ZYC |      |1956                  |66          |06-08-1956|c2026847926404610461|
|43 Rebecca street, Harveytown              |Mrs Darren  |Baldwin     |c7365350289112516537 |Single           |-7        |M     |2013                  |9           |29-12-2013|c7839596180442651345|
|-9       

# Data Cleaning & Standardisation

In [38]:
# Looks like there is a new line character in address - this will need to be removed
# We can replace these '\n' values with spaces:

census = standardisation.reg_replace(df = census, dic = {' ': '\n'})
ccs = standardisation.reg_replace(df = ccs, dic = {' ': '\n'})

ccs.select('Address').show(truncate = False)

+-------------------------------------------+
|Address                                    |
+-------------------------------------------+
|Studio 48 Cooper street, Port Fredericktown|
|43 Rebecca street, Harveytown              |
|-9                                         |
| 7 Noble valley, Lake Simonville           |
|464 Victor mews, Janemouth                 |
|-9                                         |
|Studio 5 Fuller burgs, New Lindsey         |
|644 Garry walk, Blackburnville             |
|Studio 73 Clayton mountains, Stevenbury    |
|Flat 92B Ross expressway, Brayshire        |
|8 Grant spurs, South Philip                |
|414 Forster plains, Aimeemouth             |
|21 Stephen island, terrymouth              |
|flat 78 Jones Glen, marIonbuRgh            |
|-9                                         |
|69 Neil hill, Turnerbury                   |
|Studio 6 Dixon bypass, New Marian          |
|Flat 00 John bridge, Ahmedport             |
|Flat 22 Kennedy keys, Port Valeri

Let's standardise the date format to be consistent across our data in a **ddMMyyyy** format:
NEED TO CHANGE JUST CCS ONCE DATA CHANGED

In [41]:
ccs = standardisation.standardise_date(ccs, col_name = "DOB", in_date_format = "yyyy-MM-dd", out_date_format = "dd/MM/yyyy")

census.show(truncate = False)

+--------------------+----------+----------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----+
|             Address|ENUM_FNAME|ENUM_SNAME|                  ID|   Marital_Status|Postcode|   Sex|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age| DOB|
+--------------------+----------+----------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----+
|Studio 48 Cooper ...|  Margaret|      Ross|c4064232788196233825|         Divorced|CV25 4ZY|  Male|                    6|                      8|                  1956|          66|null|
|43 Rebecca street...|    Darren|   Baldwin|c6330546597769552216|           Single|  E2 0LP|      |                   29|                     12|                  2013|           9|null|
|04 Lane shores, S...|      Eric|      Bibi|c3552902187723607632|

Next, we have generic 'ID' columns in each dataset. We also have address and name variables named differently in each dataset. 

We can use **rename_columns()** from the dataframes module to rename all of these at once. 

In [11]:
census = dataframes.rename_columns(census, rename_dict = {"ID":"ID_Census","ENUM_FNAME":"FORENAME","ENUM_SNAME":"SURNAME"})
ccs = dataframes.rename_columns(ccs, rename_dict = {"ID":"ID_CCS","ENUM_FNAME":"FORENAME","ENUM_SNAME":"SURNAME"})

census.columns

['Address',
 'FORENAME',
 'SURNAME',
 'ID_Census',
 'Marital_Status',
 'Postcode',
 'Sex',
 'Resident_Day_Of_Birth',
 'Resident_Month_Of_Birth',
 'Resident_Year_Of_Birth',
 'Resident_Age',
 'DOB']

Now let's set all of variables to upper case for consistency, using **standardise_case()**:

In [12]:
census = standardisation.standardise_case(census)
ccs = standardisation.standardise_case(ccs)

census.show(truncate = False)

+--------------------+------------+------------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----------+
|             Address|    FORENAME|     SURNAME|           ID_Census|   Marital_Status|Postcode|   Sex|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age|       DOB|
+--------------------+------------+------------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----------+
|STUDIO 48
COOPER ...|MRS MARGARET|        ROSS|C4064232788196233825|              NAN|CV25 4ZY|      |                    6|                      8|                  1956|          66|06/08/1956|
|43 REBECCA STREET...|  MRS DARREN|     BALDWIN|C7365350289112516537|           SINGLE|  E2 0LP|    -7|                   29|                     12|                  2013|           9|29/12/2013|
|04 LANE SHORES

Next, the values for missingness are all over the place. I can spot a few NaNs, minus 7/9s, and whitespaces. Let's standardise missingness with the **standardise_null()** function. We can retrieve these null values from the previous **value_counts()** outputs: 

In [17]:
# we can use the standardise_null function to replace these with true None values:
# we use regex to do this: https://regex101.com/ 
census = standardisation.standardise_null(census, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$|^###$")
ccs = standardisation.standardise_null(ccs, replace = "^NAN$|^NULL$|^\s*$|^-7$|^-9$|^###$")

census.show(truncate = False)

+-------------------------------------------+------------+------------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----------+
|Address                                    |FORENAME    |SURNAME     |ID_Census           |Marital_Status   |Postcode|Sex   |Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age|DOB       |
+-------------------------------------------+------------+------------+--------------------+-----------------+--------+------+---------------------+-----------------------+----------------------+------------+----------+
|STUDIO 48
COOPER STREET, PORT FREDERICKTOWN|MRS MARGARET|ROSS        |C4064232788196233825|null             |CV25 4ZY|null  |6                    |8                      |1956                  |66          |06/08/1956|
|43 REBECCA STREET, HARVEYTOWN              |MRS DARREN  |BALDWIN     |C7365350289112516537|SINGLE           |E2 0LP  |n

Great, these now all show up as true nulls. 

Next, we have a mix of 1s, 2s, Ms, and Fs in our sex column. Let's standardise this to be either 1s or 2s. For this we can use **reg_replace()**:

In [19]:
# reg_replace() takes a dictionary, where the value is the regex to replace, and the key is what this will be replaced with
# so we're replacing 'M' with '1', and 'F' with '2':
census = standardisation.reg_replace(census, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})
ccs = standardisation.reg_replace(ccs, subset = "SEX", dic = {"1":"^M$|^MALE$","2":"^F$|^FEMALE$"})

census.show(truncate = False)

+-------------------------------------------+------------+------------+--------------------+-----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|Address                                    |FORENAME    |SURNAME     |ID_Census           |Marital_Status   |Postcode|SEX |Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age|DOB       |
+-------------------------------------------+------------+------------+--------------------+-----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|STUDIO 48
COOPER STREET, PORT FREDERICKTOWN|MRS MARGARET|ROSS        |C4064232788196233825|null             |CV25 4ZY|null|6                    |8                      |1956                  |66          |06/08/1956|
|43 REBECCA STREET, HARVEYTOWN              |MRS DARREN  |BALDWIN     |C7365350289112516537|SINGLE           |E2 0LP  |null|29  

Now that our sex column is populated with just 1s and 2s, we might want to change the type from string to integer. This can be done using the **cast_type()** function in the standardisation module:

In [20]:
# Casting strings to integer can increase performance

census = standardisation.cast_type(census, subset = ['SEX'], types = "integer")
ccs = standardisation.cast_type(ccs, subset = ['SEX'], types = "integer")

census.select('SEX').dtypes

[('SEX', 'int')]

Let's begin to have a look at our postcode, address and name variables. 

A lot of the variables contain multiple white spaces in a row. We can remove white spcaes altogether using the **standardise_white_space()** function, setting white space level (wsl) to none. 

However, we don't want to remove all of the white spaces in our address variable, as this would make the text unreadable. Instead, we can set wsl to one, meaning only gaps of 2 spaces or more will be removed.  

In [21]:
# Using a list comprehension, we remove all white spaces from all columns except Address
census = standardisation.standardise_white_space(census, 
                                                 subset = [column for column in census.columns if column != 'Address'], 
                                                 wsl = "none")
ccs = standardisation.standardise_white_space(ccs, 
                                              subset = [column for column in ccs.columns if column != 'Address'], 
                                              wsl = "none")

# Then we allow a single white space for the Address column
census = standardisation.standardise_white_space(census, subset = 'Address', wsl = "one")
ccs = standardisation.standardise_white_space(ccs, subset = 'Address', wsl = "one")

census.show(truncate = False)

+--------------------+-----------+------------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|             Address|   FORENAME|     SURNAME|           ID_Census|  Marital_Status|Postcode| SEX|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age|       DOB|
+--------------------+-----------+------------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|STUDIO 48 COOPER ...|MRSMARGARET|        ROSS|C4064232788196233825|            null| CV254ZY|null|                    6|                      8|                  1956|          66|06/08/1956|
|43 REBECCA STREET...|  MRSDARREN|     BALDWIN|C7365350289112516537|          SINGLE|   E20LP|null|                   29|                     12|                  2013|           9|29/12/2013|
|04 LANE SHORES, S...|    MRSERIC| 

We still have some leading/trailing whitespaces in some of our variables, let's **trim()** these:

In [22]:
census = standardisation.trim(census)
ccs = standardisation.trim(ccs)

census.show(truncate = False)

+--------------------+-----------+------------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|             Address|   FORENAME|     SURNAME|           ID_Census|  Marital_Status|Postcode| SEX|Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age|       DOB|
+--------------------+-----------+------------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|STUDIO 48 COOPER ...|MRSMARGARET|        ROSS|C4064232788196233825|            null| CV254ZY|null|                    6|                      8|                  1956|          66|06/08/1956|
|43 REBECCA STREET...|  MRSDARREN|     BALDWIN|C7365350289112516537|          SINGLE|   E20LP|null|                   29|                     12|                  2013|           9|29/12/2013|
|04 LANE SHORES, S...|    MRSERIC| 

Next, let's focus on our name variables. Forenames still contain titles and some surnames have common prefixes like 'Van' or 'Der'. We can strip out titles and concatenate surname prefixes with our **clean_forename()** and **clean_surname()** functions. 

In [24]:
census = standardisation.clean_forename(census, subset = 'FORENAME')
ccs = standardisation.clean_forename(ccs, subset = 'FORENAME')

census = standardisation.clean_surname(census, subset = 'SURNAME')
ccs = standardisation.clean_surname(ccs, subset = 'SURNAME')

census.show(truncate = False)

+-------------------------------------------+-----------+-----------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|Address                                    |FORENAME   |SURNAME    |ID_Census           |Marital_Status  |Postcode|SEX |Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age|DOB       |
+-------------------------------------------+-----------+-----------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|STUDIO 48 COOPER STREET, PORT FREDERICKTOWN|MRSMARGARET|ROSS       |C4064232788196233825|null            |CV254ZY |null|6                    |8                      |1956                  |66          |06/08/1956|
|43 REBECCA STREET, HARVEYTOWN              |MRSDARREN  |BALDWIN    |C7365350289112516537|SINGLE          |E20LP   |null|29                 

Finally, let's strip out numbers from our name variables. Again, we can use the **reg_replace()** function for this:

In [25]:
census = standardisation.reg_replace(census, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})
ccs = standardisation.reg_replace(ccs, subset = ["FORENAME","SURNAME"], dic = {"": "[0-9]"})

census.show(truncate = False)

+-------------------------------------------+-----------+-----------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|Address                                    |FORENAME   |SURNAME    |ID_Census           |Marital_Status  |Postcode|SEX |Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age|DOB       |
+-------------------------------------------+-----------+-----------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+----------+
|STUDIO 48 COOPER STREET, PORT FREDERICKTOWN|MRSMARGARET|ROSS       |C4064232788196233825|null            |CV254ZY |null|6                    |8                      |1956                  |66          |06/08/1956|
|43 REBECCA STREET, HARVEYTOWN              |MRSDARREN  |BALDWIN    |C7365350289112516537|SINGLE          |E20LP   |null|29                 

This still leaves apostrophes and hyphens in our name variables. The **remove_punct()** function can handle these. While we're at it, let's also use **remove_punct()** to get rid of dashes in our address field, but we'll have to specify the optional argument **keep** to make sure it doesn't strip out commas from addresses:

In [26]:
#First remove all punction from every column except address
census = standardisation.remove_punct(census, 
                                      subset = [column for column in census.columns if column != 'Address'], 
                                      )

ccs = standardisation.remove_punct(ccs, 
                                   subset = [column for column in ccs.columns if column != 'Address']
                                  )


#Then remove the punctuation from address except for the commas
census = standardisation.remove_punct(census, subset = 'Address', keep = ',')
ccs = standardisation.remove_punct(ccs, subset = 'Address', keep = ',')

census.show(truncate = False)

+-------------------------------------------+-----------+-----------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+--------+
|Address                                    |FORENAME   |SURNAME    |ID_Census           |Marital_Status  |Postcode|SEX |Resident_Day_Of_Birth|Resident_Month_Of_Birth|Resident_Year_Of_Birth|Resident_Age|DOB     |
+-------------------------------------------+-----------+-----------+--------------------+----------------+--------+----+---------------------+-----------------------+----------------------+------------+--------+
|STUDIO 48 COOPER STREET, PORT FREDERICKTOWN|MRSMARGARET|ROSS       |C4064232788196233825|null            |CV254ZY |null|6                    |8                      |1956                  |66          |06081956|
|43 REBECCA STREET, HARVEYTOWN              |MRSDARREN  |BALDWIN    |C7365350289112516537|SINGLE          |E20LP   |null|29                   |12   

# Derive Variables

We've got quite a few identifying variables that we can split out into further variables for matching. 

First, let's derive street and town from the address variable. The **split()** function from the dataframes module will be useful here, splitting on comma. 

In [27]:
# This will create a new column called "ADDRESS_SPLIT" that contains an array of each address element, separated by a comma
census = dataframes.split(census, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")
ccs = dataframes.split(ccs, col_in = "ADDRESS", col_out = "ADDRESS_SPLIT", split_on = ",")

census.select("ADDRESS", "ADDRESS_SPLIT").show(truncate = False)

+-------------------------------------------+----------------------------------------------+
|ADDRESS                                    |ADDRESS_SPLIT                                 |
+-------------------------------------------+----------------------------------------------+
|STUDIO 48 COOPER STREET, PORT FREDERICKTOWN|[STUDIO 48 COOPER STREET,  PORT FREDERICKTOWN]|
|43 REBECCA STREET, HARVEYTOWN              |[43 REBECCA STREET,  HARVEYTOWN]              |
|04 LANE SHORES, SOUTH DANIELFORT           |[04 LANE SHORES,  SOUTH DANIELFORT]           |
|7 NOBLE VALLEY, LAKE SIMONVILLE            |[7 NOBLE VALLEY,  LAKE SIMONVILLE]            |
|57 PEARSON CORNER, JOANNABOROUGH           |[57 PEARSON CORNER,  JOANNABOROUGH]           |
|0 JEREMY MOUNTAINS, NORTH FRANK            |[0 JEREMY MOUNTAINS,  NORTH FRANK]            |
|STUDIO 5 FULLER BURGS, NEW LINDSEY         |[STUDIO 5 FULLER BURGS,  NEW LINDSEY]         |
|644 GARRY WALK, BLACKBURNVILLE             |[644 GARRY WALK,  BLACKBU

In [28]:
# We can then select the first element of the 'split address' to create the 'street address' variable
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "STREET", index = 0)

# The second element contains the town name, which we can append to a new column also 
census = dataframes.index_select(census, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)
ccs = dataframes.index_select(ccs, split_col = "ADDRESS_SPLIT", out_col = "TOWN", index = 1)

# Since we no longer need the 'ADDRESS_SPLIT' column, we can remove it using our drop_columns() function
census = dataframes.drop_columns(census, subset = 'ADDRESS_SPLIT')
ccs = dataframes.drop_columns(ccs, subset = 'ADDRESS_SPLIT')

census.select("ADDRESS", "STREET", "TOWN").show(truncate = False)

+-----------------------------------------+------------------------+-----------------+
|ADDRESS                                  |STREET                  |TOWN             |
+-----------------------------------------+------------------------+-----------------+
|0 JEREMY MOUNTAINS, NORTH FRANK          |0 JEREMY MOUNTAINS      | NORTH FRANK     |
|FLAT 2 OLIVER CORNERS, BAXTERTON         |FLAT 2 OLIVER CORNERS   | BAXTERTON       |
|273 KIM CIRCLES, WEST CARL               |273 KIM CIRCLES         | WEST CARL       |
|STUDIO 48 NICOLA FORKS, CATHERINEFORT    |STUDIO 48 NICOLA FORKS  | CATHERINEFORT   |
|191 GEORGE VIEWS, WEST BERNARD           |191 GEORGE VIEWS        | WEST BERNARD    |
|FLAT 63G JACK STREAM, DANNYCHESTER       |FLAT 63G JACK STREAM    | DANNYCHESTER    |
|1 MARTIN CREEK, IANBERG                  |1 MARTIN CREEK          | IANBERG         |
|4 WARREN ESTATES, HERBERTCHESTER         |4 WARREN ESTATES        | HERBERTCHESTER  |
|2 PRITCHARD STRAVENUE, DARRENMOUTH       |

We can create a 'full name' variable by concatenating the two existing name columns together, using **concat()**:

In [29]:
census = dataframes.concat(census, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")
ccs = dataframes.concat(ccs, columns = ["FORENAME", "SURNAME"], sep = " ", out_col = "FULL_NAME")

census.select("FORENAME", "SURNAME", "FULL_NAME").show()

+-----------+------------+--------------------+
|   FORENAME|     SURNAME|           FULL_NAME|
+-----------+------------+--------------------+
|   MRSCHLOE|    CHANDLER|   MRSCHLOE CHANDLER|
|      REECE|        LONG|          REECE LONG|
|   MRSMOLLY|     SKINNER|    MRSMOLLY SKINNER|
|      LYDIA|        WEBB|          LYDIA WEBB|
|    SUZANNE|VANGALLAGHER|SUZANNE VANGALLAGHER|
|   ROSEMARY|      CLARKE|     ROSEMARY CLARKE|
|      SCOTT|       YATES|         SCOTT YATES|
|       GLEN|       CLARK|          GLEN CLARK|
|   GEORGINA|      FULLER|     GEORGINA FULLER|
|    MRSHUGH|      HARRIS|      MRSHUGH HARRIS|
|MRCATHERINE|  FITZGERALD|MRCATHERINE FITZG...|
|   MROLIVER|     SIMPSON|    MROLIVER SIMPSON|
|     JUSTIN|      THORPE|       JUSTIN THORPE|
|   MRJOANNE|      DAWSON|     MRJOANNE DAWSON|
|  MRSRONALD|        LORD|      MRSRONALD LORD|
|     HILARY|      MISTRY|       HILARY MISTRY|
|       RHYS|     MANNING|        RHYS MANNING|
|   CAROLINE|       JONES|      CAROLINE

For ethnically diverse datasets, phonetic encodings of name variables may aid matching. We have functions for this in the linkage module. 

In [30]:
census = linkage.metaphone(df = census, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
census = linkage.soundex(df = census, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

ccs = linkage.metaphone(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_METAPHONE')
ccs = linkage.soundex(df = ccs, input_col = 'FORENAME', output_col = 'FORENAME_SOUNDEX')

census.select("FORENAME", "FORENAME_METAPHONE", "FORENAME_SOUNDEX").show()  

+-----------+------------------+----------------+
|   FORENAME|FORENAME_METAPHONE|FORENAME_SOUNDEX|
+-----------+------------------+----------------+
|   MRSCHLOE|             MRSXL|            M624|
|      REECE|                RS|            R200|
|   MRSMOLLY|             MRSML|            M625|
|      LYDIA|                LT|            L300|
|    SUZANNE|               SSN|            S250|
|   ROSEMARY|              RSMR|            R256|
|      SCOTT|               SKT|            S300|
|       GLEN|               KLN|            G450|
|   GEORGINA|              JRJN|            G625|
|    MRSHUGH|               MRX|            M622|
|MRCATHERINE|            MRK0RN|            M623|
|   MROLIVER|             MRLFR|            M641|
|     JUSTIN|              JSTN|            J235|
|   MRJOANNE|              MRJN|            M625|
|  MRSRONALD|           MRSRNLT|            M626|
|     HILARY|               HLR|            H460|
|       RHYS|                RS|            R200|


Similarly, if there have been spelling mistakes, alphabetising string columns may also aid matching. We have a function for this in the linkage module. 

In [31]:
census = linkage.alpha_name(census, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')
ccs = linkage.alpha_name(ccs, input_col = 'FORENAME', output_col = 'ALPHABETISE_FORENAME')

census.select("FORENAME", "ALPHABETISE_FORENAME").show()

+-----------+--------------------+
|   FORENAME|ALPHABETISE_FORENAME|
+-----------+--------------------+
|   MRSCHLOE|            CEHLMORS|
|      REECE|               CEEER|
|   MRSMOLLY|            LLMMORSY|
|      LYDIA|               ADILY|
|    SUZANNE|             AENNSUZ|
|   ROSEMARY|            AEMORRSY|
|      SCOTT|               COSTT|
|       GLEN|                EGLN|
|   GEORGINA|            AEGGINOR|
|    MRSHUGH|             GHHMRSU|
|MRCATHERINE|         ACEEHIMNRRT|
|   MROLIVER|            EILMORRV|
|     JUSTIN|              IJNSTU|
|   MRJOANNE|            AEJMNNOR|
|  MRSRONALD|           ADLMNORRS|
|     HILARY|              AHILRY|
|       RHYS|                HRSY|
|   CAROLINE|            ACEILNOR|
|   MRCHERYL|            CEHLMRRY|
|   KAYLEIGH|            AEGHIKLY|
+-----------+--------------------+
only showing top 20 rows



There are more common matching variables we could still derive, for example, a common practice in data linkage is to derive a postcode district variable instead of using full postcode. 

The second part of a postcode is *always* 3 characters, whilst the first part can range from 2-4. Therefore, to derive postcode district, we remove the last 3 characters from postcode. 

This can be done using the **substring()** function:

In [32]:
census = dataframes.substring(census, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)
ccs = dataframes.substring(ccs, out_col = "PC_DISTRICT", target_col = "POSTCODE", start = 4, length = 4, from_end = True)

census.select("POSTCODE", "PC_DISTRICT").show()

+--------+-----------+
|POSTCODE|PC_DISTRICT|
+--------+-----------+
|  G7F3RE|        G7F|
| CM6R8DA|       CM6R|
| RH3R4NW|       RH3R|
|  IM54HR|        IM5|
|  E6C2JS|        E6C|
| RH712UT|       RH71|
|  E5D9EQ|        E5D|
|  E589JW|        E58|
|  E2A1NY|        E2A|
|  M307WF|        M30|
| LE820DP|       LE82|
|  S361ZJ|        S36|
|  W156AS|        W15|
|  S1E7GA|        S1E|
| FK167EH|       FK16|
| WF714QS|       WF71|
|   M99SR|         M9|
| LE081BB|       LE08|
|  B2S1RU|        B2S|
|  S050ED|        S05|
+--------+-----------+
only showing top 20 rows



If you have a time lag between the collection of two surveys you are trying to link together, you may want to align respondent ages for matching. We can do this using the **age_at()** function.

This function takes a few arguments:
* the dataframe
* the name of the date of birth column
* the data format the date of birth column is in
* the date(s) to calculate age at

In [33]:
# We can find out their age at the most recent Census, for example:
census_date = '21/03/2021'

census = standardisation.age_at(census, 'DOB', 'dd/MM/yyyy', census_date)
ccs = standardisation.age_at(ccs, 'DOB', 'dd/MM/yyyy', census_date)

census.select('DOB','age_at_21/03/2021')

DOB,age_at_21/03/2021
,
,
,
,
,
,
,
,
,
,


In [1]:
# NOT SURE IF THE BELOW IS RIGHT - JUST TAKING IT FROM DAP VERSION BEFORE DELETED

# Deduplication

This is quite easily done, defining our duplicate matchkey(s) and using the **deduplicate** function:

In [None]:
# define our matchkey
deduplicate_mkey = ['First_Name', 'Last_Name','Resident_Age','Sex','Postcode','Address']
census.count()

In [None]:
census = linkage.deduplicate(df = census, record_id - 'Resident_ID', mks = deduplicate_mkey)
ccs = linkage.deduplicate(df = ccs, record_id - 'Resident_ID', mks = deduplicate_mkey)
census.count()

Now that we've removed duplicates, we can start to investigate some matchkeys:

In [None]:
# first, let's suffix each dataset's columns to distinguish the two dataframes 
census = dataframes.suffix_columns(census, suffix = '_df1')
ccs = dataframes.suffix_columns(ccs, suffix = '_df2')

census.persist().count()
ccs.persist().count()

In [None]:
MK1 = [census.Full_Name_census == ccs.Full_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# letting middle name be a mismatch 
MK2 = [census.First_Name_census == ccs.First_Name_ccs,
       census.Last_Name_census == ccs.Last_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# taking the phonetic encoding of forename - using the metaphone algorithm
MK3 = [census.forename_metaphone_census == ccs.forename_metaphone_ccs,
       census.Last_Name_census == ccs.Last_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# Now allowing for misspellings rather than mishearings of names, using standardised Levenshtein edit distance
MK4 = [linkage.std_lev_score(F.col('First_Name_census'),F,col('First_Name_ccs')) > 0.7,
       census.Last_Name_census == ccs.Last_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

# similar to the above, but now using a different string comparison algorithm - the Jaro comparator
MK5 = [linkage.jaro(F.col('First_Name_census'),F,col('First_Name_ccs')) > 0.7,
       census.Last_Name_census == ccs.Last_Name_ccs,
       census.Sex_census == ccs.Sex_ccs,
       census.Resident_Age_census == ccs.Resident_Age_ccs,
       census.Postcode_census == ccs.Postcode_ccs]

matchkeys = [MK1,MK2,MK3,MK4,MK5]

In [None]:
links = linkage.deterministic_linkage(df_l = census, df_r = ccs, id_l = 'Resident_ID_crensus', id_r = 'Resident_ID_ccs', 
                                      matchkeys = matchkeys, our_dir = '/user/edwara5/census_ccs_links')

In [None]:
links.show()