# Retirement in Australia
## Data Pre-Processing

This project uses unit record data from the Household, Income and Labour Dynamics in Australia (HILDA) Survey. The HILDA Survey was initiated and is funded by the Australian Government Department of Social Services (DSS) and is managed by the Melbourne Institute of Applied Economic and Social Research (Melbourne Institute). The findings and views reported in this project, however, are those of the author and should not be attributed to the Australian Government, DSS or the Melbourne Institute. DOI: 10.26193/R4IN30.

<br>
<br>

### Step 1. Convert SAS files to .csv format

The data is sourced from the Australian Data Archive Dataverse system (ADA Dataverse): [HILDA General Release 22, Waves 1-22](https://dataverse.ada.edu.au/dataset.xhtml?persistentId=doi:10.26193/R4IN30)

To access the data, you will need to complete an application with ADA Dataverse and get it approved. The data files are available for download in SAS, STATA or SPSS format. For the project I used .sas7bdat files. You will have 20 files coded with letter *a* to *t* and corresponding to survey waves for 2001-2020 years:

| File Name                  | Description                          |
|---------------------------|---------------------------------------|
| combined_a200c.sas7bdat   | Wave A - Combined data for 2001 year  |
| combined_b200c.sas7bdat   | Wave B - Combined data for 2002 year  |
| combined_c200c.sas7bdat   | Wave C - Combined data for 2003 year  |
| combined_d200c.sas7bdat   | Wave D - Combined data for 2004 year  |
| combined_e200c.sas7bdat   | Wave E - Combined data for 2005 year  |
| combined_f200c.sas7bdat   | Wave F - Combined data for 2006 year  |
| combined_g200c.sas7bdat   | Wave G - Combined data for 2007 year  |
| combined_h200c.sas7bdat   | Wave H - Combined data for 2008 year  |
| combined_i200c.sas7bdat   | Wave I - Combined data for 2009 year  |
| combined_j200c.sas7bdat   | Wave J - Combined data for 2010 year  |
| combined_k200c.sas7bdat   | Wave K - Combined data for 2011 year  |
| combined_l200c.sas7bdat   | Wave L - Combined data for 2012 year  |
| combined_m200c.sas7bdat   | Wave M - Combined data for 2013 year  |
| combined_n200c.sas7bdat   | Wave N - Combined data for 2014 year  |
| combined_o200c.sas7bdat   | Wave O - Combined data for 2015 year  |
| combined_p200c.sas7bdat   | Wave P - Combined data for 2016 year  |
| combined_q200c.sas7bdat   | Wave Q - Combined data for 2017 year  |
| combined_r200c.sas7bdat   | Wave R - Combined data for 2018 year  |
| combined_s200c.sas7bdat   | Wave S - Combined data for 2019 year  |
| combined_t200c.sas7bdat   | Wave T - Combined data for 2020 year  |


You do not need SAS software to read .sas7bdat files — you can open them directly in Python or R. The folder src in my github repository contains two helper fies:
* convert_sas_to_csv.py - converst SAS files to .csv format using *pandas* and *pyreadstat*.
* convert_sas_to_csv.R - converts SAS files to .csv format using *haven* from the *tidyverse* family.

Depending on your python library versions, you may experience error messages in file conversion starting from *combined_n200c.sas7bdat*. R helper code takes longer to convert SAS files, however it is robust to errors. 

Once the conversion is done you will have 20 files in .csv format, with the same names as the original files but with the .csv extension.


In [1]:
import pandas as pd
import numpy as np

In [2]:
# Folder where you store hilda csv files
folder_path = "../data/raw_csv/"

# Sample file
filename = "combined_c200c.csv"

# Read in the sample file
df = pd.read_csv(folder_path + filename, low_memory=False)
print(df.shape)
display(df.head())

(17690, 4480)


Unnamed: 0.1,Unnamed: 0,xwaveid,chhrpid,chhpno,chhstate,xhhstrat,xhhraid,chhrhid,chhhqivw,chhadst,...,crwsc41,crwsc42,crwsc43,crwsc44,crwsc45,chhwtscs,cscmatch,chhinthi,chhinthf,chhinthq
0,1,100001,7974101,1,2,50,279,79741,28/09/2003,3,...,1143.046875,1176.072266,1155.821289,1158.679688,1146.063477,0.85999,1,1058,-1,1058
1,2,100002,7974102,2,2,50,279,79741,28/09/2003,3,...,1086.798828,1118.537109,1100.366211,1098.357422,1080.396484,0.816616,1,1058,-1,1058
2,3,100003,5576101,1,4,84,143,55761,03/10/2003,3,...,926.969238,899.085449,934.718262,924.816895,900.067871,0.703248,1,2021,-1,2021
3,4,100004,5576102,2,4,84,143,55761,03/10/2003,3,...,873.90332,851.333984,889.961426,876.783691,845.202637,0.665602,1,2021,-1,2021
4,5,100005,5576103,3,4,84,143,55761,03/10/2003,3,...,902.345703,869.174805,910.70166,892.404785,861.791992,0.681132,1,2021,-1,2021


Each csv file is structured so the rows represent the respondents and columns represent responses to survey questions. The field *xwaveid* is a unique identifier of a respondent.

<br>
<br> 

### Step 2. Select Survey Questions

The full data dictionary is maintained by the Melbourne Institute of Applied Economic and Social Research  and is available here [HILDA Survey  HILDA Online Data Dictionary](https://hildaodd.app.unimelb.edu.au/help_info.html).

For the project we only need to select the questions related to retirement.

In [3]:
# Metadata file
metadata = pd.read_csv("../data/metadata/hilda_metadata.csv")
display(metadata)

Unnamed: 0,Group,Variable,Description,1,2,3,4,5,6,7,...,13,14,15,16,17,18,19,20,21,22
0,Identifier,waveid,Cross wave ID,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,Background,hgsex,Sex,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,Background,hhiage,Age,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3,Background,anbcob,Country of birth,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,Background,aneab,How well speaks English,0,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
114,Retirement - coordination with partner,rtwsp,Work status of spouse when you retired,0,0,1,0,0,0,1,...,0,0,1,0,0,0,1,0,0,0
115,Retirement - superannuation,rtsup,"When (partly) retired, had any money in a supe...",0,0,1,0,0,0,1,...,0,0,1,0,0,0,1,0,0,0
116,Retirement - superannuation,rtlump,Total lump sum value of your superannuation wh...,0,0,1,0,0,0,1,...,0,0,1,0,0,0,1,0,0,0
117,Retirement - income stream,rtconv,"At around the time that you (partly) retired, ...",0,0,1,0,0,0,1,...,0,0,1,0,0,0,1,0,0,0


In [4]:
metadata.groupby('Group')['Variable'].count()

Group
Background                                 5
Employment                                 4
Family                                     7
Health                                     3
Housing                                    8
Identifier                                 1
Income                                     5
Job Satisfaction                           6
Life Satisfaction                          6
Lifestyle                                  1
Major Life Events                         16
Retirement                                 6
Retirement - coordination with partner     2
Retirement - income stream                 2
Retirement - superannuation                2
Retirement Status                          1
Retirement reasons                        25
Wealth - Home                              2
Wealth - Other Property                    9
Wealth - Superannuation                    8
Name: Variable, dtype: int64

The file *hilda_metadata.csv* lists 119 variables we need to get from the HILDA dataset, grouped by the category (Employment, Family, Health etc.). Some questions are only asked every four years. Columns *'1'* to *'22'* contain binary value (0 or 1), with 1 indicating the question is available for the selected wave.

<br>
<br> 

### Step 3. Combine .csv files

In [5]:
# Define function to create a long-form dataset, combining multiple waves
def get_long_dataset(waves, folder, metadata):
    '''
    Parameters:
        waves - list of wave letters, i.e. ['a', 'b', ...]
        folder - location of network folder containing HILDA .csv files
        metadata - dataframe containing column 'Variable' from HILDA data dictionary
    Returns:
        creates a long form dataset at person and year granularity
    '''
    # Load column description
    dfcols = metadata.copy()

    # Create an empty dataframe
    out = pd.DataFrame()
    
    # Load data
    for letter in waves:
        filename = 'combined_' + letter + '200c.csv'
        hilda = pd.read_csv(folder + filename, low_memory=False)

        # Create lists of available columns 
        cols_temp = letter + dfcols['Variable']
        cols = ['xwaveid'] # list of columns with added wave letter
        cols_original = ['xwaveid'] # list of columns
        
        for col in cols_temp:
            if col in list(hilda.columns):
                cols.append(col)
                cols_original.append(col[1:])

        # Select wave data
        df = hilda[cols].copy()
        df.columns = [cols_original]
        df['wave'] = letter
        
        # Append data
        out = pd.concat([out, df], ignore_index = True)

    return out

In [6]:
# Create long-form dataset
folder_path = "../data/raw_csv/"
waves_list =['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't']
hilda_df = get_long_dataset(waves_list, folder_path, metadata)
hilda_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410658 entries, 0 to 410657
Columns: 120 entries, ('xwaveid',) to ('rtcamt',)
dtypes: float64(83), int64(36), object(1)
memory usage: 376.0+ MB


In [7]:
# Cleanup the column names
colnew = []
for col in hilda_df.columns:
    colnew.append(col[0])
hilda_df.columns = colnew

In [8]:
# Cleanup the identifier (early waves have a letter prefix)
hilda_df['xwaveid'] = hilda_df['xwaveid'].astype('str')
hilda_df['xwaveid'] = hilda_df['xwaveid'].str.replace(r"[\'b]", "", regex=True)
hilda_df.head()

Unnamed: 0,xwaveid,hgsex,hhiage,anbcob,edhigh1,esempdt,jbhruc,jbmo61,jbmi61,rtage,...,gcecf,ledhm,lsinthm,wsfes,savaln2,rtpage,rtsup,rtlump,rtconv,rtcamt
0,100001,1,49,3,5,-1,-1.0,-1,-1,-1,...,,,,,,,,,,
1,100002,2,49,1,9,1,20.0,4,17,55,...,,,,,,,,,,
2,100003,1,49,1,9,5,25.0,7,9,60,...,,,,,,,,,,
3,100004,2,39,2,5,1,15.0,8,14,-1,...,,,,,,,,,,
4,100005,2,16,1,9,1,9.0,6,7,-1,...,,,,,,,,,,


In [9]:
# Check the overall counts
hilda_df.groupby('wave')['xwaveid'].count()

wave
a    19914
b    18295
c    17690
d    17209
e    17467
f    17453
g    17280
h    17144
i    17632
j    17855
k    23415
l    23182
m    23299
n    23114
o    23305
p    23507
q    23442
r    23267
s    23256
t    22932
Name: xwaveid, dtype: int64

In [10]:
# Check age
hilda_df.groupby('wave')['hhiage'].agg(['count', 'min', 'max','mean'])

Unnamed: 0_level_0,count,min,max,mean
wave,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a,19914,-10,100,27.621422
b,18295,-10,97,28.422957
c,17690,-10,98,28.853477
d,17209,-10,99,29.036318
e,17467,-10,100,29.509361
f,17453,-10,99,29.996047
g,17280,-10,100,30.104456
h,17144,-10,101,30.545439
i,17632,-10,98,30.826225
j,17855,-10,98,30.96718


In [11]:
# Save output to csv
hilda_df.to_csv('../data/processed/hilda_combined.csv', index=False)

<br>
<br> 

### Step 4. Apply Data Dictionary

The full description of survey responses is provided in the HILDA Online Data Dictionary https://hildaodd.app.unimelb.edu.au/srchVarnameUsingCategoriesCrossWave.aspx

The missing survey responses are coded as negative numbers. 

#### Columns that are expected to have no missing values

In [12]:
# Create a new dataframe, containing cross-wave person ID and wave letter
df = hilda_df[['xwaveid', 'wave']].copy() 

In [13]:
# Add column: year of survey
df['year'] = df['wave'].map({
    'a':	2001,
    'b':	2002,
    'c':	2003,
    'd':	2004,
    'e':	2005,
    'f':	2006,
    'g':	2007,
    'h':	2008,
    'i':	2009,
    'j':	2010,
    'k':	2011,
    'l':	2012,
    'm':	2013,
    'n':	2014,
    'o':	2015,
    'p':	2016,
    'q':	2017,
    'r':	2018,
    's':	2019,
    't':	2020})

<br>
<br>

#### Background, Education, Family

In [14]:
# Add column: age at the date of interview
df['age'] = hilda_df['hhiage'].apply(lambda x: x if x > 0 else np.nan)

# Add columns: whether a person reached a milestone date (60, 65, 67, 70, 75)
df['reached_60'] = df['age'].apply(lambda x: 1 if x >= 60 else (0 if x > 0 else np.nan))
df['reached_65'] = df['age'].apply(lambda x: 1 if x >= 65 else (0 if x > 0 else np.nan))
df['reached_67'] = df['age'].apply(lambda x: 1 if x >= 67 else (0 if x > 0 else np.nan))
df['reached_70'] = df['age'].apply(lambda x: 1 if x >= 70 else (0 if x > 0 else np.nan))
df['reached_75'] = df['age'].apply(lambda x: 1 if x >= 75 else (0 if x > 0 else np.nan))

In [15]:
# Add column: age retired or intends to retire
# Responses over 990 mean no intention to retire
df['retirement_age'] = hilda_df['rtage'].apply(
    lambda x: 100 if x > 990 
    else (x if x > 0 else np.nan))

In [16]:
# Add column: flag is repondent is male
df['male'] = hilda_df['hgsex'].map({1:1,  # Male
                                    2:0}) # Female

In [17]:
# Add columns: born overseas
# Main English speaking countries are United Kingdom, New Zealand, Canada, USA, Ireland and South Africa
df['born_overseas_english'] = hilda_df['anbcob'].map({1:0,  # born in Australia
                                                      2:1,  # born in a main English speaking country
                                                      3:0}) # born in otehr country

df['born_overseas_nonenglish'] = hilda_df['anbcob'].map({1:0,  # born in Australia
                                                         2:0,  # born in a main English speaking country
                                                         3:1}) # born in otehr country

In [18]:
# Add column: English ability
df['english_limited'] = hilda_df['aneab'].map({-1:0,   # Native speaker
                                                1:0,   # Very well
                                                2:0,   # Well
                                                3:1,   # Not well
                                                4:1 }) # Not at all

In [19]:
# Add columns: highest level of education achieved
df['education_tertiary'] = hilda_df['edhigh1'].map({
                                      1: 1, # Postgraduate
                                      2: 1, # Graduate
                                      3: 1, # Bachelor or honours
                                      4: 1, # Adv Diploma or Diploma
                                      5: 0, # Certificate III/IV
                                      8: 0, # Year 12
                                      9: 0, # Year 11 and below
                                      10: 0}) #Undetermined


df['education_year11_or_below'] = hilda_df['edhigh1'].map({
                                      1: 0, # Postgraduate
                                      2: 0, # Graduate
                                      3: 0, # Bachelor or honours
                                      4: 0, # Adv Diploma or Diploma
                                      5: 0, # Certificate III/IV
                                      8: 0, # Year 12
                                      9: 1, # Year 11 and below
                                      10: 1}) #Undetermined

In [20]:
# Add columns: marital status
df['married_defacto'] = hilda_df['mrcms'].map({
                                         1:1,  # Married 
                                         2:0,  # Separated
                                         3:0,  # Divorced
                                         4:0,  # Widowed
                                         5:1,  # Never married,in relationship
                                         6:0}) # Never married

df['divorced_separated'] = hilda_df['mrcms'].map({ 
                                         1:0,  # Married 
                                         2:1,  # Separated
                                         3:1,  # Divorced
                                         4:0,  # Widowed
                                         5:0,  # Never married,in relationship
                                         6:0}) # Never married

df['widowed'] = hilda_df['mrcms'].map({
                                 1:0,  # Married 
                                 2:0,  # Separated
                                 3:0,  # Divorced
                                 4:1,  # Widowed
                                 5:0,  # Never married,in relationship
                                 6:0}) # Never married

In [21]:
# Add columns: household type
# Reference: households with other related or non-related
df['hh_lone_person'] = hilda_df['hhtype'].apply(
    lambda x: 1 if x == 24 
    else (0 if x > 0 else np.nan))

df['hh_empty_nesters'] = hilda_df['hhtype'].apply(
    lambda x: 1 if x == 1 
    else (0 if x > 0 else np.nan))

df['hh_couple_with_children'] = hilda_df['hhtype'].apply(
    lambda x: 1 if x in (4,7,10) 
    else (0 if x > 0 else np.nan))

df['hh_lone_parent_with_children'] = hilda_df['hhtype'].apply(
    lambda x: 1 if x in (13,16,19) 
    else (0 if x > 0 else np.nan))

# Add columns: resident children
df['hh_dependent_children'] = hilda_df['hhtype'].apply(
    lambda x: 1 if x in (4,5,6,7,8,9,13,14,15,16,17,18) 
    else (0 if x > 0 else np.nan))

df['hh_independent_children'] = hilda_df['hhtype'].apply(
    lambda x: 1 if x in (10,11,12,19,20,21) 
    else (0 if x > 0 else np.nan))

# Add column: other related people in the household
df['hh_other_related'] = hilda_df['hhtype'].apply(
    lambda x: 1 if x in (11,14,17,20,2,5,8,22,23) 
    else (0 if x > 0 else np.nan))

<br>
<br>

#### Health, Caring for Others

In [22]:
# Add column: self-assed health
# 1=Excellent, 2=Very good, 3=Good, 4=Fair, 5=Poor

df['self_assessed_health_1-5'] = hilda_df['gh1'].apply(lambda x: x if x > 0 else np.nan)

df['long_term_health_condition'] = hilda_df['helth'].map({1:1,  # 1=Yes, 
                                                          2:0}) # 2=No

df['health_condition_limits_work'] = hilda_df['helthwk'].map({1: 4,  # Limit type or amount of work ~ fair health
                                                              2: 3,  # No impact ~ good health
                                                              3: 5}) # Not capable of working ~ poor health
def impute_health(row):
    if row['self_assessed_health_1-5'] > 0:
        return row['self_assessed_health_1-5']
    else: 
        return row['health_condition_limits_work']

df['self_assessed_health_imputed'] = df.apply(impute_health, axis=1)

In [23]:
# Add columns: grandchildren care
df['grandchildren_care_often'] = hilda_df['gcecf'].apply(
    lambda x: 1 if x in (1,2,3) 
    else (0 if x >= -1 else np.nan)) #-1 not asked

df['grandchildren_care_occasionally'] = hilda_df['gcecf'].apply(
    lambda x: 1 if x in (4,5,6,7) 
    else (0 if x >= -1 else np.nan)) #-1 not asked

In [24]:
# Add columns: caring for other
df['hours_caring_per_week'] = hilda_df['lshrcar'].apply(
    lambda x: x if x > 0 
    else (0 if x >= 0 else np.nan))

In [25]:
# Flag main carer
df['main_carer_res'] = hilda_df['hencam'].map({
                                 1:1, # main_carer
                                 2:0, # share care
                                -1:0})  # not applicable

df['main_carer_nonres'] = hilda_df['hercam'].map({
                                 1:1, # main_carer
                                 2:0, # share care
                                -1:0})  # not applicable

def impute_main_carer(row):
    if row['main_carer_res'] == 1:
        return 1
    elif row['main_carer_nonres'] == 1:
        return 1
    else: 
        return 0

df['flag_main_carer'] = df.apply(impute_main_carer, axis=1)

<br>
<br>

#### Housing

In [26]:
# Add columns: housing
df['housing_owned'] = hilda_df['hstenr'].map({ 
                                        1:1,  # Own/currently paying off mortgage
                                        2:0,  # Rent (or pay board)
                                        3:0,  # Involved in a rent-buy scheme
                                        4:0}) # Live here rent free / Life Tenure

df['housing_rent'] = hilda_df['hstenr'].map({ 
                                        1:0,  # Own/currently paying off mortgage
                                        2:1,  # Rent (or pay board)
                                        3:1,  # Involved in a rent-buy scheme
                                        4:0}) # Live here rent free / Life Tenure

df['housing_free'] = hilda_df['hstenr'].map({ 
                                        1:0,  # Own/currently paying off mortgage
                                        2:0,  # Rent (or pay board)
                                        3:0,  # Involved in a rent-buy scheme
                                        4:1}) # Live here rent free / Life Tenure

In [27]:
#Add column: mortgage paid off completely
df['home_loan_paid_off'] = hilda_df['hsmgpd'].map({1:1,  # Yes 
                                                   2:0}) # No

In [28]:
# Add column: own house, mortgage-free
def mortgage_free(row):
    if (row['housing_owned'] == 1) and (row['home_loan_paid_off']==1):
        return 1
    elif pd.isna(row['housing_owned']):
        return np.nan
    elif pd.isna(row['home_loan_paid_off']):
        return np.nan
    else: 
        return 0
df['own_house_no_mortgage'] = df.apply(mortgage_free, axis=1)

# Add column: own house with mortgage
def mortgage(row):
    if (row['housing_owned'] == 1) and (row['home_loan_paid_off']==0):
        return 1
    elif pd.isna(row['housing_owned']):
        return np.nan
    elif pd.isna(row['home_loan_paid_off']):
        return np.nan
    else: 
        return 0
df['own_house_with_mortgage'] = df.apply(mortgage, axis=1)

In [29]:
# Add column: socio-economic disadvantage (1) to advantage (10)
df['area_seifa_irsad'] = hilda_df['hhsad10'].apply(lambda x: x if x > 0 else np.nan)

In [30]:
# Add columns - remoteness
df['remoteness_major_cities']= hilda_df['hhsra'].map({
                                   0:1,  # Major Cities of Australia
                                   1:0,  # Inner Regional Australia
                                   2:0,  # Outer Regional Australia
                                   3:0,  # Remote Australia
                                   4:0,  # Very Remote Australia
                                   5:0,  # Migratory - Offshore - Shipping (NSW)
                                   9:0}) # No usual address (NSW)

df['remoteness_remote']= hilda_df['hhsra'].map({
                                   0:0,  # Major Cities of Australia
                                   1:0,  # Inner Regional Australia
                                   2:0,  # Outer Regional Australia
                                   3:1,  # Remote Australia
                                   4:1,  # Very Remote Australia
                                   5:0,  # Migratory - Offshore - Shipping (NSW)
                                   9:0}) # No usual address (NSW)
                            

In [31]:
df['home_internet'] = hilda_df['lsinthm'].apply(
    lambda x: 1 if x == 1 else (0 if x > 0 else np.nan))

<br>
<br>

#### Employment

In [32]:
df['paid_work'] = hilda_df['esempdt'].map({   1:1, # Employee
                                              2:1, # Own business with employees
                                              3:1, # Own business without employees
                                              4:1, # Employer/Self-employed with employees
                                              5:1, # Employer/Self-employed without employees
                                              6:0}) # Unpaid family worker

In [33]:
df['work_hours'] = hilda_df['jbhruc'].apply(lambda x: x if x > 0 else np.nan)

In [34]:
# Add columns - employment and education
df['occupation_group'] = hilda_df['jbmo61'].map({
                                              1: 'Managers',
                                              2: 'Professionals',
                                              3: 'Technicians and Trades Workers',
                                              4: 'Community and Personal Service Workers',
                                              5: 'Clerical and Administrative Workers',
                                              6: 'Sales Workers',
                                              7: 'Machinery Operators and Drivers',
                                              8: 'Labourers'})

In [35]:
df['industry'] = hilda_df['jbmi61'].map({
                                       1  :'Agriculture, Forestry and Fishing',
                                       10 :'Information Media and Telecommunications',
                                       11 :'Financial and Insurance Services',
                                       12 :'Rental, Hiring and Real Estate Services',
                                       13 :'Professional, Scientific and Technical Services',
                                       14 :'Administrative and Support Services',
                                       15 :'Public Administration and Safety',
                                       16 :'Education and Training',
                                       17 :'Health Care and Social Assistance',
                                       18 :'Arts and Recreation Services',
                                       19 :'Other Services',
                                       2 :'Mining',
                                       3 :'Manufacturing',
                                       4 :'Electricity, Gas, Water and Waste Services',
                                       5 :'Construction',
                                       6 :'Wholesale Trade',
                                       7 :'Retail Trade',
                                       8 :'Accommodation and Food Services',
                                       9 :'Transport, Postal and Warehousing'})

In [36]:
# Add columns: retirement status
df['retired_completely'] = hilda_df['rtstat'].map({
                               -1:0,  # Not asked
                                1:1,  # Completely retired
                                2:0,  # Partly retired
                                3:0,  # Not retired
                                4:0}) # Have never been in paid work

df['retired_partially'] = hilda_df['rtstat'].map({
                               -1:0,  # Not asked
                                1:0,  # Completely retired
                                2:1,  # Partly retired
                                3:0,  # Not retired
                                4:0}) # Have never been in paid work

df['not_retired'] = hilda_df['rtstat'].map({
                               -1:0,  # Not asked
                                1:0,  # Completely retired
                                2:0,  # Partly retired
                                3:1,  # Not retired
                                4:0}) # Have never been in paid work

<br>
<br>

#### Major Life Events

In [37]:
# Add columns: major life event
mapping = {1:0, # No
           2:1} # Yes

df['le_death_close_friend']         = hilda_df['ledfr'].map(mapping)
df['le_death_close_relative']       = hilda_df['ledrl'].map(mapping)
df['le_death_partner_child']        = hilda_df['ledsc'].map(mapping)
df['le_major_improvement_finances'] = hilda_df['lefni'].map(mapping)
df['le_major_worsening_finances']   = hilda_df['lefnw'].map(mapping)
df['le_fired_redundant']            = hilda_df['lefrd'].map(mapping)
df['le_illness_family']             = hilda_df['leinf'].map(mapping)
df['le_illness_personal']           = hilda_df['leins'].map(mapping)
df['le_changed_job']                = hilda_df['lejob'].map(mapping)
df['le_changed_residence']          = hilda_df['lemvd'].map(mapping)
df['le_retired']                    = hilda_df['lertr'].map(mapping)
df['le_separated_from_spouse']      = hilda_df['lesep'].map(mapping)
df['le_back_together_spouse']       = hilda_df['lercl'].map(mapping)
df['le_weather_disaster']           = hilda_df['ledhm'].map(mapping)
df['le_got_married']                = hilda_df['lemar'].map(mapping)
df['le_promoted']                   = hilda_df['leprm'].map(mapping)

<br>
<br>

#### Wealth

In [38]:
# Add columns: wealth
df['super_value'] = hilda_df['saest'].apply(lambda x: x if x > 0 else np.nan)
df['super_retired'] = hilda_df['pwsupri'].apply(lambda x: x if x > 0 else np.nan)
df['super_notretired'] = hilda_df['pwsupwi'].apply(lambda x: x if x > 0 else np.nan)

df['home_value'] = hilda_df['hsvalui'].apply(lambda x: x if x > 0 else np.nan)
df['home_debt'] = hilda_df['hsdebti' ].apply(lambda x: x if x > 0 else np.nan)

df['otherprop_value'] = hilda_df['opdt'].apply(lambda x: x if x > 0 else np.nan)
df['otherprop_debt'] = hilda_df['opvalue'].apply(lambda x: x if x > 0 else np.nan)

df['otherprop_count'] = hilda_df['opnum'].apply(lambda x: x if x > 0 else np.nan)
df['otherprop_count_earned_rent'] = hilda_df['oprntn'].apply(lambda x: x if x > 0 else np.nan)

In [39]:
def impute_super(row):
    if row['super_value'] > 0:
        return row['super_value']

    elif row['super_retired'] > 0:
        return row['super_retired']

    elif row['super_notretired'] > 0:
        return row['super_notretired']
    
    else: 
        return np.nan

df['super_imputed'] = df.apply(impute_super, axis = 1)

<br>
<br>

#### View dataframe information

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410658 entries, 0 to 410657
Data columns (total 79 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   xwaveid                          410658 non-null  object 
 1   wave                             410658 non-null  object 
 2   year                             410658 non-null  int64  
 3   age                              305143 non-null  float64
 4   reached_60                       305143 non-null  float64
 5   reached_65                       305143 non-null  float64
 6   reached_67                       305143 non-null  float64
 7   reached_70                       305143 non-null  float64
 8   reached_75                       305143 non-null  float64
 9   retirement_age                   113069 non-null  float64
 10  male                             410658 non-null  int64  
 11  born_overseas_english            305061 non-null  float64
 12  bo

<br>
<br>

#### Columns that need to be filled forward

In [41]:
# Sort dataframe
df = df.sort_values(by = ['xwaveid', 'year'])

In [42]:
# Occupation and industry questions are only asked if a person is still working. 
# For analysis, we need to forward‑fill the latest record within each person
df['occupation_group_latest'] = (
    df.groupby('xwaveid')['occupation_group']
      .ffill()) # carry the last non‑missing value forward

df['industry_latest'] = (
    df.groupby('xwaveid')['industry']
      .ffill()) # carry the last non‑missing value forward


In [43]:
# Add columns: occupation
df['occupation_managers'] = df['occupation_group_latest'].apply(
    lambda x: 1 if x == 'Managers' 
    else (np.nan if pd.isna(x) else 0)) 

df['occupation_professionals'] = df['occupation_group_latest'].apply(
    lambda x: 1 if x == 'Professionals' 
    else (np.nan if pd.isna(x) else 0))

df['occupation_tech_trade'] = df['occupation_group_latest'].apply(
    lambda x: 1 if x == 'Technicians and Trades Workers' 
    else (np.nan if pd.isna(x) else 0))

df['occupation_care_service'] = df['occupation_group_latest'].apply(
    lambda x: 1 if x == 'Community and Personal Service Workers'
    else (np.nan if pd.isna(x) else 0))

df['occupation_admin_sales'] = df['occupation_group_latest'].apply(
    lambda x: 1 if x in ('Clerical and Administrative Workers', 'Sales Workers')
    else (np.nan if pd.isna(x) else 0))

df['occupation_labourers_drivers'] = df['occupation_group_latest'].apply(
    lambda x: 1 if x in ('Labourers', 'Machinery Operators and Drivers') 
    else (np.nan if pd.isna(x) else 0))


In [44]:
# Add columns: industry
df['industry_public_sector'] = df['industry_latest'].apply(
    lambda x: 1 if x == 'Public Administration and Safety' 
    else (np.nan if pd.isna(x) else 0))

df['industry_health_social_care'] = df['industry_latest'].apply(
    lambda x: 1 if x == 'Health Care and Social Assistance' 
    else (np.nan if pd.isna(x) else 0))


In [45]:
# Wealth-related questions are asked once every 4 years 
# (1 year before the retirement questions are asked)

df['super_value_latest'] = (
    df.groupby('xwaveid')['super_value']
      .ffill()) # carry the last non‑missing value forward

df['home_value_latest'] = (
    df.groupby('xwaveid')['home_value']
      .ffill()) # carry the last non‑missing value forward

df['home_debt_latest'] = (
    df.groupby('xwaveid')['home_debt']
      .ffill()) # carry the last non‑missing value forward

df['otherprop_value_latest'] = (
    df.groupby('xwaveid')['otherprop_value']
      .ffill()) # carry the last non‑missing value forward

df['otherprop_debt_latest'] = (
    df.groupby('xwaveid')['otherprop_debt']
      .ffill()) # carry the last non‑missing value forward

df['otherprop_count_latest'] = (
    df.groupby('xwaveid')['otherprop_count']
      .ffill()) # carry the last non‑missing value forward

df['otherprop_count_earned_rent_latest'] = (
    df.groupby('xwaveid')['otherprop_count_earned_rent']
      .ffill()) # carry the last non‑missing value forward

<br>
<br> 

### Step 5. Save output to csv

In [46]:
# Filter dataframe to only include respondents over 45 years old
df = df[df['age']>= 45].copy()

In [47]:
cols1 = ['xwaveid', 'wave', 'year', 
         'age', 
         'reached_60', 'reached_65','reached_67', 'reached_70', 'reached_75', 
         'retirement_age', 
         'male',
         'born_overseas_english', 'born_overseas_nonenglish', 
         'english_limited',
         'education_tertiary', 'education_year11_or_below', 
         'married_defacto', 'divorced_separated', 'widowed', 
         'hh_lone_person', 'hh_empty_nesters','hh_couple_with_children', 'hh_lone_parent_with_children',
         'hh_dependent_children', 'hh_independent_children', 'hh_other_related',
         
         'self_assessed_health_imputed',
         'grandchildren_care_often', 'grandchildren_care_occasionally',
         'hours_caring_per_week',
         'flag_main_carer']

cols2 = ['housing_owned', 'housing_rent', 'housing_free',
         'home_loan_paid_off', 
         'own_house_no_mortgage', 'own_house_with_mortgage', 
         'area_seifa_irsad',
         'remoteness_major_cities', 'remoteness_remote', 
         'home_internet',
         'paid_work', 
         'work_hours', 
         'retired_completely', 'retired_partially', 'not_retired',

         'occupation_managers','occupation_professionals', 'occupation_tech_trade',
         'occupation_care_service', 'occupation_admin_sales',
         'occupation_labourers_drivers', 
         
         'industry_public_sector', 'industry_health_social_care']

cols3 = ['le_death_close_friend', 'le_death_close_relative',
       'le_death_partner_child', 'le_major_improvement_finances',
       'le_major_worsening_finances', 'le_fired_redundant',
       'le_illness_family', 'le_illness_personal', 'le_changed_job',
       'le_changed_residence', 'le_retired', 'le_separated_from_spouse',
       'le_back_together_spouse', 'le_weather_disaster', 'le_got_married',
       'le_promoted',
         
       'super_value_latest',
       'home_value_latest', 'home_debt_latest', 
       'otherprop_value_latest','otherprop_debt_latest', 
       'otherprop_count_latest','otherprop_count_earned_rent_latest',

      ]

In [48]:
df[cols1].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
year,146156.0,2011.409631,5.692465,2001.0,2007.0,2012.0,2016.0,2020.0
age,146156.0,61.224459,11.568008,45.0,52.0,59.0,69.0,101.0
reached_60,146156.0,0.495833,0.499984,0.0,0.0,0.0,1.0,1.0
reached_65,146156.0,0.361709,0.480497,0.0,0.0,0.0,1.0,1.0
reached_67,146156.0,0.313466,0.463904,0.0,0.0,0.0,1.0,1.0
reached_70,146156.0,0.247126,0.431342,0.0,0.0,0.0,0.0,1.0
reached_75,146156.0,0.153808,0.360766,0.0,0.0,0.0,0.0,1.0
retirement_age,113044.0,62.676595,13.6727,15.0,58.0,63.0,65.0,110.0
male,146156.0,0.465414,0.498804,0.0,0.0,0.0,1.0,1.0
born_overseas_english,146117.0,0.134146,0.34081,0.0,0.0,0.0,0.0,1.0


In [49]:
df[cols2].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
housing_owned,139778.0,0.800519,0.399611,0.0,1.0,1.0,1.0,1.0
housing_rent,139778.0,0.173747,0.378893,0.0,0.0,0.0,0.0,1.0
housing_free,139778.0,0.025734,0.15834,0.0,0.0,0.0,0.0,1.0
home_loan_paid_off,77145.0,0.498192,0.5,0.0,0.0,0.0,1.0,1.0
own_house_no_mortgage,74013.0,0.49313,0.499956,0.0,0.0,0.0,1.0,1.0
own_house_with_mortgage,74013.0,0.506641,0.499959,0.0,0.0,1.0,1.0,1.0
area_seifa_irsad,146140.0,5.481963,2.901609,1.0,3.0,6.0,8.0,10.0
remoteness_major_cities,146140.0,0.624285,0.484309,0.0,0.0,1.0,1.0,1.0
remoteness_remote,146140.0,0.016505,0.127407,0.0,0.0,0.0,0.0,1.0
home_internet,91367.0,0.841245,0.36545,0.0,1.0,1.0,1.0,1.0


In [50]:
df[cols3].describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
le_death_close_friend,127581.0,0.157406,0.364184,0.0,0.0,0.0,0.0,1.0
le_death_close_relative,127616.0,0.125752,0.331571,0.0,0.0,0.0,0.0,1.0
le_death_partner_child,127512.0,0.01316,0.113958,0.0,0.0,0.0,0.0,1.0
le_major_improvement_finances,127672.0,0.033234,0.179247,0.0,0.0,0.0,0.0,1.0
le_major_worsening_finances,127669.0,0.030328,0.17149,0.0,0.0,0.0,0.0,1.0
le_fired_redundant,127545.0,0.023521,0.151552,0.0,0.0,0.0,0.0,1.0
le_illness_family,127342.0,0.171797,0.377206,0.0,0.0,0.0,0.0,1.0
le_illness_personal,127503.0,0.114687,0.318646,0.0,0.0,0.0,0.0,1.0
le_changed_job,127553.0,0.058242,0.234202,0.0,0.0,0.0,0.0,1.0
le_changed_residence,127744.0,0.082415,0.274997,0.0,0.0,0.0,0.0,1.0


In [51]:
# Save output to csv
df[cols1 + cols2 + cols3].to_csv('../data/processed/hilda_transformed.csv', index=False)

In [52]:
%load_ext watermark
%watermark -n -u -v -iv

Last updated: Sat Apr 19 2025

Python implementation: CPython
Python version       : 3.12.8
IPython version      : 8.31.0

pandas: 2.2.2
numpy : 1.26.4



#### <br>
<br>
<br>