# Retirement in Australia
## Data Pre-Processing

This project uses unit record data from the Household, Income and Labour Dynamics in Australia (HILDA) Survey. The HILDA Survey was initiated and is funded by the Australian Government Department of Social Services (DSS) and is managed by the Melbourne Institute of Applied Economic and Social Research (Melbourne Institute). The findings and views reported in this project, however, are those of the author and should not be attributed to the Australian Government, DSS or the Melbourne Institute. DOI: 10.26193/R4IN30.

<br>
<br>

### Step 1. Convert SAS files to .csv format

The data is sourced from the Australian Data Archive Dataverse system (ADA Dataverse): [HILDA General Release 22, Waves 1-22](https://dataverse.ada.edu.au/dataset.xhtml?persistentId=doi:10.26193/R4IN30)

To access the data, you will need to complete an application with ADA Dataverse and get it approved. The data files are available for download in SAS, STATA or SPSS format. For the project I used .sas7bdat files. You will have 20 files coded with letter *a* to *t* and corresponding to survey waves for 2001-2020 years:

| File Name                  | Description                          |
|---------------------------|---------------------------------------|
| combined_a200c.sas7bdat   | Wave A - Combined data for 2001 year  |
| combined_b200c.sas7bdat   | Wave B - Combined data for 2002 year  |
| combined_c200c.sas7bdat   | Wave C - Combined data for 2003 year  |
| combined_d200c.sas7bdat   | Wave D - Combined data for 2004 year  |
| combined_e200c.sas7bdat   | Wave E - Combined data for 2005 year  |
| combined_f200c.sas7bdat   | Wave F - Combined data for 2006 year  |
| combined_g200c.sas7bdat   | Wave G - Combined data for 2007 year  |
| combined_h200c.sas7bdat   | Wave H - Combined data for 2008 year  |
| combined_i200c.sas7bdat   | Wave I - Combined data for 2009 year  |
| combined_j200c.sas7bdat   | Wave J - Combined data for 2010 year  |
| combined_k200c.sas7bdat   | Wave K - Combined data for 2011 year  |
| combined_l200c.sas7bdat   | Wave L - Combined data for 2012 year  |
| combined_m200c.sas7bdat   | Wave M - Combined data for 2013 year  |
| combined_n200c.sas7bdat   | Wave N - Combined data for 2014 year  |
| combined_o200c.sas7bdat   | Wave O - Combined data for 2015 year  |
| combined_p200c.sas7bdat   | Wave P - Combined data for 2016 year  |
| combined_q200c.sas7bdat   | Wave Q - Combined data for 2017 year  |
| combined_r200c.sas7bdat   | Wave R - Combined data for 2018 year  |
| combined_s200c.sas7bdat   | Wave S - Combined data for 2019 year  |
| combined_t200c.sas7bdat   | Wave T - Combined data for 2020 year  |


You do not need SAS software to read .sas7bdat files — you can open them directly in Python or R. The folder src in my github repository contains two helper fies:
* convert_sas_to_csv.py - converst SAS files to .csv format using *pandas* and *pyreadstat*.
* convert_sas_to_csv.R - converts SAS files to .csv format using *haven* from the *tidyverse* family.

Depending on your python library versions, you may experience error messages in file conversion starting from *combined_n200c.sas7bdat*. R helper code takes longer to convert SAS files, however it is robust to errors. 

Once the conversion is done you will have 20 files in .csv format, with the same names as the original files but with the .csv extension.


In [1]:
import pandas as pd
import numpy as np

In [2]:
# Folder where you store hilda csv files
folder_path = "../data/raw_csv/"

# Sample file
filename = "combined_t200c.csv"

# Read in the sample file
df = pd.read_csv(folder_path + filename, low_memory=False)
print(df.shape)
display(df.head())

(22932, 5766)


Unnamed: 0.1,Unnamed: 0,xwaveid,thhrpid,thhpno,thhstate,xhhstrat,xhhraid,thhhqivw,thgdli1,thgdli2,...,tcvmg,tcvmgrs,tcvmgha,tcvmgwk,tcvmgmt,tcvmgat,thhinthq,thhhqlen,tnsctc,tnpctc
0,1,100003,19397101,1,4,84,143,03/09/2020,08/08/2019,-1/-1/ -1,...,-1,-1,-1,-1,-1,-1,10015,4,-1,-1
1,2,100005,21035101,1,6,84,143,12/09/2020,14/09/2019,-1/-1/ -1,...,-1,-1,-1,-1,-1,-1,6012,5,-1,-1
2,3,100006,19384101,1,4,84,143,22/09/2020,22/09/2019,22/09/2019,...,-1,-1,-1,-1,-1,-1,7012,9,0,43
3,4,100010,14826101,1,2,41,258,14/08/2020,14/08/2019,14/08/2019,...,2,-1,-1,-1,-1,-1,1002,6,-1,-1
4,5,100014,12138101,1,1,22,479,15/08/2020,06/08/2019,-1/-1/ -1,...,2,-1,-1,-1,-1,-1,13038,5,-1,-1


Each csv file is structured so the rows represent the respondents and columns represent responses to survey questions. The field *xwaveid* is a unique identifier of a respondent.

<br>
<br> 

### Step 2. Select Survey Questions

The full data dictionary is maintained by the Melbourne Institute of Applied Economic and Social Research  and is available here [HILDA Survey  HILDA Online Data Dictionary](https://hildaodd.app.unimelb.edu.au/help_info.html).

For the project we only need to select the questions related to retirement.

In [3]:
# Metadata file
metadata = pd.read_csv("../data/metadata/hilda_metadata.csv")
display(metadata.head())

Unnamed: 0,Group,Variable,Description,1,2,3,4,5,6,7,...,13,14,15,16,17,18,19,20,21,22
0,Identifier,waveid,Cross wave ID,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
1,Background,hgsex,Sex,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
2,Background,hhiage,Age,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
3,Background,anbcob,Country of birth,1,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1
4,Background,aneab,How well speaks English,0,1,1,1,1,1,1,...,1,1,1,1,1,1,1,1,1,1


In [4]:
metadata.groupby('Group')['Variable'].count()

Group
Background                  5
Employment                  4
Family                      7
Health                      3
Housing                     8
Identifier                  1
Income                      5
Job Satisfaction            6
Life Satisfaction           6
Lifestyle                   1
Major Life Events          16
Retirement                  6
Retirement reasons         25
Wealth - Home               2
Wealth - Other Property     9
Wealth - Superannuation     8
Name: Variable, dtype: int64

The file *hilda_metadata.csv* lists 125 variables we need to get from the HILDA dataset, grouped by the category (Employment, Family, Health etc.). Some questions are only asked every four years. Columns *'1'* to *'22'* contain binary value (0 or 1), with 1 indicating the question is available for the selected wave.

<br>
<br> 

### Step 3. Combine .csv files

In [5]:
# Define function to create a long-form dataset, combining multiple waves
def get_long_dataset(waves, folder, metadata):
    '''
    Parameters:
        waves - list of wave letters, i.e. ['a', 'b', ...]
        folder - location of network folder containing HILDA .csv files
        metadata - dataframe containing column 'Variable' from HILDA data dictionary
    Returns:
        creates a long form dataset at person and year granularity
    '''
    # Load column description
    dfcols = metadata.copy()

    # Create an empty dataframe
    out = pd.DataFrame()
    
    # Load data
    for letter in waves:
        filename = 'combined_' + letter + '200c.csv'
        hilda = pd.read_csv(folder + filename, low_memory=False)

        # Create lists of available columns 
        cols_temp = letter + dfcols['Variable']
        cols = ['xwaveid'] # list of columns with added wave letter
        cols_original = ['xwaveid'] # list of columns
        
        for col in cols_temp:
            if col in list(hilda.columns):
                cols.append(col)
                cols_original.append(col[1:])

        # Select wave data
        df = hilda[cols].copy()
        df.columns = [cols_original]
        df['wave'] = letter
        
        # Append data
        out = pd.concat([out, df], ignore_index = True)

    return out

In [6]:
# Create long-form dataset
folder_path = "../data/raw_csv/"
waves_list =['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't']
hilda_df = get_long_dataset(waves_list, folder_path, metadata)
hilda_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 410658 entries, 0 to 410657
Columns: 113 entries, ('xwaveid',) to ('rtpage',)
dtypes: float64(111), object(2)
memory usage: 354.0+ MB


In [7]:
# Cleanup the column names
colnew = []
for col in hilda_df.columns:
    colnew.append(col[0])
hilda_df.columns = colnew

# Cleanup the identifier
hilda_df['xwaveid'] = hilda_df['xwaveid'].str.replace(r"[\'b]", "", regex=True)
hilda_df.head()

Unnamed: 0,xwaveid,hgsex,hhiage,anbcob,edhigh1,esempdt,jbhruc,jbmo61,jbmi61,rtage,...,oprntn,oprnty,opt2hnr,opt2hr,gcecf,ledhm,lsinthm,wsfes,savaln2,rtpage
0,100001,1.0,49.0,3.0,5.0,-1.0,-1.0,-1.0,-1.0,-1.0,...,,,,,,,,,,
1,100002,2.0,49.0,1.0,9.0,1.0,20.0,4.0,17.0,55.0,...,,,,,,,,,,
2,100003,1.0,49.0,1.0,9.0,5.0,25.0,7.0,9.0,60.0,...,,,,,,,,,,
3,100004,2.0,39.0,2.0,5.0,1.0,15.0,8.0,14.0,-1.0,...,,,,,,,,,,
4,100005,2.0,16.0,1.0,9.0,1.0,9.0,6.0,7.0,-1.0,...,,,,,,,,,,


In [8]:
# Save output to csv
hilda_df.to_csv('../data/processed/hilda_combined.csv', index=False)

<br>
<br> 

### Step 4. Apply Data Dictionary

In [9]:
# Create new data frame
num_cols = ['hhiage','rtage','rtcage','rtpage', 'hhsad10', 'hhsec10','hhsed10','tcr', 
            'jbhruc','wsfei','wsfes','bncpeni','hsdebti','hsvalui','saest','sacfnda',
            'pwsupri','pwsuprt','pwsupwi','pwsupwk','opdt','opnum','oprntn','opvalue',
            'gh1', 'lshrcar','jbmsall','jbmsflx','jbmshrs','jbmssec','jbmspay','jbmswrk',
            'losat','losatfs','losatyh','losatft','losateo','losatlc']
df = hilda_df[num_cols].copy()

# Rename columns
df = df.rename(columns={'hhiage'  : 'age',
                        'rtage'   : 'retirement_age',
                        'rtcage'  : 'age_completely_retired',
                        'rtpage'  : 'age_partially_retired',
                    
                        'hhsad10' :	'area_seifa_irsad',
                        'hhsec10' :	'area_seifa_ier',
                        'hhsed10' :	'area_seifa_ieo',

                        'tcr'     : 'children_resident',
                        'jbhruc'  : 'work_hours',
                        'wsfei'   : 'salary_wages',
                        'wsfes'   : 'salary_wages_incl_salsac',
                        'bncpeni' : 'weekly_aged_pension',
                        'hsdebti' : 'total_home_debt',
                        'hsvalui' : 'home_value_imputed',

                        'saest'   : 'super_value',
                        'sacfnda' : 'super_funds_value',
                        'pwsupri' : 'super_retired_imputed', 
                        'pwsuprt' : 'super_retired', 
                        'pwsupwi' : 'super_notretired_imputed',
                        'pwsupwk' : 'super_notretired', 

                        'opdt'   : 'otherprop_debt', 
                        'opnum'  : 'otherprop_count', 
                        'oprntn' : 'otherprop_count_earned_rent', 
                        'opvalue': 'otherprop_value',  

                        'gh1'     : 'self_assessed_health_1-5', 
                        'lshrcar' : 'hours_caring_per_week',
                        
                        'jbmsall' :	'job satisfaction_overall',
                        'jbmsflx' :	'job satisfaction_flexibility',
                        'jbmshrs' :	'job satisfaction_hours',
                        'jbmssec' : 'job satisfaction_security',
                        'jbmspay' :	'job satisfaction_pay',
                        'jbmswrk' :	'job satisfaction_work',
                        
                        'losat'   :	'satisfaction_life',
                        'losatfs' :	'satisfaction_financial_situation',
                        'losatyh' :	'satisfaction_health',
                        'losatft' :	'satisfaction_free_time',
                        'losateo' :	'satisfaction_employment_opportunities',
                        'losatlc' :	'satisfaction_part_of_community'
                       })

In [10]:
# Add columns - employment and education
df['occupation_group'] = hilda_df['jbmo61'].map({  -1: np.nan,
                                             -4: np.nan,
                                             -7: np.nan,
                                              1: 'Managers',
                                              2: 'Professionals',
                                              3: 'Technicians and Trades Workers',
                                              4: 'Community and Personal Service Workers',
                                              5: 'Clerical and Administrative Workers',
                                              6: 'Sales Workers',
                                              7: 'Machinery Operators and Drivers',
                                              8: 'Labourers'})

df['industry'] = hilda_df['jbmi61'].map({
                                       1  :'Agriculture, Forestry and Fishing',
                                       10 :'Information Media and Telecommunications',
                                       11 :'Financial and Insurance Services',
                                       12 :'Rental, Hiring and Real Estate Services',
                                       13 :'Professional, Scientific and Technical Services',
                                       14 :'Administrative and Support Services',
                                       15 :'Public Administration and Safety',
                                       16 :'Education and Training',
                                       17 :'Health Care and Social Assistance',
                                       18 :'Arts and Recreation Services',
                                       19 :'Other Services',
                                       2 :'Mining',
                                       3 :'Manufacturing',
                                       4 :'Electricity, Gas, Water and Waste Services',
                                       5 :'Construction',
                                       6 :'Wholesale Trade',
                                       7 :'Retail Trade',
                                       8 :'Accommodation and Food Services',
                                       9 :'Transport, Postal and Warehousing'})

df['education'] = hilda_df['edhigh1'].map( {1: 'Postgraduate', #Masters or PhD
                                      2: 'Graduate',
                                      3: 'Bachelor/Diploma', # 'Bachelor honours',
                                      4: 'Bachelor/Diploma', # 'Adv Diploma or Diploma',
                                      5: 'Year 12 or Certificate III/IV', #'Certificate III or IV',
                                      8: 'Year 12 or Certificate III/IV', #'Year 12',
                                      9:  'Year 11 and below',
                                      10: 'Year 11 and below'})

df['employment_status'] = hilda_df['esempdt'].map({-1: np.nan,
                                              1: 'Employee', 
                                              2: 'Own business with employees',
                                              3: 'Own business without employees',
                                              4: 'Employer/Self-employed with employees', 
                                              5: 'Employer/Self-employed without employees',
                                              6: 'Unpaid family worker'})

df['paid_work'] =  hilda_df['esempdt'].apply(lambda x: 1 if x in (1,2,3,4,5) else 0)
df['retired_completely_from_workforce'] = hilda_df['rtcomp'].apply(lambda x: 1 if x == 1 else 0)
df['retired_completely_not_working'] = hilda_df['rtcompn'].apply(lambda x: 1 if x == 1 else 0)
df['retired_completely_from_paid_work'] = hilda_df['sartpw'].apply(lambda x: 1 if x ==1 else 0)

In [11]:
# Add columns - background
df['gender'] = hilda_df['hgsex'].map( {1:'Male', 
                                 2:'Female'})

# Main English speaking countries are United Kingdom, New Zealand, Canada, USA, Ireland and South Africa
df['country_birth'] = hilda_df['anbcob'].map({ 1:'Australia', 
                                         2:'Main English Speaking', 
                                         3:'Other', 
                                        -4: np.nan})

df['marital_status'] = hilda_df['mrcms'].map({ 1:'Married', 
                                         2:'Separated', 
                                         3:'Divorced', 
                                         4:'Widowed', 
                                         5:'Never married,in relationship', 
                                         6:'Never married'})

df['grandchildren_care'] = hilda_df['gcecf'].map( {1: 'Several times a week', #'Daily', 
                                             2:'Several times a week', 
                                             3: 'Once a week to once a month', #'Once a week', 
                                             4: 'Once a week to once a month', # 'Once a week to once a month',
                                             5: 'Rarely', #'A few times a year', 
                                             6: 'Rarely', #'Once a year', 
                                             7: 'Rarely', #'Less than once a year', 
                                             8: 'Never', 
                                             -1:np.nan})

df['receives_age_pension'] = hilda_df['bncap'].apply(lambda x: 1 if x == 1 else 0)
df['receives_disability_pension'] = hilda_df['bncdsp'].apply(lambda x: 1 if x == 1 else 0)

In [12]:
# Add columns - life events
df['le_death_close_friend']         = hilda_df['ledfr'].apply(lambda x: 1 if x==2 else 0)
df['le_death_close_relative']       = hilda_df['ledrl'].apply(lambda x: 1 if x==2 else 0)
df['le_death_partner_child']        = hilda_df['ledsc'].apply(lambda x: 1 if x==2 else 0)
df['le_major_improvement_finances'] = hilda_df['lefni'].apply(lambda x: 1 if x==2 else 0)
df['le_major_worsening_finances']   = hilda_df['lefnw'].apply(lambda x: 1 if x==2 else 0)
df['le_fired_redundant']            = hilda_df['lefrd'].apply(lambda x: 1 if x==2 else 0)
df['le_illness_family']             = hilda_df['leinf'].apply(lambda x: 1 if x==2 else 0)
df['le_illness_personal']           = hilda_df['leins'].apply(lambda x: 1 if x==2 else 0)
df['le_changed_job']                = hilda_df['lejob'].apply(lambda x: 1 if x==2 else 0)
df['le_changed_residence']          = hilda_df['lemvd'].apply(lambda x: 1 if x==2 else 0)
df['le_retired']                    = hilda_df['lertr'].apply(lambda x: 1 if x==2 else 0)
df['le_separated_from_spouse']      = hilda_df['lesep'].apply(lambda x: 1 if x==2 else 0)
df['le_back_together_spouse']       = hilda_df['lercl'].apply(lambda x: 1 if x==2 else 0)
df['le_weather_disaster']           = hilda_df['ledhm'].apply(lambda x: 1 if x==2 else 0)
df['le_got_married']                = hilda_df['lemar'].apply(lambda x: 1 if x==2 else 0)
df['le_promoted']                   = hilda_df['leprm'].apply(lambda x: 1 if x==2 else 0)

In [13]:
# Add columns - housing
df['housing_type'] = hilda_df['hstenr'].map({ 1: 'Own/currently paying off mortgage',
                                        2: 'Rent (or pay board)',
                                        3: 'Involved in a rent-buy scheme',
                                        4: 'Live here rent free / Life Tenure'
                                        })

df['home_loan_paid_off'] = hilda_df['hsmgpd'].map({1:'Yes', 2:'No', -1:'n/a', -3:'n/a', -4:'n/a'})

df['housing_other'] = hilda_df['hsfrea'].map({   1: 'Housing part of job compensation',
                                           2: 'Home owned by a relative not living here',
                                           3: 'Home owned by someone unrelated',
                                           4: 'Sold home but have not moved yet',
                                           5: 'Public Housing',
                                           6: 'Staying with friends or relatives rent-free',
                                           7: 'Home owned by a household member relative',
                                           8: 'Life Tenure contract',
                                           98: 'Other'
                                        })

df['landlord'] = hilda_df['hsllord'].map({ 1: 'Private rent',
                                     2: 'Caravan park rent',
                                     3: 'Government housing authority',
                                     4: 'Community housing group',
                                     5: 'Employer',
                                     6: 'Other',
                                     7: 'Private rent'
                                    })

In [14]:
# Add columns - household
df['remoteness'] = hilda_df['hhsra'].map({
                                   0: 'Major Cities', #Major Cities of Australia
                                   1: 'Inner Regional', #Inner Regional Australia
                                   2: 'Outer Regional', #Outer Regional Australia
                                   3: 'Remote', #Remote Australia
                                   4: 'Remote', #Very Remote Australia
                                   5: 'Other', #Migratory - Offshore - Shipping (NSW)
                                   9: 'Other', #No usual address (NSW)
                                   -7: 'Other'})

df['household_type'] = hilda_df['hhtype'].map({
   1:  'Couple family wo children or others',
   10: 'Couple family with ndepchild wo others',
   11: 'Couple family with ndepchild w other related',
   12: 'Couple family with ndepchild w other not related',
   13: 'Lone parent with children < 15 wo others',
   14: 'Lone parent with children < 15 w other related',
   15: 'Lone parent with children < 15 w other not related',
   16: 'Lone parent with depst wo others',
   17: 'Lone parent with depst w other related',
   18: 'Lone parent with depst w other not related',
   19: 'Lone parent with ndepchild wo others',
   2:  'Couple family wo children w other related',
   20: 'Lone parent with ndepchild w other related',
   21: 'Lone parent with ndepchild w other not related',
   22:  'Other related family wo children < 15 or others',
   23: 'Other related family wo children < 15 w others',
   24: 'Lone person',
   25: 'Group household',
   26: 'Multi family household',
   3: 'Couple family wo children w other not related',
   4: 'Couple family with children < 15 wo others',
   5: 'Couple family with children < 15 w other related',
   6: 'Couple family with children < 15 w other not related',
   7: 'Couple family with depst wo others',
   8: 'Couple family with depst w other related',
   9: 'Couple family with depst w other not related',
   99: 'Other' #Not yet classified
})

df['household_group'] = hilda_df['hhtype'].map({
        1:  'Empty nesters',
        10: 'Couple family with children',
        11: 'Extended family with children',
        12: 'Extended family with children',
        13: 'Other',
        14: 'Other',
        15: 'Other',
        16: 'Other',
        17: 'Other',
        18: 'Other',
        19: 'Other',
        2:  'Extended family',
        20: 'Extended family',
        21: 'Extended family',
        22: 'Extended family',
        23: 'Extended family',
        24: 'Lone person',
        25: 'Other',
        26: 'Extended family',
        3:  'Extended family',
        4: 'Couple family with children',
        5: 'Extended family with children',
        6: 'Extended family with children',
        7: 'Couple family with children',
        8: 'Extended family with children',
        9: 'Extended family with children',
        99: 'Other' #Not yet classified
})

In [15]:
#Add columns - retirement reasons
df['rtreaep'] = hilda_df['rtreaep'].apply(lambda x: 'Became eligible for the age pension, ' if x ==1 else '')
df['rtreavr'] = hilda_df['rtreavr'].apply(lambda x: 'Offered reasonable financial terms to retire early or accept a voluntary redundancy, ' if x ==1 else '')
df['rtreafa'] = hilda_df['rtreafa'].apply(lambda x: 'Superannuation rules made it financially advantageous to retire at that time, ' if x ==1 else '')
df['rtreaca'] = hilda_df['rtreaca'].apply(lambda x: 'Could afford to retire / Had enough income, ' if x ==1 else '')
df['rtreasi'] = hilda_df['rtreasi'].apply(lambda x: 'Spouses / partners income enabled me to retire, ' if x ==1 else '')
df['rtreamr'] = hilda_df['rtreamr'].apply(lambda x: 'Made redundant / Dismissed / Had no choice, ' if x ==1 else '')
df['rtreara'] = hilda_df['rtreara'].apply(lambda x: 'Reached compulsory retirement age, ' if x ==1 else '')
df['rtreanj'] = hilda_df['rtreanj'].apply(lambda x: 'Could not find another job, ' if x ==1 else '')
df['rtreaws'] = hilda_df['rtreaws'].apply(lambda x: 'Fed up with working / work stresses, demands, ' if x ==1 else '')
df['rtreape'] = hilda_df['rtreape'].apply(lambda x: 'Pressure from employer or others at work, ' if x ==1 else '')
df['rtreaho'] = hilda_df['rtreaho'].apply(lambda x: 'Own ill health, ' if x ==1 else '')
df['rtreahf'] = hilda_df['rtreahf'].apply(lambda x: 'Ill health of other family member, ' if x ==1 else '')
df['rtreahs'] = hilda_df['rtreahs'].apply(lambda x: 'Ill health of spouse / partner, ' if x ==1 else '')
df['rtreapr'] = hilda_df['rtreapr'].apply(lambda x: 'Partner had just retired or was about to retire, ' if x ==1 else '')
df['rtreasw'] = hilda_df['rtreasw'].apply(lambda x: 'Spouse / partner wanted me to retire, ' if x ==1 else '')
df['rtreatf'] = hilda_df['rtreatf'].apply(lambda x: 'To spend more time with other family members, ' if x ==1 else '')
df['rtreats'] = hilda_df['rtreats'].apply(lambda x: 'To spend more time with spouse / partner, ' if x ==1 else '')
df['rtrealt'] = hilda_df['rtrealt'].apply(lambda x: 'To have more personal / leisure time, ' if x ==1 else '')
df['rtreach'] = hilda_df['rtreach'].apply(lambda x: 'To have children/ start family/ to care for children, ' if x ==1 else '')
df['rtrearf'] = hilda_df['rtrearf'].apply(lambda x: 'Refused, ' if x ==1 else '')
df['rtreane'] = hilda_df['rtreane'].apply(lambda x: 'NEI to classify, ' if x ==1 else '')
df['rtreano'] = hilda_df['rtreano'].apply(lambda x: 'None of these, ' if x ==1 else '')
df['rtreaos'] = hilda_df['rtreaos'].apply(lambda x: 'Other reason, ' if x ==1 else '')
df['rtreadn'] = hilda_df['rtreadn'].apply(lambda x: 'Dont know, ' if x ==1 else '')

df['rt_retirement_reasons'] = df['rtreaep'] + df['rtreavr'] + df['rtreafa'] + df['rtreaca'] + df['rtreasi'] + df['rtreamr'] + df['rtreara'] + df['rtreanj'] + df['rtreaws'] + df['rtreape'] + df['rtreaho'] + df['rtreahf'] + df['rtreahs'] + df['rtreapr'] + df['rtreasw'] + df['rtreatf'] + df['rtreats'] + df['rtrealt'] + df['rtreach'] + df['rtrearf'] + df['rtreane'] + df['rtreano'] + df['rtreaos'] + df['rtreadn'] 

df['rt_main_reason'] = hilda_df['rtmrea'].map({
1: 'Became eligible for the age pension',
2:	'Offered reasonable financial terms to retire early or accept a voluntary redundancy',
3:	'Superannuation rules made it financially advantageous to retire at that time',
4:	'Could afford to retire / Had enough income',
5:	'Spouses / partners income enabled me to retire',
6:	'Made redundant / Dismissed / Had no choice',
7:	'Reached compulsory retirement age',
8:	'Could not find another job',
9:	'Fed up with working / work stresses, demands',
10:	'Pressure from employer or others at work',
11:	'Own ill health',
12:	'Ill health of spouse / partner',
13:	'Ill health of other family member',
14:	'Partner had just retired or was about to retire',
15:	'Spouse / partner wanted me to retire',
16:	'To spend more time with spouse / partner',
17:	'To spend more time with other family members',
18:	'To have more personal / leisure time',
19:	'To have children/ start family/ to care for children',
95:	'NEI to classify',
98:	'Other',
997:	'None of these'
})



In [16]:
# Add columns - health
df['main_carer_non_res'] = hilda_df['hencam'].map({
    1: 'main_carer',
    2: 'shared_care',
    -1: 'n/a'})

df['main_carer_res'] = hilda_df['hercam'].map({
    1: 'main_carer',
    2: 'shared_care',
    -1: 'n/a'})

df['home_internet'] = hilda_df['lsinthm'].apply(lambda x: 1 if x == 1 else 0)
df['long_term_health_condition'] = hilda_df['helth'].apply(lambda x: 1 if x == 1 else 0)

df['health_condition_limits_work'] = hilda_df['helthwk'].map({
    1: 'Limit type or amount of work',
    2: 'No impact',
    3: 'Not capable of working',
    -1:'n/a',
    -3:'n/a',
    -4:'n/a'})

In [17]:
# Add columns - other
df['superannuation_group'] = hilda_df['savaln2'].map({
    -4: np.nan, 
    -3: np.nan,
    -1: np.nan,
    1: '0-20k',
    2: '0-20k',
    3: '20-50k',
    4: '50-100k',
    5: '100-200k',
    6: '200-500k',
    7: '500k-1m',
    8: '1m+',
    9: '1m+',
    10: '1m+',
    97: 'no super'}) 

df['flag_super_funds'] = hilda_df['sacfnd'].apply(lambda x: 1 if x ==1 else 0)
df['flag_other_property'] = hilda_df['optoh'].apply(lambda x: 1 if x ==1 else 0)
df['flag_otherprop_earned_rent'] = hilda_df['oprnty'].apply(lambda x: 1 if x ==1 else 0)
df['flag_otherprop_not_rented_out'] = hilda_df['opt2hnr'].apply(lambda x: 1 if x ==1 else 0)
df['flag_otherprop_rented_out'] = hilda_df['opt2hr'].apply(lambda x: 1 if x ==1 else 0)

# How well speaks English {-1:'Native speaker', 1:'Very well', 2:'Well', 3:'Not well', 3:'Not at all'}
df['english_fluent'] = hilda_df['aneab'].apply(lambda x: 1 if x in (-1,1) else 0)

In [18]:
# Add columns - year
df['year'] = hilda_df['wave'].map({
'a':	2001,
'b':	2002,
'c':	2003,
'd':	2004,
'e':	2005,
'f':	2006,
'g':	2007,
'h':	2008,
'i':	2009,
'j':	2010,
'k':	2011,
'l':	2012,
'm':	2013,
'n':	2014,
'o':	2015,
'p':	2016,
'q':	2017,
'r':	2018,
's':	2019,
't':	2020})

In [19]:
# Save output to csv
df.to_csv('../data/processed/hilda_transformed.csv', index=False)

In [20]:
%load_ext watermark
%watermark -n -u -v -iv

Last updated: Mon Apr 14 2025

Python implementation: CPython
Python version       : 3.12.8
IPython version      : 8.31.0

numpy : 1.26.4
pandas: 2.2.2



#### <br>
<br>
<br>