# Cork

The 2022 Census is used to derive the vulnerability index for Ireland. In order to use this data, first the relevant data needs to be identified and then normalised. Below is the method used to do this.

## Census Data

The Central Statistics Office (CSO) has produced a dataset of [small area statistics](https://www.cso.ie/en/media/csoie/census/census2022/) for the 2022 Census. This will be the main data source for use with the Irish Vulnerbility Assessment.


### R Libraries

The relvant R libraries are imported in to the kernal:

In [1]:
# load the libraries
library(tidyverse)

── [1mAttaching core tidyverse packages[22m ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<ht

In [2]:
# create the pipeline directory if it does not exist
pipelineDir <- file.path("../..","2_pipeline","Cork","1a_CensusData","2022")
if(!dir.exists(pipelineDir)){
    dir.create(pipelineDir, recursive = TRUE)
    print(paste0(pipelineDir, " created"))
}

### Import the csv data

The data can be imported directly from the CSO website (this is the default) or using a local version.

In [3]:
#get the data from the CSO website
#smallAreaCSOData <- read.csv("https://www.cso.ie/en/media/csoie/census/census2022/SAPS_2022_Small_Area_270923.csv",  header=TRUE, sep=",")

#get the data locally
smallAreaCSOData <- read.csv('../../0_data/Cork/IrishCensus/2022/SAPS_2022_Small_Area_270923.csv', header=TRUE, sep=",", stringsAsFactors = FALSE)

# remove 'IE0' row from Census 2022 CSV supplied by CSO, and reindex
smallAreaCSOData <- smallAreaCSOData[smallAreaCSOData$GUID != "IE0", ]
row.names(smallAreaCSOData) <- 1:nrow(smallAreaCSOData)
head(smallAreaCSOData)

Unnamed: 0_level_0,GUID,GEOGID,GEOGDESC,T1_1AGE0M,T1_1AGE1M,T1_1AGE2M,T1_1AGE3M,T1_1AGE4M,T1_1AGE5M,T1_1AGE6M,⋯,T15_1_2C,T15_1_3C,T15_1_GE4C,T15_1_NSC,T15_1_TC,T15_2_BB,T15_2_OIC,T15_2_NO,T15_2_NS,T15_2_T
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,00b00ae4-229d-455d-84f1-d6face4876b1,147002002/02,147002002/02,2,6,3,2,6,5,4,⋯,20,3,1,20,116,88,4,4,20,116
2,03003797-1fcd-4fcf-8dde-b2188e3fb1db,167070005/01/167070001,167070005/01/167070001,1,3,2,1,1,3,1,⋯,43,16,4,7,119,93,1,17,8,119
3,06650182-eeaa-4c6c-847c-f85ddaf5361b,087064009/03,087064009/03,4,1,2,3,1,2,1,⋯,59,9,2,6,114,106,2,1,5,114
4,08e82f06-46ee-4141-aa07-79a793a12b27,087001012/02,087001012/02,1,1,0,2,1,0,1,⋯,26,3,0,2,86,80,1,3,2,86
5,0920215b-86d3-4a53-9fc0-6008ae5c91f9,167053005/01,167053005/01,2,6,5,5,4,3,7,⋯,59,3,0,9,98,84,0,5,9,98
6,0daa02c9-4e30-45fe-90d7-6a1d3e7af2cb,227032010/01,227032010/01,3,0,2,0,0,1,3,⋯,14,1,0,2,37,34,0,1,2,37


## Select only the relevant data

In total there are 799 different variables in the small area dataset. However, only a smaller subset are useful for our purposes. We therefore need to extract the relevant data, then combine these to create our vulnerability indicators.

The dataset also includes data that is at the persons level (number of people in a small area) and the household level (number of households in a small area). As the preprocessing is slightly different for each, they are treated differently below.

### Small Area ID

First, we need to get the unique ID data for each of the small areas:

In [4]:
# get Small Area GUID
smallAreaID <- smallAreaCSOData[, c('GUID'), drop = FALSE]
colnames(smallAreaID)[colnames(smallAreaID) == "GUID"] = "SA_GUID__1"
head(smallAreaID)

Unnamed: 0_level_0,SA_GUID__1
Unnamed: 0_level_1,<chr>
1,00b00ae4-229d-455d-84f1-d6face4876b1
2,03003797-1fcd-4fcf-8dde-b2188e3fb1db
3,06650182-eeaa-4c6c-847c-f85ddaf5361b
4,08e82f06-46ee-4141-aa07-79a793a12b27
5,0920215b-86d3-4a53-9fc0-6008ae5c91f9
6,0daa02c9-4e30-45fe-90d7-6a1d3e7af2cb


### Persons Level Data

We then get the persons level data and combine the variables together to create indicators:

In [5]:
#PERSONS DATA

#POPULATION TOTAL
populationTotalData <- smallAreaCSOData[, 'T1_1AGETT', drop = FALSE]
names(populationTotalData)[1] <- 'populationTotal'

#AGE - YOUNG 
ageYoungVariables <- c(
    'T1_1AGE0T', #Age 0 - Total
    'T1_1AGE1T', #Age 1 - Total
    'T1_1AGE2T', #Age 2 - Total
    'T1_1AGE3T', #Age 3 - Total
    'T1_1AGE4T', #Age 4 - Total
    'T1_1AGE5T'  #Age 5 - Total
)

ageYoungData <- smallAreaCSOData[,ageYoungVariables, drop = FALSE]
ageYoungData$young <- apply(ageYoungData,1,sum)
ageYoungData <- select(ageYoungData, 'young')

#AGE - OLD (Only over 75)
ageOldVariables <- c(
    'T1_1AGE75_79T', #Age 75 - 79 - Total
    'T1_1AGE80_84T', #Age 80 - 84 - Total
    'T1_1AGEGE_85T'  #Age 85 and over - Total
)
ageOldData <- smallAreaCSOData[, ageOldVariables, drop = FALSE]
ageOldData$old <- apply(ageOldData,1,sum)
ageOldData <- select(ageOldData, 'old')

#PRIMARY SCHOOL AGE
primarySchoolAgeVariables <- c(
    'T1_1AGE4T',  #Age 4 - Total
    'T1_1AGE5T',  #Age 5 - Total
    'T1_1AGE6T',  #Age 6 - Total
    'T1_1AGE7T',  #Age 7 - Total
    'T1_1AGE8T',  #Age 8 - Total
    'T1_1AGE9T',  #Age 9 - Total
    'T1_1AGE10T', #Age 10 - Total
    'T1_1AGE11T', #Age 11 - Total
    'T1_1AGE12T'  #Age 12 - Total
)

primarySchoolAgeData <- smallAreaCSOData[, primarySchoolAgeVariables, drop = FALSE]
primarySchoolAgeData$priSch <- apply(primarySchoolAgeData,1,sum)
primarySchoolAgeData <- select(primarySchoolAgeData, 'priSch')

#NOT VOLUNTEERS IN THE COMMUNITY
volunteersData <- smallAreaCSOData[, 'T7_1_VOL', drop = FALSE] 
volunteersData$volunteers <- apply(volunteersData,1,sum)
volunteersData <- select(volunteersData, 'volunteers')
notVolunteersData = populationTotalData - volunteersData
colnames(notVolunteersData) <- "notVolunteers"

#HEALTH -  BAD HEALTH (Choice of: Very good, Good, Fair, Bad, Very bad, and Not stated) 
healthVariables <- c(
    'T12_3_BT', #Bad - Total
    'T12_3_VBT' #Very bad - Total
)
healthData <- smallAreaCSOData[, healthVariables, drop = FALSE]
healthData$poorHealth <- apply(healthData,1,sum)
healthData <- select(healthData, 'poorHealth')

#DISABILITY 
disabilitiesData <- smallAreaCSOData[, 'T12_1_T', drop = FALSE] 
disabilitiesData$disability <- apply(disabilitiesData,1,sum)
disabilitiesData <- select(disabilitiesData, 'disability')

#UNEMPLOYMENT 
unemploymentVariables <- c(
    'T8_1_LFFJT',   #Looking for first regular job - Total
    #'T8_1_ULGUPJT', #Unemployed having lost or given up previous job - Total
    'T8_1_UTWSDT',  #Unable to work due to permanent sickness or disability - Total - MAY CORRELATE WITH HEALTH TOO MUCH
    'T8_1_LAHFT',   #Looking after home/family - Total - NOT SURE ABOUT THIS ONE
    'T8_1_STUT',    #Short Term Unemployed  - Total 
    'T8_1_LTUT'     #Long Term Unemployed - Total
)

unemploymentData <- smallAreaCSOData[, unemploymentVariables, drop = FALSE]
unemploymentData$unemploy <- apply(unemploymentData,1,sum)
unemploymentData <- select(unemploymentData, 'unemploy')

#LOW SKILLED EMPLOYMENT
lowSkilledEmploymentVariables <- c(
    'T9_2_PE', #E Manual skilled (No. of persons)
    'T9_2_PF', #F Semi-skilled (No. of persons)
    'T9_2_PG'  #G Unskilled (No. of persons)
)

lowSkilledEmploymentData <- smallAreaCSOData[, lowSkilledEmploymentVariables, drop = FALSE]
lowSkilledEmploymentData$lowSkill <- apply(lowSkilledEmploymentData,1,sum)
lowSkilledEmploymentData <- select(lowSkilledEmploymentData, 'lowSkill')

#FARMERS
farmingEmploymentVariables <- c(
    'T9_2_PI' #I Farmers (No. of persons)
    #'T9_2_PJ'  #J Agricultural workers (No. of persons) Forestry and fishing also included
)

farmingEmploymentData <- smallAreaCSOData[, farmingEmploymentVariables, drop = FALSE]
farmingEmploymentData$farming<- apply(farmingEmploymentData,1,sum)
farmingEmploymentData <- select(farmingEmploymentData, 'farming')


#TENURE - Permanent private households by type of occupancy 
rentVariables <- c(
    'T6_3_RPLP',  #Rented from private landlord (No. of persons) 
    'T6_3_RLAP',  #Rented from Local Authority (No. of persons)
    'T6_3_RVCHBP' #Rented from voluntary/co-operative housing body (No. of persons)
)

rentData <- smallAreaCSOData[, rentVariables, drop = FALSE]
rentData$rent <- apply(rentData,1,sum)
rentData <- select(rentData, 'rent')

#EDUCATION 
educationVariables <- c(
    'T10_4_NFT' #No formal education - Total
#     'T10_4_PT'   #Primary education - Total
)

educationData <- smallAreaCSOData[, educationVariables, drop = FALSE]
educationData$education <- apply(educationData,1,sum)
educationData <- select(educationData, 'education')

#ENGLISH ABILITY - Speakers of foreign languages by ability to speak English
englishVariables <- c(
    'T2_6NW', #Not well
    'T2_6NAA' #Not at all
)

englishData <- smallAreaCSOData[, englishVariables, drop = FALSE] 
englishData$engLang <- apply(englishData,1,sum)
englishData <- select(englishData, 'engLang')

#NEW RESIDENTS - Usually resident population aged 1 year and over by usual residence 1 year before Census Day
newResidentsVariables <- c(
    'T2_3EI', #Elsewhere in Ireland
    'T2_3OI'  #Outside Ireland
)

newResidentsData <- smallAreaCSOData[, newResidentsVariables, drop = FALSE] 
newResidentsData$newRes <- apply(newResidentsData,1,sum)
newResidentsData <- select(newResidentsData, 'newRes')

#TRAVEL TIME - Population aged 5 years and over by journey time to work, school or college 
travelTimeVariables <- c(
    'T11_3_D5', #hour - under 1 1/2 hours
    'T11_3_D6'  #1 1/2 hours and over
)

travelTimeData <- smallAreaCSOData[, travelTimeVariables, drop = FALSE] 
travelTimeData$travelTime <- apply(travelTimeData,1,sum)
travelTimeData <- select(travelTimeData, 'travelTime')

#combine all the data into one table
personsData <- cbind(smallAreaID,
                     populationTotalData,
                     ageYoungData,
                     ageOldData,
                     primarySchoolAgeData,
                     notVolunteersData,
                     healthData,
                     disabilitiesData,
                     unemploymentData,
                     lowSkilledEmploymentData,
                     farmingEmploymentData,
                     rentData,
                     educationData,
                     englishData,
                     newResidentsData,
                     travelTimeData
                    )

#get the number of columns in the data
personsDataColLength = ncol(personsData)

head(personsData)

#output the data as a csv
outputFile <- file.path(pipelineDir, "personsSmallAreaRawData.csv")
write.csv(personsData, outputFile, row.names = FALSE)

Unnamed: 0_level_0,SA_GUID__1,populationTotal,young,old,priSch,notVolunteers,poorHealth,disability,unemploy,lowSkill,farming,rent,education,engLang,newRes,travelTime
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,00b00ae4-229d-455d-84f1-d6face4876b1,376,44,2,92,353,5,82,65,130,0,290,3,3,2,13
2,03003797-1fcd-4fcf-8dde-b2188e3fb1db,310,24,41,34,244,8,83,54,67,37,26,11,4,5,23
3,06650182-eeaa-4c6c-847c-f85ddaf5361b,375,29,5,55,332,3,54,22,44,0,33,3,4,16,34
4,08e82f06-46ee-4141-aa07-79a793a12b27,225,25,2,43,211,3,45,38,64,0,148,3,14,3,17
5,0920215b-86d3-4a53-9fc0-6008ae5c91f9,344,48,0,68,308,3,42,25,46,4,109,0,3,26,43
6,0daa02c9-4e30-45fe-90d7-6a1d3e7af2cb,164,9,8,13,138,3,32,16,11,4,3,3,0,6,8


### Household Level Data

We then get the household level data and combine the variables together to create indicators:

In [6]:
#HOUSEHOLD DATA

#HOUSEHOLDS TOTAL
householdsTotalData <- smallAreaCSOData[, 'T5_1T_H', drop = FALSE] #Total households (No. of households)
names(householdsTotalData)[1] <- 'householdsTotal'


#NO HEATING - Permanent private households by central heating - Households
noHeatingData <- smallAreaCSOData[, 'T6_5_NCH', drop = FALSE]  #No central heating
noHeatingData$noHeating <- apply(noHeatingData,1,sum)
noHeatingData <- select(noHeatingData, 'noHeating')

#RENEWABLE ENERGY SOURCE – Households with no renewable energy source
noRenewableEnergyHousesVariables <- c(
    'T6_10_NORE', #No Renewable Energy
    'T6_10_NS'    #Not stated
)

noRenewablenergyData <- smallAreaCSOData[, noRenewableEnergyHousesVariables, drop = FALSE]
noRenewablenergyData$noRenewableEnergyHouses <- apply(noRenewablenergyData,1,sum)
noRenewablenergyData <- select(noRenewablenergyData, 'noRenewableEnergyHouses')

#YEAR PROPERTY BUILT - Permanent private households by year built (Pre 1919, 1919-1945, 1946-1960, 1961-1970, 
#1971-1980, 1981-1990, 1991-2000, 2001-2010, 2011 or Later, Not stated)

yearBuiltVariables <- c(
    'T6_2_PRE19H', #Pre 1919 (No. of households)
    'T6_2_19_45H'  #1919 - 1945 (No. of households)
)

yearBuiltData <- smallAreaCSOData[, yearBuiltVariables, drop = FALSE]
yearBuiltData$yearBuilt <- apply(yearBuiltData,1,sum)
yearBuiltData <- select(yearBuiltData, 'yearBuilt')


#CARAVAN/MOBILE HOME (House/Bungalow, Flat/Apartment Bed-Sit, Caravan/Mobile home, Not stated)
mobileHomeData <- smallAreaCSOData[, 'T6_1_CM_H', drop = FALSE] # #Caravan/Mobile home (No. of households)
mobileHomeData$mobHome <- apply(mobileHomeData,1,sum)
mobileHomeData <- select(mobileHomeData, 'mobHome')

#UNOCCUPIED DWELLINGS - Occupancy Status of Permanent Dwellings on Census Night
#1971-1980, 1981-1990, 1991-2000, 2001-2010, 2011 or Later, Not stated)
unoccupiedDwellingsVariables <- c(
    'T6_8_UHH', #Unoccupied holiday homes (No. of households)
    'T6_8_TA',  #Temperorily absent (No. of households)
    'T6_8_OVD'  #Other vacant dwellings (No. of households)
)

unoccupiedDwellingsData <- smallAreaCSOData[, unoccupiedDwellingsVariables, drop = FALSE]
unoccupiedDwellingsData$unoccupiedDwellings <- apply(unoccupiedDwellingsData,1,sum)
unoccupiedDwellingsData <- select(unoccupiedDwellingsData, 'unoccupiedDwellings')


#ONE PARENT HOUSEHOLDS
oneParentVariables <- c(
    'T5_1OPFC_H', #One parent family (father) with  children households (No. of households)
    'T5_1OPMC_H', #One parent family (mother) and children households (No. of households)
    'T5_1OPFCO_H',#One parent family (father) with children and others households (No. of households)
    'T5_1OPMCO_H' #One parent family (mother) with children and others households (No. of households)
)

oneParentData <- smallAreaCSOData[, oneParentVariables, drop = FALSE]
oneParentData$oneParent <- apply(oneParentData,1,sum)
oneParentData <- select(oneParentData, 'oneParent')

#ONE PERSON HOUSEHOLDS
onePersonData <- smallAreaCSOData[, 'T5_1OP_H', drop = FALSE] #One person households (No. of households)
onePersonData$onePerson <- apply(onePersonData,1,sum)
onePersonData <- select(onePersonData, 'onePerson')

#CAR OWNERSHIP
noCarData <- smallAreaCSOData[, 'T15_1_NC', drop = FALSE] #No motor car (No. of households)
noCarData$noCar <- apply(noCarData,1,sum)
noCarData <- select(noCarData, 'noCar')


#NO INTERNET
noInternetData <- smallAreaCSOData[, 'T15_2_NO', drop = FALSE] #No internet (No. of households)
noInternetData$noInternet <- apply(noInternetData,1,sum)
noInternetData <- select(noInternetData, 'noInternet')

#WATER SUPPLY - private water supplies at risk of disease due to reduced quality control - *BIG ASSUMPTION*
waterSupplyVariables <- c(
    'T6_6_GSP', #Group scheme with private source
    'T6_6_OP'   #Other private source
)

waterSupplyData <- smallAreaCSOData[, waterSupplyVariables, drop = FALSE]
waterSupplyData$priWater <- apply(waterSupplyData,1,sum)
waterSupplyData <- select(waterSupplyData, 'priWater')


#FAMILY UNITS - HOUSEHOLDS WITH MORE THAN 3 CHILDREN
familyUnitsVariables <- c(
    'T4_2_3CT',  #Families with 3 children - Total
    'T4_2_4CT',  #Families with 4 children - Total
    'T4_2_GE5CT'  #Families with 5+ children - Total
)

familyUnitsData <- smallAreaCSOData[, familyUnitsVariables, drop = FALSE]
familyUnitsData$familyUnits <- apply(familyUnitsData,1,sum)
familyUnitsData <- select(familyUnitsData, 'familyUnits')





#combine all the data into one table
householdData <- cbind(smallAreaID,
                       householdsTotalData,
                       noHeatingData,
                       noRenewablenergyData,
                       yearBuiltData,
                       mobileHomeData,
                       unoccupiedDwellingsData,
                       oneParentData,
                       onePersonData,
                       noCarData,
                       noInternetData,
                       waterSupplyData,
                       familyUnitsData
                    )
#inspect the table
head(householdData)

#get the number of columns in the data
householdDataColLength = ncol(householdData)

#output the data as a csv
outputFile <- file.path(pipelineDir, "householdSmallAreaRawData.csv")
write.csv(householdData, outputFile, row.names = FALSE)

Unnamed: 0_level_0,SA_GUID__1,householdsTotal,noHeating,noRenewableEnergyHouses,yearBuilt,mobHome,unoccupiedDwellings,oneParent,onePerson,noCar,noInternet,priWater,familyUnits
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,00b00ae4-229d-455d-84f1-d6face4876b1,116,0,36,2,0,2,38,14,15,4,0,23
2,03003797-1fcd-4fcf-8dde-b2188e3fb1db,119,1,70,34,0,11,7,34,8,17,90,16
3,06650182-eeaa-4c6c-847c-f85ddaf5361b,114,0,71,0,0,4,12,7,2,1,0,26
4,08e82f06-46ee-4141-aa07-79a793a12b27,86,0,79,0,0,12,29,13,16,3,0,10
5,0920215b-86d3-4a53-9fc0-6008ae5c91f9,98,0,54,1,0,9,12,12,2,5,0,15
6,0daa02c9-4e30-45fe-90d7-6a1d3e7af2cb,37,1,9,0,0,13,1,11,1,1,0,1


## Percentages

The raw data is not suitable for use within the vulnerabiltiy assessment. It needs to be normalised based on the number of people/households within each small area. Therefore, the data is converted to percentages based on the total persons/households within each small area.

### Persons Percentages

In [7]:
#PERSONS DATA

#Copy the data
personsDataPct <- personsData

#Calculate the percentages for each of the relevant columns - starting at the 4th column
for(col in names(personsDataPct)[3:personsDataColLength]) {
  personsDataPct[paste0(col, "_pct")] = (personsDataPct[col] / personsDataPct$populationTotal)*100
}

#remove the original data to leave only the percentages
personsDataPct <- personsDataPct[-c(2:personsDataColLength)]
head(personsDataPct)

#output the data as a csv
outputFile <- file.path(pipelineDir, "personsSmallAreaPctData.csv")
write.csv(personsDataPct, outputFile, row.names = FALSE)

Unnamed: 0_level_0,SA_GUID__1,young_pct,old_pct,priSch_pct,notVolunteers_pct,poorHealth_pct,disability_pct,unemploy_pct,lowSkill_pct,farming_pct,rent_pct,education_pct,engLang_pct,newRes_pct,travelTime_pct
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,00b00ae4-229d-455d-84f1-d6face4876b1,11.702128,0.5319149,24.468085,93.88298,1.329787,21.80851,17.287234,34.574468,0.0,77.12766,0.7978723,0.7978723,0.5319149,3.457447
2,03003797-1fcd-4fcf-8dde-b2188e3fb1db,7.741935,13.2258065,10.967742,78.70968,2.580645,26.77419,17.419355,21.612903,11.935484,8.387097,3.5483871,1.2903226,1.6129032,7.419355
3,06650182-eeaa-4c6c-847c-f85ddaf5361b,7.733333,1.3333333,14.666667,88.53333,0.8,14.4,5.866667,11.733333,0.0,8.8,0.8,1.0666667,4.2666667,9.066667
4,08e82f06-46ee-4141-aa07-79a793a12b27,11.111111,0.8888889,19.111111,93.77778,1.333333,20.0,16.888889,28.444444,0.0,65.777778,1.3333333,6.2222222,1.3333333,7.555556
5,0920215b-86d3-4a53-9fc0-6008ae5c91f9,13.953488,0.0,19.767442,89.53488,0.872093,12.2093,7.267442,13.372093,1.162791,31.686047,0.0,0.872093,7.5581395,12.5
6,0daa02c9-4e30-45fe-90d7-6a1d3e7af2cb,5.487805,4.8780488,7.926829,84.14634,1.829268,19.5122,9.756098,6.707317,2.439024,1.829268,1.8292683,0.0,3.6585366,4.878049


### Household Percentages

In [8]:
#HOUSEHOLD DATA

#Copy the data
householdDataPct <- householdData

#Calculate the percentages for each of the relevant columns - starting at the 4th column
for(col in names(householdDataPct)[3:ncol(householdDataPct)]) {
  householdDataPct[paste0(col, "_pct")] = (householdDataPct[col] / householdDataPct$householdsTotal)*100
}

#remove the original data to leave only the percentages
householdDataPct <- householdDataPct[-c(2:householdDataColLength)]
head(householdDataPct)

#output the data as a csv
outputFile <- file.path(pipelineDir, "householdSmallAreaNormalisedData.csv")
write.csv(householdDataPct, outputFile, row.names = FALSE)

Unnamed: 0_level_0,SA_GUID__1,noHeating_pct,noRenewableEnergyHouses_pct,yearBuilt_pct,mobHome_pct,unoccupiedDwellings_pct,oneParent_pct,onePerson_pct,noCar_pct,noInternet_pct,priWater_pct,familyUnits_pct
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,00b00ae4-229d-455d-84f1-d6face4876b1,0.0,31.03448,1.724138,0,1.724138,32.758621,12.068966,12.931034,3.448276,0.0,19.827586
2,03003797-1fcd-4fcf-8dde-b2188e3fb1db,0.8403361,58.82353,28.571429,0,9.243697,5.882353,28.571429,6.722689,14.285714,75.63025,13.445378
3,06650182-eeaa-4c6c-847c-f85ddaf5361b,0.0,62.2807,0.0,0,3.508772,10.526316,6.140351,1.754386,0.877193,0.0,22.807018
4,08e82f06-46ee-4141-aa07-79a793a12b27,0.0,91.86047,0.0,0,13.953488,33.72093,15.116279,18.604651,3.488372,0.0,11.627907
5,0920215b-86d3-4a53-9fc0-6008ae5c91f9,0.0,55.10204,1.020408,0,9.183673,12.244898,12.244898,2.040816,5.102041,0.0,15.306122
6,0daa02c9-4e30-45fe-90d7-6a1d3e7af2cb,2.7027027,24.32432,0.0,0,35.135135,2.702703,29.72973,2.702703,2.702703,0.0,2.702703


## Z-Scores

The raw data is not suitable for use within the vulnerabiltiy assessment. It needs to be standardised. Therefore, the data is converted to z-scores. Z-scores are:

>"A statistical measurement of a score's relationship to the mean (average value) in a group of scores. A Z-score of 0 means the score is the same as the mean (average value). A Z-score can be positive or negative, indicating whether it is above or below the mean and by how many standard deviations. Z-score standardisation represents the deviation of a raw score from its mean in standard deviation units." (Kazmierczak et al., 2015)

## Persons Z-scores

In [9]:
#PERSONS DATA

#Copy the data
personsDataZ <- personsDataPct

#Calculate the z scores for each of the relevant columns - starting at the 2nd column
for(col in names(personsDataZ)[2:ncol(personsDataZ)]) {
  personsDataZ[paste0(col, "_Z")] = scale(personsDataZ[col])
}


#remove the original data to leave only the z scores
personsDataZ <- personsDataZ[-c(2:ncol(personsDataPct))]
# summary(personsDataZ)
head(personsDataZ)

#output the data as a csv
outputFile <- file.path(pipelineDir, "personsSmallAreaZData.csv")
write.csv(personsDataZ, outputFile, row.names = FALSE)

Unnamed: 0_level_0,SA_GUID__1,young_pct_Z,old_pct_Z,priSch_pct_Z,notVolunteers_pct_Z,poorHealth_pct_Z,disability_pct_Z,unemploy_pct_Z,lowSkill_pct_Z,farming_pct_Z,rent_pct_Z,education_pct_Z,engLang_pct_Z,newRes_pct_Z,travelTime_pct_Z
Unnamed: 0_level_1,<chr>,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>"
1,00b00ae4-229d-455d-84f1-d6face4876b1,1.7131001,-1.206032,2.9123134,1.5943961,-0.32805634,-0.03080027,0.7576813,1.5271694,-0.5775435,2.1279286,-0.5407622,-0.4608183,-0.7106716,-0.5840908
2,03003797-1fcd-4fcf-8dde-b2188e3fb1db,0.3415865,1.208063,-0.1770142,-1.4769239,0.55757518,0.75212763,0.7825959,0.1713787,1.50707,-0.8133532,1.1870518,-0.2363478,-0.4294834,0.7320148
3,06650182-eeaa-4c6c-847c-f85ddaf5361b,0.3386074,-1.05362,0.6694228,0.5115419,-0.70315591,-1.19888325,-1.3959424,-0.8620328,-0.5775435,-0.7956858,-0.5394257,-0.3382955,0.2608173,1.2792351
4,08e82f06-46ee-4141-aa07-79a793a12b27,1.5084163,-1.138143,1.6864596,1.5731018,-0.32554563,-0.31594402,0.6825637,0.8859637,-0.5775435,1.6422881,-0.2043972,2.0117291,-0.5022055,0.7772593
5,0920215b-86d3-4a53-9fc0-6008ae5c91f9,2.4928027,-1.30719,1.8366499,0.7142718,-0.65211266,-1.54428555,-1.1317924,-0.6906172,-0.3744542,0.1835659,-1.0419684,-0.4269868,1.1169999,2.4197535
6,0daa02c9-4e30-45fe-90d7-6a1d3e7af2cb,-0.4390753,-0.379494,-0.8728761,-0.3764557,0.02558588,-0.3928551,-0.6624963,-1.3877584,-0.1515513,-1.0939506,0.1071385,-0.8245075,0.1026297,-0.1121813


## Households Z-scores

In [10]:
#HOUSEHOLD DATA

#Copy the data
householdDataZ <- householdDataPct

#Calculate the z scores for each of the relevant columns - starting at the 3rd column
for(col in names(householdDataZ)[2:ncol(householdDataZ)]) {
  householdDataZ[paste0(col, "_Z")] = scale(householdDataZ[col])
}

#remove the original data to leave only the z scores
householdDataZ <- householdDataZ[-c(2:ncol(householdDataPct))]
# summary(householdDataZ)
head(householdDataZ)

#output the data as a csv
outputFile <- file.path(pipelineDir, "householdSmallAreaZData.csv")
write.csv(householdDataZ, outputFile, row.names = FALSE)

Unnamed: 0_level_0,SA_GUID__1,noHeating_pct_Z,noRenewableEnergyHouses_pct_Z,yearBuilt_pct_Z,mobHome_pct_Z,unoccupiedDwellings_pct_Z,oneParent_pct_Z,onePerson_pct_Z,noCar_pct_Z,noInternet_pct_Z,priWater_pct_Z,familyUnits_pct_Z
Unnamed: 0_level_1,<chr>,"<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>","<dbl[,1]>"
1,00b00ae4-229d-455d-84f1-d6face4876b1,-0.7784105,-2.6542228,-0.7424513,-0.1877926,-0.6509695,3.2416831,-1.1159543,-0.02238299,-0.8408819,-0.5390374,1.2664209585
2,03003797-1fcd-4fcf-8dde-b2188e3fb1db,-0.2296817,-1.0415869,0.7315812,-0.1877926,-0.3124909,-0.8026652,0.4895645,-0.48506338,0.7935542,2.3691507,0.2809909677
3,06650182-eeaa-4c6c-847c-f85ddaf5361b,-0.7784105,-0.8409625,-0.8371139,-0.1877926,-0.5706376,-0.1038404,-1.6927471,-0.85532893,-1.2286369,-0.5390374,1.7264531244
4,08e82f06-46ee-4141-aa07-79a793a12b27,-0.7784105,0.8755911,-0.8371139,-0.1877926,-0.1004887,3.3864917,-0.8194822,0.40044644,-0.8348348,-0.5390374,0.0003685581
5,0920215b-86d3-4a53-9fc0-6008ae5c91f9,-0.7784105,-1.2575499,-0.7810891,-0.1877926,-0.3151928,0.1547723,-1.0988379,-0.83398255,-0.5914712,-0.5390374,0.5682948632
6,0daa02c9-4e30-45fe-90d7-6a1d3e7af2cb,0.98642,-3.0436224,-0.8371139,-0.1877926,0.8529627,-1.2811399,0.6022552,-0.7846551,-0.9533247,-0.5390374,-1.3777068492


## Combine Data

The persons level and household level data are then combined into a single CSV:

In [11]:
#Combine the RAW persons and household data
personsHouseholdDataCombined <- cbind(personsData,
                                       householdData[2:ncol(householdData)])

#output the data as a csv
outputFile <- file.path(pipelineDir, "censusData.csv")
write.csv(personsHouseholdDataCombined, outputFile, row.names = FALSE)

#Combine the % persons and household data
personsHouseholdPctDataCombined <- cbind(personsDataPct,
                                       householdDataPct[2:ncol(householdDataPct)])

#output the data as a csv
outputFile <- file.path(pipelineDir, "censusDataPercent.csv")
write.csv(personsHouseholdPctDataCombined, outputFile, row.names = FALSE)

#Combine the Z-score persons and household data
personsHouseholdZDataCombined <- cbind(personsDataZ,
                                       householdDataZ[2:ncol(householdDataZ)])

names(personsHouseholdZDataCombined) <- gsub("_pct_Z","",names(personsHouseholdZDataCombined))

#output the data as a csv
outputFile <- file.path(pipelineDir, "censusDataZ.csv")
write.csv(personsHouseholdZDataCombined, outputFile, row.names = FALSE)

**END**