## Crash Data Code

This code loads and wrangles the crash statistics data from NZTA's crash analysis system. It filters the crashes down to only those that occur in the Canterbury region. It combines it with a table of area unit names to give an overview of the crash statistics in each area unit in Canterbury.

In [20]:
library(tidyverse) # Load the necessary tidyverse library.

First the inital crash dataframe was loaded. This included data for the whole country from 2000 to present. This data can be found here: https://opendata-nzta.opendata.arcgis.com/datasets/NZTA::crash-analysis-system-cas-data-1/about. There was a problem with the name of the first column X, which had a weird symbol before it. This was changed to just X to avoid any problems further down the track.

In [None]:
initial.df <- read.csv("Crash_Analysis_System_(CAS)_data.csv") # Read the dataframe into R
initial.df <- initial.df %>% rename("X" = "ï..X") # fix the weird symbol in the column name
head(initial.df)

The dataset was filtered to get only crashes in Canterbury that happened from 2016 to 2020

In [None]:
crash.df<- initial.df %>% filter(region == "Canterbury Region" & crashYear >= 2016 & crashYear < 2021) # Filter the dataframe appropriately
head(crash.df)

This dataframe was saved and exported as it was useful to look at it in excell to get a clearer picture of all the columns.

In [None]:
write_csv(crash.df, "/home/mathuser/R/x86_64-pc-linux-gnu-library/crash.csv") # Save the crash.df dataframe.

Because the original raw dataset was so big, loading it could take a while and sometimes the program crashed. Therefore, the file containing the filtered data-frame that was saved above was often loaded in instead to allow for the development of the code.

In [4]:
crash.df <- read.csv("crash.csv") # Read the dataframe into R
#crash.df <- crash.df %>% rename("X" = "ï..X") # fix the weird symbol in the column name
head(crash.df)

Unnamed: 0_level_0,X,Y,OBJECTID,advisorySpeed,areaUnitID,bicycle,bridge,bus,carStationWagon,cliffBank,...,train,tree,truck,unknownVehicleType,urban,vanOrUtility,vehicle,waterRiver,weatherA,weatherB
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<chr>,<int>,<int>,<int>,<chr>,<chr>
1,1565751,5181488,123,,589700,0,,0,2,,...,,,0,0,Urban,0,,,Fine,Null
2,1566976,5180563,158,,590701,1,,0,1,,...,,,0,0,Urban,0,,,Light rain,Null
3,1571770,5179923,163,,593501,0,,0,2,,...,,,0,0,Urban,0,,,Heavy rain,Null
4,1568150,5183313,257,,592100,1,,0,1,,...,,,0,0,Urban,0,,,Light rain,Null
5,1570162,5178324,276,,594600,0,,0,0,,...,,,1,0,Urban,0,,,Fine,Null
6,1568514,5178239,305,,595000,0,0.0,0,1,0.0,...,0.0,0.0,1,0,Open,0,0.0,0.0,Fine,Null


To help with developing the code, all the columns that contained location information were filtered into the dataframe below. This allowed for testing with the different keys for the area that we were thinking of using. i.e. meshblock, area unit, lat & long.

In [5]:
crash_location.df <- crash.df %>% select(X,Y, areaUnitID, crashLocation1,crashYear,meshblockId, tlaName) # Select only the location columns.
head(crash_location.df)
dim(crash_location.df)

Unnamed: 0_level_0,X,Y,areaUnitID,crashLocation1,crashYear,meshblockId,tlaName
Unnamed: 0_level_1,<dbl>,<dbl>,<int>,<chr>,<int>,<int>,<chr>
1,1565751,5181488,589700,MAIDSTONE ROAD,2019,2536900,Christchurch City
2,1566976,5180563,590701,HINAU ST,2016,2470000,Christchurch City
3,1571770,5179923,593501,FITZGERALD AVENUE,2016,2608800,Christchurch City
4,1568150,5183313,592100,BLIGHS ROAD,2019,2670800,Christchurch City
5,1570162,5178324,594600,SH 76,2016,2617204,Christchurch City
6,1568514,5178239,595000,076-0003/02.19-I,2018,2642500,Christchurch City


The geographic key that we finally decided on was area unit, so a dataset containing the area unit id and name was loaded in. All the other columns that contained extra information were removed to give a dataframe the could be joined on to the crash dataframe.

In [6]:
initial_area_unit.df <- read.csv("area-unit-2017-generalised-version.csv") # Read the are unit dataframe into R

In [7]:
area_unit.df <- initial_area_unit.df %>% select(AU2017, AU2017_NAME) %>% rename(areaUnitID = AU2017) # Select onyl the wanted columns and rename the id column for joining.
head(area_unit.df)

Unnamed: 0_level_0,areaUnitID,AU2017_NAME
Unnamed: 0_level_1,<int>,<chr>
1,500100,Awanui
2,500202,Karikari Peninsula-Maungataniwha
3,500203,Taipa Bay-Mangonui
4,500204,Herekino
5,500206,North Cape
6,500207,Houhora


The area unit dataframe was joined onto the crash dataframe so that each crash had the name of the area unit it occured in. A right join was used to make sure that all the area units remained in the dataframe, even if they had no crashes in them. The dimensions of the resulting dataframe were check if any crashes were lost. As it happened there were no area units with no crashes and no crashes that did not have an area unit in the list.

In [8]:
crash_area.df <- area_unit.df %>% right_join(crash.df, by = "areaUnitID") #Join the crashes and the area unit dataframes.
head(crash_area.df)
dim(crash_area.df)

Unnamed: 0_level_0,areaUnitID,AU2017_NAME,X,Y,OBJECTID,advisorySpeed,bicycle,bridge,bus,carStationWagon,...,train,tree,truck,unknownVehicleType,urban,vanOrUtility,vehicle,waterRiver,weatherA,weatherB
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,...,<int>,<int>,<int>,<int>,<chr>,<int>,<int>,<int>,<chr>,<chr>
1,585506,Amuri,1596680,5263056,1160,,0,0.0,0,0,...,0.0,0.0,0,0,Open,0,0.0,0.0,Fine,Null
2,585506,Amuri,1581726,5275489,3452,,0,,0,0,...,,,1,0,Open,0,,,Fine,Null
3,585506,Amuri,1589119,5270593,5523,,0,0.0,0,0,...,0.0,0.0,0,0,Open,1,0.0,0.0,Fine,Null
4,585506,Amuri,1550535,5305477,14109,,0,0.0,0,1,...,0.0,0.0,0,0,Open,0,0.0,0.0,Light rain,Null
5,585506,Amuri,1605633,5281957,17902,,0,0.0,0,1,...,0.0,0.0,0,0,Open,0,0.0,0.0,Fine,Null
6,585506,Amuri,1583273,5287666,18864,,0,,0,0,...,,,0,0,Open,1,,,Fine,Null


All the NA values in the above dataframe were transformed into 0's using the is.na query. This was to make further analysis easier.

In [9]:
crash_area.df[is.na(crash_area.df)] <- 0  # Set all NA values to 0
head(crash_area.df)

Unnamed: 0_level_0,areaUnitID,AU2017_NAME,X,Y,OBJECTID,advisorySpeed,bicycle,bridge,bus,carStationWagon,...,train,tree,truck,unknownVehicleType,urban,vanOrUtility,vehicle,waterRiver,weatherA,weatherB
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<dbl>,<int>,<dbl>,<int>,<dbl>,<int>,<int>,...,<dbl>,<dbl>,<int>,<int>,<chr>,<int>,<dbl>,<dbl>,<chr>,<chr>
1,585506,Amuri,1596680,5263056,1160,0,0,0,0,0,...,0,0,0,0,Open,0,0,0,Fine,Null
2,585506,Amuri,1581726,5275489,3452,0,0,0,0,0,...,0,0,1,0,Open,0,0,0,Fine,Null
3,585506,Amuri,1589119,5270593,5523,0,0,0,0,0,...,0,0,0,0,Open,1,0,0,Fine,Null
4,585506,Amuri,1550535,5305477,14109,0,0,0,0,1,...,0,0,0,0,Open,0,0,0,Light rain,Null
5,585506,Amuri,1605633,5281957,17902,0,0,0,0,1,...,0,0,0,0,Open,0,0,0,Fine,Null
6,585506,Amuri,1583273,5287666,18864,0,0,0,0,0,...,0,0,0,0,Open,1,0,0,Fine,Null


Then a dataframe was created that contained summaries of the important crash statistics for each area unit including the total number of crashes in each area unit. The summarise function was used to do this aggregation after the old table was grouped by area unit. The type.convert function was used to make all the columns have the appropriate type as beforehand the integer columns were a mixture of int and dbl types. This was used as the final crash data table.

In [19]:
crash_summary.df <- crash_area.df %>% 
    add_column(counter = 1) %>%  # Add a counter column to let the total crashes in each area unit be easily found.
    group_by(AU2017_NAME) %>%    # Group by area unit
        summarise(Total_crashes = sum(counter), Fatalities = sum(fatalCount), Serious_injuries = sum(seriousInjuryCount), Minor_injuries = sum(minorInjuryCount),  # create columns that  contain interesting infomation.
                  Bicycles_involved = sum(bicycle), Motorbikes_involved = sum(moped) + sum(motorcycle), Pedestrians_involved = sum(pedestrian), Median_speed_limit = median(speedLimit))
crash_summary.df <- crash_summary.df %>% rename(Area_unit = AU2017_NAME) # Rename the area unit column to be more explicit.
crash_summary.df <- type.convert(crash_summary.df, as.is = TRUE) # Convert the columns to their appropriate type.
head(crash_summary.df)

Area_unit,Total_crashes,Fatalities,Serious_injuries,Minor_injuries,Bicycles_involved,Motorbikes_involved,Pedestrians_involved,Median_speed_limit
<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
Addington,127,1,9,27,16,3,10,50
Aidanfield,41,1,4,5,4,1,2,50
Akaroa,22,0,2,1,0,0,1,50
Akaroa Harbour,99,3,15,28,0,18,0,100
Allenton East,26,0,1,10,1,0,2,50
Allenton West,21,0,3,10,4,0,1,50


In [13]:
write_csv(crash_summary.df, "/home/mathuser/R/x86_64-pc-linux-gnu-library/crash_final_final.csv")

A dataframe containing only the data going in the main table, i.e. the total number of crashes was also formed.

In [None]:
crash_total.df <- crash_area.df %>% 
    add_column(counter = 1) %>%
    group_by(AU2017_NAME) %>% 
        summarise(Total_Crashes = sum(counter))  # count the number of crashes in each area unit.

In [None]:
write_csv(crash_total.df, "/home/mathuser/R/x86_64-pc-linux-gnu-library/crash_final.csv")

In [None]:
areaXY <- crash_area.df %>% group_by(AU2017_NAME) %>%
    summarise(meanX = mean(X), meanY = mean(Y))

In [None]:
write_csv(areaXY, "/home/mathuser/R/x86_64-pc-linux-gnu-library/areaXY.csv")