# INFOSYS 722 - BDAS ITERATION

# Predicting Crash Severity On New Zealand Roads

Ferdinand Djohar (adjo446)

## PREREQUISITES
Initialise and start spark session.

In [16]:
import findspark
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('bdas').getOrCreate()

## DATA UNDERSTANDING & DATA PREPARATION

### 2.2 Data Description
Data set consist of 104,032 records in total, sourced from three CSV files (i.e. one file for each year) with 38 attributes.

Documentation provided by NZTA tells us that most of the value types in the data set are categorical and numeric types. Some variables are derived from other variables e.g. **URBAN** variable is derived from **SPD_LIM** variable giving possible values of *'Urban'* where **SPD_LIM** is less than 80 or *'Open Road'* where **SPD_LIM** is greater or equal to 80.

Please refer to *Appendix A* in the report for more detailed list of attributes of the data set extracted from NZTA documentation.

In [19]:
#df = spark.read.option("header", "true").csv("Data/*.csv")
df = spark.read.csv("Data/*.csv", header = True, inferSchema = True)

#Print out the dimension of the data frame
print(df.count()) #rows
print(len(df.columns)) #columns

104032
38


### 2.3 Data Exploration

Schema of the data frame

In [20]:
df.printSchema()

root
 |-- CRASH_YEAR: integer (nullable = true)
 |-- CRASH_SEV: string (nullable = true)
 |-- MULTI_VEH: string (nullable = true)
 |-- HOLIDAY: string (nullable = true)
 |-- LG_REGION_DESC: string (nullable = true)
 |-- EASTING: integer (nullable = true)
 |-- NORTHING: integer (nullable = true)
 |-- CRASH_LOCN1: string (nullable = true)
 |-- CRASH_LOCN2: string (nullable = true)
 |-- OUTDTD_LOCN_DESC: string (nullable = true)
 |-- CRASH_RP_RS: integer (nullable = true)
 |-- INTERSECTION: string (nullable = true)
 |-- JUNCTION_TYPE: string (nullable = true)
 |-- CR_RD_SIDE_RD: integer (nullable = true)
 |-- CRASH_DIRN_DESC: string (nullable = true)
 |-- CRASH_DIST: integer (nullable = true)
 |-- CRASH_RP_DIRN_DESC: string (nullable = true)
 |-- DIRN_ROLE1_DESC: string (nullable = true)
 |-- CRASH_RP_DISP: integer (nullable = true)
 |-- CRASH_SH_DESC: string (nullable = true)
 |-- CRASH_RP_SH: string (nullable = true)
 |-- CRASH_RP_NEWS_DESC: string (nullable = true)
 |-- INTSN_MIDBLOCK:

### 2.4 Data Quality Verification
The data quality of the data set seems to be fairly high. Only three out of 38 variables have missing data (see the result of the code chunk below).

Furthermore, based on the data profiling report, we also noticed that there are some variables have high proportion of *"Unknown"* and zero values which may result in such variables being identified as least important variables, thus excluded in later stages of our work.

Apart from minor data issues mentioned above, we found nothing of concern regarding the data quality of the data set.

In [64]:
import pandas as pd

from pyspark.sql.functions import isnan, when, count, col

#Use spark to calculate the number of missing values for each column and convert the result to pandas.
#Note: the result will contain a single row with 38 columns, small enough to be handled by pandas in memory
missing_values = pd.concat(
    [
        df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).toPandas().transpose(),
        df.select([count(when(col(c) == " ", c)).alias(c) for c in df.columns]).toPandas().transpose()
    ],
    axis = 1
)

missing_values.columns = ["null", "whitespace"]
missing_values["total"] = df.count()
missing_values["missing"] = missing_values["null"] + missing_values["whitespace"]
missing_values["percent"] = missing_values["missing"] / missing_values["total"] * 100

In [65]:
missing_values.loc[missing_values['percent'] > 0].sort_values("percent", ascending = False)

Unnamed: 0,null,whitespace,total,missing,percent
CRASH_DIRN_DESC,0,35316,104032,35316,33.947247
ROAD_LANE,0,54,104032,54,0.051907
CR_RD_SIDE_RD,1,0,104032,1,0.000961
