# Linear SVC to predict injury in an accident

The data set is downloaded from the National Highway Traffic Safety Administration (NHTSA) website. This study uses motor vehicle fatality information from 2015 through 2019 from Fatality Analysis Reporting System (FARS). This dataset has 248107 records.  With the help of the below segment of the code, we load and clean the FARS data to predict accident injury.

------------------------------------------------------------------------------------------------------------------------------

##### Create a spark session and load the Fatal Analysis Reporting Data set

In [0]:
from pyspark.sql import SparkSession

In [0]:
spark = SparkSession.builder.appName('IMMLSVC').getOrCreate()

In [0]:
file_location = "/FileStore/tables/FARS_BINARY.csv"
file_type = "csv"

# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

------------------------------------------------------------------------------------------------------------------------------

##### Data pre-processing

In [0]:
# Import the required libraries

from pyspark.sql.functions import datediff,date_format,to_date,to_timestamp

In [0]:
import pyspark.sql.functions as f

In [0]:
# Selecting the dependent and the independent variables that are identified as most useful attributes to make predictions

data=df.select(['STATENAME','MONTHNAME','HOUR','RUR_URBNAME','FUNC_SYSNAME',
                                 'MOD_YEARNAME','ROLLOVERNAME','IMPACT1NAME','IMPACT1NAME','FIRE_EXPNAME','AGE','SEXNAME','INJ_SEVNAME',
               'REST_USENAME','REST_MISNAME','AIR_BAGNAME','EJECTIONNAME','ALC_RESNAME','Year','Overlimit'])

In [0]:
data=data.dropna()

In [0]:
# Create a 70-30 train test split

train_data,test_data=data.randomSplit([0.7,0.3])

------------------------------------------------------------------------------------------------------------------------------

### Building the Linear SVC model

In [0]:
# Import the required libraries

from pyspark.ml.classification import LinearSVC
from pyspark.ml.feature import VectorAssembler,StringIndexer,StandardScaler
from pyspark.ml import Pipeline

In [0]:
# Use StringIndexer to convert the categorical columns to hold numerical data

STATENAME_indexer = StringIndexer(inputCol='STATENAME',outputCol='STATENAME_index',handleInvalid='keep')
MONTHNAME_indexer = StringIndexer(inputCol='MONTHNAME',outputCol='MONTHNAME_index',handleInvalid='keep')
HOUR_indexer = StringIndexer(inputCol='HOUR',outputCol='HOUR_index',handleInvalid='keep')
RUR_URBNAME_indexer = StringIndexer(inputCol='RUR_URBNAME',outputCol='RUR_URBNAME_index',handleInvalid='keep')
FUNC_SYSNAME_indexer = StringIndexer(inputCol='FUNC_SYSNAME',outputCol='FUNC_SYSNAME_index',handleInvalid='keep')
MOD_YEARNAME_indexer = StringIndexer(inputCol='MOD_YEARNAME',outputCol='MOD_YEARNAME_index',handleInvalid='keep')
ROLLOVERNAME_indexer = StringIndexer(inputCol='ROLLOVERNAME',outputCol='ROLLOVERNAME_index',handleInvalid='keep')
IMPACT1NAME_indexer = StringIndexer(inputCol='IMPACT1NAME',outputCol='IMPACT1NAME_index',handleInvalid='keep')
FIRE_EXPNAME_indexer = StringIndexer(inputCol='FIRE_EXPNAME',outputCol='FIRE_EXPNAME_index',handleInvalid='keep')
SEXNAME_indexer = StringIndexer(inputCol='SEXNAME',outputCol='SEXNAME_index',handleInvalid='keep')
REST_USENAME_indexer = StringIndexer(inputCol='REST_USENAME',outputCol='REST_USENAME_index',handleInvalid='keep')
REST_MISNAME_indexer = StringIndexer(inputCol='REST_MISNAME',outputCol='REST_MISNAME_index',handleInvalid='keep')
AIR_BAGNAME_indexer = StringIndexer(inputCol='AIR_BAGNAME',outputCol='AIR_BAGNAME_index',handleInvalid='keep')
EJECTIONNAME_indexer = StringIndexer(inputCol='EJECTIONNAME',outputCol='EJECTIONNAME_index',handleInvalid='keep')
ALC_RESNAME_indexer = StringIndexer(inputCol='ALC_RESNAME',outputCol='ALC_RESNAME_index',handleInvalid='keep')
Year_indexer = StringIndexer(inputCol='Year',outputCol='Year_index',handleInvalid='keep')
Overlimit_indexer = StringIndexer(inputCol='Overlimit',outputCol='Overlimit_index',handleInvalid='keep')



In [0]:
# Vector assembler is used to create a vector of input features

assembler = VectorAssembler(inputCols= ['STATENAME_index','MONTHNAME_index','HOUR_index','RUR_URBNAME_index','FUNC_SYSNAME_index','MOD_YEARNAME_index','ROLLOVERNAME_index','IMPACT1NAME_index','FIRE_EXPNAME_index','SEXNAME_index','REST_USENAME_index','REST_MISNAME_index','AIR_BAGNAME_index','EJECTIONNAME_index','ALC_RESNAME_index','Year_index','Overlimit_index'],
                            outputCol="unscaled_features")

In [0]:
# Standard scaler is used to scale the data for the linear SVC to perform well on the training data

scaler = StandardScaler(inputCol="unscaled_features",outputCol="features")

In [0]:
# Create an object for the Linear SVC model

svc_model = LinearSVC(labelCol='INJ_SEVNAME')

In [0]:
# Pipeline is used to pass the data through indexer and assembler simultaneously. Also, it helps to pre-rocess the test data
# in the same way as that of the train data. It also 

pipe = Pipeline(stages= [STATENAME_indexer,MONTHNAME_indexer,HOUR_indexer,RUR_URBNAME_indexer,FUNC_SYSNAME_indexer,MOD_YEARNAME_indexer,ROLLOVERNAME_indexer,IMPACT1NAME_indexer,FIRE_EXPNAME_indexer,SEXNAME_indexer,REST_USENAME_indexer,REST_MISNAME_indexer,AIR_BAGNAME_indexer,EJECTIONNAME_indexer,ALC_RESNAME_indexer,Year_indexer,Overlimit_indexer,assembler,scaler,svc_model])

In [0]:
# The total duration to train the model was around 30 minnutes

fit_model=pipe.fit(train_data)

In [0]:
# Store the results in a dataframe

results = fit_model.transform(test_data)
display(results)

STATENAME,MONTHNAME,HOUR,RUR_URBNAME,FUNC_SYSNAME,MOD_YEARNAME,ROLLOVERNAME,IMPACT1NAME,IMPACT1NAME.1,FIRE_EXPNAME,AGE,SEXNAME,INJ_SEVNAME,REST_USENAME,REST_MISNAME,AIR_BAGNAME,EJECTIONNAME,ALC_RESNAME,Year,Overlimit,STATENAME_index,MONTHNAME_index,HOUR_index,RUR_URBNAME_index,FUNC_SYSNAME_index,MOD_YEARNAME_index,ROLLOVERNAME_index,IMPACT1NAME_index,FIRE_EXPNAME_index,SEXNAME_index,REST_USENAME_index,REST_MISNAME_index,AIR_BAGNAME_index,EJECTIONNAME_index,ALC_RESNAME_index,Year_index,Overlimit_index,unscaled_features,features,rawPrediction,prediction
Alabama,April,3,Rural,Principal Arterial - Other,2010,No Rollover,12 Clock Point,12 Clock Point,No or Not Reported,23,Female,1,Shoulder and Lap Belt Used,NO MIS USE,Deployed,Not Ejected,0.086 % BAC,2019,Overlimit,12.0,8.0,22.0,1.0,0.0,15.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,144.0,3.0,2.0,"Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 3, 5, 9, 12, 14, 15, 16), values -> List(12.0, 8.0, 22.0, 1.0, 15.0, 1.0, 1.0, 144.0, 3.0, 2.0))","Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 3, 5, 9, 12, 14, 15, 16), values -> List(1.0329185598836175, 2.354995200000869, 3.2643648038551847, 2.007305365321481, 1.8857767671261807, 2.2082663989739832, 1.4851399563404872, 2.745873481695224, 2.1300797126344237, 2.8366820474761036))","Map(vectorType -> dense, length -> 2, values -> List(-1.4817577749592492, 1.4817577749592492))",1.0
Alabama,April,3,Urban,Principal Arterial - Other,2010,No Rollover,12 Clock Point,12 Clock Point,No or Not Reported,52,Female,1,None Used/Not Applicable,None Used/Not Applicable,Deployed,Not Ejected,0.000 % BAC,2019,Under,12.0,8.0,22.0,0.0,0.0,15.0,0.0,0.0,0.0,1.0,3.0,1.0,1.0,0.0,1.0,3.0,1.0,"Map(vectorType -> dense, length -> 17, values -> List(12.0, 8.0, 22.0, 0.0, 0.0, 15.0, 0.0, 0.0, 0.0, 1.0, 3.0, 1.0, 1.0, 0.0, 1.0, 3.0, 1.0))","Map(vectorType -> dense, length -> 17, values -> List(1.0329185598836175, 2.354995200000869, 3.2643648038551847, 0.0, 0.0, 1.8857767671261807, 0.0, 0.0, 0.0, 2.2082663989739832, 1.1289612247417684, 3.618297583373824, 1.4851399563404872, 0.0, 0.019068565845105724, 2.1300797126344237, 1.4183410237380518))","Map(vectorType -> dense, length -> 2, values -> List(-1.1817718614733512, 1.1817718614733512))",1.0
Alabama,April,4,Urban,Minor Collector,2006,Rollover,12 Clock Point,12 Clock Point,No or Not Reported,28,Male,1,None Used/Not Applicable,None Used/Not Applicable,Deployed,Ejected,0.000 % BAC,2019,Under,12.0,8.0,23.0,0.0,6.0,1.0,1.0,0.0,0.0,0.0,3.0,1.0,1.0,2.0,1.0,3.0,1.0,"Map(vectorType -> dense, length -> 17, values -> List(12.0, 8.0, 23.0, 0.0, 6.0, 1.0, 1.0, 0.0, 0.0, 0.0, 3.0, 1.0, 1.0, 2.0, 1.0, 3.0, 1.0))","Map(vectorType -> dense, length -> 17, values -> List(1.0329185598836175, 2.354995200000869, 3.4127450222122384, 0.0, 3.462480986772321, 0.12571845114174537, 2.7611379748114624, 0.0, 0.0, 0.0, 1.1289612247417684, 3.618297583373824, 1.4851399563404872, 3.2780647308110185, 0.019068565845105724, 2.1300797126344237, 1.4183410237380518))","Map(vectorType -> dense, length -> 2, values -> List(-7.196879786641762, 7.196879786641762))",1.0
Alabama,April,5,Rural,Minor Collector,1999,No Rollover,5 Clock Point,5 Clock Point,No or Not Reported,29,Male,1,None Used/Not Applicable,None Used/Not Applicable,Not Deployed,Not Ejected,0.000 % BAC,2019,Under,12.0,8.0,18.0,1.0,6.0,17.0,0.0,16.0,0.0,0.0,3.0,1.0,0.0,0.0,1.0,3.0,1.0,"Map(vectorType -> dense, length -> 17, values -> List(12.0, 8.0, 18.0, 1.0, 6.0, 17.0, 0.0, 16.0, 0.0, 0.0, 3.0, 1.0, 0.0, 0.0, 1.0, 3.0, 1.0))","Map(vectorType -> dense, length -> 17, values -> List(1.0329185598836175, 2.354995200000869, 2.670843930426969, 2.007305365321481, 3.462480986772321, 2.1372136694096713, 0.0, 3.338617525459975, 0.0, 0.0, 1.1289612247417684, 3.618297583373824, 0.0, 0.0, 0.019068565845105724, 2.1300797126344237, 1.4183410237380518))","Map(vectorType -> dense, length -> 2, values -> List(0.8182271260569383, -0.8182271260569383))",0.0
Alabama,April,7,Rural,Minor Arterial,2007,No Rollover,1 Clock Point,1 Clock Point,No or Not Reported,41,Female,1,Shoulder and Lap Belt Used,NO MIS USE,Deployed,Not Ejected,Test Not Given,2019,unknown,12.0,8.0,14.0,1.0,1.0,0.0,0.0,5.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,3.0,0.0,"Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 3, 4, 7, 9, 12, 15), values -> List(12.0, 8.0, 14.0, 1.0, 1.0, 5.0, 1.0, 1.0, 3.0))","Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 3, 4, 7, 9, 12, 15), values -> List(1.0329185598836175, 2.354995200000869, 2.077323056998754, 2.007305365321481, 0.5770801644620535, 1.0433179767062422, 2.2082663989739832, 1.4851399563404872, 2.1300797126344237))","Map(vectorType -> dense, length -> 2, values -> List(-0.9999970477091751, 0.9999970477091751))",1.0
Alabama,April,7,Urban,Principal Arterial - Other,2006,Rollover,12 Clock Point,12 Clock Point,No or Not Reported,45,Male,1,None Used/Not Applicable,None Used/Not Applicable,Not Deployed,Ejected,Test Not Given,2019,unknown,12.0,8.0,14.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,3.0,1.0,0.0,2.0,0.0,3.0,0.0,"Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 5, 6, 10, 11, 13, 15), values -> List(12.0, 8.0, 14.0, 1.0, 1.0, 3.0, 1.0, 2.0, 3.0))","Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 5, 6, 10, 11, 13, 15), values -> List(1.0329185598836175, 2.354995200000869, 2.077323056998754, 0.12571845114174537, 2.7611379748114624, 1.1289612247417684, 3.618297583373824, 3.2780647308110185, 2.1300797126344237))","Map(vectorType -> dense, length -> 2, values -> List(-5.196882214778258, 5.196882214778258))",1.0
Alabama,April,10,Urban,Major Collector,2015,No Rollover,4 Clock Point,4 Clock Point,No or Not Reported,71,Female,0,Shoulder and Lap Belt Used,NO MIS USE,Not Deployed,Not Ejected,0.000 % BAC,2019,Under,12.0,8.0,15.0,0.0,3.0,4.0,0.0,18.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,3.0,1.0,"Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 4, 5, 7, 9, 14, 15, 16), values -> List(12.0, 8.0, 15.0, 3.0, 4.0, 18.0, 1.0, 1.0, 3.0, 1.0))","Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 4, 5, 7, 9, 14, 15, 16), values -> List(1.0329185598836175, 2.354995200000869, 2.2257032753558077, 1.7312404933861605, 0.5028738045669815, 3.7559447161424715, 2.2082663989739832, 0.019068565845105724, 2.1300797126344237, 1.4183410237380518))","Map(vectorType -> dense, length -> 2, values -> List(1.0000008534821367, -1.0000008534821367))",0.0
Alabama,April,12,Rural,Interstate,1999,No Rollover,12 Clock Point,12 Clock Point,Yes,47,Male,1,Shoulder and Lap Belt Used,NO MIS USE,Unknown,Not Ejected,0.000 % BAC,2019,Under,12.0,8.0,9.0,1.0,2.0,17.0,0.0,0.0,1.0,0.0,0.0,0.0,2.0,0.0,1.0,3.0,1.0,"Map(vectorType -> dense, length -> 17, values -> List(12.0, 8.0, 9.0, 1.0, 2.0, 17.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 2.0, 0.0, 1.0, 3.0, 1.0))","Map(vectorType -> dense, length -> 17, values -> List(1.0329185598836175, 2.354995200000869, 1.3354219652134844, 2.007305365321481, 1.154160328924107, 2.1372136694096713, 0.0, 0.0, 5.569712153372797, 0.0, 0.0, 0.0, 2.9702799126809745, 0.0, 0.019068565845105724, 2.1300797126344237, 1.4183410237380518))","Map(vectorType -> dense, length -> 2, values -> List(-4.862346415514309, 4.862346415514309))",1.0
Alabama,April,12,Urban,Interstate,2012,Rollover,12 Clock Point,12 Clock Point,No or Not Reported,57,Male,1,Shoulder and Lap Belt Used,NO MIS USE,Deployed,Not Ejected,Test Not Given,2019,unknown,12.0,8.0,9.0,0.0,2.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,3.0,0.0,"Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 4, 5, 6, 12, 15), values -> List(12.0, 8.0, 9.0, 2.0, 10.0, 1.0, 1.0, 3.0))","Map(vectorType -> sparse, length -> 17, indices -> List(0, 1, 2, 4, 5, 6, 12, 15), values -> List(1.0329185598836175, 2.354995200000869, 1.3354219652134844, 1.154160328924107, 1.2571845114174538, 2.7611379748114624, 1.4851399563404872, 2.1300797126344237))","Map(vectorType -> dense, length -> 2, values -> List(-2.9999948327386847, 2.9999948327386847))",1.0
Alabama,April,13,Rural,Interstate,2019,Rollover,12 Clock Point,12 Clock Point,Yes,26,Male,1,Reported as Unknown,None Used/Not Applicable,Unknown,Not Ejected,0.000 % BAC,2019,Under,12.0,8.0,8.0,1.0,2.0,25.0,1.0,0.0,1.0,0.0,4.0,1.0,2.0,0.0,1.0,3.0,1.0,"Map(vectorType -> dense, length -> 17, values -> List(12.0, 8.0, 8.0, 1.0, 2.0, 25.0, 1.0, 0.0, 1.0, 0.0, 4.0, 1.0, 2.0, 0.0, 1.0, 3.0, 1.0))","Map(vectorType -> dense, length -> 17, values -> List(1.0329185598836175, 2.354995200000869, 1.1870417468564307, 2.007305365321481, 1.154160328924107, 3.1429612785436345, 2.7611379748114624, 0.0, 5.569712153372797, 0.0, 1.5052816329890246, 3.618297583373824, 2.9702799126809745, 0.0, 0.019068565845105724, 2.1300797126344237, 1.4183410237380518))","Map(vectorType -> dense, length -> 2, values -> List(-7.044122066395374, 7.044122066395374))",1.0


In [0]:
results.select(['INJ_SEVNAME','prediction']).show()

+-----------+----------+
|INJ_SEVNAME|prediction|
+-----------+----------+
|          1|       1.0|
|          1|       1.0|
|          1|       1.0|
|          1|       0.0|
|          1|       1.0|
|          1|       1.0|
|          0|       0.0|
|          1|       1.0|
|          1|       1.0|
|          1|       1.0|
|          1|       1.0|
|          1|       1.0|
|          1|       0.0|
|          1|       1.0|
|          1|       1.0|
|          1|       1.0|
|          1|       1.0|
|          0|       1.0|
|          1|       1.0|
|          1|       1.0|
+-----------+----------+
only showing top 20 rows



-------------------------------------------------------------------------------------------------------------------------------

### Evaluating the model

#####  1. Area under the ROC

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [0]:
AUC_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='INJ_SEVNAME',metricName='areaUnderROC')

In [0]:
AUC = AUC_evaluator.evaluate(results)

In [0]:
print("The area under the curve is {}".format(AUC))

The area under the curve is 0.8056726278966728


A roughly 65% area under ROC denotes the model has performed reasonably well in predicting whether an incident has met the sla

------------------------------------------------------------------------------------------------------------------------------

#####  2. Area under the PR

In [0]:
PR_evaluator = BinaryClassificationEvaluator(rawPredictionCol='prediction',labelCol='INJ_SEVNAME',metricName='areaUnderPR')

In [0]:
PR = PR_evaluator.evaluate(results)

In [0]:
print("The area under the PR curve is {}".format(PR))

The area under the PR curve is 0.8877342432801796


------------------------------------------------------------------------------------------------------------------------------

#####  3. Accuracy

In [0]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

In [0]:
ACC_evaluator = MulticlassClassificationEvaluator(
    labelCol="INJ_SEVNAME", predictionCol="prediction", metricName="accuracy")

In [0]:
accuracy = ACC_evaluator.evaluate(results)

In [0]:
print("The accuracy of the model is {}".format(accuracy))

The accuracy of the model is 0.8270744111349037


------------------------------------------------------------------------------------------------------------------------------

#####  4. Confusion Matrix

In [0]:
from sklearn.metrics import confusion_matrix

In [0]:
y_true = results.select("INJ_SEVNAME")
y_true = y_true.toPandas()

y_pred = results.select("prediction")
y_pred = y_pred.toPandas()

cnf_matrix = confusion_matrix(y_true, y_pred)
print("Below is the confusion matrix: \n {}".format(cnf_matrix))

Below is the confusion matrix: 
 [[15792  5063]
 [ 7858 46007]]
