In [1]:
%%latex
\tableofcontents

<IPython.core.display.Latex object>

# Get Data

This file 

    - Pulls the data from the '../../Big_Files' folder (not in the GitHub repository) for each year,
    - From each year pulls the Accident, Vehicle, and Person .csv files,
    - Drops features in the Vehicle and Person files that are repeats of features in the other files,
    - Drops features that are not useful, even for imputing missing data, and
    - Writes the results to one big file, '../../Big_Files/CRSS_Merged_Raw_Data.csv'.
    
In a change from our previous versions, we will keep the crashes involving pedestrians through the imputation stages, then drop them when building the models.  

# readme
## Directory Structure

We designed this code to fit in a GitHub repository with files under 100MB.  Many of the files we input are over this limit, and after preprocessing (selecting features, binning features) we saved the data as a .csv file of about 150 MB so we can tweak later code without having to run the preprocessing again.  We saved it again after imputing missing data.  To keep the files in our GitHub repository under 100 MB, we saved these into a different directory.

- CRSS Data Files
    - We use seven years of data, 2016-2022. Once later years come out, they can be easily added. 
    - Each year's data is 100-200 MB
    - The files we're really interested in each year are these.  The names were uppercase until 2018, then lowercase, and also after 2018 the file sizes jumped.
        - accident.csv or ACCIDENT.csv, now about 30 MB
        - vehicle.csv or VEHICLE.csv, now about 180 MB
        - person.csv or PERSON.csv, now about 150 MB

- Big_Files
    - CRSS_Files
        - CRSS2016CSV (22 files, 160 MB)
        - CRSS2017CSV (22 files, 189 MB)
        - CRSS2018CSV (22 files, 169 MB)
        - CRSS2019CSV (23 files, 633 MB)
        - CRSS2020CSV (29 files, 719 MB)
        - CRSS2021CSV (29 files, 736 MB)
        - CRSS2022CSV (29 files, 738 MB)
        
        
    - *Intermediate .csv files*
- GitHub_Repository
    - Code_Files
        - Analyze_Proba
        - Confusion_Matrices
        - Images

# Setup

## Import Libraries

In [2]:
print ('Install Packages')

import sys, copy, math, time, os

print ('Python version: {}'.format(sys.version))

import numpy as np
print ('NumPy version: {}'.format(np.__version__))
np.set_printoptions(suppress=True)

import pandas as pd
print ('Pandas version:  {}'.format(pd.__version__))
pd.set_option('display.max_rows', 500)

import json # We will use json ('JavaScript Object Notation') to write and read dictionaries to/from files
print ('JSON version:  {}'.format(json.__version__))

# Set Randomness.  Copied from https://www.kaggle.com/code/abazdyrev/keras-nn-focal-loss-experiments
import random
#np.random.seed(42) # NumPy
#random.seed(42) # Python
#tf.random.set_seed(42) # Tensorflow

import warnings
warnings.filterwarnings('ignore')

print ('Finished Installing Packages')
print ()

Install Packages
Python version: 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ]
NumPy version: 1.24.2
Pandas version:  1.5.3
JSON version:  2.0.9
Finished Installing Packages



# Get Data and Preprocess

## Read CRSS Files
- We have the CRSS dataset in 
    - Big_Files/CRSS_2020_Update/
- In one directory for each year,
    - CRSS2016CSV
    - CRSS2017CSV
    - CRSS2018CSV
    - CRSS2019CSV
    - CRSS2020CSV    
    - CRSS2021CSV    
- In each year, the CRSS dataset comes in three main files, 
    - Accident.csv
    - Vehicle.csv
    - Person.csv
- Collect those and merge into three files,
    - Accident_Raw.csv
    - Vehicle_Raw.csv
    - Person_Raw.csv
- and also three files with category names,
    - Accident_Raw_with_Names.csv
    - Vehicle_Raw_with_Names.csv
    - Person_Raw_with_Names.csv


### accident.csv from CRSS

In [3]:
def Import_Data_Accident(NAMES):
    print ('Import_Data_Accident()')

    df = pd.DataFrame([])
#    for year in ['2018']:
    for year in ['2016','2017','2018']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/ACCIDENT.CSV'
        temp = pd.read_csv(filename, index_col=None)
        print (year, len(temp))
        df = df.append(temp)

#    for year in ['2020']:
    for year in ['2019','2020','2021', '2022']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/accident.csv'
        temp = pd.read_csv(filename, index_col=None)
        print (year, len(temp))
        df = df.append(temp)
    
    if NAMES==0:
        for feature in df:
            if 'NAME' in feature:
                df.drop(columns=[feature], inplace=True)

    print (df.shape)
    print ()
    return df

### vehicle.csv from CRSS

In [4]:
def Import_Data_Vehicle(NAMES):
    print ('Import_Data_Vehicle()')

    df = pd.DataFrame([])
    for year in ['2016','2017','2018']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/VEHICLE.CSV'
        temp = pd.read_csv(filename, index_col=None, low_memory=False)
        print (year, len(temp))
        df = df.append(temp)

    for year in ['2019','2020','2021', '2022']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/vehicle.csv'
        temp = pd.read_csv(filename, index_col=None, encoding='latin1', low_memory=False)
        print (year, len(temp))
        df = df.append(temp)

    if NAMES==0:
        for feature in df:
            if 'NAME' in feature:
                df.drop(columns=[feature], inplace=True)

    print (df.shape)
    print ()
    return df

### person.csv from CRSS

In [5]:
def Import_Data_Person(NAMES):
    print ('Import_Data_Person()')

    df = pd.DataFrame([])
    for year in ['2016','2017','2018']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/PERSON.CSV'
        temp = pd.read_csv(filename, index_col=None)
        print (year, len(temp))
        df = df.append(temp)

    for year in ['2019','2020','2021', '2022']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/person.csv'
        temp = pd.read_csv(filename, index_col=None, encoding='latin1')
        print (year, len(temp))
        df = df.append(temp)

    if NAMES==0:
        for feature in df:
            if 'NAME' in feature:
                df.drop(columns=[feature], inplace=True)

    print (df.shape)
    print ()
    return df

### Get Data
- The Get_Data_from_Original() reads the (original) CRSS files from the CRSS directory, preprocesses it, and writes it to files in a folder outside this GitHub repo (because the files are too large for my subscription), and returns the dataframes.
- The Get_Data_from_Temp_Files() reads the temp files and returns the dataframes.  I created this option for running repeatedly during writing and debugging, because it's much faster.

In [6]:
def Get_Data_from_Original():
    print ('Get_Data_from_Original()')
    
    df_Accident = Import_Data_Accident(0)
    df_Vehicle = Import_Data_Vehicle(0)
    df_Person = Import_Data_Person(0)
    
    df_Accident.to_csv('../../Big_Files/Accident_Raw.csv', index=False)
    df_Vehicle.to_csv('../../Big_Files/Vehicle_Raw.csv', index=False)
    df_Person.to_csv('../../Big_Files/Person_Raw.csv', index=False)
    

    df_Accident = Import_Data_Accident(1)
    df_Vehicle = Import_Data_Vehicle(1)
    df_Person = Import_Data_Person(1)
    
    df_Accident.to_csv('../../Big_Files/Accident_Raw_with_NAMES.csv', index=False)
    df_Vehicle.to_csv('../../Big_Files/Vehicle_Raw_with_NAMES.csv', index=False)
    df_Person.to_csv('../../Big_Files/Person_Raw_with_NAMES.csv', index=False)
    

    return df_Accident, df_Vehicle, df_Person

#df_Accident, df_Vehicle, df_Person = Get_Data_from_Original()

In [7]:
def Get_Data_from_Temp_Files():
    print ('Get_Data')
    df_Acc = pd.read_csv('../../Big_Files/Accident_Raw.csv', low_memory=False)
    df_Veh = pd.read_csv('../../Big_Files/Vehicle_Raw.csv', low_memory=False)
    df_Per = pd.read_csv('../../Big_Files/Person_Raw.csv', low_memory=False)
    
    print ('df_Acc.shape = ', df_Acc.shape)
    print ('df_Veh.shape = ', df_Veh.shape)
    print ('df_Per.shape = ', df_Per.shape)
    print ()
    
    return df_Acc, df_Veh, df_Per

#df_Acc, df_Veh, df_Per = Get_Data_from_Temp_Files()

## Drop Features

- We now have three dataframes from the Accident, Vehicle, and Person files.  
- Some features are repeated, so we will drop the ones in Vehicle or Person that appear in Accident, and drop those in Person that appear in Vehicle. 
- There are two repeated features we need to keep for merging the three data:
    - CASENUM tells us to which accident the vehicle and person correspond
    - VEH_NO tells us which vehicle the person was in.
- Some features have no predictive power and/or resemble random numbers, like the VIN (Vehicle Identification Number) and the minute of the accident time.  
- For details on the features, see the *Crash Report Sampling System Analytical User's Manual 2016-2020.*

### Drop Repeated Features

In [8]:
def Drop_Repeated_Features(df_Acc, df_Veh, df_Per):
    print ('Drop_Repeated_Features()')
    Acc_Cols = df_Acc.columns.tolist()
    Veh_Cols = df_Veh.columns.tolist()
    Per_Cols = df_Per.columns.tolist()
    
    Drop_Veh = [x for x in Veh_Cols if x in Acc_Cols]
    Drop_Per = [x for x in Per_Cols if (x in Acc_Cols or x in Veh_Cols)]
        
    print ('Drop_Veh:')
    for item in Drop_Veh:
        print (item)
    print ()

    print ('Drop_Per:')
    for item in sorted(Drop_Per):
        print (item)
    print ()
    
    # We need to keep these for merging the dataframes.
    Drop_Veh.remove('CASENUM')
    Drop_Per.remove('CASENUM')
    Drop_Per.remove('VEH_NO')
    
    df_Veh.drop(columns=Drop_Veh, inplace=True)
    df_Per.drop(columns=Drop_Per, inplace=True)

    print ('df_Acc.shape = ', df_Acc.shape)
    print ('df_Veh.shape = ', df_Veh.shape)
    print ('df_Per.shape = ', df_Per.shape)
    print ()
    
    return df_Acc, df_Veh, df_Per
                                        

### Drop Irrelevant Features

We will later drop features that are unknowable from the notification, like drug test results, but we still need those features for imputing missing data.  Here we drop features that are either 

    - Just noise (like VIN, Vehicle Identification Number), 
    - Only given for a small number of crashes (like trailer weight), or 
    - Only present for some years but not others.

In [9]:
def Drop_Irrelevant_Features(df_Acc, df_Veh, df_Per):
    
    print ('Drop_Irrelevant_Features')
    
    Drop_Accident = [
        'CF1',
        'CF2',
        'CF3',
        'MINUTE',
        'MINUTE_IM',
        'PSU_VAR',
        'PSUSTRAT',
        'STRATUM',
        'WEATHER1',
        'WEATHER2',
        'WEIGHT',
    ]
    
    df_Acc.drop(columns=Drop_Accident, inplace=True)
    
    print ('Drop Names of Dropped Features in df_Acc')
    for feature in Drop_Accident:
        feature_name = feature + 'NAME'
        if feature_name in df_Acc:
            print (feature_name)
            df_Acc.drop(columns=[feature_name], inplace=True)
    
    # List of features in df_Veh that aren't repeats from df_Acc 
    # that we don't want to use, even for imputation, because
    # they're only for some years or are like random numbers
    Drop_Vehicle = [
        'DR_SF1',
        'DR_SF2',
        'DR_SF3',
        'DR_SF4',
#        'DR_ZIP',
        'GVWR',
        'GVWR_FROM',
        'GVWR_TO',
        'HAZ_ID',
        'ICFINALBODY',
        'MCARR_I1',
        'MCARR_I2',
        'MCARR_ID',
        'TRLR1GVWR',
        'TRLR1VIN',
        'TRLR2GVWR',
        'TRLR2VIN',
        'TRLR3GVWR',
        'TRLR3VIN',
        'UNITTYPE',
        'V_CONFIG',
        'V_Config',
        'VEH_SC1',
        'VEH_SC2',
        'VIN',
        'VPICBODYCLASS',
        'VPICMAKE',
        'VPICMODEL',
    ]
    
    df_Veh.drop(columns=Drop_Vehicle, inplace=True)
    
    print ('Drop Names of Dropped Features in df_Veh')
    for feature in Drop_Vehicle:
        feature_name = feature + 'NAME'
        if feature_name in df_Veh:
            print (feature_name)
            df_Veh.drop(columns=[feature_name], inplace=True)
    
    Drop_Person = [
        'ATST_TYP',
        'DEVTYPE',
        'DEVMOTOR',
        'DRUGRES1',
        'DRUGRES2',
        'DRUGRES3',
        'DRUGTST1',
        'DRUGTST2',
        'DRUGTST3',
        'DSTATUS',
        'HELM_MIS',
        'HELM_USE',
        'P_SF1',
        'P_SF2',
        'P_SF3',
        'STR_VEH',
    ]
    
    df_Per.drop(columns=Drop_Person, inplace=True)
    
    print ('Drop Names of Dropped Features in df_Per')
    for feature in Drop_Person:
        feature_name = feature + 'NAME'
        if feature_name in df_Per:
            print (feature_name)
            df_Per.drop(columns=[feature_name], inplace=True)
    
    
    print ('df_Acc.shape = ', df_Acc.shape)
    print ('df_Veh.shape = ', df_Veh.shape)
    print ('df_Per.shape = ', df_Per.shape)
    print ()
    
    
    return df_Acc, df_Veh, df_Per

## Merge Accident, Vehicle, and Person Dataframes

In [10]:
def Merge(df_Acc, df_Veh, df_Per):
    print ('Merge()')
    print ()

    data = pd.merge(
        df_Acc, df_Veh, 
        on=['CASENUM'],
        how="inner", sort=False
    )
    
    print ('df_Acc.shape')
    print (df_Acc.shape)
    print ('df_Veh.shape')
    print (df_Veh.shape)
    print ('data.shape')
    print (data.shape)
    print ()

    
    data = pd.merge(
        data, df_Per, 
        on=['CASENUM', 'VEH_NO'],
        how="inner", sort=False
    )
    
    print ('df_Acc.shape')
    print (df_Acc.shape)
    print ('df_Veh.shape')
    print (df_Veh.shape)
    print ('df_Per.shape')
    print (df_Per.shape)
    print ('data.shape')
    print (data.shape)
    print ()


    print (data.head())

    return data

## Drop Pedestrian Crashes

A vehicle hitting another vehicle, a tree, or something else large can result in sudden deceleration different enough from hard braking to trigger an automated notification, but an impact with a pedestrian or bicycle is not.  Our work needs to focus on crashes likely to trigger an automated notification, so we will drop pedestrian crashes from our dataset.  

One could argue that we should keep pedestrian crashes for information relevant for imputing missing data, but a sample with a pedestrian lacks all vehicle information, so those records would be more harm than help.  

In [11]:
def Remove_Pedestrian_Crashes(data):
    print ('Remove_Pedestrian_Crashes()')
    display(data.PEDS.value_counts())
    n = len(data[data.PEDS>0])
    print ('Removing %d crashes that involve a pedestrian.' % n)
    data = data[data.PEDS==0]
    print ('data.shape: ', data.shape)
    
    # Drop the PEDS column, which now only has value "0".
    # Drop the LOCATION feature, which now only has value "0".
    data.drop(columns=['PEDS','LOCATION'], inplace=True)
    return data

## Make Names Dictionary

In [12]:
def Make_Names_Dictionary(data):
    C = {}
    for feature in data:
        feature_name = feature + 'NAME'
        if feature_name in data:
            A = pd.DataFrame()
            A[feature] = data[feature]
            A[feature_name] = data[feature_name]
            A.dropna(inplace=True)
            A.drop_duplicates(inplace=True)
            A = A.set_index(feature)
#            display(A)
            B = A.to_dict()
#            display(B)
            C.update(B)
#            display(C)
    with open("../../Big_Files/Names_Dictionary.json", "w") as outfile: 
        json.dump(C, outfile)
        
    print ('Reading in Names Dictionary')
    with open('../../Big_Files/Names_Dictionary.json') as json_file:
        D = json.load(json_file)
#    display(D)

    print ()
    print ("print (D['MONTHNAME']['11'])")
    print (D['MONTHNAME']['11'])
    print ()
    
    return D
    


## Make List of Values Signifying Missing or Unknown
- This list of criteria isn't perfect, but it's good.  

In [13]:
def Missing_Unknown(Dict):
    MU = ['Missing', 'Unknown']
    Unknowns = {}
    for x, obj in Dict.items():
        Name = x[:len(x) - 4]
        print ()
        print(Name)
        if '_IM' not in Name:
            A = []
            for key in obj:
                value = obj[key]
                for mu in MU:
                    if mu in str(value) and 'type' not in str(value): # If the word 'Missing' or 'Unknown' is in the value
                        S = set(str(key))
                        if S <= set(['8','9']) and str(key)[0] == '9': # If the key only has 8's and 9's, and has at least one 9
                            print('0    ', S, " ", key + ':', value)
                            A.append(int(key))
                        else:
                            print('1    ', S, " ", key + ':', value)
            B = {Name:A}
#            print (B)
            Unknowns.update(B)
#            print (Unknowns)
            
    with open("../../Big_Files/Missing_Unknown.json", "w") as outfile: 
        json.dump(Unknowns, outfile)
        
    print ('Reading in Missing/Unknown Dictionary')
    with open('../../Big_Files/Missing_Unknown.json') as json_file:
        C = json.load(json_file)
    display(C)

    return Unknowns
    

## Run:  Get Data and Preprocess
- CPU times: user 3min 33s, sys: 19.2 s, total: 3min 52s
- Wall time: 3min 56s

In [14]:
%%time
def Preprocess_Data():
    print ('Preprocess_Data()')
    df_Acc, df_Veh, df_Per = Get_Data_from_Original()
    
#    for feature in df_Acc:
#        print ('df_Acc', feature)
#    print ()
#    for feature in df_Veh:
#        print ('df_Veh', feature)
#    print ()
#    for feature in df_Per:
#        print ('df_Per', feature)
#    print ()

#    df_Acc, df_Veh, df_Per = Get_Data_from_Temp_Files()
    df_Acc, df_Veh, df_Per = Drop_Repeated_Features(df_Acc, df_Veh, df_Per)    
    df_Acc, df_Veh, df_Per = Drop_Irrelevant_Features (df_Acc, df_Veh, df_Per)    

    data = Merge (df_Acc, df_Veh, df_Per)
    data = Remove_Pedestrian_Crashes(data)
    
    Dict = Make_Names_Dictionary(data)
    Unknowns = Missing_Unknown(Dict)
    
    # Bin the target variable.  
    # Either the person went to the hospital or didn't; we don't care how the person got to the hospital.
    data['HOSPITAL'] = data['HOSPITAL'].apply(lambda x:1 if x in [1,2,3,4,5] else 0)
    
    data.to_csv('../../Big_Files/CRSS_Merged_Raw_Data.csv', index=False)
    
    data.sample(frac=0.1).to_csv('../../Big_Files/CRSS_Merged_Raw_Data_Sample.csv', index=False)
    print ('Finished Preprocess_Data()')

Preprocess_Data()

Preprocess_Data()
Get_Data_from_Original()
Import_Data_Accident()
2016 46511
2017 54969
2018 48443
2019 54409
2020 54745
2021 54200
2022 53955
(367232, 51)

Import_Data_Vehicle()
2016 82149
2017 97625
2018 86105
2019 96717
2020 94718
2021 95785
2022 94756
(647855, 98)

Import_Data_Person()
2016 117759
2017 138913
2018 120230
2019 135410
2020 131962
2021 133734
2022 132175
(910183, 71)

Import_Data_Accident()
2016 46511
2017 54969
2018 48443
2019 54409
2020 54745
2021 54200
2022 53955
(367232, 90)

Import_Data_Vehicle()
2016 82149
2017 97625
2018 86105
2019 96717
2020 94718
2021 95785
2022 94756
(647855, 187)

Import_Data_Person()
2016 117759
2017 138913
2018 120230
2019 135410
2020 131962
2021 133734
2022 132175
(910183, 125)

Drop_Repeated_Features()
Drop_Veh:
CASENUM
PSU
PJ
STRATUM
VE_FORMS
MONTH
HOUR
MINUTE
HARM_EV
MAN_COLL
URBANICITY
REGION
PSUSTRAT
PSU_VAR
WEIGHT
STRATUMNAME
REGIONNAME
URBANICITYNAME
MONTHNAME
HOURNAME
MINUTENAME
HARM_EVNAME
MAN_COLLNAME

Drop_Per:
BODY_TYP
BODY_T

0     833798
1      38631
2       1168
3        134
4         36
6          8
5          6
11         1
7          1
8          1
Name: PEDS, dtype: int64

Removing 39986 crashes that involve a pedestrian.
data.shape:  (833798, 224)
Reading in Names Dictionary

print (D['MONTHNAME']['11'])
November


NUM_INJ
0     {'9'}   99: All Persons in Crash are Unknown If Injured

MONTH

YEAR

DAY_WEEK

HOUR
0     {'9'}   99: Unknown Hours

HARM_EV
1     {'1', '9'}   91: Unknown Object Not Fixed
1     {'3', '9'}   93: Unknown Fixed Object
0     {'9'}   99: Reported as Unknown

ALCOHOL
0     {'9'}   9: Reported as Unknown

MAX_SEV
0     {'9'}   9: Unknown/Not Reported
1     {'5'}   5: Injured, Severity Unknown

MAN_COLL
0     {'9'}   99: Reported as Unknown

RELJCT1
0     {'9'}   9: Reported as Unknown

RELJCT2
0     {'9'}   99: Reported as Unknown

TYP_INT
0     {'9'}   99: Reported as Unknown

WRK_ZONE
1     {'4'}   4: Work Zone, Type Unknown

REL_ROAD
1     {'6'}   6: Off Roadway-Location Unknown
0     {'9'}   99: Reported as Unknown

LGT_COND
0     {'9'}   9: Reported as Unknown
1     {'6'}   6: Dark - Unknown Lighting

WEATHER
0     {'9'}   99: 

{'NUM_INJ': [99],
 'MONTH': [],
 'YEAR': [],
 'DAY_WEEK': [],
 'HOUR': [99],
 'HARM_EV': [99],
 'ALCOHOL': [9],
 'MAX_SEV': [9],
 'MAN_COLL': [99],
 'RELJCT1': [9],
 'RELJCT2': [99],
 'TYP_INT': [99],
 'WRK_ZONE': [],
 'REL_ROAD': [99],
 'LGT_COND': [9],
 'WEATHER': [99],
 'SCH_BUS': [],
 'INT_HWY': [9],
 'URBANICITY': [],
 'REGION': [],
 'NUMOCCS': [99],
 'HIT_RUN': [9],
 'MAKE': [99],
 'BODY_TYP': [],
 'MOD_YEAR': [9999],
 'MAK_MOD': [99999, 99898, 99989, 99998, 99988, 98999, 9999],
 'TOW_VEH': [9],
 'J_KNIFE': [],
 'CARGO_BT': [98, 99],
 'HAZ_INV': [],
 'HAZ_PLAC': [],
 'HAZ_CNO': [],
 'HAZ_REL': [],
 'BUS_USE': [99],
 'SPEC_USE': [99],
 'EMER_USE': [9],
 'TRAV_SP': [999],
 'ROLLOVER': [9],
 'ROLINLOC': [9],
 'IMPACT1': [99],
 'DEFORMED': [9],
 'TOWED': [9],
 'M_HARM': [99],
 'VEH_ALCH': [9],
 'MAX_VSEV': [9],
 'NUM_INJV': [],
 'FIRE_EXP': [],
 'DR_PRES': [9],
 'DR_ZIP': [99999],
 'SPEEDREL': [9],
 'VTRAFWAY': [9],
 'VNUM_LAN': [9],
 'VSPD_LIM': [99],
 'VALIGN': [9],
 'VPROFILE': [9

Finished Preprocess_Data()
CPU times: user 3min 14s, sys: 16.3 s, total: 3min 30s
Wall time: 3min 35s
