In [1]:
%%latex
\tableofcontents

<IPython.core.display.Latex object>

# Get Data

- This file 

    - Pulls the data from the '../../Big_Files' folder (not in the GitHub repository) for each year,
    - From each year pulls the Accident, Vehicle, and Person .csv files,
    - Drops features in the Vehicle and Person files that are repeats of features in the other files,
    - Drops features that are not useful, even for imputing missing data, because they are just noise or were not collected in some years,
    - Merges the Accident, Vehicle, and Person files so that each sample represents a crash person,
    - Drops crashes involving a pedestrian, 
    - Drops features with NaN values:  Three of the "_IM" imputed features that were only imputed in some years,
    - Engineers a "VEH_AGE" feature from "VEH_YEAR" and "YEAR," because a 2016 car in the 2016 data is different from a 2016 car in the 2022 data, but a new car is a new car,
    - Drops features where more than 20% of the samples are missing,
    - Drops features where one value is more than 99% of the samples,
    - Drops samples where the target variable, HOSPITAL, is unknown, 
    - Bins HOSPITAL into 0 (did not go to hospital) and 1 (went to hospital by any means).  We will use this binary HOSPITAL to order the values unordered features by correlation to hospitalization, then bin them by gaps in correlation,
    - Writes the results to one big file, '../../Big_Files/CRSS_Merged_Raw_Data.csv'.
    - Also writes a sample to '../../Big_Files/CRSS_Merged_Raw_Data_Sample.csv' for testing code.

- This file also creates two dictionaries and writes them in JSON (JavaScript Object Notation) format to file.

    - '../../Big_Files/Imputed_Features_Dict.json' gives the Raw:Imputed pairs for features imputed by CRSS,
    - '../../Big_Files/Missing_Unknown_Dict.json' gives the values for each feature that signify "Missing" or "Unknown."  It's complicated because some values signify unknown subcategory within a category, not entirely unknown.  
    
- We started with 220 features and end with 69 raw features and 19 imputed features.

- We started with 910,183 crash persons, 
    - then dropped to 833,798 crash persons when we eliminated crashes with a pedestrian, 
    - then down to 817,623 crash persons when we deleted crash persons with unknown hospitalization.
    


# readme
## Directory Structure

We designed this code to fit in a GitHub repository with files under 100MB.  Many of the files we input are over this limit, and after preprocessing (selecting features, binning features) we saved the data as a .csv file of about 150 MB so we can tweak later code without having to run the preprocessing again.  We saved it again after imputing missing data.  To keep the files in our GitHub repository under 100 MB, we saved these into a different directory.

- CRSS Data Files
    - We use seven years of data, 2016-2022. Once later years come out, they can be easily added. 
    - Each year's data is 100-200 MB
    - The files we're really interested in each year are these.  The names were uppercase until 2018, then lowercase, and also after 2018 the file sizes jumped.
        - accident.csv or ACCIDENT.csv, now about 30 MB
        - vehicle.csv or VEHICLE.csv, now about 180 MB
        - person.csv or PERSON.csv, now about 150 MB

- Big_Files
    - CRSS_Files
        - CRSS2016CSV (22 files, 160 MB)
        - CRSS2017CSV (22 files, 189 MB)
        - CRSS2018CSV (22 files, 169 MB)
        - CRSS2019CSV (23 files, 633 MB)
        - CRSS2020CSV (29 files, 719 MB)
        - CRSS2021CSV (29 files, 736 MB)
        - CRSS2022CSV (29 files, 738 MB)
        
        
    - *Intermediate .csv files*
- GitHub_Repository
    - Code_Files
        - Analyze_Proba
        - Confusion_Matrices
        - Images

## Note on DR_ZIP Field

- The field DR_ZIP in the Vehicle file signifies the driver's zip code.
- The Zone Improvement Program assigns a five-digit code to each postal zone in the US.  We use it as a proxy for small geographic regions.  
- The ZIP code is five digits, but can start with zeros, so it is not exactly a five-digit number.
- In DR_ZIP, some of the entries are three or four digits, like 1846, which appears 602 times in CRSS 2016-2022, and 802, which appears seven times.  
- None of the five-digit DR_ZIP entries begin with a zero, so we suspect that CRSS has dropped leading zeros.
- ZIP code 01846 is around Reading, MA, and ZIP code 00802 is in the US Virgin Islands, so if we padded leading zeroes we would get legitimate ZIP codes (at least in these two examples).
- CRSS did not cut off the leading zeros in fields MCARR_I2 or MCARR_ID (Motor Carrier Identification Number), where 00000000000 signifies "Not Applicable."  When one opens the .csv in Excel it appears as "0", but when one opens it in a terminal window all of the digits appear.  I don't see any technical reason why CRSS had to cut off leading zeroes in the ZIP codes.
- We could pad zeroes onto the DR_ZIP entries and save them as strings, but that may be unnecessary.  Each three- and four-digit codes still signify a unique place just as well as it would as a five-character string.  
- We will leave the DR_ZIP as they are.  


# Notes on NaN
- Three of the features imputed by CRSS have NaN values for entire years, because they were not imputed in those years.
- RELJCT1 was not imputed in 2019, so RELJCT1_IM has NaN in that year.
- HIT_RUN was not imputed in 2020, 2021, or 2022, so HITRUN_IM has NaN in those years.
- BODY_TYP was not imputed in 2021 or 2022, so BDYTYP_IM has NaN in those years.

# Setup

## Import Libraries

In [2]:
print ('Install Packages')

import sys, copy, math, time, os

print ('Python version: {}'.format(sys.version))

import numpy as np
print ('NumPy version: {}'.format(np.__version__))
np.set_printoptions(suppress=True)

import pandas as pd
print ('Pandas version:  {}'.format(pd.__version__))
pd.set_option('display.max_rows', 500)

import json # We will use json ('JavaScript Object Notation') to write and read dictionaries to/from files
print ('JSON version:  {}'.format(json.__version__))

# THERE IS NO RANDOMNESS IN THIS NOTEBOOK
# Set Randomness.  Copied from https://www.kaggle.com/code/abazdyrev/keras-nn-focal-loss-experiments
#import random
#np.random.seed(42) # NumPy
#random.seed(42) # Python
#tf.random.set_seed(42) # Tensorflow

import warnings
warnings.filterwarnings('ignore')

print ('Finished Installing Packages')
print ()

Install Packages
Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
NumPy version: 1.26.4
Pandas version:  2.2.2
JSON version:  2.0.9
Finished Installing Packages



# Get Data and Preprocess

## Read CRSS Files
- We have the CRSS dataset in 
    - Big_Files/CRSS_2020_Update/
- In one directory for each year,
    - CRSS2016CSV
    - CRSS2017CSV
    - CRSS2018CSV
    - CRSS2019CSV
    - CRSS2020CSV    
    - CRSS2021CSV    
    - CRSS2022CSV    
- In each year, the CRSS dataset comes in three main files, 
    - Accident.csv
    - Vehicle.csv
    - Person.csv
- Collect those and merge into three files,
    - Accident_Raw.csv
    - Vehicle_Raw.csv
    - Person_Raw.csv
- and also three files with category names,
    - Accident_Raw_with_Names.csv
    - Vehicle_Raw_with_Names.csv
    - Person_Raw_with_Names.csv


### accident.csv from CRSS

In [3]:
def Import_Data_Accident(NAMES):
    print ('Import_Data_Accident()')

    df = pd.DataFrame([])
#    for year in ['2018']:
    for year in ['2016','2017','2018']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/ACCIDENT.CSV'
        temp = pd.read_csv(filename, index_col=None)
        print (year, len(temp))
        df = df.append(temp)

#    for year in ['2020']:
    for year in ['2019','2020','2021', '2022']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/accident.csv'
        temp = pd.read_csv(filename, index_col=None)
        print (year, len(temp))
        df = df.append(temp)
    
    if NAMES==0:
        for feature in df:
            if 'NAME' in feature:
                df.drop(columns=[feature], inplace=True)

    print (df.shape)
    print ()
    return df

### vehicle.csv from CRSS

In [4]:
def Import_Data_Vehicle(NAMES):
    print ('Import_Data_Vehicle()')

    df = pd.DataFrame([])
    for year in ['2016','2017','2018']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/VEHICLE.CSV'
        temp = pd.read_csv(filename, index_col=None, low_memory=False)
        print (year, len(temp))
        df = df.append(temp)

    for year in ['2019','2020','2021', '2022']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/vehicle.csv'
        temp = pd.read_csv(filename, index_col=None, encoding='latin1', low_memory=False)
        print (year, len(temp))
        df = df.append(temp)

    if NAMES==0:
        for feature in df:
            if 'NAME' in feature:
                df.drop(columns=[feature], inplace=True)

    print (df.shape)
    print ()
    return df

### person.csv from CRSS

In [5]:
def Import_Data_Person(NAMES):
    print ('Import_Data_Person()')

    df = pd.DataFrame([])
    for year in ['2016','2017','2018']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/PERSON.CSV'
        temp = pd.read_csv(filename, index_col=None)
        print (year, len(temp))
        df = df.append(temp)

    for year in ['2019','2020','2021', '2022']:
        filename = '../../Big_Files/CRSS_2020_Update/CRSS' + year + 'CSV/person.csv'
        temp = pd.read_csv(filename, index_col=None, encoding='latin1')
        print (year, len(temp))
        df = df.append(temp)

    if NAMES==0:
        for feature in df:
            if 'NAME' in feature:
                df.drop(columns=[feature], inplace=True)

    print (df.shape)
    print ()
    return df

### Get Data
- The Get_Data_from_Original() reads the (original) CRSS files from the CRSS directory, preprocesses it, and writes it to files in a folder outside this GitHub repo (because the files are too large for my subscription), and returns the dataframes.
- The Get_Data_from_Temp_Files() reads the temp files and returns the dataframes.  I created this option for running repeatedly during writing and debugging, because it's much faster.

In [6]:
def Get_Data_from_Original():
    print ('Get_Data_from_Original()')
    
    df_Accident = Import_Data_Accident(0)
    df_Vehicle = Import_Data_Vehicle(0)
    df_Person = Import_Data_Person(0)
    
    df_Accident.to_csv('../../Big_Files/Accident_Raw.csv', index=False)
    df_Vehicle.to_csv('../../Big_Files/Vehicle_Raw.csv', index=False)
    df_Person.to_csv('../../Big_Files/Person_Raw.csv', index=False)
    

    df_Accident_NAMES = Import_Data_Accident(1)
    df_Vehicle_NAMES = Import_Data_Vehicle(1)
    df_Person_NAMES = Import_Data_Person(1)
    
    df_Accident_NAMES.to_csv('../../Big_Files/Accident_Raw_with_NAMES.csv', index=False)
    df_Vehicle_NAMES.to_csv('../../Big_Files/Vehicle_Raw_with_NAMES.csv', index=False)
    df_Person_NAMES.to_csv('../../Big_Files/Person_Raw_with_NAMES.csv', index=False)
    

    return df_Accident, df_Vehicle, df_Person

#df_Accident, df_Vehicle, df_Person = Get_Data_from_Original()

In [7]:
def Get_Data_from_Temp_Files():
    print ('Get_Data_from_Temp_Files()')
    df_Acc = pd.read_csv('../../Big_Files/Accident_Raw.csv', low_memory=False)
    df_Veh = pd.read_csv('../../Big_Files/Vehicle_Raw.csv', low_memory=False)
    df_Per = pd.read_csv('../../Big_Files/Person_Raw.csv', low_memory=False)
    
    print ('df_Acc.shape = ', df_Acc.shape)
    print ('df_Veh.shape = ', df_Veh.shape)
    print ('df_Per.shape = ', df_Per.shape)
    print ()
    
    return df_Acc, df_Veh, df_Per

#df_Acc, df_Veh, df_Per = Get_Data_from_Temp_Files()

## Drop Features

- We now have three dataframes from the Accident, Vehicle, and Person files.  
- Some features are repeated, so we will drop the ones in Vehicle or Person that appear in Accident, and drop those in Person that appear in Vehicle. 
- There are two repeated features we need to keep for merging the three data:
    - CASENUM tells us to which accident the vehicle and person correspond
    - VEH_NO tells us which vehicle the person was in.
- Some features have no predictive power and/or resemble random numbers, like the VIN (Vehicle Identification Number) and the minute of the accident time.  
- For details on the features, see the *Crash Report Sampling System Analytical User's Manual 2016-2020.*

### Drop Repeated Features

In [8]:
def Drop_Repeated_Features(df_Acc, df_Veh, df_Per):
    print ('Drop_Repeated_Features()')
    Acc_Cols = df_Acc.columns.tolist()
    Veh_Cols = df_Veh.columns.tolist()
    Per_Cols = df_Per.columns.tolist()
    
    Drop_Veh = [x for x in Veh_Cols if x in Acc_Cols]
    Drop_Per = [x for x in Per_Cols if (x in Acc_Cols or x in Veh_Cols)]
        
#    print ('Drop_Veh:')
#    for item in Drop_Veh:
#        print (item)
#    print ()

#    print ('Drop_Per:')
#    for item in sorted(Drop_Per):
#        print (item)
#    print ()
    
    # We need to keep these for merging the dataframes.
    Drop_Veh.remove('CASENUM')
    Drop_Per.remove('CASENUM')
    Drop_Per.remove('VEH_NO')
    
    df_Veh.drop(columns=Drop_Veh, inplace=True)
    df_Per.drop(columns=Drop_Per, inplace=True)

    print ('df_Acc.shape = ', df_Acc.shape)
    print ('df_Veh.shape = ', df_Veh.shape)
    print ('df_Per.shape = ', df_Per.shape)
    print ()
    
    return df_Acc, df_Veh, df_Per
                                        

### Drop Irrelevant Features

We will later drop features that are unknowable from the notification, like drug test results, but we still need those features for imputing missing data.  Here we drop features that are either 

    - Just noise (like VIN, Vehicle Identification Number), 
    - Only given for a small number of crashes (like trailer weight), or 
    - Only present for some years but not others.

In [9]:
def Drop_Irrelevant_Features(df_Acc, df_Veh, df_Per):
    print ('Drop_Irrelevant_Features()')
    
    print ('df_Acc.shape = ', df_Acc.shape)
    print ('df_Veh.shape = ', df_Veh.shape)
    print ('df_Per.shape = ', df_Per.shape)
    print ()
    
    Drop_Accident = [
        'CF1',
        'CF2',
        'CF3',
        'MINUTE',
        'MINUTE_IM',
        'PSU_VAR',
        'PSUSTRAT',
        'STRATUM',
        'WEATHER1',
        'WEATHER2',
        'WEIGHT',
    ]
    
    df_Acc.drop(columns=Drop_Accident, inplace=True)
    
    print ('Drop Names of Dropped Features in df_Acc')
    for feature in Drop_Accident:
        feature_name = feature + 'NAME'
        if feature_name in df_Acc:
#            print (feature_name)
            df_Acc.drop(columns=[feature_name], inplace=True)
    
    # List of features in df_Veh that aren't repeats from df_Acc 
    # that we don't want to use, even for imputation, because
    # they're only for some years or are like random numbers
    Drop_Vehicle = [
        'DR_SF1',
        'DR_SF2',
        'DR_SF3',
        'DR_SF4',
#        'DR_ZIP',
        'GVWR',
        'GVWR_FROM',
        'GVWR_TO',
        'HAZ_ID',
        'ICFINALBODY',
        'MCARR_I1',
        'MCARR_I2',
        'MCARR_ID',
        'TRLR1GVWR',
        'TRLR1VIN',
        'TRLR2GVWR',
        'TRLR2VIN',
        'TRLR3GVWR',
        'TRLR3VIN',
        'UNDEROVERRIDE',
        'UNITTYPE',
        'V_CONFIG',
        'V_Config',
        'VEH_SC1',
        'VEH_SC2',
        'VIN',
        'VPICBODYCLASS',
        'VPICMAKE',
        'VPICMODEL',
    ]
    
    df_Veh.drop(columns=Drop_Vehicle, inplace=True)
    
    print ('Drop Names of Dropped Features in df_Veh')
    for feature in Drop_Vehicle:
        feature_name = feature + 'NAME'
        if feature_name in df_Veh:
#            print (feature_name)
            df_Veh.drop(columns=[feature_name], inplace=True)
    
    Drop_Person = [
        'ATST_TYP',
        'DEVTYPE',
        'DEVMOTOR',
        'DRUGRES1',
        'DRUGRES2',
        'DRUGRES3',
        'DRUGTST1',
        'DRUGTST2',
        'DRUGTST3',
        'DSTATUS',
        'HELM_MIS',
        'HELM_USE',
        'P_SF1',
        'P_SF2',
        'P_SF3',
        'STR_VEH',
    ]
    
    df_Per.drop(columns=Drop_Person, inplace=True)
    
    print ('Drop Names of Dropped Features in df_Per')
    for feature in Drop_Person:
        feature_name = feature + 'NAME'
        if feature_name in df_Per:
#            print (feature_name)
            df_Per.drop(columns=[feature_name], inplace=True)
    
    
    print ('df_Acc.shape = ', df_Acc.shape)
    print ('df_Veh.shape = ', df_Veh.shape)
    print ('df_Per.shape = ', df_Per.shape)
    print ()
    
    return df_Acc, df_Veh, df_Per

## Merge Accident, Vehicle, and Person Dataframes

In [10]:
def Merge(df_Acc, df_Veh, df_Per):
    print ('Merge()')
    print ()

    data = pd.merge(
        df_Acc, df_Veh, 
        on=['CASENUM'],
        how="inner", sort=False
    )
    
    print ('df_Acc.shape')
    print (df_Acc.shape)
    print ('df_Veh.shape')
    print (df_Veh.shape)
    print ('data.shape')
    print (data.shape)
    print ()

    # In this step we go from 910,183 Person samples to 873,784 merged samples 
    #     because 36,399 of the Person samples don't have a corresponding Vehicle because they're pedestrians.
    # In a later feature we'll also delete the merged samples that describe the people in the vehicles 
    #     involved in pedestrian crashes.
    data = pd.merge(
        data, df_Per, 
        on=['CASENUM', 'VEH_NO'],
        how="inner", sort=False
    )
    data.drop(columns=['CASENUM', 'VEH_NO', 'PER_NO'], inplace=True)
    
    print ('df_Acc.shape')
    print (df_Acc.shape)
    print ('df_Veh.shape')
    print (df_Veh.shape)
    print ('df_Per.shape')
    print (df_Per.shape)
    print ('data.shape')
    print (data.shape)
    print ()


#    print (data.head())

    return data

## Drop Pedestrian Crashes

A vehicle hitting another vehicle, a tree, or something else large can result in sudden deceleration different enough from hard braking to trigger an automated notification, but an impact with a pedestrian or bicycle is not.  Our work needs to focus on crashes likely to trigger an automated notification, so we will drop pedestrian crashes from our dataset.  

One could argue that we should keep pedestrian crashes for information relevant for imputing missing data, but a sample with a pedestrian lacks all vehicle information, so those records would be more harm than help.  

In [11]:
def Remove_Pedestrian_Crashes(data):
    print ('Remove_Pedestrian_Crashes()')
    display(data.PEDS.value_counts())
    n = len(data[data.PEDS>0])
    print ('Removing %d crashes that involve a pedestrian.' % n)
    data = data[data.PEDS==0]
    print ('data.shape: ', data.shape)
    
    # Drop the PEDS column, which now only has value "0".
    # Drop the LOCATION feature, which now only has value "0".
    data.drop(columns=['PEDS','LOCATION'], inplace=True)
    print ()
    
    return data

## Remove Feature Names

In [12]:
def Remove_Feature_Names(data):
    print ('Remove_Feature_Names()')
    for feature in Feature:
        if feature[-4:] == 'NAME':
            data.drop(columns=[feature], inplace=True)

    print ()
    
    return data

## Feature Engineering
- Merge YEAR and MOD_YEAR into VEH_AGE
- Why?  Because MOD_YEAR is mostly useful for telling us the age of the car, but we have seven years of data, so a car that was new in 2016 won't be new in 2022.  Much more useful to have a feature that tells the age of the vehicle in the year of the crash.

In [13]:
def Feature_Engineering(data, Missing_Unknown_Dict):
    print ('Feature_Engineering()')
    print (data.shape)
    
    data['VEH_AGE'] = data['YEAR'] - data['MOD_YEAR']
    data['VEH_AGE'].where(
        ~data['MOD_YEAR'].isin(Missing_Unknown_Dict['MOD_YEAR']),
        99, 
        inplace=True
    )
    
    data.drop(columns=['YEAR','MOD_YEAR', 'MDLYR_IM'], inplace=True)
    print (data.shape)
    print ()
    
    return data   

## Make Imputed Feature Dict

In [14]:
def Make_Imputed_Feature_Dict(data):
    print ('Make_Imputed_Feature_Dict()')

    Imputed_Feature_Dict = {
        'AGE':'AGE_IM',
        'ALCOHOL':'ALCHL_IM',
        'BODY_TYP':'BDYTYP_IM',
        'DAY_WEEK':'WKDY_IM',
        'DRINKING':'PERALCH_IM',
        'EJECTION':'EJECT_IM',
        'HARM_EV':'EVENT1_IM',
        'HIT_RUN':'HITRUN_IM',
        'HOUR':'HOUR_IM',
        'IMPACT1':'IMPACT1_IM',
        'INJ_SEV':'INJSEV_IM',
        'LGT_COND':'LGTCON_IM',
        'M_HARM':'VEVENT_IM',
        'MAN_COLL':'MANCOL_IM',
        'MAX_SEV':'MAXSEV_IM',
        'MAX_VSEV':'MXVSEV_IM',
        'MINUTE':'MINUTE_IM',
        'MOD_YEAR':'MDLYR_IM',
        'NUM_INJ':'NO_INJ_IM',
        'NUM_INJV':'NUMINJ_IM',
        'P_CRASH1':'PCRASH1_IM',
        'RELJCT1':'RELJCT1_IM',
        'RELJCT2':'RELJCT2_IM',
        'SEAT_POS':'SEAT_IM',
        'SEX':'SEX_IM',
        'VEH_ALCH':'V_ALCH_IM',
        'WEATHER':'WEATHR_IM',
    }
        
    with open("../../Big_Files/Imputed_Feature_Dict.json", "w") as outfile: 
        json.dump(Imputed_Feature_Dict, outfile)
        
    print ('Reading in Imputed_Feature__Dict')
    with open('../../Big_Files/Imputed_Feature_Dict.json') as json_file:
        D = json.load(json_file)
    print ("Did the read/write work? ", D == Imputed_Feature_Dict)
    print ()

    return Imputed_Feature_Dict
    

## Make_Missing_Unknown_Dict

In [15]:
def Make_Missing_Unknown_Dict(data, Imputed_Feature_Dict):
    print ('Make_Missing_Unknown_Dict()')
    
    Missing_Unknown_Dict = {
        # Features Imputed by CRSS.  The "Missing" values are the same as the ones imputed by CRSS
        'AGE': [998,999],
        'ALCOHOL': [9], # The value 8 is converted to attribute code 2 [User's Manual 2016-2021, page 64]
        'BODY_TYP': [49,79,98,99], # Imputation discontinued after 2020
        'DAY_WEEK': [9],
        'DRINKING': [8,9],
        'EJECTION': [7,9],
        'HARM_EV': [98,99],
        'HOUR': [99],
        'IMPACT1': [98,99],
        'INJ_SEV': [9],
        'LGT_COND': [8,9],
        'M_HARM': [98,99],
        'MAN_COLL': [98,99],
        'MAX_SEV': [9],
        'MAX_VSEV': [9],
        'MOD_YEAR': [9998,9999],
        'NUM_INJ': [99], # The value 98 is converted to code 0 in the imputed version [User's Manual 2016-2021, page 63]
        'NUM_INJV': [99], # The value 98 is converted to code 0 in the imputed version [User's Manual 2016-2021, page 114]
        'RELJCT1': [8,9], # Imputed data element discontinued in 2019 and added back in 2020
        'SEAT_POS': [19,29,39,49,98,99],
        'SEX': [8,9],
        'VEH_ALCH': [9], # Value 8 is converted to code 2 in the imputed version [User's Manual 2016-21, page 115]
        'WEATHER': [98,99],
        # Features Not Imputed by CRSS.  What qualifies as "Missing" is a little subjective.
        'ACC_TYPE': [98,99],
        'AIR_BAG': [98,99],
        'ALC_RES': [996,997,998,999],
        'ALC_STATUS': [8,9],
        'BUS_USE': [98,99],
        'CARGO_BT': [97,98,99],
        'DEFORMED': [8,9],
        'DRUGS': [8,9],
        'DR_PRES': [9],
        'DR_ZIP': [99998,99999],
        'EMER_USE': [8,9],
        'FIRE_EXP': [],
        'HAZ_CNO': [88],
        'HAZ_INV': [],
        'HAZ_PLAC': [8],
        'HAZ_REL': [8],
        'HIT_RUN': [9],
        'HOSPITAL': [],
        'INT_HWY': [9],
        'J_KNIFE': [],
        'MAKE': [97,98,99],
        'MAK_MOD': [97997,97998,97999,98997,98998,98999,99997,99998,99999],
        'MODEL': [997,998,999],
        'MONTH': [],
        'M_HARM': [98,99],
        'NUMOCCS': [99],
        'PCRASH4': [9],
        'PCRASH5': [9],
        'PERMVIT': [],
        'PERNOTMVIT': [],
        'PER_TYP': [9],
        'PJ': [],
        'PSU': [],
        'PVH_INVL': [],
        'P_CRASH1': [98,99],
        'P_CRASH2': [98,99],
        'P_CRASH3': [98,99],
        'REGION': [],
        'RELJCT2': [98,99],
        'REL_ROAD': [98,99],
        'REST_MIS': [],
        'REST_USE': [98,99],
        'ROLINLOC': [9],
        'ROLLOVER': [],
        'SCH_BUS': [],
        'SPEC_USE': [98,99],
        'SPEEDREL': [9],
        'TOWED': [8,9],
        'TOW_VEH': [9],
        'TRAV_SP': [998,999],
        'TYP_INT': [98,99],
        'URBANICITY': [],
        'VALIGN': [8,9],
        'VE_FORMS': [],
        'VE_TOTAL': [],
        'VNUM_LAN': [8,9],
        'VPROFILE': [8,9],
        'VSPD_LIM': [98,99],
        'VSURCOND': [98,99],
        'VTCONT_F': [8,9],
        'VTRAFCON': [97,98,99],
        'VTRAFWAY': [8,9],
        'WRK_ZONE': [],
        'YEAR': [],
        'VEH_AGE': [99],
    }

    """
    Features = [feature for feature in data]
    Keys = Missing_Unknown_Dict.keys()
    Features.sort()
    N = len(data)
    for feature in Features:
        if '_IM' not in feature:
            if feature in Keys:
                print (feature, Missing_Unknown_Dict[feature])
                s = 0
                D = Missing_Unknown_Dict[feature]
                for d in D:
                    t = len(data[data[feature]==d])
                    s += t
                    print (feature, d, t)
                print ()
                print (feature, s, round(s/N*100,2), "% Missing")
                if feature in Imputed_Feature_Dict.keys():
                    print (Imputed_Feature_Dict[feature], " in data")
                else:
                    print ("Imputed Feature not in data")
                print ()
                    
            else:
                print (feature, " MISSING")
    """

    with open("../../Big_Files/Missing_Unknown_Dict.json", "w") as outfile: 
        json.dump(Missing_Unknown_Dict, outfile)
        
    print ('Reading in Missing_Unknown_Dict')
    with open('../../Big_Files/Missing_Unknown_Dict.json') as json_file:
        D = json.load(json_file)
    print ("Did the read/write work? ", D == Missing_Unknown_Dict)
    print ()

    return Missing_Unknown_Dict


## Consolidate Values following CRSS Imputation
- When CRSS imputed some features, they consolidated the "Not Applicable" code into the "None" code.  
- Here we change those codes in the unimputed features.

In [16]:
def Consolidate_Values_following_CRSS_Imputation(data):
    print ('Consolidate_Values_following_CRSS_Imputation()')
    
    print ('ALCOHOL should change 8 to 2')
    print (data['ALCOHOL'].value_counts())
#    data['ALCOHOL'].where( ~data['ALCOHOL'] != 8, 2, inplace=True)
    data['ALCOHOL'][data['ALCOHOL']==8] = 2
    print (data['ALCOHOL'].value_counts())
    
    
    data['NUM_INJ'][data['NUM_INJ']==98] = 0
    data['NUM_INJV'][data['NUM_INJV']==98] = 0
    data['VEH_ALCH'][data['VEH_ALCH']==8] = 2
    
    print ()
    
    return data

## Drop Features with Excessive Missing
- Drop features with more than 20% of values missing or unknown

In [17]:
def Drop_Features_with_Excessive_Missing(data, Missing_Unknown_Dict, Imputed_Feature_Dict):
    print ('Drop_Features_with_Excessive_Missing()')
    
    N = len(data)
    
    for feature in data:
        if feature[-3:] != '_IM':
            s = len(data[data[feature].isin(Missing_Unknown_Dict[feature])])
            if s/N > 0.20:
                print (feature, s, round(s/N*100,2))
                data.drop(columns=[feature], inplace=True)
                print ('Dropped ', feature)
                if feature in Imputed_Feature_Dict.keys():
                    data.drop(columns=[Imputed_Feature_Dict[feature]], inplace=True)
                    print ('Dropped ', Imputed_Feature_Dict[feature])
                print (data.shape)
                print ()
    print ()
                
    return data
    

In [18]:
def Count_NaN(data):
    print ('Count NaN in each feature and drop column if it has NaN')
    print ('For comparison, here are the YEAR value counts for the whole dataset:')
    print (data['YEAR'].value_counts())
    print ()
    for feature in data:
        n = data[feature].isna().sum()
        if n>0 and 'NAME' not in feature:
            print (feature, n)
            B = data[[feature, 'YEAR']].copy()
            B = B[pd.isnull(B[feature])]
#            B.dropna(subset = [feature], inplace=True)
            print (B['YEAR'].value_counts())
            print ()
            data.drop(columns=[feature], inplace=True)
            print (data.shape)
    print ()
    
    return data

In [19]:
def Drop_Features_with_Dominant_Value(data, Missing_Unknown_Dict, Imputed_Feature_Dict):
    print ('Drop_Features_with_Dominant_Value()')
    for feature in data:
        if '_IM' not in feature:
            MU = Missing_Unknown_Dict[feature]
            Feature = data[feature]
            for mu in MU:
                Feature = Feature[Feature != mu]
            U = Feature.unique()
            V = list(Feature.value_counts(normalize=True))
            if V[0] > 0.95:
                print (feature, len(U), V[0])
                if V[0] > 0.99:
                    data.drop(columns=[feature], inplace=True)
                    print ('Dropped ', feature)
                    if feature in Imputed_Feature_Dict.keys():
                        data.drop(columns=[Imputed_Feature_Dict[feature]], inplace=True)
                        print ('Dropped ', Imputed_Feature_Dict[feature])
                    print (data.shape)
            if V[0] < 0.05:
                print (feature,len(U), V[0])
    print ()
    
    return data            

In [20]:
def Count_Features(data):
    n = 0
    m = 0
    for feature in data:
        if feature[-3:] == '_IM':
            m += 1
        else:
            n += 1
    print (n, ' Raw Features')
    print (m, ' Imputed Features')

## Run:  Get Data and Preprocess
- CPU times: user 3min 33s, sys: 19.2 s, total: 3min 52s
- Wall time: 3min 56s

In [21]:
%%time
def Preprocess_Data():
    print ('Preprocess_Data()')
    
#    df_Acc, df_Veh, df_Per = Get_Data_from_Original()
    
#    for feature in df_Acc:
#        print ('df_Acc', feature)
#    print ()
#    for feature in df_Veh:
#        print ('df_Veh', feature)
#    print ()
#    for feature in df_Per:
#        print ('df_Per', feature)
#    print ()

    df_Acc, df_Veh, df_Per = Get_Data_from_Temp_Files()
    df_Acc, df_Veh, df_Per = Drop_Repeated_Features(df_Acc, df_Veh, df_Per)    
    df_Acc, df_Veh, df_Per = Drop_Irrelevant_Features (df_Acc, df_Veh, df_Per)    

    data = Merge (df_Acc, df_Veh, df_Per)
    data = Remove_Pedestrian_Crashes(data)
    data = Count_NaN(data)    
    
    Imputed_Feature_Dict = Make_Imputed_Feature_Dict(data)
    Missing_Unknown_Dict = Make_Missing_Unknown_Dict(data, Imputed_Feature_Dict)

    data = Consolidate_Values_following_CRSS_Imputation(data) 
    data = Feature_Engineering(data, Missing_Unknown_Dict)
    
    data = Drop_Features_with_Excessive_Missing(data, Missing_Unknown_Dict, Imputed_Feature_Dict)
    data = Drop_Features_with_Dominant_Value(data, Missing_Unknown_Dict, Imputed_Feature_Dict)
    
    # Drop samples where we don't know whether the person went to the hospital.
#    print (data['HOSPITAL'].value_counts())
    print ('len(data) before dropping missing hospital values = ', len(data))
    data = data[data['HOSPITAL'] < 8]
    print ('len(data) after dropping missing hospital values = ', len(data))
    print ()
    
    # Bin the target variable.  
    # Either the person went to the hospital or didn't; we don't care how the person got to the hospital.
    data['HOSPITAL'] = data['HOSPITAL'].apply(lambda x:1 if x in [1,2,3,4,5] else 0)
    
#    Analyze_Stuff(data)

    Count_Features(data)
    
    A = []
    for feature in data:
        if '_IM' not in feature:
            A.append(feature)
    A.sort()
    for feature in A:
        print ("        '%s'," % feature)
    
#    data.to_csv('../../Big_Files/CRSS_Merged_Raw_Data.csv', index=False)
    data.to_csv('../../Big_Files/CRSS_01.csv', index=False)
    
    # Make a sample of the dataset for testing while writing code
#    data.sample(frac=0.1).to_csv('../../Big_Files/CRSS_Merged_Raw_Data_Sample_frac_01.csv', index=False)

    # Make a really small sample of the dataset for testing while writing code
#    data.sample(n=1000).to_csv('../../Big_Files/CRSS_Merged_Raw_Data_Sample_n_1000.csv', index=False)

    print ('Finished Preprocess_Data()')

Preprocess_Data()

#CPU times: user 19.4 s, sys: 4.52 s, total: 24 s
#Wall time: 25.2 s

Preprocess_Data()
Get_Data_from_Temp_Files()
df_Acc.shape =  (367232, 51)
df_Veh.shape =  (647855, 98)
df_Per.shape =  (910183, 71)

Drop_Repeated_Features()
df_Acc.shape =  (367232, 51)
df_Veh.shape =  (647855, 84)
df_Per.shape =  (910183, 40)

Drop_Irrelevant_Features()
df_Acc.shape =  (367232, 51)
df_Veh.shape =  (647855, 84)
df_Per.shape =  (910183, 40)

Drop Names of Dropped Features in df_Acc
Drop Names of Dropped Features in df_Veh
Drop Names of Dropped Features in df_Per
df_Acc.shape =  (367232, 40)
df_Veh.shape =  (647855, 56)
df_Per.shape =  (910183, 24)

Merge()

df_Acc.shape
(367232, 40)
df_Veh.shape
(647855, 56)
data.shape
(647855, 95)

df_Acc.shape
(367232, 40)
df_Veh.shape
(647855, 56)
df_Per.shape
(910183, 24)
data.shape
(873784, 114)

Remove_Pedestrian_Crashes()


PEDS
0     833798
1      38631
2       1168
3        134
4         36
6          8
5          6
11         1
7          1
8          1
Name: count, dtype: int64

Removing 39986 crashes that involve a pedestrian.
data.shape:  (833798, 114)

Count NaN in each feature and drop column if it has NaN
For comparison, here are the YEAR value counts for the whole dataset:
YEAR
2017    127396
2019    123958
2021    122262
2020    120404
2022    120232
2018    111020
2016    108526
Name: count, dtype: int64

RELJCT1_IM 123958
YEAR
2019    123958
Name: count, dtype: int64

(833798, 111)
HITRUN_IM 362898
YEAR
2021    122262
2020    120404
2022    120232
Name: count, dtype: int64

(833798, 110)
BDYTYP_IM 242494
YEAR
2021    122262
2022    120232
Name: count, dtype: int64

(833798, 109)

Make_Imputed_Feature_Dict()
Reading in Imputed_Feature__Dict
Did the read/write work?  True

Make_Missing_Unknown_Dict()
Reading in Missing_Unknown_Dict
Did the read/write work?  True

Consolidate_Values_following_CRSS_Imputation()
ALCOHOL should change 8 to 2
ALCOHOL
2    589293
9    210364
1     34027
8       114
Name: count, dtype: int64
ALCOHOL
2    589407
9    210364
1  