In [1]:
%%latex
\tableofcontents

<IPython.core.display.Latex object>

# Explanation

- This second piece of the code gives us guidance in how to bin each of the categorical features into fewer categories.  Binary classification algorithms generally work best with fewer than ten categories per feature, but some of our features have hundreds of unique values.  

- Running this code is not necessary for running the next notebook, Ambulance_Dispatch_03_Bin_Data.ipynb, which starts with the output from the previous notebook, Ambulance_Dispatch_01_Get_Data.ipynb.

## Binning by Human Understanding of Meaning of Codes:  HOSPITAL

Some of the features, like HOSPITAL, we can bin by looking at what each value signifies and putting them together using human understanding of what the codes signify.  We are only interested in whether the crash person went to the hospital; we do not really care how the person got to the hospital, but the CRSS codes differentiate six ways a person might have gone to the hospital and two ways the information can be unknown.  We will bin CRSS codes 1-6 together as "Yes" and codes 8-9 as 'Unknown,' to be imputed later.  

| CRSS Attribute Code | Meaning | Our Bin |
|---|---|---|
| 0  | Not Transported for Treatment  | 0 |
| 1  | EMS Air  | 1 |
| 2  | Law Enforcement  | 1 |
| 3  | EMS Unknown Mode  | 1 |
| 4  | Transported Unknown Source  | 1 |
| 5  | EMS Ground  | 1 |
| 6  | Other  | 1 |
| 8  | Not Reported  | 'Unknown' |
| 9  | Reported as Unknown  | 'Unknown' |

## Binning of Ordered Codes:  HOUR

The crash hours are discrete but ordered.  When binning "similar" times together, we look for "similar" in terms of the likelihood that, at that time of day, a crash person will go to the hospital.  In the table below, crashes at midnight (HOUR = 0) account for 1.5% of crash persons, and 23% of those crash persons went to the hospital.  We see a significant drop between 4 and 7 am in the likelihood that a crash person will go to the hospital, about a 4% drop each hour.  The percentage of crash person going to the hospital stays about the same until 6pm, when it starts to rise again.  Where exactly to cut the bins is a somewhat arbitrary decision.  

| HOUR | % of Crash Persons | % to Hospital | Our Bin | Meaning |
|---|---|---|---|---|
| 0 | 1.344 | 23.0705 | 6 | Late Night |
| 1 | 1.0665 | 26.5419 | 6 | Late Night |
| 2 | 0.9483 | 26.6044 | 6 | Late Night |
| 3 | 0.7306 | 26.8372 | 6 | Late Night |
| 4 | 0.7089 | 25.6741 | 6 | Late Night |
| 5 | 1.1962 | 21.2548 | 0 | Early Morning |
| 6 | 2.4013 | 17.1788 | 0 | Early Morning |
| 7 | 4.6948 | 13.1665 | 1 | Morning |
| 8 | 4.5183 | 13.3792 | 1 | Morning |
| 9 | 3.8258 | 14.3828 | 1 | Morning |
| 10 | 4.079 | 14.8444 | 1 | Morning |
| 11 | 5.0904 | 14.1232 | 2 | Mid-Day |
| 12 | 6.2797 | 13.4761 | 2 | Mid-Day |
| 13 | 6.2852 | 14.0212 | 2 | Mid-Day |
| 14 | 7.0741 | 14.1841 | 2 | Mid-Day |
| 15 | 8.6077 | 12.9617 | 3 | Rush Hour |
| 16 | 8.6935 | 13.3526 | 3 | Rush Hour |
| 17 | 9.3121 | 12.6166 | 3 | Rush Hour |
| 18 | 7.0147 | 14.0458 | 4 | Early Evening |
| 19 | 4.8031 | 16.2731 | 4 | Early Evening |
| 20 | 3.7818 | 17.9284 | 5 | Evening |
| 21 | 3.2342 | 18.7128 | 5 | Evening |
| 22 | 2.4795 | 20.4524 | 5 | Evening |
| 23 | 1.8303 | 22.846 | 6 | Late Night |![image.png](attachment:image.png)

## Automated Binning:  BODY_TYP

The CRSS dataset in these six years differentiates 68 different vehicle body types.  Some of them, like "4: 4-Door Sedan, Hardtop" and "14: Compact Utility" are common, with 36% and 16% of crash persons, respectively.  Some like "21: Large Van" are less common (1%), and some are rare, like "32: Pickup With Slide-in Camper (2016-2017 Only)" (0.0007%).  

We want to put the 68 codes into about five bins by likelihood of going to the hospital.

To bin a feature like BODY_TYP, the code in this notebook orders the CRSS codes by proportion of crash persons going to the hospital, then assigns the codes to about five bins so that approximately the same number of crash persons are in each bin.  Large categories like "4: 4-Door Sedan, Hardtop" will be their own bin.  

Unsurprisingly, many of the codes in the most dangerous bin are motorcycles, and many of the codes in the least dangerous bin are large trucks.  

The table below shows some of the data that the code below considers when cutting the bins.  CRSS codes 4 and 14 are large enough to get their own bins.  Codes 20 and 34 are not large enough to get their own bins, but too large to be in the same bin.  This notebook ouputs such a table for each feature in $\LaTeX$ format.

| BODY_TYP | % of Crash Persons | % to Hospital | Our Bin |
|---|---|---|---|
| 86 | 0.0003 | 100.0000 | 0 |
| ... | | | |
| 1 | 0.6528 | 16.2667 | 0 |
| 2 | 3.0509 | 16.085 | 0 |
| 19 | 0.9264 | 15.9881 | 0 |
| 52 | 0.1562 | 15.9703 | 0 |
| 59 | 0.0309 | 15.9624 | 0 |
|||||
| 4 | 36.1961 | 15.9386 | 1 |
|||||
| 30 | 0.3609 | 15.7154 | 2 |
| 5 | 2.5745 | 14.8298 | 2 |
| 9 | 2.8609 | 14.6233 | 2 |
| 10 | 0.0142 | 14.2857 | 2 |
| 91 | 0.001 | 14.2857 | 2 |
| 6 | 5.2201 | 14.1666 | 2 |
| 16 | 0.2965 | 14.09 | 2 |
|||||
| 14 | 16.1724 | 13.823 | 3 |
|||||
| 22 | 0.0132 | 13.1868 | 4 |
| 20 | 4.1037 | 12.9021 | 4 |
| 40 | 0.0717 | 12.5506 | 4 |
|||||
| 34 | 9.824 | 11.7167 | 5 |
| 29 | 0.2 | 11.4576 | 5 |
| 15 | 5.4602 | 11.2749 | 5 |
| 31 | 1.4085 | 11.2358 | 5 |
| 17 | 0.0149 | 10.6796 | 5 |
| ... | | | |
| 41 | 0.0001 | 0.0000 | 5 |

The cell below is actual output from this notebook, in a format we could cut and paste into the next notebook.  The comments show the percentage of crash persons in each bin.

In [2]:
A = [
        ['0', [86,87,82,83,89,81,84,80,88,85,90,11,96,95,97,45,58,12,32,8,42,3,1,2,19,52,59,]], #  9.0438 %
        ['1', [4,]], #  36.1961 %
        ['2', [30,5,9,10,91,6,16,]], #  11.3281 %
        ['3', [14,]], #  16.1724 %
        ['4', [22,20,40,]], #  14.0126 %
        ['5', [34,29,15,31,17,39,55,28,21,93,92,48,50,7,51,61,67,63,62,66,65,78,64,72,60,71,73,94,41,]], #  13.2471 %
        ['Unknowns', [98, 99, 49, 79, ]]
    ]

# Setup
## Import Libraries

In [3]:
import sys, copy, math, time

print ('Python version: {}'.format(sys.version))

from IPython.display import display, HTML

from collections import Counter

import numpy as np
print ('NumPy version: {}'.format(np.__version__))
np.set_printoptions(suppress=True)

import pandas as pd
print ('Pandas version:  {}'.format(pd.__version__))
pd.set_option('display.max_rows', 500)

import json # We will use json ('JavaScript Object Notation') to write and read dictionaries to/from files
print ('JSON version:  {}'.format(json.__version__))


print ('Finished Importing Libraries')



Python version: 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ]
NumPy version: 1.24.2
Pandas version:  1.5.3
JSON version:  2.0.9
Finished Importing Libraries


## Import Data
- Read the data file 
- Take out the NAME files and the IMputed files
- Read in the dictionary of feature values signifying "Missing" or "Unknown."

In [4]:
def Import_Stuff():
    print ('Import_Stuff()')
#    filename = '../../Big_Files/CRSS_Merged_Raw_Data.csv'
    filename = '../../Big_Files/CRSS_Merged_Raw_Data_Sample.csv'
    data = pd.read_csv(filename, index_col=None, low_memory=False)
    print ('data.shape: ', data.shape)

    for feature in data:
        if 'NAME' in feature or '_IM' in feature:
            data.drop(columns=[feature], inplace=True)

    data.drop(columns=['CASENUM'], inplace=True)
        
    print ('data.shape: ', data.shape)
    print ()
    
    print ('Reading in Missing/Unknown Dictionary')
    filename = '../../Big_Files/Missing_Unknown.json'
    with open(filename) as json_file:
        Missing_Unknown = json.load(json_file)
    print ()

    
    return data, Missing_Unknown

#Import_Data()


In [5]:
def Remove_Unknowns_in_Feature(data, Missing_Unknown, feature):
    print ('Remove_Unknowns_in_Feature()')
#    print (feature)

    data.dropna(inplace=True)

    if feature in Missing_Unknown.keys():
        A = Missing_Unknown[feature]
#        print (A)
#        print (data.shape)
#        print (data[feature].unique())
        for a in A:
            data = data[~data[feature].isin(Missing_Unknown[feature])]
#        print (data.shape)
#        print (data[feature].unique())
#        print ()
#    print ()
    return data
    
    
    

In [6]:
def Contingency(data, target, feature):
    contingency_matrix = pd.crosstab(data[target], data[feature])
    cm = contingency_matrix.values.tolist()

    if len(cm)==2 and len(cm[0])==2:
        corr = cm[1][1] / (cm[0][1] + cm[1][1])
        per = (cm[0][1] + cm[1][1])/(cm[0][0] + cm[0][1] + cm[1][0] + cm[1][1])
    else:
        corr = 0
        per = 0
        print ('Error in Contingency Matrix Dimensions')
        print ('data[feature].unique()')
        print (data[feature].unique())

    per = round(per*100,4)
    corr = round(corr*100,4)

    return per, corr

In [7]:
def Correlation(data, target, Missing_Unknown):
    print ("Correlation()")
    data = data.reindex(sorted(data.columns), axis=1)
    data = data[ [target] + [col for col in data.columns if col != target]]
    
    Correlation_Dictionary = {}
    
    for feature in data:
        print (feature)
        A = []
        D = {}
        if feature not in [target]:
            A = data[[target,feature]].copy()
            A = Remove_Unknowns_in_Feature(A, Missing_Unknown, feature)
            U = np.sort(A[feature].unique())
#            print (U)
#            print ()
            
            C = []
            for value in U:
#                print (value)
                B = A[[target]].copy()
                B[feature] = A[feature].apply(lambda x: 1 if x==value else 0)
#                display (B.head())
#                print ()
                per, corr = Contingency(B, target, feature)
#                print ('    ', value, per, corr)        
                C.append([value, per, corr])
            C.sort(key=lambda x:x[2], reverse=True)
#            display(C)

            E = []
            j = 0
            s = 0.0
            for i, c in enumerate(C):
                E.append(c[0])
                s += c[1]
#                print (len(C), i, c, j, E, s)
                if (
                    c[1] > 10 or 
                    (i<len(C)-1 and s + C[i+1][1] > 21) or 
                    i==len(C)-1
                ):
                    D.update({j:E})
#                    print (D)
#                    print (s)
#                    print ()
                    j += 1
                    E = []
                    s = 0.0
#            print ()
        Correlation_Dictionary.update({feature:D})
            
    return Correlation_Dictionary
            
        

In [None]:
def Main():
    target = 'HOSPITAL'
    data, Missing_Unknown = Import_Stuff()
    Correlation_Dictionary = Correlation(data, target, Missing_Unknown)
    
    with open("../../Big_Files/Correlation_Dictionary.json", "w") as outfile: 
        json.dump(Correlation_Dictionary, outfile, default=str)
        
    print ('Reading in Correlation Dictionary')
    with open('../../Big_Files/Correlation_Dictionary.json') as json_file:
        C = json.load(json_file)
    print (C)
    
Main()

Import_Stuff()
data.shape:  (83380, 222)
data.shape:  (83380, 89)

Reading in Missing/Unknown Dictionary

Correlation()
HOSPITAL
ACC_TYPE
Remove_Unknowns_in_Feature()
AGE
Remove_Unknowns_in_Feature()
AIR_BAG
Remove_Unknowns_in_Feature()
ALCOHOL
Remove_Unknowns_in_Feature()
ALC_RES
Remove_Unknowns_in_Feature()
ALC_STATUS
Remove_Unknowns_in_Feature()
BODY_TYP
Remove_Unknowns_in_Feature()
BUS_USE
Remove_Unknowns_in_Feature()
CARGO_BT
Remove_Unknowns_in_Feature()
DAY_WEEK
Remove_Unknowns_in_Feature()
DEFORMED
Remove_Unknowns_in_Feature()
DRINKING
Remove_Unknowns_in_Feature()
DRUGS
Remove_Unknowns_in_Feature()
DR_PRES
Remove_Unknowns_in_Feature()
DR_ZIP
Remove_Unknowns_in_Feature()
EJECTION
Remove_Unknowns_in_Feature()
EMER_USE
Remove_Unknowns_in_Feature()
FIRE_EXP
Remove_Unknowns_in_Feature()
HARM_EV
Remove_Unknowns_in_Feature()
HAZ_CNO
Remove_Unknowns_in_Feature()
HAZ_INV
Remove_Unknowns_in_Feature()
HAZ_PLAC
Remove_Unknowns_in_Feature()
HAZ_REL
Remove_Unknowns_in_Feature()
HIT_RUN
Remove