In [1]:
%%latex
\tableofcontents

<IPython.core.display.Latex object>

# Explanation

- This second piece of the code gives us guidance in how to bin each of the categorical features into fewer categories.  Binary classification algorithms generally work best with fewer than ten categories per feature, but some of our features have hundreds of unique values.  

- Running this code is not necessary for running the next notebook, Ambulance_Dispatch_03_Bin_Data.ipynb, which starts with the output from the previous notebook, Ambulance_Dispatch_01_Get_Data.ipynb.

## Binning by Human Understanding of Meaning of Codes:  HOSPITAL

Some of the features, like HOSPITAL, we can bin by looking at what each value signifies and putting them together using human understanding of what the codes signify.  We are only interested in whether the crash person went to the hospital; we do not really care how the person got to the hospital, but the CRSS codes differentiate six ways a person might have gone to the hospital and two ways the information can be unknown.  We will bin CRSS codes 1-6 together as "Yes" and codes 8-9 as 'Unknown,' to be imputed later.  

| CRSS Attribute Code | Meaning | Our Bin |
|---|---|---|
| 0  | Not Transported for Treatment  | 0 |
| 1  | EMS Air  | 1 |
| 2  | Law Enforcement  | 1 |
| 3  | EMS Unknown Mode  | 1 |
| 4  | Transported Unknown Source  | 1 |
| 5  | EMS Ground  | 1 |
| 6  | Other  | 1 |
| 8  | Not Reported  | 'Unknown' |
| 9  | Reported as Unknown  | 'Unknown' |

## Binning of Ordered Codes:  HOUR

The crash hours are discrete but ordered.  When binning "similar" times together, we look for "similar" in terms of the likelihood that, at that time of day, a crash person will go to the hospital.  In the table below, crashes at midnight (HOUR = 0) account for 1.5% of crash persons, and 23% of those crash persons went to the hospital.  We see a significant drop between 4 and 7 am in the likelihood that a crash person will go to the hospital, about a 4% drop each hour.  The percentage of crash person going to the hospital stays about the same until 6pm, when it starts to rise again.  Where exactly to cut the bins is a somewhat arbitrary decision.  

| HOUR | % of Crash Persons | % to Hospital | Our Bin | Meaning |
|---|---|---|---|---|
| 0 | 1.344 | 23.0705 | 6 | Late Night |
| 1 | 1.0665 | 26.5419 | 6 | Late Night |
| 2 | 0.9483 | 26.6044 | 6 | Late Night |
| 3 | 0.7306 | 26.8372 | 6 | Late Night |
| 4 | 0.7089 | 25.6741 | 6 | Late Night |
| 5 | 1.1962 | 21.2548 | 0 | Early Morning |
| 6 | 2.4013 | 17.1788 | 0 | Early Morning |
| 7 | 4.6948 | 13.1665 | 1 | Morning |
| 8 | 4.5183 | 13.3792 | 1 | Morning |
| 9 | 3.8258 | 14.3828 | 1 | Morning |
| 10 | 4.079 | 14.8444 | 1 | Morning |
| 11 | 5.0904 | 14.1232 | 2 | Mid-Day |
| 12 | 6.2797 | 13.4761 | 2 | Mid-Day |
| 13 | 6.2852 | 14.0212 | 2 | Mid-Day |
| 14 | 7.0741 | 14.1841 | 2 | Mid-Day |
| 15 | 8.6077 | 12.9617 | 3 | Rush Hour |
| 16 | 8.6935 | 13.3526 | 3 | Rush Hour |
| 17 | 9.3121 | 12.6166 | 3 | Rush Hour |
| 18 | 7.0147 | 14.0458 | 4 | Early Evening |
| 19 | 4.8031 | 16.2731 | 4 | Early Evening |
| 20 | 3.7818 | 17.9284 | 5 | Evening |
| 21 | 3.2342 | 18.7128 | 5 | Evening |
| 22 | 2.4795 | 20.4524 | 5 | Evening |
| 23 | 1.8303 | 22.846 | 6 | Late Night |![image.png](attachment:image.png)

## Automated Binning:  BODY_TYP

The CRSS dataset in these six years differentiates 68 different vehicle body types.  Some of them, like "4: 4-Door Sedan, Hardtop" and "14: Compact Utility" are common, with 36% and 16% of crash persons, respectively.  Some like "21: Large Van" are less common (1%), and some are rare, like "32: Pickup With Slide-in Camper (2016-2017 Only)" (0.0007%).  

We want to put the 68 codes into about five bins by likelihood of going to the hospital.

To bin a feature like BODY_TYP, the code in this notebook orders the CRSS codes by proportion of crash persons going to the hospital, then assigns the codes to about five bins so that approximately the same number of crash persons are in each bin.  Large categories like "4: 4-Door Sedan, Hardtop" will be their own bin.  

Unsurprisingly, many of the codes in the most dangerous bin are motorcycles, and many of the codes in the least dangerous bin are large trucks.  

The table below shows some of the data that the code below considers when cutting the bins.  CRSS codes 4 and 14 are large enough to get their own bins.  Codes 20 and 34 are not large enough to get their own bins, but too large to be in the same bin.  This notebook ouputs such a table for each feature in $\LaTeX$ format.

| BODY_TYP | % of Crash Persons | % to Hospital | Our Bin |
|---|---|---|---|
| 86 | 0.0003 | 100.0000 | 0 |
| ... | | | |
| 1 | 0.6528 | 16.2667 | 0 |
| 2 | 3.0509 | 16.085 | 0 |
| 19 | 0.9264 | 15.9881 | 0 |
| 52 | 0.1562 | 15.9703 | 0 |
| 59 | 0.0309 | 15.9624 | 0 |
|||||
| 4 | 36.1961 | 15.9386 | 1 |
|||||
| 30 | 0.3609 | 15.7154 | 2 |
| 5 | 2.5745 | 14.8298 | 2 |
| 9 | 2.8609 | 14.6233 | 2 |
| 10 | 0.0142 | 14.2857 | 2 |
| 91 | 0.001 | 14.2857 | 2 |
| 6 | 5.2201 | 14.1666 | 2 |
| 16 | 0.2965 | 14.09 | 2 |
|||||
| 14 | 16.1724 | 13.823 | 3 |
|||||
| 22 | 0.0132 | 13.1868 | 4 |
| 20 | 4.1037 | 12.9021 | 4 |
| 40 | 0.0717 | 12.5506 | 4 |
|||||
| 34 | 9.824 | 11.7167 | 5 |
| 29 | 0.2 | 11.4576 | 5 |
| 15 | 5.4602 | 11.2749 | 5 |
| 31 | 1.4085 | 11.2358 | 5 |
| 17 | 0.0149 | 10.6796 | 5 |
| ... | | | |
| 41 | 0.0001 | 0.0000 | 5 |

The cell below is actual output from this notebook, in a format we could cut and paste into the next notebook.  The comments show the percentage of crash persons in each bin.

In [2]:
A = [
        ['0', [86,87,82,83,89,81,84,80,88,85,90,11,96,95,97,45,58,12,32,8,42,3,1,2,19,52,59,]], #  9.0438 %
        ['1', [4,]], #  36.1961 %
        ['2', [30,5,9,10,91,6,16,]], #  11.3281 %
        ['3', [14,]], #  16.1724 %
        ['4', [22,20,40,]], #  14.0126 %
        ['5', [34,29,15,31,17,39,55,28,21,93,92,48,50,7,51,61,67,63,62,66,65,78,64,72,60,71,73,94,41,]], #  13.2471 %
        ['Unknowns', [98, 99, 49, 79, ]]
    ]

# Setup
## Import Libraries

In [3]:
import sys, copy, math, time

print ('Python version: {}'.format(sys.version))

from IPython.display import display, HTML

from collections import Counter

import numpy as np
print ('NumPy version: {}'.format(np.__version__))
np.set_printoptions(suppress=True)

import pandas as pd
print ('Pandas version:  {}'.format(pd.__version__))
pd.set_option('display.max_rows', 500)

print ('Finished Importing Libraries')



Python version: 3.9.16 (main, Dec  7 2022, 10:02:13) 
[Clang 14.0.0 (clang-1400.0.29.202)]
NumPy version: 1.24.0
Pandas version:  1.5.2
Finished Importing Libraries


## Import Data

In [4]:
def Import_Data():
    print ('Import_Data()')
#    filename = '../../Big_Files/CRSS_Merged_Raw_Data.csv'
    filename = '../../Big_Files/CRSS_Merged_Raw_Data_Sample.csv'
    data = pd.read_csv(filename, index_col=None)
    
    print ('data.shape: ', data.shape)
    
    return data

#Import_Data()


# Tools

## Narrow_Dataset()

In [5]:
def Narrow_Dataset(data, Features):
    print ('Narrow_Dataset()')
    data_narrow = pd.DataFrame()
  

    for f in Features:
        data_narrow[f] = data[f]
        
    data_narrow = data_narrow.reindex(sorted(data_narrow.columns), axis=1)    
    
    print ()
    return data_narrow

## Feature_Names()

In [6]:
def Feature_Names(data, Named_Features):
    print ('Feature_Names')
    D = {}
    for f in Named_Features:
        g = f + 'NAME'
        A = pd.concat([data[f],data[g]], axis=1)
        A.drop_duplicates(inplace=True)
        A.dropna(inplace=True)
#        print (f)
#        print (len(A))
#        print (A.head())
#        print ()
        B = dict(zip(A[f],A[g]))
        D[f] = B
#        print (B)
#        print ()
#    print (D)
    print ()
    return D
        

## Remove_Unknowns_in_Feature()

In [7]:
def Remove_Unknowns_in_Feature(data, feature):
    
    Unknowns = {     
    # Accident
        'DAY_WEEK': [9],
        'HOUR': [99],
        'INT_HWY': [9],
        'LGT_COND': [8,9],
#        'MAN_COLL': [98,99],
        'MONTH': [],
        'REL_ROAD': [98,99],
        'RELJCT2': [98,99],
        'TYP_INT': [98,99],
        'WEATHER': [98,99],
    # Vehicle
        'BDYTYP_IM': [],
        'BODY_TYP': [98, 99, 49, 79],
        'BUS_USE': [98, 99],
        'DR_ZIP': [9998, 9999],
        'EMER_USE': [8, 9],
        'MAKE': [99],
        'MOD_YEAR': [9998, 9999],
        'MODEL': [],
        'NUMOCCS': [99],
        'VALIGN': [8, 9],
        'VNUM_LAN': [8, 9],
        'VPROFILE': [8, 9],
        'VSPD_LIM': [98, 99],
        'VSURCOND': [98, 99],
        'VTRAFCON': [97, 99],
        'VTRAFWAY': [8, 9],
    # Person
        'SEX_IM': [],
        'AGE': [998,999,],
        'HOSPITAL': [],
#        'LOCATION': [98,99,],
        'PER_TYP': [],        
    }
    

    if feature in Unknowns.keys():
        print ('Remove_Unknowns_in_Feature ', feature, Unknowns[feature], len(data[data[feature].isin(Unknowns[feature])]), ' unknown')
        data_temp = data[~data[feature].isin(Unknowns[feature])]
        return data_temp, Unknowns[feature]
    else:
        data_temp = data
        return data_temp, []
    print ()
    


## Correlation(data, target, feature, value, name)

- Returns the correlation between a feature value and hospitalization
- The input data for the value of the feature:
    - 1 if the feature has that value
    - 0 if the feature does not have that value
- Input data for the target:
    - 1 if hospitalized
    - 0 if not
- Returns:
    - Correlation:  The percentage of the samples with that feature value that were hospitalized. 
    - Percentage of the feature that have that value


In [8]:
def Correlation(data, target, feature):
    contingency_matrix = pd.crosstab(data[target], data[feature])
    cm = contingency_matrix.values.tolist()

    if len(cm)==2 and len(cm[0])==2:
        corr = cm[1][1] / (cm[0][1] + cm[1][1])
        per = (cm[0][1] + cm[1][1])/(cm[0][0] + cm[0][1] + cm[1][0] + cm[1][1])
    else:
        corr = 0
        per = 0
        print ('Error in Contingency Matrix Dimensions')
        print ('data[feature].unique()')
        print (data[feature].unique())

    per = round(per*100,4)
    corr = round(corr*100,4)

    return (per, corr)

In [9]:
def Correlation_by_Value(data, target, feature, Feature_Names_Dict, Unknowns):
# I decided against the np.unique because it treats each nan as a separate entry.
#    V = np.unique(data[feature].values) 
    V = data[feature].unique()
    print ('V = data[feature].unique()')
    print (V)
    B = []

    for value in V:
        A = pd.DataFrame()
        # Change to binary:  1 if feature has that value, 0 otherwise
        A[feature] = data[feature].apply(lambda x: 1 if x==value else 0)
        # Target is already binary
        A[target] = data[target]
        if feature in Feature_Names_Dict:
            if value in Feature_Names_Dict[feature]:
                name = Feature_Names_Dict[feature][value]
            else:
                name=str(value)
        else:
            name = str(value)
#        if len(name)>30:
#            name = name[:30]
        per, corr = Correlation(A, target, feature)
        B.append([feature, value, name, per, corr])
#    print (feature)

    # Reverse sort values by correlation
    B = sorted(B, key=lambda x:x[4], reverse=True)
    print ("B: ")
    for b in B:
        print (b)
    print ()
    
    # If name is the string of an integer, change to an integer
    for b in B:
        c = b[1]
        try:
            c = int(c)
        except:
            c=c
        else:
            c = int(c)
#        print (c, end=',')
#    print ()
#    print ()

    # Print grouped into 100/p blocks of same size
    print ("    feature = '%s'" % feature)
    print ('    A = [')
    p = 20
    s = 0.0 # Running total of percentage of dataset
    s2 = 0.0
    n=0
    print ("        ['%d', [" % n , end='')
    for b in B:
        t = s + b[3]
#        if b[3]<10: # If that value accounts for less than 10% of dataset...
#            s2 = s2 + b[3] # See what adding the next value will do
#        q = int(s/p)
#        r = int((t-0.001)/p)
#        if r>q or b[3]>10:
#            print ("]], # ", round(s2,4), '%', end='')
#            print (' q = ', q, ' r = ', r, ' s = ', s, ' t = ', t, ' b[3] = ', b[3] )
#            s2 = 0.0 # Running total in this block for percentage of dataset
#            n += 1
#            print ("        ['%d', [" % n , end='')
#        s = t
        s2 = s2 + b[3]
        
#        c = b[1]
#        try:
#            c = int(c)
#        except:
#            c=c
#        else:
#            c = int(c)
#        print (c, end=',')
        print (b[1], end=',')
        if b[3]>10 or s2 > 20:
#            print ("]], # ", round(b[3],4), '%')
            print ("]], # ", round(s2,4), '%')
            s2=0.0
            n += 1
            print ("        ['%d', [" % n , end='')
    print ("]], # ", round(s2,4), '%')
    print ("        ['Unknowns', [", end='')
    for u in Unknowns:
        print (u, end=', ')
    print ("]]" )
    print ('    ]')
    print ('    data = Build_Individual_Feature_with_Dict(df_Per, data, feature, A)')
    print ()
    
    C = pd.DataFrame(B)
    C.columns = ['Feature', 'Code', 'Name', 'Per', 'Corr']
#    C.drop(C[C['Per'] < 0.1].index, inplace=True)
#    print (C)
    display(C)

    TeX = open('../Correlation_2024/Correlation_' + feature + '.tex', 'w')
    E = [c for c in B if c[3]>=0.0]
    
        
    
    for c in E:
        a = c[0]
        b = c[1]
        d = c[2]
        e = "{:.4f}".format(c[3])
        f = "{:.4f}".format(c[4])
        TeX.write('\t & \\verb|%s| & %s & %s & %s & %s \\cr\n' % (a,b,d,e,f))
    

    TeX = open('../Correlation_2024/Correlation_Ordered_' + feature + '.tex', 'w')
    E = sorted(B, key=lambda x:x[1], reverse=False)

    
    for c in E:
        a = c[0]
        b = c[1]
        d = c[2]
        e = "{:.4f}".format(c[3])
        f = "{:.4f}".format(c[4])
        TeX.write('\t & \\verb|%s| & %s & %s & %s & %s \\cr\n' % (a,b,d,e,f))
    

    print ()
    return B

In [10]:
def Correlation_All(data, target, Feature_Names_Dict):
    print ('Correlation_All')
    
    C = []
#    for feature in data:
    for feature in ['BODY_TYP']:
        data_temp, Unknowns = Remove_Unknowns_in_Feature(data, feature)
        U = data_temp[feature].unique()
#        print (feature, len(U))
        if len(U)<10000:
            B = Correlation_by_Value(
                data_temp, target, feature, Feature_Names_Dict, Unknowns
            )
            for b in B:
                C.append(b)
#            print ()
#        print ()
#    for c in C:
#        print (c)
#    print ()
    C = sorted(C, key=lambda x:x[4], reverse=True)
    D = pd.DataFrame(C)
    D.columns = ['Feature', 'Code', 'Name', 'Per', 'Corr']
    print (D)
    print ()
    
    D.drop(D[D['Per'] < 0.5].index, inplace=True)
    print (D)
    print ()
    
    TeX = open('../Correlation_2024/Correlation.tex', 'w')
    E = [c for c in C if c[3]>=0.5]
    
    for c in E:
        a = c[0]
        b = c[1]
        d = c[2]
        e = "{:.4f}".format(c[3])
        f = "{:.4f}".format(c[4])
        TeX.write('\\verb|%s| & %s & %s & %s & %s \\cr\n' % (a,b,d,e,f))
    
    return 0

    

# Main()
- CPU times: user 4min 58s, sys: 24.3 s, total: 5min 22s
- Wall time: 5min 23s

In [11]:
%%time
def Main():
    target = 'HOSPITAL'
    data = Import_Data()
    
    Features = [
    # Accident Dataset
        'DAY_WEEK',
        'HOUR',
        'INT_HWY',
        'LGT_COND',
        'MONTH',
        'PERMVIT',
        'PERNOTMVIT',
        'PJ',
        'PSU',
        'PVH_INVL',
        'REGION',
        'REL_ROAD',
        'RELJCT1',
        'RELJCT2',
        'SCH_BUS',
        'TYP_INT',
        'URBANICITY',
        'VE_FORMS',
        'VE_TOTAL',
        'WEATHER',
        'WRK_ZONE',
        'YEAR',
    # Vehicle Dataset
        'BODY_TYP',
        'BUS_USE',
        'DR_ZIP',
        'EMER_USE',
        'MAKE',
        'MOD_YEAR',
        'MODEL',
        'NUMOCCS',
        'VALIGN',
        'VNUM_LAN',
        'VPROFILE',
        'VSPD_LIM',
        'VSURCOND',
        'VTRAFCON',
        'VTRAFWAY',
    # Person Dataset
        'AGE',
        'HOSPITAL',
        'PER_TYP',
        'SEX',
    ]

    # What if we don't narrow the dataset?
    data = Narrow_Dataset(data, Features)
    
    print ('Features in data, with Number of Unique Values and Number of Blank Values')
    for feature in data:
        U = data[feature].unique()
        s = data[feature].isna().sum()
        print (feature, len(U), s)
    print ()
        
#    Feature_Names_Dict = Feature_Names(data, Features)
    Feature_Names_Dict = {}

#    print (Feature_Names_Dict)
    
    Correlation_All(data, target, Feature_Names_Dict)


Main()

Import_Data()
data.shape:  (16676, 222)
Narrow_Dataset()

Features in data, with Number of Unique Values and Number of Blank Values
AGE 98 0
BODY_TYP 56 0
BUS_USE 10 0
DAY_WEEK 7 0
DR_ZIP 4516 0
EMER_USE 8 0
HOSPITAL 2 0
HOUR 25 0
INT_HWY 2 0
LGT_COND 9 0
MAKE 65 0
MODEL 124 0
MOD_YEAR 53 0
MONTH 12 0
NUMOCCS 19 0
PERMVIT 23 0
PERNOTMVIT 5 0
PER_TYP 3 0
PJ 409 0
PSU 60 0
PVH_INVL 6 0
REGION 4 0
RELJCT1 4 0
RELJCT2 14 0
REL_ROAD 13 0
SCH_BUS 2 0
SEX 4 0
TYP_INT 11 0
URBANICITY 2 0
VALIGN 7 0
VE_FORMS 12 0
VE_TOTAL 12 0
VNUM_LAN 10 0
VPROFILE 9 0
VSPD_LIM 19 0
VSURCOND 13 0
VTRAFCON 19 0
VTRAFWAY 10 0
WEATHER 13 0
WRK_ZONE 5 0
YEAR 7 0

Correlation_All
Remove_Unknowns_in_Feature  BODY_TYP [98, 99, 49, 79] 622  unknown
V = data[feature].unique()
[34 14 15  5  3 80  4 19  6 21  9 89 66 16 20 52  2 30 31 59 39 67  1 29
 62 50 83 51 61 63 40 81 48 84  8 78 64 10 96 22 58 65 55 88 17 28 92 60
 72 82  7 90]
B: 
['BODY_TYP', 82, '82', 0.0125, 100.0]
['BODY_TYP', 90, '90', 0.0062, 100.0]
['BODY_

Unnamed: 0,Feature,Code,Name,Per,Corr
0,BODY_TYP,82,82,0.0125,100.0
1,BODY_TYP,90,90,0.0062,100.0
2,BODY_TYP,89,89,0.1121,72.2222
3,BODY_TYP,84,84,0.1184,68.4211
4,BODY_TYP,80,80,2.3857,62.9243
5,BODY_TYP,83,83,0.0498,62.5
6,BODY_TYP,81,81,0.1433,60.8696
7,BODY_TYP,88,88,0.0561,55.5556
8,BODY_TYP,58,58,0.0187,33.3333
9,BODY_TYP,17,17,0.0249,25.0



     Feature  Code Name      Per      Corr
0   BODY_TYP    82   82   0.0125  100.0000
1   BODY_TYP    90   90   0.0062  100.0000
2   BODY_TYP    89   89   0.1121   72.2222
3   BODY_TYP    84   84   0.1184   68.4211
4   BODY_TYP    80   80   2.3857   62.9243
5   BODY_TYP    83   83   0.0498   62.5000
6   BODY_TYP    81   81   0.1433   60.8696
7   BODY_TYP    88   88   0.0561   55.5556
8   BODY_TYP    58   58   0.0187   33.3333
9   BODY_TYP    17   17   0.0249   25.0000
10  BODY_TYP     1    1   0.6478   22.1154
11  BODY_TYP    40   40   0.0872   21.4286
12  BODY_TYP    16   16   0.3551   21.0526
13  BODY_TYP     5    5   2.7719   17.7528
14  BODY_TYP    30   30   0.2990   16.6667
15  BODY_TYP    22   22   0.0374   16.6667
16  BODY_TYP    55   55   0.0374   16.6667
17  BODY_TYP     4    4  35.9163   16.1464
18  BODY_TYP     3    3   0.8347   15.6716
19  BODY_TYP     2    2   3.1830   14.8728
20  BODY_TYP     9    9   3.0211   14.8454
21  BODY_TYP    20   20   3.8246   14.1694
22  BODY_T