In [1]:
%%latex
\tableofcontents

<IPython.core.display.Latex object>

# Explanation

- This second piece of the code gives us guidance in how to bin each of the categorical features into fewer categories.  Binary classification algorithms generally work best with fewer than ten categories per feature, but some of our features have hundreds of unique values.  

- None of the CRSS features are floats.  All of the values are integers.  
    - Some of them, like HOUR, are ordered.
        - These ordered features we will bin by hand.
    - Many of them, like DR_ZIP, driver's ZIP code, are unordered.
        - For the unordered features, we impose an order by the percent of crash persons with that value who were hospitalized.  
        - Once we have an order, we slice it into 5-10 bins.

- The output of this code is a dictionary for automatic binning by correlation with the target variable, HOSPITAL.



## Binning by Human Understanding of Meaning of Codes:  HOSPITAL

Some of the features, like HOSPITAL, we can bin by looking at what each value signifies and putting them together using human understanding of what the codes signify.  We are only interested in whether the crash person went to the hospital; we do not really care how the person got to the hospital, but the CRSS codes differentiate six ways a person might have gone to the hospital and two ways the information can be unknown.  We will bin CRSS codes 1-6 together as "Yes" and codes 8-9 as 'Unknown,' to be imputed later.  

| CRSS Attribute Code | Meaning | Our Bin |
|---|---|---|
| 0  | Not Transported for Treatment  | 0 |
| 1  | EMS Air  | 1 |
| 2  | Law Enforcement  | 1 |
| 3  | EMS Unknown Mode  | 1 |
| 4  | Transported Unknown Source  | 1 |
| 5  | EMS Ground  | 1 |
| 6  | Other  | 1 |
| 8  | Not Reported  | 'Unknown' |
| 9  | Reported as Unknown  | 'Unknown' |

## Binning of Ordered Codes:  HOUR

The crash hours are discrete but ordered.  When binning "similar" times together, we look for "similar" in terms of the likelihood that, at that time of day, a crash person will go to the hospital.  In the table below, crashes at midnight (HOUR = 0) account for 1.5% of crash persons, and 23% of those crash persons went to the hospital.  We see a significant drop between 4 and 7 am in the likelihood that a crash person will go to the hospital, about a 4% drop each hour.  The percentage of crash person going to the hospital stays about the same until 6pm, when it starts to rise again.  Where exactly to cut the bins is a somewhat arbitrary decision.  

| HOUR | % of Crash Persons | % to Hospital | Our Bin | Meaning |
|---|---|---|---|---|
| 0 | 1.344 | 23.0705 | 6 | Late Night |
| 1 | 1.0665 | 26.5419 | 6 | Late Night |
| 2 | 0.9483 | 26.6044 | 6 | Late Night |
| 3 | 0.7306 | 26.8372 | 6 | Late Night |
| 4 | 0.7089 | 25.6741 | 6 | Late Night |
| 5 | 1.1962 | 21.2548 | 0 | Early Morning |
| 6 | 2.4013 | 17.1788 | 0 | Early Morning |
| 7 | 4.6948 | 13.1665 | 1 | Morning |
| 8 | 4.5183 | 13.3792 | 1 | Morning |
| 9 | 3.8258 | 14.3828 | 1 | Morning |
| 10 | 4.079 | 14.8444 | 1 | Morning |
| 11 | 5.0904 | 14.1232 | 2 | Mid-Day |
| 12 | 6.2797 | 13.4761 | 2 | Mid-Day |
| 13 | 6.2852 | 14.0212 | 2 | Mid-Day |
| 14 | 7.0741 | 14.1841 | 2 | Mid-Day |
| 15 | 8.6077 | 12.9617 | 3 | Rush Hour |
| 16 | 8.6935 | 13.3526 | 3 | Rush Hour |
| 17 | 9.3121 | 12.6166 | 3 | Rush Hour |
| 18 | 7.0147 | 14.0458 | 4 | Early Evening |
| 19 | 4.8031 | 16.2731 | 4 | Early Evening |
| 20 | 3.7818 | 17.9284 | 5 | Evening |
| 21 | 3.2342 | 18.7128 | 5 | Evening |
| 22 | 2.4795 | 20.4524 | 5 | Evening |
| 23 | 1.8303 | 22.846 | 6 | Late Night |![image.png](attachment:image.png)

## Automated Binning:  BODY_TYP

The CRSS dataset in these six years differentiates 68 different vehicle body types.  Some of them, like "4: 4-Door Sedan, Hardtop" and "14: Compact Utility" are common, with 36% and 16% of crash persons, respectively.  Some like "21: Large Van" are less common (1%), and some are rare, like "32: Pickup With Slide-in Camper (2016-2017 Only)" (0.0007%).  

We want to put the 68 codes into about five bins by likelihood of going to the hospital.

To bin a feature like BODY_TYP, the code in this notebook orders the CRSS codes by proportion of crash persons going to the hospital, then assigns the codes to about five bins so that approximately the same number of crash persons are in each bin.  Large categories like "4: 4-Door Sedan, Hardtop" will be their own bin.  

Unsurprisingly, many of the codes in the most dangerous bin are motorcycles, and many of the codes in the least dangerous bin are large trucks.  

The table below shows some of the data that the code below considers when cutting the bins.  CRSS codes 4 and 14 are large enough to get their own bins.  Codes 20 and 34 are not large enough to get their own bins, but too large to be in the same bin.  This notebook ouputs such a table for each feature in $\LaTeX$ format.

| BODY_TYP | % of Crash Persons | % to Hospital | Our Bin |
|---|---|---|---|
| 86 | 0.0003 | 100.0000 | 0 |
| ... | | | |
| 1 | 0.6528 | 16.2667 | 0 |
| 2 | 3.0509 | 16.085 | 0 |
| 19 | 0.9264 | 15.9881 | 0 |
| 52 | 0.1562 | 15.9703 | 0 |
| 59 | 0.0309 | 15.9624 | 0 |
|||||
| 4 | 36.1961 | 15.9386 | 1 |
|||||
| 30 | 0.3609 | 15.7154 | 2 |
| 5 | 2.5745 | 14.8298 | 2 |
| 9 | 2.8609 | 14.6233 | 2 |
| 10 | 0.0142 | 14.2857 | 2 |
| 91 | 0.001 | 14.2857 | 2 |
| 6 | 5.2201 | 14.1666 | 2 |
| 16 | 0.2965 | 14.09 | 2 |
|||||
| 14 | 16.1724 | 13.823 | 3 |
|||||
| 22 | 0.0132 | 13.1868 | 4 |
| 20 | 4.1037 | 12.9021 | 4 |
| 40 | 0.0717 | 12.5506 | 4 |
|||||
| 34 | 9.824 | 11.7167 | 5 |
| 29 | 0.2 | 11.4576 | 5 |
| 15 | 5.4602 | 11.2749 | 5 |
| 31 | 1.4085 | 11.2358 | 5 |
| 17 | 0.0149 | 10.6796 | 5 |
| ... | | | |
| 41 | 0.0001 | 0.0000 | 5 |

- The cell below is a sample of the kind of dictionary entry this code generates and saves in JSON format as 'Correlation_Dictionary.json'.  
- ACC_TYPE is 99 categories of accident type.
- The percentages indicate the percentage of samples in that bin.

In [2]:
Sample = """
{
    ACC_TYPE:
    {
        0: [41, 55, 60, 61, 51, 34, 59, 50, 4, 52, 16, 10, 6, 53, 14, 1, 7, 2, 58, 5, 8, 89, 0], # 12.02%
        1: [69, 3, 87, 66, 64, 9, 83, 38, 65, 86], # 13.26%
        2: [88, 68, 90, 85, 62, 91, 82, 26, 22, 30, 31, 77, 25], # 14.03%
        3: [98], # 9.53%
        4: [27, 11, 71, 79, 12, 73, 24, 32, 67, 29, 39, 72, 48], # 11.39%
        5: [21], # 8.92%
        6: [33, 15, 80, 76, 44, 28, 49, 75, 81, 78, 23], # 8.4%
        7: [20], # 8.99%
        8: [45, 13, 84, 47, 70, 74, 46, 93, 92, 40, 42, 43, 56], # 13.44%
    }
}
"""

## Performance Notes

- I tried three methods for measuring correlation between the values of a feature and the target:
    - The first, Correlation(), loops over each value of the feature, making (in a custom way) a one-hot encoding, then uses the Pandas crosstabs function to make a confusion matrix.  Takes over an hour.
    - The next, Correlation_One_Hot(), makes a one-hot encoding of the feature, perhaps creating thousands of new features, and loops over those features to make a confusion matrix for each one.  Takes about ten minutes.
    - The third method, Correlation_Count(), takes about one minute to run the whole notebook, most of which time is loading the data and narrowing the features. For each feature, 
        - makes a new dataframe of that feature and the target, 
        - loops over each unique value
            - filters the new dataframe for just that value of the feature,
            - finds the sum of the target value in that filtered dataframe,
            - and calculates the correlation from those.
- I tested the three methods with the "ACC_TYPE" and "DR_ZIP" features, and all three methods gave the same correlation library.
  


# Setup
## Import Libraries

In [3]:
%matplotlib notebook

import sys, copy, math, time

print ('Python version: {}'.format(sys.version))

from IPython.display import display, HTML

from collections import Counter

import numpy as np
print ('NumPy version: {}'.format(np.__version__))
np.set_printoptions(suppress=True)

from numpy import array, linspace

import scipy as sc
print ('SciPy version: {}'.format(sc.__version__))
from scipy.signal import argrelextrema

import sklearn
print ('sklearn version: {}'.format(sklearn.__version__))
from sklearn.neighbors import KernelDensity

import matplotlib
print ('Matplotlib version: {}'.format(matplotlib.__version__))
from matplotlib.pyplot import plot

import pandas as pd
print ('Pandas version:  {}'.format(pd.__version__))
pd.set_option('display.max_rows', 500)
pd.options.mode.chained_assignment = None 

import json # We will use json ('JavaScript Object Notation') to write and read dictionaries to/from files
print ('JSON version:  {}'.format(json.__version__))


print ('Finished Importing Libraries')



Python version: 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ]
NumPy version: 1.24.2
SciPy version: 1.7.3




sklearn version: 1.2.2
Matplotlib version: 3.7.1
Pandas version:  1.5.3
JSON version:  2.0.9
Finished Importing Libraries


## Import Data
- Read the data file 
- Take out the NAME files and the IMputed files
- Read in the dictionary of feature values signifying "Missing" or "Unknown."

In [4]:
def Import_Stuff(file_number):
    print ('Import_Stuff()')
    filename_dict = {
        0: '../../Big_Files/CRSS_Merged_Raw_Data.csv',
        1: '../../Big_Files/CRSS_Merged_Raw_Data_Sample_frac_01.csv',
        2: '../../Big_Files/CRSS_Merged_Raw_Data_Sample_n_1000.csv'
    }
    filename = filename_dict[file_number]
    print (filename)
    data = pd.read_csv(filename, index_col=None, low_memory=False)
    print ('data.shape: ', data.shape)

    for feature in data:
        if 'NAME' in feature or '_IM' in feature:
            data.drop(columns=[feature], inplace=True)

    print ('data.shape: ', data.shape)
    print ()
    
    print ('Reading in Missing/Unknown Dictionary')
    filename = '../../Big_Files/Missing_Unknown.json'
    with open(filename) as json_file:
        Missing_Unknown = json.load(json_file)
    print ()

    
    return data, Missing_Unknown

#Import_Data()


In [5]:
def Remove_Unknowns_in_Feature(data, Missing_Unknown, feature):
#    print ('Remove_Unknowns_in_Feature()')
#    print (feature)

    data.dropna(subset=[feature], inplace=True)

    if feature in Missing_Unknown.keys():
         data = data[~data[feature].isin(Missing_Unknown[feature])]
#        print (data.shape)
#        print (data[feature].unique())
#        print ()
#    print ()
    return data
    
    
    

In [6]:
def Contingency(data, target, feature):
    contingency_matrix = pd.crosstab(data[target], data[feature])
    cm = contingency_matrix.values.tolist()

    if len(cm)==2 and len(cm[0])==2:
        corr = cm[1][1] / (cm[0][1] + cm[1][1])
        per = (cm[0][1] + cm[1][1])/(cm[0][0] + cm[0][1] + cm[1][0] + cm[1][1])
    else:
        corr = 0
        per = 0
        print ('Error in Contingency Matrix Dimensions')
        print ('data[feature].unique()')
        print (data[feature].unique())

    per = round(per*100,6)
    corr = round(corr*100,6)

    return per, corr

In [7]:
def Correlation(data, target, Missing_Unknown):
    print ("Correlation()")
    data = data.reindex(sorted(data.columns), axis=1)
    data = data[ [target] + [col for col in data.columns if col != target]]
    
    Correlation_Dictionary = {}
    
    for feature in data:
        print (feature)
        A = []
        D = {}
        if feature not in [target]:
            A = data[[target,feature]].copy()
            A = Remove_Unknowns_in_Feature(A, Missing_Unknown, feature)
            U = np.sort(A[feature].unique())
#            print (U)
#            print ()
            
            C = []
            for i, value in enumerate(U):
                if i%1000==0:
                    print (i, value)
                # Track progress through features with many values, like DR_ZIP
#                print (value)
                B = A[[target]].copy()
                B[feature] = A[feature].apply(lambda x: 1 if x==value else 0)
#                display (B.head())
#                print ()
                per, corr = Contingency(B, target, feature)
#                print ('    ', value, per, corr)        
                C.append([value, per, corr])
            C.sort(key=lambda x:x[2], reverse=True)
#            display(C)

            E = []
            j = 0
            s = 0.0
            for i, c in enumerate(C):
                E.append(c[0])
                s += c[1]
#                print (len(C), i, c, j, E, s)
                if (
                    c[1] > 8 or 
                    (i<len(C)-1 and s + C[i+1][1] > 15) or 
                    i==len(C)-1
                ):
                    D.update({j:E})
#                    print (j, ': ', E, ', # ', round(s,2), '%', sep='')
#                    print (D)
#                    print (s)
#                    print ()
                    j += 1
                    E = []
                    s = 0.0
#            print ()
        Correlation_Dictionary.update({feature:D})
            
    return Correlation_Dictionary
            
        

In [8]:
def Correlation_One_Hot(data, target, Missing_Unknown):
    print ("Correlation_One_Hot()")
    data = data.reindex(sorted(data.columns), axis=1)
    data = data[ [target] + [col for col in data.columns if col != target]]
    
    Correlation_Dictionary = {}
    
    Target = data[target].copy()
    
    for feature in data:
#    for feature in ['ACC_TYPE']:
        print (feature)
        D = {}
        if feature != target:
            A = Remove_Unknowns_in_Feature(data[[target, feature]], Missing_Unknown, feature)
            B = pd.get_dummies(A[feature])
            B[target] = A[target]
#            B.columns = B.columns.astype(str)
            U = sorted(A[feature].unique())
#            U = [str(u) for u in U]
            C = []
            
            for i, u in enumerate(U): # Loop over the unique values of the feature
                per, corr = Contingency(B[[target, u]], target, u)
                C.append([u, per, corr])
                
            C.sort(key=lambda x:x[2], reverse=True)
#            display(C)
#            print (sum([c[1] for c in C]))
            
            E = []
            j = 0
            s = 0.0
            for i, c in enumerate(C):
                E.append(c[0])
                s += c[1]
#                print (len(C), i, c, j, E, s)
                if (
                    c[1] > 8 or 
                    (i<len(C)-1 and s + C[i+1][1] > 15) or 
                    i==len(C)-1
                ):
                    D.update({j:E})
#                    print (j, ': ', E, ', # ', round(s,2), '%', sep='')
#                    print (D)
#                    print (s)
#                    print ()
                    j += 1
                    E = []
                    s = 0.0
#            print ()
        Correlation_Dictionary.update({feature:D})
            
    return Correlation_Dictionary
            
        

In [9]:
def Correlation_Count(data, target, Missing_Unknown):
    print ("Correlation_Count()")
    data = data.reindex(sorted(data.columns), axis=1)
    data = data[ [target] + [col for col in data.columns if col != target]]
    
    Correlation_Dictionary = {}
    
    for feature in data:
#    for feature in ['ACC_TYPE']:
        print (feature)
        D = {}
        if feature != target:
            A = Remove_Unknowns_in_Feature(data[[target, feature]], Missing_Unknown, feature)
            U = sorted(A[feature].unique())
            C = []
            for u in U:
                B = A[A[feature]==u]
                TP = B[target].sum()
                PP = len(B)
                corr = round(TP/PP*100,6)
                per = round(len(B)/len(A)*100,6)
        
                C.append([u, per, corr])
            C.sort(key=lambda x:x[2], reverse=True)
#            display(C)
#            print (sum([c[1] for c in C]))
            E = []
            j = 0
            s = 0.0
            for i, c in enumerate(C):
                E.append(c[0])
                s += c[1]
#                print (len(C), i, c, j, E, s)
                if (
                    c[1] > 8 or 
                    (i<len(C)-1 and s + C[i+1][1] > 15) or 
                    i==len(C)-1
                ):
                    D.update({j:E})
#                    print (j, ': ', E, ', # ', round(s,2), '%', sep='')
#                    print (D)
#                    print (s)
#                    print ()
                    j += 1
                    E = []
                    s = 0.0
#            print ()
        Correlation_Dictionary.update({feature:D})
    
    return Correlation_Dictionary
            
 

In [10]:
def Largest_Gaps(A, feature, U, C):
    for i in range (len(C)-1):
        C[i].append(round(C[i][2] - C[i+1][2],4))
    C[-1].append(0)
#    print ('C ordered by decreasing corr with difference in corr')
#    display(C[:20])
    C.sort(key=lambda x:x[3], reverse=True)
#    print ('C ordered by difference in corr')    
#    display(C[:20])

#    for i in range (len(C)-2):
#        C[i].append(C[i][3] - C[i+1][3])
#    C[-1].append(0)
#    C[-2].append(0)
#    print ('C ordered by difference in corr with difference of difference')
#    display(C[:20])

#    display(C)
    D = [round(c[3],2) for c in C]
    E = [c[2] for c in C if c[3] > 1]
    E = E[:6]
    E.sort(reverse=True)
    E = [101] + E + [0]
    Dict = {}
    for i in range (0, len(E)-1):
        F = []
        for c in C:
            if E[i] > c[2]  and c[2] >= E[i+1]:
                F = F + c[0]
        for f in F:
            Dict[f] = i
        
#    print (feature)
#    C.sort(key=lambda x:x[2], reverse=True)
#    display(C)
#    print (len(U), len(C), D[:12])
    print (len(E), E)
#    print (Dict)
 
    return Dict

In [11]:
def KDE (A, feature, U, C):
    print ('KDE')

    D = dict([[c[0], c[2]] for c in C]) # Dictionary of unique values in feature and correlation to 
#    print ('display (D)')
#    display (D)
#    E = [D[a] for a in list(A[feature])]
    E = [D[a] for a in U]
#    temp = [[list(A[feature])[i], E[i]] for i in range (10)]
#    print ('display (temp)')
#    display (temp)
            
            
    a = array(E).reshape(-1, 1)
#    print ('display (a)')
#    display (a[:10])
#    print ('len(a) = ', len(a))
    kde = KernelDensity(kernel='gaussian', bandwidth=2.0).fit(a)
    s = linspace(0,100)
    e = kde.score_samples(array(s).reshape(-1,1))
#    print ('e: ', e)
#    display(plot(s, e))
    mi, ma = argrelextrema(e, np.less)[0], argrelextrema(e, np.greater)[0]
    G = list(s[mi])
    G.sort(reverse=True)
    G = [round(t,4) for t in G]
    print ("Minima:", G)
#    print ("Maxima:", s[ma])
#    print ('mi: ', mi)
            
#    display(plot(
#        s[:mi[0]+1], e[:mi[0]+1], 'r',
#        s[mi[0]:mi[1]+1], e[mi[0]:mi[1]+1], 'g',
#        s[mi[1]:], e[mi[1]:], 'b',
#        s[ma], e[ma], 'go',
#        s[mi], e[mi], 'ro'
#    ))

    G = [101] + G + [0]
    Dict = {}
    for i in range (0, len(G)-1):
        F = []
        for c in C:
            if G[i] > c[2]  and c[2] >= G[i+1]:
                F.append(int(c[0]))
        for f in F:
            Dict[f] = i
            
    return Dict

            


In [12]:
def Merge_Same_Corr(C, delta):
    for i in range (len(C)-2, -1, -1):
        if C[i][2] - C[i+1][2] < delta:
            C[i][0] = C[i][0] + C[i+1][0]
            C[i][2] = (C[i][1]*C[i][2] + C[i+1][1]*C[i+1][2])/(C[i][1] + C[i+1][1])
            C[i][1] = C[i][1] + C[i+1][1]
            del(C[i+1])
#            print (i)
#            display(C)
#            print ()
    return C


In [13]:
def Merge_Small_Per(C, delta):
    
    for i in range (len(C)-2, -1, -1):
        if C[i][1] + C[i+1][1] < delta:
            C[i][0] = C[i][0] + C[i+1][0]
            C[i][2] = (C[i][1]*C[i][2] + C[i+1][1]*C[i+1][2])/(C[i][1] + C[i+1][1])
            C[i][1] = C[i][1] + C[i+1][1]
            del(C[i+1])

    C.sort(key=lambda x:x[2], reverse=False)
    for i in range (len(C)-2, -1, -1):
        if C[i+1][1] < delta:
            C[i][0] = C[i][0] + C[i+1][0]
            C[i][2] = (C[i][1]*C[i][2] + C[i+1][1]*C[i+1][2])/(C[i][1] + C[i+1][1])
            C[i][1] = C[i][1] + C[i+1][1]
            del(C[i+1])

    C.sort(key=lambda x:x[2], reverse=True)
    for i in range (len(C)-2, -1, -1):
        if C[i+1][1] < delta:
            C[i][0] = C[i][0] + C[i+1][0]
            C[i][2] = (C[i][1]*C[i][2] + C[i+1][1]*C[i+1][2])/(C[i][1] + C[i+1][1])
            C[i][1] = C[i][1] + C[i+1][1]
            del(C[i+1])

    for c in C:
        for i in range (1,3):
            c[i] = round(c[i],4)

    return C
    

In [14]:
def Correlation_Count_KDE(data, target, Missing_Unknown):
    
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html
    print ("Correlation_Count()")
    data = data.reindex(sorted(data.columns), axis=1)
    data = data[ [target] + [col for col in data.columns if col != target]]
    
    Correlation_Dictionary = {}
    
    for feature in data:
#    for feature in ['PJ']:
        print (feature)
        if feature != target and len(data[feature].unique())>10:
            A = Remove_Unknowns_in_Feature(data[[target, feature]], Missing_Unknown, feature)
            U = sorted(A[feature].unique())
            C = []
            for u in U:
                B = A[A[feature]==u]
                TP = B[target].sum()
                PP = len(B)
                corr = round(TP/PP*100,4)
                per = round(len(B)/len(A)*100,4)
        
                C.append([[u], per, corr])

            C.sort(key=lambda x:x[2], reverse=True)
            D = C.copy()
#            display(C[:10])
            
            """
            data[feature] is a Pandas series of the values of the feature for each the 800,000+ samples
            A is the a Pandas dataframe with two columns:
                    HOSPITAL (the target feature)
                    feature (the current feature)
                with the samples that have "Missing" and "Unknown" values for the current feature removed
            U is a list of the unique values in A
            for each unique value u in U:
                B is the Pandas series A filtered to just have the samples with that unique value u
            C is a list, for each unique value u in U, of:
                u, the unique value, 
                per, the percentage of the samples in list A that has that value
                corr (correlation), the percent of the samples that have this value that are hospitalized
                
            """

            C = Merge_Small_Per(C,2.0)
            Gaps_Dict = Largest_Gaps(A, feature, U, C)
        
            F = [[c[0][0], c[1], c[2]] for c in D]
            KDE_Dict = KDE(A, feature, U, F)
            print (feature)
            print (KDE_Dict)
            print (Missing_Unknown[feature])
            print ()
            for mu in Missing_Unknown[feature]:
                KDE_Dict[mu] = 99
            print (KDE_Dict)
            print ()
                
            
            Correlation_Dictionary[feature] = KDE_Dict
            
    return Correlation_Dictionary
            
            
            
            


In [15]:
def Correlation_Count_Iterative(data, target, Missing_Unknown):
    
    # https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html
    print ("Correlation_Count()")
    data = data.reindex(sorted(data.columns), axis=1)
    data = data[ [target] + [col for col in data.columns if col != target]]
    
    Correlation_Dictionary = {}
    
#    for feature in data:
    for feature in ['MAKE']:
        print (feature)
        D = {}
        if feature != target:
            A = Remove_Unknowns_in_Feature(data[[target, feature]], Missing_Unknown, feature)
            U = sorted(A[feature].unique())
            print (U)
            C = []
            for u in U:
                B = A[A[feature]==u]
                TP = B[target].sum()
                PP = len(B)
                corr = TP/PP*100
                per = len(B)/len(A)*100
        
                C.append([[u], per, corr])

            C.sort(key=lambda x:x[2], reverse=True)
            
            print ('display (C)')
            display(C[:10])
            for i in range (len(C)-1):
                gap = C[i][2] - C[i+1][2]
                C[i] = C[i] + [gap, gap*C[i][1]]
            C[-1] = C[-1] + [0,0]
            
            D = [[c[0], round(c[1],4), round(c[2],4), round(c[3],4), round(c[4],4)] for c in C]
            display(D)
            
            i=-1
            while i<len(C)-2:
                i+= 1
                print (i, len(C), C[i])
                if C[i][3]==0.0:
                    C[i][0] = C[i][0] + C[i+1][0]
                    C[i][1] += C[i+1][1]
                    C[i][2] = (C[i][2]*C[i][1] + C[i+1][2]*C[i+1][1])/(C[i][1] + C[i+1][1])
                    if i < len(C)-2:
                        C[i][3] = C[i][2] - C[i+2][2]
                        C[i][4] = C[i][3]*C[i][1]
                    else:
                        C[i][3] = 0
                        C[i][4] = 0
                    del C[i+1]
                    print ("New ",i, len(C), C[i])
                    print ()
                    i -= 1
            
            D = [[c[0], round(c[1],4), round(c[2],4), round(c[3],4), round(c[4],4)] for c in C]
            display(D)
            
            while len(C) > 10:
                M = [c[4] for c in C]
                M.remove(0)
                i = M.index(min(M))
#                print (C[i])
                C[i][0] = C[i][0] + C[i+1][0]
                C[i][1] += C[i+1][1]
                C[i][2] = (C[i][2]*C[i][1] + C[i+1][2]*C[i+1][1])/(C[i][1] + C[i+1][1])
                if i < len(C)-2:
                    C[i][3] = C[i][2] - C[i+2][2]
                    C[i][4] = C[i][3]*C[i][1]
                else:
                    C[i][3] = 0
                    C[i][4] = 0
                del C[i+1]
#                print ("New ",i, len(C), C[i])
#                print ()
            D = [[c[0], round(c[1],4), round(c[2],4), round(c[3],4), round(c[4],4)] for c in C]
            display(D)
                
            
            """
            data[feature] is a Pandas series of the values of the feature for each the 800,000+ samples
            A is the a Pandas dataframe with two columns:
                    HOSPITAL (the target feature)
                    feature (the current feature)
                with the samples that have "Missing" and "Unknown" values for the current feature removed
            U is a list of the unique values in A
            for each unique value u in U:
                B is the Pandas series A filtered to just have the samples with that unique value u
            C is a list, for each unique value u in U, of:
                u, the unique value, 
                per, the percentage of the samples in list A that has that value
                corr (correlation), the percent of the samples that have this value that are hospitalized
            """
            
            
            


In [16]:
def Make_List_of_Features(data, Missing_Unknown, target):
    print ('Features and number of unique values in feature')
    
    B = []
    for feature in data:
        if feature != target:
            A = Remove_Unknowns_in_Feature(data[[target, feature]], Missing_Unknown, feature)
            U = sorted(A[feature].unique())
            B.append([feature, len(U)])
    B.sort(key = lambda x:x[1])
    for b in B:
        print ('    %s, # %d' % (b[0], b[1]))
    print ()

In [17]:
%%time
def Main():
    target = 'HOSPITAL'
    data, Missing_Unknown = Import_Stuff(2)
    
    print ('Missing_Unknown')
    print (Missing_Unknown)
    print ()
    
    Make_List_of_Features(data, Missing_Unknown, target)
    
#    Correlation_Dictionary = Correlation(data, target, Missing_Unknown)
#    Correlation_Dictionary = Correlation_One_Hot(data, target, Missing_Unknown)
#    Correlation_Dictionary = Correlation_Count(data, target, Missing_Unknown)
    Correlation_Dictionary = Correlation_Count_KDE(data, target, Missing_Unknown)

    with open("../../Big_Files/Correlation_Dictionary.json", "w") as outfile: 
        json.dump(Correlation_Dictionary, outfile)
        
    print ('Reading in Correlation Dictionary')
    with open('../../Big_Files/Correlation_Dictionary.json') as json_file:
        C = json.load(json_file)
    print (C)
    
Main()

Import_Stuff()
../../Big_Files/CRSS_Merged_Raw_Data_Sample_n_1000.csv
data.shape:  (1000, 217)
data.shape:  (1000, 86)

Reading in Missing/Unknown Dictionary

Missing_Unknown
{'PSU': [], 'PJ': [], 'VE_TOTAL': [], 'VE_FORMS': [], 'PVH_INVL': [], 'PERMVIT': [], 'PERNOTMVIT': [], 'NUM_INJ': [99], 'MONTH': [], 'YEAR': [], 'DAY_WEEK': [], 'HOUR': [99], 'HARM_EV': [98, 99], 'ALCOHOL': [9], 'MAX_SEV': [9], 'MAN_COLL': [98, 99], 'RELJCT1': [8, 9], 'RELJCT2': [98, 99], 'TYP_INT': [98, 99], 'WRK_ZONE': [], 'REL_ROAD': [98, 99], 'LGT_COND': [8, 9], 'WEATHER': [98, 99], 'SCH_BUS': [], 'INT_HWY': [9], 'URBANICITY': [], 'REGION': [], 'NUMOCCS': [99], 'HIT_RUN': [9], 'MAKE': [97, 99], 'MODEL': [], 'BODY_TYP': [98], 'MOD_YEAR': [9998, 9999], 'MAK_MOD': [98999, 99999], 'TOW_VEH': [9], 'J_KNIFE': [], 'CARGO_BT': [99], 'HAZ_INV': [], 'HAZ_PLAC': [8], 'HAZ_CNO': [88], 'HAZ_REL': [8], 'BUS_USE': [98, 99], 'SPEC_USE': [98, 99], 'EMER_USE': [8, 9], 'TRAV_SP': [998, 999], 'ROLLOVER': [], 'ROLINLOC': [9], 'IMP

HOUR
7 [101, 26.0869, 19.1781, 13.5135, 10.5263, 8.6957, 0]
KDE
Minima: [34.6939, 28.5714, 22.449]
HOUR
{3: 0, 5: 1, 19: 2, 18: 3, 1: 3, 14: 3, 4: 3, 23: 3, 15: 3, 21: 3, 2: 3, 13: 3, 8: 3, 12: 3, 22: 3, 16: 3, 17: 3, 0: 3, 20: 3, 7: 3, 9: 3, 11: 3, 6: 3, 10: 3}
[99]

{3: 0, 5: 1, 19: 2, 18: 3, 1: 3, 14: 3, 4: 3, 23: 3, 15: 3, 21: 3, 2: 3, 13: 3, 8: 3, 12: 3, 22: 3, 16: 3, 17: 3, 0: 3, 20: 3, 7: 3, 9: 3, 11: 3, 6: 3, 10: 3, 99: 99}

IMPACT1
8 [101, 37.5, 21.6216, 18.75, 14.2857, 11.2613, 9.836, 0]
KDE
Minima: [30.6122, 4.0816]
IMPACT1
{0: 0, 10: 0, 83: 1, 14: 1, 62: 1, 9: 1, 8: 1, 3: 1, 82: 1, 12: 1, 2: 1, 4: 1, 61: 1, 1: 1, 6: 1, 63: 1, 11: 1, 81: 1, 19: 1, 5: 2, 7: 2}
[98, 99]

{0: 0, 10: 0, 83: 1, 14: 1, 62: 1, 9: 1, 8: 1, 3: 1, 82: 1, 12: 1, 2: 1, 4: 1, 61: 1, 1: 1, 6: 1, 63: 1, 11: 1, 81: 1, 19: 1, 5: 2, 7: 2, 98: 99, 99: 99}

INJ_SEV
INT_HWY
J_KNIFE
LGT_COND
MAKE
8 [101, 76.1897, 33.3334, 24.2424, 20.5882, 17.9487, 9.6774, 0]
KDE
Minima: [83.6735, 67.3469, 53.0612, 42.8571, 28.57

RELJCT2
6 [101, 18.251, 16.4671, 13.1148, 8.0851, 0]
KDE
Minima: [28.5714, 22.449, 12.2449, 4.0816]
RELJCT2
{19: 0, 6: 1, 7: 1, 20: 2, 2: 2, 1: 2, 8: 2, 3: 3, 5: 3, 18: 3, 4: 4, 17: 4}
[98, 99]

{19: 0, 6: 1, 7: 1, 20: 2, 2: 2, 1: 2, 8: 2, 3: 3, 5: 3, 18: 3, 4: 4, 17: 4, 98: 99, 99: 99}

REL_ROAD
REST_MIS
REST_USE
5 [101, 60.3775, 12.6885, 10.5263, 0]
KDE
Minima: [79.5918, 46.9388, 30.6122, 22.449, 6.1224]
REST_USE
{5: 0, 17: 0, 20: 1, 7: 2, 19: 3, 1: 4, 3: 4, 2: 4, 8: 4, 4: 4, 10: 5, 11: 5, 12: 5, 97: 5}
[98, 99]

{5: 0, 17: 0, 20: 1, 7: 2, 19: 3, 1: 4, 3: 4, 2: 4, 8: 4, 4: 4, 10: 5, 11: 5, 12: 5, 97: 5, 98: 99, 99: 99}

ROLINLOC
ROLLOVER
SCH_BUS
SEAT_POS
5 [101, 17.1875, 14.7712, 13.1579, 0]
KDE
Minima: [24.4898, 8.1633]
SEAT_POS
{12: 0, 13: 1, 11: 1, 23: 1, 22: 1, 21: 2, 19: 2, 29: 2, 31: 2, 33: 2, 41: 2, 51: 2}
[98, 99]

{12: 0, 13: 1, 11: 1, 23: 1, 22: 1, 21: 2, 19: 2, 29: 2, 31: 2, 33: 2, 41: 2, 51: 2, 98: 99, 99: 99}

SEX
SPEC_USE
SPEEDREL
TOWED
TOW_VEH
TRAV_SP
8 [101, 57.1429, 

Minima: [ 2.04081633 14.28571429 28.57142857 38.7755102  46.93877551 55.10204082
 73.46938776 93.87755102]
Minima: [ 4.08163265 12.24489796 22.44897959 30.6122449  34.69387755 38.7755102
 46.93877551 75.51020408]    