In [1]:
%%latex
\tableofcontents

<IPython.core.display.Latex object>

# readme
- Most of our other Jupyter Notebooks have a main() function at the bottom that runs everything.  
- This notebook is structured differently, with several functions that run in sequence.  
- The reason for the difference is that part of the work has to be done outside this notebook.  
- The IVEware imputation software is available in several languages, but not Python.  We ran it in R using scrlib.  
- This notebook prepares the data for the Mode, Random Forest, and IVEware imputations, and does the first two.  Then the user must separately run the IVEware software.  Finally, this notebook pulls in those results and compares the three methods.  

# Methods

- We have the discretized CRSS dataset in '../../Big_Files/CRSS_Binned_Data.csv'
- MissForest is a round-robin imputation method most commonly implemented in R, generally considered one of the best imputation methods.  It has several Python implementations.
- The Python implementation we found most current and referenced, at https://pypi.org/project/MissForest/ , was not appropriate for our work because all of our data is categorical.  The MissForest algorithm starts from some imputed state, and we wanted to start with imputation to mode, but that implementation only offered imputation to mean or median, which are appropriate for continuous variables but not for categorical, so we wrote our own implementation.  
- We compare here four methods:
    - Round-Robin Random Forest 
        - Our own implementation of Round Robin, using scikit-learn's random forest
        - Using imputation by mode as the starting point
    - Imputation by mode
    - Random Imputation
    - IVEware, using the hyperparameters in the CRSS Imputation report
- To compare, we followed the example for MissForest.
    - We dropped all samples with a missing value, so we would have ground truth, going from 817,623 samples to 232,333 samples to make a Pandas dataframe data_Ground_Truth
    - We erased ~15% of the values in each sample to make data_NaN
    - We used each imputation method to impute the missing values.
    - To compare methods, we counted:
        - For each method, what percentage of imputed values did not match ground truth (28-44%)
        - For each pair of methods, which method did a better job on how many features
        - For each pair of methods, how many values are different
- Our round-robin method
    - In data_NaN, change all of the 'Unknown' to np.NaN.
    - In each feature, count the number of unknown samples.
    - In another copy, data_Mode, impute by mode in all of the features.
    - Starting with the feature with the least (nonzero) number of missing samples:
        - Copy that feature from data_NaN into data_Mode, so that only that feature has missing values.
        - Separate the dataframe into two, one with known values in the target variable (X) and one with unknown values (Z).
        - From the dataframe with known values (X), separate out the target variable (call it 'y')
        - Using Random Forest, build a model that maps X to y.  
        - Use the model to impute the missing values
    - At each iteration we replace the mode-imputed values with RF-imputed values.
- Our Random Imputation method
    - We did not choose randomly from the unique values in the feature, because some values may be much more common than others.  We wanted (approximately) the same distribution of values.
    - We started with 232,333 samples with 67 features.
    - We erased values with a probability of 15%, but that doesn't mean that exactly 34,849.95 values are missing from each feature, but we did erase *about* 35,000 values from each feature.  The exact number erased from each feature is printed out when the code runs.
    - For each feature:
        - Create a temporary copy of the feature, which will have 232,333 samples, about 35,000 of which are NaN.
        - Drop the NaN samples in the temp feature, leaving about 200,000 samples.
        - Resample the temp feature to have 232,333 samples.  The resampling will change the order of the values but keep about the same distribution.
        - In the original feature, replace the NaN values with the non-NaN corresponding values in the temporary feature.
- The IVEware implementation is available in several platforms, but Python is not one of them.  We run it in R outside this notebook.  Be aware that the random selection of values to erase is different for each run, so the IVEware imputation must be run anew. 

- Once we had analyzed the results and decided that the Random Forest method is best for our work, we implemented it and saved the results to CRSS_Imputed_Data.csv.

## What is going on with IVEware using "seed 0;" ?
- When we set the random seed to 0, the accuracy of IVEware jumps from about 70% to about 80%, from slightly worse than Random Forest to MUCH better.  WHAT ???

- These runs have the same random seed for Python and NumPy, have the five multicollinear features used in the imputation but dropped for the evaluation.  

- Having the same Python and NumPy random seed means that the input datasets for the IVEware imputation have the same samples have the same missing feature values.  

- "seed 0;" in IVEware_CRSS_Imputation.xml


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,143  | 100% | 
    | RF |  558,626  |  27.85 % | 
    | Mode |  681,514  |  33.97 % | 
    | Random |  888,663  |  44.3 % | 
    | IVEware |  438,072  |  21.84 % | 

- "seed 1;"
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,143  | 100% | 
    | RF |  558,626  |  27.85 % | 
    | Mode |  681,514  |  33.97 % | 
    | Random |  888,663  |  44.3 % | 
    | IVEware |  592,313  |  29.52 % | 

- "seed 2;"
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,143  | 100% | 
    | RF |  558,626  |  27.85 % | 
    | Mode |  681,514  |  33.97 % | 
    | Random |  888,663  |  44.3 % | 
    | IVEware |  568,719  |  28.35 % | 
    
    
<br><br>
- Found what was going on. "seed 1;" in IVEware is setting the random seed in R, but "seed 0;" is something different.
- Cite IVEware_User_Guide, page 17

"SEED number;

Specifies a seed for the random draws from the posterior predictive distribution. Number should be greater than zero. A zero seed will result in no perturbations of the predicted values or the regression coefficients. If the SEED keyword is missing from the setup file then the seed will be determined by your computer’s internal clock."

- set.seed(int) in R does not have this behavior at int=0.  I tried set.seed(0) in R and it worked just fine.  
- SAS requires that the random seed be a positive integer, and SAS is one of the implementations of IVEware, so that may be why the IVEware authors thought to implement this functionality for their seed.

- According to this ~2017 scraping of GitHub Python code to count the choices of random seeds,
    - https://www.kaggle.com/code/residentmario/kernel16e284dcb7
    - 0 is the most common (19%)
    - 1 and 42 are next(9% and 4%, respectively)
    
- According to this 2014 scraping of 100 top R repositories owned by 27 people, 
    - https://www.r-bloggers.com/2014/03/what-are-the-most-common-rng-seeds-used-in-r-scripts-on-github/
    - 1 is by far the most common (60 examples)
    - 123 is next (about 25)
    - 0 is not on the list
    
### Is this just an anomaly, or might "seed 0;" be useful?

- Test Method
    - Test with all 67 features, not dropping five multicollinear features
    - We have results with seeds 1 and 42
    - Test with seed 0 in IVEware, Python, and NumPy

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,167,826  | 100% | 
    | RF |  591,364  |  27.28 % | 
    | Mode |  739,696  |  34.12 % | 
    | Random |  971,759  |  44.83 % | 
    | IVEware |  447,881  |  20.66 % | 
    
    <br><br>
    - Test with seed 0 in IVEware but seed 42 in Python and NumPy in the Binning and Imputation

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,168,989  | 100% | 
    | RF |  587,221  |  27.07 % | 
    | Mode |  739,903  |  34.11 % | 
    | Random |  971,670  |  44.8 % | 
    | IVEware |  445,195  |  20.53 % | 
    
- Another test method
    - Randomly sample from 67 to 40 features and test again
    - Note that dropping features will increase the number of samples that have no missing values, so data_Ground_Truth and data_NaN will have fewer features but more samples, so having about the same number of total missing values over the 40 features is not a problem.
    - Do it twice with two random seeds.  
    - The same random seed for Python and NumPy will preserve, but different random seeds will change:
        - Which features get dropped
        - Which 15% of the samples will get dropped to make data_NaN for testing the imputation
    - Seed 0 in Python and Numpy, seed 0 in IVEware:
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,071,755  | 100% | 
    | RF |  652,983  |  31.52 % | 
    | Mode |  774,049  |  37.36 % | 
    | Random |  997,759  |  48.16 % | 
    | IVEware |  556,618  |  26.87 % | 

    - Seed 0 in Python and NumPy, seed 1 in IVEware:

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,071,755  | 100% | 
    | RF |  652,983  |  31.52 % | 
    | Mode |  774,049  |  37.36 % | 
    | Random |  997,759  |  48.16 % | 
    | IVEware |  738,201  |  35.63 % | 
    
    
    - Seed 1 in Python and Numpy, seed 0 in IVEware:

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  1,861,972  | 100% | 
    | RF |  449,626  |  24.15 % | 
    | Mode |  547,876  |  29.42 % | 
    | Random |  737,527  |  39.61 % | 
    | IVEware |  370,546  |  19.9 % | 

    - Seed 1 in Python and Numpy, seed 1 in IVEware:
    
   | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  1,861,972  | 100% | 
    | RF |  449,626  |  24.15 % | 
    | Mode |  547,876  |  29.42 % | 
    | Random |  737,527  |  39.61 % | 
    | IVEware |  486,820  |  26.15 % | 
    
    - Analysis
        - Seed 1 (compared with seed 0) for Python and NumPy appears to have chosen features that are easier to impute
        - Within each seed for Python and Numpy, choosing seed 0 for IVEware gave much better results.  
    
### Conclusion
- Setting the IVEware seed to zero is not recommended in the manual, and we think it shouldn't work well, but it works dramatically well with our test methods.  
- Use two sets of data from here on, one imputed with Random Forest and another imputed with IVEware with random seed zero.  See which gives best results at the end.  

# Results of Comparison of Four Imputation Methods

- We start with the binned (discretized) data, CRSS_Binned_Data.csv, with 817,623 samples in 67 features.
<br><br>
- Dropping any sample with a missing value, we have 232,333 samples of Ground Truth.

<br><br>
- First run with random seed  42 in Python, NumPy, and R:
    <br><br>
    - Samples Incorrectly Imputed
    
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,168,989  | 100% | 
    | RF |  589,714  |  27.19 % | 
    | Mode |  739,903  |  34.11 % | 
    | Random |  971,670  |  44.8 % | 
    | IVEware |  622,622  |  28.71 % | 

    <br><br>
    - Comparison of number of errors in the 67 features.  For instance, comparing Random Forest to Mode, in 50 features RF had fewer errors than Mode, in 17 features the two methods had the same number of errors, and in no features did RF have more errors than Mode.  

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to Mode |  50  |  17  |  0  | 67  |
    | Compare RF to Random |  67  |  0  |  0  | 67  |
    | Compare RF to IVEware |  34  |  0  |  33  | 67  |
    | Compare Mode to Random |  67  |  0  |  0  | 67  |
    | Compare Mode to IVEware |  24  |  0  |  43  | 67  |
    | Compare Random to IVEware |  5  |  0  |  62  | 67  |

    <br><br>
     - Number of NaN Imputed Differently by Different Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  2,168,989  | 100% |
    | RF Different from Mode |  260,334  |  12.0 % |
    | RF Different from Random |  805,376  |  37.13 % |
    | RF Different from IVEware |  649,467  |  29.94 % |
    | Mode Different from Random |  739,564  |  34.1 % |
    | Mode Different from IVEware |  780,385  |  35.98 % |
    | Random Different from IVEware |  1,003,065  |  46.25 % |    
    
<br><br>
- Second Run, Same random seed (42) to make sure the random seed is implemented correctly.  Same results. 

    <br><br>
     - Percentage of Samples Incorrectly Imputed
     

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,168,989  | 100% | 
    | RF |  589,714  |  27.19 % | 
    | Mode |  739,903  |  34.11 % | 
    | Random |  971,670  |  44.8 % | 
    | IVEware |  622,622  |  28.71 % | 

    <br><br>
     - Comparison of number of errors in the 67 features:

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to Mode |  50  |  17  |  0  | 67  |
    | Compare RF to Random |  67  |  0  |  0  | 67  |
    | Compare RF to IVEware |  34  |  0  |  33  | 67  |
    | Compare Mode to Random |  67  |  0  |  0  | 67  |
    | Compare Mode to IVEware |  24  |  0  |  43  | 67  |
    | Compare Random to IVEware |  5  |  0  |  62  | 67  |

    <br><br>
     - Number of NaN Imputed Differently by Different Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  2,168,989  | 100% |
    | RF Different from Mode |  260,334  |  12.0 % |
    | RF Different from Random |  805,376  |  37.13 % |
    | RF Different from IVEware |  649,467  |  29.94 % |
    | Mode Different from Random |  739,564  |  34.1 % |
    | Mode Different from IVEware |  780,385  |  35.98 % |
    | Random Different from IVEware |  1,003,065  |  46.25 % |

<br><br>
- Third run, with random seed 1:

    <br><br>
    - Samples Incorrectly Imputed by Method

   | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,167,935  | 100% | 
    | RF |  587,894  |  27.12 % | 
    | Mode |  738,676  |  34.07 % | 
    | Random |  970,865  |  44.78 % | 
    | IVEware |  595,206  |  27.45 % | 

    <br><br>
- Comparison of number of errors in the 67 features:

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to Mode |  51  |  16  |  0  | 67  |
    | Compare RF to Random |  67  |  0  |  0  | 67  |
    | Compare RF to IVEware |  35  |  0  |  32  | 67  |
    | Compare Mode to Random |  67  |  0  |  0  | 67  |
    | Compare Mode to IVEware |  22  |  0  |  45  | 67  |
    | Compare Random to IVEware |  5  |  0  |  62  | 67  |

    <br><br>
- Number of NaN Imputed Differently by Pairs of Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  2,167,935  | 100% |
    | RF Different from Mode |  252,911  |  11.67 % |
    | RF Different from Random |  802,733  |  37.03 % |
    | RF Different from IVEware |  620,312  |  28.61 % |
    | Mode Different from Random |  738,742  |  34.08 % |
    | Mode Different from IVEware |  751,752  |  34.68 % |
    | Random Different from IVEware |  978,679  |  45.14 % |








## Drop Multicollinear Features before Imputing?  Compare two methods
- First Method
    - After Binning, reduce dimensionality
        - Removes MAX_VSEV, VE_FORMS, VTCONT_F, MAX_SEV, NUM_INJV
        - Reduces from 67 to 62 features
    - Impute
- Second Method
    - Impute with all 67 features
    - Before evaluating the imputation, remove the five features and only evaluate the results on the 62 features used in the comparison above
- We used random seed 42 for both methods
- First Method Results

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,007,463  | 100% | 
    | RF |  569,509  |  28.37 % | 
    | Mode |  681,753  |  33.96 % | 
    | Random |  889,794  |  44.32 % | 
    | IVEware |  606,632  |  30.22 % | 
    
- Second Method Results


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,007,235  | 100% | 
    | RF |  558,936  |  27.85 % | 
    | Mode |  681,996  |  33.98 % | 
    | Random |  888,845  |  44.28 % | 
    | IVEware |  606,062  |  30.19 % | 


### Analysis
- Mode was the same, as it should be.
- Random was slightly different, perhaps because the features were in a different order?
- IVEware was not significantly different in the two methods.
- Random Forest was slightly but significantly better (0.52%) with the second method, not removing the multicollinear features before imputing, which is surprising.  

### Conclusion
- Run again with different random seed = 1

### Second Round Results
- First Method

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,643  | 100% | 
    | RF |  568,909  |  28.35 % | 
    | Mode |  681,061  |  33.94 % | 
    | Random |  889,048  |  44.31 % | 
    | IVEware |  592,233  |  29.51 % | 


- Second Method


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,005,955  | 100% | 
    | RF |  558,742  |  27.85 % | 
    | Mode |  680,715  |  33.93 % | 
    | Random |  887,944  |  44.27 % | 
    | IVEware |  564,254  |  28.13 % | 
    
### Analysis

- Again, the second method, leaving in multicollinear features, is better for both Random Forest and IVEware

### Conclusion
- When we impute 

## Discussion

- Random imputation is clearly worse than Mode and RF on every feature.
- Random is overall worse than IVEware, but on one of our runs there are five features on which Random is better than IVEware.
- Random Forest is as good or better than Mode on every feature, which is not surprising, as RF starts at Mode and improves on it.  
- Random Forest is as good or better than IVEware on more than half of the features, but not overwhelmingly, and slightly better in the count of missing samples correctly imputed.
- IVEware and Mode are comparable in the number of features, but IVEware is much better in the count of missing samples correctly imputed.
- Random Forest and Mode make the same mistakes.  
- IVEware makes different mistakes from Random Forest and Mode.

## Conclusion

- Use Random Forest

## Opportunities for Future Research
(or, "Things we didn't do")

- Which features are better imputed by Random imputation than by IVEware, and why?
- Which features are better imputed by IVEware than by Random Forest, and why?
- Would a different mix of features make IVEware perform better than Random Forest?
- Is it okay to use one imputation method for some features and another method for other features?

# Setup
## Import Libraries

In [2]:
import sys, copy, math, time, os

print ('Python version: {}'.format(sys.version))

import numpy as np
print ('NumPy version: {}'.format(np.__version__))
np.set_printoptions(suppress=True)


import pandas as pd
print ('Pandas version:  {}'.format(pd.__version__))
pd.set_option('display.max_rows', 500)

import sklearn
print ('SciKit-Learn version: {}'.format(sklearn.__version__))
from sklearn.model_selection import train_test_split

import sklearn.neighbors._base
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

from missforest.missforest import MissForest

# Set Randomness.  Copied from https://www.kaggle.com/code/abazdyrev/keras-nn-focal-loss-experiments
import random
random_seed = 1
np.random.seed(random_seed) # NumPy
random.seed(random_seed) # Python
#tf.set_random_seed(random_seed) # Tensorflow

from IPython.display import Audio
sound_file = './beep.wav'

import warnings
warnings.filterwarnings('ignore')

print ('Finished Importing Libraries')


Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
NumPy version: 1.26.4
Pandas version:  2.2.2
SciKit-Learn version: 1.5.0
Finished Importing Libraries


## Get Data

This notebook pulls in the saved output of Ambulance_Dispatch_2024_02_Binning.

In [3]:
def Get_Data():
    print ('Get_Data')
    data = pd.read_csv('../../Big_Files/CRSS_Binned_Data.csv', low_memory=False)
#    data = pd.read_csv('../../Big_Files/CRSS_Binned_Data_Seed_42.csv', low_memory=False)
#    data = pd.read_csv('../../Big_Files/CRSS_Binned_Reduced_Dimensionality_Data.csv', low_memory=False)
    print ('data.shape = ', data.shape)
    print ()

    # We already dropped the imputed columns in the Binning stage
    print ('Drop Imputed Columns')
    for feature in data:
        if '_IM' in feature:
            print (feature)
            data.drop(columns=feature, inplace=True)
 

    # Method for dropping from 67 to 40 features 
    # to test whether it was just this particular mix of features 
    # that made the IVEware behave strangely well with random seed of zero.
#    print ('data.shape = ', data.shape)
#    data = data.sample(n=40, axis='columns')
    
    print ('data.shape = ', data.shape)
    print ()
    
#    print ("Remaining Features:")
#    Features = sorted(list(data.columns))
#    for feature in Features:
#        print ("    ",feature)
    
    return data

In [4]:
#data = Get_Data()


## Tools

In [5]:
def Impute_MissForest(data):
    print('Impute_MissForest()')

    print (data.shape)
    display(data.head(20))
#    data.replace({np.nan: ''}, inplace=True)
#    display(data.head(20))

    categorical = list(data)
    print (categorical)
    
    clf = RandomForestClassifier(
        n_estimators=100, 
        max_depth=10, 
#        max_features=0.5
    )
    rgr = RandomForestRegressor(
        n_estimators=100, 
        max_depth=10, 
#        max_features=0.5
    )

    data_MF = MissForest(clf, rgr).fit_transform(
        x = data,
        categorical=categorical,
    )
    display(data_MF.head(20))
    print ('Finished Impute_MissForest()')
    print ()
    
    return data_MF
    

In [6]:
def Test_Impute_MissForest():
    data = Get_Data()
    print (data.shape)

    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('int64')

    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth = data_Ground_Truth.astype('int64')
    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head())

    # Randomly pick 15% of the values from each row
    # and set them to be missing
    print ('Remove 15% of values from each row')
    frac = .15
    data_NaN = data_Ground_Truth.copy(deep=True)
    N = data_NaN.shape[0] * frac # Number of NaN in each feature
    for c in data_NaN.columns:
        idx = np.random.choice(a=data_NaN.index, size=int(len(data_NaN) * frac))
        data_NaN.loc[idx, c] = np.NaN
#    for feature in data_NaN:
#        data_NaN[feature] = pd.to_numeric(data_NaN[feature])
#    data_NaN.astype('int64')


    print ('data_NaN.shape')
    print (data_NaN.shape)
    display(data_NaN.head())
    
    data_NaN = data_NaN.astype('Int8')
    
    data_NaN = data_NaN.sample(n=220000)
    print (data_NaN.shape)

    
    # Perform MissForest imputation
    data_MF = Impute_MissForest(data_NaN)
    data_MF.sort_index(inplace=True)
    data_MF = data_MF[data.columns]  
    data_MF = data_MF.astype('Int64')
    
#Test_Impute_MissForest()


In [7]:
def Test_Impute_MissForest_2():
    data = Get_Data()
    print (data.shape)
    display(data.head(10))
    data = data.sample(n=1000)
    print (data.shape)
    data.replace({99:np.nan}, inplace=True)
    display(data.head(10))
    data_MF = Impute_MissForest(data)
    
#Test_Impute_MissForest_2()

In [8]:
def Impute_Round_Robin(data):
    print ('Impute()')
    pd.set_option('display.max_columns', None)
    
    # Replace 'Unknown' with np.NaN
#    data.replace({'Unknown': np.nan}, inplace=True)
    data.replace({99: np.nan}, inplace=True)
    display(data.head(20))
    print ()
    
    # Make a list of features with missing samples, 
    #     ordered by the number of missing samples, 
    #     from least to most.  
    Missing = []
    Complete = []
    for feature in data:
        s = data[feature].isna().sum()
        if s==0:
            Complete.append([feature, s])
        if s>0:
            Missing.append([feature, s])
    Missing = sorted (Missing, key=lambda x:x[1], reverse=False)
    print ()
    print ('Complete[]')
    display(Complete)
    print ()
    print ('Missing[]')
    display(Missing)
    print ()
    
    print ('Make data_Mode')
    print ()
    data_Mode = pd.DataFrame()
    for X in Complete:
        feature = X[0]
        data_Mode[feature] = data[feature]
    for M in Missing:
        feature = M[0]
        m = data[feature].mode()[0]
        print (feature, M[1], m)
        data_Mode[feature] = data[feature].fillna(m)
    print ('data_Mode')
    display(data_Mode.head(20))

    print ()
    print ('Make starting point for data_Imputed')
    data_Imputed = pd.DataFrame()
    for X in Complete:
        feature = X[0]
        data_Imputed[feature] = data[feature]
    for X in Missing:
        feature = X[0]
        data_Imputed[feature] = data_Mode[feature]
    print ('data_Imputed')
    display(data_Imputed.head(20))
    print ()

    print ('Start Loop')
    print ()
    n = 0
    for M in Missing:
        n += 1
        print (M)
        feature = M[0]
        data_Imputed[feature] = data[feature]
#        print ()
#        print ('data[feature].isna().sum()')
#        print (data[feature].isna().sum())
#        print ('data_Imputed[feature].isna().sum()')
#        print (data_Imputed[feature].isna().sum())
#        print ()
        W = data_Imputed.dropna(subset=[feature])
        X = data_Imputed.dropna(subset=[feature])
        y = X[feature]
        X.drop(columns=feature, inplace=True)
        Z = data_Imputed[data_Imputed[feature].isna()]
        Z.drop(columns=feature, inplace=True)
#        Z.reset_index(drop=True, inplace=True)
#        print (data.shape)
#        print (X.shape)
#        display(X.head(40))
#        display(y.head(40))
#        print (Z.shape)
#        display(Z)
        clf = RandomForestClassifier(max_depth=2, random_state=random_seed)
        clf.fit(X,y)
#        print ('clf.predict(Z)')
        z = clf.predict(Z)
        print (len(z))
        display(z)
        Z[feature] = z
#        display(Z)
        data_Imputed = pd.concat([Z, W])
#        display(data_Imputed.head(60))
        print (data_Imputed.shape)
        print ()
#        data_Imputed.sort_values(
#            by = ['CASENUM', 'VEH_NO', 'PER_NO'], 
#            ascending = [True, True, True], 
#            inplace=True
#        )
#        print ()
#        print ('data.PER_NO.equals(data_Imputed.PER_NO)')
#        print (data.PER_NO.equals(data_Imputed.PER_NO))
#        print ()
               
        Check_Feature(data, data_Imputed, feature)
#        if n==10:
#            return data_Imputed
    
    
    display(data_Imputed.head(20))

    
    print ()
    return data_Imputed

In [9]:
def Check(data, data_Imputed):
    Features = data.columns
    print (Features)
    for feature in Features:
        U = pd.unique(data[feature]).tolist()
        print (U)
        A = []
        for u in U:
            a = len(data[data[feature]==u])
            b = len(data_Imputed[data_Imputed[feature]==u])
            A.append([u, a, b])
        display(A)
        print ()


In [10]:
def Check_Feature(data, data_Imputed, feature):
    U = pd.unique(data[feature]).tolist()
    U = [x for x in U if x == x]
    print (U)
    A = []
    for u in U:
        a = len(data[data[feature]==u])
        b = len(data_Imputed[data_Imputed[feature]==u])
        A.append([u, a, b, b-a])
    a = data[feature].isna().sum()
    b = data_Imputed[feature].isna().sum()
    A.append(['NaN', a, b, 0])
    A = pd.DataFrame(A, columns=['Value', 'Original', 'Imputed', 'Difference'])
    display(A)
    print ()


In [11]:
def Impute_Randomly(data):
    print ()
    print ('Impute_Randomly()')
    print ()
    
    data.sample(frac=1, replace=True) # Randomly shuffle the rows of the dataset
    for feature in data:
        print (feature)
#        print ('display(data[feature].head())')
#        display(data[feature].head())
        dfA = data[feature]
#        print ('display(dfA.head())')
#        display(dfA.head())
        dfA.dropna(inplace=True)
#        print ('display(dfA.head()) after dfA.dropna(inplace=True)')
#        display(dfA.head())
        print ('Original Value Counts')
        print (dfA.value_counts(normalize=True))
        dfA = dfA.sample(n = len(data), replace=True)
#        print ('display(dfA.head()) after dfA.sample(n = len(data), replace=True)')
#        display(dfA.head())
        print ('Value Counts after Sampling')
        print (dfA.value_counts(normalize=True))
        dfA.reset_index(drop=True, inplace=True)
#        print ('display(dfA.head()) after dfA.reset_index(drop=True)')
#        display(dfA.head())
        data[feature].fillna(dfA, inplace=True)
#        print ('display(data[feature].head())')
#        display(data[feature].head())        
        print ()
        
    return data
        
def Test_Impute_Randomly():
    Dict = {
        'A':[0,0,0,1,np.nan],
        'B':[1,2,3,4,np.nan]
    }
    
    data = pd.DataFrame(Dict)
    display(data)
    data = Impute_Randomly(data)
    display(data)
    
#Test_Impute_Randomly()
        

# Compare Imputation Methods

## Mode Imputation
## Random Forest Imputation
## Prepare Data for IVEware

In [12]:
def Compare_Imputation_Methods_Part_1():
    print ()
    print ('Compare_Imputation_Methods_Part_1()')
    data = Get_Data()
    print (data.shape)

    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('Int32')

    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth = data_Ground_Truth.astype('int64')
    
    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    data_Ground_Truth = data_Ground_Truth.sample(n=200000)
    data_Ground_Truth.reset_index(inplace=True, drop=True)

    print ('data_Ground_Truth.shape after resampling')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head())

    # Randomly pick 15% of the values from each row
    # and set them to be missing
    print ('Remove 15% of values from each row')
    frac = .15
    data_NaN = data_Ground_Truth.copy(deep=True)
    N = data_NaN.shape[0] * frac # Number of NaN in each feature
    for c in data_NaN.columns:
        idx = np.random.choice(a=data_NaN.index, size=int(len(data_NaN) * frac))
        data_NaN.loc[idx, c] = np.NaN
#    for feature in data_NaN:
#        data_NaN[feature] = pd.to_numeric(data_NaN[feature])
#    data_NaN = data_NaN.astype('Int32')
    
    
    print ('data_NaN.shape')
    print (data_NaN.shape)
    display(data_NaN.head())
    
    
    # Create .txt file to feed into IVEware imputation
    data_IVEware = data_NaN.copy(deep=True)
#    data_IVEware = data_IVEware.astype('str')
    data_IVEware = data_IVEware.fillna('')
    data_IVEware.to_csv('../../Big_Files/data_IVEware.txt', sep='\t', index=False)
    
    data_Mode = pd.DataFrame()
    for feature in data_NaN:
        data_Mode[feature] = data_NaN[feature].fillna(data_NaN[feature].mode()[0])
    data_Mode = data_Mode.astype('Int32')
    print ('data_Mode.shape')
    print (data_Mode.shape)
    display(data_Mode.head())
    
    # Perform Round Robin imputation using Random Forest Classifier
    data_RF = Impute_Round_Robin(data_NaN)
    data_RF.sort_index(inplace=True)
    data_RF = data_RF[data.columns]  
    data_RF = data_RF.astype('Int32')
    
    print ('data_RF.shape')
    print (data_RF.shape)
    display(data_RF.head())
#    print ()

    # Perform MissForest imputation
    data_MF = data_NaN.copy(deep=True)
    data_MF = Impute_MissForest(data_MF)
    data_MF.sort_index(inplace=True)
    data_MF = data_MF[data.columns]  
    data_MF = data_MF.astype('Int32')
    
    print ('data_MF.shape')
    print (data_MF.shape)
    display(data_MF.head())
#    print ()

    # Impute randomly
    data_Random = data_NaN.copy(deep=True)
    data_Random = Impute_Randomly(data_Random)
    data_Random = data_Random.astype('Int32')
    
    print ('data_Random.shape')
    print (data_Random.shape)
    display(data_Random.head())
#    print ()

    return data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random

In [13]:
%%time 
data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random = Compare_Imputation_Methods_Part_1()


Compare_Imputation_Methods_Part_1()
Get_Data
data.shape =  (802700, 67)

Drop Imputed Columns
data.shape =  (802700, 67)

(802700, 67)
data_Ground_Truth.shape
(232333, 67)
data_Ground_Truth.shape after resampling
(200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,...,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,1,2,3,0,1,0,6,...,2,2,3,3,0,1,0,1,0,0
1,0,8,6,1,2,2,0,3,2,6,...,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,...,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,8,...,1,1,1,4,0,2,2,0,0,0
4,0,9,6,4,2,2,0,3,2,2,...,1,1,3,5,0,2,2,0,2,0


Remove 15% of values from each row
data_NaN.shape
(200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,...,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0.0,7.0,2.0,,,3.0,0.0,1.0,0.0,6.0,...,2.0,2.0,3.0,3.0,,1.0,0.0,,0.0,0.0
1,0.0,8.0,6.0,1.0,,,0.0,,2.0,6.0,...,1.0,1.0,3.0,4.0,0.0,1.0,0.0,,2.0,0.0
2,0.0,8.0,6.0,1.0,2.0,,0.0,,0.0,5.0,...,2.0,2.0,0.0,9.0,0.0,1.0,0.0,,2.0,0.0
3,0.0,8.0,6.0,4.0,2.0,2.0,0.0,4.0,0.0,,...,1.0,1.0,1.0,4.0,,,,0.0,0.0,0.0
4,0.0,9.0,6.0,4.0,,2.0,0.0,3.0,,2.0,...,1.0,1.0,,5.0,0.0,2.0,2.0,,2.0,0.0


data_Mode.shape
(200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,...,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,...,2,2,3,3,0,1,0,0,0,0
1,0,8,6,1,2,2,0,3,2,6,...,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,...,2,2,0,9,0,1,0,0,2,0
3,0,8,6,4,2,2,0,4,0,5,...,1,1,1,4,0,1,0,0,0,0
4,0,9,6,4,2,2,0,3,0,2,...,1,1,3,5,0,2,2,0,2,0


Impute()


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0.0,7.0,2.0,,,3.0,0.0,1.0,0.0,6.0,0.0,3.0,0.0,5.0,7.0,3.0,0.0,0.0,1.0,8.0,8.0,3.0,3.0,3.0,4.0,0.0,2.0,2.0,0.0,,2.0,1.0,1.0,0.0,4.0,4.0,0.0,,8.0,1.0,,3.0,2.0,2.0,1.0,,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,2.0,3.0,3.0,,1.0,0.0,,0.0,0.0
1,0.0,8.0,6.0,1.0,,,0.0,,2.0,6.0,0.0,3.0,0.0,5.0,1.0,2.0,0.0,,3.0,3.0,2.0,3.0,,2.0,2.0,0.0,,0.0,1.0,,2.0,1.0,3.0,1.0,4.0,4.0,0.0,1.0,9.0,1.0,0.0,,2.0,2.0,1.0,2.0,2.0,2.0,0.0,,2.0,4.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,,2.0,0.0
2,0.0,8.0,6.0,1.0,2.0,,0.0,,0.0,5.0,0.0,,,4.0,1.0,,1.0,,3.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,2.0,2.0,1.0,1.0,2.0,1.0,4.0,1.0,1.0,,0.0,1.0,8.0,1.0,0.0,,2.0,2.0,1.0,,2.0,2.0,0.0,1.0,2.0,0.0,0.0,0.0,,2.0,3.0,2.0,2.0,0.0,9.0,0.0,1.0,0.0,,2.0,0.0
3,0.0,8.0,6.0,4.0,2.0,2.0,0.0,4.0,0.0,,0.0,3.0,0.0,4.0,,3.0,0.0,0.0,3.0,6.0,4.0,3.0,3.0,3.0,,0.0,2.0,1.0,0.0,0.0,2.0,1.0,3.0,0.0,6.0,8.0,0.0,,8.0,1.0,,,2.0,2.0,1.0,2.0,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,4.0,,,,0.0,0.0,0.0
4,0.0,9.0,6.0,4.0,,2.0,0.0,3.0,,2.0,0.0,3.0,0.0,5.0,1.0,,0.0,0.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,0.0,2.0,,0.0,0.0,2.0,1.0,5.0,1.0,0.0,3.0,,4.0,9.0,1.0,0.0,3.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,1.0,,4.0,0.0,2.0,1.0,2.0,,1.0,1.0,,5.0,0.0,2.0,2.0,,2.0,0.0
5,1.0,3.0,9.0,4.0,2.0,2.0,0.0,2.0,,0.0,0.0,3.0,0.0,4.0,,2.0,0.0,0.0,3.0,2.0,2.0,3.0,2.0,2.0,1.0,,2.0,0.0,1.0,,2.0,2.0,5.0,1.0,1.0,0.0,0.0,1.0,6.0,0.0,0.0,1.0,2.0,2.0,1.0,2.0,2.0,,1.0,,2.0,3.0,0.0,0.0,1.0,2.0,4.0,1.0,1.0,1.0,3.0,0.0,1.0,,0.0,0.0,0.0
6,0.0,8.0,6.0,4.0,,7.0,0.0,,4.0,5.0,0.0,3.0,0.0,4.0,1.0,3.0,0.0,0.0,3.0,6.0,6.0,3.0,1.0,,7.0,1.0,2.0,0.0,1.0,0.0,,1.0,5.0,1.0,2.0,,,1.0,9.0,1.0,0.0,3.0,2.0,,1.0,2.0,2.0,2.0,1.0,1.0,2.0,4.0,0.0,,0.0,2.0,6.0,1.0,1.0,3.0,2.0,,2.0,1.0,,0.0,
7,0.0,8.0,6.0,4.0,2.0,9.0,,1.0,0.0,6.0,0.0,3.0,0.0,,1.0,3.0,0.0,,3.0,,9.0,3.0,3.0,3.0,,,2.0,0.0,,,2.0,1.0,5.0,1.0,5.0,4.0,0.0,,,1.0,0.0,0.0,2.0,2.0,,2.0,2.0,,1.0,,2.0,4.0,2.0,,,2.0,3.0,1.0,1.0,,6.0,0.0,,0.0,2.0,,0.0
8,0.0,3.0,6.0,4.0,2.0,4.0,0.0,2.0,2.0,8.0,0.0,3.0,0.0,5.0,2.0,3.0,0.0,0.0,3.0,6.0,5.0,2.0,3.0,3.0,5.0,0.0,2.0,,0.0,0.0,,1.0,3.0,1.0,,8.0,,1.0,3.0,,0.0,2.0,,2.0,1.0,2.0,2.0,2.0,0.0,1.0,,4.0,0.0,0.0,,2.0,2.0,1.0,1.0,,4.0,0.0,,0.0,,0.0,0.0
9,,,1.0,4.0,2.0,4.0,0.0,4.0,0.0,5.0,0.0,3.0,0.0,6.0,1.0,3.0,0.0,0.0,3.0,6.0,3.0,2.0,1.0,3.0,5.0,3.0,2.0,2.0,1.0,0.0,2.0,1.0,4.0,0.0,5.0,,0.0,1.0,7.0,2.0,0.0,2.0,,,2.0,2.0,2.0,1.0,,1.0,,0.0,0.0,0.0,,2.0,4.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,0.0,0.0,




Complete[]


[]


Missing[]


[['VPROFILE', 27762],
 ['MAK_MOD', 27763],
 ['RELJCT1', 27777],
 ['NUMOCCS', 27785],
 ['ALC_STATUS', 27786],
 ['SEAT_POS', 27786],
 ['VALIGN', 27799],
 ['ROLINLOC', 27803],
 ['ROLLOVER', 27804],
 ['VE_TOTAL', 27807],
 ['PERMVIT', 27810],
 ['NUM_INJ', 27816],
 ['AGE', 27817],
 ['VSURCOND', 27817],
 ['PSU', 27819],
 ['SPEC_USE', 27821],
 ['VTRAFCON', 27823],
 ['IMPACT1', 27830],
 ['P_CRASH1', 27830],
 ['INT_HWY', 27831],
 ['TYP_INT', 27834],
 ['MAN_COLL', 27837],
 ['MAX_SEV', 27838],
 ['REGION', 27838],
 ['PCRASH4', 27839],
 ['VTRAFWAY', 27840],
 ['SPEEDREL', 27841],
 ['HARM_EV', 27842],
 ['M_HARM', 27843],
 ['REST_USE', 27843],
 ['NUM_INJV', 27844],
 ['J_KNIFE', 27845],
 ['PCRASH5', 27847],
 ['INJ_SEV', 27850],
 ['LGT_COND', 27850],
 ['URBANICITY', 27850],
 ['DR_ZIP', 27852],
 ['HOUR', 27853],
 ['ACC_TYPE', 27854],
 ['HIT_RUN', 27855],
 ['WRK_ZONE', 27855],
 ['VE_FORMS', 27858],
 ['VSPD_LIM', 27860],
 ['VTCONT_F', 27862],
 ['MODEL', 27867],
 ['RELJCT2', 27870],
 ['EJECTION', 27875],
 ['


Make data_Mode

VPROFILE 27762 3.0
MAK_MOD 27763 3.0
RELJCT1 27777 0.0
NUMOCCS 27785 0.0
ALC_STATUS 27786 2.0
SEAT_POS 27786 2.0
VALIGN 27799 2.0
ROLINLOC 27803 2.0
ROLLOVER 27804 2.0
VE_TOTAL 27807 1.0
PERMVIT 27810 3.0
NUM_INJ 27816 0.0
AGE 27817 6.0
VSURCOND 27817 0.0
PSU 27819 3.0
SPEC_USE 27821 1.0
VTRAFCON 27823 0.0
IMPACT1 27830 1.0
P_CRASH1 27830 1.0
INT_HWY 27831 0.0
TYP_INT 27834 0.0
MAN_COLL 27837 3.0
MAX_SEV 27838 3.0
REGION 27838 1.0
PCRASH4 27839 2.0
VTRAFWAY 27840 0.0
SPEEDREL 27841 2.0
HARM_EV 27842 3.0
M_HARM 27843 2.0
REST_USE 27843 1.0
NUM_INJV 27844 0.0
J_KNIFE 27845 0.0
PCRASH5 27847 1.0
INJ_SEV 27850 3.0
LGT_COND 27850 3.0
URBANICITY 27850 1.0
DR_ZIP 27852 5.0
HOUR 27853 5.0
ACC_TYPE 27854 8.0
HIT_RUN 27855 0.0
WRK_ZONE 27855 0.0
VE_FORMS 27858 1.0
VSPD_LIM 27860 3.0
VTCONT_F 27862 1.0
MODEL 27867 3.0
RELJCT2 27870 1.0
EJECTION 27875 0.0
TOW_VEH 27878 0.0
VEH_AGE 27881 1.0
P_CRASH2 27882 8.0
PJ 27883 5.0
CARGO_BT 27884 0.0
MAX_VSEV 27885 3.0
MONTH 27885 0.0
MAKE 

Unnamed: 0,VPROFILE,MAK_MOD,RELJCT1,NUMOCCS,ALC_STATUS,SEAT_POS,VALIGN,ROLINLOC,ROLLOVER,VE_TOTAL,PERMVIT,NUM_INJ,AGE,VSURCOND,PSU,SPEC_USE,VTRAFCON,IMPACT1,P_CRASH1,INT_HWY,TYP_INT,MAN_COLL,MAX_SEV,REGION,PCRASH4,VTRAFWAY,SPEEDREL,HARM_EV,M_HARM,REST_USE,NUM_INJV,J_KNIFE,PCRASH5,INJ_SEV,LGT_COND,URBANICITY,DR_ZIP,HOUR,ACC_TYPE,HIT_RUN,WRK_ZONE,VE_FORMS,VSPD_LIM,VTCONT_F,MODEL,RELJCT2,EJECTION,TOW_VEH,VEH_AGE,P_CRASH2,PJ,CARGO_BT,MAX_VSEV,MONTH,MAKE,HOSPITAL,PER_TYP,DAY_WEEK,AIR_BAG,REL_ROAD,SEX,DEFORMED,WEATHER,TOWED,PVH_INVL,REST_MIS,BODY_TYP
0,3.0,8.0,0.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,1.0,0.0,2.0,0.0,4.0,1.0,0.0,7.0,1.0,0.0,2.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,1.0,1.0,6.0,5.0,7.0,0.0,0.0,2.0,3.0,1.0,4.0,3.0,0.0,0.0,1.0,8.0,4.0,0.0,3.0,0.0,8.0,0.0,0.0,1.0,4.0,2.0,0.0,0.0,0.0,2.0,0.0,2.0,3.0
1,3.0,2.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,1.0,6.0,0.0,4.0,1.0,0.0,1.0,1.0,0.0,0.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,2.0,3.0,1.0,6.0,5.0,8.0,0.0,0.0,1.0,4.0,1.0,2.0,1.0,0.0,0.0,1.0,9.0,4.0,0.0,2.0,0.0,3.0,0.0,1.0,3.0,1.0,2.0,0.0,2.0,2.0,4.0,0.0,2.0,2.0
2,0.0,3.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,4.0,1.0,6.0,0.0,3.0,1.0,0.0,1.0,1.0,1.0,0.0,3.0,0.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,1.0,0.0,1.0,3.0,3.0,1.0,5.0,4.0,8.0,0.0,0.0,2.0,9.0,1.0,3.0,1.0,0.0,0.0,3.0,8.0,1.0,0.0,0.0,3.0,3.0,0.0,1.0,3.0,1.0,2.0,0.0,0.0,2.0,0.0,0.0,2.0,2.0
3,1.0,4.0,0.0,1.0,2.0,0.0,2.0,2.0,2.0,1.0,3.0,0.0,6.0,0.0,8.0,1.0,0.0,1.0,1.0,0.0,2.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,1.0,5.0,4.0,8.0,0.0,0.0,1.0,4.0,1.0,3.0,1.0,0.0,0.0,1.0,8.0,6.0,0.0,3.0,0.0,6.0,0.0,0.0,4.0,4.0,2.0,0.0,0.0,0.0,2.0,0.0,2.0,2.0
4,3.0,3.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,0.0,6.0,0.0,3.0,1.0,2.0,1.0,4.0,0.0,2.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,1.0,2.0,5.0,9.0,0.0,0.0,1.0,5.0,2.0,3.0,3.0,0.0,0.0,1.0,9.0,0.0,0.0,3.0,0.0,2.0,0.0,1.0,3.0,4.0,2.0,0.0,0.0,2.0,4.0,0.0,2.0,2.0
5,1.0,2.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,1.0,9.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,3.0,2.0,0.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,2.0,2.0,3.0,1.0,0.0,4.0,3.0,0.0,0.0,1.0,3.0,1.0,1.0,1.0,0.0,0.0,4.0,6.0,1.0,0.0,2.0,0.0,2.0,1.0,1.0,2.0,4.0,2.0,1.0,0.0,0.0,3.0,0.0,2.0,2.0
6,3.0,6.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,1.0,6.0,0.0,3.0,1.0,1.0,1.0,1.0,0.0,0.0,3.0,1.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,0.0,5.0,4.0,8.0,0.0,0.0,1.0,2.0,2.0,7.0,3.0,0.0,0.0,6.0,9.0,2.0,0.0,3.0,1.0,6.0,0.0,1.0,3.0,4.0,2.0,1.0,4.0,0.0,4.0,0.0,2.0,7.0
7,3.0,9.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,0.0,6.0,0.0,4.0,1.0,0.0,1.0,1.0,0.0,0.0,3.0,3.0,1.0,2.0,2.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,1.0,6.0,5.0,8.0,0.0,0.0,1.0,6.0,1.0,3.0,0.0,0.0,2.0,3.0,8.0,5.0,0.0,3.0,0.0,6.0,0.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0,4.0,0.0,2.0,9.0
8,3.0,5.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,0.0,6.0,0.0,8.0,1.0,0.0,2.0,1.0,0.0,0.0,2.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,1.0,8.0,5.0,3.0,0.0,0.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,2.0,3.0,5.0,0.0,3.0,0.0,6.0,0.0,1.0,2.0,4.0,2.0,0.0,2.0,0.0,4.0,0.0,2.0,4.0
9,3.0,3.0,0.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,1.0,1.0,0.0,0.0,2.0,1.0,2.0,2.0,0.0,2.0,3.0,2.0,2.0,0.0,0.0,1.0,3.0,3.0,1.0,5.0,6.0,8.0,0.0,0.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,4.0,7.0,5.0,0.0,3.0,3.0,6.0,0.0,0.0,4.0,4.0,2.0,1.0,0.0,0.0,0.0,0.0,2.0,4.0



Make starting point for data_Imputed
data_Imputed


Unnamed: 0,VPROFILE,MAK_MOD,RELJCT1,NUMOCCS,ALC_STATUS,SEAT_POS,VALIGN,ROLINLOC,ROLLOVER,VE_TOTAL,PERMVIT,NUM_INJ,AGE,VSURCOND,PSU,SPEC_USE,VTRAFCON,IMPACT1,P_CRASH1,INT_HWY,TYP_INT,MAN_COLL,MAX_SEV,REGION,PCRASH4,VTRAFWAY,SPEEDREL,HARM_EV,M_HARM,REST_USE,NUM_INJV,J_KNIFE,PCRASH5,INJ_SEV,LGT_COND,URBANICITY,DR_ZIP,HOUR,ACC_TYPE,HIT_RUN,WRK_ZONE,VE_FORMS,VSPD_LIM,VTCONT_F,MODEL,RELJCT2,EJECTION,TOW_VEH,VEH_AGE,P_CRASH2,PJ,CARGO_BT,MAX_VSEV,MONTH,MAKE,HOSPITAL,PER_TYP,DAY_WEEK,AIR_BAG,REL_ROAD,SEX,DEFORMED,WEATHER,TOWED,PVH_INVL,REST_MIS,BODY_TYP
0,3.0,8.0,0.0,2.0,2.0,0.0,2.0,2.0,2.0,2.0,1.0,0.0,2.0,0.0,4.0,1.0,0.0,7.0,1.0,0.0,2.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,1.0,1.0,6.0,5.0,7.0,0.0,0.0,2.0,3.0,1.0,4.0,3.0,0.0,0.0,1.0,8.0,4.0,0.0,3.0,0.0,8.0,0.0,0.0,1.0,4.0,2.0,0.0,0.0,0.0,2.0,0.0,2.0,3.0
1,3.0,2.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,1.0,6.0,0.0,4.0,1.0,0.0,1.0,1.0,0.0,0.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,2.0,3.0,1.0,6.0,5.0,8.0,0.0,0.0,1.0,4.0,1.0,2.0,1.0,0.0,0.0,1.0,9.0,4.0,0.0,2.0,0.0,3.0,0.0,1.0,3.0,1.0,2.0,0.0,2.0,2.0,4.0,0.0,2.0,2.0
2,0.0,3.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,4.0,1.0,6.0,0.0,3.0,1.0,0.0,1.0,1.0,1.0,0.0,3.0,0.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,1.0,0.0,1.0,3.0,3.0,1.0,5.0,4.0,8.0,0.0,0.0,2.0,9.0,1.0,3.0,1.0,0.0,0.0,3.0,8.0,1.0,0.0,0.0,3.0,3.0,0.0,1.0,3.0,1.0,2.0,0.0,0.0,2.0,0.0,0.0,2.0,2.0
3,1.0,4.0,0.0,1.0,2.0,0.0,2.0,2.0,2.0,1.0,3.0,0.0,6.0,0.0,8.0,1.0,0.0,1.0,1.0,0.0,2.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,1.0,5.0,4.0,8.0,0.0,0.0,1.0,4.0,1.0,3.0,1.0,0.0,0.0,1.0,8.0,6.0,0.0,3.0,0.0,6.0,0.0,0.0,4.0,4.0,2.0,0.0,0.0,0.0,2.0,0.0,2.0,2.0
4,3.0,3.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,0.0,6.0,0.0,3.0,1.0,2.0,1.0,4.0,0.0,2.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,1.0,2.0,5.0,9.0,0.0,0.0,1.0,5.0,2.0,3.0,3.0,0.0,0.0,1.0,9.0,0.0,0.0,3.0,0.0,2.0,0.0,1.0,3.0,4.0,2.0,0.0,0.0,2.0,4.0,0.0,2.0,2.0
5,1.0,2.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,1.0,9.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,3.0,2.0,0.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,2.0,2.0,3.0,1.0,0.0,4.0,3.0,0.0,0.0,1.0,3.0,1.0,1.0,1.0,0.0,0.0,4.0,6.0,1.0,0.0,2.0,0.0,2.0,1.0,1.0,2.0,4.0,2.0,1.0,0.0,0.0,3.0,0.0,2.0,2.0
6,3.0,6.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,1.0,6.0,0.0,3.0,1.0,1.0,1.0,1.0,0.0,0.0,3.0,1.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,0.0,5.0,4.0,8.0,0.0,0.0,1.0,2.0,2.0,7.0,3.0,0.0,0.0,6.0,9.0,2.0,0.0,3.0,1.0,6.0,0.0,1.0,3.0,4.0,2.0,1.0,4.0,0.0,4.0,0.0,2.0,7.0
7,3.0,9.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,0.0,6.0,0.0,4.0,1.0,0.0,1.0,1.0,0.0,0.0,3.0,3.0,1.0,2.0,2.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,1.0,6.0,5.0,8.0,0.0,0.0,1.0,6.0,1.0,3.0,0.0,0.0,2.0,3.0,8.0,5.0,0.0,3.0,0.0,6.0,0.0,1.0,1.0,4.0,2.0,1.0,0.0,0.0,4.0,0.0,2.0,9.0
8,3.0,5.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,0.0,6.0,0.0,8.0,1.0,0.0,2.0,1.0,0.0,0.0,2.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,1.0,8.0,5.0,3.0,0.0,0.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,2.0,3.0,5.0,0.0,3.0,0.0,6.0,0.0,1.0,2.0,4.0,2.0,0.0,2.0,0.0,4.0,0.0,2.0,4.0
9,3.0,3.0,0.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,4.0,1.0,1.0,0.0,3.0,1.0,0.0,1.0,1.0,0.0,0.0,2.0,1.0,2.0,2.0,0.0,2.0,3.0,2.0,2.0,0.0,0.0,1.0,3.0,3.0,1.0,5.0,6.0,8.0,0.0,0.0,1.0,4.0,1.0,5.0,2.0,0.0,0.0,4.0,7.0,5.0,0.0,3.0,3.0,6.0,0.0,0.0,4.0,4.0,2.0,1.0,0.0,0.0,0.0,0.0,2.0,4.0



Start Loop

['VPROFILE', 27762]
27762


array([3., 3., 3., ..., 3., 3., 3.])

(200000, 67)

[3.0, 0.0, 1.0, 4.0, 2.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,3.0,140656,167698,27042
1,0.0,6396,6396,0
2,1.0,6542,6542,0
3,4.0,4915,5635,720
4,2.0,13729,13729,0
5,,27762,0,0



['MAK_MOD', 27763]
27763


array([3., 3., 3., ..., 3., 3., 3.])

(200000, 67)

[8.0, 2.0, 3.0, 4.0, 6.0, 9.0, 5.0, 1.0, 0.0, 7.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,8.0,4798,4798,0
1,2.0,13222,13222,0
2,3.0,47626,68284,20658
3,4.0,37749,37749,0
4,6.0,31974,38126,6152
5,9.0,3949,4513,564
6,5.0,17866,17866,0
7,1.0,5829,5829,0
8,0.0,3165,3554,389
9,7.0,6059,6059,0



['RELJCT1', 27777]
27777


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,161867,189644,27777
1,1.0,10356,10356,0
2,,27777,0,0



['NUMOCCS', 27785]
27785


array([0., 0., 1., ..., 0., 0., 0.])

(200000, 67)

[2.0, 0.0, 1.0, 3.0, 4.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,19036,19036,0
1,0.0,90305,112810,22505
2,1.0,45149,50429,5280
3,3.0,10568,10568,0
4,4.0,7157,7157,0
5,,27785,0,0



['ALC_STATUS', 27786]
27786


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[2.0, 0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,168537,196323,27786
1,0.0,3587,3587,0
2,1.0,90,90,0
3,,27786,0,0



['SEAT_POS', 27786]
27786


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[0.0, 2.0, 1.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,29195,29200,5
1,2.0,121613,149298,27685
2,1.0,8111,8111,0
3,3.0,13295,13391,96
4,,27786,0,0



['VALIGN', 27799]
27799


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[2.0, 0.0, 1.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,152063,179268,27205
1,0.0,6445,6445,0
2,1.0,8793,8793,0
3,3.0,4900,5494,594
4,,27799,0,0



['ROLINLOC', 27803]
27803


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[2.0, 0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,166378,194181,27803
1,0.0,4630,4630,0
2,1.0,1189,1189,0
3,,27803,0,0



['ROLLOVER', 27804]
27804


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[2.0, 0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,166403,194207,27804
1,0.0,2799,2799,0
2,1.0,2994,2994,0
3,,27804,0,0



['VE_TOTAL', 27807]
27807


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[2.0, 1.0, 0.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,15529,15529,0
1,1.0,133208,158515,25307
2,0.0,19593,22093,2500
3,3.0,3863,3863,0
4,,27807,0,0



['PERMVIT', 27810]
27810


array([3., 3., 5., ..., 3., 5., 0.])

(200000, 67)

[1.0, 3.0, 4.0, 5.0, 2.0, 0.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,2954,2954,0
1,3.0,69857,81844,11987
2,4.0,24523,24523,0
3,5.0,57288,71086,13798
4,2.0,4596,4596,0
5,0.0,12972,14997,2025
6,,27810,0,0



['NUM_INJ', 27816]
27816


array([0., 1., 1., ..., 0., 1., 0.])

(200000, 67)

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,84922,103630,18708
1,1.0,47756,56864,9108
2,2.0,22221,22221,0
3,3.0,9250,9250,0
4,4.0,4355,4355,0
5,5.0,3680,3680,0
6,,27816,0,0



['AGE', 27817]
27817


array([6., 6., 6., ..., 6., 6., 6.])

(200000, 67)

[2.0, 6.0, 9.0, 1.0, 7.0, 3.0, 4.0, 0.0, 5.0, 8.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,5907,5907,0
1,6.0,88003,115820,27817
2,9.0,8255,8255,0
3,1.0,7555,7555,0
4,7.0,26581,26581,0
5,3.0,7110,7110,0
6,4.0,4938,4938,0
7,0.0,4575,4575,0
8,5.0,9702,9702,0
9,8.0,9557,9557,0



['VSURCOND', 27817]
27817


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 2.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,139438,166526,27088
1,2.0,6554,7283,729
2,1.0,26191,26191,0
3,,27817,0,0



['PSU', 27819]
27819


array([4., 3., 4., ..., 3., 8., 3.])

(200000, 67)

[4.0, 8.0, 3.0, 0.0, 5.0, 6.0, 1.0, 7.0, 2.0, 9.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,4.0,40595,46767,6172
1,8.0,19229,22560,3331
2,3.0,43754,62022,18268
3,0.0,3671,3671,0
4,5.0,27099,27147,48
5,6.0,11339,11339,0
6,1.0,3656,3656,0
7,7.0,10180,10180,0
8,2.0,12333,12333,0
9,9.0,325,325,0



['SPEC_USE', 27821]
27821


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[1.0, 0.0, 2.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,171182,199003,27821
1,0.0,515,515,0
2,2.0,482,482,0
3,,27821,0,0



['VTRAFCON', 27823]
27823


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 2.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,120359,146048,25689
1,2.0,15585,15585,0
2,1.0,36233,38367,2134
3,,27823,0,0



['IMPACT1', 27830]
27830


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[7.0, 1.0, 2.0, 6.0, 3.0, 4.0, 9.0, 5.0, 0.0, 8.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,7.0,41502,46438,4936
1,1.0,78640,101534,22894
2,2.0,10490,10490,0
3,6.0,5929,5929,0
4,3.0,5669,5669,0
5,4.0,9912,9912,0
6,9.0,3281,3281,0
7,5.0,5574,5574,0
8,0.0,7559,7559,0
9,8.0,3614,3614,0



['P_CRASH1', 27830]
27830


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[1.0, 4.0, 0.0, 2.0, 5.0, 3.0, 6.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,90835,118609,27774
1,4.0,27015,27071,56
2,0.0,11053,11053,0
3,2.0,16773,16773,0
4,5.0,9565,9565,0
5,3.0,10695,10695,0
6,6.0,6234,6234,0
7,,27830,0,0



['INT_HWY', 27831]
27831


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,150264,178095,27831
1,1.0,21905,21905,0
2,,27831,0,0



['TYP_INT', 27834]
27834


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[2.0, 0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,50652,57449,6797
1,0.0,100335,121372,21037
2,1.0,21179,21179,0
3,,27834,0,0



['MAN_COLL', 27837]
27837


array([3., 2., 2., ..., 1., 3., 3.])

(200000, 67)

[3.0, 2.0, 1.0, 0.0, 4.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,3.0,73376,91464,18088
1,2.0,46710,53023,6313
2,1.0,22199,25635,3436
3,0.0,7527,7527,0
4,4.0,22351,22351,0
5,,27837,0,0



['MAX_SEV', 27838]
27838


array([2., 3., 3., ..., 2., 3., 3.])

(200000, 67)

[3.0, 0.0, 2.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,3.0,84817,103527,18710
1,0.0,20221,21335,1114
2,2.0,40609,47934,7325
3,1.0,26515,27204,689
4,,27838,0,0



['REGION', 27838]
27838


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[1.0, 0.0, 2.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,127182,155020,27838
1,0.0,5550,5550,0
2,2.0,35001,35001,0
3,3.0,4429,4429,0
4,,27838,0,0



['PCRASH4', 27839]
27839


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[2.0, 0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,164430,192269,27839
1,0.0,3646,3646,0
2,1.0,4085,4085,0
3,,27839,0,0



['VTRAFWAY', 27840]
27840


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 2.0, 5.0, 1.0, 3.0, 4.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,76397,101848,25451
1,2.0,36381,38090,1709
2,5.0,4901,5581,680
3,1.0,43418,43418,0
4,3.0,3890,3890,0
5,4.0,7173,7173,0
6,,27840,0,0



['SPEEDREL', 27841]
27841


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[2.0, 0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,163683,191524,27841
1,0.0,2481,2481,0
2,1.0,5995,5995,0
3,,27841,0,0



['HARM_EV', 27842]
27842


array([3., 3., 3., ..., 3., 3., 3.])

(200000, 67)

[3.0, 1.0, 2.0, 0.0, 4.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,3.0,153094,179849,26755
1,1.0,5709,6735,1026
2,2.0,4951,4965,14
3,0.0,4523,4570,47
4,4.0,3881,3881,0
5,,27842,0,0



['M_HARM', 27843]
27843


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[2.0, 1.0, 0.0, 4.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,149993,175453,25460
1,1.0,8872,10752,1880
2,0.0,6869,7372,503
3,4.0,3846,3846,0
4,3.0,2577,2577,0
5,,27843,0,0



['REST_USE', 27843]
27843


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[1.0, 2.0, 0.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,155538,182977,27439
1,2.0,2175,2175,0
2,0.0,8501,8905,404
3,3.0,5943,5943,0
4,,27843,0,0



['NUM_INJV', 27844]
27844


array([0., 1., 0., ..., 0., 0., 0.])

(200000, 67)

[1.0, 0.0, 2.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,43817,52541,8724
1,0.0,108746,127866,19120
2,2.0,13060,13060,0
3,3.0,6533,6533,0
4,,27844,0,0



['J_KNIFE', 27845]
27845


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 2.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,168814,196392,27578
1,2.0,3249,3516,267
2,1.0,92,92,0
3,,27845,0,0



['PCRASH5', 27847]
27847


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[1.0, 2.0, 0.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,136417,161882,25465
1,2.0,17476,17476,0
2,0.0,18260,20642,2382
3,,27847,0,0



['INJ_SEV', 27850]
27850


array([3., 3., 3., ..., 2., 3., 3.])

(200000, 67)

[3.0, 2.0, 1.0, 0.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,3.0,119965,146119,26154
1,2.0,26719,28388,1669
2,1.0,16567,16594,27
3,0.0,8899,8899,0
4,,27850,0,0



['LGT_COND', 27850]
27850


array([3., 3., 3., ..., 3., 3., 3.])

(200000, 67)

[1.0, 3.0, 2.0, 0.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,26581,26581,0
1,3.0,124718,152568,27850
2,2.0,5081,5081,0
3,0.0,15770,15770,0
4,,27850,0,0



['URBANICITY', 27850]
27850


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[1.0, 0.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,129535,157385,27850
1,0.0,42615,42615,0
2,,27850,0,0



['DR_ZIP', 27852]
27852


array([5., 5., 5., ..., 5., 5., 5.])

(200000, 67)

[6.0, 5.0, 2.0, 0.0, 8.0, 7.0, 4.0, 9.0, 1.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,6.0,27603,27603,0
1,5.0,51098,76280,25182
2,2.0,8826,8826,0
3,0.0,6397,6397,0
4,8.0,16918,16918,0
5,7.0,24896,27566,2670
6,4.0,20092,20092,0
7,9.0,6598,6598,0
8,1.0,3394,3394,0
9,3.0,6326,6326,0



['HOUR', 27853]
27853


array([5., 5., 5., ..., 5., 5., 5.])

(200000, 67)

[5.0, 4.0, 6.0, 3.0, 1.0, 0.0, 2.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,5.0,59697,84654,24957
1,4.0,54789,56960,2171
2,6.0,24673,25398,725
3,3.0,16342,16342,0
4,1.0,5394,5394,0
5,0.0,5150,5150,0
6,2.0,6102,6102,0
7,,27853,0,0



['ACC_TYPE', 27854]
27854


array([8., 8., 8., ..., 7., 5., 8.])

(200000, 67)

[7.0, 8.0, 9.0, 3.0, 4.0, 0.0, 6.0, 2.0, 1.0, 5.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,7.0,32850,37622,4772
1,8.0,50902,67535,16633
2,9.0,6883,6883,0
3,3.0,15004,15009,5
4,4.0,8908,8908,0
5,0.0,7116,7122,6
6,6.0,19062,19062,0
7,2.0,4644,5359,715
8,1.0,5819,7357,1538
9,5.0,20958,25143,4185



['HIT_RUN', 27855]
27855


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,171355,199210,27855
1,1.0,790,790,0
2,,27855,0,0



['WRK_ZONE', 27855]
27855


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 1.0, 2.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,168656,196511,27855
1,1.0,1849,1849,0
2,2.0,1640,1640,0
3,,27855,0,0



['VE_FORMS', 27858]
27858


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[2.0, 1.0, 3.0, 0.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,15007,15007,0
1,1.0,131822,156722,24900
2,3.0,3617,3617,0
3,0.0,21696,24654,2958
4,,27858,0,0



['VSPD_LIM', 27860]
27860


array([3., 3., 3., ..., 3., 3., 3.])

(200000, 67)

[3.0, 4.0, 9.0, 5.0, 2.0, 6.0, 0.0, 1.0, 7.0, 8.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,3.0,49629,75690,26061
1,4.0,42517,43543,1026
2,9.0,11598,11598,0
3,5.0,7156,7156,0
4,2.0,8109,8109,0
5,6.0,21269,21277,8
6,0.0,4987,5752,765
7,1.0,13929,13929,0
8,7.0,4635,4635,0
9,8.0,8311,8311,0



['VTCONT_F', 27862]
27862


array([1., 1., 1., ..., 1., 2., 1.])

(200000, 67)

[1.0, 2.0, 0.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,117054,138760,21706
1,2.0,54934,61090,6156
2,0.0,150,150,0
3,,27862,0,0



['MODEL', 27867]
27867


array([3., 3., 7., ..., 3., 3., 3.])

(200000, 67)

[4.0, 2.0, 3.0, 1.0, 7.0, 5.0, 6.0, 0.0, 9.0, 8.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,4.0,20280,20280,0
1,2.0,26697,26697,0
2,3.0,47285,70225,22940
3,1.0,5017,5017,0
4,7.0,22557,26502,3945
5,5.0,21969,21969,0
6,6.0,13954,13954,0
7,0.0,3105,3559,454
8,9.0,3617,4145,528
9,8.0,7652,7652,0



['RELJCT2', 27870]
27870


array([1., 1., 1., ..., 3., 1., 3.])

(200000, 67)

[3.0, 1.0, 0.0, 2.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,3.0,34045,37551,3506
1,1.0,76891,95330,18439
2,0.0,43239,48639,5400
3,2.0,17955,18480,525
4,,27870,0,0



['EJECTION', 27875]
27875


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 1.0, 2.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,168364,195784,27420
1,1.0,543,543,0
2,2.0,3218,3673,455
3,,27875,0,0



['TOW_VEH', 27878]
27878


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 2.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,168672,196312,27640
1,2.0,3340,3578,238
2,1.0,110,110,0
3,,27878,0,0



['VEH_AGE', 27881]
27881


array([1., 1., 1., ..., 1., 1., 1.])

(200000, 67)

[1.0, 3.0, 4.0, 6.0, 2.0, 5.0, 9.0, 7.0, 8.0, 0.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,54354,82235,27881
1,3.0,23517,23517,0
2,4.0,15145,15145,0
3,6.0,12592,12592,0
4,2.0,18177,18177,0
5,5.0,13765,13765,0
6,9.0,12228,12228,0
7,7.0,5173,5173,0
8,8.0,7813,7813,0
9,0.0,9355,9355,0



['P_CRASH2', 27882]
27882


array([8., 8., 8., ..., 8., 1., 8.])

(200000, 67)

[8.0, 9.0, 6.0, 3.0, 7.0, 5.0, 2.0, 1.0, 0.0, 4.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,8.0,67509,90156,22647
1,9.0,24267,24267,0
2,6.0,22718,25709,2991
3,3.0,11485,11485,0
4,7.0,9340,9340,0
5,5.0,13856,13856,0
6,2.0,4513,4513,0
7,1.0,6994,9238,2244
8,0.0,4419,4419,0
9,4.0,7017,7017,0



['PJ', 27883]
27883


array([5., 5., 8., ..., 5., 8., 5.])

(200000, 67)

[4.0, 1.0, 6.0, 0.0, 2.0, 5.0, 8.0, 7.0, 3.0, 9.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,4.0,21986,21986,0
1,1.0,5025,5025,0
2,6.0,24563,24563,0
3,0.0,7128,7128,0
4,2.0,7550,7550,0
5,5.0,35437,58080,22643
6,8.0,21763,26031,4268
7,7.0,25244,26216,972
8,3.0,19360,19360,0
9,9.0,4061,4061,0



['CARGO_BT', 27884]
27884


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 2.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,167631,195028,27397
1,2.0,3788,4275,487
2,1.0,697,697,0
3,,27884,0,0



['MAX_VSEV', 27885]
27885


array([3., 3., 1., ..., 3., 3., 3.])

(200000, 67)

[3.0, 2.0, 0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,3.0,108740,129735,20995
1,2.0,30186,34932,4746
2,0.0,11649,11649,0
3,1.0,21540,23684,2144
4,,27885,0,0



['MONTH', 27885]
27885


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 3.0, 1.0, 4.0, 5.0, 2.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,38627,64783,26156
1,3.0,31383,31827,444
2,1.0,26772,26772,0
3,4.0,31928,33211,1283
4,5.0,15268,15268,0
5,2.0,28137,28139,2
6,,27885,0,0



['MAKE', 27888]
27888


array([9., 6., 6., ..., 6., 6., 6.])

(200000, 67)

[8.0, 3.0, 6.0, 2.0, 4.0, 1.0, 7.0, 0.0, 5.0, 9.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,8.0,7076,7076,0
1,3.0,14622,14622,0
2,6.0,60165,87062,26897
3,2.0,25288,25288,0
4,4.0,10955,10955,0
5,1.0,5139,5139,0
6,7.0,16734,16734,0
7,0.0,2799,3278,479
8,5.0,26305,26305,0
9,9.0,3029,3541,512



['HOSPITAL', 27895]
27895


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,143302,170004,26702
1,1.0,28803,29996,1193
2,,27895,0,0



['PER_TYP', 27895]
27895


array([1., 1., 1., ..., 0., 0., 0.])

(200000, 67)

[0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,50584,55706,5122
1,1.0,121521,144294,22773
2,,27895,0,0



['DAY_WEEK', 27896]
27896


array([3., 3., 3., ..., 3., 0., 3.])

(200000, 67)

[1.0, 4.0, 3.0, 2.0, 0.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,23972,23972,0
1,4.0,29508,29508,0
2,3.0,51247,75417,24170
3,2.0,24732,24732,0
4,0.0,42645,46371,3726
5,,27896,0,0



['AIR_BAG', 27898]
27898


array([4., 4., 4., ..., 4., 4., 4.])

(200000, 67)

[1.0, 4.0, 3.0, 0.0, 2.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,1.0,13907,13907,0
1,4.0,131314,159212,27898
2,3.0,4763,4763,0
3,0.0,12320,12320,0
4,2.0,9798,9798,0
5,,27898,0,0



['REL_ROAD', 27904]
27904


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[2.0, 0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,155696,181311,25615
1,0.0,12773,15062,2289
2,1.0,3627,3627,0
3,,27904,0,0



['SEX', 27905]
27905


array([0., 0., 0., ..., 1., 1., 0.])

(200000, 67)

[0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,82617,97788,15171
1,1.0,89478,102212,12734
2,,27905,0,0



['DEFORMED', 27914]
27914


array([0., 0., 4., ..., 0., 0., 0.])

(200000, 67)

[0.0, 2.0, 4.0, 1.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,72848,87162,14314
1,2.0,41134,41134,0
2,4.0,52709,66309,13600
3,1.0,2773,2773,0
4,3.0,2622,2622,0
5,,27914,0,0



['WEATHER', 27917]
27917


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 2.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,123350,151267,27917
1,2.0,31936,31936,0
2,1.0,16797,16797,0
3,,27917,0,0



['TOWED', 27922]
27922


array([4., 4., 4., ..., 4., 0., 0.])

(200000, 67)

[2.0, 4.0, 0.0, 3.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,14138,14138,0
1,4.0,92602,112057,19455
2,0.0,58178,66645,8467
3,3.0,7160,7160,0
4,,27922,0,0



['PVH_INVL', 27942]
27942


array([0., 0., 0., ..., 0., 0., 0.])

(200000, 67)

[0.0, 1.0, 2.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,0.0,169321,197263,27942
1,1.0,2284,2284,0
2,2.0,453,453,0
3,,27942,0,0



['REST_MIS', 27943]
27943


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[2.0, 0.0, 1.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,2.0,165271,193017,27746
1,0.0,5065,5262,197
2,1.0,1721,1721,0
3,,27943,0,0



['BODY_TYP', 27965]
27965


array([2., 2., 2., ..., 2., 2., 2.])

(200000, 67)

[3.0, 2.0, 7.0, 9.0, 4.0, 6.0, 1.0, 0.0, 5.0, 8.0]


Unnamed: 0,Value,Original,Imputed,Difference
0,3.0,6425,6425,0
1,2.0,66786,89547,22761
2,7.0,5864,5864,0
3,9.0,2328,2648,320
4,4.0,41656,41656,0
5,6.0,28378,32826,4448
6,1.0,8347,8347,0
7,0.0,3103,3539,436
8,5.0,7647,7647,0
9,8.0,1501,1501,0





Unnamed: 0,VPROFILE,MAK_MOD,RELJCT1,NUMOCCS,ALC_STATUS,SEAT_POS,VALIGN,ROLINLOC,ROLLOVER,VE_TOTAL,PERMVIT,NUM_INJ,AGE,VSURCOND,PSU,SPEC_USE,VTRAFCON,IMPACT1,P_CRASH1,INT_HWY,TYP_INT,MAN_COLL,MAX_SEV,REGION,PCRASH4,VTRAFWAY,SPEEDREL,HARM_EV,M_HARM,REST_USE,NUM_INJV,J_KNIFE,PCRASH5,INJ_SEV,LGT_COND,URBANICITY,DR_ZIP,HOUR,ACC_TYPE,HIT_RUN,WRK_ZONE,VE_FORMS,VSPD_LIM,VTCONT_F,MODEL,RELJCT2,EJECTION,TOW_VEH,VEH_AGE,P_CRASH2,PJ,CARGO_BT,MAX_VSEV,MONTH,MAKE,HOSPITAL,PER_TYP,DAY_WEEK,AIR_BAG,REL_ROAD,SEX,DEFORMED,WEATHER,TOWED,PVH_INVL,REST_MIS,BODY_TYP
158160,3.0,3.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,1.0,6.0,0.0,5.0,1.0,0.0,7.0,4.0,0.0,0.0,3.0,1.0,1.0,2.0,1.0,2.0,3.0,2.0,1.0,1.0,0.0,1.0,1.0,3.0,1.0,7.0,6.0,6.0,0.0,0.0,1.0,4.0,1.0,2.0,3.0,0.0,0.0,0.0,8.0,6.0,0.0,1.0,2.0,6.0,0.0,1.0,3.0,4.0,2.0,1.0,1.0,0.0,4.0,0.0,2.0,2.0
58451,3.0,6.0,0.0,1.0,2.0,2.0,2.0,2.0,2.0,0.0,5.0,1.0,6.0,0.0,3.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,1.0,0.0,1.0,3.0,3.0,0.0,0.0,1.0,8.0,0.0,0.0,0.0,8.0,1.0,6.0,1.0,0.0,0.0,4.0,8.0,0.0,0.0,0.0,2.0,2.0,0.0,1.0,0.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,2.0,2.0
24748,3.0,6.0,0.0,1.0,2.0,2.0,2.0,2.0,2.0,1.0,4.0,0.0,3.0,0.0,3.0,1.0,1.0,1.0,2.0,0.0,2.0,0.0,3.0,1.0,2.0,1.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,1.0,1.0,1.0,6.0,5.0,0.0,0.0,1.0,3.0,2.0,5.0,0.0,0.0,0.0,1.0,6.0,4.0,0.0,3.0,2.0,5.0,0.0,1.0,0.0,4.0,2.0,0.0,1.0,0.0,4.0,0.0,2.0,2.0
16366,3.0,0.0,1.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,1.0,7.0,0.0,3.0,1.0,2.0,7.0,4.0,0.0,0.0,3.0,1.0,1.0,2.0,3.0,2.0,3.0,2.0,0.0,1.0,0.0,1.0,2.0,3.0,1.0,5.0,5.0,8.0,0.0,1.0,1.0,3.0,2.0,0.0,3.0,2.0,0.0,4.0,8.0,5.0,0.0,1.0,3.0,0.0,1.0,1.0,3.0,3.0,2.0,1.0,0.0,0.0,0.0,0.0,2.0,0.0
97345,4.0,3.0,0.0,1.0,2.0,0.0,3.0,2.0,2.0,1.0,3.0,0.0,8.0,2.0,6.0,1.0,2.0,7.0,1.0,0.0,0.0,3.0,3.0,2.0,2.0,5.0,2.0,3.0,2.0,2.0,0.0,0.0,1.0,3.0,3.0,1.0,4.0,4.0,6.0,0.0,0.0,1.0,0.0,2.0,3.0,3.0,0.0,0.0,2.0,8.0,8.0,0.0,3.0,5.0,5.0,0.0,0.0,4.0,4.0,2.0,0.0,2.0,0.0,4.0,0.0,2.0,2.0
196467,3.0,6.0,0.0,0.0,2.0,2.0,2.0,2.0,2.0,1.0,5.0,0.0,6.0,0.0,7.0,1.0,1.0,1.0,1.0,0.0,1.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,1.0,8.0,5.0,8.0,0.0,0.0,1.0,3.0,2.0,5.0,3.0,0.0,0.0,2.0,8.0,7.0,0.0,3.0,2.0,6.0,0.0,1.0,3.0,4.0,2.0,1.0,4.0,0.0,4.0,0.0,2.0,2.0
75866,3.0,3.0,0.0,2.0,0.0,2.0,2.0,2.0,2.0,1.0,3.0,1.0,6.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,2.0,3.0,2.0,1.0,3.0,0.0,1.0,0.0,1.0,1.0,0.0,5.0,8.0,0.0,0.0,1.0,3.0,1.0,3.0,1.0,0.0,0.0,4.0,1.0,3.0,0.0,1.0,5.0,6.0,1.0,1.0,0.0,0.0,2.0,1.0,0.0,2.0,0.0,0.0,2.0,2.0
44924,3.0,2.0,0.0,2.0,2.0,0.0,2.0,2.0,2.0,1.0,3.0,0.0,6.0,0.0,4.0,1.0,0.0,7.0,1.0,0.0,0.0,3.0,3.0,1.0,2.0,0.0,2.0,3.0,2.0,1.0,0.0,0.0,1.0,3.0,3.0,0.0,5.0,6.0,7.0,0.0,0.0,1.0,3.0,1.0,4.0,1.0,0.0,0.0,4.0,8.0,5.0,0.0,3.0,3.0,7.0,0.0,0.0,3.0,4.0,2.0,1.0,4.0,0.0,4.0,0.0,2.0,2.0
197226,3.0,6.0,0.0,3.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,0.0,0.0,0.0,8.0,1.0,0.0,7.0,1.0,1.0,0.0,3.0,3.0,1.0,2.0,1.0,2.0,3.0,2.0,3.0,0.0,0.0,1.0,3.0,3.0,1.0,8.0,4.0,8.0,0.0,0.0,1.0,9.0,1.0,6.0,1.0,0.0,0.0,1.0,8.0,8.0,0.0,3.0,3.0,6.0,0.0,0.0,0.0,4.0,2.0,1.0,4.0,0.0,4.0,0.0,2.0,2.0
73799,3.0,4.0,1.0,1.0,2.0,0.0,2.0,2.0,2.0,1.0,3.0,1.0,6.0,0.0,2.0,1.0,0.0,4.0,2.0,0.0,2.0,2.0,0.0,1.0,2.0,4.0,2.0,3.0,2.0,1.0,1.0,0.0,1.0,3.0,3.0,1.0,5.0,5.0,5.0,0.0,0.0,1.0,3.0,1.0,4.0,0.0,0.0,0.0,9.0,6.0,3.0,0.0,0.0,0.0,6.0,0.0,0.0,3.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,2.0,2.0



data_RF.shape
(200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,0,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,2,2,2,0,2,0,1,1,2,1,3,1,4,4,0,1,9,1,0,1,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,3,0,1,8,1,0,1,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,5,0,3,0,4,1,3,0,0,3,6,4,3,3,3,3,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,1,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,0,0,0,0
4,0,9,6,4,2,2,0,3,4,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,1,1,1,3,5,0,2,2,0,2,0


Impute_MissForest()
(200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0.0,7.0,2.0,,,3.0,0.0,1.0,0.0,6.0,0.0,3.0,0.0,5.0,7.0,3.0,0.0,0.0,1.0,8.0,8.0,3.0,3.0,3.0,4.0,0.0,2.0,2.0,0.0,,2.0,1.0,1.0,0.0,4.0,4.0,0.0,,8.0,1.0,,3.0,2.0,2.0,1.0,,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,2.0,3.0,3.0,,1.0,0.0,,0.0,0.0
1,0.0,8.0,6.0,1.0,,,0.0,,2.0,6.0,0.0,3.0,0.0,5.0,1.0,2.0,0.0,,3.0,3.0,2.0,3.0,,2.0,2.0,0.0,,0.0,1.0,,2.0,1.0,3.0,1.0,4.0,4.0,0.0,1.0,9.0,1.0,0.0,,2.0,2.0,1.0,2.0,2.0,2.0,0.0,,2.0,4.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,,2.0,0.0
2,0.0,8.0,6.0,1.0,2.0,,0.0,,0.0,5.0,0.0,,,4.0,1.0,,1.0,,3.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,2.0,2.0,1.0,1.0,2.0,1.0,4.0,1.0,1.0,,0.0,1.0,8.0,1.0,0.0,,2.0,2.0,1.0,,2.0,2.0,0.0,1.0,2.0,0.0,0.0,0.0,,2.0,3.0,2.0,2.0,0.0,9.0,0.0,1.0,0.0,,2.0,0.0
3,0.0,8.0,6.0,4.0,2.0,2.0,0.0,4.0,0.0,,0.0,3.0,0.0,4.0,,3.0,0.0,0.0,3.0,6.0,4.0,3.0,3.0,3.0,,0.0,2.0,1.0,0.0,0.0,2.0,1.0,3.0,0.0,6.0,8.0,0.0,,8.0,1.0,,,2.0,2.0,1.0,2.0,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,4.0,,,,0.0,0.0,0.0
4,0.0,9.0,6.0,4.0,,2.0,0.0,3.0,,2.0,0.0,3.0,0.0,5.0,1.0,,0.0,0.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,0.0,2.0,,0.0,0.0,2.0,1.0,5.0,1.0,0.0,3.0,,4.0,9.0,1.0,0.0,3.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,1.0,,4.0,0.0,2.0,1.0,2.0,,1.0,1.0,,5.0,0.0,2.0,2.0,,2.0,0.0
5,1.0,3.0,9.0,4.0,2.0,2.0,0.0,2.0,,0.0,0.0,3.0,0.0,4.0,,2.0,0.0,0.0,3.0,2.0,2.0,3.0,2.0,2.0,1.0,,2.0,0.0,1.0,,2.0,2.0,5.0,1.0,1.0,0.0,0.0,1.0,6.0,0.0,0.0,1.0,2.0,2.0,1.0,2.0,2.0,,1.0,,2.0,3.0,0.0,0.0,1.0,2.0,4.0,1.0,1.0,1.0,3.0,0.0,1.0,,0.0,0.0,0.0
6,0.0,8.0,6.0,4.0,,7.0,0.0,,4.0,5.0,0.0,3.0,0.0,4.0,1.0,3.0,0.0,0.0,3.0,6.0,6.0,3.0,1.0,,7.0,1.0,2.0,0.0,1.0,0.0,,1.0,5.0,1.0,2.0,,,1.0,9.0,1.0,0.0,3.0,2.0,,1.0,2.0,2.0,2.0,1.0,1.0,2.0,4.0,0.0,,0.0,2.0,6.0,1.0,1.0,3.0,2.0,,2.0,1.0,,0.0,
7,0.0,8.0,6.0,4.0,2.0,9.0,,1.0,0.0,6.0,0.0,3.0,0.0,,1.0,3.0,0.0,,3.0,,9.0,3.0,3.0,3.0,,,2.0,0.0,,,2.0,1.0,5.0,1.0,5.0,4.0,0.0,,,1.0,0.0,0.0,2.0,2.0,,2.0,2.0,,1.0,,2.0,4.0,2.0,,,2.0,3.0,1.0,1.0,,6.0,0.0,,0.0,2.0,,0.0
8,0.0,3.0,6.0,4.0,2.0,4.0,0.0,2.0,2.0,8.0,0.0,3.0,0.0,5.0,2.0,3.0,0.0,0.0,3.0,6.0,5.0,2.0,3.0,3.0,5.0,0.0,2.0,,0.0,0.0,,1.0,3.0,1.0,,8.0,,1.0,3.0,,0.0,2.0,,2.0,1.0,2.0,2.0,2.0,0.0,1.0,,4.0,0.0,0.0,,2.0,2.0,1.0,1.0,,4.0,0.0,,0.0,,0.0,0.0
9,,,1.0,4.0,2.0,4.0,0.0,4.0,0.0,5.0,0.0,3.0,0.0,6.0,1.0,3.0,0.0,0.0,3.0,6.0,3.0,2.0,1.0,3.0,5.0,3.0,2.0,2.0,1.0,0.0,2.0,1.0,4.0,0.0,5.0,,0.0,1.0,7.0,2.0,0.0,2.0,,,2.0,2.0,2.0,1.0,,1.0,,0.0,0.0,0.0,,2.0,4.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,0.0,0.0,


['HOSPITAL', 'ACC_TYPE', 'AGE', 'AIR_BAG', 'ALC_STATUS', 'BODY_TYP', 'CARGO_BT', 'DAY_WEEK', 'DEFORMED', 'DR_ZIP', 'EJECTION', 'HARM_EV', 'HIT_RUN', 'HOUR', 'IMPACT1', 'INJ_SEV', 'INT_HWY', 'J_KNIFE', 'LGT_COND', 'MAKE', 'MAK_MOD', 'MAN_COLL', 'MAX_SEV', 'MAX_VSEV', 'MODEL', 'MONTH', 'M_HARM', 'NUMOCCS', 'NUM_INJ', 'NUM_INJV', 'PCRASH4', 'PCRASH5', 'PERMVIT', 'PER_TYP', 'PJ', 'PSU', 'PVH_INVL', 'P_CRASH1', 'P_CRASH2', 'REGION', 'RELJCT1', 'RELJCT2', 'REL_ROAD', 'REST_MIS', 'REST_USE', 'ROLINLOC', 'ROLLOVER', 'SEAT_POS', 'SEX', 'SPEC_USE', 'SPEEDREL', 'TOWED', 'TOW_VEH', 'TYP_INT', 'URBANICITY', 'VALIGN', 'VEH_AGE', 'VE_FORMS', 'VE_TOTAL', 'VPROFILE', 'VSPD_LIM', 'VSURCOND', 'VTCONT_F', 'VTRAFCON', 'VTRAFWAY', 'WEATHER', 'WRK_ZONE']


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0.0,7.0,2.0,4.0,2.0,3.0,0.0,1.0,0.0,6.0,0.0,3.0,0.0,5.0,7.0,3.0,0.0,0.0,1.0,8.0,8.0,3.0,3.0,3.0,4.0,0.0,2.0,2.0,0.0,0.0,2.0,1.0,1.0,0.0,4.0,4.0,0.0,1.0,8.0,1.0,0.0,3.0,2.0,2.0,1.0,2.0,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,2.0,3.0,3.0,0.0,1.0,0.0,2.0,0.0,0.0
1,0.0,8.0,6.0,1.0,2.0,2.0,0.0,3.0,2.0,6.0,0.0,3.0,0.0,5.0,1.0,2.0,0.0,0.0,3.0,3.0,2.0,3.0,1.0,2.0,2.0,0.0,2.0,0.0,1.0,0.0,2.0,1.0,3.0,1.0,4.0,4.0,0.0,1.0,9.0,1.0,0.0,0.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,1.0,2.0,4.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,0.0,2.0,0.0
2,0.0,8.0,6.0,1.0,2.0,2.0,0.0,4.0,0.0,5.0,0.0,3.0,0.0,4.0,1.0,3.0,1.0,0.0,3.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,2.0,2.0,1.0,1.0,2.0,1.0,4.0,1.0,1.0,4.0,0.0,1.0,8.0,1.0,0.0,0.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,1.0,2.0,0.0,0.0,0.0,1.0,2.0,3.0,2.0,2.0,0.0,9.0,0.0,1.0,0.0,2.0,2.0,0.0
3,0.0,8.0,6.0,4.0,2.0,2.0,0.0,4.0,0.0,4.0,0.0,3.0,0.0,4.0,1.0,3.0,0.0,0.0,3.0,6.0,4.0,3.0,3.0,3.0,4.0,0.0,2.0,1.0,0.0,0.0,2.0,1.0,3.0,0.0,6.0,8.0,0.0,1.0,8.0,1.0,0.0,1.0,2.0,2.0,1.0,2.0,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,9.0,6.0,4.0,2.0,2.0,0.0,3.0,2.0,2.0,0.0,3.0,0.0,5.0,1.0,3.0,0.0,0.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,0.0,2.0,0.0,0.0,0.0,2.0,1.0,5.0,1.0,0.0,3.0,0.0,4.0,9.0,1.0,0.0,3.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,1.0,2.0,4.0,0.0,2.0,1.0,2.0,4.0,1.0,1.0,3.0,5.0,0.0,2.0,2.0,0.0,2.0,0.0
5,1.0,3.0,9.0,4.0,2.0,2.0,0.0,2.0,2.0,0.0,0.0,3.0,0.0,4.0,1.0,2.0,0.0,0.0,3.0,2.0,2.0,3.0,2.0,2.0,1.0,4.0,2.0,0.0,1.0,0.0,2.0,2.0,5.0,1.0,1.0,0.0,0.0,1.0,6.0,0.0,0.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,0.0,0.0,1.0,2.0,4.0,1.0,1.0,1.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0
6,0.0,8.0,6.0,4.0,2.0,7.0,0.0,3.0,4.0,5.0,0.0,3.0,0.0,4.0,1.0,3.0,0.0,0.0,3.0,6.0,6.0,3.0,1.0,3.0,7.0,1.0,2.0,0.0,1.0,0.0,2.0,1.0,5.0,1.0,2.0,3.0,0.0,1.0,9.0,1.0,0.0,3.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,4.0,0.0,0.0,0.0,2.0,6.0,1.0,1.0,3.0,2.0,0.0,2.0,1.0,1.0,0.0,0.0
7,0.0,8.0,6.0,4.0,2.0,9.0,0.0,1.0,0.0,6.0,0.0,3.0,0.0,5.0,1.0,3.0,0.0,0.0,3.0,7.0,9.0,3.0,3.0,3.0,2.0,4.0,2.0,0.0,0.0,0.0,2.0,1.0,5.0,1.0,5.0,4.0,0.0,1.0,8.0,1.0,0.0,0.0,2.0,2.0,1.0,2.0,2.0,2.0,1.0,1.0,2.0,4.0,2.0,0.0,1.0,2.0,3.0,1.0,1.0,3.0,6.0,0.0,1.0,0.0,2.0,0.0,0.0
8,0.0,3.0,6.0,4.0,2.0,4.0,0.0,2.0,2.0,8.0,0.0,3.0,0.0,5.0,2.0,3.0,0.0,0.0,3.0,6.0,5.0,2.0,3.0,3.0,5.0,0.0,2.0,0.0,0.0,0.0,2.0,1.0,3.0,1.0,6.0,8.0,0.0,1.0,3.0,1.0,0.0,2.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,1.0,2.0,4.0,0.0,0.0,1.0,2.0,2.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0
9,0.0,3.0,1.0,4.0,2.0,4.0,0.0,4.0,0.0,5.0,0.0,3.0,0.0,6.0,1.0,3.0,0.0,0.0,3.0,6.0,3.0,2.0,1.0,3.0,5.0,3.0,2.0,2.0,1.0,0.0,2.0,1.0,4.0,0.0,5.0,3.0,0.0,1.0,7.0,2.0,0.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,1.0,2.0,4.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,0.0,0.0,0.0


Finished Impute_MissForest()

data_MF.shape
(200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,2,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,1,2,2,0,2,0,1,0,2,1,3,1,4,4,0,1,9,1,0,0,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,4,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,4,0,1,8,1,0,0,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,4,0,3,0,4,1,3,0,0,3,6,4,3,3,3,4,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,1,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,0,0,0,0
4,0,9,6,4,2,2,0,3,2,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,4,1,1,3,5,0,2,2,0,2,0



Impute_Randomly()

HOSPITAL
Original Value Counts
HOSPITAL
0.0    0.832643
1.0    0.167357
Name: proportion, dtype: float64
Value Counts after Sampling
HOSPITAL
0.0    0.832565
1.0    0.167435
Name: proportion, dtype: float64

ACC_TYPE
Original Value Counts
ACC_TYPE
8.0    0.295691
7.0    0.190826
5.0    0.121745
6.0    0.110732
3.0    0.087159
4.0    0.051747
0.0    0.041337
9.0    0.039984
1.0    0.033803
2.0    0.026977
Name: proportion, dtype: float64
Value Counts after Sampling
ACC_TYPE
8.0    0.294825
7.0    0.191515
5.0    0.122205
6.0    0.111980
3.0    0.086845
4.0    0.051975
0.0    0.040590
9.0    0.039045
1.0    0.033850
2.0    0.027170
Name: proportion, dtype: float64

AGE
Original Value Counts
AGE
6.0    0.511102
7.0    0.154376
5.0    0.056347
8.0    0.055505
9.0    0.047943
1.0    0.043878
3.0    0.041293
2.0    0.034307
4.0    0.028679
0.0    0.026571
Name: proportion, dtype: float64
Value Counts after Sampling
AGE
6.0    0.512790
7.0    0.153020
5.0    0.056445
8.0  

Original Value Counts
NUMOCCS
0.0    0.524374
1.0    0.262166
2.0    0.110536
3.0    0.061365
4.0    0.041559
Name: proportion, dtype: float64
Value Counts after Sampling
NUMOCCS
0.0    0.524040
1.0    0.261975
2.0    0.110430
3.0    0.061990
4.0    0.041565
Name: proportion, dtype: float64

NUM_INJ
Original Value Counts
NUM_INJ
0.0    0.493205
1.0    0.277354
2.0    0.129054
3.0    0.053722
4.0    0.025293
5.0    0.021372
Name: proportion, dtype: float64
Value Counts after Sampling
NUM_INJ
0.0    0.494380
1.0    0.276265
2.0    0.128160
3.0    0.053630
4.0    0.025560
5.0    0.022005
Name: proportion, dtype: float64

NUM_INJV
Original Value Counts
NUM_INJV
0.0    0.631671
1.0    0.254519
2.0    0.075861
3.0    0.037948
Name: proportion, dtype: float64
Value Counts after Sampling
NUM_INJV
0.0    0.632035
1.0    0.253635
2.0    0.076035
3.0    0.038295
Name: proportion, dtype: float64

PCRASH4
Original Value Counts
PCRASH4
2.0    0.955094
1.0    0.023728
0.0    0.021178
Name: proportion


VSURCOND
Original Value Counts
VSURCOND
0.0    0.809824
1.0    0.152111
2.0    0.038064
Name: proportion, dtype: float64
Value Counts after Sampling
VSURCOND
0.0    0.809720
1.0    0.152145
2.0    0.038135
Name: proportion, dtype: float64

VTCONT_F
Original Value Counts
VTCONT_F
1.0    0.680001
2.0    0.319128
0.0    0.000871
Name: proportion, dtype: float64
Value Counts after Sampling
VTCONT_F
1.0    0.679765
2.0    0.319405
0.0    0.000830
Name: proportion, dtype: float64

VTRAFCON
Original Value Counts
VTRAFCON
0.0    0.699042
1.0    0.210440
2.0    0.090517
Name: proportion, dtype: float64
Value Counts after Sampling
VTRAFCON
0.0    0.698945
1.0    0.211045
2.0    0.090010
Name: proportion, dtype: float64

VTRAFWAY
Original Value Counts
VTRAFWAY
0.0    0.443756
1.0    0.252196
2.0    0.211321
4.0    0.041665
5.0    0.028468
3.0    0.022595
Name: proportion, dtype: float64
Value Counts after Sampling
VTRAFWAY
0.0    0.444280
1.0    0.251780
2.0    0.211220
4.0    0.041665
5.0    0.

Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,3,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,1,0,0
1,0,8,6,1,2,8,0,4,2,6,0,3,0,5,1,2,0,0,3,3,2,3,3,2,2,0,2,0,1,0,2,1,3,1,4,4,0,1,9,1,0,2,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,1,2,0
2,0,8,6,1,2,6,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,4,0,1,8,1,0,0,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,0,2,0
3,0,8,6,4,2,2,0,4,0,5,0,3,0,4,2,3,0,0,3,6,4,3,3,3,4,0,2,1,0,0,2,1,3,0,6,8,0,2,8,1,0,2,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,1,0,0,0
4,0,9,6,4,2,2,0,3,4,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,1,1,1,3,5,0,2,2,0,2,0


CPU times: user 5min 50s, sys: 16.3 s, total: 6min 6s
Wall time: 6min 11s


## Do IVEware Imputation (Outside this Jupyter Notebook)
- Go to the IVEware folder and run (at the command line) IVE_12_22_22.bat
- Requires scrlib and R.  You may need to, in the batch file, change the path to your scrlib installation.
- Notes to self:
    - Open srcshell
    - From srcshell, open IVEware_CRSS_Imputation.xml
    - Run
- Run time: ./IVEware_CRSS_Imputation.bat  1069.08s user 12.92s system 98% cpu 18:23.92 total

In [14]:
data_IVEware = pd.read_csv('../../Big_Files/data_IVEware.csv')
data_IVEware.drop(columns='Unnamed: 0', inplace=True)

print ('data_Ground_Truth', data_Ground_Truth.shape)
display(data_Ground_Truth.head(10))
print ('data_NaN', data_NaN.shape)
display(data_NaN.head(10))
print ('data_RF', data_RF.shape)
display(data_RF.head(10))
print ('data_MF', data_MF.shape)
display(data_MF.head(10))
print ('data_Mode', data_Mode.shape)
display(data_Mode.head(10))
print ('data_IVEware', data_IVEware.shape)
display(data_IVEware.head(10))


data_Ground_Truth (200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,1,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,4,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,1,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,2,2,2,0,2,0,1,1,2,1,3,1,4,4,0,1,9,1,0,1,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,3,0,1,8,1,0,0,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,8,0,3,0,4,7,3,0,0,3,6,4,3,3,3,4,0,2,1,0,0,2,1,3,0,6,8,0,4,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,2,2,0,0,0
4,0,9,6,4,2,2,0,3,2,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,4,1,1,3,5,0,2,2,0,2,0
5,1,3,9,4,2,2,0,2,2,0,0,3,0,4,2,2,0,0,3,2,2,3,2,2,1,3,2,0,1,1,2,2,5,1,1,0,0,1,6,0,0,1,2,2,1,2,2,2,1,1,2,3,0,0,1,2,4,1,1,1,3,0,1,0,0,0,0
6,0,8,6,4,2,7,0,2,4,5,0,3,0,4,1,3,0,0,3,6,6,3,1,3,7,1,2,0,1,0,2,1,5,1,2,5,0,1,9,1,0,3,2,2,1,2,2,2,1,1,2,4,0,1,0,2,6,1,1,3,2,0,2,1,0,0,0
7,0,8,6,4,2,9,2,1,0,6,0,3,0,3,1,3,0,2,3,9,9,3,3,3,9,0,2,0,0,0,2,1,5,1,5,4,0,1,8,1,0,0,2,2,1,2,2,2,1,1,2,4,2,1,1,2,3,1,1,3,6,0,1,0,2,0,0
8,0,3,6,4,2,4,0,2,2,8,0,3,0,5,2,3,0,0,3,6,5,2,3,3,5,0,2,1,0,0,2,1,3,1,8,8,0,1,3,1,0,2,2,2,1,2,2,2,0,1,2,4,0,0,1,2,2,1,1,3,4,0,1,0,1,0,0
9,0,6,1,4,2,4,0,4,0,5,0,3,0,6,1,3,0,0,3,6,3,2,1,3,5,3,2,2,1,0,2,1,4,0,5,4,0,1,7,2,0,2,2,2,2,2,2,1,0,1,2,0,0,0,1,2,4,1,1,3,4,0,1,0,0,0,0


data_NaN (200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0.0,7.0,2.0,,,3.0,0.0,1.0,0.0,6.0,0.0,3.0,0.0,5.0,7.0,3.0,0.0,0.0,1.0,8.0,8.0,3.0,3.0,3.0,4.0,0.0,2.0,2.0,0.0,,2.0,1.0,1.0,0.0,4.0,4.0,0.0,,8.0,1.0,,3.0,2.0,2.0,1.0,,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,2.0,3.0,3.0,,1.0,0.0,,0.0,0.0
1,0.0,8.0,6.0,1.0,,,0.0,,2.0,6.0,0.0,3.0,0.0,5.0,1.0,2.0,0.0,,3.0,3.0,2.0,3.0,,2.0,2.0,0.0,,0.0,1.0,,2.0,1.0,3.0,1.0,4.0,4.0,0.0,1.0,9.0,1.0,0.0,,2.0,2.0,1.0,2.0,2.0,2.0,0.0,,2.0,4.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,,2.0,0.0
2,0.0,8.0,6.0,1.0,2.0,,0.0,,0.0,5.0,0.0,,,4.0,1.0,,1.0,,3.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,2.0,2.0,1.0,1.0,2.0,1.0,4.0,1.0,1.0,,0.0,1.0,8.0,1.0,0.0,,2.0,2.0,1.0,,2.0,2.0,0.0,1.0,2.0,0.0,0.0,0.0,,2.0,3.0,2.0,2.0,0.0,9.0,0.0,1.0,0.0,,2.0,0.0
3,0.0,8.0,6.0,4.0,2.0,2.0,0.0,4.0,0.0,,0.0,3.0,0.0,4.0,,3.0,0.0,0.0,3.0,6.0,4.0,3.0,3.0,3.0,,0.0,2.0,1.0,0.0,0.0,2.0,1.0,3.0,0.0,6.0,8.0,0.0,,8.0,1.0,,,2.0,2.0,1.0,2.0,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,4.0,,,,0.0,0.0,0.0
4,0.0,9.0,6.0,4.0,,2.0,0.0,3.0,,2.0,0.0,3.0,0.0,5.0,1.0,,0.0,0.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,0.0,2.0,,0.0,0.0,2.0,1.0,5.0,1.0,0.0,3.0,,4.0,9.0,1.0,0.0,3.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,1.0,,4.0,0.0,2.0,1.0,2.0,,1.0,1.0,,5.0,0.0,2.0,2.0,,2.0,0.0
5,1.0,3.0,9.0,4.0,2.0,2.0,0.0,2.0,,0.0,0.0,3.0,0.0,4.0,,2.0,0.0,0.0,3.0,2.0,2.0,3.0,2.0,2.0,1.0,,2.0,0.0,1.0,,2.0,2.0,5.0,1.0,1.0,0.0,0.0,1.0,6.0,0.0,0.0,1.0,2.0,2.0,1.0,2.0,2.0,,1.0,,2.0,3.0,0.0,0.0,1.0,2.0,4.0,1.0,1.0,1.0,3.0,0.0,1.0,,0.0,0.0,0.0
6,0.0,8.0,6.0,4.0,,7.0,0.0,,4.0,5.0,0.0,3.0,0.0,4.0,1.0,3.0,0.0,0.0,3.0,6.0,6.0,3.0,1.0,,7.0,1.0,2.0,0.0,1.0,0.0,,1.0,5.0,1.0,2.0,,,1.0,9.0,1.0,0.0,3.0,2.0,,1.0,2.0,2.0,2.0,1.0,1.0,2.0,4.0,0.0,,0.0,2.0,6.0,1.0,1.0,3.0,2.0,,2.0,1.0,,0.0,
7,0.0,8.0,6.0,4.0,2.0,9.0,,1.0,0.0,6.0,0.0,3.0,0.0,,1.0,3.0,0.0,,3.0,,9.0,3.0,3.0,3.0,,,2.0,0.0,,,2.0,1.0,5.0,1.0,5.0,4.0,0.0,,,1.0,0.0,0.0,2.0,2.0,,2.0,2.0,,1.0,,2.0,4.0,2.0,,,2.0,3.0,1.0,1.0,,6.0,0.0,,0.0,2.0,,0.0
8,0.0,3.0,6.0,4.0,2.0,4.0,0.0,2.0,2.0,8.0,0.0,3.0,0.0,5.0,2.0,3.0,0.0,0.0,3.0,6.0,5.0,2.0,3.0,3.0,5.0,0.0,2.0,,0.0,0.0,,1.0,3.0,1.0,,8.0,,1.0,3.0,,0.0,2.0,,2.0,1.0,2.0,2.0,2.0,0.0,1.0,,4.0,0.0,0.0,,2.0,2.0,1.0,1.0,,4.0,0.0,,0.0,,0.0,0.0
9,,,1.0,4.0,2.0,4.0,0.0,4.0,0.0,5.0,0.0,3.0,0.0,6.0,1.0,3.0,0.0,0.0,3.0,6.0,3.0,2.0,1.0,3.0,5.0,3.0,2.0,2.0,1.0,0.0,2.0,1.0,4.0,0.0,5.0,,0.0,1.0,7.0,2.0,0.0,2.0,,,2.0,2.0,2.0,1.0,,1.0,,0.0,0.0,0.0,,2.0,4.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,0.0,0.0,


data_RF (200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,0,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,2,2,2,0,2,0,1,1,2,1,3,1,4,4,0,1,9,1,0,1,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,3,0,1,8,1,0,1,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,5,0,3,0,4,1,3,0,0,3,6,4,3,3,3,3,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,1,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,0,0,0,0
4,0,9,6,4,2,2,0,3,4,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,1,1,1,3,5,0,2,2,0,2,0
5,1,3,9,4,2,2,0,2,0,0,0,3,0,4,1,2,0,0,3,2,2,3,2,2,1,0,2,0,1,1,2,2,5,1,1,0,0,1,6,0,0,1,2,2,1,2,2,2,1,1,2,3,0,0,1,2,4,1,1,1,3,0,1,0,0,0,0
6,0,8,6,4,2,7,0,3,4,5,0,3,0,4,1,3,0,0,3,6,6,3,1,3,7,1,2,0,1,0,2,1,5,1,2,3,0,1,9,1,0,3,2,2,1,2,2,2,1,1,2,4,0,2,0,2,6,1,1,3,2,0,2,1,0,0,0
7,0,8,6,4,2,9,0,1,0,6,0,3,0,5,1,3,0,0,3,6,9,3,3,3,7,0,2,0,0,0,2,1,5,1,5,4,0,1,8,1,0,0,2,2,1,2,2,2,1,1,2,4,2,0,1,2,3,1,1,3,6,0,1,0,2,0,0
8,0,3,6,4,2,4,0,2,2,8,0,3,0,5,2,3,0,0,3,6,5,2,3,3,5,0,2,0,0,0,2,1,3,1,8,8,0,1,3,1,0,2,2,2,1,2,2,2,0,1,2,4,0,0,1,2,2,1,1,3,4,0,1,0,0,0,0
9,0,8,1,4,2,4,0,4,0,5,0,3,0,6,1,3,0,0,3,6,3,2,1,3,5,3,2,2,1,0,2,1,4,0,5,4,0,1,7,2,0,2,2,2,2,2,2,1,0,1,2,0,0,0,1,2,4,1,1,3,4,0,1,0,0,0,0


data_MF (200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,2,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,1,2,2,0,2,0,1,0,2,1,3,1,4,4,0,1,9,1,0,0,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,4,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,4,0,1,8,1,0,0,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,4,0,3,0,4,1,3,0,0,3,6,4,3,3,3,4,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,1,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,0,0,0,0
4,0,9,6,4,2,2,0,3,2,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,4,1,1,3,5,0,2,2,0,2,0
5,1,3,9,4,2,2,0,2,2,0,0,3,0,4,1,2,0,0,3,2,2,3,2,2,1,4,2,0,1,0,2,2,5,1,1,0,0,1,6,0,0,1,2,2,1,2,2,2,1,1,2,3,0,0,1,2,4,1,1,1,3,0,1,0,0,0,0
6,0,8,6,4,2,7,0,3,4,5,0,3,0,4,1,3,0,0,3,6,6,3,1,3,7,1,2,0,1,0,2,1,5,1,2,3,0,1,9,1,0,3,2,2,1,2,2,2,1,1,2,4,0,0,0,2,6,1,1,3,2,0,2,1,1,0,0
7,0,8,6,4,2,9,0,1,0,6,0,3,0,5,1,3,0,0,3,7,9,3,3,3,2,4,2,0,0,0,2,1,5,1,5,4,0,1,8,1,0,0,2,2,1,2,2,2,1,1,2,4,2,0,1,2,3,1,1,3,6,0,1,0,2,0,0
8,0,3,6,4,2,4,0,2,2,8,0,3,0,5,2,3,0,0,3,6,5,2,3,3,5,0,2,0,0,0,2,1,3,1,6,8,0,1,3,1,0,2,2,2,1,2,2,2,0,1,2,4,0,0,1,2,2,1,1,3,4,0,1,0,0,0,0
9,0,3,1,4,2,4,0,4,0,5,0,3,0,6,1,3,0,0,3,6,3,2,1,3,5,3,2,2,1,0,2,1,4,0,5,3,0,1,7,2,0,2,2,2,2,2,2,1,1,1,2,0,0,0,1,2,4,1,1,3,4,0,1,0,0,0,0


data_Mode (200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,0,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,3,2,2,0,2,0,1,0,2,1,3,1,4,4,0,1,9,1,0,1,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,3,0,1,8,1,0,1,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,0,2,0
3,0,8,6,4,2,2,0,4,0,5,0,3,0,4,1,3,0,0,3,6,4,3,3,3,3,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,1,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,0,0,0,0
4,0,9,6,4,2,2,0,3,0,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,1,1,1,3,5,0,2,2,0,2,0
5,1,3,9,4,2,2,0,2,0,0,0,3,0,4,1,2,0,0,3,2,2,3,2,2,1,0,2,0,1,0,2,2,5,1,1,0,0,1,6,0,0,1,2,2,1,2,2,2,1,1,2,3,0,0,1,2,4,1,1,1,3,0,1,0,0,0,0
6,0,8,6,4,2,7,0,3,4,5,0,3,0,4,1,3,0,0,3,6,6,3,1,3,7,1,2,0,1,0,2,1,5,1,2,3,0,1,9,1,0,3,2,2,1,2,2,2,1,1,2,4,0,0,0,2,6,1,1,3,2,0,2,1,0,0,0
7,0,8,6,4,2,9,0,1,0,6,0,3,0,5,1,3,0,0,3,6,9,3,3,3,3,0,2,0,0,0,2,1,5,1,5,4,0,1,8,1,0,0,2,2,1,2,2,2,1,1,2,4,2,0,1,2,3,1,1,3,6,0,1,0,2,0,0
8,0,3,6,4,2,4,0,2,2,8,0,3,0,5,2,3,0,0,3,6,5,2,3,3,5,0,2,0,0,0,2,1,3,1,5,8,0,1,3,1,0,2,2,2,1,2,2,2,0,1,2,4,0,0,1,2,2,1,1,3,4,0,1,0,0,0,0
9,0,8,1,4,2,4,0,4,0,5,0,3,0,6,1,3,0,0,3,6,3,2,1,3,5,3,2,2,1,0,2,1,4,0,5,3,0,1,7,2,0,2,2,2,2,2,2,1,1,1,2,0,0,0,1,2,4,1,1,3,4,0,1,0,0,0,0


data_IVEware (200000, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,4,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,0,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,2,2,2,0,2,0,1,1,2,1,3,1,4,4,0,1,9,1,0,1,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,3,0,1,8,1,0,1,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,8,0,3,0,4,1,3,0,0,3,6,4,3,3,3,3,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,2,1,0,0,0
4,0,9,6,4,2,2,0,3,4,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,1,1,1,3,5,0,2,2,0,2,0
5,1,3,9,4,2,2,0,2,2,0,0,3,0,4,1,2,0,0,3,2,2,3,2,2,1,0,2,0,1,1,2,2,5,1,1,0,0,1,6,0,0,1,2,2,1,2,2,2,1,1,2,3,0,0,1,2,4,1,1,1,3,0,1,0,0,0,0
6,0,8,6,4,2,7,0,3,4,5,0,3,0,4,1,3,0,0,3,6,6,3,1,3,7,1,2,0,1,0,2,1,5,1,2,5,0,1,9,1,0,3,2,2,1,2,2,2,1,1,2,4,0,2,0,2,6,1,1,3,2,0,2,1,0,0,0
7,0,8,6,4,2,9,2,1,0,6,0,3,0,5,1,3,0,2,3,9,9,3,3,3,9,0,2,0,0,0,2,1,5,1,5,4,0,1,9,1,0,0,2,2,1,2,2,2,1,1,2,4,2,1,1,2,3,1,1,3,6,0,1,0,2,0,0
8,0,3,6,4,2,4,0,2,2,8,0,3,0,5,2,3,0,0,3,6,5,2,3,3,5,0,2,1,0,0,2,1,3,1,8,8,0,1,3,1,0,2,2,2,1,2,2,2,0,1,2,4,0,0,1,2,2,1,1,3,4,0,1,0,1,0,0
9,0,6,1,4,2,4,0,4,0,5,0,3,0,6,1,3,0,0,3,6,3,2,1,3,5,3,2,2,1,0,2,1,4,0,5,4,0,1,7,2,0,2,2,2,2,2,2,1,0,1,2,0,0,0,1,2,4,1,1,3,4,0,1,0,0,0,0


## Compare Three Imputation Methods

In [19]:
def Compare_Imputation_Methods_Part_2(
    data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random, data_IVEware
):
    print ('Compare_Imputation_Methods_Part_2')
    
    """
    print ('Drop Multicollinear Features')
    Drop = ['MAX_VSEV', 'VE_FORMS', 'VTCONT_F', 'MAX_SEV', 'NUM_INJV']
    DF = [data_Ground_Truth, data_NaN, data_RF, data_Mode, data_Random, data_IVEware]
    
    for df in DF:
        for feature in Drop:
            if feature in df:
                df.drop(columns=[feature], inplace=True)
                print ('Drop ', feature)
    print ()
    """
    
    print ('data_Ground_Truth.shape: ', data_Ground_Truth.shape)
    print ('data_NaN.shape: ', data_NaN.shape)
    print ('data_RF.shape: ', data_RF.shape)
    print ('data_MF.shape: ', data_MF.shape)
    print ('data_Mode.shape: ', data_Mode.shape)
    print ('data_Random.shape: ', data_Random.shape)
    print ('data_IVEware.shape: ', data_IVEware.shape)
    print ()
    
    print ('data_Ground_Truth')
    display(data_Ground_Truth.head())
    print ('data_NaN')
    display(data_NaN.head())
    print ('data_RF')
    display(data_RF.head())
    print ('data_MF')
    display(data_MF.head())
    print ('data_Mode')
    display(data_Mode.head())
    print ('data_Random')
    display(data_Random.head())
    print ('data_IVEware')
    display(data_IVEware.head())
    
    
    
    A = []
    for feature in data_NaN:
        nNaN = data_NaN[feature].isna().sum()
#        print (feature, N)
#        print ()
        D = data_Ground_Truth[feature] != data_RF[feature]
        d = D.sum()
        E = data_Ground_Truth[feature] != data_MF[feature]
        e = E.sum()
        F = data_Ground_Truth[feature] != data_Mode[feature]
        f = F.sum()
        G = data_Ground_Truth[feature] != data_Random[feature]
        g = G.sum()
        H = data_Ground_Truth[feature] != data_IVEware[feature]
        h = H.sum()
        I = data_RF[feature] != data_MF[feature]
        i = I.sum()
        J = data_RF[feature] != data_Mode[feature]
        j = J.sum()
        K = data_RF[feature] != data_Random[feature]
        k = K.sum()
        L = data_RF[feature] != data_IVEware[feature]
        l = L.sum()
        M = data_MF[feature] != data_Mode[feature]
        m = M.sum()
        N = data_MF[feature] != data_Random[feature]
        n = N.sum()
        O = data_MF[feature] != data_IVEware[feature]
        o = O.sum()
        P = data_Mode[feature] != data_Random[feature]
        p = P.sum()
        Q = data_Mode[feature] != data_IVEware[feature]
        q = Q.sum()
        R = data_Random[feature] != data_IVEware[feature]
        r = R.sum()
        print (feature, nNaN, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r)
        print (
            feature, 
            data_Ground_Truth.dtypes[feature],
            data_NaN.dtypes[feature],
            data_RF.dtypes[feature],
            data_MF.dtypes[feature],
            data_Mode.dtypes[feature],
            data_Random.dtypes[feature],
            data_IVEware.dtypes[feature],
        )
        A.append([
            feature, # 0
            nNaN,  # 1
            d, int(d/nNaN*100), # 2, 3
            e, int(e/nNaN*100), # 4, 5
            f, int(f/nNaN*100), # 6, 7
            g, int(g/nNaN*100), # 8, 9
            h, int(h/nNaN*100), # 10, 11
            i, int(i/nNaN*100), # 12, 13
            j, int(j/nNaN*100), # 14, 15
            k, int(k/nNaN*100), # 16, 17
            l, int(l/nNaN*100), # 18, 19
            m, int(m/nNaN*100), # 20, 21
            n, int(n/nNaN*100), # 22, 23
            o, int(o/nNaN*100), # 24, 25
            p, int(p/nNaN*100), # 26, 27
            q, int(q/nNaN*100), # 28, 28
            r, int(r/nNaN*100), # 30, 31
        ])
    print ()
    
    A = sorted(A, key=lambda x:x[3])
    B = pd.DataFrame(
        A, 
        columns=[
            'Feature', # 0
            'nNaN',  # 1
            'nRF Incorrect', 'pRF Incorrect', # 2, 3
            'nMF Incorrect', 'pMF Incorrect', # 4, 5
            'nMode Incorrect', 'pMode Incorrect', # 6, 7
            'nRandom Incorrect', 'pRandom Incorrect', # 8, 9
            'nIVEware Incorrect', 'pIVEware Incorrect', # 10, 11
            'RF and MF Different', 'RF v/s MF %', # 12, 13
            'RF and Mode Different', 'RF v/s Mode %', # 14, 15
            'RF and Random Different', 'RF v/s Random %', # 16, 17
            'RF and IVEware Different', 'RF v/s IVEware %', # 18, 19
            'MF and Mode Different', 'MF v/s Mode %', # 20, 21
            'MF and Random Different', 'MF v/s Random %', # 22, 23
            'MF and IVEware Different', 'MF v/s IVEware %', # 24, 25
            'Mode and Random Different', 'Mode v/s Random %', # 26, 27
            'Mode and IVEware Different', 'Mode v/s IVEware %', #, 28, 29
            'Random and IVEware Different', 'Random v/s IVEware %', # 30, 31
        ]
    )
    display(B)
    a = sum([x[1] for x in A]) # nNaN
    b = sum([x[2] for x in A]) # nRF Incorrect
    c = sum([x[4] for x in A]) # nMF Incorrect
    d = sum([x[6] for x in A]) # nMode INcorrect
    e = sum([x[8] for x in A]) # nRandom Incorrect
    f = sum([x[10] for x in A]) # nIVEware Incorrect
    g = round(b/a*100,2)
    h = round(c/a*100,2)
    i = round(d/a*100,2)
    j = round(e/a*100,2)
    k = round(f/a*100,2)

    RF_less_MF = sum([x[2] < x[4] for x in A])
    RF_equal_MF = sum([x[2] == x[4] for x in A])
    RF_greater_MF = sum([x[2] > x[4] for x in A])

    RF_less_Mode = sum([x[2] < x[6] for x in A])
    RF_equal_Mode = sum([x[2] == x[6] for x in A])
    RF_greater_Mode = sum([x[2] > x[6] for x in A])

    RF_less_Random = sum([x[2] < x[8] for x in A])
    RF_equal_Random = sum([x[2] == x[8] for x in A])
    RF_greater_Random = sum([x[2] > x[8] for x in A])

    RF_less_IVEware = sum([x[2] < x[10] for x in A])
    RF_equal_IVEware = sum([x[2] == x[10] for x in A])
    RF_greater_IVEware = sum([x[2] > x[10] for x in A])

    MF_less_Mode = sum([x[4] < x[6] for x in A])
    MF_equal_Mode = sum([x[4] == x[6] for x in A])
    MF_greater_Mode = sum([x[4] > x[6] for x in A])

    MF_less_Random = sum([x[4] < x[8] for x in A])
    MF_equal_Random = sum([x[4] == x[8] for x in A])
    MF_greater_Random = sum([x[4] > x[8] for x in A])

    MF_less_IVEware = sum([x[4] < x[10] for x in A])
    MF_equal_IVEware = sum([x[4] == x[10] for x in A])
    MF_greater_IVEware = sum([x[4] > x[10] for x in A])

    Mode_less_Random = sum([x[6] < x[8] for x in A])
    Mode_equal_Random = sum([x[6] == x[8] for x in A])
    Mode_greater_Random = sum([x[6] > x[8] for x in A])

    Mode_less_IVEware = sum([x[6] < x[10] for x in A])
    Mode_equal_IVEware = sum([x[6] == x[10] for x in A])
    Mode_greater_IVEware = sum([x[6] > x[10] for x in A])

    Random_less_IVEware = sum([x[8] < x[10] for x in A])
    Random_equal_IVEware = sum([x[8] == x[10] for x in A])
    Random_greater_IVEware = sum([x[8] > x[10] for x in A])

    print ()
    print ('    | | Number | Percentage |')
    print ('    | --- | --- | --- | ')    
    print ('    | Total NaN | ', f'{a:,d}', ' | 100% | ')
    print ('    | RF | ', f'{b:,d}', ' | ', g, '% | ')
    print ('    | MF | ', f'{c:,d}', ' | ', h, '% | ')
    print ('    | Mode | ', f'{d:,d}', ' | ', i, '% | ')
    print ('    | Random | ', f'{e:,d}', ' | ', j, '% | ')
    print ('    | IVEware | ', f'{f:,d}', ' | ', k, '% | ')
    print ()
    print ('    |  | Fewer | Equal | More | Total | ')
    print ('    | --- | --- | --- | --- | --- | ')
    print ('    | Compare RF to MF | ', RF_less_MF, ' | ', RF_equal_MF,  ' | ' ,RF_greater_MF,  ' |', len(A), ' |' )
    print ('    | Compare RF to Mode | ', RF_less_Mode, ' | ', RF_equal_Mode,  ' | ' ,RF_greater_Mode,  ' |', len(A), ' |' )
    print ('    | Compare RF to Random | ', RF_less_Random, ' | ' , RF_equal_Random,  ' | ' , RF_greater_Random,  ' |', len(A), ' |' )
    print ('    | Compare RF to IVEware | ', RF_less_IVEware, ' | ' , RF_equal_IVEware, ' | ' , RF_greater_IVEware, ' |', len(A), ' |' )
    print ('    | Compare MF to Mode | ', MF_less_Mode, ' | ', MF_equal_Mode,  ' | ' ,MF_greater_Mode,  ' |', len(A), ' |' )
    print ('    | Compare MF to Random | ', MF_less_Random, ' | ' , MF_equal_Random,  ' | ' , MF_greater_Random,  ' |', len(A), ' |' )
    print ('    | Compare MF to IVEware | ', MF_less_IVEware, ' | ' , MF_equal_IVEware, ' | ' , MF_greater_IVEware, ' |', len(A), ' |' )
    print ('    | Compare Mode to Random | ', Mode_less_Random, ' | ' , Mode_equal_Random, ' | ' , Mode_greater_Random, ' |', len(A), ' |' )
    print ('    | Compare Mode to IVEware | ', Mode_less_IVEware, ' | ' , Mode_equal_IVEware, ' | ' , Mode_greater_IVEware, ' |', len(A), ' |' )
    print ('    | Compare Random to IVEware | ', Random_less_IVEware, ' | ' , Random_equal_IVEware, ' | ' , Random_greater_IVEware, ' |', len(A), ' |' )
    print ()
    
    p = sum([x[12] for x in A])
    q = sum([x[14] for x in A])
    r = sum([x[16] for x in A])
    s = sum([x[18] for x in A])
    t = sum([x[20] for x in A])
    u = sum([x[22] for x in A])
    v = sum([x[24] for x in A])
    w = sum([x[26] for x in A])
    x = sum([x[28] for x in A])
    y = sum([x[30] for x in A])
    f = round(p/a*100,2)
    g = round(q/a*100,2)
    h = round(r/a*100,2)
    i = round(s/a*100,2)
    j = round(t/a*100,2)
    k = round(u/a*100,2)
    l = round(v/a*100,2)
    m = round(w/a*100,2)
    n = round(x/a*100,2)
    o = round(y/a*100,2)
    
    print ('    |  | Number |  Percentage |')
    print ('    | --- | --- | -- |')
    print ('    | Total NaN | ', f'{a:,d}', ' | 100% |' )
    print ('    | RF Different from MF | ', f'{p:,d}', ' | ', f, '% |')
    print ('    | RF Different from Mode | ', f'{q:,d}', ' | ', g, '% |')
    print ('    | RF Different from Random | ', f'{r:,d}', ' | ', h, '% |')
    print ('    | RF Different from IVEware | ', f'{s:,d}', ' | ', i, '% |')
    print ('    | MF Different from Mode | ', f'{t:,d}', ' | ', j, '% |')
    print ('    | MF Different from Random | ', f'{u:,d}', ' | ', k, '% |')
    print ('    | MF Different from IVEware | ', f'{v:,d}', ' | ', l, '% |')
    print ('    | Mode Different from Random | ', f'{w:,d}', ' | ', m, '% |')
    print ('    | Mode Different from IVEware | ', f'{x:,d}', ' | ', n, '% |')
    print ('    | Random Different from IVEware | ', f'{y:,d}', ' | ', o, '% |' )
    print ()
        
#    display(Audio(sound_file, autoplay=True))
    
    


In [20]:
Compare_Imputation_Methods_Part_2(
    data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random, data_IVEware
)

Compare_Imputation_Methods_Part_2
data_Ground_Truth.shape:  (200000, 67)
data_NaN.shape:  (200000, 67)
data_RF.shape:  (200000, 67)
data_MF.shape:  (200000, 67)
data_Mode.shape:  (200000, 67)
data_Random.shape:  (200000, 67)
data_IVEware.shape:  (200000, 67)

data_Ground_Truth


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,1,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,4,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,1,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,2,2,2,0,2,0,1,1,2,1,3,1,4,4,0,1,9,1,0,1,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,3,0,1,8,1,0,0,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,8,0,3,0,4,7,3,0,0,3,6,4,3,3,3,4,0,2,1,0,0,2,1,3,0,6,8,0,4,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,2,2,0,0,0
4,0,9,6,4,2,2,0,3,2,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,4,1,1,3,5,0,2,2,0,2,0


data_NaN


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0.0,7.0,2.0,,,3.0,0.0,1.0,0.0,6.0,0.0,3.0,0.0,5.0,7.0,3.0,0.0,0.0,1.0,8.0,8.0,3.0,3.0,3.0,4.0,0.0,2.0,2.0,0.0,,2.0,1.0,1.0,0.0,4.0,4.0,0.0,,8.0,1.0,,3.0,2.0,2.0,1.0,,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,2.0,3.0,3.0,,1.0,0.0,,0.0,0.0
1,0.0,8.0,6.0,1.0,,,0.0,,2.0,6.0,0.0,3.0,0.0,5.0,1.0,2.0,0.0,,3.0,3.0,2.0,3.0,,2.0,2.0,0.0,,0.0,1.0,,2.0,1.0,3.0,1.0,4.0,4.0,0.0,1.0,9.0,1.0,0.0,,2.0,2.0,1.0,2.0,2.0,2.0,0.0,,2.0,4.0,0.0,0.0,1.0,2.0,1.0,1.0,1.0,3.0,4.0,0.0,1.0,0.0,,2.0,0.0
2,0.0,8.0,6.0,1.0,2.0,,0.0,,0.0,5.0,0.0,,,4.0,1.0,,1.0,,3.0,3.0,3.0,3.0,0.0,0.0,3.0,3.0,2.0,2.0,1.0,1.0,2.0,1.0,4.0,1.0,1.0,,0.0,1.0,8.0,1.0,0.0,,2.0,2.0,1.0,,2.0,2.0,0.0,1.0,2.0,0.0,0.0,0.0,,2.0,3.0,2.0,2.0,0.0,9.0,0.0,1.0,0.0,,2.0,0.0
3,0.0,8.0,6.0,4.0,2.0,2.0,0.0,4.0,0.0,,0.0,3.0,0.0,4.0,,3.0,0.0,0.0,3.0,6.0,4.0,3.0,3.0,3.0,,0.0,2.0,1.0,0.0,0.0,2.0,1.0,3.0,0.0,6.0,8.0,0.0,,8.0,1.0,,,2.0,2.0,1.0,2.0,2.0,0.0,0.0,1.0,2.0,2.0,0.0,2.0,1.0,2.0,1.0,1.0,1.0,1.0,4.0,,,,0.0,0.0,0.0
4,0.0,9.0,6.0,4.0,,2.0,0.0,3.0,,2.0,0.0,3.0,0.0,5.0,1.0,,0.0,0.0,3.0,2.0,3.0,3.0,3.0,3.0,3.0,0.0,2.0,,0.0,0.0,2.0,1.0,5.0,1.0,0.0,3.0,,4.0,9.0,1.0,0.0,3.0,2.0,2.0,1.0,2.0,2.0,2.0,0.0,1.0,,4.0,0.0,2.0,1.0,2.0,,1.0,1.0,,5.0,0.0,2.0,2.0,,2.0,0.0


data_RF


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,0,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,2,2,2,0,2,0,1,1,2,1,3,1,4,4,0,1,9,1,0,1,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,3,0,1,8,1,0,1,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,5,0,3,0,4,1,3,0,0,3,6,4,3,3,3,3,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,1,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,0,0,0,0
4,0,9,6,4,2,2,0,3,4,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,1,1,1,3,5,0,2,2,0,2,0


data_MF


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,2,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,1,2,2,0,2,0,1,0,2,1,3,1,4,4,0,1,9,1,0,0,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,4,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,4,0,1,8,1,0,0,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,4,0,3,0,4,1,3,0,0,3,6,4,3,3,3,4,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,1,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,0,0,0,0
4,0,9,6,4,2,2,0,3,2,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,4,1,1,3,5,0,2,2,0,2,0


data_Mode


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,0,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,3,2,2,0,2,0,1,0,2,1,3,1,4,4,0,1,9,1,0,1,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,3,0,1,8,1,0,1,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,0,2,0
3,0,8,6,4,2,2,0,4,0,5,0,3,0,4,1,3,0,0,3,6,4,3,3,3,3,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,1,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,0,0,0,0
4,0,9,6,4,2,2,0,3,0,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,1,1,1,3,5,0,2,2,0,2,0


data_Random


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,3,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,1,0,0
1,0,8,6,1,2,8,0,4,2,6,0,3,0,5,1,2,0,0,3,3,2,3,3,2,2,0,2,0,1,0,2,1,3,1,4,4,0,1,9,1,0,2,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,1,2,0
2,0,8,6,1,2,6,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,4,0,1,8,1,0,0,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,0,2,0
3,0,8,6,4,2,2,0,4,0,5,0,3,0,4,2,3,0,0,3,6,4,3,3,3,4,0,2,1,0,0,2,1,3,0,6,8,0,2,8,1,0,2,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,1,1,0,0,0
4,0,9,6,4,2,2,0,3,4,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,1,1,1,3,5,0,2,2,0,2,0


data_IVEware


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,HARM_EV,HIT_RUN,HOUR,IMPACT1,INJ_SEV,INT_HWY,J_KNIFE,LGT_COND,MAKE,MAK_MOD,MAN_COLL,MAX_SEV,MAX_VSEV,MODEL,MONTH,M_HARM,NUMOCCS,NUM_INJ,NUM_INJV,PCRASH4,PCRASH5,PERMVIT,PER_TYP,PJ,PSU,PVH_INVL,P_CRASH1,P_CRASH2,REGION,RELJCT1,RELJCT2,REL_ROAD,REST_MIS,REST_USE,ROLINLOC,ROLLOVER,SEAT_POS,SEX,SPEC_USE,SPEEDREL,TOWED,TOW_VEH,TYP_INT,URBANICITY,VALIGN,VEH_AGE,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,7,2,4,2,3,0,1,0,6,0,3,0,5,7,3,0,0,1,8,8,3,3,3,4,0,2,2,0,0,2,1,1,0,4,4,0,4,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,2,2,3,3,0,1,0,0,0,0
1,0,8,6,1,2,2,0,3,2,6,0,3,0,5,1,2,0,0,3,3,2,3,2,2,2,0,2,0,1,1,2,1,3,1,4,4,0,1,9,1,0,1,2,2,1,2,2,2,0,1,2,4,0,0,1,2,1,1,1,3,4,0,1,0,0,2,0
2,0,8,6,1,2,2,0,3,0,5,0,3,0,4,1,3,1,0,3,3,3,3,0,0,3,3,2,2,1,1,2,1,4,1,1,3,0,1,8,1,0,1,2,2,1,2,2,2,0,1,2,0,0,0,1,2,3,2,2,0,9,0,1,0,2,2,0
3,0,8,6,4,2,2,0,4,0,8,0,3,0,4,1,3,0,0,3,6,4,3,3,3,3,0,2,1,0,0,2,1,3,0,6,8,0,1,8,1,0,3,2,2,1,2,2,0,0,1,2,2,0,2,1,2,1,1,1,1,4,0,2,1,0,0,0
4,0,9,6,4,2,2,0,3,4,2,0,3,0,5,1,3,0,0,3,2,3,3,3,3,3,0,2,0,0,0,2,1,5,1,0,3,0,4,9,1,0,3,2,2,1,2,2,2,0,1,2,4,0,2,1,2,1,1,1,3,5,0,2,2,0,2,0


HOSPITAL 27895 3886 4691 4691 7842 2888 1193 1193 5514 2986 2 4689 3995 4687 3997 7370
HOSPITAL int64 float64 Int32 Int32 Int32 Int32 int64
ACC_TYPE 27854 13145 19240 19600 23340 7564 15914 11221 21707 10971 10340 21889 18293 19833 17659 22951
ACC_TYPE int64 float64 Int32 Int32 Int32 Int32 int64
AGE 27817 13555 13656 13555 19481 12952 255 0 13463 933 255 13579 1177 13463 933 13928
AGE int64 float64 Int32 Int32 Int32 Int32 int64
AIR_BAG 27898 6546 6544 6546 11110 6546 2 0 6534 0 2 6536 2 6534 0 6534
AIR_BAG int64 float64 Int32 Int32 Int32 Int32 int64
ALC_STATUS 27786 570 570 570 1126 570 0 0 576 0 0 576 0 576 0 576
ALC_STATUS int64 float64 Int32 Int32 Int32 Int32 int64
BODY_TYP 27965 13190 17039 17088 21152 10674 5217 5204 18437 8549 156 17225 13431 17186 13476 20087
BODY_TYP int64 float64 Int32 Int32 Int32 Int32 int64
CARGO_BT 27884 265 747 747 1433 98 487 487 1188 200 0 721 687 721 687 1377
CARGO_BT int64 float64 Int32 Int32 Int32 Int32 int64
DAY_WEEK 27896 19166 19679 19642 21743 196

VPROFILE 27762 4426 5159 5146 8917 4365 739 720 5614 61 19 5067 800 5054 781 5665
VPROFILE int64 float64 Int32 Int32 Int32 Int32 int64
VSPD_LIM 27860 18858 20788 19838 22844 18505 14145 1799 19974 3866 13324 21387 14748 19676 3825 20681
VSPD_LIM int64 float64 Int32 Int32 Int32 Int32 int64
VSURCOND 27817 4580 5309 5309 8877 2254 729 729 5866 2498 0 5278 3227 5278 3227 7509
VSURCOND int64 float64 Int32 Int32 Int32 Int32 int64
VTCONT_F 27862 2774 8916 8916 12042 1098 6156 6156 11040 2263 0 8830 8419 8830 8419 11876
VTCONT_F int64 float64 Int32 Int32 Int32 Int32 int64
VTRAFCON 27823 6875 8484 8484 12930 3403 2134 2134 9461 6010 0 8410 8144 8410 8144 12384
VTRAFCON int64 float64 Int32 Int32 Int32 Int32 int64
VTRAFWAY 27840 13622 17956 15565 19322 12000 16108 2389 16099 5377 15702 18858 15993 15520 7755 17285
VTRAFWAY int64 float64 Int32 Int32 Int32 Int32 int64
WEATHER 27917 7892 10196 7892 12321 5800 4457 0 7883 4087 4457 10261 7838 7883 4087 10368
WEATHER int64 float64 Int32 Int32 Int32 In

Unnamed: 0,Feature,nNaN,nRF Incorrect,pRF Incorrect,nMF Incorrect,pMF Incorrect,nMode Incorrect,pMode Incorrect,nRandom Incorrect,pRandom Incorrect,nIVEware Incorrect,pIVEware Incorrect,RF and MF Different,RF v/s MF %,RF and Mode Different,RF v/s Mode %,RF and Random Different,RF v/s Random %,RF and IVEware Different,RF v/s IVEware %,MF and Mode Different,MF v/s Mode %,MF and Random Different,MF v/s Random %,MF and IVEware Different,MF v/s IVEware %,Mode and Random Different,Mode v/s Random %,Mode and IVEware Different,Mode v/s IVEware %,Random and IVEware Different,Random v/s IVEware %
0,CARGO_BT,27884,265,0,747,2,747,2,1433,5,98,0,487,1,487,1,1188,4,200,0,0,0,721,2,687,2,721,2,687,2,1377,4
1,EJECTION,27875,205,0,659,2,659,2,1207,4,116,0,455,1,455,1,1009,3,89,0,0,0,575,2,544,1,575,2,544,1,1096,3
2,HIT_RUN,27855,112,0,112,0,112,0,237,0,112,0,0,0,0,0,125,0,0,0,0,0,125,0,0,0,125,0,0,0,125,0
3,SPEC_USE,27821,181,0,181,0,181,0,327,1,158,0,0,0,0,0,146,0,57,0,0,0,146,0,57,0,146,0,57,0,203,0
4,J_KNIFE,27845,281,1,544,1,544,1,1098,3,51,0,267,0,267,0,833,2,266,0,0,0,584,2,533,1,584,2,533,1,1089,3
5,PVH_INVL,27942,398,1,398,1,398,1,818,2,114,0,0,0,0,0,429,1,376,1,0,0,429,1,376,1,429,1,376,1,796,2
6,TOW_VEH,27878,285,1,518,1,518,1,1068,3,51,0,238,0,238,0,800,2,253,0,0,0,566,2,491,1,566,2,491,1,1045,3
7,ALC_STATUS,27786,570,2,570,2,570,2,1126,4,570,2,0,0,0,0,576,2,0,0,0,0,576,2,0,0,576,2,0,0,576,2
8,REL_ROAD,27904,822,2,2695,9,2695,9,4943,17,652,2,2289,8,2289,8,4545,16,487,1,0,0,2667,9,2536,9,2667,9,2536,9,4785,17
9,WRK_ZONE,27855,564,2,605,2,564,2,1110,3,564,2,47,0,0,0,560,2,0,0,47,0,607,2,47,0,560,2,0,0,560,2



    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  1,866,090  | 100% | 
    | RF |  506,018  |  27.12 % | 
    | MF |  678,186  |  36.34 % | 
    | Mode |  636,443  |  34.11 % | 
    | Random |  836,392  |  44.82 % | 
    | IVEware |  385,685  |  20.67 % | 

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to MF |  56  |  9  |  2  | 67  |
    | Compare RF to Mode |  49  |  18  |  0  | 67  |
    | Compare RF to Random |  67  |  0  |  0  | 67  |
    | Compare RF to IVEware |  4  |  5  |  58  | 67  |
    | Compare MF to Mode |  12  |  24  |  31  | 67  |
    | Compare MF to Random |  64  |  0  |  3  | 67  |
    | Compare MF to IVEware |  1  |  2  |  64  | 67  |
    | Compare Mode to Random |  67  |  0  |  0  | 67  |
    | Compare Mode to IVEware |  0  |  8  |  59  | 67  |
    | Compare Random to IVEware |  0  |  0  |  67  | 67  |

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  1,866,090  | 100%

# Impute using Random Forest and Save for Next Step

In [None]:
def Impute_Using_Random_Forest():
    data = Get_Data()
    
#    data_Imputed = Impute_Full(data)
    data_Imputed = Impute_Round_Robin(data)
    data_Imputed.to_csv('../../Big_Files/CRSS_Imputed_by_RF_Data.csv', index=False)
#    display(data_Imputed.head(50))
    
    Check(data, data_Imputed)
#    display(Audio(sound_file, autoplay=True))
    return 0

Impute_Using_Random_Forest()

In [None]:
def Impute_Using_IVEware():
    data = Get_Data()
    
    # Create .txt file to feed into IVEware imputation
    data_IVEware = data.copy(deep=True)
    print (data_IVEware.shape)
    display(data_IVEware.head(10))
    
    data_IVEware = data_IVEware.replace(99,'')
    display(data_IVEware.head(10))
    data_IVEware.to_csv('../../Big_Files/data_IVEware.txt', sep='\t', index=False)
    
    return 0

Impute_Using_IVEware()
# About one hour