In [1]:
%%latex
\tableofcontents

<IPython.core.display.Latex object>

# Ambulance_Dispatch_2024_03_Impute_Missing_Data

# readme
- Most of our other Jupyter Notebooks have a main() function at the bottom that runs everything.  
- This notebook is structured differently, with several functions that run in sequence.  
- The reason for the difference is that part of the work has to be done outside this notebook.  
- The IVEware imputation software is available in several languages, but not Python.  We ran it in R using scrlib.  
- This notebook prepares the data for the Mode, Random Forest, and IVEware imputations, and does the first two.  Then the user must separately run the IVEware software.  Finally, this notebook pulls in those results and compares the three methods.  

# Methods

## Goal
- We have about 3% of the values in the dataset missing.  
- CRSS used IVEware to impute missing values in some, but not most, of the features.  
- We can use IVEware to impute the rest of the features, but we should compare to other methods.  

## Dataset
- We have the discretized CRSS dataset in '../../Big_Files/CRSS_Binned_Data.csv'.
- The dataset is 802,700 samples with 67 features.
- In that dataset, each feature has values in {0,1,2,3,4,5,6,7,8,9,99}, with most features having fewer values and 99 signifying "Missing" or "Unknown."
- Overall, about 3% of the values are 99, 
    - In the features, thirteen features have no missing values, and six features have more than ten percent missing, the highest being RELJCT1 with 18% missing.  
    - In the rows, 29% have no missing values, 25% one missing value, 16% two, ..., 1% eight, ..., and 0.3% thirteen missing values.  
    - See results of the Analyze_Data() function for full details.  

## Test Method
- Start with the binned dataset with 802,700 samples in 67 features, and analyze the distribution of NaN.
    - In the binned dataset, "Missing" is represented as 99, but for clarity we will call it NaN.  Also, many things we want to do are Pandas functions for NaN, so we often convert the 99 to NaN.  
    - For each feature, find the proportion of samples missing.  For the full list, run Analyze_Data() below.
    - Note that we previously dropped all features with more than 20% of samples missing.

    | Feature | NaN % |
    |---|---|
    | HOSPITAL | 0.00% |
    | ACC_TYPE | 9.82% |
    | AGE | 4.11% |
    | AIR_BAG | 6.47% |
    | ALC_STATUS | 16.49% |
    | BODY_TYP | 2.65% |
    | $\vdots$ | $\vdots$ |
    | VTRAFCON | 8.56% |
    | VTRAFWAY | 15.53% |
    | WEATHER | 3.33% |
    | WRK_ZONE | 0.00% |
    
    - For each sample, find the number of samples missing.
    

    | Number of NaN in Sample | % of Dataset |
    |---|---|
    | 0 | 28.94% |
    | 1 | 24.66% |
    | 2 | 16.04% |
    | 3 | 9.92% |
    | 4 | 6.46% |
    | 5 | 4.56% |
    | 6 | 3.18% |
    | 7 | 2.18% |
    | 8 | 1.40% |
    | 9 | 0.89% |
    | 10 | 0.59% |
    | 11 | 0.45% |
    | 12 | 0.38% |
    | 13 | 0.35% |

- Drop all of the samples with any NaN, leaving 28.94% of the dataset, 232,333 samples.
    - The reason to test with a dataset with known values is to know whether the imputation method has imputed the missing values correctly.  
    - Call this dataframe data_Ground_Truth.
- Make the test dataset, data_NaN, by dropping values from data_Ground_Truth in proportion as the original dataset. 
    - Each feature in data_NaN has the same proportion of NaN as in that feature in the binned data.
    - The same proportion of rows have no NaN (28.94%), one NaN (24.66), ..., thirteen NaN (0.35%).
    - Details below with the Make_data_NaN_Method_3() function.
- For each of the six methods:
    - Impute missing values on data_NaN.
    - By comparing with data_Ground_Truth, count the number of NaN imputed incorrectly.
- For each pair of methods:
    - Compare by feature which method did better.
    - Count the number of differently imputed values, to see which methods give similar results.



## Imputation Methods
- We compare here six methods:
    - MissForest, implemented in Python
        -  https://pypi.org/project/MissForest/, version 2.5.5
        - We made significant modifications to optimize memory usage and correct what we think is a mistake in the logic
        - Using scikit-learn's random forest classifier and regressor with the default options
    - Our Round-Robin Random Forest 
        - Our own implementation of the MissForest method
        - We wrote it to help us understand the MissForest method, so we could understand how to get the above Python implementation to work
        - Written just for categorical features
        - Only does one iteration
    - IVEware with random seed 1
        - Using the hyperparameters given in the CRSS Imputation report, 
            - minrsqd 0.01;
            - maxpred 15;
            - "The minimum marginal r-squared required for a predictor to be included in the model is set to 0.01. The maximum number of predictors in a model 15."
            - Footnote #6 on page 8
            - Herbert, G. C. (2019, September). Crash Report Sampling System: Imputation (Report No. DOT HS 812 795). Washington, DC: National Highway Traffic Safety Administration.
    - IVEware with random seed 0
        - "A zero seed will result in no perturbations of the predicted values or the regression coefficients," according to the IVEware documentation.
        - We discovered by accident that a random seed of 0 makes a huge difference, in the case of our dataset a difference for the better.  
        - In Python and R, zero is a perfectly acceptable random seed, and in Python code is the most common choice of random seed, but in another language in which IVEware can be implemented, SAS, random seeds have to be positive, which may explain why the IVEware authors chose ``seed 0;`` as the switch to turn off perturbations.
    - Imputation by mode
        - We include this as a control.  
        - Both MissForest and our Round-Robin Random Forest start with imputation by mode and are supposed to improve on it.
        - If the results of either MissForest or our Round-Robin Random Forest are not significantly better than those of imputation by mode, something has gone wrong.
    - Random Imputation
        - With the distribution of the values in each feature matching the values in the original dataset
        - We include this as another control.  Anything should be better than this.  



# Results of Comparison of Six Imputation Methods

- We start with the binned (discretized) data, CRSS_Binned_Data.csv, with 817,623 samples in 67 features.
<br><br>
- Dropping any sample with a missing value, we have 232,333 samples of Ground Truth.


<br><br>
- First run with random seed  0 in Python and NumPy, and both 0 and 1 as random seeds for IVEware, 7/3/24
    <br><br>
    - Samples Incorrectly Imputed
    
   

   | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  484,710  | 100% | 
    | RF |  130,045  |  26.83 % | 
    | MF |  80,995  |  16.71 % | 
    | Mode |  173,102  |  35.71 % | 
    | Random |  226,013  |  46.63 % | 
    | IVEware_seed_0 |  105,212  |  21.71 % | 
    | IVEware_seed_1 |  158,163  |  32.63 % | 

   
    <br><br>
    - Comparison of number of errors in the 67 features.  For instance, comparing my Random Forest Round-Robin method to MissForest, in 1 feature RF had fewer errors than MissForest, in 18 features the two methods had the same number of errors, and in 48 features RF had more errors than MissForest.  

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to MF |  2  |  13  |  52  | 67  |
    | Compare RF to Mode |  41  |  26  |  0  | 67  |
    | Compare RF to Random |  56  |  8  |  3  | 67  |
    | Compare RF to IVEware_seed_0 |  5  |  19  |  43  | 67  |
    | Compare RF to IVEware_seed_0 |  29  |  15  |  23  | 67  |
    | Compare MF to Mode |  55  |  10  |  2  | 67  |
    | Compare MF to Random |  57  |  10  |  0  | 67  |
    | Compare MF to IVEware_seed_0 |  44  |  17  |  6  | 67  |
    | Compare MF to IVEware_seed_1 |  52  |  15  |  0  | 67  |
    | Compare Mode to Random |  52  |  10  |  5  | 67  |
    | Compare Mode to IVEware_seed_0 |  1  |  19  |  47  | 67  |
    | Compare Mode to IVEware_seed_1 |  19  |  13  |  35  | 67  |
    | Compare Random to IVEware_seed_0 |  2  |  9  |  56  | 67  |
    | Compare Random to IVEware_seed_1 |  6  |  8  |  53  | 67  |

    <br><br>
     - Number of NaN Imputed Differently by Different Methods
         - Smaller numbers mean the two methods give similar results

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  484,710  | 100% |
    | RF Different from MF |  86,478  |  17.84 % |
    | RF Different from Mode |  68,474  |  14.13 % |
    | RF Different from Random |  188,373  |  38.86 % |
    | RF Different from IVEware_seed_0 |  58,420  |  12.05 % |
    | RF Different from IVEware_seed_1 |  154,275  |  31.83 % |
    | MF Different from Mode |  136,683  |  28.2 % |
    | MF Different from Random |  209,244  |  43.17 % |
    | MF Different from IVEware_seed_0 |  63,811  |  13.16 % |
    | MF Different from IVEware_seed_1 |  139,804  |  28.84 % |
    | Mode Different from Random |  173,284  |  35.75 % |
    | Mode Different from IVEware_seed_0 |  113,233  |  23.36 % |
    | Mode Different from IVEware_seed_1 |  191,622  |  39.53 % |
    | Random Different from IVEware_seed_0 |  203,925  |  42.07 % |
    | Random Different from IVEware_seed_1 |  232,558  |  47.98 % |
    | IVEware_seed_0 Different from IVEware_seed_1 |  128,807  |  26.57 % |



<br><br>
- Second Run, Same random seed (0) to make sure the random seed is implemented correctly.  Same results. 

    <br><br>
     - Percentage of Samples Incorrectly Imputed
     
   | | Number | NaN Imputed Incorrectly |
    | --- | --- | --- | 
    | Total NaN |  484,710  | 100% | 
    | RF |  130,045  |  26.83 % | 
    | MF |  80,995  |  16.71 % | 
    | Mode |  173,102  |  35.71 % | 
    | Random |  226,013  |  46.63 % | 
    | IVEware_seed_0 |  105,212  |  21.71 % | 
    | IVEware_seed_1 |  158,163  |  32.63 % | 


    <br><br>
     - Percentage of Samples Correctly Imputed
     
    | | Number | NaN Imputed Correctly |
    | --- | --- | --- | 
    | Total NaN |  484,710  |  | 
    | RF |  130,045  |  73.17 % | 
    | MF |  80,995  |  83.29 % | 
    | Mode |  173,102  |  64.29 % | 
    | Random |  226,013  |  53.37 % | 
    | IVEware_seed_0 |  105,212  |  78.29 % | 
    | IVEware_seed_1 |  158,163  |  67.37 % | 


    <br><br>
     - Comparison of number of errors in the 67 features:

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to MF |  2  |  13  |  52  | 67  |
    | Compare RF to Mode |  41  |  26  |  0  | 67  |
    | Compare RF to Random |  56  |  8  |  3  | 67  |
    | Compare RF to IVEware_seed_0 |  5  |  19  |  43  | 67  |
    | Compare RF to IVEware_seed_0 |  29  |  15  |  23  | 67  |
    | Compare MF to Mode |  55  |  10  |  2  | 67  |
    | Compare MF to Random |  57  |  10  |  0  | 67  |
    | Compare MF to IVEware_seed_0 |  44  |  17  |  6  | 67  |
    | Compare MF to IVEware_seed_1 |  52  |  15  |  0  | 67  |
    | Compare Mode to Random |  52  |  10  |  5  | 67  |
    | Compare Mode to IVEware_seed_0 |  1  |  19  |  47  | 67  |
    | Compare Mode to IVEware_seed_1 |  19  |  13  |  35  | 67  |
    | Compare Random to IVEware_seed_0 |  2  |  9  |  56  | 67  |
    | Compare Random to IVEware_seed_1 |  6  |  8  |  53  | 67  |


    <br><br>
     - Number of NaN Imputed Differently by Different Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  484,710  | 100% |
    | RF Different from MF |  86,478  |  17.84 % |
    | RF Different from Mode |  68,474  |  14.13 % |
    | RF Different from Random |  188,373  |  38.86 % |
    | RF Different from IVEware_seed_0 |  58,420  |  12.05 % |
    | RF Different from IVEware_seed_1 |  154,275  |  31.83 % |
    | MF Different from Mode |  136,683  |  28.2 % |
    | MF Different from Random |  209,244  |  43.17 % |
    | MF Different from IVEware_seed_0 |  63,811  |  13.16 % |
    | MF Different from IVEware_seed_1 |  139,804  |  28.84 % |
    | Mode Different from Random |  173,284  |  35.75 % |
    | Mode Different from IVEware_seed_0 |  113,233  |  23.36 % |
    | Mode Different from IVEware_seed_1 |  191,622  |  39.53 % |
    | Random Different from IVEware_seed_0 |  203,925  |  42.07 % |
    | Random Different from IVEware_seed_1 |  232,558  |  47.98 % |
    | IVEware_seed_0 Different from IVEware_seed_1 |  128,807  |  26.57 % |

<br><br>
- Third run, with random seed 42 in Python and Numpy, with both 0 and 1 as random seeds for IVEware.
    - Note that the IVEware results are different in this run than in the previous runs, even though we used the same random seeds in IVEware.  The reason for the change is the different seed for Python and NumPy, which changed which values were missing in the dataset that we fed into IVEware.  

    <br><br>
    - Samples Incorrectly Imputed by Method
    
    | | Number | NaN Imputed Incorrectly |
    | --- | --- | --- | 
    | Total NaN |  484,712  | 100% | 
    | RF |  131,564  |  27.14 % | 
    | MF |  80,422  |  16.59 % | 
    | Mode |  173,078  |  35.71 % | 
    | Random |  225,585  |  46.54 % | 
    | IVEware_seed_0 |  104,722  |  21.6 % | 
    | IVEware_seed_1 |  141,782  |  29.25 % | 



    <br><br>
- Comparison of number of errors in the 67 features:

    | | Number | NaN Imputed Correctly |
    | --- | --- | --- | 
    | Total NaN |  484,712  |  | 
    | RF |  131,564  |  72.86 % | 
    | MF |  80,422  |  83.41 % | 
    | Mode |  173,078  |  64.29 % | 
    | Random |  225,585  |  53.46 % | 
    | IVEware_seed_0 |  104,722  |  78.4 % | 
    | IVEware_seed_1 |  141,782  |  70.75 % | 

    <br><br>
- Number of NaN Imputed Differently by Pairs of Methods

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to MF |  1  |  14  |  52  | 67  |
    | Compare RF to Mode |  41  |  25  |  1  | 67  |
    | Compare RF to Random |  56  |  10  |  1  | 67  |
    | Compare RF to IVEware_seed_0 |  2  |  21  |  44  | 67  |
    | Compare RF to IVEware_seed_0 |  28  |  12  |  27  | 67  |
    | Compare MF to Mode |  55  |  11  |  1  | 67  |
    | Compare MF to Random |  57  |  8  |  2  | 67  |
    | Compare MF to IVEware_seed_0 |  43  |  19  |  5  | 67  |
    | Compare MF to IVEware_seed_1 |  53  |  14  |  0  | 67  |
    | Compare Mode to Random |  54  |  12  |  1  | 67  |
    | Compare Mode to IVEware_seed_0 |  1  |  18  |  48  | 67  |
    | Compare Mode to IVEware_seed_1 |  21  |  8  |  38  | 67  |
    | Compare Random to IVEware_seed_0 |  1  |  10  |  56  | 67  |
    | Compare Random to IVEware_seed_1 |  5  |  10  |  52  | 67  |

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  484,712  | 100% |
    | RF Different from MF |  88,045  |  18.16 % |
    | RF Different from Mode |  66,204  |  13.66 % |
    | RF Different from Random |  187,311  |  38.64 % |
    | RF Different from IVEware_seed_0 |  61,250  |  12.64 % |
    | RF Different from IVEware_seed_1 |  137,491  |  28.37 % |
    | MF Different from Mode |  137,294  |  28.32 % |
    | MF Different from Random |  209,192  |  43.16 % |
    | MF Different from IVEware_seed_0 |  63,481  |  13.1 % |
    | MF Different from IVEware_seed_1 |  121,428  |  25.05 % |
    | Mode Different from Random |  173,048  |  35.7 % |
    | Mode Different from IVEware_seed_0 |  113,640  |  23.44 % |
    | Mode Different from IVEware_seed_1 |  173,642  |  35.82 % |
    | Random Different from IVEware_seed_0 |  203,859  |  42.06 % |
    | Random Different from IVEware_seed_1 |  226,543  |  46.74 % |
    | IVEware_seed_0 Different from IVEware_seed_1 |  110,441  |  22.78 % |






## Drop Multicollinear Features before Imputing?  Compare two methods
- First Method
    - After Binning, reduce dimensionality
        - Removes MAX_VSEV, VE_FORMS, VTCONT_F, MAX_SEV, NUM_INJV
        - Reduces from 67 to 62 features
    - Impute
- Second Method
    - Impute with all 67 features
    - Before evaluating the imputation, remove the five features and only evaluate the results on the 62 features used in the comparison above
- We used random seed 42 for both methods
- First Method Results

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,007,463  | 100% | 
    | RF |  569,509  |  28.37 % | 
    | Mode |  681,753  |  33.96 % | 
    | Random |  889,794  |  44.32 % | 
    | IVEware |  606,632  |  30.22 % | 
    
- Second Method Results


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,007,235  | 100% | 
    | RF |  558,936  |  27.85 % | 
    | Mode |  681,996  |  33.98 % | 
    | Random |  888,845  |  44.28 % | 
    | IVEware |  606,062  |  30.19 % | 


### Analysis
- Mode was the same, as it should be.
- Random was slightly different, perhaps because the features were in a different order?
- IVEware was not significantly different in the two methods.
- Random Forest was slightly but significantly better (0.52%) with the second method, not removing the multicollinear features before imputing, which is surprising.  

### Conclusion
- Run again with different random seed = 1

### Second Round Results
- First Method

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,643  | 100% | 
    | RF |  568,909  |  28.35 % | 
    | Mode |  681,061  |  33.94 % | 
    | Random |  889,048  |  44.31 % | 
    | IVEware |  592,233  |  29.51 % | 


- Second Method


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,005,955  | 100% | 
    | RF |  558,742  |  27.85 % | 
    | Mode |  680,715  |  33.93 % | 
    | Random |  887,944  |  44.27 % | 
    | IVEware |  564,254  |  28.13 % | 
    
### Analysis

- Again, the second method, leaving in multicollinear features, is better for both Random Forest and IVEware

### Conclusion
- When we do the big runs, try it both ways, and see which one builds the best models.

## Discussion -- REDO

- Random imputation is clearly worse than Mode and RF on every feature.
- Random is overall worse than IVEware, but on one of our runs there are five features on which Random is better than IVEware.
- Random Forest is as good or better than Mode on every feature, which is not surprising, as RF starts at Mode and improves on it.  
- Random Forest is as good or better than IVEware on more than half of the features, but not overwhelmingly, and slightly better in the count of missing samples correctly imputed.
- IVEware and Mode are comparable in the number of features, but IVEware is much better in the count of missing samples correctly imputed.
- Random Forest and Mode make the same mistakes.  
- IVEware makes different mistakes from Random Forest and Mode.

## Conclusion

- Try using both MissForest and IVEware with random seed zero.

## Opportunities for Future Research
(or, "Things we didn't do")


- Is it okay to use one imputation method for some features and another method for other features?

# Setup
## Import Libraries

In [2]:
import sys, copy, math, time, os

print ('Python version: {}'.format(sys.version))

import numpy as np
print ('NumPy version: {}'.format(np.__version__))
np.set_printoptions(suppress=True)


import pandas as pd
print ('Pandas version:  {}'.format(pd.__version__))
pd.set_option('display.max_rows', 500)

import sklearn
print ('SciKit-Learn version: {}'.format(sklearn.__version__))
from sklearn.model_selection import train_test_split

import sklearn.neighbors._base
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

import missforest.missforest_Brad_1
from missforest.missforest_Brad_1 import MissForest 
print ('MissForest version:  {}'.format(missforest.missforest_Brad_1.__version__))

# Set Randomness.  Copied from https://www.kaggle.com/code/abazdyrev/keras-nn-focal-loss-experiments
import random
random_seed = 1
print ('Random seed for Python and NumPy set to ', random_seed)
np.random.seed(random_seed) # NumPy
random.seed(random_seed) # Python
#tf.set_random_seed(random_seed) # Tensorflow
#filename_prefix = '_0_0'
#filename_prefix = '_0_1'
#filename_prefix = '_1_0'
filename_prefix = '_1_1'
print ('filename_prefix = ', filename_prefix)

from IPython.display import Audio
sound_file = './beep.wav'

import warnings
warnings.filterwarnings('ignore')

print ('Finished Importing Libraries')
print ()

comment = """
Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
NumPy version: 1.26.4
Pandas version:  2.2.2
SciKit-Learn version: 1.5.0
MissForest version:  2.5.5
"""

Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
NumPy version: 1.26.4
Pandas version:  2.2.2
SciKit-Learn version: 1.5.0
MissForest version:  2.5.5_Brad_1_Stuff
Random seed for Python and NumPy set to  1
filename_prefix =  _1_1
Finished Importing Libraries



## Get Data

This notebook pulls in the saved output of Ambulance_Dispatch_2024_02_Binning.

In [3]:
def Get_Data():
    print ('Get_Data')
#    data = pd.read_csv('../../Big_Files/CRSS_02_Binned_Data.csv', low_memory=False)

    # The first '_0' or '_1' is for the random seed for Python and Numpy
    # The second '_0' (or '_1') is for the dimensionality not being reduced (being reduced) before imputing
    filename = '../../Big_Files/CRSS_02' + filename_prefix + '.csv'
    print (filename)
    print ()
    data = pd.read_csv(filename, low_memory=False) 
    print ('data.shape = ', data.shape)
    print ()

    # We already dropped the imputed columns in the Binning stage
    print ('Drop Imputed Columns')
    for feature in data:
        if '_IM' in feature:
            print (feature)
            data.drop(columns=feature, inplace=True)
 

    # Method for dropping from 67 to 40 features 
    # to test whether it was just this particular mix of features 
    # that made the IVEware behave strangely well with random seed of zero.
#    print ('data.shape = ', data.shape)
#    data = data.sample(n=40, axis='columns')
    
    print ('data.shape = ', data.shape)
    print ()
    
    print ('Total number of NaN')
    print (data.replace({99:np.nan}).isnull().sum().sum())
    print ()
    
#    print ("Remaining Features:")
#    Features = sorted(list(data.columns))
#    for feature in Features:
#        print ("    ",feature)
    print ('Finished Get_Data()')
    print ()
    
    return data

In [None]:
data = Get_Data()


In [None]:
def Analyze_Data():
    data = Get_Data()
    print ('Analyze_Data')
    data = data.replace({99:np.nan})
    
    print ('Total NaN')
    s = data.isna().sum().sum()
    rows = data.shape[0]
    cols = data.shape[1]
    print (s, " ", round(s/(rows*cols)*100,2))
    print ()

    print ('Percentage of NaN in each feature')
    Feature_NaN_Counts = []
    for feature in data:
        s = data[feature].isna().sum()
        n = len(data)
#        print (feature, s, round(s/n*100,2))
        Feature_NaN_Counts.append([feature, round(s/n*100,6)])
    for row in Feature_NaN_Counts:
        print (row)
    print ()
    print ('Distribution of number of NaN in each sample')
    A = data.isna().sum(axis=1)
    Row_NaN_Counts = A.value_counts(normalize=True)
    display(Row_NaN_Counts)
    Row_NaN_Counts = Row_NaN_Counts.to_list()
    
    print ('Finished Analyze_Data()')
    print ()
    
    return Feature_NaN_Counts, Row_NaN_Counts
    
def Run_Analyze_Data():
    Feature_NaN_Counts, Row_NaN_Counts = Analyze_Data()
    
    print ('| Feature | NaN % |')
    print ('|---|---|')
    for row in Feature_NaN_Counts:
        print ('| %s | %.2f%% |' % (row[0], row[1]))
    print ()
    
    
    print ('| Number of NaN in Sample | % of Dataset |')
    print ('|---|---|')
    for i, row in enumerate(Row_NaN_Counts):
        print ('| %d | %.2f%% |' % (i, round(row*100,2)))
    print ()
    
    
#Run_Analyze_Data()  
    

# Tools

## Create_data_NaN_Method_3()
- This function makes a dataset for comparing methods for imputing missing values.
- It takes data_Ground_Truth and creates data_NaN.
    - data_Ground_Truth is the 232,333 samples with no missing values of the 802,700 samples with binned values created in the previous notebook, with 232,333 being 28.94% of 802,700.
    - For convenience, we will call the 802,700 samples in 67 features the "original dataset" because it is the origin of this notebook, even though it is not the original CRSS dataset that had more samples and features.
    - data_NaN will be data_Ground_Truth with sample values removed, and we will compare the imputation methods by how well they impute the data_NaN missing values by comparing the imputed values with the ground truth.  
- data_NaN will have missing values in the same patterns as the original dataset of 802,700 samples:
    - In each feature of data_NaN, the same proportion of missing values as in the same feature in the original dataset.
    
    | Feature | NaN % |
    |---|---|
    | HOSPITAL | 0.00% |
    | ACC_TYPE | 9.82% |
    | AGE | 4.11% |
    | AIR_BAG | 6.47% |
    | ALC_STATUS | 16.49% |
    | BODY_TYP | 2.65% |
    | CARGO_BT | 2.20% |
    | DAY_WEEK | 0.00% |
    | DEFORMED | 16.19% |
    | DR_ZIP | 5.85% |
    | EJECTION | 4.23% |
    | HARM_EV | 0.03% |
    | HIT_RUN | 0.00% |
    | HOUR | 0.22% |
    | IMPACT1 | 1.14% |
    | INJ_SEV | 1.28% |
    | INT_HWY | 0.01% |
    | J_KNIFE | 0.00% |
    | LGT_COND | 0.28% |
    | MAKE | 1.50% |
    | MAK_MOD | 0.22% |
    | MAN_COLL | 0.31% |
    | MAX_SEV | 0.24% |
    | MAX_VSEV | 1.07% |
    | MODEL | 2.29% |
    | MONTH | 0.00% |
    | M_HARM | 0.04% |
    | NUMOCCS | 2.04% |
    | NUM_INJ | 0.24% |
    | NUM_INJV | 1.07% |
    | PCRASH4 | 3.81% |
    | PCRASH5 | 0.28% |
    | PERMVIT | 0.00% |
    | PER_TYP | 0.02% |
    | PJ | 0.00% |
    | PSU | 0.00% |
    | PVH_INVL | 0.00% |
    | P_CRASH1 | 1.03% |
    | P_CRASH2 | 1.74% |
    | REGION | 0.00% |
    | RELJCT1 | 18.50% |
    | RELJCT2 | 5.02% |
    | REL_ROAD | 0.02% |
    | REST_MIS | 0.00% |
    | REST_USE | 6.70% |
    | ROLINLOC | 0.05% |
    | ROLLOVER | 0.00% |
    | SEAT_POS | 1.47% |
    | SEX | 2.24% |
    | SPEC_USE | 0.89% |
    | SPEEDREL | 1.33% |
    | TOWED | 4.97% |
    | TOW_VEH | 0.02% |
    | TYP_INT | 8.80% |
    | URBANICITY | 0.00% |
    | VALIGN | 5.14% |
    | VEH_AGE | 1.21% |
    | VE_FORMS | 0.00% |
    | VE_TOTAL | 0.00% |
    | VPROFILE | 13.20% |
    | VSPD_LIM | 12.60% |
    | VSURCOND | 3.95% |
    | VTCONT_F | 8.24% |
    | VTRAFCON | 8.56% |
    | VTRAFWAY | 15.53% |
    | WEATHER | 3.33% |
    | WRK_ZONE | 0.00% |

    - The samples of data_NaN will have the same distribution of number of missing samples.  In the original dataset, 28.94% of the samples had no missing values, so in data_NaN, 29.94% will have no missing values.  Similarly, in the original dataset 24.66% had one missing value, 16% two, 10% three, ..., 0.3487% thirteen missing values, so data_NaN will have the same distribution of number of missing samples.  
    
| Number of NaN in Sample | % of Dataset |
|---|---|
| 0 | 28.94% |
| 1 | 24.66% |
| 2 | 16.04% |
| 3 | 9.92% |
| 4 | 6.46% |
| 5 | 4.56% |
| 6 | 3.18% |
| 7 | 2.18% |
| 8 | 1.40% |
| 9 | 0.89% |
| 10 | 0.59% |
| 11 | 0.45% |
| 12 | 0.38% |
| 13 | 0.35% |

- For clarity of explanation, we will use these numbers to describe the method, though none of the numbers are hard coded in.

- Start by creating the distribution of samples with the same number of missing samples as in the original dataset.  
    - Make an empty list, data_NaN_list.
    - Append $232,333 \times 28.94\%$ rows with 67 zeros.
    - Append $232,333 \times 24.66\%$ rows with 1 one and 66 zeros, each row randomly shuffled.
    - Append $232,333 \times 16.04\%$ rows with 2 ones and 65 zeros, each row randomly shuffled.
    - ...
    - Append $232,333 \times 0.35\%$ rows with 13 ones and 54 zeros, each row randomly shuffled.
    - Due to rounding errors, the total number of rows may be more or less than 232,333.
        - Pad each number of rows a bit so we have more than 232,333 rows
        - Sample the result down to 232,333 rows.
     - data_NaN_list how has (approximately) row sums in the same distribution as in the original data.
- Because Pandas has appropriate tools, now change data_NaN_list to a Pandas dataframe, data_NaN.  
- Use a greedy algorithm to modify data_NaN to get each feature to have the same proportion of NaN as in the original data.  In each iteration:
    - Calculate the percentage of NaN in the feature, and the difference between the current percentage and the goal.  Sort by the difference.
    
    | Feature | % NaN Needed | Current % NaN | Difference |
    |---|---|---|---|
    |'J_KNIFE' | 0.0 | 3.165284 | -3.165284 |
    |'PERMVIT' | 0.0 | 3.159689 | -3.159689 |
    |'PVH_INVL' | 0.0 | 3.154093 | -3.154093 |
    |'PJ' | 0.0 | 3.151941 | -3.151941 |
    | ... | | | |
    |'VTRAFWAY' | 15.526349 | 3.042616 | 12.483733 |
    |'DEFORMED' | 16.189485 | 3.020234 | 13.169251 |
    |'ALC_STATUS' | 16.487106 | 3.083505 | 13.403601 |
    |'RELJCT1' | 18.504547 | 3.105026 | 15.399521 |
    
    - The first feature in the list has the most NaN to give, and the last feature in the list has the most to take.  Call those "give_feature" and "take_feature".
        - In the example above, J_KNIFE is the give_feature and RELJCT1 is the take_feature
    - Count how many 1's the give_feature needs to get rid of; call it "nGive".
        - nGive = abs(-3.165284)% * 232,333 = 7,354
        - J_KNIFE needs 7,354 fewer 1's
    - Count how many 0's the take_feature needs to get; call it "nTake".
        - nTake = 15.399521% * 232,333 = 35,778
        - RELJCT1 needs 35,778 1's
    - Filter the dataset to samples that have 1 in give_feature and 0 in take_feature; call it "Swap".
    - The number of samples in Swap, Swap.shape[0], is the number of samples available to swap between the two features.
        - Swap.shape[0] = 6,939
        - There are 6,939 samples where J_KNIFE == 1 and RELJCT1 == 0
    - Take the minimum of nGive, nTake, and Swap.shape[0]; call it nSample.
        - nSample = 6,939
    - If nSample==0, then go up the list to get a new take_feature and repeat the process.
    - Sample Swap down to nSample rows.
        - In this first round, nSample is the length of Swap, so sampling just shuffles the rows of Swap.
    - In the samples in the (possibly shortened) Swap, change the 1's in give_feature to 0 and the 0's in take_feature to 1.  
        - J_KNIFE now only needs to get rid of 7,354 - 6,939 = 415 more 1's.
        - RELJCT1 now only needs 35,778 - 6,939 = 28,839 more 1's.
        - In this process we do not change the number of 0's and 1's in any sample, so the distribution of number of 1's in rows is preserved.
    - Repeat until we're within epsilon of our goal.  
        - Maximum percentage error in NaN_Counts:   0.001291%
- The stopping mechanism we finally decided on was if nSample==0 twenty times for one give_feature, indicating that we can't find things left for us to swap and we're within rounding error of the number of NaN we need in each feature.  


In [None]:
def Create_data_NaN_Method_3(data_Ground_Truth):
    print ('Create_data_NaN_Method_3()')
    nRows = data_Ground_Truth.shape[0]
    nCols = data_Ground_Truth.shape[1]
    print ('nRows = ', nRows, ' nCols = ', nCols)
    
    Feature_NaN_Counts, Row_NaN_Counts = Analyze_Data()
    data_NaN_list = []
    for i in range (len(Row_NaN_Counts)):
        Ones = i
        Zeros = nCols - i
        Number = int(Row_NaN_Counts[i] * nRows + 1)
        for j in range (Number):
            New_row = [1]*Ones + [0]*Zeros
            random.shuffle(New_row)
            data_NaN_list.append(New_row.copy())
    
    data_NaN = pd.DataFrame(data_NaN_list, columns = data_Ground_Truth.columns)
#    print (data_NaN.shape)
    data_NaN = data_NaN.sample(n = nRows)
#    print (data_NaN.shape)
#    display(data_NaN.head(20))
#    display(data_NaN.tail(20))
    
#    print ('Distribution of number of NaN in each sample')
    A = data_NaN.sum(axis=1)
    Row_NaN_Counts = A.value_counts(normalize=True)
#    display(Row_NaN_Counts)
#    Row_NaN_Counts = Row_NaN_Counts.to_list()

    Feature_NaN_Counts = [[x[0], x[1], 0, 0] for x in Feature_NaN_Counts]
    Feature_NaN_Counts = Feature_NaN_Counts_Update(Feature_NaN_Counts, data_NaN)
    print ('Feature_NaN_Counts')
    for row in Feature_NaN_Counts:
        print (row)
    print ()
    
    old_give = ''
    old_take = ''
    
    stop = False
    while stop == False:
#        if Feature_NaN_Counts[0][3] > -0.001 or Feature_NaN_Counts[-1][3] < 0.001:
#            stop = True

        give_feature = Feature_NaN_Counts[0][0]
        take_i = -1
        take_feature = Feature_NaN_Counts[take_i][0]
        nGive = int(round(-1/100 * Feature_NaN_Counts[0][3] * nRows,0))
        nTake = int(round(1/100 * Feature_NaN_Counts[take_i][3] * nRows,0))
        
        mask = ((data_NaN[give_feature]==1) & (data_NaN[take_feature] == 0))
        Swap = data_NaN[mask]
        nSample = min([nGive, nTake, Swap.shape[0]])
        print ('give_feature, nGive, take_i, take_feature, nTake, Swap.shape[0], nSample')
        print (give_feature, nGive, take_i, take_feature, nTake, Swap.shape[0], nSample)
        while nSample==0:
            take_i = take_i - 1
            take_feature = Feature_NaN_Counts[take_i][0]
            nGive = int(round(-1/100 * Feature_NaN_Counts[0][3] * nRows,0))
            nTake = int(round(1/100 * Feature_NaN_Counts[take_i][3] * nRows,0))
            mask = ((data_NaN[give_feature]==1) & (data_NaN[take_feature] == 0))
            Swap = data_NaN[mask]
            nSample = min([nGive, nTake, Swap.shape[0]])
            print ('give_feature, nGive, take_i, take_feature, nTake, Swap.shape[0], nSample')
            print (give_feature, nGive, take_i, take_feature, nTake, Swap.shape[0], nSample)
            print ()
            if nSample==0 and take_i < -20:
                stop = True
                break
        
#        if nSample == 0:
#            stop = True
        Swap = Swap.sample(n=nSample)
#        display(Swap[[give_feature, take_feature]])
        mask = Swap.index.values.tolist()
        for m in mask:
            data_NaN.loc[[m], give_feature] = 0
            data_NaN.loc[[m], take_feature] = 1
            
#        print ()
#        data_NaN[give_feature],data_NaN[take_feature]=np.where(mask,(data_NaN[take_feature],data_NaN[give_feature]),(data_NaN[give_feature],data_NaN[take_feature]))
        Feature_NaN_Counts = Feature_NaN_Counts_Update(Feature_NaN_Counts, data_NaN)

    print ('Feature_NaN_Counts')
    for row in Feature_NaN_Counts:
        print (row)
    print ()
    print ('Maximum percentage error in NaN_Counts:  ', max([abs(x[3]) for x in Feature_NaN_Counts]))

    data_NaN = data_NaN.sample(frac=1)
#    print ('data_NaN')
#    display(data_NaN.head(10))
#    print ('data_NaN reindexed')
    data_NaN.reset_index(inplace=True, drop=True)
 #   display(data_NaN.head(10))
 #   print ('data_Ground_Truth')
 #   display(data_Ground_Truth.head(10))
 #   print ('data_NaN')
 #   display(data_NaN.head(10))
    data_NaN = data_Ground_Truth.where(data_NaN==0)
#    print ('data_NaN')
#    display(data_NaN.head(20))
#    display(data_NaN.tail(20))
    
    
    display(data_NaN.isna().sum())
    print ('data_NaN.isna().sum().sum() = ', data_NaN.isna().sum().sum())

    print ('Finished Create_data_NaN_Method_3()')
    print ()
    
    return data_NaN
    

def Feature_NaN_Counts_Update(Feature_NaN_Counts, data_NaN):
    for row in Feature_NaN_Counts:
        feature = row[0]
        s = data_NaN[feature].sum()
        n = len(data_NaN)
#        print (feature, s, round(s/n*100,2))
        row[2] = round(s/n*100,6)
        row[3] = round(row[1] - row[2], 6)
    Feature_NaN_Counts = sorted(Feature_NaN_Counts, key=lambda x:x[3])
    
    return Feature_NaN_Counts

    
def Test_Create_data_NaN_Method_3():
    print ('Test_Create_data_NaN_Method_3()')
    data = Get_Data()
    print (data.shape)
    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
#    data_Ground_Truth.astype('Int64')

    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head(10))
    Create_data_NaN_Method_3(data_Ground_Truth)
    print ('Finished Test_Create_data_NaN_Method_3')
    print ()

    
#Test_Create_data_NaN_Method_3()

In [None]:
def Create_data_NaN(data_Ground_Truth):
    print ('Create_data_NaN()')
    
    # Create a 2d list of the same shape as "data"
    # Each row has "1" in 15% of the columns, "0" otherwise
        # By "15%," we mean that "(# columns)*0.15" rounded to the nearest integer.
    # The first row is a random shuffle of such a row.
    # The next (columns-1) rows are rotations of that row
    # Each "column" number of rows, shuffle and repeat.
    # Each row will have 15% of the samples 1, and each column will have 15% of the samples 1.
    # Shuffle the rows.
    # Then shuffling the columns would be redundant.
    
    rows = data_Ground_Truth.shape[0]
    columns = data_Ground_Truth.shape[1]
    drops_in_row = int(round(columns*0.15,0))
    print ('drops_in_row = ', drops_in_row)
    
    Rand_Drop = []
    Row = [1]*(drops_in_row) + [0]*(columns - drops_in_row)

    for i in range (rows):
        if i%3==0:
            Row = [1]*(drops_in_row) + [0]*(columns - drops_in_row)
        elif i%3==1:
            Row = [1]*(drops_in_row + 1) + [0]*(columns - drops_in_row - 1)
        else:
            Row = [1]*(drops_in_row - 1) + [0]*(columns - drops_in_row + 1)
        random.shuffle(Row)
        Rand_Drop.append(Row.copy())

#    for i in range (rows):
#        if i%columns==0:
#            random.shuffle(Row)
#        Row.append(Row.pop(0))
#        Rand_Drop.append(Row.copy())

    random.shuffle(Rand_Drop)

#    for i in range (columns):
#        print (i, sum([x[i] for x in Rand_Drop]))

    # Turn the 2d list into a dataframe
        
    Rand_Drop_df = pd.DataFrame(Rand_Drop, columns=data_Ground_Truth.columns)
    display(Rand_Drop_df)
    
#    for feature in Rand_Drop_df:
#        print (feature, Rand_Drop_df[feature].sum())

    # Change the Ground Truth values to NaN where the corresponding value in Rand_Drop_df is 1
    data_NaN = data_Ground_Truth.where(Rand_Drop_df==0)
#    data_NaN = data_NaN.astype('Int')
    
    print ('data_NaN')
    display(data_NaN)
    
    print ('data_NaN.isna().sum()')
    display(data_NaN.isna().sum())
    
    print ('data_NaN.dropna().shape')
    print (data_NaN.dropna().shape)
    
    print ('Finished Create_data_NaN()')
    print ()

    return data_NaN
    
    
def Test_Create_data_NaN():
    print ('Test_Create_data_NaN()')
    data = Get_Data()
    print (data.shape)
    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('Int32')

    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    Create_data_NaN(data_Ground_Truth)
    
    print ('Finished Test_Create_data_NaN()')
    print ()

#Test_Create_data_NaN()


In [None]:
def Create_data_NaN_Old(data_Ground_Truth):
    """
    data = Get_Data()
    print (data.shape)

    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('int64')

    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth = data_Ground_Truth.astype('int64')
    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head())
    """

    # Randomly pick 15% of the values from each row
    # and set them to be missing
    print ('Remove 15% of values from each row')
    frac = .15
    data_NaN = data_Ground_Truth.copy(deep=True)
    N = data_NaN.shape[0] * frac # Number of NaN in each feature
    for c in data_NaN.columns:
        idx = np.random.choice(a=data_NaN.index, size=int(len(data_NaN) * frac))
        data_NaN.loc[idx, c] = np.NaN
#    for feature in data_NaN:
#        data_NaN[feature] = pd.to_numeric(data_NaN[feature])
#    data_NaN.astype('int64')


#    print ('data_NaN.shape')
#    print (data_NaN.shape)
#    display(data_NaN.head())
    
    data_NaN = data_NaN.astype('Int8')
    
#    data_NaN = data_NaN.sample(n=200000)
#    print (data_NaN.shape)
#    print (data_NaN.head(20))
    
    return data_NaN

#Create_data_NaN_Old()

    

In [4]:
def Impute_MissForest(data):
    print('Impute_MissForest()')

    print (data.shape)
    display(data.head(20))
#    data.replace({np.nan: ''}, inplace=True)
#    display(data.head(20))

    categorical = list(data)
    print ('categorical features: ', categorical)
    
    clf = RandomForestClassifier(
#        n_estimators=100, 
#        max_depth=10, 
#        verbose=2,
#        max_features=0.5
    )
    rgr = RandomForestRegressor(
#        n_estimators=100, 
#        max_depth=10, 
#        verbose=2,
#        max_features=0.5
    )

    data_MF = MissForest(clf, rgr, max_iter = 10).fit_transform(
        x = data,
        categorical=categorical,
    )
    display(data_MF.head(20))
    print ('Finished Impute_MissForest()')
    print ()
    
    return data_MF
    

## Modifying and Testing our Versions of MissForest
- We made some changes to MissForest version 2.5.5
    - Change the logic of when to end the iterations
    - Optimize memory usage
        - Original took about 40 GB on the 232,333-sample test set
        - First revision took 1.9 GB
        - Second revision took 1.1 GB
- Note that we will only talk about what MissForest does for the categorical features, because all of our features are categorical, but there's a parallel system for numerical features.
- What we think the Original version was supposed to do.  
    - The number of changes from one iteration to the next should be decreasing.  If it starts increasing, stop.
    - The metric Gamma is the count of all of the changes from one iteration to the next, divided by the number of categorical features.  
    - The list all_x_imp_cat is a list of the dataframes, one appended at the end of each imputing iteration.
    - At the end of each iteration, calculate Gamma between all_x_imp_cat[-1] and all_x_imp_cat[-2].
    - Append that Gamma to all_gamma_cat.
    - If the new Gamma in all_gamma_cat is more than from the last iteration, stop, because the imputation is diverting.
- What the original version did instead:
    - Because of an indenting error (?), it added the current dataframe to all_x_imp_cat at the end of each feature, not at the end of each iteration.  
        - Instead of saving one copy of the dataset each iteration, it saved 67 (number of features) copies of the dataset in each iteration, which took so much memory that it caused the process to crash.
        - Instead of the changes from all_x_imp_cat[-2] to all_x_imp_cat[-1] being a whole iteration of changes over 67 features, the difference at the end of the iteration (when Gamma was calculated) was just the change in the last feature, WRK_ZONE.  So all_gamma_cat tracked the number of changes in the last feature in each iteration, and the stopping mechanism only looked at the number of changes in one feature.
    - When the imputation started to diverge, it stopped and returned the dataframe from the last iteration, the one that was diverging.  Our opinion is that the returne dataframe should be the one before that, when the changes were minimized.  
- Our first revision:
    - Moved the indentation back
        - We were saving only one copy of the dataframe per iteration
        - Gamma measured the change over one iteration, not the change over one feature
    - Returned the next-to-last iteration
- Our second revision additionally:
    - Only kept the current and the previous iteration in memory.  
        - Only saving two copies of the dataframe, not nIterations + 1 copies
        - Before the first iteration, and at the end of each iteration, ``x_imp_previous = x_imp``
        - Gamma measures the difference between x_imp_previous and x_imp.
- We tested the two revisions to make sure they gave the same results.  At the end of each iteration they gave the same Gammas (exactly), and we take that as reasonable assurance that they're doing the same thing.  
- The revisions should give the same dataset results as the original code if they stop after the same number of iterations.
- The revisions should not give the same Gamma values as the original code because they're comparing different things.


In [None]:
%%time
def Test_Impute_MissForest():
    data = Get_Data()
    print (data.shape)

    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('Int64')

    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head())

    data_NaN = Create_data_NaN_Method_3(data_Ground_Truth)


    print ('data_NaN.shape')
    print (data_NaN.shape)
    display(data_NaN.head())
    
#    data_NaN = data_NaN.astype('Int8')
    
#    data_NaN = data_NaN.sample(n=10000)
#    print (data_NaN.shape)
#    print (data_NaN.head(20))

    
    # Perform MissForest imputation
    print ('Start Imputation')
    data_MF = Impute_MissForest(data_NaN)
    data_MF.sort_index(inplace=True)
    data_MF = data_MF[data.columns]  
#    data_MF = data_MF.astype('Int64')
    print (data_MF.head(20))
    
    print ('Finished Test_Impute_MissForest()')
    print ()
    
#Test_Impute_MissForest()

# With missforest.py with Brad's modifications that store only the current and previous iterations of the dataframe
# all_gamma_cat =  [2001.7611940298507, 337.089552238806, 293.4776119402985, 287.53731343283584, 287.4925373134328, 285.5970149253731, 285.2238805970149, 286.82089552238807]
# 7 iterations
# 60 minutes
# Memory Usage:
    # Before running, python3.10 is taking 50 MB.
    # Running things up to this test, 130 MB.
    # During Iteration 0, between 950 MB and 1400 MB, typically 1100 MB.
    # About the same in Iterations 1, 3, and 6.
    # 

# with missforest_Brad_2.py with Brad's modifications that store a new dataframe at each iteration
# all_gamma_cat =  [2001.7611940298507, 337.089552238806, 293.4776119402985, 287.53731343283584, 287.4925373134328, 285.5970149253731, 285.2238805970149, 286.82089552238807]
# 7 iterations
# Same time
# Memory usage:
    # In iteration 0, 920 MB
    # In iteration 1, 1.13 GB
    # In iteration 2, 1.36 GB
    # In iteration 3, 1.49 GB
    # In iteration 4, 1.52 GB
    # In iteration 5, 1.7 GB
    # In iteration 6, 1.8 GB
    # In iteration 7, 1.9 GB

In [None]:
%%time
# Test imputing the big dataset
def Test_Impute_MissForest_2():
    print ('Test_Impute_Miss_Forest_2()')
    data = Get_Data()
    print (data.shape)
    display(data.head(10))
#    data = data.sample(n=1000)
    print (data.shape)
    data.replace({99:np.nan}, inplace=True)
    display(data.head(10))
    data_MF = Impute_MissForest(data)
    
    print ('Finished Test_Impute_Miss_Forest_2()')
    print ()

#Test_Impute_MissForest_2()

In [None]:
def Impute_Round_Robin(data):
    print ('Impute_Round_Robin()')
    pd.set_option('display.max_columns', None)
    
    # Replace 'Unknown' with np.NaN
#    data.replace({'Unknown': np.nan}, inplace=True)
#    data.replace({99: np.nan}, inplace=True)
    display(data.head(20))
    print ()
    
    # Make a list of features with missing samples, 
    #     ordered by the number of missing samples, 
    #     from least to most.  
    Missing = []
    Complete = []
    for feature in data:
        s = data[feature].isna().sum()
        if s==0:
            Complete.append([feature, s])
        if s>0:
            Missing.append([feature, s])
    Missing = sorted (Missing, key=lambda x:x[1], reverse=False)
#    print ()
#    print ('Complete[]')
#    display(Complete)
#    print ()
#    print ('Missing[]')
#    display(Missing)
#    print ()
    
#    print ('Make data_Mode')
#    print ()
    data_Mode = pd.DataFrame()
    for X in Complete:
        feature = X[0]
        data_Mode[feature] = data[feature]
    for M in Missing:
        feature = M[0]
        m = data[feature].mode()[0]
#        print (feature, M[1], m)
        data_Mode[feature] = data[feature].fillna(m)
#    print ('data_Mode')
    display(data_Mode.head(20))

#    print ()
#    print ('Make starting point for data_Imputed')
    data_Imputed = pd.DataFrame()
    for X in Complete:
        feature = X[0]
        data_Imputed[feature] = data[feature]
    for X in Missing:
        feature = X[0]
        data_Imputed[feature] = data_Mode[feature]
#    print ('data_Imputed')
#    display(data_Imputed.head(20))
#    print ()

    print ('Start Loop')
    print ()
    n = 0
    for M in Missing:
        n += 1
        print (M)
        feature = M[0]
        data_Imputed[feature] = data[feature]
#        print ()
#        print ('data[feature].isna().sum()')
#        print (data[feature].isna().sum())
#        print ('data_Imputed[feature].isna().sum()')
#        print (data_Imputed[feature].isna().sum())
#        print ()
        W = data_Imputed.dropna(subset=[feature])
        X = data_Imputed.dropna(subset=[feature])
        y = X[feature]
        X.drop(columns=feature, inplace=True)
        Z = data_Imputed[data_Imputed[feature].isna()]
        Z.drop(columns=feature, inplace=True)
#        Z.reset_index(drop=True, inplace=True)
#        print (data.shape)
#        print (X.shape)
#        display(X.head(40))
#        display(y.head(40))
#        print (Z.shape)
#        display(Z)
        clf = RandomForestClassifier(max_depth=2, random_state=random_seed)
        clf.fit(X,y)
#        print ('clf.predict(Z)')
        z = clf.predict(Z)
#        print (len(z))
#        display(z)
        Z[feature] = z
#        display(Z)
        data_Imputed = pd.concat([Z, W])
#        display(data_Imputed.head(60))
#        print (data_Imputed.shape)
#        print ()
#        data_Imputed.sort_values(
#            by = ['CASENUM', 'VEH_NO', 'PER_NO'], 
#            ascending = [True, True, True], 
#            inplace=True
#        )
#        print ()
#        print ('data.PER_NO.equals(data_Imputed.PER_NO)')
#        print (data.PER_NO.equals(data_Imputed.PER_NO))
#        print ()
               
#        Check_Feature(data, data_Imputed, feature)
#        if n==10:
#            return data_Imputed
    
    
    display(data_Imputed.head(20))

    print ('Finished Impute_Round_Robin()')
    print ()
    return data_Imputed

In [5]:
def Check(data, data_Imputed):
    print ('Check()')
    Features = data.columns
    print (Features)
    for feature in Features:
        U = pd.unique(data[feature]).tolist()
        print (U)
        A = []
        for u in U:
            a = len(data[data[feature]==u])
            b = len(data_Imputed[data_Imputed[feature]==u])
            A.append([u, a, b])
        display(A)
        print ()
    print ('Finished Check()')
    print ()


In [None]:
def Check_Feature(data, data_Imputed, feature):
    print ('Check_Feature(%s)' % feature)
    U = pd.unique(data[feature]).tolist()
    U = [x for x in U if x == x]
    print (U)
    A = []
    for u in U:
        a = len(data[data[feature]==u])
        b = len(data_Imputed[data_Imputed[feature]==u])
        A.append([u, a, b, b-a])
    a = data[feature].isna().sum()
    b = data_Imputed[feature].isna().sum()
    A.append(['NaN', a, b, 0])
    A = pd.DataFrame(A, columns=['Value', 'Original', 'Imputed', 'Difference'])
    display(A)
    
    print ('Finished Check_Feature()')
    print ()


In [None]:
def Impute_Randomly(data):
    print ('Impute_Randomly()')
    print ()
    
    data.sample(frac=1, replace=True) # Randomly shuffle the rows of the dataset
    for feature in data:
        print (feature)
#        print ('display(data[feature].head())')
#        display(data[feature].head())
        dfA = data[feature]
#        print ('display(dfA.head())')
#        display(dfA.head())
        dfA.dropna(inplace=True)
#        print ('display(dfA.head()) after dfA.dropna(inplace=True)')
#        display(dfA.head())
#        print ('Original Value Counts')
#        print (dfA.value_counts(normalize=True))
        dfA = dfA.sample(n = len(data), replace=True)
#        print ('display(dfA.head()) after dfA.sample(n = len(data), replace=True)')
#        display(dfA.head())
#        print ('Value Counts after Sampling')
#        print (dfA.value_counts(normalize=True))
        dfA.reset_index(drop=True, inplace=True)
#        print ('display(dfA.head()) after dfA.reset_index(drop=True)')
#        display(dfA.head())
        data[feature].fillna(dfA, inplace=True)
#        print ('display(data[feature].head())')
#        display(data[feature].head())        
#        print ()
        
    print ('Finished Impute_Randomly()')
    print ()
    return data
        
def Test_Impute_Randomly():
    Dict = {
        'A':[0,0,0,1,np.nan],
        'B':[1,2,3,4,np.nan]
    }
    
    data = pd.DataFrame(Dict)
    display(data)
    data = Impute_Randomly(data)
    display(data)
    
#Test_Impute_Randomly()
        

# Compare Imputation Methods

## Mode Imputation
## Random Forest Imputation
## Prepare Data for IVEware

In [None]:
def Compare_Imputation_Methods_Part_1():
    print ('Compare_Imputation_Methods_Part_1()')
    data = Get_Data()
    print (data.shape)

    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('Int32')

    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
#    data_Ground_Truth = data_Ground_Truth.sample(n=200000)
#    data_Ground_Truth.reset_index(inplace=True, drop=True)

#    print ('data_Ground_Truth.shape after resampling')
#    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head())

    print ('Remove values from each row')
    data_NaN = Create_data_NaN_Method_3(data_Ground_Truth)
    
    print ('data_NaN.shape')
    print (data_NaN.shape)
    display(data_NaN.head(10))
    display(data_NaN.tail(10))
    
    # Perform MissForest imputation
    data_MF = data_NaN.copy(deep=True)
#    data_MF = data_NaN_Old.copy(deep=True)
#    data_MF = data_MF.astype('Int8')
    print ('data_MF')
    display(data_MF.head(20))
    data_MF = Impute_MissForest(data_MF)
    data_MF.sort_index(inplace=True)
    data_MF = data_MF[data.columns]  
#    data_MF = data_MF.astype('Int32')
    
    print ('data_MF.shape')
    print (data_MF.shape)
    display(data_MF.head())
#    print ()

    
    # Create .txt file to feed into IVEware imputation
    data_IVEware = data_NaN.copy(deep=True)
#    data_IVEware = data_IVEware.astype('str')
    data_IVEware = data_IVEware.fillna('')
    data_IVEware.to_csv('../../Big_Files/data_IVEware.txt', sep='\t', index=False)
    
    data_Mode = pd.DataFrame()
    for feature in data_NaN:
        data_Mode[feature] = data_NaN[feature].fillna(data_NaN[feature].mode()[0])
    data_Mode = data_Mode.astype('Int32')
    print ('data_Mode.shape')
    print (data_Mode.shape)
    display(data_Mode.head())
    
    # Perform Round Robin imputation using Random Forest Classifier
    data_RF = Impute_Round_Robin(data_NaN)
    data_RF.sort_index(inplace=True)
    data_RF = data_RF[data.columns]  
    data_RF = data_RF.astype('Int32')
    
    print ('data_RF.shape')
    print (data_RF.shape)
    display(data_RF.head())
#    print ()

    # Impute randomly
    data_Random = data_NaN.copy(deep=True)
    data_Random = Impute_Randomly(data_Random)
    data_Random = data_Random.astype('Int32')
    
    print ('data_Random.shape')
    print (data_Random.shape)
    display(data_Random.head())
#    print ()

    print ('Finished Compare_Imputation_Methods_Part_1()')
    print ()

    return data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random

In [None]:
%%time 
# about an hour
data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random = Compare_Imputation_Methods_Part_1()

## Do IVEware Imputation (Outside this Jupyter Notebook)
- Go to the IVEware folder and run (at the command line) IVE_12_22_22.bat
- Requires scrlib and R.  You may need to, in the batch file, change the path to your scrlib installation.
- Notes to self:
    - Open srcshell
    - From srcshell, open IVEware_CRSS_Imputation.xml
    - Run
- Run time: ./IVEware_CRSS_Imputation.bat  1069.08s user 12.92s system 98% cpu 18:23.92 total

In [None]:
data_IVEware_seed_0 = pd.read_csv('../../Big_Files/data_IVEware_Compare_seed_0.csv')
data_IVEware_seed_0.drop(columns='Unnamed: 0', inplace=True)

data_IVEware_seed_1 = pd.read_csv('../../Big_Files/data_IVEware_Compare_seed_1.csv')
data_IVEware_seed_1.drop(columns='Unnamed: 0', inplace=True)

print ('data_Ground_Truth', data_Ground_Truth.shape)
display(data_Ground_Truth.head(10))
print ('data_NaN', data_NaN.shape)
display(data_NaN.head(10))
print ('data_RF', data_RF.shape)
display(data_RF.head(10))
print ('data_MF', data_MF.shape)
display(data_MF.head(10))
print ('data_IVEware_seed_0', data_IVEware_seed_0.shape)
display(data_IVEware_seed_0.head(10))
print ('data_IVEware_seed_1', data_IVEware_seed_1.shape)
display(data_IVEware_seed_1.head(10))
print ('data_Mode', data_Mode.shape)
display(data_Mode.head(10))
print ('data_Random', data_Random.shape)
display(data_Random.head(10))


## Compare Six Imputation Methods

In [None]:
def Compare_Imputation_Methods_Part_2(
    data_Ground_Truth, 
    data_NaN, 
    data_RF, 
    data_MF, 
    data_Mode, 
    data_Random, 
    data_IVEware_seed_0, 
    data_IVEware_seed_1
):
    print ('Compare_Imputation_Methods_Part_2')
    
    """
    print ('Drop Multicollinear Features')
    Drop = ['MAX_VSEV', 'VE_FORMS', 'VTCONT_F', 'MAX_SEV', 'NUM_INJV']
    DF = [data_Ground_Truth, data_NaN, data_RF, data_Mode, data_Random, data_IVEware]
    
    for df in DF:
        for feature in Drop:
            if feature in df:
                df.drop(columns=[feature], inplace=True)
                print ('Drop ', feature)
    print ()
    """
    
    Datasets = [
        ['data_Ground_Truth', data_Ground_Truth],
        ['data_NaN', data_NaN],
        ['data_RF', data_RF],
        ['data_MF', data_MF],
        ['data_Mode', data_Mode],
        ['data_Random', data_Random],
        ['data_IVEware_seed_0', data_IVEware_seed_0],
        ['data_IVEware_seed_1', data_IVEware_seed_1],
    ]
    
    for dataset in Datasets:
        name = dataset[0]
        data = dataset[1]
        print (name, '.shape: ', data.shape)
    print ()
    
    for dataset in Datasets:
        name = dataset[0]
        data = dataset[1]
        print (name)
        display(data.head())
    print ()
    
    Datasets = [
        ['data_Ground_Truth', data_Ground_Truth],
#        ['data_NaN', data_NaN],
        ['data_RF', data_RF],
        ['data_MF', data_MF],
        ['data_Mode', data_Mode],
        ['data_Random', data_Random],
        ['data_IVEware_seed_0', data_IVEware_seed_0],
        ['data_IVEware_seed_1', data_IVEware_seed_1],
    ]
    
    A = []
    for feature in data_NaN:
        B = []
        B.append(feature)
        B.append(data_NaN[feature].isna().sum())
        for i in range (len(Datasets)-1):
            for j in range (i+1, len(Datasets)):
                C = (Datasets[i][1][feature] != Datasets[j][1][feature]).sum()
                D = round(C/B[1]*100,4)
                B.append(C)
#                print ('B[', len(B)-1, '] counts differences between ', Datasets[i][0], ' and ', Datasets[j][0], '.')
                B.append(D)
#                print ('B[', len(B)-1, '] gives the count as a percentage.')
#        print ()
        
#        print (B)
        A.append(B)
    print ()
    
    A = sorted(A, key=lambda x:x[3])
    B = pd.DataFrame(
        A, 
        columns=[
            'Feature', # 0
            'nNaN',  # 1
            'nRF Incorrect', 'pRF Incorrect', # 2, 3
            'nMF Incorrect', 'pMF Incorrect', # 4, 5
            'nMode Incorrect', 'pMode Incorrect', # 6, 7
            'nRandom Incorrect', 'pRandom Incorrect', # 8, 9
            'nIVEware_seed_0 Incorrect', 'pIVEware_seed_0 Incorrect', # 10, 11
            'nIVEware_seed_1 Incorrect', 'pIVEware_seed_1 Incorrect', # 12, 13
            'RF and MF Different', 'RF v/s MF %', # 14, 15
            'RF and Mode Different', 'RF v/s Mode %', # 16, 17
            'RF and Random Different', 'RF v/s Random %', # 18, 19
            'RF and IVEware_seed_0 Different', 'RF v/s IVEware_seed_0 %', # 20, 21
            'RF and IVEware_seed_1 Different', 'RF v/s IVEware_seed_1 %', # 22, 23
            'MF and Mode Different', 'MF v/s Mode %', # 24, 25
            'MF and Random Different', 'MF v/s Random %', # 26, 27
            'MF and IVEware_seed_0 Different', 'MF v/s IVEware_seed_0 %', # 28, 29
            'MF and IVEware_seed_1 Different', 'MF v/s IVEware_seed_1 %', # 30, 31
            'Mode and Random Different', 'Mode v/s Random %', # 32, 33
            'Mode and IVEware_seed_0 Different', 'Mode v/s IVEware_seed_0 %', #, 34, 35
            'Mode and IVEware_seed_1 Different', 'Mode v/s IVEware_seed_1 %', #, 36, 37
            'Random and IVEware_seed_0 Different', 'Random v/s IVEware_seed_0 %', # 38, 39
            'Random and IVEware_seed_1 Different', 'Random v/s IVEware_seed_1 %', # 40, 41
            'IVEware_seed_0 and IVEware_seed_1 Different', 'IVEware_seed_0 v/s IVEware_seed_1 %', # 42, 43
        ]
    )
    display(B)
    a = sum([x[1] for x in A]) # nNaN
    b = sum([x[2] for x in A]) # nRF Incorrect
    c = sum([x[4] for x in A]) # nMF Incorrect
    d = sum([x[6] for x in A]) # nMode INcorrect
    e = sum([x[8] for x in A]) # nRandom Incorrect
    f = sum([x[10] for x in A]) # nIVEware_seed_0 Incorrect
    g = sum([x[12] for x in A]) # nIVEware_seed_1 Incorrect
    h = round(b/a*100,2)
    i = round(c/a*100,2)
    j = round(d/a*100,2)
    k = round(e/a*100,2)
    l = round(f/a*100,2)
    m = round(g/a*100,2)

    RF_less_MF = sum([x[2] < x[4] for x in A])
    RF_equal_MF = sum([x[2] == x[4] for x in A])
    RF_greater_MF = sum([x[2] > x[4] for x in A])
    
    RF_less_Mode = sum([x[2] < x[6] for x in A])
    RF_equal_Mode = sum([x[2] == x[6] for x in A])
    RF_greater_Mode = sum([x[2] > x[6] for x in A])

    RF_less_Random = sum([x[2] < x[8] for x in A])
    RF_equal_Random = sum([x[2] == x[8] for x in A])
    RF_greater_Random = sum([x[2] > x[8] for x in A])

    RF_less_IVEware_seed_0 = sum([x[2] < x[10] for x in A])
    RF_equal_IVEware_seed_0 = sum([x[2] == x[10] for x in A])
    RF_greater_IVEware_seed_0 = sum([x[2] > x[10] for x in A])

    RF_less_IVEware_seed_1 = sum([x[2] < x[12] for x in A])
    RF_equal_IVEware_seed_1 = sum([x[2] == x[12] for x in A])
    RF_greater_IVEware_seed_1 = sum([x[2] > x[12] for x in A])

    MF_less_Mode = sum([x[4] < x[6] for x in A])
    MF_equal_Mode = sum([x[4] == x[6] for x in A])
    MF_greater_Mode = sum([x[4] > x[6] for x in A])

    MF_less_Random = sum([x[4] < x[8] for x in A])
    MF_equal_Random = sum([x[4] == x[8] for x in A])
    MF_greater_Random = sum([x[4] > x[8] for x in A])

    MF_less_IVEware_seed_0 = sum([x[4] < x[10] for x in A])
    MF_equal_IVEware_seed_0 = sum([x[4] == x[10] for x in A])
    MF_greater_IVEware_seed_0 = sum([x[4] > x[10] for x in A])

    MF_less_IVEware_seed_1 = sum([x[4] < x[12] for x in A])
    MF_equal_IVEware_seed_1 = sum([x[4] == x[12] for x in A])
    MF_greater_IVEware_seed_1 = sum([x[4] > x[12] for x in A])

    Mode_less_Random = sum([x[6] < x[8] for x in A])
    Mode_equal_Random = sum([x[6] == x[8] for x in A])
    Mode_greater_Random = sum([x[6] > x[8] for x in A])

    Mode_less_IVEware_seed_0 = sum([x[6] < x[10] for x in A])
    Mode_equal_IVEware_seed_0 = sum([x[6] == x[10] for x in A])
    Mode_greater_IVEware_seed_0 = sum([x[6] > x[10] for x in A])

    Mode_less_IVEware_seed_1 = sum([x[6] < x[12] for x in A])
    Mode_equal_IVEware_seed_1 = sum([x[6] == x[12] for x in A])
    Mode_greater_IVEware_seed_1 = sum([x[6] > x[12] for x in A])

    Random_less_IVEware_seed_0 = sum([x[8] < x[10] for x in A])
    Random_equal_IVEware_seed_0 = sum([x[8] == x[10] for x in A])
    Random_greater_IVEware_seed_0 = sum([x[8] > x[10] for x in A])

    Random_less_IVEware_seed_1 = sum([x[8] < x[12] for x in A])
    Random_equal_IVEware_seed_1 = sum([x[8] == x[12] for x in A])
    Random_greater_IVEware_seed_1 = sum([x[8] > x[12] for x in A])

    IVEware_seed_0_less_IVEware_seed_1 = sum([x[10] < x[12] for x in A])
    IVEware_seed_0_equal_IVEware_seed_1 = sum([x[10] == x[12] for x in A])
    IVEware_seed_0_greater_IVEware_seed_1 = sum([x[10] > x[12] for x in A])

    print ()
    print ('    | | Number | NaN Imputed Incorrectly |')
    print ('    | --- | --- | --- | ')    
    print ('    | Total NaN | ', f'{a:,d}', ' | 100% | ')
    print ('    | RF | ', f'{b:,d}', ' | ', h, '% | ')
    print ('    | MF | ', f'{c:,d}', ' | ', i, '% | ')
    print ('    | Mode | ', f'{d:,d}', ' | ', j, '% | ')
    print ('    | Random | ', f'{e:,d}', ' | ', k, '% | ')
    print ('    | IVEware_seed_0 | ', f'{f:,d}', ' | ', l, '% | ')
    print ('    | IVEware_seed_1 | ', f'{g:,d}', ' | ', m, '% | ')
    print ()
    print ()
    print ('    | | Number | NaN Imputed Correctly |')
    print ('    | --- | --- | --- | ')    
    print ('    | Total NaN | ', f'{a:,d}', ' |  | ')
    print ('    | RF | ', f'{b:,d}', ' | ', round(100-h,2), '% | ')
    print ('    | MF | ', f'{c:,d}', ' | ', round(100-i,2), '% | ')
    print ('    | Mode | ', f'{d:,d}', ' | ', round(100-j,2), '% | ')
    print ('    | Random | ', f'{e:,d}', ' | ', round(100-k,2), '% | ')
    print ('    | IVEware_seed_0 | ', f'{f:,d}', ' | ', round(100-l,2), '% | ')
    print ('    | IVEware_seed_1 | ', f'{g:,d}', ' | ', round(100-m,2), '% | ')
    print ()
    print ('    |  | Fewer | Equal | More | Total | ')
    print ('    | --- | --- | --- | --- | --- | ')
    print ('    | Compare RF to MF | ', RF_less_MF, ' | ', RF_equal_MF,  ' | ' ,RF_greater_MF,  ' |', len(A), ' |' )
    print ('    | Compare RF to Mode | ', RF_less_Mode, ' | ', RF_equal_Mode,  ' | ' ,RF_greater_Mode,  ' |', len(A), ' |' )
    print ('    | Compare RF to Random | ', RF_less_Random, ' | ' , RF_equal_Random,  ' | ' , RF_greater_Random,  ' |', len(A), ' |' )
    print ('    | Compare RF to IVEware_seed_0 | ', RF_less_IVEware_seed_0, ' | ' , RF_equal_IVEware_seed_0, ' | ' , RF_greater_IVEware_seed_0, ' |', len(A), ' |' )
    print ('    | Compare RF to IVEware_seed_0 | ', RF_less_IVEware_seed_1, ' | ' , RF_equal_IVEware_seed_1, ' | ' , RF_greater_IVEware_seed_1, ' |', len(A), ' |' )
    print ('    | Compare MF to Mode | ', MF_less_Mode, ' | ', MF_equal_Mode,  ' | ' ,MF_greater_Mode,  ' |', len(A), ' |' )
    print ('    | Compare MF to Random | ', MF_less_Random, ' | ' , MF_equal_Random,  ' | ' , MF_greater_Random,  ' |', len(A), ' |' )
    print ('    | Compare MF to IVEware_seed_0 | ', MF_less_IVEware_seed_0, ' | ' , MF_equal_IVEware_seed_0, ' | ' , MF_greater_IVEware_seed_0, ' |', len(A), ' |' )
    print ('    | Compare MF to IVEware_seed_1 | ', MF_less_IVEware_seed_1, ' | ' , MF_equal_IVEware_seed_1, ' | ' , MF_greater_IVEware_seed_1, ' |', len(A), ' |' )
    print ('    | Compare Mode to Random | ', Mode_less_Random, ' | ' , Mode_equal_Random, ' | ' , Mode_greater_Random, ' |', len(A), ' |' )
    print ('    | Compare Mode to IVEware_seed_0 | ', Mode_less_IVEware_seed_0, ' | ' , Mode_equal_IVEware_seed_0, ' | ' , Mode_greater_IVEware_seed_0, ' |', len(A), ' |' )
    print ('    | Compare Mode to IVEware_seed_1 | ', Mode_less_IVEware_seed_1, ' | ' , Mode_equal_IVEware_seed_1, ' | ' , Mode_greater_IVEware_seed_1, ' |', len(A), ' |' )
    print ('    | Compare Random to IVEware_seed_0 | ', Random_less_IVEware_seed_0, ' | ' , Random_equal_IVEware_seed_0, ' | ' , Random_greater_IVEware_seed_0, ' |', len(A), ' |' )
    print ('    | Compare Random to IVEware_seed_1 | ', Random_less_IVEware_seed_1, ' | ' , Random_equal_IVEware_seed_1, ' | ' , Random_greater_IVEware_seed_1, ' |', len(A), ' |' )
    print ()
    
    b = sum([x[14] for x in A])
    c = sum([x[16] for x in A])
    d = sum([x[18] for x in A])
    e = sum([x[20] for x in A])
    f = sum([x[22] for x in A])
    g = sum([x[24] for x in A])
    h = sum([x[26] for x in A])
    i = sum([x[28] for x in A])
    j = sum([x[30] for x in A])
    k = sum([x[32] for x in A])
    l = sum([x[34] for x in A])
    m = sum([x[36] for x in A])
    n = sum([x[38] for x in A])
    o = sum([x[40] for x in A])
    p = sum([x[42] for x in A])
    
    print ('    |  | Number |  Percentage |')
    print ('    | --- | --- | -- |')
    print ('    | Total NaN | ', f'{a:,d}', ' | 100% |' )
    print ('    | RF Different from MF | ', f'{b:,d}', ' | ', round(b/a*100,2), '% |')
    print ('    | RF Different from Mode | ', f'{c:,d}', ' | ', round(c/a*100,2), '% |')
    print ('    | RF Different from Random | ', f'{d:,d}', ' | ', round(d/a*100,2), '% |')
    print ('    | RF Different from IVEware_seed_0 | ', f'{e:,d}', ' | ', round(e/a*100,2), '% |')
    print ('    | RF Different from IVEware_seed_1 | ', f'{f:,d}', ' | ', round(f/a*100,2), '% |')
    print ('    | MF Different from Mode | ', f'{g:,d}', ' | ', round(g/a*100,2), '% |')
    print ('    | MF Different from Random | ', f'{h:,d}', ' | ',  round(h/a*100,2), '% |')
    print ('    | MF Different from IVEware_seed_0 | ', f'{i:,d}', ' | ',  round(i/a*100,2), '% |')
    print ('    | MF Different from IVEware_seed_1 | ', f'{j:,d}', ' | ', round(j/a*100,2), '% |')
    print ('    | Mode Different from Random | ', f'{k:,d}', ' | ', round(k/a*100,2), '% |')
    print ('    | Mode Different from IVEware_seed_0 | ', f'{l:,d}', ' | ', round(l/a*100,2), '% |')
    print ('    | Mode Different from IVEware_seed_1 | ', f'{m:,d}', ' | ', round(m/a*100,2), '% |')
    print ('    | Random Different from IVEware_seed_0 | ', f'{n:,d}', ' | ', round(n/a*100,2), '% |')
    print ('    | Random Different from IVEware_seed_1 | ', f'{o:,d}', ' | ', round(o/a*100,2), '% |')
    print ('    | IVEware_seed_0 Different from IVEware_seed_1 | ', f'{p:,d}', ' | ', round(p/a*100,2), '% |')
    print ()
        
#    display(Audio(sound_file, autoplay=True))
    
    print ('Finished Compare_Imputation_Methods_Part_2')



In [None]:
Compare_Imputation_Methods_Part_2(
    data_Ground_Truth, data_NaN, 
    data_RF, data_MF, 
    data_Mode, data_Random, 
    data_IVEware_seed_0, data_IVEware_seed_1
)

# Impute using Random Forest and Save for Next Step

In [None]:
def Impute_Using_Random_Forest():
    print ('Impute_Using_Random_Forest()')
    data = Get_Data()
    data = data.replace({99:np.nan})
    
#    data_Imputed = Impute_Full(data)
    data_Imputed = Impute_Round_Robin(data)
    data_Imputed.to_csv('../../Big_Files/CRSS_Imputed_by_RF_Data.csv', index=False)
#    display(data_Imputed.head(50))
    
    Check(data, data_Imputed)
#    display(Audio(sound_file, autoplay=True))

    print ('Finished Impute_Using_Random_Forest()')
    print ()
    
    return 0

#Impute_Using_Random_Forest()

# Impute using IVEware
- Get this message:
    - Warning: Too many iterations, probably due to colinearity with dependent variable
- With filename_prefix '_0_0', meaning that we didn't do dimensionality reduction before imputing, on these features
    - Iteration 1:
        - ACC_TYPE
        - CARGO_BT
        - HARM_EV
        - MAKE
        - MAX_VSEV
        - NUM_INJ
        - TOWED
        - TYP_INT
    - Iteration 2:
        - AIR_BAG
        - BODY_TYP
        - CARGO_BT
        - HARM_EV
        - MAX_SEV
        - NUM_INJ
        - NUM_INJV
        - TOWED
        - TYP_INT
        - VSPD_LIM
     - Iteration 3:
         - AIR_BAG
         - BODY_TYP
         - CARGO_BT
         - HARM_EV
         - MAKE
         - MAK_MOD
         - MAX_SEV
         - NUM_INJ
         - TOWED
         - TYP_INT
         - VSPD_LIM
    - ... and similarly through Iteration 10. 
- With dimensionality reduction:
    - Iteration 1:
        - ACC_TYPE
        - CARGO_BT
        - HARM_EV
        - MAKE
        - [MAX_VSEV Removed]
        - NUM_INJ
        - TOWED
        - TYP_INT
    - Looks the same
- Some of these features have an $R^2$ score greater than 0.8, so if we cut our $R^2$ threshold from 0.9 to 0.8 we would cut out some, but not all, of these.
    - BODY_TYP
    - HARM_EV
    - INJ_SEV
    - MODEL
    - MAX_SEV
    - NUM_INJ
    - NUM_INJV


In [6]:
%%time
def Impute_Using_IVEware():
    print ('Impute_Using_IVEware()')
    data = Get_Data()
    
    # Create .txt file to feed into IVEware imputation
    data_IVEware = data.copy(deep=True)
    print (data_IVEware.shape)
    display(data_IVEware.head(10))
    
    data_IVEware = data_IVEware.replace(99,'')
    display(data_IVEware.head(10))
#    data_IVEware.to_csv('../../Big_Files/data_IVEware.txt', sep='\t', index=False)

    # The first '_0' or '_1' is for the random seed for Python and Numpy
    # The second '_0' (or '_1') is for the dimensionality not being reduced (being reduced) before binning
    # The third '_1' is for using IVEware with R random seed 0
    filename = '../../Big_Files/data_IVEware' + filename_prefix + '_1.txt'
    print (filename)
    print ()
    data_IVEware.to_csv(filename, sep='\t', index=False) # Dimensionality not reduced before binning
#    data_IVEware.to_csv('../../Big_Files/data_IVEware_1_1.txt', sep='\t', index=False) # Dimensionality reduced before binning
    
    print ('Finished Impute_Using_IVEware()')
    print ()
    
    return 0

Impute_Using_IVEware()

# Now run IVEware outside this notebook.
# Takes about 1 GB of memory
# Run twice, once with seed 0 and once with seed 1, writing to different files
# About one hour for each of two runs

Impute_Using_IVEware()
Get_Data
../../Big_Files/CRSS_02_1_1.csv

data.shape =  (802700, 64)

Drop Imputed Columns
data.shape =  (802700, 64)

Total number of NaN
1599835

Finished Get_Data()

(802700, 64)


Unnamed: 0,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,...,VEH_AGE,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE,HOSPITAL
0,9,9,4,99,2,0,1,2,8,0,...,6,1,3,2,0,99,99,0,0,0
1,8,6,4,99,3,0,1,2,8,0,...,1,1,3,2,0,99,99,0,0,0
2,5,3,1,99,7,0,1,0,8,0,...,9,1,2,2,0,99,99,0,0,0
3,3,4,1,99,2,0,1,0,8,0,...,5,1,2,2,0,99,99,0,0,0
4,3,4,3,2,2,0,1,0,8,0,...,5,1,2,2,0,99,99,0,0,0
5,3,3,3,2,2,0,1,0,8,0,...,5,1,2,2,0,99,99,0,0,0
6,2,6,4,99,2,0,2,0,8,1,...,6,0,3,5,0,99,1,0,0,1
7,0,5,4,2,2,0,4,4,8,0,...,5,1,2,3,1,99,99,0,0,0
8,8,7,99,2,7,0,3,3,5,0,...,9,1,0,3,0,0,1,0,0,0
9,7,6,99,2,2,0,3,4,5,0,...,2,1,0,3,0,0,1,0,0,0


Unnamed: 0,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,...,VEH_AGE,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE,HOSPITAL
0,9,9,4.0,,2,0,1,2,8,0,...,6,1,3,2,0,,,0,0,0
1,8,6,4.0,,3,0,1,2,8,0,...,1,1,3,2,0,,,0,0,0
2,5,3,1.0,,7,0,1,0,8,0,...,9,1,2,2,0,,,0,0,0
3,3,4,1.0,,2,0,1,0,8,0,...,5,1,2,2,0,,,0,0,0
4,3,4,3.0,2.0,2,0,1,0,8,0,...,5,1,2,2,0,,,0,0,0
5,3,3,3.0,2.0,2,0,1,0,8,0,...,5,1,2,2,0,,,0,0,0
6,2,6,4.0,,2,0,2,0,8,1,...,6,0,3,5,0,,1.0,0,0,1
7,0,5,4.0,2.0,2,0,4,4,8,0,...,5,1,2,3,1,,,0,0,0
8,8,7,,2.0,7,0,3,3,5,0,...,9,1,0,3,0,0.0,1.0,0,0,0
9,7,6,,2.0,2,0,3,4,5,0,...,2,1,0,3,0,0.0,1.0,0,0,0


../../Big_Files/data_IVEware_1_1_1.txt

Finished Impute_Using_IVEware()

CPU times: user 7.71 s, sys: 686 ms, total: 8.4 s
Wall time: 8.65 s


0

# Impute using MissForest and Save for Next Step


In [7]:
%%time
def Impute_Using_MissForest():
    print ('Impute_Using_MissForest()')
    data = Get_Data()
    data = data.replace({99:np.nan})
    
#    data_Imputed = Impute_Full(data)
    data_Imputed = Impute_MissForest(data)
#    data_Imputed.to_csv('../../Big_Files/CRSS_03_Imputed_by_MF_Data.csv', index=False)


    # The first '_0' or '_1' is for the random seed for Python and Numpy
    # The second '_0' (or '_1') is for the dimensionality not being reduced (being reduced) before binning
    # The third '_0' is for imputation with MissForest.  '_1' will be for IVEware with R random seed 0.
    filename = '../../Big_Files/CRSS_03' + filename_prefix + '_0.csv'
    print (filename)
    print ()
    data_Imputed.to_csv(filename, index=False) 

#    display(data_Imputed.head(50))
    
    Check(data, data_Imputed)
#    display(Audio(sound_file, autoplay=True))

    print ('Finished Impute_Using_MissForest()')
    print ()

    return 0

Impute_Using_MissForest()

# Takes 2-3 GB of memory
# CPU times: user 1h 17min 49s, sys: 1min 25s, total: 1h 19min 14s
# Wall time: 1h 19min 35s

Impute_Using_MissForest()
Get_Data
../../Big_Files/CRSS_02_1_1.csv

data.shape =  (802700, 64)

Drop Imputed Columns
data.shape =  (802700, 64)

Total number of NaN
1599835

Finished Get_Data()

Impute_MissForest()
(802700, 64)


Unnamed: 0,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,...,VEH_AGE,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE,HOSPITAL
0,9.0,9.0,4.0,,2.0,0.0,1,2.0,8.0,0.0,...,6.0,1,3.0,2.0,0.0,,,0.0,0,0
1,8.0,6.0,4.0,,3.0,0.0,1,2.0,8.0,0.0,...,1.0,1,3.0,2.0,0.0,,,0.0,0,0
2,5.0,3.0,1.0,,7.0,0.0,1,0.0,8.0,0.0,...,9.0,1,2.0,2.0,0.0,,,0.0,0,0
3,3.0,4.0,1.0,,2.0,0.0,1,0.0,8.0,0.0,...,5.0,1,2.0,2.0,0.0,,,0.0,0,0
4,3.0,4.0,3.0,2.0,2.0,0.0,1,0.0,8.0,0.0,...,5.0,1,2.0,2.0,0.0,,,0.0,0,0
5,3.0,3.0,3.0,2.0,2.0,0.0,1,0.0,8.0,0.0,...,5.0,1,2.0,2.0,0.0,,,0.0,0,0
6,2.0,6.0,4.0,,2.0,0.0,2,0.0,8.0,1.0,...,6.0,0,3.0,5.0,0.0,,1.0,0.0,0,1
7,0.0,5.0,4.0,2.0,2.0,0.0,4,4.0,8.0,0.0,...,5.0,1,2.0,3.0,1.0,,,0.0,0,0
8,8.0,7.0,,2.0,7.0,0.0,3,3.0,5.0,0.0,...,9.0,1,0.0,3.0,0.0,0.0,1.0,0.0,0,0
9,7.0,6.0,,2.0,2.0,0.0,3,4.0,5.0,0.0,...,2.0,1,0.0,3.0,0.0,0.0,1.0,0.0,0,0


categorical features:  ['ACC_TYPE', 'AGE', 'AIR_BAG', 'ALC_STATUS', 'BODY_TYP', 'CARGO_BT', 'DAY_WEEK', 'DEFORMED', 'DR_ZIP', 'EJECTION', 'HARM_EV', 'HIT_RUN', 'HOUR', 'IMPACT1', 'INJ_SEV', 'INT_HWY', 'J_KNIFE', 'LGT_COND', 'MAKE', 'MAK_MOD', 'MAN_COLL', 'MAX_SEV', 'MODEL', 'MONTH', 'M_HARM', 'NUMOCCS', 'NUM_INJ', 'NUM_INJV', 'PCRASH4', 'PCRASH5', 'PERMVIT', 'PER_TYP', 'PJ', 'PSU', 'PVH_INVL', 'P_CRASH1', 'P_CRASH2', 'REGION', 'RELJCT1', 'RELJCT2', 'REL_ROAD', 'REST_MIS', 'REST_USE', 'ROLINLOC', 'ROLLOVER', 'SEAT_POS', 'SEX', 'SPEC_USE', 'SPEEDREL', 'TOWED', 'TOW_VEH', 'TYP_INT', 'URBANICITY', 'VALIGN', 'VEH_AGE', 'VE_TOTAL', 'VPROFILE', 'VSPD_LIM', 'VSURCOND', 'VTRAFCON', 'VTRAFWAY', 'WEATHER', 'WRK_ZONE', 'HOSPITAL']
Start fit_transform()
Start transform()
Iteration  0  of  10 , feature  ACC_TYPE
Iteration  0  of  10 , feature  AGE
Iteration  0  of  10 , feature  AIR_BAG
Iteration  0  of  10 , feature  ALC_STATUS
Iteration  0  of  10 , feature  BODY_TYP
Iteration  0  of  10 , feature

Iteration  3  of  10 , feature  P_CRASH2
Iteration  3  of  10 , feature  RELJCT1
Iteration  3  of  10 , feature  RELJCT2
Iteration  3  of  10 , feature  REL_ROAD
Iteration  3  of  10 , feature  REST_USE
Iteration  3  of  10 , feature  ROLINLOC
Iteration  3  of  10 , feature  SEAT_POS
Iteration  3  of  10 , feature  SEX
Iteration  3  of  10 , feature  SPEC_USE
Iteration  3  of  10 , feature  SPEEDREL
Iteration  3  of  10 , feature  TOWED
Iteration  3  of  10 , feature  TOW_VEH
Iteration  3  of  10 , feature  TYP_INT
Iteration  3  of  10 , feature  VALIGN
Iteration  3  of  10 , feature  VEH_AGE
Iteration  3  of  10 , feature  VPROFILE
Iteration  3  of  10 , feature  VSPD_LIM
Iteration  3  of  10 , feature  VSURCOND
Iteration  3  of  10 , feature  VTRAFCON
Iteration  3  of  10 , feature  VTRAFWAY
Iteration  3  of  10 , feature  WEATHER
compute_gamma_categorical()
len(self.categorical) = 64
all_gamma_cat =  [6876.015625, 1721.203125, 1615.375, 1594.296875]
Iteration  4  of  10 , feature  A

Unnamed: 0,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,EJECTION,...,VEH_AGE,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE,HOSPITAL
0,9.0,9.0,4.0,2.0,2.0,0.0,1,2.0,8.0,0.0,...,6.0,1,3.0,2.0,0.0,0.0,0.0,0.0,0,0
1,8.0,6.0,4.0,2.0,3.0,0.0,1,2.0,8.0,0.0,...,1.0,1,3.0,2.0,0.0,0.0,0.0,0.0,0,0
2,5.0,3.0,1.0,2.0,7.0,0.0,1,0.0,8.0,0.0,...,9.0,1,2.0,2.0,0.0,1.0,0.0,0.0,0,0
3,3.0,4.0,1.0,2.0,2.0,0.0,1,0.0,8.0,0.0,...,5.0,1,2.0,2.0,0.0,1.0,0.0,0.0,0,0
4,3.0,4.0,3.0,2.0,2.0,0.0,1,0.0,8.0,0.0,...,5.0,1,2.0,2.0,0.0,1.0,0.0,0.0,0,0
5,3.0,3.0,3.0,2.0,2.0,0.0,1,0.0,8.0,0.0,...,5.0,1,2.0,2.0,0.0,1.0,0.0,0.0,0,0
6,2.0,6.0,4.0,2.0,2.0,0.0,2,0.0,8.0,1.0,...,6.0,0,3.0,5.0,0.0,0.0,1.0,0.0,0,1
7,0.0,5.0,4.0,2.0,2.0,0.0,4,4.0,8.0,0.0,...,5.0,1,2.0,3.0,1.0,1.0,0.0,0.0,0,0
8,8.0,7.0,4.0,2.0,7.0,0.0,3,3.0,5.0,0.0,...,9.0,1,0.0,3.0,0.0,0.0,1.0,0.0,0,0
9,7.0,6.0,4.0,2.0,2.0,0.0,3,4.0,5.0,0.0,...,2.0,1,0.0,3.0,0.0,0.0,1.0,0.0,0,0


Finished Impute_MissForest()

../../Big_Files/CRSS_03_1_1_0.csv

Check()
Index(['ACC_TYPE', 'AGE', 'AIR_BAG', 'ALC_STATUS', 'BODY_TYP', 'CARGO_BT',
       'DAY_WEEK', 'DEFORMED', 'DR_ZIP', 'EJECTION', 'HARM_EV', 'HIT_RUN',
       'HOUR', 'IMPACT1', 'INJ_SEV', 'INT_HWY', 'J_KNIFE', 'LGT_COND', 'MAKE',
       'MAK_MOD', 'MAN_COLL', 'MAX_SEV', 'MODEL', 'MONTH', 'M_HARM', 'NUMOCCS',
       'NUM_INJ', 'NUM_INJV', 'PCRASH4', 'PCRASH5', 'PERMVIT', 'PER_TYP', 'PJ',
       'PSU', 'PVH_INVL', 'P_CRASH1', 'P_CRASH2', 'REGION', 'RELJCT1',
       'RELJCT2', 'REL_ROAD', 'REST_MIS', 'REST_USE', 'ROLINLOC', 'ROLLOVER',
       'SEAT_POS', 'SEX', 'SPEC_USE', 'SPEEDREL', 'TOWED', 'TOW_VEH',
       'TYP_INT', 'URBANICITY', 'VALIGN', 'VEH_AGE', 'VE_TOTAL', 'VPROFILE',
       'VSPD_LIM', 'VSURCOND', 'VTRAFCON', 'VTRAFWAY', 'WEATHER', 'WRK_ZONE',
       'HOSPITAL'],
      dtype='object')
[9.0, 8.0, 5.0, 3.0, 2.0, 0.0, 7.0, nan, 6.0, 4.0, 1.0]


[[9.0, 30801, 31358],
 [8.0, 206437, 230173],
 [5.0, 94292, 100590],
 [3.0, 65072, 72917],
 [2.0, 21069, 21802],
 [0.0, 27759, 31336],
 [7.0, 126924, 153341],
 [nan, 0, 0],
 [6.0, 82681, 90268],
 [4.0, 43129, 44378],
 [1.0, 25715, 26537]]


[9.0, 6.0, 3.0, 4.0, 5.0, 7.0, 1.0, 2.0, 8.0, nan, 0.0]


[[9.0, 36859, 36871],
 [6.0, 402658, 432524],
 [3.0, 29902, 29920],
 [4.0, 20657, 20658],
 [5.0, 40729, 40735],
 [7.0, 125355, 125490],
 [1.0, 29420, 30838],
 [2.0, 22889, 23347],
 [8.0, 43629, 43634],
 [nan, 0, 0],
 [0.0, 17579, 18683]]


[4.0, 1.0, 3.0, nan, 0.0, 2.0]


[[4.0, 595378, 644724],
 [1.0, 45696, 46710],
 [3.0, 16772, 16841],
 [nan, 0, 0],
 [0.0, 41608, 42481],
 [2.0, 51326, 51944]]


[nan, 2.0, 0.0, 1.0]


[[nan, 0, 0], [2.0, 655465, 787788], [0.0, 14643, 14662], [1.0, 250, 250]]


[2.0, 3.0, 7.0, 4.0, 1.0, 6.0, 9.0, 8.0, 0.0, 5.0, nan]


[[2.0, 282564, 299292],
 [3.0, 52838, 52841],
 [7.0, 26388, 26392],
 [4.0, 170645, 174859],
 [1.0, 38348, 38353],
 [6.0, 124124, 124302],
 [9.0, 16278, 16278],
 [8.0, 16775, 16786],
 [0.0, 21659, 21659],
 [5.0, 31825, 31938],
 [nan, 0, 0]]


[0.0, 1.0, 2.0, nan]


[[0.0, 762384, 765714], [1.0, 4402, 4404], [2.0, 18241, 32582], [nan, 0, 0]]


[1, 2, 4, 3, 0]


[[1, 110914, 110914],
 [2, 115126, 115126],
 [4, 136970, 136970],
 [3, 238408, 238408],
 [0, 201282, 201282]]


[2.0, 0.0, 4.0, 3.0, nan, 1.0]


[[2.0, 163477, 202488],
 [0.0, 266824, 289480],
 [4.0, 204651, 271029],
 [3.0, 14457, 14527],
 [nan, 0, 0],
 [1.0, 23338, 25176]]


[8.0, 5.0, 4.0, 9.0, 7.0, nan, 2.0, 6.0, 1.0, 0.0, 3.0]


[[8.0, 84289, 88561],
 [5.0, 211789, 235383],
 [4.0, 90819, 92020],
 [9.0, 32233, 33273],
 [7.0, 102896, 113912],
 [nan, 0, 0],
 [2.0, 44370, 44858],
 [6.0, 109844, 112399],
 [1.0, 19411, 19586],
 [0.0, 30566, 33102],
 [3.0, 29526, 29606]]


[0.0, 1.0, nan, 2.0]


[[0.0, 743562, 777468], [1.0, 2758, 2774], [nan, 0, 0], [2.0, 22427, 22458]]


[3.0, 2.0, 0.0, 4.0, 1.0, nan]


[[3.0, 702923, 702994],
 [2.0, 24911, 24948],
 [0.0, 27548, 27561],
 [4.0, 19208, 19209],
 [1.0, 27849, 27988],
 [nan, 0, 0]]


[0.0, 1.0, nan]


[[0.0, 777792, 777810], [1.0, 24890, 24890], [nan, 0, 0]]


[4.0, 6.0, 2.0, 5.0, 0.0, 3.0, 1.0, nan]


[[4.0, 261736, 262448],
 [6.0, 114099, 114601],
 [2.0, 28830, 28834],
 [5.0, 269181, 269599],
 [0.0, 25094, 25096],
 [3.0, 74425, 74427],
 [1.0, 27590, 27695],
 [nan, 0, 0]]


[4.0, 0.0, 1.0, 2.0, 7.0, 5.0, 6.0, 3.0, 8.0, 9.0, nan]


[[4.0, 38883, 39221],
 [0.0, 46527, 46954],
 [1.0, 357264, 364912],
 [2.0, 44192, 44511],
 [7.0, 189288, 189469],
 [5.0, 27331, 27394],
 [6.0, 26654, 26727],
 [3.0, 29867, 29936],
 [8.0, 16441, 16482],
 [9.0, 17068, 17094],
 [nan, 0, 0]]


[3.0, 2.0, 0.0, 1.0, nan]


[[3.0, 565203, 575072],
 [2.0, 120354, 120584],
 [0.0, 39430, 39474],
 [1.0, 67474, 67570],
 [nan, 0, 0]]


[0.0, 1.0, nan]


[[0.0, 715981, 716044], [1.0, 86649, 86656], [nan, 0, 0]]


[0, 2, 1]


[[0, 781759, 781759], [2, 20511, 20511], [1, 430, 430]]


[3.0, 1.0, 2.0, nan, 0.0]


[[3.0, 576924, 578517],
 [1.0, 134512, 134756],
 [2.0, 23647, 23647],
 [nan, 0, 0],
 [0.0, 65396, 65780]]


[2.0, 6.0, 7.0, 1.0, 5.0, nan, 4.0, 9.0, 0.0, 8.0, 3.0]


[[2.0, 112792, 113400],
 [6.0, 277003, 281097],
 [7.0, 76241, 76256],
 [1.0, 22608, 22955],
 [5.0, 117352, 117694],
 [nan, 0, 0],
 [4.0, 47870, 47880],
 [9.0, 18407, 21088],
 [0.0, 16078, 19282],
 [8.0, 37062, 37297],
 [3.0, 65261, 65751]]


[4.0, 6.0, 3.0, 1.0, 8.0, 0.0, 2.0, 9.0, 5.0, 7.0, nan]


[[4.0, 165878, 166211],
 [6.0, 136372, 136488],
 [3.0, 204176, 205406],
 [1.0, 41832, 41832],
 [8.0, 22508, 22559],
 [0.0, 21988, 21988],
 [2.0, 67768, 67770],
 [9.0, 35429, 35429],
 [5.0, 78366, 78369],
 [7.0, 26648, 26648],
 [nan, 0, 0]]


[4.0, 0.0, 1.0, 3.0, 2.0, nan]


[[4.0, 102961, 103441],
 [0.0, 34060, 34069],
 [1.0, 114844, 114904],
 [3.0, 315553, 316366],
 [2.0, 232793, 233920],
 [nan, 0, 0]]


[3.0, 2.0, 0.0, 1.0, nan]


[[3.0, 395534, 397467],
 [2.0, 193231, 193231],
 [0.0, 96097, 96098],
 [1.0, 115904, 115904],
 [nan, 0, 0]]


[3.0, 6.0, 7.0, 5.0, 2.0, 1.0, 4.0, 8.0, 9.0, 0.0, nan]


[[3.0, 194410, 205086],
 [6.0, 57842, 57897],
 [7.0, 94828, 95052],
 [5.0, 106717, 106961],
 [2.0, 139600, 143316],
 [1.0, 20631, 20886],
 [4.0, 83154, 86262],
 [8.0, 34055, 34131],
 [9.0, 31444, 31445],
 [0.0, 21664, 21664],
 [nan, 0, 0]]


[0, 1, 2, 3, 4, 5]


[[0, 177800, 177800],
 [1, 123494, 123494],
 [2, 134577, 134577],
 [3, 145610, 145610],
 [4, 150456, 150456],
 [5, 70763, 70763]]


[2.0, 0.0, 4.0, 1.0, 3.0, nan]


[[2.0, 689764, 689854],
 [0.0, 35444, 35454],
 [4.0, 20106, 20108],
 [1.0, 42225, 42401],
 [3.0, 14872, 14883],
 [nan, 0, 0]]


[0.0, 2.0, 3.0, 1.0, 4.0, nan]


[[0.0, 428660, 443089],
 [2.0, 81781, 82011],
 [3.0, 46859, 46972],
 [1.0, 195875, 197392],
 [4.0, 33169, 33236],
 [nan, 0, 0]]


[0.0, 4.0, 1.0, 2.0, 3.0, 5.0, nan]


[[0.0, 395550, 397483],
 [4.0, 19303, 19303],
 [1.0, 224915, 224916],
 [2.0, 99933, 99933],
 [3.0, 42718, 42718],
 [5.0, 18347, 18347],
 [nan, 0, 0]]


[0.0, 1.0, 3.0, 2.0, nan]


[[0.0, 516528, 524946],
 [1.0, 194943, 195079],
 [3.0, 28089, 28089],
 [2.0, 54584, 54586],
 [nan, 0, 0]]


[2.0, 0.0, nan, 1.0]


[[2.0, 743219, 767803], [0.0, 12689, 16185], [nan, 0, 0], [1.0, 16207, 18712]]


[2.0, 1.0, 0.0, nan]


[[2.0, 79101, 79232], [1.0, 633795, 635716], [0.0, 87552, 87752], [nan, 0, 0]]


[5, 3, 0, 4, 1, 2]


[[5, 268478, 268478],
 [3, 313905, 313905],
 [0, 66121, 66121],
 [4, 115648, 115648],
 [1, 16509, 16509],
 [2, 22039, 22039]]


[1.0, 0.0, nan]


[[1.0, 587731, 587858], [0.0, 214839, 214842], [nan, 0, 0]]


[9, 5, 0, 1, 4, 2, 6, 3, 7, 8]


[[9, 32598, 32598],
 [5, 140700, 140700],
 [0, 32531, 32531],
 [1, 30131, 30131],
 [4, 116361, 116361],
 [2, 34286, 34286],
 [6, 110607, 110607],
 [3, 112608, 112608],
 [7, 106310, 106310],
 [8, 86568, 86568]]


[9, 2, 3, 4, 6, 5, 8, 0, 1, 7]


[[9, 20484, 20484],
 [2, 69248, 69248],
 [3, 197221, 197221],
 [4, 139149, 139149],
 [6, 71467, 71467],
 [5, 97129, 97129],
 [8, 70128, 70128],
 [0, 29027, 29027],
 [1, 56499, 56499],
 [7, 52348, 52348]]


[0, 1, 2]


[[0, 785238, 785238], [1, 14278, 14278], [2, 3184, 3184]]


[5.0, 1.0, 2.0, 4.0, nan, 0.0, 3.0, 6.0]


[[5.0, 40511, 40796],
 [1.0, 413264, 418832],
 [2.0, 80341, 81047],
 [4.0, 129042, 130128],
 [nan, 0, 0],
 [0.0, 45520, 45878],
 [3.0, 49470, 49575],
 [6.0, 36315, 36444]]


[8.0, 6.0, 3.0, 0.0, 9.0, 7.0, 5.0, 2.0, 4.0, 1.0, nan]


[[8.0, 293670, 302652],
 [6.0, 106561, 107350],
 [3.0, 48798, 49388],
 [0.0, 21502, 21839],
 [9.0, 103939, 104772],
 [7.0, 47690, 48651],
 [5.0, 71247, 71627],
 [2.0, 23047, 23895],
 [4.0, 41268, 41295],
 [1.0, 31043, 31231],
 [nan, 0, 0]]


[3, 2, 1, 0]


[[3, 120799, 120799],
 [2, 142507, 142507],
 [1, 451311, 451311],
 [0, 88083, 88083]]


[0.0, 1.0, nan]


[[0.0, 615492, 760794], [1.0, 38672, 41906], [nan, 0, 0]]


[1.0, 0.0, 3.0, 2.0, nan]


[[1.0, 280013, 318298],
 [0.0, 226950, 226973],
 [3.0, 185240, 185265],
 [2.0, 70237, 72164],
 [nan, 0, 0]]


[2.0, 0.0, 1.0, nan]


[[2.0, 723058, 723131], [0.0, 60746, 60787], [1.0, 18768, 18782], [nan, 0, 0]]


[2, 1, 0]


[[2, 734023, 734023], [1, 6870, 6870], [0, 61807, 61807]]


[1.0, 0.0, 3.0, nan, 2.0]


[[1.0, 659701, 679584],
 [0.0, 45223, 78971],
 [3.0, 23782, 23940],
 [nan, 0, 0],
 [2.0, 20204, 20205]]


[2.0, 1.0, 0.0, nan]


[[2.0, 774586, 774586], [1.0, 6596, 6750], [0.0, 21109, 21364], [nan, 0, 0]]


[2, 0, 1]


[[2, 774586, 774586], [0, 13800, 13800], [1, 14314, 14314]]


[2.0, 0.0, 3.0, 1.0, nan]


[[2.0, 587598, 587728],
 [0.0, 117231, 125851],
 [3.0, 53461, 56330],
 [1.0, 32586, 32791],
 [nan, 0, 0]]


[1.0, 0.0, nan, 2.0]


[[1.0, 428354, 439915], [0.0, 356326, 362783], [nan, 0, 0], [2.0, 2, 2]]


[1.0, 2.0, 0.0, nan]


[[1.0, 788331, 795433], [2.0, 3391, 3411], [0.0, 3856, 3856], [nan, 0, 0]]


[2.0, 1.0, nan, 0.0]


[[2.0, 742749, 753337], [1.0, 36619, 36688], [nan, 0, 0], [0.0, 12675, 12675]]


[4.0, 0.0, 3.0, nan, 1.0, 2.0]


[[4.0, 430720, 466808],
 [0.0, 223212, 226165],
 [3.0, 34150, 34643],
 [nan, 0, 0],
 [1.0, 25348, 25348],
 [2.0, 49360, 49736]]


[0.0, 2.0, 1.0, nan]


[[0.0, 781136, 781332], [2.0, 20652, 20652], [1.0, 716, 716], [nan, 0, 0]]


[0.0, nan, 2.0, 1.0]


[[0.0, 408835, 415382],
 [nan, 0, 0],
 [2.0, 230256, 282458],
 [1.0, 92939, 104860]]


[0, 1]


[[0, 181758, 181758], [1, 620942, 620942]]


[2.0, nan, 1.0, 0.0, 3.0]


[[2.0, 678108, 719317],
 [nan, 0, 0],
 [1.0, 38391, 38416],
 [0.0, 25978, 25979],
 [3.0, 18988, 18988]]


[6.0, 1.0, 9.0, 5.0, 2.0, 0.0, 4.0, 3.0, 7.0, nan, 8.0]


[[6.0, 58041, 58199],
 [1.0, 251976, 259703],
 [9.0, 58931, 59521],
 [5.0, 64698, 64838],
 [2.0, 82224, 82366],
 [0.0, 46126, 46141],
 [4.0, 66100, 66198],
 [3.0, 103794, 104578],
 [7.0, 23841, 23844],
 [nan, 0, 0],
 [8.0, 37275, 37312]]


[1, 0, 3, 2]


[[1, 585044, 585044], [0, 93192, 93192], [3, 31598, 31598], [2, 92866, 92866]]


[3.0, 2.0, 0.0, nan, 4.0, 1.0]


[[3.0, 578378, 684174],
 [2.0, 57261, 57412],
 [0.0, 19585, 19585],
 [nan, 0, 0],
 [4.0, 18988, 18988],
 [1.0, 22541, 22541]]


[2.0, 5.0, 3.0, 4.0, 6.0, 1.0, nan, 7.0, 0.0, 8.0, 9.0]


[[2.0, 45401, 45870],
 [5.0, 25534, 25957],
 [3.0, 210720, 281436],
 [4.0, 155749, 164094],
 [6.0, 79711, 84955],
 [1.0, 68774, 80307],
 [nan, 0, 0],
 [7.0, 14136, 14522],
 [0.0, 19237, 19237],
 [8.0, 41069, 41925],
 [9.0, 41224, 44397]]


[0.0, 1.0, nan, 2.0]


[[0.0, 633413, 663569],
 [1.0, 110557, 112100],
 [nan, 0, 0],
 [2.0, 26988, 27031]]


[nan, 0.0, 1.0, 2.0]


[[nan, 0, 0],
 [0.0, 463988, 520559],
 [1.0, 192172, 202791],
 [2.0, 77793, 79350]]


[nan, 1.0, 0.0, 2.0, 3.0, 4.0, 5.0]


[[nan, 0, 0],
 [1.0, 156860, 169766],
 [0.0, 303273, 407390],
 [2.0, 152529, 159592],
 [3.0, 18348, 18855],
 [4.0, 28072, 28109],
 [5.0, 18988, 18988]]


[0.0, nan, 1.0, 2.0]


[[0.0, 576937, 601935],
 [nan, 0, 0],
 [1.0, 70198, 71328],
 [2.0, 128840, 129437]]


[0, 1, 2]


[[0, 788008, 788008], [1, 7959, 7959], [2, 6733, 6733]]


[0, 1]


[[0, 676444, 676444], [1, 126256, 126256]]


Finished Check()

Finished Impute_Using_MissForest()

CPU times: user 1h 43min 19s, sys: 2min 44s, total: 1h 46min 4s
Wall time: 1h 48min 17s


0