In [None]:
%%latex
\tableofcontents

# readme
- Most of our other Jupyter Notebooks have a main() function at the bottom that runs everything.  
- This notebook is structured differently, with several functions that run in sequence.  
- The reason for the difference is that part of the work has to be done outside this notebook.  
- The IVEware imputation software is available in several languages, but not Python.  We ran it in R using scrlib.  
- This notebook prepares the data for the Mode, Random Forest, and IVEware imputations, and does the first two.  Then the user must separately run the IVEware software.  Finally, this notebook pulls in those results and compares the three methods.  

# Methods

- We have the discretized CRSS dataset in '../../Big_Files/CRSS_Binned_Data.csv'
- MissForest is a round-robin imputation method most commonly implemented in R, generally considered one of the best imputation methods.  It has several Python implementations.
- The Python implementation we found most current and referenced, at https://pypi.org/project/MissForest/ , was not appropriate for our work because all of our data is categorical.  The MissForest algorithm starts from some imputed state, and we wanted to start with imputation to mode, but that implementation only offered imputation to mean or median, which are appropriate for continuous variables but not for categorical, so we wrote our own implementation.  
- We compare here four methods:
    - Round-Robin Random Forest 
        - Our own implementation of Round Robin, using scikit-learn's random forest
        - Using imputation by mode as the starting point
    - Imputation by mode
    - Random Imputation
    - IVEware, using the hyperparameters in the CRSS Imputation report
- To compare, we followed the example for MissForest.
    - We dropped all samples with a missing value, so we would have ground truth, going from 817,623 samples to 232,333 samples to make a Pandas dataframe data_Ground_Truth
    - We erased ~15% of the values in each sample to make data_NaN
    - We used each imputation method to impute the missing values.
    - To compare methods, we counted:
        - For each method, what percentage of imputed values did not match ground truth (28-44%)
        - For each pair of methods, which method did a better job on how many features
        - For each pair of methods, how many values are different
- Our round-robin method
    - In data_NaN, change all of the 'Unknown' to np.NaN.
    - In each feature, count the number of unknown samples.
    - In another copy, data_Mode, impute by mode in all of the features.
    - Starting with the feature with the least (nonzero) number of missing samples:
        - Copy that feature from data_NaN into data_Mode, so that only that feature has missing values.
        - Separate the dataframe into two, one with known values in the target variable (X) and one with unknown values (Z).
        - From the dataframe with known values (X), separate out the target variable (call it 'y')
        - Using Random Forest, build a model that maps X to y.  
        - Use the model to impute the missing values
    - At each iteration we replace the mode-imputed values with RF-imputed values.
- Our Random Imputation method
    - We did not choose randomly from the unique values in the feature, because some values may be much more common than others.  We wanted (approximately) the same distribution of values.
    - We started with 232,333 samples with 67 features.
    - We erased values with a probability of 15%, but that doesn't mean that exactly 34,849.95 values are missing from each feature, but we did erase *about* 35,000 values from each feature.  The exact number erased from each feature is printed out when the code runs.
    - For each feature:
        - Create a temporary copy of the feature, which will have 232,333 samples, about 35,000 of which are NaN.
        - Drop the NaN samples in the temp feature, leaving about 200,000 samples.
        - Resample the temp feature to have 232,333 samples.  The resampling will change the order of the values but keep about the same distribution.
        - In the original feature, replace the NaN values with the non-NaN corresponding values in the temporary feature.
- The IVEware implementation is available in several platforms, but Python is not one of them.  We run it in R outside this notebook.  Be aware that the random selection of values to erase is different for each run, so the IVEware imputation must be run anew. 

- Once we had analyzed the results and decided that the Random Forest method is best for our work, we implemented it and saved the results to CRSS_Imputed_Data.csv.

## What is going on with IVEware using "seed 0;" ?
- When we set the random seed to 0, the accuracy of IVEware jumps from about 70% to about 80%, from slightly worse than Random Forest to MUCH better.  WHAT ???

- These runs have the same random seed for Python and NumPy, have the five multicollinear features used in the imputation but dropped for the evaluation.  

- Having the same Python and NumPy random seed means that the input datasets for the IVEware imputation have the same samples have the same missing feature values.  

- "seed 0;" in IVEware_CRSS_Imputation.xml


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,143  | 100% | 
    | RF |  558,626  |  27.85 % | 
    | Mode |  681,514  |  33.97 % | 
    | Random |  888,663  |  44.3 % | 
    | IVEware |  438,072  |  21.84 % | 

- "seed 1;"
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,143  | 100% | 
    | RF |  558,626  |  27.85 % | 
    | Mode |  681,514  |  33.97 % | 
    | Random |  888,663  |  44.3 % | 
    | IVEware |  592,313  |  29.52 % | 

- "seed 2;"
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,143  | 100% | 
    | RF |  558,626  |  27.85 % | 
    | Mode |  681,514  |  33.97 % | 
    | Random |  888,663  |  44.3 % | 
    | IVEware |  568,719  |  28.35 % | 
    
    
<br><br>
- Found what was going on. "seed 1;" in IVEware is setting the random seed in R, but "seed 0;" is something different.
- Cite IVEware_User_Guide, page 17

"SEED number;

Specifies a seed for the random draws from the posterior predictive distribution. Number should be greater than zero. A zero seed will result in no perturbations of the predicted values or the regression coefficients. If the SEED keyword is missing from the setup file then the seed will be determined by your computer’s internal clock."

- set.seed(int) in R does not have this behavior at int=0.  I tried set.seed(0) in R and it worked just fine.  
- SAS requires that the random seed be a positive integer, and SAS is one of the implementations of IVEware, so that may be why the IVEware authors thought to implement this functionality for their seed.

- According to this ~2017 scraping of GitHub Python code to count the choices of random seeds,
    - https://www.kaggle.com/code/residentmario/kernel16e284dcb7
    - 0 is the most common (19%)
    - 1 and 42 are next(9% and 4%, respectively)
    
- According to this 2014 scraping of 100 top R repositories owned by 27 people, 
    - https://www.r-bloggers.com/2014/03/what-are-the-most-common-rng-seeds-used-in-r-scripts-on-github/
    - 1 is by far the most common (60 examples)
    - 123 is next (about 25)
    - 0 is not on the list
    
### Is this just an anomaly, or might "seed 0;" be useful?

- Test Method
    - Test with all 67 features, not dropping five multicollinear features
    - We have results with seeds 1 and 42
    - Test with seed 0 in IVEware, Python, and NumPy

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,167,826  | 100% | 
    | RF |  591,364  |  27.28 % | 
    | Mode |  739,696  |  34.12 % | 
    | Random |  971,759  |  44.83 % | 
    | IVEware |  447,881  |  20.66 % | 
    
    <br><br>
    - Test with seed 0 in IVEware but seed 42 in Python and NumPy in the Binning and Imputation

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,168,989  | 100% | 
    | RF |  587,221  |  27.07 % | 
    | Mode |  739,903  |  34.11 % | 
    | Random |  971,670  |  44.8 % | 
    | IVEware |  445,195  |  20.53 % | 
    
- Another test method
    - Randomly sample from 67 to 40 features and test again
    - Note that dropping features will increase the number of samples that have no missing values, so data_Ground_Truth and data_NaN will have fewer features but more samples, so having about the same number of total missing values over the 40 features is not a problem.
    - Do it twice with two random seeds.  
    - The same random seed for Python and NumPy will preserve, but different random seeds will change:
        - Which features get dropped
        - Which 15% of the samples will get dropped to make data_NaN for testing the imputation
    - Seed 0 in Python and Numpy, seed 0 in IVEware:
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,071,755  | 100% | 
    | RF |  652,983  |  31.52 % | 
    | Mode |  774,049  |  37.36 % | 
    | Random |  997,759  |  48.16 % | 
    | IVEware |  556,618  |  26.87 % | 

    - Seed 0 in Python and NumPy, seed 1 in IVEware:

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,071,755  | 100% | 
    | RF |  652,983  |  31.52 % | 
    | Mode |  774,049  |  37.36 % | 
    | Random |  997,759  |  48.16 % | 
    | IVEware |  738,201  |  35.63 % | 
    
    
    - Seed 1 in Python and Numpy, seed 0 in IVEware:

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  1,861,972  | 100% | 
    | RF |  449,626  |  24.15 % | 
    | Mode |  547,876  |  29.42 % | 
    | Random |  737,527  |  39.61 % | 
    | IVEware |  370,546  |  19.9 % | 

    - Seed 1 in Python and Numpy, seed 1 in IVEware:
    
   | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  1,861,972  | 100% | 
    | RF |  449,626  |  24.15 % | 
    | Mode |  547,876  |  29.42 % | 
    | Random |  737,527  |  39.61 % | 
    | IVEware |  486,820  |  26.15 % | 
    
    - Analysis
        - Seed 1 (compared with seed 0) for Python and NumPy appears to have chosen features that are easier to impute
        - Within each seed for Python and Numpy, choosing seed 0 for IVEware gave much better results.  
    
### Conclusion
- Setting the IVEware seed to zero is not recommended in the manual, and we think it shouldn't work well, but it works dramatically well with our test methods.  
- Use two sets of data from here on, one imputed with Random Forest and another imputed with IVEware with random seed zero.  See which gives best results at the end.  

# Results of Comparison of Four Imputation Methods

- We start with the binned (discretized) data, CRSS_Binned_Data.csv, with 817,623 samples in 67 features.
<br><br>
- Dropping any sample with a missing value, we have 232,333 samples of Ground Truth.

<br><br>
- First run with random seed  42 in Python, NumPy, and R:
    <br><br>
    - Samples Incorrectly Imputed
    
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,168,989  | 100% | 
    | RF |  589,714  |  27.19 % | 
    | Mode |  739,903  |  34.11 % | 
    | Random |  971,670  |  44.8 % | 
    | IVEware |  622,622  |  28.71 % | 

    <br><br>
    - Comparison of number of errors in the 67 features.  For instance, comparing Random Forest to Mode, in 50 features RF had fewer errors than Mode, in 17 features the two methods had the same number of errors, and in no features did RF have more errors than Mode.  

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to Mode |  50  |  17  |  0  | 67  |
    | Compare RF to Random |  67  |  0  |  0  | 67  |
    | Compare RF to IVEware |  34  |  0  |  33  | 67  |
    | Compare Mode to Random |  67  |  0  |  0  | 67  |
    | Compare Mode to IVEware |  24  |  0  |  43  | 67  |
    | Compare Random to IVEware |  5  |  0  |  62  | 67  |

    <br><br>
     - Number of NaN Imputed Differently by Different Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  2,168,989  | 100% |
    | RF Different from Mode |  260,334  |  12.0 % |
    | RF Different from Random |  805,376  |  37.13 % |
    | RF Different from IVEware |  649,467  |  29.94 % |
    | Mode Different from Random |  739,564  |  34.1 % |
    | Mode Different from IVEware |  780,385  |  35.98 % |
    | Random Different from IVEware |  1,003,065  |  46.25 % |    
    
<br><br>
- Second Run, Same random seed (42) to make sure the random seed is implemented correctly.  Same results. 

    <br><br>
     - Percentage of Samples Incorrectly Imputed
     

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,168,989  | 100% | 
    | RF |  589,714  |  27.19 % | 
    | Mode |  739,903  |  34.11 % | 
    | Random |  971,670  |  44.8 % | 
    | IVEware |  622,622  |  28.71 % | 

    <br><br>
     - Comparison of number of errors in the 67 features:

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to Mode |  50  |  17  |  0  | 67  |
    | Compare RF to Random |  67  |  0  |  0  | 67  |
    | Compare RF to IVEware |  34  |  0  |  33  | 67  |
    | Compare Mode to Random |  67  |  0  |  0  | 67  |
    | Compare Mode to IVEware |  24  |  0  |  43  | 67  |
    | Compare Random to IVEware |  5  |  0  |  62  | 67  |

    <br><br>
     - Number of NaN Imputed Differently by Different Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  2,168,989  | 100% |
    | RF Different from Mode |  260,334  |  12.0 % |
    | RF Different from Random |  805,376  |  37.13 % |
    | RF Different from IVEware |  649,467  |  29.94 % |
    | Mode Different from Random |  739,564  |  34.1 % |
    | Mode Different from IVEware |  780,385  |  35.98 % |
    | Random Different from IVEware |  1,003,065  |  46.25 % |

<br><br>
- Third run, with random seed 1:

    <br><br>
    - Samples Incorrectly Imputed by Method

   | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,167,935  | 100% | 
    | RF |  587,894  |  27.12 % | 
    | Mode |  738,676  |  34.07 % | 
    | Random |  970,865  |  44.78 % | 
    | IVEware |  595,206  |  27.45 % | 

    <br><br>
- Comparison of number of errors in the 67 features:

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to Mode |  51  |  16  |  0  | 67  |
    | Compare RF to Random |  67  |  0  |  0  | 67  |
    | Compare RF to IVEware |  35  |  0  |  32  | 67  |
    | Compare Mode to Random |  67  |  0  |  0  | 67  |
    | Compare Mode to IVEware |  22  |  0  |  45  | 67  |
    | Compare Random to IVEware |  5  |  0  |  62  | 67  |

    <br><br>
- Number of NaN Imputed Differently by Pairs of Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  2,167,935  | 100% |
    | RF Different from Mode |  252,911  |  11.67 % |
    | RF Different from Random |  802,733  |  37.03 % |
    | RF Different from IVEware |  620,312  |  28.61 % |
    | Mode Different from Random |  738,742  |  34.08 % |
    | Mode Different from IVEware |  751,752  |  34.68 % |
    | Random Different from IVEware |  978,679  |  45.14 % |








## Drop Multicollinear Features before Imputing?  Compare two methods
- First Method
    - After Binning, reduce dimensionality
        - Removes MAX_VSEV, VE_FORMS, VTCONT_F, MAX_SEV, NUM_INJV
        - Reduces from 67 to 62 features
    - Impute
- Second Method
    - Impute with all 67 features
    - Before evaluating the imputation, remove the five features and only evaluate the results on the 62 features used in the comparison above
- We used random seed 42 for both methods
- First Method Results

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,007,463  | 100% | 
    | RF |  569,509  |  28.37 % | 
    | Mode |  681,753  |  33.96 % | 
    | Random |  889,794  |  44.32 % | 
    | IVEware |  606,632  |  30.22 % | 
    
- Second Method Results


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,007,235  | 100% | 
    | RF |  558,936  |  27.85 % | 
    | Mode |  681,996  |  33.98 % | 
    | Random |  888,845  |  44.28 % | 
    | IVEware |  606,062  |  30.19 % | 


### Analysis
- Mode was the same, as it should be.
- Random was slightly different, perhaps because the features were in a different order?
- IVEware was not significantly different in the two methods.
- Random Forest was slightly but significantly better (0.52%) with the second method, not removing the multicollinear features before imputing, which is surprising.  

### Conclusion
- Run again with different random seed = 1

### Second Round Results
- First Method

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,643  | 100% | 
    | RF |  568,909  |  28.35 % | 
    | Mode |  681,061  |  33.94 % | 
    | Random |  889,048  |  44.31 % | 
    | IVEware |  592,233  |  29.51 % | 


- Second Method


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,005,955  | 100% | 
    | RF |  558,742  |  27.85 % | 
    | Mode |  680,715  |  33.93 % | 
    | Random |  887,944  |  44.27 % | 
    | IVEware |  564,254  |  28.13 % | 
    
### Analysis

- Again, the second method, leaving in multicollinear features, is better for both Random Forest and IVEware

### Conclusion
- When we impute 

## Discussion

- Random imputation is clearly worse than Mode and RF on every feature.
- Random is overall worse than IVEware, but on one of our runs there are five features on which Random is better than IVEware.
- Random Forest is as good or better than Mode on every feature, which is not surprising, as RF starts at Mode and improves on it.  
- Random Forest is as good or better than IVEware on more than half of the features, but not overwhelmingly, and slightly better in the count of missing samples correctly imputed.
- IVEware and Mode are comparable in the number of features, but IVEware is much better in the count of missing samples correctly imputed.
- Random Forest and Mode make the same mistakes.  
- IVEware makes different mistakes from Random Forest and Mode.

## Conclusion

- Use Random Forest

## Opportunities for Future Research
(or, "Things we didn't do")

- Which features are better imputed by Random imputation than by IVEware, and why?
- Which features are better imputed by IVEware than by Random Forest, and why?
- Would a different mix of features make IVEware perform better than Random Forest?
- Is it okay to use one imputation method for some features and another method for other features?

# Setup
## Import Libraries

In [None]:
import sys, copy, math, time, os

print ('Python version: {}'.format(sys.version))

import numpy as np
print ('NumPy version: {}'.format(np.__version__))
np.set_printoptions(suppress=True)


import pandas as pd
print ('Pandas version:  {}'.format(pd.__version__))
pd.set_option('display.max_rows', 500)

import sklearn
print ('SciKit-Learn version: {}'.format(sklearn.__version__))
from sklearn.model_selection import train_test_split

import sklearn.neighbors._base
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

from missforest.missforest import MissForest

# Set Randomness.  Copied from https://www.kaggle.com/code/abazdyrev/keras-nn-focal-loss-experiments
import random
random_seed = 1
np.random.seed(random_seed) # NumPy
random.seed(random_seed) # Python
#tf.set_random_seed(random_seed) # Tensorflow

from IPython.display import Audio
sound_file = './beep.wav'

import warnings
warnings.filterwarnings('ignore')

print ('Finished Importing Libraries')


## Get Data

This notebook pulls in the saved output of Ambulance_Dispatch_2024_02_Binning.

In [None]:
def Get_Data():
    print ('Get_Data')
    data = pd.read_csv('../../Big_Files/CRSS_Binned_Data.csv', low_memory=False)
#    data = pd.read_csv('../../Big_Files/CRSS_Binned_Data_Seed_42.csv', low_memory=False)
#    data = pd.read_csv('../../Big_Files/CRSS_Binned_Reduced_Dimensionality_Data.csv', low_memory=False)
    print ('data.shape = ', data.shape)
    print ()

    # We already dropped the imputed columns in the Binning stage
    print ('Drop Imputed Columns')
    for feature in data:
        if '_IM' in feature:
            print (feature)
            data.drop(columns=feature, inplace=True)
 

    # Method for dropping from 67 to 40 features 
    # to test whether it was just this particular mix of features 
    # that made the IVEware behave strangely well with random seed of zero.
#    print ('data.shape = ', data.shape)
#    data = data.sample(n=40, axis='columns')
    
    print ('data.shape = ', data.shape)
    print ()
    
    print ("Remaining Features:")
    Features = sorted(list(data.columns))
    for feature in Features:
        print ("    ",feature)
    
    return data

In [None]:
data = Get_Data()


## Tools

In [None]:
def Impute_MissForest(data):

    display(data.head(20))
#    data.replace({np.nan: ''}, inplace=True)
#    display(data.head(20))

    categorical = list(data)
    print (categorical)
    data_MF = MissForest().fit_transform(
        x = data,
        categorical=categorical,
    )
    display(data_MF.head(20))
    
    return data_MF
    
data = Get_Data()
data.replace({99:np.nan}, inplace=True)
#data = data.sample(20, axis='columns')
#data = data.sample(frac=0.50, axis='rows')
print (data.shape)
Impute_MissForest(data)

In [None]:
def Impute_Round_Robin(data):
    print ('Impute()')
    pd.set_option('display.max_columns', None)
    
    # Replace 'Unknown' with np.NaN
#    data.replace({'Unknown': np.nan}, inplace=True)
    data.replace({99: np.nan}, inplace=True)
    display(data.head(20))
    print ()
    
    # Make a list of features with missing samples, 
    #     ordered by the number of missing samples, 
    #     from least to most.  
    Missing = []
    Complete = []
    for feature in data:
        s = data[feature].isna().sum()
        if s==0:
            Complete.append([feature, s])
        if s>0:
            Missing.append([feature, s])
    Missing = sorted (Missing, key=lambda x:x[1], reverse=False)
    print ()
    print ('Complete[]')
    display(Complete)
    print ()
    print ('Missing[]')
    display(Missing)
    print ()
    
    print ('Make data_Mode')
    print ()
    data_Mode = pd.DataFrame()
    for X in Complete:
        feature = X[0]
        data_Mode[feature] = data[feature]
    for M in Missing:
        feature = M[0]
        m = data[feature].mode()[0]
        print (feature, M[1], m)
        data_Mode[feature] = data[feature].fillna(m)
    print ('data_Mode')
    display(data_Mode.head(20))

    print ()
    print ('Make starting point for data_Imputed')
    data_Imputed = pd.DataFrame()
    for X in Complete:
        feature = X[0]
        data_Imputed[feature] = data[feature]
    for X in Missing:
        feature = X[0]
        data_Imputed[feature] = data_Mode[feature]
    print ('data_Imputed')
    display(data_Imputed.head(20))
    print ()

    print ('Start Loop')
    print ()
    n = 0
    for M in Missing:
        n += 1
        print (M)
        feature = M[0]
        data_Imputed[feature] = data[feature]
#        print ()
#        print ('data[feature].isna().sum()')
#        print (data[feature].isna().sum())
#        print ('data_Imputed[feature].isna().sum()')
#        print (data_Imputed[feature].isna().sum())
#        print ()
        W = data_Imputed.dropna(subset=[feature])
        X = data_Imputed.dropna(subset=[feature])
        y = X[feature]
        X.drop(columns=feature, inplace=True)
        Z = data_Imputed[data_Imputed[feature].isna()]
        Z.drop(columns=feature, inplace=True)
#        Z.reset_index(drop=True, inplace=True)
#        print (data.shape)
#        print (X.shape)
#        display(X.head(40))
#        display(y.head(40))
#        print (Z.shape)
#        display(Z)
        clf = RandomForestClassifier(max_depth=2, random_state=random_seed)
        clf.fit(X,y)
#        print ('clf.predict(Z)')
        z = clf.predict(Z)
        print (len(z))
        display(z)
        Z[feature] = z
#        display(Z)
        data_Imputed = pd.concat([Z, W])
#        display(data_Imputed.head(60))
        print (data_Imputed.shape)
        print ()
#        data_Imputed.sort_values(
#            by = ['CASENUM', 'VEH_NO', 'PER_NO'], 
#            ascending = [True, True, True], 
#            inplace=True
#        )
#        print ()
#        print ('data.PER_NO.equals(data_Imputed.PER_NO)')
#        print (data.PER_NO.equals(data_Imputed.PER_NO))
#        print ()
               
        Check_Feature(data, data_Imputed, feature)
#        if n==10:
#            return data_Imputed
    
    
    display(data_Imputed.head(20))

    
    print ()
    return data_Imputed

In [None]:
def Check(data, data_Imputed):
    Features = data.columns
    print (Features)
    for feature in Features:
        U = pd.unique(data[feature]).tolist()
        print (U)
        A = []
        for u in U:
            a = len(data[data[feature]==u])
            b = len(data_Imputed[data_Imputed[feature]==u])
            A.append([u, a, b])
        display(A)
        print ()


In [None]:
def Check_Feature(data, data_Imputed, feature):
    U = pd.unique(data[feature]).tolist()
    U = [x for x in U if x == x]
    print (U)
    A = []
    for u in U:
        a = len(data[data[feature]==u])
        b = len(data_Imputed[data_Imputed[feature]==u])
        A.append([u, a, b, b-a])
    a = data[feature].isna().sum()
    b = data_Imputed[feature].isna().sum()
    A.append(['NaN', a, b, 0])
    A = pd.DataFrame(A, columns=['Value', 'Original', 'Imputed', 'Difference'])
    display(A)
    print ()


In [None]:
def Impute_Randomly(data):
    print ()
    print ('Impute_Randomly()')
    print ()
    
    data.sample(frac=1, replace=True) # Randomly shuffle the rows of the dataset
    for feature in data:
        print (feature)
#        print ('display(data[feature].head())')
#        display(data[feature].head())
        dfA = data[feature]
#        print ('display(dfA.head())')
#        display(dfA.head())
        dfA.dropna(inplace=True)
#        print ('display(dfA.head()) after dfA.dropna(inplace=True)')
#        display(dfA.head())
        print ('Original Value Counts')
        print (dfA.value_counts(normalize=True))
        dfA = dfA.sample(n = len(data), replace=True)
#        print ('display(dfA.head()) after dfA.sample(n = len(data), replace=True)')
#        display(dfA.head())
        print ('Value Counts after Sampling')
        print (dfA.value_counts(normalize=True))
        dfA.reset_index(drop=True, inplace=True)
#        print ('display(dfA.head()) after dfA.reset_index(drop=True)')
#        display(dfA.head())
        data[feature].fillna(dfA, inplace=True)
#        print ('display(data[feature].head())')
#        display(data[feature].head())        
        print ()
        
    return data
        
def Test_Impute_Randomly():
    Dict = {
        'A':[0,0,0,1,np.nan],
        'B':[1,2,3,4,np.nan]
    }
    
    data = pd.DataFrame(Dict)
    display(data)
    data = Impute_Randomly(data)
    display(data)
    
#Test_Impute_Randomly()
        

# Compare Imputation Methods

## Mode Imputation
## Random Forest Imputation
## Prepare Data for IVEware

In [None]:
def Compare_Imputation_Methods_Part_1():
    print ()
    print ('Compare_Imputation_Methods_Part_1()')
    data = Get_Data()
    print (data.shape)

    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('int64')

    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth = data_Ground_Truth.astype('int64')
    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head())

    # Randomly pick 15% of the values from each row
    # and set them to be missing
    print ('Remove 15% of values from each row')
    frac = .15
    data_NaN = data_Ground_Truth.copy(deep=True)
    N = data_NaN.shape[0] * frac # Number of NaN in each feature
    for c in data_NaN.columns:
        idx = np.random.choice(a=data_NaN.index, size=int(len(data_NaN) * frac))
        data_NaN.loc[idx, c] = np.NaN
#    for feature in data_NaN:
#        data_NaN[feature] = pd.to_numeric(data_NaN[feature])
#    data_NaN.astype('int64')


    print ('data_NaN.shape')
    print (data_NaN.shape)
    display(data_NaN.head())
    
    # Perform MissForest imputation
    data_MF = Impute_MissForest(data_NaN)
    data_MF.sort_index(inplace=True)
    data_MF = data_MF[data.columns]  
    data_MF = data_MF.astype('int64')
    
    print ('data_MF.shape')
    print (data_MF.shape)
    display(data_MF.head())
#    print ()

    
    # Create .txt file to feed into IVEware imputation
    data_IVEware = data_NaN.copy(deep=True)
    data_IVEware = data_IVEware.fillna('')
    data_IVEware.to_csv('../../Big_Files/data_IVEware.txt', sep='\t', index=False)
    
    data_Mode = pd.DataFrame()
    for feature in data_NaN:
        data_Mode[feature] = data_NaN[feature].fillna(data_NaN[feature].mode()[0])
    data_Mode = data_Mode.astype('int64')
    print ('data_Mode.shape')
    print (data_Mode.shape)
    display(data_Mode.head())
    
    # Perform Round Robin imputation using Random Forest Classifier
    data_RF = Impute_Round_Robin(data_NaN)
    data_RF.sort_index(inplace=True)
    data_RF = data_RF[data.columns]  
    data_RF = data_RF.astype('int64')
    
    print ('data_RF.shape')
    print (data_RF.shape)
    display(data_RF.head())
#    print ()

    # Impute randomly
    data_Random = data_NaN.copy(deep=True)
    data_Random = Impute_Randomly(data_Random)
    data_Random = data_Random.astype('int64')
    
    print ('data_Random.shape')
    print (data_Random.shape)
    display(data_Random.head())
#    print ()

    return data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random

In [None]:
%%time 
data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random = Compare_Imputation_Methods_Part_1()

## Do IVEware Imputation (Outside this Jupyter Notebook)
- Go to the IVEware folder and run (at the command line) IVE_12_22_22.bat
- Requires scrlib and R.  You may need to, in the batch file, change the path to your scrlib installation.
- Notes to self:
    - Open srcshell
    - From srcshell, open IVEware_CRSS_Imputation.xml
    - Run
- Run time: ./IVEware_CRSS_Imputation.bat  1069.08s user 12.92s system 98% cpu 18:23.92 total

In [None]:
data_IVEware = pd.read_csv('../../Big_Files/data_IVEware.csv')
data_IVEware.drop(columns='Unnamed: 0', inplace=True)

print ('data_Ground_Truth', data_Ground_Truth.shape)
display(data_Ground_Truth.head(10))
print ('data_NaN', data_NaN.shape)
display(data_NaN.head(10))
print ('data_RF', data_RF.shape)
display(data_RF.head(10))
print ('data_MF', data_RM.shape)
display(data_MF.head(10))
print ('data_Mode', data_Mode.shape)
display(data_Mode.head(10))
print ('data_IVEware', data_IVEware.shape)
display(data_IVEware.head(10))


## Compare Three Imputation Methods

In [None]:
def Compare_Imputation_Methods_Part_2(
    data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random, data_IVEware
):
    print ('Compare_Imputation_Methods_Part_2')
    
    """
    print ('Drop Multicollinear Features')
    Drop = ['MAX_VSEV', 'VE_FORMS', 'VTCONT_F', 'MAX_SEV', 'NUM_INJV']
    DF = [data_Ground_Truth, data_NaN, data_RF, data_Mode, data_Random, data_IVEware]
    
    for df in DF:
        for feature in Drop:
            if feature in df:
                df.drop(columns=[feature], inplace=True)
                print ('Drop ', feature)
    print ()
    """
    
    print ('data_Ground_Truth.shape: ', data_Ground_Truth.shape)
    print ('data_NaN.shape: ', data_NaN.shape)
    print ('data_RF.shape: ', data_RF.shape)
    print ('data_MF.shape: ', data_MF.shape)
    print ('data_Mode.shape: ', data_Mode.shape)
    print ('data_Random.shape: ', data_Random.shape)
    print ('data_IVEware.shape: ', data_IVEware.shape)
    print ()
    
    
    
    A = []
    for feature in data_NaN:
        N = data_NaN[feature].isna().sum()
#        print (feature, N)
#        print ()
        D = data_Ground_Truth[feature] != data_RF[feature]
        d = D.sum()
        E = data_Ground_Truth[feature] != data_Mode[feature]
        e = E.sum()
        F = data_Ground_Truth[feature] != data_Random[feature]
        f = F.sum()
        G = data_Ground_Truth[feature] != data_IVEware[feature]
        g = G.sum()
        H = data_RF[feature] != data_Mode[feature]
        h = H.sum()
        I = data_RF[feature] != data_Random[feature]
        i = I.sum()
        J = data_RF[feature] != data_IVEware[feature]
        j = J.sum()
        K = data_Mode[feature] != data_Random[feature]
        k = K.sum()
        L = data_Mode[feature] != data_IVEware[feature]
        l = L.sum()
        M = data_Random[feature] != data_IVEware[feature]
        m = M.sum()
        print (feature, N, d, e, f, g, h, i, j, k, l, m)
        print (
            feature, 
            data_Ground_Truth.dtypes[feature],
            data_NaN.dtypes[feature],
            data_RF.dtypes[feature],
            data_Mode.dtypes[feature],
            data_Random.dtypes[feature],
            data_IVEware.dtypes[feature],
        )
        A.append([
            feature, # 0
            N,  # 1
            d, int(d/N*100), # 2, 3
            e, int(e/N*100), # 4, 5
            f, int(f/N*100), # 6, 7
            g, int(g/N*100), # 8, 9
            h, int(h/N*100), # 10, 11
            i, int(i/N*100), # 12, 13
            j, int(j/N*100), # 14, 15
            k, int(k/N*100), # 16, 17
            l, int(l/N*100), # 18, 19
            m, int(m/N*100), # 20, 21
        ])
    print ()
    
    A = sorted(A, key=lambda x:x[3])
    B = pd.DataFrame(
        A, 
        columns=[
            'Feature', # 0
            'nNaN',  # 1
            'nRF Incorrect', 'pRF Incorrect', # 2, 3
            'nMode Incorrect', 'pMode Incorrect', # 4, 5
            'nRandom Incorrect', 'pRandom Incorrect', # 6, 7
            'nIVEware Incorrect', 'pIVEware Incorrect', # 8, 9
            'RF and Mode Different', 'RF v/s Mode %', # 10, 11
            'RF and Random Different', 'RF v/s Random %', # 12, 13
            'RF and IVEware Different', 'RF v/s IVEware %', # 14, 15
            'Mode and Random Different', 'Mode v/s Random %', # 16, 17
            'Mode and IVEware Different', 'Mode v/s IVEware %', #, 18, 19
            'Random and IVEware Different', 'Random v/s IVEware %', # 20, 21
        ]
    )
    display(B)
    a = sum([x[1] for x in A]) # nNaN
    b = sum([x[2] for x in A]) # nRF Incorrect
    c = sum([x[4] for x in A]) # nMode INcorrect
    d = sum([x[6] for x in A]) # nRandom Incorrect
    e = sum([x[8] for x in A]) # nIVEware Incorrect
    f = round(b/a*100,2)
    g = round(c/a*100,2)
    h = round(d/a*100,2)
    i = round(e/a*100,2)

    RF_less_Mode = sum([x[2] < x[4] for x in A])
    RF_equal_Mode = sum([x[2] == x[4] for x in A])
    RF_greater_Mode = sum([x[2] > x[4] for x in A])

    RF_less_Random = sum([x[2] < x[6] for x in A])
    RF_equal_Random = sum([x[2] == x[6] for x in A])
    RF_greater_Random = sum([x[2] > x[6] for x in A])

    RF_less_IVEware = sum([x[2] < x[8] for x in A])
    RF_equal_IVEware = sum([x[2] == x[8] for x in A])
    RF_greater_IVEware = sum([x[2] > x[8] for x in A])

    Mode_less_Random = sum([x[4] < x[6] for x in A])
    Mode_equal_Random = sum([x[4] == x[6] for x in A])
    Mode_greater_Random = sum([x[4] > x[6] for x in A])

    Mode_less_IVEware = sum([x[4] < x[8] for x in A])
    Mode_equal_IVEware = sum([x[4] == x[8] for x in A])
    Mode_greater_IVEware = sum([x[4] > x[8] for x in A])

    Random_less_IVEware = sum([x[6] < x[8] for x in A])
    Random_equal_IVEware = sum([x[6] == x[8] for x in A])
    Random_greater_IVEware = sum([x[6] > x[8] for x in A])

    print ()
    print ('    | | Number | Percentage |')
    print ('    | --- | --- | --- | ')    
    print ('    | Total NaN | ', f'{a:,d}', ' | 100% | ')
    print ('    | RF | ', f'{b:,d}', ' | ', f, '% | ')
    print ('    | Mode | ', f'{c:,d}', ' | ', g, '% | ')
    print ('    | Random | ', f'{d:,d}', ' | ', h, '% | ')
    print ('    | IVEware | ', f'{e:,d}', ' | ', i, '% | ')
    print ()
    print ('    |  | Fewer | Equal | More | Total | ')
    print ('    | --- | --- | --- | --- | --- | ')
    print ('    | Compare RF to Mode | ', RF_less_Mode, ' | ', RF_equal_Mode,  ' | ' ,RF_greater_Mode,  ' |', len(A), ' |' )
    print ('    | Compare RF to Random | ', RF_less_Random, ' | ' , RF_equal_Random,  ' | ' , RF_greater_Random,  ' |', len(A), ' |' )
    print ('    | Compare RF to IVEware | ', RF_less_IVEware, ' | ' , RF_equal_IVEware, ' | ' , RF_greater_IVEware, ' |', len(A), ' |' )
    print ('    | Compare Mode to Random | ', Mode_less_Random, ' | ' , Mode_equal_Random, ' | ' , Mode_greater_Random, ' |', len(A), ' |' )
    print ('    | Compare Mode to IVEware | ', Mode_less_IVEware, ' | ' , Mode_equal_IVEware, ' | ' , Mode_greater_IVEware, ' |', len(A), ' |' )
    print ('    | Compare Random to IVEware | ', Random_less_IVEware, ' | ' , Random_equal_IVEware, ' | ' , Random_greater_IVEware, ' |', len(A), ' |' )
    print ()
    
    p = sum([x[10] for x in A])
    q = sum([x[12] for x in A])
    r = sum([x[14] for x in A])
    s = sum([x[16] for x in A])
    t = sum([x[18] for x in A])
    u = sum([x[20] for x in A])
    f = round(p/a*100,2)
    g = round(q/a*100,2)
    h = round(r/a*100,2)
    i = round(s/a*100,2)
    j = round(t/a*100,2)
    k = round(u/a*100,2)
    
    print ('    |  | Number |  Percentage |')
    print ('    | --- | --- | -- |')
    print ('    | Total NaN | ', f'{a:,d}', ' | 100% |' )
    print ('    | RF Different from Mode | ', f'{p:,d}', ' | ', f, '% |')
    print ('    | RF Different from Random | ', f'{q:,d}', ' | ', g, '% |')
    print ('    | RF Different from IVEware | ', f'{r:,d}', ' | ', h, '% |')
    print ('    | Mode Different from Random | ', f'{s:,d}', ' | ', i, '% |')
    print ('    | Mode Different from IVEware | ', f'{t:,d}', ' | ', j, '% |')
    print ('    | Random Different from IVEware | ', f'{u:,d}', ' | ', k, '% |' )
    print ()
        
#    display(Audio(sound_file, autoplay=True))
    
    


In [None]:
Compare_Imputation_Methods_Part_2(
    data_Ground_Truth, data_NaN, data_RF, data_Mode, data_Random, data_IVEware
)

# Impute using Random Forest and Save for Next Step

In [None]:
def Impute_Using_Random_Forest():
    data = Get_Data()
    
#    data_Imputed = Impute_Full(data)
    data_Imputed = Impute_Round_Robin(data)
    data_Imputed.to_csv('../../Big_Files/CRSS_Imputed_by_RF_Data.csv', index=False)
#    display(data_Imputed.head(50))
    
    Check(data, data_Imputed)
#    display(Audio(sound_file, autoplay=True))
    return 0

Impute_Using_Random_Forest()

In [None]:
def Impute_Using_IVEware():
    data = Get_Data()
    
    # Create .txt file to feed into IVEware imputation
    data_IVEware = data.copy(deep=True)
    print (data_IVEware.shape)
    display(data_IVEware.head(10))
    
    data_IVEware = data_IVEware.replace(99,'')
    display(data_IVEware.head(10))
    data_IVEware.to_csv('../../Big_Files/data_IVEware.txt', sep='\t', index=False)
    
    return 0

Impute_Using_IVEware()
# About one hour