In [3]:
%%latex
\tableofcontents

<IPython.core.display.Latex object>

# Ambulance_Dispatch_2024_03_Impute_Missing_Data

# readme
- Most of our other Jupyter Notebooks have a main() function at the bottom that runs everything.  
- This notebook is structured differently, with several functions that run in sequence.  
- The reason for the difference is that part of the work has to be done outside this notebook.  
- The IVEware imputation software is available in several languages, but not Python.  We ran it in R using scrlib.  
- This notebook prepares the data for the Mode, Random Forest, and IVEware imputations, and does the first two.  Then the user must separately run the IVEware software.  Finally, this notebook pulls in those results and compares the three methods.  

# Methods

## Goal
- We have about 3% of the values in the dataset missing.  
- CRSS used IVEware to impute missing values in some, but not most, of the features.  
- We can use IVEware to impute the rest of the features, but we should compare to other methods.  

## Dataset
- We have the discretized CRSS dataset in '../../Big_Files/CRSS_Binned_Data.csv'.
- The dataset is 802,700 samples with 67 features.
- In that dataset, each feature has values in {0,1,2,3,4,5,6,7,8,9,99}, with most features having fewer values and 99 signifying "Missing" or "Unknown."
- Overall, about 3% of the values are 99, 
    - In the features, thirteen features have no missing values, and six features have more than ten percent missing, the highest being RELJCT1 with 18% missing.  
    - In the rows, 29% have no missing values, 25% one missing value, 16% two, ..., 1% eight, ..., and 0.3% thirteen missing values.  
    - See results of the Analyze_Data() function for full details.  

## Imputation Methods
### MissForest
- MissForest is a round-robin imputation method most commonly implemented in R, generally considered one of the best imputation methods.  It has several Python implementations. The Python implementation we found most current and referenced is https://pypi.org/project/MissForest/, and that's the one we used, version 2.5.5.  There are other current implementations that we did not try.  
- MissForest doesn't work with some datasets, because it imputes each feature by first separating the samples into those with a missing value in that feature and those with a known value, build a model with the set with known values, and applying the model to the set with unknown values.  The problem comes when it builds the model, that it drops all samples with a missing value in any feature. If all samples have at least two missing values, then the learning algorithm gets an empty dataframe on which to build the model, and it (rightly) throws an error.  Our original test for comparing imputation methods took out 15% of the values in each feature, which meant that each model-building round, if it had any clean data, had very little, so the models were poor, giving MissForest poor results.  
- MissForest takes an enormous amount of memory, sometimes enough to crash the process on the computer we were using.  Imputing the ground-truth set we will use to compare imputation methods, 232,333 samples with 67 features, it used 40+ GB of memory and crashed in the sixth iteration.  Imputing the full set on which we want to use the best imputation method, 802,700 samples with 67 features, it crashed in the second iteration.
- Interesting bug (feature?) in the code:  It starts counting iterations at zero and stops at the end of an iteration `` if n_iter > self.max_iter ``, so it will go through one more iteration than ``max_iter``.
- We couldn't find any detailed documentation for MissForest, and ended up reading the code itself.  
    - It does use mode imputation for categorical features (line 192 of missforest.py, version 2.5.5)
- We added one line to the transform() function in missforest.py, a print statement to let us know which interation it was on and which function it was imputing.  
````{verbtim}
```456        while True:
```457            for c in missing_rows:
```457                print ('Iteration ', n_iter, ' of ', self.max_iter,  ', feature ', c)
````


- 
- We compare here four methods:
    - Round-Robin Random Forest 
        - Our own implementation of Round Robin, using scikit-learn's random forest
        - Using imputation by mode as the starting point
    - Imputation by mode
    - Random Imputation
    - IVEware, using the hyperparameters in the CRSS Imputation report
- To compare, we followed the example for MissForest.
    - We dropped all samples with a missing value, so we would have ground truth, going from 817,623 samples to 232,333 samples to make a Pandas dataframe data_Ground_Truth
    - We erased ~15% of the values in each sample to make data_NaN
    - We used each imputation method to impute the missing values.
    - To compare methods, we counted:
        - For each method, what percentage of imputed values did not match ground truth (28-44%)
        - For each pair of methods, which method did a better job on how many features
        - For each pair of methods, how many values are different
- Our round-robin method
    - In data_NaN, change all of the 'Unknown' to np.NaN.
    - In each feature, count the number of unknown samples.
    - In another copy, data_Mode, impute by mode in all of the features.
    - Starting with the feature with the least (nonzero) number of missing samples:
        - Copy that feature from data_NaN into data_Mode, so that only that feature has missing values.
        - Separate the dataframe into two, one with known values in the target variable (X) and one with unknown values (Z).
        - From the dataframe with known values (X), separate out the target variable (call it 'y')
        - Using Random Forest, build a model that maps X to y.  
        - Use the model to impute the missing values
    - At each iteration we replace the mode-imputed values with RF-imputed values.
- Our Random Imputation method
    - We did not choose randomly from the unique values in the feature, because some values may be much more common than others.  We wanted (approximately) the same distribution of values.
    - We started with 232,333 samples with 67 features.
    - We erased values with a probability of 15%, but that doesn't mean that exactly 34,849.95 values are missing from each feature, but we did erase *about* 35,000 values from each feature.  The exact number erased from each feature is printed out when the code runs.
    - For each feature:
        - Create a temporary copy of the feature, which will have 232,333 samples, about 35,000 of which are NaN.
        - Drop the NaN samples in the temp feature, leaving about 200,000 samples.
        - Resample the temp feature to have 232,333 samples.  The resampling will change the order of the values but keep about the same distribution.
        - In the original feature, replace the NaN values with the non-NaN corresponding values in the temporary feature.
- The IVEware implementation is available in several platforms, but Python is not one of them.  We run it in R outside this notebook.  Be aware that the random selection of values to erase is different for each run, so the IVEware imputation must be run anew. 

- Once we had analyzed the results and decided that the Random Forest method is best for our work, we implemented it and saved the results to CRSS_Imputed_Data.csv.

## What is going on with IVEware using "seed 0;" ?
- When we set the random seed to 0, the accuracy of IVEware jumps from about 70% to about 80%, from slightly worse than Random Forest to MUCH better.  WHAT ???

- These runs have the same random seed for Python and NumPy, have the five multicollinear features used in the imputation but dropped for the evaluation.  

- Having the same Python and NumPy random seed means that the input datasets for the IVEware imputation have the same samples have the same missing feature values.  

- "seed 0;" in IVEware_CRSS_Imputation.xml


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,143  | 100% | 
    | RF |  558,626  |  27.85 % | 
    | Mode |  681,514  |  33.97 % | 
    | Random |  888,663  |  44.3 % | 
    | IVEware |  438,072  |  21.84 % | 

- "seed 1;"
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,143  | 100% | 
    | RF |  558,626  |  27.85 % | 
    | Mode |  681,514  |  33.97 % | 
    | Random |  888,663  |  44.3 % | 
    | IVEware |  592,313  |  29.52 % | 

- "seed 2;"
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,143  | 100% | 
    | RF |  558,626  |  27.85 % | 
    | Mode |  681,514  |  33.97 % | 
    | Random |  888,663  |  44.3 % | 
    | IVEware |  568,719  |  28.35 % | 
    
    
<br><br>
- Found what was going on. "seed 1;" in IVEware is setting the random seed in R, but "seed 0;" is something different.
- Cite IVEware_User_Guide, page 17

"SEED number;

Specifies a seed for the random draws from the posterior predictive distribution. Number should be greater than zero. A zero seed will result in no perturbations of the predicted values or the regression coefficients. If the SEED keyword is missing from the setup file then the seed will be determined by your computer’s internal clock."

- set.seed(int) in R does not have this behavior at int=0.  I tried set.seed(0) in R and it worked just fine.  
- SAS requires that the random seed be a positive integer, and SAS is one of the implementations of IVEware, so that may be why the IVEware authors thought to implement this functionality for their seed.

- According to this ~2017 scraping of GitHub Python code to count the choices of random seeds,
    - https://www.kaggle.com/code/residentmario/kernel16e284dcb7
    - 0 is the most common (19%)
    - 1 and 42 are next(9% and 4%, respectively)
    
- According to this 2014 scraping of 100 top R repositories owned by 27 people, 
    - https://www.r-bloggers.com/2014/03/what-are-the-most-common-rng-seeds-used-in-r-scripts-on-github/
    - 1 is by far the most common (60 examples)
    - 123 is next (about 25)
    - 0 is not on the list
    
### Is this just an anomaly, or might "seed 0;" be useful?

- Test Method
    - Test with all 67 features, not dropping five multicollinear features
    - We have results with seeds 1 and 42
    - Test with seed 0 in IVEware, Python, and NumPy

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,167,826  | 100% | 
    | RF |  591,364  |  27.28 % | 
    | Mode |  739,696  |  34.12 % | 
    | Random |  971,759  |  44.83 % | 
    | IVEware |  447,881  |  20.66 % | 
    
    <br><br>
    - Test with seed 0 in IVEware but seed 42 in Python and NumPy in the Binning and Imputation

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,168,989  | 100% | 
    | RF |  587,221  |  27.07 % | 
    | Mode |  739,903  |  34.11 % | 
    | Random |  971,670  |  44.8 % | 
    | IVEware |  445,195  |  20.53 % | 
    
- Another test method
    - Randomly sample from 67 to 40 features and test again
    - Note that dropping features will increase the number of samples that have no missing values, so data_Ground_Truth and data_NaN will have fewer features but more samples, so having about the same number of total missing values over the 40 features is not a problem.
    - Do it twice with two random seeds.  
    - The same random seed for Python and NumPy will preserve, but different random seeds will change:
        - Which features get dropped
        - Which 15% of the samples will get dropped to make data_NaN for testing the imputation
    - Seed 0 in Python and Numpy, seed 0 in IVEware:
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,071,755  | 100% | 
    | RF |  652,983  |  31.52 % | 
    | Mode |  774,049  |  37.36 % | 
    | Random |  997,759  |  48.16 % | 
    | IVEware |  556,618  |  26.87 % | 

    - Seed 0 in Python and NumPy, seed 1 in IVEware:

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,071,755  | 100% | 
    | RF |  652,983  |  31.52 % | 
    | Mode |  774,049  |  37.36 % | 
    | Random |  997,759  |  48.16 % | 
    | IVEware |  738,201  |  35.63 % | 
    
    
    - Seed 1 in Python and Numpy, seed 0 in IVEware:

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  1,861,972  | 100% | 
    | RF |  449,626  |  24.15 % | 
    | Mode |  547,876  |  29.42 % | 
    | Random |  737,527  |  39.61 % | 
    | IVEware |  370,546  |  19.9 % | 

    - Seed 1 in Python and Numpy, seed 1 in IVEware:
    
   | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  1,861,972  | 100% | 
    | RF |  449,626  |  24.15 % | 
    | Mode |  547,876  |  29.42 % | 
    | Random |  737,527  |  39.61 % | 
    | IVEware |  486,820  |  26.15 % | 
    
    - Analysis
        - Seed 1 (compared with seed 0) for Python and NumPy appears to have chosen features that are easier to impute
        - Within each seed for Python and Numpy, choosing seed 0 for IVEware gave much better results.  
    
### Conclusion
- Setting the IVEware seed to zero is not recommended in the manual, and we think it shouldn't work well, but it works dramatically well with our test methods.  
- Use two sets of data from here on, one imputed with Random Forest and another imputed with IVEware with random seed zero.  See which gives best results at the end.  

# Results of Comparison of Six Imputation Methods

- We start with the binned (discretized) data, CRSS_Binned_Data.csv, with 817,623 samples in 67 features.
<br><br>
- Dropping any sample with a missing value, we have 232,333 samples of Ground Truth.

- Replacing 15% of the sample values in each feature with NaN, we have $232,333  \times 67 \times 0.15 = 1,866,090$ missing values to be imputed.  

<br><br>
- First run with random seed  42 in Python and NumPy, and both 0 and 1 as random seeds for IVEware
    <br><br>
    - Samples Incorrectly Imputed
   
    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  417,246  | 100% | 
    | RF |  112,645  |  27.0 % | 
    | MF |  85,624  |  20.52 % | 
    | Mode |  148,831  |  35.67 % | 
    | Random |  194,305  |  46.57 % | 
    | IVEware_seed_0 |  90,130  |  21.6 % | 
    | IVEware_seed_1 |  123,928  |  29.7 % | 

    <br><br>
    - Comparison of number of errors in the 67 features.  For instance, comparing my Random Forest Round-Robin method to MissForest, in 1 feature RF had fewer errors than MissForest, in 18 features the two methods had the same number of errors, and in 48 features RF had more errors than MissForest.  

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to MF |  1  |  18  |  48  | 67  |
    | Compare RF to Mode |  40  |  27  |  0  | 67  |
    | Compare RF to Random |  54  |  13  |  0  | 67  |
    | Compare RF to IVEware_seed_0 |  4  |  19  |  44  | 67  |
    | Compare RF to IVEware_seed_0 |  30  |  14  |  23  | 67  |
    | Compare MF to Mode |  50  |  17  |  0  | 67  |
    | Compare MF to Random |  54  |  13  |  0  | 67  |
    | Compare MF to IVEware_seed_0 |  31  |  22  |  14  | 67  |
    | Compare MF to IVEware_seed_1 |  53  |  12  |  2  | 67  |
    | Compare Mode to Random |  52  |  15  |  0  | 67  |
    | Compare Mode to IVEware_seed_0 |  1  |  17  |  49  | 67  |
    | Compare Mode to IVEware_seed_1 |  22  |  12  |  33  | 67  |
    | Compare Random to IVEware_seed_0 |  1  |  11  |  55  | 67  |
    | Compare Random to IVEware_seed_1 |  7  |  13  |  47  | 67  |


    <br><br>
     - Number of NaN Imputed Differently by Different Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  417,246  | 100% |
    | RF Different from MF |  53,566  |  12.84 % |
    | RF Different from Mode |  57,378  |  13.75 % |
    | RF Different from Random |  161,299  |  38.66 % |
    | RF Different from IVEware_seed_0 |  48,866  |  11.71 % |
    | RF Different from IVEware_seed_1 |  120,019  |  28.76 % |
    | MF Different from Mode |  99,730  |  23.9 % |
    | MF Different from Random |  174,445  |  41.81 % |
    | MF Different from IVEware_seed_0 |  38,763  |  9.29 % |
    | MF Different from IVEware_seed_1 |  103,645  |  24.84 % |
    | Mode Different from Random |  148,772  |  35.66 % |
    | Mode Different from IVEware_seed_0 |  96,820  |  23.2 % |
    | Mode Different from IVEware_seed_1 |  153,020  |  36.67 % |
    | Random Different from IVEware_seed_0 |  175,503  |  42.06 % |
    | Random Different from IVEware_seed_1 |  196,586  |  47.12 % |
    | IVEware_seed_0 Different from IVEware_seed_1 |  97,299  |  23.32 % |



<br><br>
- Second Run, Same random seed (42) to make sure the random seed is implemented correctly.  Same results. 

    <br><br>
     - Percentage of Samples Incorrectly Imputed
     


   | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  417,246  | 100% | 
    | RF |  112,645  |  27.0 % | 
    | MF |  85,624  |  20.52 % | 
    | Mode |  148,831  |  35.67 % | 
    | Random |  194,305  |  46.57 % | 
    | IVEware_seed_0 |  90,130  |  21.6 % | 
    | IVEware_seed_1 |  123,928  |  29.7 % | 

    <br><br>
     - Comparison of number of errors in the 67 features:

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to MF |  1  |  18  |  48  | 67  |
    | Compare RF to Mode |  40  |  27  |  0  | 67  |
    | Compare RF to Random |  54  |  13  |  0  | 67  |
    | Compare RF to IVEware_seed_0 |  4  |  19  |  44  | 67  |
    | Compare RF to IVEware_seed_0 |  30  |  14  |  23  | 67  |
    | Compare MF to Mode |  50  |  17  |  0  | 67  |
    | Compare MF to Random |  54  |  13  |  0  | 67  |
    | Compare MF to IVEware_seed_0 |  31  |  22  |  14  | 67  |
    | Compare MF to IVEware_seed_1 |  53  |  12  |  2  | 67  |
    | Compare Mode to Random |  52  |  15  |  0  | 67  |
    | Compare Mode to IVEware_seed_0 |  1  |  17  |  49  | 67  |
    | Compare Mode to IVEware_seed_1 |  22  |  12  |  33  | 67  |
    | Compare Random to IVEware_seed_0 |  1  |  11  |  55  | 67  |
    | Compare Random to IVEware_seed_1 |  7  |  13  |  47  | 67  |

    <br><br>
     - Number of NaN Imputed Differently by Different Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  417,246  | 100% |
    | RF Different from MF |  53,566  |  12.84 % |
    | RF Different from Mode |  57,378  |  13.75 % |
    | RF Different from Random |  161,299  |  38.66 % |
    | RF Different from IVEware_seed_0 |  48,866  |  11.71 % |
    | RF Different from IVEware_seed_1 |  120,019  |  28.76 % |
    | MF Different from Mode |  99,730  |  23.9 % |
    | MF Different from Random |  174,445  |  41.81 % |
    | MF Different from IVEware_seed_0 |  38,763  |  9.29 % |
    | MF Different from IVEware_seed_1 |  103,645  |  24.84 % |
    | Mode Different from Random |  148,772  |  35.66 % |
    | Mode Different from IVEware_seed_0 |  96,820  |  23.2 % |
    | Mode Different from IVEware_seed_1 |  153,020  |  36.67 % |
    | Random Different from IVEware_seed_0 |  175,503  |  42.06 % |
    | Random Different from IVEware_seed_1 |  196,586  |  47.12 % |
    | IVEware_seed_0 Different from IVEware_seed_1 |  97,299  |  23.32 % |

<br><br>
- Third run, with random seed 0 in Python and Numpy, with both 0 and 1 as random seeds for IVEware.
    - Note that the IVEware results are different in this run than in the previous runs, even though we used the same random seeds in IVEware.  The reason for the change is the different seed for Python and NumPy, which changed which values were missing in the dataset that we fed into IVEware.  

    <br><br>
    - Samples Incorrectly Imputed by Method


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  417,237  | 100% | 
    | RF |  112,416  |  26.94 % | 
    | MF |  85,273  |  20.44 % | 
    | Mode |  149,065  |  35.73 % | 
    | Random |  194,724  |  46.67 % | 
    | IVEware_seed_0 |  89,391  |  21.42 % | 
    | IVEware_seed_1 |  122,642  |  29.39 % | 


    <br><br>
- Comparison of number of errors in the 67 features:

    |  | Fewer | Equal | More | Total | 
    | --- | --- | --- | --- | --- | 
    | Compare RF to MF |  2  |  14  |  51  | 67  |
    | Compare RF to Mode |  39  |  28  |  0  | 67  |
    | Compare RF to Random |  53  |  14  |  0  | 67  |
    | Compare RF to IVEware_seed_0 |  3  |  20  |  44  | 67  |
    | Compare RF to IVEware_seed_0 |  33  |  11  |  23  | 67  |
    | Compare MF to Mode |  53  |  13  |  1  | 67  |
    | Compare MF to Random |  56  |  10  |  1  | 67  |
    | Compare MF to IVEware_seed_0 |  31  |  18  |  18  | 67  |
    | Compare MF to IVEware_seed_1 |  54  |  11  |  2  | 67  |
    | Compare Mode to Random |  51  |  16  |  0  | 67  |
    | Compare Mode to IVEware_seed_0 |  0  |  21  |  46  | 67  |
    | Compare Mode to IVEware_seed_1 |  21  |  10  |  36  | 67  |
    | Compare Random to IVEware_seed_0 |  0  |  13  |  54  | 67  |
    | Compare Random to IVEware_seed_1 |  8  |  8  |  51  | 67  |



    <br><br>
- Number of NaN Imputed Differently by Pairs of Methods

    |  | Number |  Percentage |
    | --- | --- | -- |
    | Total NaN |  417,237  | 100% |
    | RF Different from MF |  54,162  |  12.98 % |
    | RF Different from Mode |  58,329  |  13.98 % |
    | RF Different from Random |  161,621  |  38.74 % |
    | RF Different from IVEware_seed_0 |  57,464  |  13.77 % |
    | RF Different from IVEware_seed_1 |  118,643  |  28.44 % |
    | MF Different from Mode |  101,716  |  24.38 % |
    | MF Different from Random |  174,740  |  41.88 % |
    | MF Different from IVEware_seed_0 |  38,365  |  9.2 % |
    | MF Different from IVEware_seed_1 |  102,005  |  24.45 % |
    | Mode Different from Random |  148,709  |  35.64 % |
    | Mode Different from IVEware_seed_0 |  104,535  |  25.05 % |
    | Mode Different from IVEware_seed_1 |  151,758  |  36.37 % |
    | Random Different from IVEware_seed_0 |  176,490  |  42.3 % |
    | Random Different from IVEware_seed_1 |  195,705  |  46.9 % |
    | IVEware_seed_0 Different from IVEware_seed_1 |  96,334  |  23.09 % |







## Drop Multicollinear Features before Imputing?  Compare two methods
- First Method
    - After Binning, reduce dimensionality
        - Removes MAX_VSEV, VE_FORMS, VTCONT_F, MAX_SEV, NUM_INJV
        - Reduces from 67 to 62 features
    - Impute
- Second Method
    - Impute with all 67 features
    - Before evaluating the imputation, remove the five features and only evaluate the results on the 62 features used in the comparison above
- We used random seed 42 for both methods
- First Method Results

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,007,463  | 100% | 
    | RF |  569,509  |  28.37 % | 
    | Mode |  681,753  |  33.96 % | 
    | Random |  889,794  |  44.32 % | 
    | IVEware |  606,632  |  30.22 % | 
    
- Second Method Results


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,007,235  | 100% | 
    | RF |  558,936  |  27.85 % | 
    | Mode |  681,996  |  33.98 % | 
    | Random |  888,845  |  44.28 % | 
    | IVEware |  606,062  |  30.19 % | 


### Analysis
- Mode was the same, as it should be.
- Random was slightly different, perhaps because the features were in a different order?
- IVEware was not significantly different in the two methods.
- Random Forest was slightly but significantly better (0.52%) with the second method, not removing the multicollinear features before imputing, which is surprising.  

### Conclusion
- Run again with different random seed = 1

### Second Round Results
- First Method

    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,006,643  | 100% | 
    | RF |  568,909  |  28.35 % | 
    | Mode |  681,061  |  33.94 % | 
    | Random |  889,048  |  44.31 % | 
    | IVEware |  592,233  |  29.51 % | 


- Second Method


    | | Number | Percentage |
    | --- | --- | --- | 
    | Total NaN |  2,005,955  | 100% | 
    | RF |  558,742  |  27.85 % | 
    | Mode |  680,715  |  33.93 % | 
    | Random |  887,944  |  44.27 % | 
    | IVEware |  564,254  |  28.13 % | 
    
### Analysis

- Again, the second method, leaving in multicollinear features, is better for both Random Forest and IVEware

### Conclusion
- When we impute 

## Discussion

- Random imputation is clearly worse than Mode and RF on every feature.
- Random is overall worse than IVEware, but on one of our runs there are five features on which Random is better than IVEware.
- Random Forest is as good or better than Mode on every feature, which is not surprising, as RF starts at Mode and improves on it.  
- Random Forest is as good or better than IVEware on more than half of the features, but not overwhelmingly, and slightly better in the count of missing samples correctly imputed.
- IVEware and Mode are comparable in the number of features, but IVEware is much better in the count of missing samples correctly imputed.
- Random Forest and Mode make the same mistakes.  
- IVEware makes different mistakes from Random Forest and Mode.

## Conclusion

- Use Random Forest

## Opportunities for Future Research
(or, "Things we didn't do")

- Which features are better imputed by Random imputation than by IVEware, and why?
- Which features are better imputed by IVEware than by Random Forest, and why?
- Would a different mix of features make IVEware perform better than Random Forest?
- Is it okay to use one imputation method for some features and another method for other features?

# Setup
## Import Libraries

In [4]:
import sys, copy, math, time, os

print ('Python version: {}'.format(sys.version))

import numpy as np
print ('NumPy version: {}'.format(np.__version__))
np.set_printoptions(suppress=True)


import pandas as pd
print ('Pandas version:  {}'.format(pd.__version__))
pd.set_option('display.max_rows', 500)

import sklearn
print ('SciKit-Learn version: {}'.format(sklearn.__version__))
from sklearn.model_selection import train_test_split

import sklearn.neighbors._base
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor

import missforest.missforest
from missforest.missforest import MissForest
print ('MissForest version:  {}'.format(missforest.missforest.__version__))

# Set Randomness.  Copied from https://www.kaggle.com/code/abazdyrev/keras-nn-focal-loss-experiments
import random
random_seed = 0
print ('Random seed for Python and NumPy set to ', random_seed)
np.random.seed(random_seed) # NumPy
random.seed(random_seed) # Python
#tf.set_random_seed(random_seed) # Tensorflow

from IPython.display import Audio
sound_file = './beep.wav'

import warnings
warnings.filterwarnings('ignore')

print ('Finished Importing Libraries')
print ()


Python version: 3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]
NumPy version: 1.26.4
Pandas version:  2.2.2
SciKit-Learn version: 1.5.0
MissForest version:  2.5.5
Random seed for Python and NumPy set to  0
Finished Importing Libraries



## Get Data

This notebook pulls in the saved output of Ambulance_Dispatch_2024_02_Binning.

In [5]:
def Get_Data():
    print ('Get_Data')
    data = pd.read_csv('../../Big_Files/CRSS_Binned_Data.csv', low_memory=False)
#    data = pd.read_csv('../../Big_Files/CRSS_Binned_Data_Seed_42.csv', low_memory=False)
#    data = pd.read_csv('../../Big_Files/CRSS_Binned_Reduced_Dimensionality_Data.csv', low_memory=False)
    print ('data.shape = ', data.shape)
    print ()

    # We already dropped the imputed columns in the Binning stage
    print ('Drop Imputed Columns')
    for feature in data:
        if '_IM' in feature:
            print (feature)
            data.drop(columns=feature, inplace=True)
 

    # Method for dropping from 67 to 40 features 
    # to test whether it was just this particular mix of features 
    # that made the IVEware behave strangely well with random seed of zero.
#    print ('data.shape = ', data.shape)
#    data = data.sample(n=40, axis='columns')
    
    print ('data.shape = ', data.shape)
    print ()
    
    print ('Total number of NaN')
    print (data.replace({99:np.nan}).isnull().sum().sum())
    print ()
    
#    print ("Remaining Features:")
#    Features = sorted(list(data.columns))
#    for feature in Features:
#        print ("    ",feature)
    print ('Finished Get_Data()')
    print ()
    
    return data

In [6]:
#data = Get_Data()


In [7]:
def Analyze_Data():
    data = Get_Data()
    print ('Analyze_Data')
    data = data.replace({99:np.nan})
    
    print ('Total NaN')
    s = data.isna().sum().sum()
    rows = data.shape[0]
    cols = data.shape[1]
    print (s, " ", round(s/(rows*cols)*100,2))
    print ()

    print ('Number of NaN in each feature')
    Feature_NaN_Counts = []
    for feature in data:
        s = data[feature].isna().sum()
        n = len(data)
#        print (feature, s, round(s/n*100,2))
        Feature_NaN_Counts.append([feature, round(s/n*100,6)])
    for row in Feature_NaN_Counts:
        print (row)
    print ()
    print ('Distribution of number of NaN in each sample')
    A = data.isna().sum(axis=1)
    Row_NaN_Counts = A.value_counts(normalize=True)
    display(Row_NaN_Counts)
    Row_NaN_Counts = Row_NaN_Counts.to_list()
    
    print ('Finished Analyze_Data()')
    print ()
    
    return Feature_NaN_Counts, Row_NaN_Counts
    
#Feature_NaN_Counts, Row_NaN_Counts = Analyze_Data()
    
    

## Tools

In [8]:
def Impute_MissForest(data):
    print('Impute_MissForest()')

    print (data.shape)
    display(data.head(20))
#    data.replace({np.nan: ''}, inplace=True)
#    display(data.head(20))

    categorical = list(data)
    print ('categorical features: ', categorical)
    
    clf = RandomForestClassifier(
        n_estimators=100, 
        max_depth=10, 
#        verbose=2,
#        max_features=0.5
    )
    rgr = RandomForestRegressor(
        n_estimators=100, 
        max_depth=10, 
#        verbose=2,
#        max_features=0.5
    )

    data_MF = MissForest(clf, rgr, max_iter = 4).fit_transform(
        x = data,
        categorical=categorical,
    )
    display(data_MF.head(20))
    print ('Finished Impute_MissForest()')
    print ()
    
    return data_MF
    

In [9]:
def Test_Impute_MissForest():
    data = Get_Data()
    print (data.shape)

    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('Int64')

    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head())

    data_NaN = Create_data_NaN_Method_3(data_Ground_Truth)


    print ('data_NaN.shape')
    print (data_NaN.shape)
    display(data_NaN.head())
    
#    data_NaN = data_NaN.astype('Int8')
    
#    data_NaN = data_NaN.sample(n=200000)
    print (data_NaN.shape)
    print (data_NaN.head(20))

    
    # Perform MissForest imputation
    print ('Start Imputation')
    data_MF = Impute_MissForest(data_NaN)
    data_MF.sort_index(inplace=True)
    data_MF = data_MF[data.columns]  
#    data_MF = data_MF.astype('Int64')
    print (data_MF.head(20))
    
    print ('Finished Test_Impute_MissForest()')
    print ()
    
Test_Impute_MissForest()


Get_Data
data.shape =  (802700, 67)

Drop Imputed Columns
data.shape =  (802700, 67)

Total number of NaN
1674506

Finished Get_Data()

(802700, 67)
data_Ground_Truth.shape
(232333, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,...,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,4.0,7.0,3.0,2.0,7.0,0.0,4,0.0,5.0,...,1,1,3.0,5.0,0.0,1.0,0.0,0.0,0.0,0
1,0,2.0,6.0,1.0,2.0,4.0,0.0,3,0.0,5.0,...,0,0,3.0,3.0,1.0,1.0,0.0,2.0,1.0,0
2,1,2.0,5.0,1.0,2.0,4.0,0.0,3,0.0,5.0,...,0,0,3.0,3.0,1.0,1.0,0.0,2.0,1.0,0
3,0,2.0,5.0,3.0,2.0,4.0,0.0,3,0.0,5.0,...,0,0,3.0,3.0,1.0,1.0,0.0,2.0,1.0,0
4,0,7.0,6.0,4.0,2.0,2.0,0.0,3,4.0,5.0,...,1,1,3.0,3.0,0.0,1.0,0.0,1.0,2.0,0


NameError: name 'Create_data_NaN_Method_3' is not defined

In [11]:
def Test_Impute_MissForest_2():
    print ('Test_Impute_Miss_Forest_2()')
    data = Get_Data()
    print (data.shape)
    display(data.head(10))
#    data = data.sample(n=1000)
    print (data.shape)
    data.replace({99:np.nan}, inplace=True)
    display(data.head(10))
    data_MF = Impute_MissForest(data)
    
    print ('Finished Test_Impute_Miss_Forest_2()')
    print ()

Test_Impute_MissForest_2()

Test_Impute_Miss_Forest_2()
Get_Data
data.shape =  (802700, 67)

Drop Imputed Columns
data.shape =  (802700, 67)

Total number of NaN
1674506

Finished Get_Data()

(802700, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,...,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,9,9,4,99,2,0,1,2,8,...,1,1,3,2,0,99,99,99,0,0
1,0,8,6,4,99,3,0,1,2,8,...,1,1,3,2,0,99,99,99,0,0
2,0,5,3,1,99,7,0,1,0,8,...,1,1,2,2,0,99,99,99,0,0
3,0,3,4,1,99,2,0,1,0,8,...,1,1,2,2,0,99,99,99,0,0
4,0,3,4,3,2,2,0,1,0,8,...,1,1,2,2,0,99,99,99,0,0
5,0,3,3,3,2,2,0,1,0,8,...,1,1,2,2,0,99,99,99,0,0
6,1,2,6,4,99,2,0,2,0,8,...,0,0,3,5,0,99,99,1,0,0
7,0,0,5,4,2,2,0,4,4,8,...,1,1,2,3,1,99,99,99,0,0
8,0,8,7,99,2,7,0,3,3,5,...,1,1,0,3,0,1,0,1,0,0
9,0,7,6,99,2,2,0,3,4,5,...,1,1,0,3,0,1,0,1,0,0


(802700, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,...,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,9.0,9.0,4.0,,2.0,0.0,1,2.0,8.0,...,1,1,3.0,2.0,0.0,,,,0.0,0
1,0,8.0,6.0,4.0,,3.0,0.0,1,2.0,8.0,...,1,1,3.0,2.0,0.0,,,,0.0,0
2,0,5.0,3.0,1.0,,7.0,0.0,1,0.0,8.0,...,1,1,2.0,2.0,0.0,,,,0.0,0
3,0,3.0,4.0,1.0,,2.0,0.0,1,0.0,8.0,...,1,1,2.0,2.0,0.0,,,,0.0,0
4,0,3.0,4.0,3.0,2.0,2.0,0.0,1,0.0,8.0,...,1,1,2.0,2.0,0.0,,,,0.0,0
5,0,3.0,3.0,3.0,2.0,2.0,0.0,1,0.0,8.0,...,1,1,2.0,2.0,0.0,,,,0.0,0
6,1,2.0,6.0,4.0,,2.0,0.0,2,0.0,8.0,...,0,0,3.0,5.0,0.0,,,1.0,0.0,0
7,0,0.0,5.0,4.0,2.0,2.0,0.0,4,4.0,8.0,...,1,1,2.0,3.0,1.0,,,,0.0,0
8,0,8.0,7.0,,2.0,7.0,0.0,3,3.0,5.0,...,1,1,0.0,3.0,0.0,1.0,0.0,1.0,0.0,0
9,0,7.0,6.0,,2.0,2.0,0.0,3,4.0,5.0,...,1,1,0.0,3.0,0.0,1.0,0.0,1.0,0.0,0


Impute_MissForest()
(802700, 67)


Unnamed: 0,HOSPITAL,ACC_TYPE,AGE,AIR_BAG,ALC_STATUS,BODY_TYP,CARGO_BT,DAY_WEEK,DEFORMED,DR_ZIP,...,VE_FORMS,VE_TOTAL,VPROFILE,VSPD_LIM,VSURCOND,VTCONT_F,VTRAFCON,VTRAFWAY,WEATHER,WRK_ZONE
0,0,9.0,9.0,4.0,,2.0,0.0,1,2.0,8.0,...,1,1,3.0,2.0,0.0,,,,0.0,0
1,0,8.0,6.0,4.0,,3.0,0.0,1,2.0,8.0,...,1,1,3.0,2.0,0.0,,,,0.0,0
2,0,5.0,3.0,1.0,,7.0,0.0,1,0.0,8.0,...,1,1,2.0,2.0,0.0,,,,0.0,0
3,0,3.0,4.0,1.0,,2.0,0.0,1,0.0,8.0,...,1,1,2.0,2.0,0.0,,,,0.0,0
4,0,3.0,4.0,3.0,2.0,2.0,0.0,1,0.0,8.0,...,1,1,2.0,2.0,0.0,,,,0.0,0
5,0,3.0,3.0,3.0,2.0,2.0,0.0,1,0.0,8.0,...,1,1,2.0,2.0,0.0,,,,0.0,0
6,1,2.0,6.0,4.0,,2.0,0.0,2,0.0,8.0,...,0,0,3.0,5.0,0.0,,,1.0,0.0,0
7,0,0.0,5.0,4.0,2.0,2.0,0.0,4,4.0,8.0,...,1,1,2.0,3.0,1.0,,,,0.0,0
8,0,8.0,7.0,,2.0,7.0,0.0,3,3.0,5.0,...,1,1,0.0,3.0,0.0,1.0,0.0,1.0,0.0,0
9,0,7.0,6.0,,2.0,2.0,0.0,3,4.0,5.0,...,1,1,0.0,3.0,0.0,1.0,0.0,1.0,0.0,0


categorical features:  ['HOSPITAL', 'ACC_TYPE', 'AGE', 'AIR_BAG', 'ALC_STATUS', 'BODY_TYP', 'CARGO_BT', 'DAY_WEEK', 'DEFORMED', 'DR_ZIP', 'EJECTION', 'HARM_EV', 'HIT_RUN', 'HOUR', 'IMPACT1', 'INJ_SEV', 'INT_HWY', 'J_KNIFE', 'LGT_COND', 'MAKE', 'MAK_MOD', 'MAN_COLL', 'MAX_SEV', 'MAX_VSEV', 'MODEL', 'MONTH', 'M_HARM', 'NUMOCCS', 'NUM_INJ', 'NUM_INJV', 'PCRASH4', 'PCRASH5', 'PERMVIT', 'PER_TYP', 'PJ', 'PSU', 'PVH_INVL', 'P_CRASH1', 'P_CRASH2', 'REGION', 'RELJCT1', 'RELJCT2', 'REL_ROAD', 'REST_MIS', 'REST_USE', 'ROLINLOC', 'ROLLOVER', 'SEAT_POS', 'SEX', 'SPEC_USE', 'SPEEDREL', 'TOWED', 'TOW_VEH', 'TYP_INT', 'URBANICITY', 'VALIGN', 'VEH_AGE', 'VE_FORMS', 'VE_TOTAL', 'VPROFILE', 'VSPD_LIM', 'VSURCOND', 'VTCONT_F', 'VTRAFCON', 'VTRAFWAY', 'WEATHER', 'WRK_ZONE']
Iteration  0  of  4 , feature  ACC_TYPE
Iteration  0  of  4 , feature  AGE


KeyboardInterrupt: 

In [None]:
def Impute_Round_Robin(data):
    print ('Impute_Round_Robin()')
    pd.set_option('display.max_columns', None)
    
    # Replace 'Unknown' with np.NaN
#    data.replace({'Unknown': np.nan}, inplace=True)
    data.replace({99: np.nan}, inplace=True)
    display(data.head(20))
    print ()
    
    # Make a list of features with missing samples, 
    #     ordered by the number of missing samples, 
    #     from least to most.  
    Missing = []
    Complete = []
    for feature in data:
        s = data[feature].isna().sum()
        if s==0:
            Complete.append([feature, s])
        if s>0:
            Missing.append([feature, s])
    Missing = sorted (Missing, key=lambda x:x[1], reverse=False)
#    print ()
#    print ('Complete[]')
#    display(Complete)
#    print ()
#    print ('Missing[]')
#    display(Missing)
#    print ()
    
#    print ('Make data_Mode')
#    print ()
    data_Mode = pd.DataFrame()
    for X in Complete:
        feature = X[0]
        data_Mode[feature] = data[feature]
    for M in Missing:
        feature = M[0]
        m = data[feature].mode()[0]
#        print (feature, M[1], m)
        data_Mode[feature] = data[feature].fillna(m)
#    print ('data_Mode')
    display(data_Mode.head(20))

#    print ()
#    print ('Make starting point for data_Imputed')
    data_Imputed = pd.DataFrame()
    for X in Complete:
        feature = X[0]
        data_Imputed[feature] = data[feature]
    for X in Missing:
        feature = X[0]
        data_Imputed[feature] = data_Mode[feature]
#    print ('data_Imputed')
#    display(data_Imputed.head(20))
#    print ()

    print ('Start Loop')
    print ()
    n = 0
    for M in Missing:
        n += 1
        print (M)
        feature = M[0]
        data_Imputed[feature] = data[feature]
#        print ()
#        print ('data[feature].isna().sum()')
#        print (data[feature].isna().sum())
#        print ('data_Imputed[feature].isna().sum()')
#        print (data_Imputed[feature].isna().sum())
#        print ()
        W = data_Imputed.dropna(subset=[feature])
        X = data_Imputed.dropna(subset=[feature])
        y = X[feature]
        X.drop(columns=feature, inplace=True)
        Z = data_Imputed[data_Imputed[feature].isna()]
        Z.drop(columns=feature, inplace=True)
#        Z.reset_index(drop=True, inplace=True)
#        print (data.shape)
#        print (X.shape)
#        display(X.head(40))
#        display(y.head(40))
#        print (Z.shape)
#        display(Z)
        clf = RandomForestClassifier(max_depth=2, random_state=random_seed)
        clf.fit(X,y)
#        print ('clf.predict(Z)')
        z = clf.predict(Z)
#        print (len(z))
#        display(z)
        Z[feature] = z
#        display(Z)
        data_Imputed = pd.concat([Z, W])
#        display(data_Imputed.head(60))
#        print (data_Imputed.shape)
#        print ()
#        data_Imputed.sort_values(
#            by = ['CASENUM', 'VEH_NO', 'PER_NO'], 
#            ascending = [True, True, True], 
#            inplace=True
#        )
#        print ()
#        print ('data.PER_NO.equals(data_Imputed.PER_NO)')
#        print (data.PER_NO.equals(data_Imputed.PER_NO))
#        print ()
               
        Check_Feature(data, data_Imputed, feature)
#        if n==10:
#            return data_Imputed
    
    
    display(data_Imputed.head(20))

    print ('Finished Impute_Round_Robin()')
    print ()
    return data_Imputed

In [None]:
def Check(data, data_Imputed):
    print ('Check()')
    Features = data.columns
    print (Features)
    for feature in Features:
        U = pd.unique(data[feature]).tolist()
        print (U)
        A = []
        for u in U:
            a = len(data[data[feature]==u])
            b = len(data_Imputed[data_Imputed[feature]==u])
            A.append([u, a, b])
        display(A)
        print ()
    print ('Finished Check()')
    print ()


In [None]:
def Check_Feature(data, data_Imputed, feature):
    print ('Check_Feature()')
    U = pd.unique(data[feature]).tolist()
    U = [x for x in U if x == x]
    print (U)
    A = []
    for u in U:
        a = len(data[data[feature]==u])
        b = len(data_Imputed[data_Imputed[feature]==u])
        A.append([u, a, b, b-a])
    a = data[feature].isna().sum()
    b = data_Imputed[feature].isna().sum()
    A.append(['NaN', a, b, 0])
    A = pd.DataFrame(A, columns=['Value', 'Original', 'Imputed', 'Difference'])
    display(A)
    
    print ('Finished Check_Feature()')
    print ()


In [None]:
def Impute_Randomly(data):
    print ('Impute_Randomly()')
    print ()
    
    data.sample(frac=1, replace=True) # Randomly shuffle the rows of the dataset
    for feature in data:
        print (feature)
#        print ('display(data[feature].head())')
#        display(data[feature].head())
        dfA = data[feature]
#        print ('display(dfA.head())')
#        display(dfA.head())
        dfA.dropna(inplace=True)
#        print ('display(dfA.head()) after dfA.dropna(inplace=True)')
#        display(dfA.head())
#        print ('Original Value Counts')
#        print (dfA.value_counts(normalize=True))
        dfA = dfA.sample(n = len(data), replace=True)
#        print ('display(dfA.head()) after dfA.sample(n = len(data), replace=True)')
#        display(dfA.head())
#        print ('Value Counts after Sampling')
#        print (dfA.value_counts(normalize=True))
        dfA.reset_index(drop=True, inplace=True)
#        print ('display(dfA.head()) after dfA.reset_index(drop=True)')
#        display(dfA.head())
        data[feature].fillna(dfA, inplace=True)
#        print ('display(data[feature].head())')
#        display(data[feature].head())        
#        print ()
        
    print ('Finished Impute_Randomly()')
    print ()
    return data
        
def Test_Impute_Randomly():
    Dict = {
        'A':[0,0,0,1,np.nan],
        'B':[1,2,3,4,np.nan]
    }
    
    data = pd.DataFrame(Dict)
    display(data)
    data = Impute_Randomly(data)
    display(data)
    
#Test_Impute_Randomly()
        

In [8]:
def Create_data_NaN_Method_3(data_Ground_Truth):
    print ('Create_data_NaN_Method_3()')
    nRows = data_Ground_Truth.shape[0]
    nCols = data_Ground_Truth.shape[1]
    print ('nRows = ', nRows, ' nCols = ', nCols)
    
    Feature_NaN_Counts, Row_NaN_Counts = Analyze_Data()
    data_NaN_list = []
    for i in range (len(Row_NaN_Counts)):
        Ones = i
        Zeros = nCols - i
        Number = int(Row_NaN_Counts[i] * nRows + 1)
        for j in range (Number):
            New_row = [1]*Ones + [0]*Zeros
            random.shuffle(New_row)
            data_NaN_list.append(New_row.copy())
    
    data_NaN = pd.DataFrame(data_NaN_list, columns = data_Ground_Truth.columns)
#    print (data_NaN.shape)
    data_NaN = data_NaN.sample(n = nRows)
#    print (data_NaN.shape)
#    display(data_NaN.head(20))
#    display(data_NaN.tail(20))
    
#    print ('Distribution of number of NaN in each sample')
    A = data_NaN.sum(axis=1)
    Row_NaN_Counts = A.value_counts(normalize=True)
#    display(Row_NaN_Counts)
#    Row_NaN_Counts = Row_NaN_Counts.to_list()

    Feature_NaN_Counts = [[x[0], x[1], 0, 0] for x in Feature_NaN_Counts]
    Feature_NaN_Counts = Feature_NaN_Counts_Update(Feature_NaN_Counts, data_NaN)
#    for row in Feature_NaN_Counts:
#        print (row)
#    print ()
    
    old_give = ''
    old_take = ''
    
    stop = False
    while stop == False:
        if Feature_NaN_Counts[0][3] > -0.001 or Feature_NaN_Counts[-1][3] < 0.001:
            stop = True

        give_feature = Feature_NaN_Counts[0][0]
        take_i = -1
        take_feature = Feature_NaN_Counts[take_i][0]
        nGive = int(round(-1/100 * Feature_NaN_Counts[0][3] * nRows,0))
        nTake = int(round(1/100 * Feature_NaN_Counts[take_i][3] * nRows,0))
        
        mask = ((data_NaN[give_feature]==1) & (data_NaN[take_feature] == 0))
        Swap = data_NaN[mask]
        nSample = min([nGive, nTake, Swap.shape[0]])
        print ('give_feature, nGive, take_i, take_feature, nTake, Swap.shape[0], nSample')
        print (give_feature, nGive, take_i, take_feature, nTake, Swap.shape[0], nSample)
        while nSample==0:
            take_i = take_i - 1
            take_feature = Feature_NaN_Counts[take_i][0]
            nGive = int(round(-1/100 * Feature_NaN_Counts[0][3] * nRows,0))
            nTake = int(round(1/100 * Feature_NaN_Counts[take_i][3] * nRows,0))
            mask = ((data_NaN[give_feature]==1) & (data_NaN[take_feature] == 0))
            Swap = data_NaN[mask]
            nSample = min([nGive, nTake, Swap.shape[0]])
            print ('give_feature, nGive, take_i, take_feature, nTake, Swap.shape[0], nSample')
            print (give_feature, nGive, take_i, take_feature, nTake, Swap.shape[0], nSample)
            print ()
            if nSample==0 and take_i < -10:
                break
                stop = True
        
#        if nSample == 0:
#            stop = True
        Swap = Swap.sample(n=nSample)
#        display(Swap[[give_feature, take_feature]])
        mask = Swap.index.values.tolist()
        for m in mask:
            data_NaN.loc[[m], give_feature] = 0
            data_NaN.loc[[m], take_feature] = 1
            
#        print ()
#        data_NaN[give_feature],data_NaN[take_feature]=np.where(mask,(data_NaN[take_feature],data_NaN[give_feature]),(data_NaN[give_feature],data_NaN[take_feature]))
        Feature_NaN_Counts = Feature_NaN_Counts_Update(Feature_NaN_Counts, data_NaN)

#    for row in Feature_NaN_Counts:
#        print (row)
#    print ()

    data_NaN = data_NaN.sample(frac=1)
#    print ('data_NaN')
#    display(data_NaN.head(10))
#    print ('data_NaN reindexed')
    data_NaN.reset_index(inplace=True, drop=True)
 #   display(data_NaN.head(10))
 #   print ('data_Ground_Truth')
 #   display(data_Ground_Truth.head(10))
 #   print ('data_NaN')
 #   display(data_NaN.head(10))
    data_NaN = data_Ground_Truth.where(data_NaN==0)
#    print ('data_NaN')
#    display(data_NaN.head(20))
#    display(data_NaN.tail(20))
    
    display(data_NaN.isna().sum())
#    print ('Finished Create_data_NaN_Method_3')

    print ('Finished Create_data_NaN_Method_3()')
    print ()
    
    return data_NaN
    

def Feature_NaN_Counts_Update(Feature_NaN_Counts, data_NaN):
    for row in Feature_NaN_Counts:
        feature = row[0]
        s = data_NaN[feature].sum()
        n = len(data_NaN)
#        print (feature, s, round(s/n*100,2))
        row[2] = round(s/n*100,6)
        row[3] = round(row[1] - row[2], 6)
    Feature_NaN_Counts = sorted(Feature_NaN_Counts, key=lambda x:x[3])
    
    return Feature_NaN_Counts

    
def Test_Create_data_NaN_Method_3():
    print ('Test_Create_data_NaN_Method_3()')
    data = Get_Data()
    print (data.shape)
    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
#    data_Ground_Truth.astype('Int64')

    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head(10))
    Create_data_NaN_Method_3(data_Ground_Truth)
    print ('Finished Test_Create_data_NaN_Method_3')
    print ()

    
#Test_Create_data_NaN_Method_3()

In [None]:
def Create_data_NaN(data_Ground_Truth):
    print ('Create_data_NaN()')
    
    # Create a 2d list of the same shape as "data"
    # Each row has "1" in 15% of the columns, "0" otherwise
        # By "15%," we mean that "(# columns)*0.15" rounded to the nearest integer.
    # The first row is a random shuffle of such a row.
    # The next (columns-1) rows are rotations of that row
    # Each "column" number of rows, shuffle and repeat.
    # Each row will have 15% of the samples 1, and each column will have 15% of the samples 1.
    # Shuffle the rows.
    # Then shuffling the columns would be redundant.
    
    rows = data_Ground_Truth.shape[0]
    columns = data_Ground_Truth.shape[1]
    drops_in_row = int(round(columns*0.15,0))
    print ('drops_in_row = ', drops_in_row)
    
    Rand_Drop = []
    Row = [1]*(drops_in_row) + [0]*(columns - drops_in_row)

    for i in range (rows):
        if i%3==0:
            Row = [1]*(drops_in_row) + [0]*(columns - drops_in_row)
        elif i%3==1:
            Row = [1]*(drops_in_row + 1) + [0]*(columns - drops_in_row - 1)
        else:
            Row = [1]*(drops_in_row - 1) + [0]*(columns - drops_in_row + 1)
        random.shuffle(Row)
        Rand_Drop.append(Row.copy())

#    for i in range (rows):
#        if i%columns==0:
#            random.shuffle(Row)
#        Row.append(Row.pop(0))
#        Rand_Drop.append(Row.copy())

    random.shuffle(Rand_Drop)

#    for i in range (columns):
#        print (i, sum([x[i] for x in Rand_Drop]))

    # Turn the 2d list into a dataframe
        
    Rand_Drop_df = pd.DataFrame(Rand_Drop, columns=data_Ground_Truth.columns)
    display(Rand_Drop_df)
    
#    for feature in Rand_Drop_df:
#        print (feature, Rand_Drop_df[feature].sum())

    # Change the Ground Truth values to NaN where the corresponding value in Rand_Drop_df is 1
    data_NaN = data_Ground_Truth.where(Rand_Drop_df==0)
#    data_NaN = data_NaN.astype('Int')
    
    print ('data_NaN')
    display(data_NaN)
    
    print ('data_NaN.isna().sum()')
    display(data_NaN.isna().sum())
    
    print ('data_NaN.dropna().shape')
    print (data_NaN.dropna().shape)
    
    print ('Finished Create_data_NaN()')
    print ()

    return data_NaN
    
    
def Test_Create_data_NaN():
    print ('Test_Create_data_NaN()')
    data = Get_Data()
    print (data.shape)
    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('Int32')

    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    Create_data_NaN(data_Ground_Truth)
    
    print ('Finished Test_Create_data_NaN()')
    print ()

#Test_Create_data_NaN()


In [None]:
def Create_data_NaN_Old(data_Ground_Truth):
    """
    data = Get_Data()
    print (data.shape)

    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('int64')

    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth = data_Ground_Truth.astype('int64')
    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head())
    """

    # Randomly pick 15% of the values from each row
    # and set them to be missing
    print ('Remove 15% of values from each row')
    frac = .15
    data_NaN = data_Ground_Truth.copy(deep=True)
    N = data_NaN.shape[0] * frac # Number of NaN in each feature
    for c in data_NaN.columns:
        idx = np.random.choice(a=data_NaN.index, size=int(len(data_NaN) * frac))
        data_NaN.loc[idx, c] = np.NaN
#    for feature in data_NaN:
#        data_NaN[feature] = pd.to_numeric(data_NaN[feature])
#    data_NaN.astype('int64')


#    print ('data_NaN.shape')
#    print (data_NaN.shape)
#    display(data_NaN.head())
    
    data_NaN = data_NaN.astype('Int8')
    
#    data_NaN = data_NaN.sample(n=200000)
#    print (data_NaN.shape)
#    print (data_NaN.head(20))
    
    return data_NaN

#Create_data_NaN_Old()

    

# Compare Imputation Methods

## Mode Imputation
## Random Forest Imputation
## Prepare Data for IVEware

In [None]:
def Compare_Imputation_Methods_Part_1():
    print ('Compare_Imputation_Methods_Part_1()')
    data = Get_Data()
    print (data.shape)

    # Drop all samples with missing data, so we have ground truth
    data_Ground_Truth = data.replace({99:np.nan})
    data_Ground_Truth = data_Ground_Truth.dropna()
    data_Ground_Truth.reset_index(inplace=True, drop=True)
    for feature in data_Ground_Truth:
        data_Ground_Truth[feature] = pd.to_numeric(data_Ground_Truth[feature])
    data_Ground_Truth.astype('Int32')

    print ('data_Ground_Truth.shape')
    print (data_Ground_Truth.shape)
    data_Ground_Truth = data_Ground_Truth.sample(n=200000)
    data_Ground_Truth.reset_index(inplace=True, drop=True)

    print ('data_Ground_Truth.shape after resampling')
    print (data_Ground_Truth.shape)
    display(data_Ground_Truth.head())

    print ('Remove values from each row')
    data_NaN = Create_data_NaN_Method_3(data_Ground_Truth)
    
    print ('data_NaN.shape')
    print (data_NaN.shape)
    display(data_NaN.head(10))
    display(data_NaN.tail(10))
    
    # Perform MissForest imputation
    data_MF = data_NaN.copy(deep=True)
#    data_MF = data_NaN_Old.copy(deep=True)
#    data_MF = data_MF.astype('Int8')
    print ('data_MF')
    display(data_MF.head(20))
    data_MF = Impute_MissForest(data_MF)
    data_MF.sort_index(inplace=True)
    data_MF = data_MF[data.columns]  
#    data_MF = data_MF.astype('Int32')
    
    print ('data_MF.shape')
    print (data_MF.shape)
    display(data_MF.head())
#    print ()

    
    # Create .txt file to feed into IVEware imputation
    data_IVEware = data_NaN.copy(deep=True)
#    data_IVEware = data_IVEware.astype('str')
    data_IVEware = data_IVEware.fillna('')
    data_IVEware.to_csv('../../Big_Files/data_IVEware.txt', sep='\t', index=False)
    
    data_Mode = pd.DataFrame()
    for feature in data_NaN:
        data_Mode[feature] = data_NaN[feature].fillna(data_NaN[feature].mode()[0])
    data_Mode = data_Mode.astype('Int32')
    print ('data_Mode.shape')
    print (data_Mode.shape)
    display(data_Mode.head())
    
    # Perform Round Robin imputation using Random Forest Classifier
    data_RF = Impute_Round_Robin(data_NaN)
    data_RF.sort_index(inplace=True)
    data_RF = data_RF[data.columns]  
    data_RF = data_RF.astype('Int32')
    
    print ('data_RF.shape')
    print (data_RF.shape)
    display(data_RF.head())
#    print ()

    # Impute randomly
    data_Random = data_NaN.copy(deep=True)
    data_Random = Impute_Randomly(data_Random)
    data_Random = data_Random.astype('Int32')
    
    print ('data_Random.shape')
    print (data_Random.shape)
    display(data_Random.head())
#    print ()

    print ('Finished Compare_Imputation_Methods_Part_1()')
    print ()

    return data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random

In [None]:
%%time 
data_Ground_Truth, data_NaN, data_RF, data_MF, data_Mode, data_Random = Compare_Imputation_Methods_Part_1()

## Do IVEware Imputation (Outside this Jupyter Notebook)
- Go to the IVEware folder and run (at the command line) IVE_12_22_22.bat
- Requires scrlib and R.  You may need to, in the batch file, change the path to your scrlib installation.
- Notes to self:
    - Open srcshell
    - From srcshell, open IVEware_CRSS_Imputation.xml
    - Run
- Run time: ./IVEware_CRSS_Imputation.bat  1069.08s user 12.92s system 98% cpu 18:23.92 total

In [None]:
data_IVEware_seed_0 = pd.read_csv('../../Big_Files/data_IVEware_Compare_seed_0.csv')
data_IVEware_seed_0.drop(columns='Unnamed: 0', inplace=True)

data_IVEware_seed_1 = pd.read_csv('../../Big_Files/data_IVEware_Compare_seed_1.csv')
data_IVEware_seed_1.drop(columns='Unnamed: 0', inplace=True)

print ('data_Ground_Truth', data_Ground_Truth.shape)
display(data_Ground_Truth.head(10))
print ('data_NaN', data_NaN.shape)
display(data_NaN.head(10))
print ('data_RF', data_RF.shape)
display(data_RF.head(10))
print ('data_MF', data_MF.shape)
display(data_MF.head(10))
print ('data_Mode', data_Mode.shape)
display(data_Mode.head(10))
print ('data_IVEware_seed_0', data_IVEware_seed_0.shape)
display(data_IVEware_seed_0.head(10))
print ('data_IVEware_seed_1', data_IVEware_seed_1.shape)
display(data_IVEware_seed_1.head(10))


## Compare Six Imputation Methods

In [None]:
def Compare_Imputation_Methods_Part_2(
    data_Ground_Truth, 
    data_NaN, 
    data_RF, 
    data_MF, 
    data_Mode, 
    data_Random, 
    data_IVEware_seed_0, 
    data_IVEware_seed_1
):
    print ('Compare_Imputation_Methods_Part_2')
    
    """
    print ('Drop Multicollinear Features')
    Drop = ['MAX_VSEV', 'VE_FORMS', 'VTCONT_F', 'MAX_SEV', 'NUM_INJV']
    DF = [data_Ground_Truth, data_NaN, data_RF, data_Mode, data_Random, data_IVEware]
    
    for df in DF:
        for feature in Drop:
            if feature in df:
                df.drop(columns=[feature], inplace=True)
                print ('Drop ', feature)
    print ()
    """
    
    Datasets = [
        ['data_Ground_Truth', data_Ground_Truth],
        ['data_NaN', data_NaN],
        ['data_RF', data_RF],
        ['data_MF', data_MF],
        ['data_Mode', data_Mode],
        ['data_Random', data_Random],
        ['data_IVEware_seed_0', data_IVEware_seed_0],
        ['data_IVEware_seed_1', data_IVEware_seed_1],
    ]
    
    for dataset in Datasets:
        name = dataset[0]
        data = dataset[1]
        print (name, '.shape: ', data.shape)
    print ()
    
    for dataset in Datasets:
        name = dataset[0]
        data = dataset[1]
        print (name)
        display(data.head())
    print ()
    
    Datasets = [
        ['data_Ground_Truth', data_Ground_Truth],
#        ['data_NaN', data_NaN],
        ['data_RF', data_RF],
        ['data_MF', data_MF],
        ['data_Mode', data_Mode],
        ['data_Random', data_Random],
        ['data_IVEware_seed_0', data_IVEware_seed_0],
        ['data_IVEware_seed_1', data_IVEware_seed_1],
    ]
    
    A = []
    for feature in data_NaN:
        B = []
        B.append(feature)
        B.append(data_NaN[feature].isna().sum())
        for i in range (len(Datasets)-1):
            for j in range (i+1, len(Datasets)):
                C = (Datasets[i][1][feature] != Datasets[j][1][feature]).sum()
                D = round(C/B[1]*100,4)
                B.append(C)
#                print ('B[', len(B)-1, '] counts differences between ', Datasets[i][0], ' and ', Datasets[j][0], '.')
                B.append(D)
#                print ('B[', len(B)-1, '] gives the count as a percentage.')
#        print ()
        
#        print (B)
        A.append(B)
    print ()
    
    A = sorted(A, key=lambda x:x[3])
    B = pd.DataFrame(
        A, 
        columns=[
            'Feature', # 0
            'nNaN',  # 1
            'nRF Incorrect', 'pRF Incorrect', # 2, 3
            'nMF Incorrect', 'pMF Incorrect', # 4, 5
            'nMode Incorrect', 'pMode Incorrect', # 6, 7
            'nRandom Incorrect', 'pRandom Incorrect', # 8, 9
            'nIVEware_seed_0 Incorrect', 'pIVEware_seed_0 Incorrect', # 10, 11
            'nIVEware_seed_1 Incorrect', 'pIVEware_seed_1 Incorrect', # 12, 13
            'RF and MF Different', 'RF v/s MF %', # 14, 15
            'RF and Mode Different', 'RF v/s Mode %', # 16, 17
            'RF and Random Different', 'RF v/s Random %', # 18, 19
            'RF and IVEware_seed_0 Different', 'RF v/s IVEware_seed_0 %', # 20, 21
            'RF and IVEware_seed_1 Different', 'RF v/s IVEware_seed_1 %', # 22, 23
            'MF and Mode Different', 'MF v/s Mode %', # 24, 25
            'MF and Random Different', 'MF v/s Random %', # 26, 27
            'MF and IVEware_seed_0 Different', 'MF v/s IVEware_seed_0 %', # 28, 29
            'MF and IVEware_seed_1 Different', 'MF v/s IVEware_seed_1 %', # 30, 31
            'Mode and Random Different', 'Mode v/s Random %', # 32, 33
            'Mode and IVEware_seed_0 Different', 'Mode v/s IVEware_seed_0 %', #, 34, 35
            'Mode and IVEware_seed_1 Different', 'Mode v/s IVEware_seed_1 %', #, 36, 37
            'Random and IVEware_seed_0 Different', 'Random v/s IVEware_seed_0 %', # 38, 39
            'Random and IVEware_seed_1 Different', 'Random v/s IVEware_seed_1 %', # 40, 41
            'IVEware_seed_0 and IVEware_seed_1 Different', 'IVEware_seed_0 v/s IVEware_seed_1 %', # 42, 43
        ]
    )
    display(B)
    a = sum([x[1] for x in A]) # nNaN
    b = sum([x[2] for x in A]) # nRF Incorrect
    c = sum([x[4] for x in A]) # nMF Incorrect
    d = sum([x[6] for x in A]) # nMode INcorrect
    e = sum([x[8] for x in A]) # nRandom Incorrect
    f = sum([x[10] for x in A]) # nIVEware_seed_0 Incorrect
    g = sum([x[12] for x in A]) # nIVEware_seed_1 Incorrect
    h = round(b/a*100,2)
    i = round(c/a*100,2)
    j = round(d/a*100,2)
    k = round(e/a*100,2)
    l = round(f/a*100,2)
    m = round(g/a*100,2)

    RF_less_MF = sum([x[2] < x[4] for x in A])
    RF_equal_MF = sum([x[2] == x[4] for x in A])
    RF_greater_MF = sum([x[2] > x[4] for x in A])
    
    RF_less_Mode = sum([x[2] < x[6] for x in A])
    RF_equal_Mode = sum([x[2] == x[6] for x in A])
    RF_greater_Mode = sum([x[2] > x[6] for x in A])

    RF_less_Random = sum([x[2] < x[8] for x in A])
    RF_equal_Random = sum([x[2] == x[8] for x in A])
    RF_greater_Random = sum([x[2] > x[8] for x in A])

    RF_less_IVEware_seed_0 = sum([x[2] < x[10] for x in A])
    RF_equal_IVEware_seed_0 = sum([x[2] == x[10] for x in A])
    RF_greater_IVEware_seed_0 = sum([x[2] > x[10] for x in A])

    RF_less_IVEware_seed_1 = sum([x[2] < x[12] for x in A])
    RF_equal_IVEware_seed_1 = sum([x[2] == x[12] for x in A])
    RF_greater_IVEware_seed_1 = sum([x[2] > x[12] for x in A])

    MF_less_Mode = sum([x[4] < x[6] for x in A])
    MF_equal_Mode = sum([x[4] == x[6] for x in A])
    MF_greater_Mode = sum([x[4] > x[6] for x in A])

    MF_less_Random = sum([x[4] < x[8] for x in A])
    MF_equal_Random = sum([x[4] == x[8] for x in A])
    MF_greater_Random = sum([x[4] > x[8] for x in A])

    MF_less_IVEware_seed_0 = sum([x[4] < x[10] for x in A])
    MF_equal_IVEware_seed_0 = sum([x[4] == x[10] for x in A])
    MF_greater_IVEware_seed_0 = sum([x[4] > x[10] for x in A])

    MF_less_IVEware_seed_1 = sum([x[4] < x[12] for x in A])
    MF_equal_IVEware_seed_1 = sum([x[4] == x[12] for x in A])
    MF_greater_IVEware_seed_1 = sum([x[4] > x[12] for x in A])

    Mode_less_Random = sum([x[6] < x[8] for x in A])
    Mode_equal_Random = sum([x[6] == x[8] for x in A])
    Mode_greater_Random = sum([x[6] > x[8] for x in A])

    Mode_less_IVEware_seed_0 = sum([x[6] < x[10] for x in A])
    Mode_equal_IVEware_seed_0 = sum([x[6] == x[10] for x in A])
    Mode_greater_IVEware_seed_0 = sum([x[6] > x[10] for x in A])

    Mode_less_IVEware_seed_1 = sum([x[6] < x[12] for x in A])
    Mode_equal_IVEware_seed_1 = sum([x[6] == x[12] for x in A])
    Mode_greater_IVEware_seed_1 = sum([x[6] > x[12] for x in A])

    Random_less_IVEware_seed_0 = sum([x[8] < x[10] for x in A])
    Random_equal_IVEware_seed_0 = sum([x[8] == x[10] for x in A])
    Random_greater_IVEware_seed_0 = sum([x[8] > x[10] for x in A])

    Random_less_IVEware_seed_1 = sum([x[8] < x[12] for x in A])
    Random_equal_IVEware_seed_1 = sum([x[8] == x[12] for x in A])
    Random_greater_IVEware_seed_1 = sum([x[8] > x[12] for x in A])

    IVEware_seed_0_less_IVEware_seed_1 = sum([x[10] < x[12] for x in A])
    IVEware_seed_0_equal_IVEware_seed_1 = sum([x[10] == x[12] for x in A])
    IVEware_seed_0_greater_IVEware_seed_1 = sum([x[10] > x[12] for x in A])

    print ()
    print ('    | | Number | Percentage |')
    print ('    | --- | --- | --- | ')    
    print ('    | Total NaN | ', f'{a:,d}', ' | 100% | ')
    print ('    | RF | ', f'{b:,d}', ' | ', h, '% | ')
    print ('    | MF | ', f'{c:,d}', ' | ', i, '% | ')
    print ('    | Mode | ', f'{d:,d}', ' | ', j, '% | ')
    print ('    | Random | ', f'{e:,d}', ' | ', k, '% | ')
    print ('    | IVEware_seed_0 | ', f'{f:,d}', ' | ', l, '% | ')
    print ('    | IVEware_seed_1 | ', f'{g:,d}', ' | ', m, '% | ')
    print ()
    print ('    |  | Fewer | Equal | More | Total | ')
    print ('    | --- | --- | --- | --- | --- | ')
    print ('    | Compare RF to MF | ', RF_less_MF, ' | ', RF_equal_MF,  ' | ' ,RF_greater_MF,  ' |', len(A), ' |' )
    print ('    | Compare RF to Mode | ', RF_less_Mode, ' | ', RF_equal_Mode,  ' | ' ,RF_greater_Mode,  ' |', len(A), ' |' )
    print ('    | Compare RF to Random | ', RF_less_Random, ' | ' , RF_equal_Random,  ' | ' , RF_greater_Random,  ' |', len(A), ' |' )
    print ('    | Compare RF to IVEware_seed_0 | ', RF_less_IVEware_seed_0, ' | ' , RF_equal_IVEware_seed_0, ' | ' , RF_greater_IVEware_seed_0, ' |', len(A), ' |' )
    print ('    | Compare RF to IVEware_seed_0 | ', RF_less_IVEware_seed_1, ' | ' , RF_equal_IVEware_seed_1, ' | ' , RF_greater_IVEware_seed_1, ' |', len(A), ' |' )
    print ('    | Compare MF to Mode | ', MF_less_Mode, ' | ', MF_equal_Mode,  ' | ' ,MF_greater_Mode,  ' |', len(A), ' |' )
    print ('    | Compare MF to Random | ', MF_less_Random, ' | ' , MF_equal_Random,  ' | ' , MF_greater_Random,  ' |', len(A), ' |' )
    print ('    | Compare MF to IVEware_seed_0 | ', MF_less_IVEware_seed_0, ' | ' , MF_equal_IVEware_seed_0, ' | ' , MF_greater_IVEware_seed_0, ' |', len(A), ' |' )
    print ('    | Compare MF to IVEware_seed_1 | ', MF_less_IVEware_seed_1, ' | ' , MF_equal_IVEware_seed_1, ' | ' , MF_greater_IVEware_seed_1, ' |', len(A), ' |' )
    print ('    | Compare Mode to Random | ', Mode_less_Random, ' | ' , Mode_equal_Random, ' | ' , Mode_greater_Random, ' |', len(A), ' |' )
    print ('    | Compare Mode to IVEware_seed_0 | ', Mode_less_IVEware_seed_0, ' | ' , Mode_equal_IVEware_seed_0, ' | ' , Mode_greater_IVEware_seed_0, ' |', len(A), ' |' )
    print ('    | Compare Mode to IVEware_seed_1 | ', Mode_less_IVEware_seed_1, ' | ' , Mode_equal_IVEware_seed_1, ' | ' , Mode_greater_IVEware_seed_1, ' |', len(A), ' |' )
    print ('    | Compare Random to IVEware_seed_0 | ', Random_less_IVEware_seed_0, ' | ' , Random_equal_IVEware_seed_0, ' | ' , Random_greater_IVEware_seed_0, ' |', len(A), ' |' )
    print ('    | Compare Random to IVEware_seed_1 | ', Random_less_IVEware_seed_1, ' | ' , Random_equal_IVEware_seed_1, ' | ' , Random_greater_IVEware_seed_1, ' |', len(A), ' |' )
    print ()
    
    b = sum([x[14] for x in A])
    c = sum([x[16] for x in A])
    d = sum([x[18] for x in A])
    e = sum([x[20] for x in A])
    f = sum([x[22] for x in A])
    g = sum([x[24] for x in A])
    h = sum([x[26] for x in A])
    i = sum([x[28] for x in A])
    j = sum([x[30] for x in A])
    k = sum([x[32] for x in A])
    l = sum([x[34] for x in A])
    m = sum([x[36] for x in A])
    n = sum([x[38] for x in A])
    o = sum([x[40] for x in A])
    p = sum([x[42] for x in A])
    
    print ('    |  | Number |  Percentage |')
    print ('    | --- | --- | -- |')
    print ('    | Total NaN | ', f'{a:,d}', ' | 100% |' )
    print ('    | RF Different from MF | ', f'{b:,d}', ' | ', round(b/a*100,2), '% |')
    print ('    | RF Different from Mode | ', f'{c:,d}', ' | ', round(c/a*100,2), '% |')
    print ('    | RF Different from Random | ', f'{d:,d}', ' | ', round(d/a*100,2), '% |')
    print ('    | RF Different from IVEware_seed_0 | ', f'{e:,d}', ' | ', round(e/a*100,2), '% |')
    print ('    | RF Different from IVEware_seed_1 | ', f'{f:,d}', ' | ', round(f/a*100,2), '% |')
    print ('    | MF Different from Mode | ', f'{g:,d}', ' | ', round(g/a*100,2), '% |')
    print ('    | MF Different from Random | ', f'{h:,d}', ' | ',  round(h/a*100,2), '% |')
    print ('    | MF Different from IVEware_seed_0 | ', f'{i:,d}', ' | ',  round(i/a*100,2), '% |')
    print ('    | MF Different from IVEware_seed_1 | ', f'{j:,d}', ' | ', round(j/a*100,2), '% |')
    print ('    | Mode Different from Random | ', f'{k:,d}', ' | ', round(k/a*100,2), '% |')
    print ('    | Mode Different from IVEware_seed_0 | ', f'{l:,d}', ' | ', round(l/a*100,2), '% |')
    print ('    | Mode Different from IVEware_seed_1 | ', f'{m:,d}', ' | ', round(m/a*100,2), '% |')
    print ('    | Random Different from IVEware_seed_0 | ', f'{n:,d}', ' | ', round(n/a*100,2), '% |')
    print ('    | Random Different from IVEware_seed_1 | ', f'{o:,d}', ' | ', round(o/a*100,2), '% |')
    print ('    | IVEware_seed_0 Different from IVEware_seed_1 | ', f'{p:,d}', ' | ', round(p/a*100,2), '% |')
    print ()
        
#    display(Audio(sound_file, autoplay=True))
    
    print ('Finished Compare_Imputation_Methods_Part_2')



In [None]:
Compare_Imputation_Methods_Part_2(
    data_Ground_Truth, data_NaN, 
    data_RF, data_MF, 
    data_Mode, data_Random, 
    data_IVEware_seed_0, data_IVEware_seed_1
)

# Impute using Random Forest and Save for Next Step

In [None]:
def Impute_Using_Random_Forest():
    print ('Impute_Using_Random_Forest()')
    data = Get_Data()
    
#    data_Imputed = Impute_Full(data)
    data_Imputed = Impute_Round_Robin(data)
    data_Imputed.to_csv('../../Big_Files/CRSS_Imputed_by_RF_Data.csv', index=False)
#    display(data_Imputed.head(50))
    
    Check(data, data_Imputed)
#    display(Audio(sound_file, autoplay=True))

    print ('Finished Impute_Using_Random_Forest()')
    print ()
    
    return 0

Impute_Using_Random_Forest()

In [None]:
def Impute_Using_MissForest():
    print ('Impute_Using_MissForest()')
    data = Get_Data()
    
#    data_Imputed = Impute_Full(data)
    data_Imputed = Impute_MissForest(data)
    data_Imputed.to_csv('../../Big_Files/CRSS_Imputed_by_MF_Data.csv', index=False)
#    display(data_Imputed.head(50))
    
    Check(data, data_Imputed)
#    display(Audio(sound_file, autoplay=True))

    print ('Finished Impute_Using_MissForest()')
    print ()

    return 0

Impute_Using_Random_Forest()

In [None]:
def Impute_Using_IVEware():
    print ('Impute_Using_IVEware()')
    data = Get_Data()
    
    # Create .txt file to feed into IVEware imputation
    data_IVEware = data.copy(deep=True)
    print (data_IVEware.shape)
    display(data_IVEware.head(10))
    
    data_IVEware = data_IVEware.replace(99,'')
    display(data_IVEware.head(10))
    data_IVEware.to_csv('../../Big_Files/data_IVEware.txt', sep='\t', index=False)
    
    print ('Finished Impute_Using_IVEware()')
    print ()
    
    return 0

Impute_Using_IVEware()
# About one hour