An attempt to recreate MIDA Gondara and Wang(2018) from https://gist.github.com/lgondara/18387c5f4d745673e9ca8e23f3d7ebd3, which is written in R

# 1. Loading Dataset

## 1.1. Load a dataset and introduce missingness

Dataset used: Shuttle Dataset (https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)

### 1.1.1. Load the dataset and store it as dataframe(numeric)

In [3]:
import pandas as pd

In [4]:
def get_dataframe_from_csv(filename, header_row = None):
    """
    input filename (full path) and returns dataframe with data
    
    TO DO: 
        -: As of now reading headerless files with header = None, what if the data has a header, how to deal with that
        -: Should the last coloumn name be replaced with "label"?
        -: Add functionality for space de-limited or comma de-limited files
        -: Improve logging, make it module specific logging
    """
    assert isinstance(filename,str), "Input complete filename as a string"
    import pandas as pd
    import logging
    logger = logging.getLogger()
    logger.setLevel(logging.DEBUG)
    
    logging.info("Input filename has to be space separated data")
    
    if not header_row:
        data_orig = pd.read_csv(filename,delim_whitespace=True,header=header_row)
    return data_orig 

In [5]:
#Test
filename = "data/shuttle/shuttle_trn"
train_df = get_dataframe_from_csv(filename)

INFO:root:Input filename has to be space separated data


In [6]:
train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,50,21,77,0,28,0,27,48,22,2
1,55,0,92,0,0,26,36,92,56,4
2,53,0,82,0,52,-5,29,30,2,1
3,37,0,76,0,28,18,40,48,8,1
4,37,0,79,0,34,-26,43,46,2,1


In [7]:
train_df.dtypes

0    int64
1    int64
2    int64
3    int64
4    int64
5    int64
6    int64
7    int64
8    int64
9    int64
dtype: object

In [8]:
len(train_df)

43500

### 1.1.2. Inducing missingness

After dataset loading, start with inducing missingness. 

To start off, introduce simple random missing patterns (Missing Completely At Random), i.e. sample half of the variables and set observations in those variables to missing if an appended random uniform vector has value less than a certain threshhold. WIth threshold of 0.2, the procedure should introduce about 20% missingness.

In [53]:
def induce_missingness(dataframe, perc_variables_sampled = 0.5, threshold = 0.2, logger_level =  20):
    """
    Steps:
        1. Append random uniform vector to the dataframe
        2. Decide thresold (default = 20%)
        3. Sample variables (default = 50%)
        4. In those variables (from 3), check the last column and if the value is less than threshold (2), set them to NaN
    
    """
    import pandas as pd
    import numpy as np
    assert isinstance(dataframe,pd.DataFrame)
    import pandas as pd
    import logging
    logger = logging.getLogger()
    logger.setLevel(logger_level)
    
    RANDOM_SEED = 18
    np.random.seed(RANDOM_SEED) #Reproducibility
    
    observations_number, variables_number = dataframe.shape[0], dataframe.shape[1]-1 #-1 to account for the "label"
    sampled_dataframe = dataframe.iloc[:,:-1].sample(n = int(variables_number*perc_variables_sampled), axis = 1) #sample perc_variables_sampled, -1: to account for "label" 
    sampled_variables = list(sampled_dataframe.columns)
    sampled_dataframe["random"] = np.random.uniform(size = observations_number)
    
    new_df = dataframe[:]
    new_df.loc[sampled_dataframe["random"] < threshold, sampled_variables] = np.NAN
    
    logging.debug(f"\n{new_df.head()}")
    logging.debug(f"\n{sampled_dataframe.head()}")
    logging.debug(f"\n{dataframe.head()}")
    
    logging.info(" Returning new dataframe with missingness(MCAR) induced")
    
    perc_of_nans = 1-sum(len(new_df)-new_df.count())/len(new_df)
    logging.info(f" Percentage of NaNs in returned dataframe : {perc_of_nans*100:.2f}")
    
    return new_df

In [54]:
#test
df1 = train_df[:]
df2 = induce_missingness(df1,logger_level=20)

INFO:root: Returning new dataframe with missingness(MCAR) induced
INFO:root: Percentage of NaNs in returned dataframe : 20.89


In [51]:
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,50,21,77,0,28,0,27,48,22,2
1,55,0,92,0,0,26,36,92,56,4
2,53,0,82,0,52,-5,29,30,2,1
3,37,0,76,0,28,18,40,48,8,1
4,37,0,79,0,34,-26,43,46,2,1


In [52]:
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,50,21,77,0,,0,,,,2
1,55,0,92,0,0.0,26,36.0,92.0,56.0,4
2,53,0,82,0,52.0,-5,29.0,30.0,2.0,1
3,37,0,76,0,28.0,18,40.0,48.0,8.0,1
4,37,0,79,0,34.0,-26,43.0,46.0,2.0,1


### 1.1.3. Create Train-Test split

Create 70% training data and 30%  test data which includes missingness and a test data without missingness so we can calculate performance. 

In [114]:
def create_train_test_split(dataframe, test_perc=0.3, logger_level = 20):
    """
    Steps:
    
    1. Induce missingness in the dataframe <use induce_missingness>
    2. Split the resultant dataframe into train, test sets
    3. Return both along with a third - test set without missingness 
    
    TO DO: 
        -: Figure out better way to extract elements via indexing
    
    """
    
    import pandas as pd
    import numpy as np
    assert isinstance(dataframe, pd.DataFrame)
    
    import logging
    logger = logging.getLogger()
    logger.setLevel(logger_level)
            
    from sklearn.model_selection import train_test_split
    RANDOM_SEED = 18
    np.random.seed(RANDOM_SEED) #Reproducibility
    
    _, full_test_df = train_test_split(dataframe, test_size = test_perc, random_state = RANDOM_SEED)
    train_df, test_df = train_test_split(induce_missingness(dataframe=dataframe), test_size = test_perc, random_state = RANDOM_SEED)
    #Used the same random_state to split the data twice in the same way, so full_test_df will be the filled part of dataframe
  
    '''
    TO DO: 
        -: Figure out why the following indexing is not working
    full_test_df = dataframe[dataframe.index.isin(test_df.index)]
    '''
    
    logging.info(f" Returning train_df, test_df, full_test_df after splitting dataframe in {1-test_perc}/{test_perc} split ")
    logging.info(" Note: full_test_df is the same as test_df but without NaNs")
    return train_df, test_df, full_test_df

In [117]:
#Test
a,b,c = create_train_test_split(df1)
print(a.head())
print(b.head())
print(c.head())

INFO:root: Returning new dataframe with missingness(MCAR) induced
INFO:root: Percentage of NaNs in returned dataframe : 20.89
INFO:root: Returning train_df, test_df, full_test_df after splitting dataframe in 0.7/0.3 split 
INFO:root: Note: full_test_df is the same as test_df but without NaNs


        0  1    2  3     4  5     6     7    8  9
7476   55  0   98  0   NaN -4   NaN   NaN  NaN  4
31355  50 -5  102  2  50.0  0  52.0  53.0  0.0  1
38462  37  0   77  0  36.0 -2  40.0  41.0  2.0  1
20525  55 -2   95  0  46.0 -3  40.0  49.0  8.0  4
34457  55  0   92  8   NaN  0   NaN   NaN  NaN  4
        0  1   2  3     4   5     6     7     8  9
15528  45 -1  76  0   NaN -16   NaN   NaN   NaN  1
14327  37  0  95  0  10.0   7  58.0  84.0  26.0  1
12125  37  0  75 -4  30.0   0  38.0  44.0   6.0  1
39952  55  0  96  0  50.0   4  41.0  47.0   6.0  4
1339   41 -1  76  0   NaN -14   NaN   NaN   NaN  1
        0  1   2  3   4   5   6   7   8  9
15528  45 -1  76  0  44 -16  31  32   2  1
14327  37  0  95  0  10   7  58  84  26  1
12125  37  0  75 -4  30   0  38  44   6  1
39952  55  0  96  0  50   4  41  47   6  4
1339   41 -1  76  0  38 -14  35  37   2  1


# 2. Modelling

Proceed to modelling.

In R:
Start with initializing 'h2o' package and then reading the training and test datasets as the 'h2o's supported format.
Then run imputation model multiple times as each new start would initialize the weights with different values.<br>
Info at: <br>
[h2o](https://cran.r-project.org/web/packages/h2o/h2o.pdf) package offers an easy to use function for implementing autoencoders. 
More information is available at this [link](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf).

In Python: