An attempt to recreate MIDA Gondara and Wang(2018) in Python(using PyTorch) from https://gist.github.com/lgondara/18387c5f4d745673e9ca8e23f3d7ebd3, (in R)

### Note: Section 1 has been tested, moved to utils.py

# 1. Loading Dataset

## 1.1. Load a dataset and introduce missingness

Dataset used: Shuttle Dataset (https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle)

### 1.1.1. Load the dataset and store it as dataframe(numeric)

In [1]:
import pandas as pd
import utils

In [2]:
#Test
filename = "data/shuttle/shuttle_trn"
train_df = utils.get_dataframe_from_csv(filename).iloc[:,:-1]  #remove label

INFO:root:Input filename has to be space separated data


In [3]:
train_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,50,21,77,0,28,0,27,48,22
1,55,0,92,0,0,26,36,92,56
2,53,0,82,0,52,-5,29,30,2
3,37,0,76,0,28,18,40,48,8
4,37,0,79,0,34,-26,43,46,2


### 1.1.2. Inducing missingness

After dataset loading, start with inducing missingness. 

To start off, introduce simple random missing patterns (Missing Completely At Random), i.e. sample half of the variables and set observations in those variables to missing if an appended random uniform vector has value less than a certain threshhold. WIth threshold of 0.2, the procedure should introduce about 20% missingness.

In [4]:
#test
df1 = train_df[:]
df2 = utils.induce_missingness(df1,logger_level=20)

INFO:root: Returning new dataframe with missingness(MCAR) induced
INFO:root: Percentage of NaNs in returned dataframe : 8.79


In [5]:
df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,50,21,77,0,28,0,27,48,22
1,55,0,92,0,0,26,36,92,56
2,53,0,82,0,52,-5,29,30,2
3,37,0,76,0,28,18,40,48,8
4,37,0,79,0,34,-26,43,46,2


In [6]:
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,50,21,77,0,,0,,,
1,55,0,92,0,0.0,26,36.0,92.0,56.0
2,53,0,82,0,52.0,-5,29.0,30.0,2.0
3,37,0,76,0,28.0,18,40.0,48.0,8.0
4,37,0,79,0,34.0,-26,43.0,46.0,2.0


### 1.1.3. Create Train-Test split

Create 70% training data and 30%  test data which includes missingness and a test data without missingness so we can calculate performance. 

In [7]:
#Test
a,b,c = utils.create_train_test_split(df1)
print(a.head())
print(b.head())
print(c.head())

INFO:root: Returning new dataframe with missingness(MCAR) induced
INFO:root: Percentage of NaNs in returned dataframe : 8.79
INFO:root: Returning train_df, test_df, full_test_df after splitting dataframe in 0.7/0.3 split 
INFO:root: Note: full_test_df is the same as test_df but without NaNs


        0  1    2  3     4  5     6     7    8
7476   55  0   98  0   NaN -4   NaN   NaN  NaN
31355  50 -5  102  2  50.0  0  52.0  53.0  0.0
38462  37  0   77  0  36.0 -2  40.0  41.0  2.0
20525  55 -2   95  0  46.0 -3  40.0  49.0  8.0
34457  55  0   92  8   NaN  0   NaN   NaN  NaN
        0  1   2  3     4   5     6     7     8
15528  45 -1  76  0   NaN -16   NaN   NaN   NaN
14327  37  0  95  0  10.0   7  58.0  84.0  26.0
12125  37  0  75 -4  30.0   0  38.0  44.0   6.0
39952  55  0  96  0  50.0   4  41.0  47.0   6.0
1339   41 -1  76  0   NaN -14   NaN   NaN   NaN
        0  1   2  3   4   5   6   7   8
15528  45 -1  76  0  44 -16  31  32   2
14327  37  0  95  0  10   7  58  84  26
12125  37  0  75 -4  30   0  38  44   6
39952  55  0  96  0  50   4  41  47   6
1339   41 -1  76  0  38 -14  35  37   2


# 2. Modelling

Proceed to modelling.

In R:
Start with initializing 'h2o' package and then reading the training and test datasets as the 'h2o's supported format.
Then run imputation model multiple times as each new start would initialize the weights with different values.<br>
Info at: <br>
[h2o](https://cran.r-project.org/web/packages/h2o/h2o.pdf) package offers an easy to use function for implementing autoencoders. 
More information is available at this [link](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/booklets/DeepLearningBooklet.pdf).

In Python:

In [8]:
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.functional as F

In [9]:
#Settings for device, randomization seed, default tensor type, kwargs for memory #DevSeedTensKwargs
RANDOM_SEED = 18
np.random.seed(RANDOM_SEED)

if torch.cuda.is_available():
    device = 'cuda'
    torch.cuda.manual_seed(RANDOM_SEED)
    torch.set_default_tensor_type(torch.cuda.FloatTensor)
    kwargs = {'num_workers':4, 'pin_memory' :True}
else:
    device = 'cpu'
    torch.manual_seed(RANDOM_SEED)
    torch.set_default_tensor_type(torch.FloatTensor)
    kwards = {}

In [10]:
import dataset_module

In [11]:
from importlib import reload

In [12]:
reload(dataset_module)

<module 'dataset_module' from 'C:\\Users\\asree\\Downloads\\SRIP\\Summer\\Code\\dataset_module.py'>

In [26]:
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,50,21,77,0,,0,,,
1,55,0,92,0,0.0,26,36.0,92.0,56.0
2,53,0,82,0,52.0,-5,29.0,30.0,2.0
3,37,0,76,0,28.0,18,40.0,48.0,8.0
4,37,0,79,0,34.0,-26,43.0,46.0,2.0


In [14]:
trainset = dataset_module.DataSetForImputation(df2)

In [15]:
trainset

Dataframe Size:43500, Perc of NaNs: 8.79

In [16]:
len(trainset)

43500

In [17]:
trainset.variables()

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8], dtype='int64')

In [19]:
trainset[0]

(tensor([50.0000, 21.0000, 77.0000,  0.0000, 34.5129,  0.0000, 37.1039, 50.9072,
         13.9429]),
 tensor([50.0000, 21.0000, 77.0000,  0.0000, 34.5129,  0.0000, 37.1039, 50.9072,
         13.9429]))

In [23]:
import Modelling
net = Modelling.DenoisingAutoEncoder(len(trainset.variables()))

In [25]:
net(trainset[0][0].unsqueeze(0))

tensor([[ 0.1237, -0.1411,  0.0052, -0.3088, -0.2209, -0.0398, -0.1120,  0.1270,
          0.0689]], grad_fn=<AddmmBackward>)