# Tabular Data Augmentation - Credit Card Dataset

This notebook contains a ML assignment as a part of the recruitment process at the Dataiku Lab.

The cells will walk you through the questions and parts of code to complete/comment. 

If a question is not marked as [OPTIONAL], this means it is mandatory to answer, and it might be necessary in order to run the subsequent cells of the notebook. [OPTIONAL] questions can be skipped. 

Answering the mandatory questions should take around 2 hours. Answering all questions should take around 4 hours.

Pay attention to the `#TODO` tags and good work!

In [1]:
import pandas as pd
import numpy as np
import torch
from torch import nn
from torch import optim
from torch.utils.data import DataLoader
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

In [2]:
seed = 42
np.random.seed(seed)

### Load Data

In [3]:
DATA_PATH = 'data/creditcard.csv'
df = pd.read_csv(DATA_PATH)

df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Preprocess data for training

In [4]:
target = 'Class'
X_train_df, X_test_df, y_train, y_test = train_test_split(df.drop(target, axis=1), df[target], test_size=0.3, random_state=seed)

x_scaler = StandardScaler()
X_train = x_scaler.fit_transform(X_train_df)

X_test = x_scaler.transform(X_test_df)


Q1. What could be said regarding the class balance? Compute the `difference_in_class_occurences` and the `class_occurences_ratio`.

In [None]:
difference_in_class_occurences = #TODO
class_occurences_ratio = #TODO

### Train Data Synthesizer and Augment Data

In the following you can find an implementation of the Tabular Variational Auto Encoder, a model used to synthesize tabular data.

In [6]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import nn
from torch.autograd import Variable
    

class TVAE(nn.Module):
    """ Tabular Variational Auto Encoder
    """
    def __init__(self, D_in:int, lin_layers:list, latent_dim:int=3):
       
        #Encoder
        super(TVAE,self).__init__()
        no = [D_in] + lin_layers
        list_of_layers = [
            nn.Sequential(nn.Linear(no[i], no[i+1]), nn.BatchNorm1d(no[i+1]), nn.ReLU())
            for i in range(len(no)-1)
            ]
        self.encoder = nn.Sequential(*list_of_layers)
        self.out_features_ = self.encoder[-1][0].out_features
        
        # Latent vectors mu and sigma
        self.fc1 = nn.Linear(self.out_features_, latent_dim)
        self.bn1 = nn.BatchNorm1d(num_features=latent_dim)
        self.fc21 = nn.Linear(latent_dim, latent_dim)
        self.fc22 = nn.Linear(latent_dim, latent_dim)

        # Sampling vector
        self.fc3 = nn.Linear(latent_dim, latent_dim)
        self.fc_bn3 = nn.BatchNorm1d(latent_dim)
        self.fc4 = nn.Linear(latent_dim, self.out_features_)
        self.fc_bn4 = nn.BatchNorm1d(self.out_features_)
        
        # Decoder
        no =  lin_layers[::-1] + [D_in]
        list_of_layers= [
            nn.Sequential(nn.Linear(no[i], no[i+1]), nn.BatchNorm1d(no[i+1]), nn.ReLU())
            for i in range(len(no)-1)
            ]
        # no ReLU in the last layer
        list_of_layers[-1] = list_of_layers[-1][:-1]
        self.decoder = nn.Sequential(*list_of_layers)
        
    def encode(self, x):

        fc1 = F.relu(self.bn1(self.fc1(self.encoder(x))))
        r1 = self.fc21(fc1)
        r2 = self.fc22(fc1)
        
        return r1, r2
    
    def reparameterize(self, mu, logvar):
        if self.training:
            std = logvar.mul(0.5).exp_()
            eps = Variable(std.data.new(std.size()).normal_())
            return eps.mul(std).add_(mu)
        else:
            return mu
        
    def decode(self, z):
        fc3 = F.relu(self.fc_bn3(self.fc3(z)))
        fc4 = F.relu(self.fc_bn4(self.fc4(fc3)))

        return self.decoder(fc4)
        
    def forward(self, x):
        mu, logvar = self.encode(x)
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar

[OPTIONAL] Q2: What does the `reparametrize` function do during training?

The next cell implement the loss we need to minimize to train the TVAE. It's composed of two losses: the Mean Squared Error loss and the Kullback-Liebler Divergence loss. 
* Q3. Complete the implementation of the MSE loss
* Q4. Explain what the MSE loss measures and why we want it to be low
* [OPTIONAL] Q5. The expression `loss_KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())` is the analytical expression of the KL divergence between two Gaussian distributions: $\mathcal{N}(mu, e^{logvar})$ and $\mathcal{N}(0, 1)$. Explain what the KLD loss measures and why we want it to be low. 
* [OPTIONAL] Q6. What could be an alternative implementation of `loss_KLD` using losses available in torch? In case you didn't know there was an analytical expression for it.

In [None]:
class customLoss(nn.Module):
    def __init__(self):
        super(customLoss, self).__init__()
        self.mse_loss = # TODO
    
    def forward(self, x_recon, x, mu, logvar):
        loss_MSE = self.mse_loss(x_recon, x)
        loss_KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
        
        return loss_MSE + loss_KLD

Let's prepare the training data for the TVAE containing only the fraud class

In [8]:
X_train_fraud = X_train[y_train==1]
X_test_fraud = X_test[y_test==1]

Now, let's instantiate the synthesizer model and train on fraud data. 

Q7. What is the input dimensions of the model? Complete below.

In [None]:
D_in = # TODO

In [10]:

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

layers_in = [64, 32]
latent_dim = 16

model_vae = TVAE(D_in, layers_in, latent_dim=latent_dim).to(device)

opt = optim.Adam(model_vae.parameters(), lr=0.01)

loss_func = customLoss()

Run the training loop below and check the loss behavior.

In [None]:
n_epochs = 100 
batch_size = 64

x = torch.tensor(X_train_fraud.astype('float32'))
x_test = torch.tensor(X_test_fraud.astype('float32'))

training_loader = DataLoader(x, batch_size=batch_size, shuffle=True, num_workers=1)
validation_loader = DataLoader(x_test, batch_size=batch_size, shuffle=False, num_workers=1)

loss_by_epoch = []
vloss_by_epoch = []

for epoch in range(n_epochs):
    
    model_vae.train(True)
    
    running_loss = 0.

    for i, x_batch in enumerate(training_loader):
        opt.zero_grad()
        recon, mu, logvar = model_vae(x_batch)
        # Compute the loss and its gradients
        loss = loss_func(recon, x_batch, mu, logvar)
        loss.backward()
        # Adjust learning weights
        opt.step()

        # Gather data and report
        running_loss += loss.item()
        
    avg_loss = running_loss / (i + 1)
    
    if epoch % 20==0:
        model_vae.train(False)

        running_vloss = 0.0
        for i, vdata in enumerate(validation_loader):
            vinputs = vdata
            voutputs, vmu, vlogvar = model_vae(vinputs)
            vloss = loss_func(voutputs, vinputs, vmu, vlogvar)
            running_vloss += vloss

        avg_vloss = running_vloss / (i + 1)
        print('Epoch {} - LOSS train {} valid {}'.format(epoch, avg_loss, avg_vloss))
        model_vae.train(True)
    else:
        print('Epoch {} - LOSS train {}'.format(epoch, avg_loss))
        
    loss_by_epoch.append(avg_loss)
    vloss_by_epoch.append(avg_vloss.cpu().detach().numpy())
    
    

In [None]:
plt.plot(loss_by_epoch, label='train loss')
plt.plot(vloss_by_epoch, color='darkorange', label='val loss')
plt.legend();

### Generate data

The function below uses the trained TVAE to generate synthetic data, sampling from a Gaussian distribution with specific mean and variance. The drawn samples are then decoded and mapped into the input space as synthetic new samples. 
* Q8. What are the mean and variance of the Gaussian distribution we sample from?
* [OPTIONAL] Q9. What other kind of sampling we could do to generate new latent vectors and then new input samples?

In [13]:
def generate_data(model_vae, opt, training_loader, no_samples:int):
    # get embeddings
    with torch.no_grad():
        for batch_idx, xb in enumerate(training_loader):
            opt.zero_grad()
            _, mu_, logvar_ = model_vae(xb)
            if batch_idx==0:
                mu=mu_
                logvar=logvar_
            else:
                mu=torch.cat((mu, mu_), dim=0)
                logvar=torch.cat((logvar, logvar_), dim=0)

    # sample from distribution defined by embeddings
    sigma = torch.exp(logvar/2)
    q = torch.distributions.Normal(mu.mean(axis=0), sigma.mean(axis=0))
    z = q.rsample(sample_shape=torch.Size([no_samples]))

    with torch.no_grad():
        pred = model_vae.decode(z).cpu().numpy()

    return pred

Now, generate new fraud samples, so to have a balanced training set.

In [14]:
X_fake_fraud = generate_data(model_vae, opt, training_loader, no_samples=difference_in_class_occurences)

Let's display the scatter plot the feature means of original and synthetic data.
* [OPTIONAL] Q10. What should you expect to see in an ideal case where the synthetic data is actually from the very same distribution as real data?
* [OPTIONAL] Q11. What could we do more to be sure the synthetic and real distributions are very similar ? 

In [None]:
plt.scatter(X_train_fraud.mean(axis=0), X_fake_fraud.mean(axis=0))
plt.xlabel('Features Means of Train data')
plt.ylabel('Features Means of Synthetic data')

### Train Prediction Model

Q12. Use any of the scikit-learn classifier and train a classifier on real data only to predict fraud.

Q13. Use scikit-learn metrics to check our model performance. 

Remember that our test set is from real data and that this is an imbalanced problem. In those cases accuracy is not the right metric to look at.

Q14. Try to use the class weights. What did you observe?

A better calibrated model could also help your model to have performace more evenly distributed across classes. 
* [OPTIONAL] Q15. Why?
* [OPTIONAL] Q16. Try a calibrated classifier from scikit-learn. What do you observe?

Now let's merge real and synthetised data to train predictive models to classify fraud samples.

In [20]:
X_train_aug_vae = np.concatenate([X_train, X_fake_fraud])
y_train_aug_vae = np.concatenate([y_train, np.ones((difference_in_class_occurences,))])

np.unique(y_train_aug_vae, return_counts=True)

(array([0., 1.]), array([199008, 199008]))

Q17. Train again you first classifier, this time on the augmented train set and evaluate its performance.

* Q18. What could we conclude for this dataset about augmenting real data with TVAE sinthetically generated data?
* [OPTIONAL] Q19. How can we make these conclusions more robust?

The table below shows the comparison of some tabular data synthetizers. Focus on the last two columns. 

The metric shown for classification problems is the F1-score, while the one used in regression problem is the R2 score. In both cases these scores represent the average performance of multiple ML models trained only on synthetised data and tested on real data. For instance for classification, the average performance of MLP, Logistic Regression, Adaboost and Decision Tree models is reported. 

The first row 'Identity' shows the performances for models trained only on real data instead.

<center><img src="ctgan_results.png" alt="ctgan results" width="500"/></center>



Do you agree with the following conclusions? Explain for each point your reasoning.
 
* Q20. The TVAE is significantly better than the CTGAN for both classification and regression, thus I should use TVAE for tabular data augmentation. 
* Q21. Regardless of the ML model I want to use to predict on my data, TVAE would be a better synthethiser than CTGAN.
* Q22. Both the TVAE and the CTGAN are anyway much worse than Identity, there's no way I could train on data generated from TVAE nor CTGAN, my performance wouldn't be good enough. There's no practical value in using these synthesizers.

* [OPTIONAL] Q23. The TVAE used in this notebook can only work on numerical continuous variables. What could you do to make it work on categorical data? Any idea?