<a href="https://colab.research.google.com/github/desstaw/DataAnonymPipeline/blob/main/II_DataAnonymizationVAE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow.keras import layers, models, backend as K
import numpy as np

# Load data
url = "https://raw.githubusercontent.com/desstaw/Seminar_DataManagement23/main/datasets/heart.csv"
df = pd.read_csv(url)

# Standardize the data
scaler = StandardScaler()
X = scaler.fit_transform(df)



*The function make_encoder creates an encoder neural network for the Variational Autoencoder. The encoder network takes in input data with a specified shape (input_shape) and encodes it into a lower-dimensional representation called the latent space, which has a specified number of dimensions (latent_dim).*

### First: Encode

####A breakdown of each line in the function:

`inputs = layers.Input(shape=input_shape)`: This creates an input layer for the neural network that matches the specified input_shape.

`x = layers.Dense(128, activation='relu')(inputs`): This creates a fully connected (dense) layer with 128 neurons, which takes in the input data and applies the ReLU activation function to its output.

`x = layers.Dense(64, activation='relu')(x)`: This creates another dense layer with 64 neurons, which takes in the output from the previous layer and applies the ReLU activation function to its output.

`z_mean = layers.Dense(latent_dim)(x)`: This creates the mean layer for the latent space representation, which is another dense layer with latent_dim neurons.

`z_log_var = layers.Dense(latent_dim)(x)`: This creates the log variance layer for the latent space representation, which is also a dense layer with latent_dim neurons.

`return models.Model(inputs, [z_mean, z_log_var])`: This creates a Keras Model object with the inputs layer as the input and `[z_mean, z_log_var]` as the output.

The **"latent_dim"** is the dimensionality of the latent space, which is the space in which the input data is encoded into a compressed representation by the encoder network. In other words, it is the number of hidden variables that are used to represent the input data in a lower-dimensional space.

The **"input_shape"** is defined by the shape of the input data. In the case of this code, the input data is a pandas DataFrame, and the "input_shape" is defined as the number of columns in the DataFrame, which represents the number of features that are used to describe each data point. The input shape is used to define the shape of the input layer of the encoder network, so that the network knows how to process the input data.

The **"latent space"** is a lower-dimensional representation of the input data that is learned by the encoder network. This representation is then used by the decoder network to generate new data. By learning a compressed, lower-dimensional representation of the input data, the VAE can capture the underlying structure and patterns in the data, making it more efficient to generate new data points in the latent space rather than in the high-dimensional input space.

In [5]:
# Define the encoder network
def make_encoder(input_shape, latent_dim):
    inputs = layers.Input(shape=input_shape)
    x = layers.Dense(128, activation='relu')(inputs)
    x = layers.Dense(64, activation='relu')(x)
    z_mean = layers.Dense(latent_dim)(x)
    z_log_var = layers.Dense(latent_dim)(x)
    return models.Model(inputs, [z_mean, z_log_var])



###Second: Decode
The decoder network is the second part of the variational autoencoder that takes the low-dimensional encoded data from the encoder and reconstructs the original high-dimensional data. In this code, the `make_decoder` function is defined to create the decoder network.

The `latent_dim` parameter that we defined earlier represents the number of dimensions in the low-dimensional space to which we are encoding the high-dimensional data. This same parameter is also passed into the `make_decoder` function to ensure that the input shape of the decoder matches the output shape of the encoder.

First, the function takes in the `latent_dim` parameter and creates an input layer with shape `(latent_dim,)`. This input layer will take in the encoded data that was output from the encoder network. This is where the compressed latent space representation is fed into the decoder.

The input is then passed through two dense layers with 64 and 128 nodes respectively, and ReLU activation functions. These layers serve as the hidden layers in the decoder network and are designed to transform the low-dimensional data into a representation that can be mapped back to the high-dimensional space. Why ReLu? for its ability to handle non-linearities.

Finally, the output layer is defined with the same number of nodes as the original input data, and the activation function used is linear. This means that the output values will not be constrained to a particular range, allowing the decoder to output any value from the original input range. The decoder model is then returned using the `models.Model` function from Keras, with the input layer and output layer specified as inputs to the function.

In [6]:
# Define the decoder network
def make_decoder(latent_dim):
    inputs = layers.Input(shape=(latent_dim,))
    x = layers.Dense(64, activation='relu')(inputs)
    x = layers.Dense(128, activation='relu')(x)
    outputs = layers.Dense(X.shape[1], activation='linear')(x)
    return models.Model(inputs, outputs)



###Third: Sample
The Sampling layer takes the mean and log variance of the learned distribution of the latent space and randomly samples new points from this distribution to generate new data points.

`class Sampling(layers.Layer)`: - define a new Keras layer for sampling the latent space

`def call(self, inputs)`: - define the method to be called when the layer is used in a model

`z_mean, z_log_var = inputs` - extract the mean and log variance of the distribution of the latent space from the input

`batch = K.shape(z_mean)[0]` - get the batch size of the input data

`dim = K.int_shape(z_mean)[1]` - get the dimensionality of the latent space

`epsilon = K.random_normal(shape=(batch, dim))` - sample random noise from a normal distribution of the same size as the input data

`return z_mean + K.exp(0.5 * z_log_var) * epsilon` - return a new point in the latent space by adding the mean to the product of the exponential of half the log variance and the random noise.

In [7]:
# Define the sampling layer
class Sampling(layers.Layer):
    def call(self, inputs):
        z_mean, z_log_var = inputs
        batch = K.shape(z_mean)[0]
        dim = K.int_shape(z_mean)[1]
        epsilon = K.random_normal(shape=(batch, dim))
        return z_mean + K.exp(0.5 * z_log_var) * epsilon


###Fourth: VAE model
The VAE model combines the encoder, decoder, and sampling layers to learn a compressed representation of the input data.

`inputs = layers.Input(shape=X.shape[1:])` defines the shape of the input data.
z_mean, z_log_var = encoder(inputs) gets the mean and variance of the latent space from the encoder network.

`z = Sampling()([z_mean, z_log_var])` applies the Sampling layer to the mean and variance of the latent space to generate a sample from the latent space.

`outputs = decoder(z)` passes the sampled latent space through the decoder network to generate the reconstructed output.

`reconstruction_loss = K.mean(K.square(inputs - outputs), axis=-1)` calculates the reconstruction loss, which measures how well the output matches the input.

`kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)` calculates the Kullback-Leibler divergence, which measures how much information is lost when the input is compressed into the latent space.

`vae_loss = K.mean(reconstruction_loss + beta * kl_loss)` combines the reconstruction and KL losses with a hyperparameter beta to get the overall VAE loss.

`vae = models.Model(inputs, outputs)` defines the VAE model as a Keras model with the input and output tensors.

`vae.add_loss(vae_loss)` adds the VAE loss to the model as an additional loss function to be optimized during training.

In [8]:
# Define the VAE model
def make_vae(encoder, decoder, beta=1.0):
    inputs = layers.Input(shape=X.shape[1:])
    z_mean, z_log_var = encoder(inputs)
    z = Sampling()([z_mean, z_log_var])
    outputs = decoder(z)
    reconstruction_loss = K.mean(K.square(inputs - outputs), axis=-1)
    kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
    vae_loss = K.mean(reconstruction_loss + beta * kl_loss)
    vae = models.Model(inputs, outputs)
    vae.add_loss(vae_loss)
    return vae


###Fifth: Execute

In [9]:
# Define the dimensions of the latent space
latent_dim = 2

# Create the encoder, decoder, and VAE models
encoder = make_encoder(X.shape[1:], latent_dim)
decoder = make_decoder(latent_dim)
vae = make_vae(encoder, decoder)

# Compile the VAE model
vae.compile(optimizer='adam')

# Train the VAE model
vae.fit(X, epochs=100, batch_size=32, validation_split=0.1)

# Encode the data and generate new, anonymous data
z_mean, _ = encoder.predict(X)
new_data = decoder.predict(np.random.normal(size=(X.shape[0], latent_dim)))
new_data = scaler.inverse_transform(new_data)

# Save the new data to a CSV file
new_df = pd.DataFrame(new_data, columns=df.columns)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [10]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  1025 non-null   float32
 1   age         1025 non-null   float32
 2   sex         1025 non-null   float32
 3   cp          1025 non-null   float32
 4   trestbps    1025 non-null   float32
 5   chol        1025 non-null   float32
 6   fbs         1025 non-null   float32
 7   restecg     1025 non-null   float32
 8   thalach     1025 non-null   float32
 9   exang       1025 non-null   float32
 10  oldpeak     1025 non-null   float32
 11  slope       1025 non-null   float32
 12  ca          1025 non-null   float32
 13  thal        1025 non-null   float32
 14  target      1025 non-null   float32
dtypes: float32(15)
memory usage: 60.2 KB


In [11]:
new_df.head(10)

Unnamed: 0.1,Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,460.029755,54.451252,0.69824,0.962146,131.848846,246.713257,0.149817,0.513378,148.952881,0.338499,1.087283,1.38593,0.749259,2.316273,0.513445
1,456.681793,54.470539,0.692738,0.964454,131.966736,246.868759,0.151549,0.516347,148.908264,0.338098,1.085373,1.388475,0.751461,2.310172,0.515827
2,452.670624,54.421345,0.683312,0.961687,132.111099,246.999451,0.155819,0.52152,148.912186,0.338436,1.087829,1.393505,0.762546,2.300417,0.5145
3,462.284302,54.499207,0.691673,0.969326,131.896927,245.124405,0.148044,0.510949,148.851456,0.33585,1.082627,1.381604,0.754386,2.311039,0.503752
4,458.265472,54.465622,0.680545,0.967206,132.059708,245.732147,0.148039,0.527088,148.822678,0.333296,1.087184,1.386384,0.755247,2.304846,0.500989
5,457.90094,54.450405,0.694098,0.963276,131.900864,246.60379,0.150277,0.515062,148.960678,0.33799,1.078507,1.384395,0.750936,2.309982,0.511529
6,464.172302,54.414387,0.69897,0.957538,131.80687,246.318756,0.149754,0.514418,148.94664,0.339153,1.086369,1.383556,0.7497,2.319414,0.511601
7,459.606049,54.393829,0.695242,0.961261,131.769897,246.508377,0.148725,0.518543,149.028931,0.338974,1.079389,1.383984,0.749446,2.308835,0.507876
8,461.419983,54.448143,0.698043,0.959627,131.822723,246.602768,0.149997,0.514159,148.96785,0.338263,1.087912,1.385875,0.751017,2.316669,0.512908
9,462.451477,54.382946,0.699124,0.959325,131.799332,246.689728,0.149475,0.518075,148.993652,0.340402,1.090774,1.384789,0.746561,2.316405,0.512106
