d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 1200px">
</div>

#Generative Adversarial Networks (GANs)
<br/>
0. The promise of deep learning is to discover *rich, hierarchical* models that represent probability distributions over the kinds of data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. 
0. The most striking successes in deep learning have involved *discriminative models*, usually those that map a high-dimensional, rich sensory input to a class label. These striking successes have primarily been based on the *backpropagation and dropout algorithms*, using piecewise linear units which have a particularly well-behaved gradient.
0. *Deep generative models* have had less of an impact, due to the difficulty of approximating many intractable probabilistic computations that arise in *maximum likelihood estimation* and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context. We propose a new generative model estimation procedure that sidesteps these difficulties.
0. In *adversarial nets framework*, the *generative* model is pitted against an *adversary*: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency. Competition in this game drives both teams to improve their methods until the counterfeits are indistiguishable from the genuine articles.
<br/><br/>
For details see [GANs](https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf)
<br/><br/>
![](https://files.training.databricks.com/images/gans.png)

Source: https://lilianweng.github.io/lil-log/assets/images/GAN.png

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
 - Learn about Generative and discriminative models
 - Get hands on experience on creating such models on California Housing dataset

In [3]:
%run "./Includes/Classroom-Setup"

In [4]:
from sklearn.datasets.california_housing import fetch_california_housing
import numpy as np
import pandas as pd

# Fetch the data & create Pandas DataFrame
cal_housing = fetch_california_housing()
X, y = cal_housing.data, cal_housing.target
data = pd.concat([pd.DataFrame(X, columns=cal_housing.feature_names), pd.DataFrame(y, columns=["label"])], axis=1)

#### In order to implement GANs, we need to create three networks:
0. Generative model
0. Discriminative model
0. GAN

#### There are also multiple parameters/hyperparameters:
0. Number of iterations
0. Number of random sampling
0. Numer of steps
0. Shape of noise (here we use Gaussian noise with mean 0 and variance 1)

In [6]:
import tensorflow as tf
from tensorflow.keras import models, layers
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Model

# Number of iterations 
epoch = 500

# Sample size used to sample data and noise
sample_size = 8

# Step to train DIS
step_size = 10

# Noise mean: we will be using Guassian noise 
mu = 1000

# Noise variance
sigma = 500

# Dimension of the data (including label)
dimension = 9

Define the GEN model. This part is an art.

In [8]:
from tensorflow.keras import initializers

# This function creates the generator model
def generative_model():
  generative_model = models.Sequential()
  generative_model.add(layers.Dense(20, input_dim=dimension, activation="relu", kernel_initializer=initializers.RandomNormal(stddev=1)))
  generative_model.add(layers.Dense(20, activation="relu", kernel_initializer=initializers.RandomNormal(stddev=1)))
  generative_model.add(layers.Dense(20, activation="relu", kernel_initializer=initializers.RandomNormal(stddev=1)))
  generative_model.add(layers.Dense(20, activation="relu", kernel_initializer=initializers.RandomNormal(stddev=1)))
  generative_model.add(layers.Dense(20, activation="relu", kernel_initializer=initializers.RandomNormal(stddev=1)))
  generative_model.add(layers.Dense(dimension, activation="linear"))
  return generative_model

Define the DIS model.

In [10]:
# This function creates the discriminate network
def discriminative_model():
  discriminative_model = models.Sequential()
  discriminative_model.add(layers.Dense(10, input_dim=dimension, activation="relu", kernel_initializer=initializers.RandomNormal(stddev=2)))
  discriminative_model.add(Dropout(0.2))
  discriminative_model.add(layers.Dense(10, activation="relu", kernel_initializer=initializers.RandomNormal(stddev=2)))
  discriminative_model.add(Dropout(0.2))
  discriminative_model.add(layers.Dense(1, activation="sigmoid", kernel_initializer=initializers.RandomNormal(stddev=2)))
  discriminative_model.compile(optimizer="adam", loss="binary_crossentropy")
  return discriminative_model

In [11]:
# This function puts together Gen model and Dis model
def gan_model(gen,dis):
  inp = Input(shape=(dimension,))
  # Important: We do not want to train DIS
  dis.trainable = False
  merged = Model(inputs=inp, outputs=dis(gen(inp)))
  merged.compile(optimizer="adam", loss="binary_crossentropy")
  return merged

In [12]:
# This function trains a GAN
def train_gans(epoch): 
  tf.keras.backend.clear_session()
  np.random.seed(42)
  tf.random.set_seed(42)
  
  # Create networks
  dis_model = discriminative_model()
  gen_model = generative_model()
  gans_model = gan_model(gen_model,dis_model)

  gans_loss = []
  dis_loss = []

  # Train GAN
  for i in range(epoch):
    for j in range(step_size):
      noise = np.random.normal(mu, sigma, [sample_size, dimension])
      data_sample = data.sample(sample_size)
      gen_data = gen_model.predict(noise)
      real_gen = np.concatenate((data_sample.iloc[:, 0:dimension], gen_data))
      # Label smoothing
      real_label = np.full((sample_size, 1), 0.9)
      fake_label = np.full((sample_size, 1), 0.1)
      real_gen_label = np.concatenate([real_label, fake_label])
      dis_loss.append(dis_model.train_on_batch(real_gen, real_gen_label))

    for k in range(step_size*2):
      noise = np.random.normal(mu, sigma, [sample_size, dimension])
      gans_loss.append(gans_model.train_on_batch(noise, real_label))

    print(f"This is epoch {i}. DIS' loss is {dis_loss[-1]} and GAN's loss is {gans_loss[-1]}")
  return (gen_model, dis_model)

In [13]:
gen_model, dis_model = train_gans(epoch)

Let's generate some fake data and compare the its distribution with the actual data.

In [15]:
# This function returns the fake data generated by the generator model inside GAN
def generate_fake_data(gen_model):
  tf.keras.backend.clear_session()
  np.random.seed(42)
  tf.random.set_seed(42)
  noise = np.random.normal(mu, sigma, [20000, dimension])
  fake_data_array = gen_model.predict(noise)
  fake_data = pd.DataFrame(fake_data_array)
  return fake_data

Since we are dealing with higher dimensional data, it is not possible to plot the joint distribution. Instead, we are going to look at the first two principal components. Here we consider multiple scenarios:
0. Look at the two principal components of the real data 
0. Look at the two principal components of the real/fake data after 2 epochs
0. Look at the two principal components of the real/fake data after 10 epochs

In [17]:
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.patches as mpatches

# This function returns the PCA plots of two PCA components
def plot_pca(data, fake_dataset, original = False):  
  # Standardize the data before applying PCA
  scaler = StandardScaler()
  transformed_real_data = scaler.fit_transform(data)
  transformed_fake_data = scaler.fit_transform(fake_dataset)

  # Find two principal components
  pca = PCA(n_components=2)
  principal_components_real_data = pca.fit_transform(transformed_real_data)
  principal_components_fake_data = pca.fit_transform(transformed_fake_data)
  principal_real_data = pd.DataFrame(data=principal_components_real_data, columns=["principal_component_1", "principal_component_2"])
  principal_fake_data = pd.DataFrame(data=principal_components_fake_data, columns=["principal_component_1", "principal_component_2"])

  # Plot the principle components of both datasets
  if original == False:
    # Plot the real data
    plt.scatter(principal_real_data["principal_component_1"], principal_real_data["principal_component_2"], color="green", s=70, alpha=0.3)
    plt.title("2 principal components")
    plt.xlabel("component 1")
    plt.ylabel("component 2")
    green_patch = mpatches.Patch(color="green", label="Real data")
    blue_patch = mpatches.Patch(color="blue", label="Fake data")
    plt.legend(handles=[green_patch, blue_patch])
    # Plot fake data
    plt.scatter(principal_fake_data["principal_component_1"], principal_fake_data["principal_component_2"], color="blue", s=70, alpha=0.3)
    plt.title("2 principal components")
    plt.xlabel("component 1")
    plt.ylabel("component 2")
    green_patch = mpatches.Patch(color="green", label="Real data")
    blue_patch = mpatches.Patch(color="blue", label="Fake data")
    plt.legend(handles=[green_patch, blue_patch])
  # Plot only original data
  else:
    plt.scatter(principal_real_data["principal_component_1"], principal_real_data["principal_component_2"], color="green", s=70, alpha=0.3)
    plt.title("2 principal components")
    plt.xlabel("component 1")
    plt.ylabel("component 2")
    green_patch = mpatches.Patch(color="green", label="Real data")
    plt.legend(handles=[green_patch])

In [18]:
# Train GAN for 2 epochs and plot the real data
gen_model, dis_model = train_gans(2)
fake_data = generate_fake_data(gen_model)
plot_pca(data, fake_data, True)

In [19]:
# Plot real and fake data after 2 epochs
plot_pca(data, fake_data, False)

In [20]:
# Train the model for 10 epochs and plot the real and fake data
gen_model, dis_model = train_gans(10)
fake_data = generate_fake_data(gen_model)
plot_pca(data, fake_data, False)

### Conclusion ###
In practice, it is more popular to train GAN on image data. Since 2014, different researchers have come up with different techniques to train GAN on tabular data (for example, see https://arxiv.org/pdf/1907.00503.pdf). Although GAN as we have seen works to some extend, in practice it has some drawbacks whose solutions have been object of intensive research since the original 2014. The major drawbacks have to do with the **training of the GAN**.
<br><br>


0. Training a GAN is hyperparameter-dependent. 
0. The loss functions are not informative: while the generated samples may start to closely resemble the true data — approximating significantly its distribution — this behavior can’t be indexed to a trend of the losses in general. This means that we can’t just run a hyperparameter optimizer using the losses and must instead iteratively tune them manually.
0. Generating categorical data is a particularly difficult problem for GANs.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>