This report contains diagnosis of VAE used to generate Dirichlet Distribution. Section 1 explains algorithm of generating training dataset and VAE; Section 2 shows the problem; Section 3 is the diagnosis; Section 4 introduces further diagnosis if slightly change the code; Section 5 proposes 2 possible solutions; Section 6 is appendix for code.

In particular, diagnosis in Section 3 and 4 show: firstly, outputs of VAE at the begining of training are comparatively dispersed, but after a few iteration, outputs become dense; secondly, after training, passing data generated from Dirichlet Distribution gives dense outputs as well; thirdly, changing learning rate and dimension of hidden layers doesn't help. 

# 1. Algorithm

## 1.1 Training Data

## 1.1.1 Data 1

Draw 1 sample: <br /> 
$~~~~~$$\alpha \sim$Uniform(0,2) <br /> 
$~~~~~$generate random samples ($p^{(i)}_1,p^{(i)}_2,p^{(i)}_3$)$\sim$Dir($\alpha,\alpha,\alpha$), i=1,...,200 <br /> 
Repeat $10^4$ times

__Visualization of 1 sample__ 

<figure>
  <img src="data1.png" alt="my alt text" width=300/>
</figure>

## 1.1.2 Data 2

Draw 1 sample: <br /> 
$~~~~~$($\alpha_1,\alpha_2,\alpha_3$)=(1,1,0.5) <br /> 
$~~~~~$generate random samples ($p_1,p_2,p_3$)$\sim$Dir($\alpha_1,\alpha_2,\alpha_3$)<br /> 
Repeat $10^4$ times

__Visualization of 200 samples__ 

<figure>
  <img src="data2.png" alt="my alt text" width=300/>
</figure>

## 1.2 Structure

<figure>
  <img src="diagnosis1.png" alt="my alt text" width=1000/>
</figure>

hidden dim1=40, hidden dim2=30, latent dim=20 <br />
$z_1,...,z_{dim3}\sim \mathcal{N}(0,\mathbb{1})$

# 2. Problem

Data generated are too dense.

<figure>
  <img src="alg1res2.png" alt="my alt text" width="300"/>
</figure>

# 3. Diagnosis (using training data Dir(1,1,0.5), lr=0.001)

__3.1 Output of VAE While Training__

<figure>
  <img src="3.1.1.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs at the beginning).</figcaption>
</figure>


<figure>
  <img src="3.1.2.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs during training. Outputs keep similar values in rest of training).</figcaption>
</figure>

__3.2 After training, pass data from Dir(1,1,0.5) to the whole VAE__

<figure>
  <img src="3.2.1.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs after training. Outputs are all the same).</figcaption>
</figure>

__3.3 change learning rate__

change learning rate from 0.001 to 0.1

<figure>
  <img src="3.3.1.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs at the beginning).</figcaption>
</figure>

<figure>
  <img src="3.3.2.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs during training. Outputs keep similar values in rest of training).</figcaption>
</figure>

# 4. Approaches (change back learning rate to 0.001)

## 4.1. Change dimensionality of hidden layers

__change hidden dim 1,hidden dim 2 and latent dim to 7,5,2__

<figure>
  <img src="4.1.1.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs at the beginning).</figcaption>
</figure>

<figure>
  <img src="4.1.2.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs during training. Outputs keep similar values in rest of training).</figcaption>
</figure>

__change hidden dim 1,hidden dim 2 and latent dim to 100,80,60__

<figure>
  <img src="4.1.3.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs at the beginning).</figcaption>
</figure>

<figure>
  <img src="4.1.4.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs during training. Outputs keep similar values in rest of training).</figcaption>
</figure>

## 4.2  Try Dir(1,1.5,0.5,0.7,1.2,1,0.8,1.8,0.2,0.9) as training dataset, which is 10 dimensional

<figure>
  <img src="4.2.1.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs at the beginning).</figcaption>
</figure>

<figure>
  <img src="4.2.2.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs during training. Outputs keep similar values in rest of training).</figcaption>
</figure>

## 4.3 Change activiation function

__(1) delete some tanh__

<figure>
  <img src="4.3.1.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs at the beginning).</figcaption>
</figure>

<figure>
  <img src="4.3.2.png" alt="my alt text" width="500"/>
<figcaption>(Batches of outputs during training. Outputs keep similar values in rest of training).</figcaption>
</figure>

# 5 Possible Solution

## 5.1 Generate parameter of Dirichlet Distribution, instead of data 

### 5.1.1 Structure 

<figure>
  <img src="5.1.1.png" alt="my alt text" width="1000"/>
</figure>

(1) $z_1,...,z_{dim3}\sim \mathcal{N}(0,\mathbb{1})$ <br />
(2) Use inverse Gamma CDF: $z_i \sim Gamma(\alpha,\beta)$ for i=1,...,dim3. <br />
$~~~~~$$u_1,...,u_{dim3}\sim Uniform(0,1)$ <br />
$~~~~~$$z_1,...,z_{dim3}\approx \beta^{-1}(u_1\alpha\Gamma (\alpha))^{1/\alpha},...,\beta^{-1}(u_{dim3}\alpha\Gamma (\alpha))^{1/\alpha}$


# 6. Appendix

__Code__

VAE

In [None]:
from scipy.stats import dirichlet as diri
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import math
from scipy.special import gamma, factorial
from tqdm import tqdm, trange
from Dirdata2 import Dirdata
import torch
from scipy.stats import multinomial
from torch.utils.data import Dataset, DataLoader
from scipy.stats import dirichlet
import os,sys
import pystan
import pystan
import pandas as pd
import warnings
import plotly.express as px
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

class Encoder(nn.Module):
    ''' This the encoder part of VAE
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, z_dim):
        super().__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim1)
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.mu = nn.Linear(hidden_dim2, z_dim)
        self.sd = nn.Linear(hidden_dim2, z_dim)
    def forward(self, x):
        # x is of shape [batch_size, input_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = self.linear2(hidden1)
        # hidden2 is of shape [batch_size, hidden_dim2]
        z_mu = self.mu(hidden2)
        # z_mu is of shape [batch_size, z_dim]
        z_sd = self.sd(hidden2)
        # z_sd is of shape [batch_size, z_dim]
        return z_mu, z_sd

class Decoder(nn.Module):
    ''' This the decoder part of VAE
    '''
    def __init__(self,z_dim, hidden_dim1, hidden_dim2, input_dim):
        super().__init__()
        self.linear1 = nn.Linear(z_dim, hidden_dim2)
        self.linear2 = nn.Linear(hidden_dim2, hidden_dim1)
        self.out1 = nn.Linear(hidden_dim1, input_dim)
        self.out2 = nn.Softmax(dim=1)
    def forward(self, x):
        # x is of shape [batch_size, z_dim]
        hidden1 = self.linear1(x)
        # hidden1 is of shape [batch_size, hidden_dim2]
        hidden2 = self.linear2(hidden1)
        # hidden2 is of shape [batch_size, hidden_dim1]
        out1 = self.out1(hidden2)
        #ensure sum of 3 elements to be 1
        pred = self.out2(out1)
        # pred is of shape [batch_size, input_dim]
        return pred

class VAE(nn.Module):
    ''' This the VAE, which takes a encoder and decoder.
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, latent_dim):
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim1, hidden_dim2, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim1, hidden_dim2, input_dim)

    def reparameterize(self, z_mu, z_sd):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        if self.training:
            # sample from the distribution having latent parameters z_mu, z_sd
            # reparameterize
            std = torch.exp(z_sd / 2)
            eps = torch.randn_like(std)
            return (eps.mul(std).add_(z_mu))
        else:
            return z_mu

    def forward(self, x):
        # encode
        z_mu, z_sd = self.encoder(x)
        # reparameterize
        x_sample = self.reparameterize(z_mu, z_sd)
        # decode
        generated_x = self.decoder(x_sample)
        return generated_x, z_mu,z_sd

def calculate_loss(reconstructed1,target, mean, log_sd):
    RCL = F.mse_loss(reconstructed1, target, reduction='sum')
    KLD = -0.5 * torch.sum(1 + log_sd - mean.pow(2) - log_sd.exp())
    return RCL + KLD

if __name__ == '__main__':
    ###### intializing data and model parameters
    dataPoints=200
    batch_size = 5
    hidden_dim1 = 40
    hidden_dim2 = 30
    z_dim = 10
    samples = 10000
    num_param=3
    input_dim = 3

    model = VAE(input_dim, hidden_dim1, hidden_dim2, z_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model = model.to(device)
    
    ###### creating data
    ds =Dirdata(dataPoints=dataPoints, samples=samples, indicate=4,num_param=num_param)
    train_dl = DataLoader(ds, batch_size=batch_size, shuffle=True)
    
    ###### train
    t = trange(50)
    for e in t:
        model.train()
        total_loss = 0
        for i,x in enumerate(train_dl):
            #input for VAE (flattened)
            x_ = x[1].float().to(device)
            #make gradient to be zero in each loop
            optimizer.zero_grad()
            #get output
            reconstructed_x, z_mu, z_sd = model(x_)
            print(reconstructed_x)
            #change dimensionality for computing loss function
            reconstructed_x1=reconstructed_x.reshape(batch_size,1,-1)[:,0]
            #loss 
            loss=calculate_loss(reconstructed_x1,x_,z_mu,z_sd)
            #compute gradient
            loss.backward() 
            #if gradient is nan, change to 0
            for param in model.parameters():
                param.grad[param.grad!=param.grad]=0
                
            #add to toal loss
            total_loss += loss.item()
            optimizer.step() # update the weigh
        t.set_description(f'Loss is {total_loss/(samples*dataPoints):.3}')
    
    ###### Sampling 5 draws from learnt model
    model.eval() # model in eval mode
    z = torch.randn(200, z_dim).to(device) # random draw
    with torch.no_grad():
        sampled_y = model.decoder(z)
    df=pd.DataFrame(sampled_y,columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
    fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
    fig.show()

Generate training data

In [None]:
import os,sys
#sys.path.append(os.path.join(os.path.dirname(__file__), '../'))
import numpy as np 
import matplotlib.pyplot as plt 
from torch.utils.data import Dataset, DataLoader
import torch
import random
from scipy.stats import dirichlet
import pandas as pd
import plotly.express as px
from sklearn.utils import shuffle

class Dirdata(Dataset):
    def __init__(self, dataPoints=20, samples=10000,
                        seed=np.random.randint(20),indicate=0,num_param=3):
        self.dataPoints = dataPoints
        self.samples = samples
        self.seed = seed
        self.Max_Points = samples * dataPoints
        self.indicate=indicate
        self.num_param=num_param
        np.random.seed(self.seed)
        self.evalPoints, self.data, self.data1,self.occure = self.__simulatedata__()
        
    def __len__(self):
        return self.samples
    
    def __getitem__(self, idx=0):
        return(self.evalPoints, self.data[idx],self.data1[idx],self.occure[idx])
    
    def __simulatedata__(self):
        # Dir(alpha,alpha,alpha), alpha~Uniform(0.5,2)
        if (self.indicate==0):
            #generate alpha
            alpha=np.random.uniform(0.5,2,self.samples)
            #repeat alpha
            alpha=np.array([alpha]*self.num_param).transpose()
            #initialize theta and counts (counts are only used in inference, not training)
            theta = np.zeros((self.samples,self.dataPoints,self.num_param))
            occurrence = np.zeros((self.samples, self.dataPoints,self.num_param))
            #generate theta 
            for idx in range(self.samples):
                #generate theta from dirichlet distr
                theta[idx]=np.random.dirichlet(alpha[idx,:],self.dataPoints)
            #shuffle theta
            theta =theta.reshape(self.samples*self.dataPoints,self.num_param)
            theta=shuffle(theta, random_state=0)
            theta1 = theta.reshape(self.samples,self.dataPoints,self.num_param)
            #generate counts
            for idx in range(self.samples):
                for idy in range(self.dataPoints):
                    occurrence[idx][idy,:]=np.random.multinomial(50,theta1[idx][idy,:],size=1)
            occurrence =occurrence.reshape(self.samples*self.dataPoints,self.num_param)
            return (alpha ,theta,theta1, occurrence)

        
        # Dir(1,1,0.2)
        if (self.indicate==4):
            alpha=np.array([1,1,0.2])
            #initialize theta and counts (counts are only used in inference, not training)
            theta = np.zeros((self.samples,self.dataPoints,self.num_param))
            occurrence = np.zeros((self.samples, self.dataPoints,self.num_param))
            #generate theta
            for idx in range(self.samples):
                #generate theta from dirichlet distr
                theta[idx]=np.random.dirichlet(alpha,self.dataPoints)
            #shuffle theta
            theta =theta.reshape(self.samples*self.dataPoints,self.num_param)
            theta=shuffle(theta, random_state=0)
            theta1 = theta.reshape(self.samples,self.dataPoints,self.num_param)
            #generate counts
            for idx in range(self.samples):
                for idy in range(self.dataPoints):
                    occurrence[idx][idy,:]=np.random.multinomial(50,theta1[idx][idy,:],size=1)
            occurrence=occurrence.reshape(self.samples*self.dataPoints,self.num_param)
            return (alpha ,theta, theta1,occurrence)

        
        if (self.indicate==5):
            alpha=np.array([1,1.5,0.5,0.7,1.2,1,0.8,1.8,0.2,0.9])
            #initialize theta and counts (counts are only used in inference, not training)
            theta = np.zeros((self.samples,self.dataPoints,self.num_param))
            occurrence = np.zeros((self.samples, self.dataPoints,self.num_param))
            #generate theta
            for idx in range(self.samples):
                #generate theta from dirichlet distr
                theta[idx]=np.random.dirichlet(alpha,self.dataPoints)
            #shuffle theta
            theta =theta.reshape(self.samples*self.dataPoints,self.num_param)
            theta=shuffle(theta, random_state=0)
            theta1 = theta.reshape(self.samples,self.dataPoints,self.num_param)
            #generate counts
            for idx in range(self.samples):
                for idy in range(self.dataPoints):
                    occurrence[idx][idy,:]=np.random.multinomial(50,theta1[idx][idy,:],size=1)
            occurrence=occurrence.reshape(self.samples*self.dataPoints,self.num_param)
            return (alpha ,theta, theta1,occurrence)

if __name__ == '__main__':
    ds =Dirdata(dataPoints=200, samples=1, indicate=0,num_param=3)
    dataloader = DataLoader(ds, batch_size=1, shuffle=True)
    fig = plt.figure(figsize=(8,8))
    for no,dt in enumerate(dataloader):
        df=pd.DataFrame(dt[2].reshape(200,3),columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()