# Generate Dirichlet Distribution by VAE 

This jupyter notebook contains 9 models I tried to generate Dirichlet Distribution. Model 2 and Model 3 are the best two models so far, but they still have some issues. <br />

I will first firstly explain how I generate training datasets, then discuss 9 models with following order: standard VAE, models with comparatively better performance, other models. Each model has result, problem (if applicable), possible reason for problem/success. I will also introduce some models I haven't tried or don't know how to implement.

The code is provided in the appendix.

## 1. Generate Training Data 

### 1.1 Data 1

Draw 1 sample: <br /> 
$~~~~~$$\alpha \sim$Uniform(0,2) <br /> 
$~~~~~$generate random samples ($p^{(i)}_1,p^{(i)}_2,p^{(i)}_3$)$\sim$Dir($\alpha,\alpha,\alpha$), i=1,...,200 <br /> 
Repeat $10^4$ times and shuffle the whole training dataset

<figure>
  <img src="data1.png" alt="my alt text" width=300/>
</figure>

### 1.2 Data 2

Draw 1 sample: <br /> 
$~~~~~$($\alpha_1,\alpha_2,\alpha_3$)=(1,1,0.5) <br /> 
$~~~~~$generate random samples ($p^{(i)}_1,p^{(i)}_2,p^{(i)}_3$)$\sim$Dir($\alpha_1,\alpha_2,\alpha_3$), i=1,...,200 <br /> 
Repeat $10^4$ times and shuffle whole training dataset

<figure>
  <img src="data2.png" alt="my alt text" width=300/>
</figure>

also tried (1,1,0.2)

<figure>
  <img src="data3.png" alt="my alt text" width=300/>
</figure>

* __Code__ provided in __Appendix A__

## 2. Models
Current order: standard VAE (Model 1) $\rightarrow$ models with comparatively better performance (Model 2, 3) $\rightarrow$ other models (Model 4, 5, 6, 7, 8). <br />

Original order: Model 1 $\rightarrow$ Model 4 $\rightarrow$ Model 5 $\rightarrow$ Model 6 $\rightarrow$ Model 2 $\rightarrow$ Model 3  $\rightarrow$ Model 7 $\rightarrow$ Model 8

### Standard VAE

### 2.1 Model 1

A VAE in standard setting: prior of latent variables is standard multivariant normal distribution. Output layer includes a softmax function. The model is trained by minimizing loss function: <br /> 
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$MSE(p,p$^{\prime\prime}$)+KL(Z||N(0,1)) <br /> 
Detail shown in graphical model below: 

![alg1](alg1.png)

* __Result__: 

<figure>
  <img src="alg1res1.png" alt="my alt text" width=300/>
  <figcaption>(Data 1).</figcaption>
</figure>

<figure>
  <img src="alg1res2.png" alt="my alt text" width="300"/>
  <figcaption>(1,1,0.5).</figcaption>
</figure>

<figure>
  <img src="alg1res3.png" alt="my alt text" width=300/>
  <figcaption>(1,1,0.2).</figcaption>
</figure>

* __Problem__: Data generated from this model is too dense. And the more samples trained, the denser datapoints are.

* __Possible Reason for Result__: this model only learns concentration of training datapoints, and ignores sparsity. See reference https://arxiv.org/abs/1901.02739 for more detail.

* __Code__ provided in __Appendix B__

### Models with comparatively better performance

### 2.2 Model 2

Outputs of above 3 methods are all data, I changed outputs to parameters of Dirichlet Ditribution in this method. Parameter outputs are then used to generate data. Loss function used to train the model is  <br /> $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$MSE(p,p$^{\prime\prime}$)+KL(Z||N(0,1)) <br /> 

![alg3](alg3.png)

* __Result__:

   <figure>
  <img src="alg1res10.png" alt="my alt text" width=300/>
  <figcaption>(Data 1).</figcaption>
</figure>
   <figure>
  <img src="alg1res11.png" alt="my alt text" width=300/>
  <figcaption>(1,1,0.5).</figcaption>
</figure>
 
 <figure>
  <img src="alg1res9.png" alt="my alt text" width=300/>
  <figcaption>(1,1,0.2).</figcaption>
</figure>

* __Possible reason for its success__: generating data after learning parameters ensures randomness.

* __Future Discussion__: 
    1. Is learning parameters meaningful in VAE? 
    2. Result of (1,1,0.2) and (1,1,0.5) is similar, this model is not sensitive to change of parameters. I also tried (0.8,1.2,0.4), but the result is the same as (1,1,0.2). Approaches tried: (1) remove tanh in decoder; (2) increase samples from $10^4$ to $10^5$; (3) change prior of latent variables to Dirichlet Distribution or Gamma Distribution. Performance of the first two approaches didn't improve, but the last one did (See __Model 3__ for more detail.)


* __Code__ provided in __Appendix C__

### 2.3 Model 3 (the best so far)

Similar to Model 2, outputs of Model 3 is parameters, rather than data. Prior of latent variables changes from standard multivariant normal to dirichlet distirbution Dir(1-1/z_dim,1-1/z_dim,1-1/z_dim). To optimize parameters, minimize loss function: <br />
$~~~~~~~~~~~~~~~~~~~~~~~~$MSE(p,p$^{\prime\prime}$)+KL(Z||Dir($\hat{\alpha},...,\hat{\alpha}))$, where $\hat{\alpha}=1-1/$z_dim <br />
Detail shown graphical model below:


![alg4](alg4.png)

* __Result__:
<figure>
  <img src="alg2res6.png" alt="my alt text" width=300/>
  <figcaption>(Data 1).</figcaption>
</figure>
<figure>
  <img src="alg2res5.png" alt="my alt text" width=300/>
  <figcaption>(1,1,0.5).</figcaption>
</figure>
<figure>
  <img src="alg2res4.png" alt="my alt text" width=300/>
  <figcaption>(1,1,0.2).</figcaption>
</figure>

* __Possible reason for its success__: generating data after learning parameters ensures randomness. Using Dirichlet Distribution as prior of lattent variables achieves better performance.

* __Further Discussion__: 1. the performance improved, but not enough. For example, if training dataset is Dir(0.4,0.8,1.2), outputs are around (0.6007, 1.2494, 1.3129). How to improve? (tried to use Gamma Distribution as a prior, it doesn't help.)

* __Code__ provided in __Appendix D__

### Other Models

### 2.4 Model 4

Based on Model 1, change loss function to <br />
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ $\sum_{k=1}^{batch}\sum_{j=1}^{datapoints}(\alpha_1-1)log(p^{\prime\prime(kj)}_1)+(\alpha_2-1)log(p^{\prime\prime(kj)}_2)+(\alpha_3-1)log(p^{\prime\prime(kj)}_3)$

Detail shown in graphical model below (same as Model 1):

![alg1](alg1.png)

* __Result__:

   <figure>
  <img src="alg1res4.png" alt="my alt text" width=300/>
  <figcaption>(One dot for any training data).</figcaption>
</figure>

* __Problem__: Result is one dot whatever dataset is trained.

* __Possible reason for result__: the data that can minimizing this loss function is mode, so this dot is mode.


* __Code__ provided in __Appendix E__.

### 2.5 Model 5

Based on Model 1, I added a regularization term to loss function, which can make outputs from one sample different from outputs from other samples. The loss function is: <br />
$~~~~~~~~~~~~~~~~~~~~~~~$ $MSE(p,p^{\prime\prime})+KL(Z||N(0,1))-a\sum_{k=1}^{batch}MSE(p^{\prime\prime(k)},p^{\prime\prime(-k)})$

* __Result__  (a=1000): 

  <figure>
  <img src="alg1res6.png" alt="my alt text" width=300/>
  <figcaption>(Data 1).</figcaption>
</figure>
   <figure>
  <img src="alg1res7.png" alt="my alt text" width=300/>
  <figcaption>(1,1,0.5).</figcaption>
</figure>
   <figure>
  <img src="alg1res5.png" alt="my alt text" width=300/>
  <img src="alg1res5.1.png" alt="my alt text" width=300/>
  <figcaption>(1,1,0.2).</figcaption>
</figure>

* __Problem__: Although datapoints are more dispersed after adding regularization term, it is not enough to generate new data from this VAE. Also,according to results with training dataset Dir(1,1,0.2), results are not stable. A big "a", like 1000, can decrease the difference between MSE term and regularization term significantly, but the performance based on a=1000 is not good.

* __Possible reason for result__: I only use a parameter "a" to scale regularization term, a more suitable scaling method should be used. 
    

* __Code__ provided in __Appendix F__.

### 2.6 Model 6

Change prior of latent variables from multivariant standard normal distribution to dirichlet distribution. Train the model by minimizing <br />
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ $MSE(p,p^{\prime\prime})+KL(Z||Dir(\hat{\alpha},...,\hat{\alpha}))$ <br /> where $\hat{\alpha}$=1-1/z_dim

Detail shown in graphical model below

![alg2](alg2.png)

* __Result__:

<figure>
  <img src="alg2res2.png" alt="my alt text" width=300/>
  <figcaption>(Data 1).</figcaption>
</figure>

<figure>
  <img src="alg2res3.png" alt="my alt text" width=300/>
  <figcaption>(1,1,0.5).</figcaption>
</figure>

<figure>
  <img src="alg2res1.png" alt="my alt text" width=300/>
  <figcaption>(1,1,0.2).</figcaption>
</figure>

* __Problem__: datapoints generated are too dense.

* __Possible Reason for Result__:this model only learns concentration of training datapoints, and ignores sparity. 

* __Code__ provided in __Appendix G__

### 2.7 Model 7

Based on Model 2, change loss function to <br />
$~~~~~~~~~~~~~~~~~~~~~~~~~$$\sum_{k=1}^{batch}\sum_{j=1}^{datapoints}log(\Gamma(\sum_{i}\alpha^{kj}_i))+\sum_{i}log(\Gamma(\alpha^{kj}_i))+\sum_{i}(\alpha^{kj}_i-1)log(p^{kj}_i)$ <br />
Graphical model is shown as below (same as Model 2):

![alg3](alg3.png)

* __Result__: 

</figure>
   <figure>
  <img src="alg1res8.png" alt="my alt text" width=300/>
  <figcaption>(for all data).</figcaption>
</figure>

* __Problem of this method__: Whatever training dataset, result is the same.

* __Possible reason for result__: 3 dimensional parameter outputs explode to their maximum. 

* __Code__ provided in __Appendix H__.

### 2.8 Model 8

Based on Model 3, change loss function to <br />
$~~~~~~~~~~~~~~~~~~~~~~~~$$\sum_{k=1}^{batch}\sum_{j=1}^{datapoints}log(\Gamma(\sum_{i}\alpha^{kj}_i))+\sum_{i}log(\Gamma(\alpha^{kj}_i))+\sum_{i}(\alpha^{kj}_i-1)log(p^{kj}_i)$ <br />
Details shown in graphical model below (same as Model 8):

![alg4](alg4.png)

* __Result__:
    <figure>
  <img src="alg2res9.png" alt="my alt text" width=300/>
  <figcaption>(for all dataset).</figcaption>
</figure>

* __Problem of this method__: Whatever training dataset, result is the same.

* __Possible reason for result__: 3 dimensional parameter outputs explode to their maximum. 

* __Code__ provided in __Appendix I__

## 3. Methods I haven't tried

### 3.1
Reference https://openreview.net/forum?id=Hkex2a4FPr introduces a method that can fill triangle visualization of dirichlet distribution in text setting, but it is introduced given positive and negative text sentiments.

### 3.2
Output only parameters with dimensionality 3, instead of dimensionality 3*200=600. And generate 200 datapoints from $(p_1,p_2,p_3)\sim Dir(\alpha^\prime_1,\alpha^\prime_2,\alpha^\prime_3)$. Details shown in graphical model below:


![alg5](alg5.png)

### 3.3 
use 3 VAEs for each entries

## 4. Appendix

### 4.1 Appendix A

In [None]:
import os,sys
#sys.path.append(os.path.join(os.path.dirname(__file__), '../'))
import numpy as np 
import matplotlib.pyplot as plt 
from torch.utils.data import Dataset, DataLoader
import torch
import random
from scipy.stats import dirichlet
import pandas as pd
import plotly.express as px
from sklearn.utils import shuffle

class Dirdata(Dataset):
    def __init__(self, dataPoints=20, samples=10000,
                        seed=np.random.randint(20),indicate=0,num_param=3):
        self.dataPoints = dataPoints
        self.samples = samples
        self.seed = seed
        self.Max_Points = samples * dataPoints
        self.indicate=indicate
        self.num_param=num_param
        np.random.seed(self.seed)
        self.evalPoints, self.data, self.occure = self.__simulatedata__()
        
    def __len__(self):
        return self.samples
    
    def __getitem__(self, idx=0):
        return(self.evalPoints, self.data[idx],self.occure[idx])
    
    def __simulatedata__(self):
        # Dir(alpha,alpha,alpha), alpha~Uniform(0.5,2)
        if (self.indicate==0):
            #generate alpha
            alpha=np.random.uniform(0.5,2,self.samples)
            #repeat alpha
            alpha=np.array([alpha]*self.num_param).transpose()
            #initialize theta and counts (counts are only used in inference, not training)
            theta = np.zeros((self.samples,self.dataPoints,self.num_param))
            occurrence = np.zeros((self.samples, self.dataPoints,self.num_param))
            #generate theta 
            for idx in range(self.samples):
                #generate theta from dirichlet distr
                theta[idx]=np.random.dirichlet(alpha[idx,:],self.dataPoints)
            #shuffle theta
            theta =theta.reshape(self.samples*self.dataPoints,self.num_param)
            theta=shuffle(theta, random_state=0)
            theta = theta.reshape(self.samples,self.dataPoints,self.num_param)
            #generate counts
            for idx in range(self.samples):
                for idy in range(self.dataPoints):
                    occurrence[idx][idy,:]=np.random.multinomial(50,theta[idx][idy,:],size=1)
            
            return (alpha ,theta, occurrence)

        #Dir(alpha,alpha,alpha), alpha has equally spaced grid
        if (self.indicate==1):
            #generate alpha
            alpha=np.linspace(0.5, 2,self.samples)
            #repeat alpha
            alpha=np.array([alpha]*self.num_param).transpose()
            #initialize theta and counts (counts are only used in inference, not training)
            theta = np.zeros((self.samples,self.dataPoints,self.num_param))
            occurrence = np.zeros((self.samples, self.dataPoints,self.num_param))
            #generate theta 
            for idx in range(self.samples):
                #generate theta from dirichlet distr
                theta[idx]=np.random.dirichlet(alpha[idx,:],self.dataPoints)
            #shuffle theta
            theta =theta.reshape(self.samples*self.dataPoints,self.num_param)
            theta=shuffle(theta, random_state=0)
            theta = theta.reshape(self.samples,self.dataPoints,self.num_param)
            #generate counts
            for idx in range(self.samples):
                for idy in range(self.dataPoints):
                    occurrence[idx][idy,:]=np.random.multinomial(50,theta[idx][idy,:],size=1)
            
            return (alpha ,theta, occurrence)

        # Dir(alpha1,alpha2,alpha3), alpha1,alpha2,alpha3~Uniform(0.5,2)
        if (self.indicate==2):
            #generate alpha
            alpha=np.random.uniform(0.5,2,(self.samples,self.num_param))
            #initialize theta and counts (counts are only used in inference, not training)
            theta = np.zeros((self.samples,self.dataPoints,self.num_param))
            occurrence = np.zeros((self.samples, self.dataPoints,self.num_param))
            #generate theta 
            for idx in range(self.samples):
                #generate theta from dirichlet distr
                theta[idx]=np.random.dirichlet(alpha[idx,:],self.dataPoints)
            #shuffle theta
            theta =theta.reshape(self.samples*self.dataPoints,self.num_param)
            theta=shuffle(theta, random_state=0)
            theta = theta.reshape(self.samples,self.dataPoints,self.num_param)
            #generate counts
            for idx in range(self.samples):
                for idy in range(self.dataPoints):
                    occurrence[idx][idy,:]=np.random.multinomial(50,theta[idx][idy,:],size=1)
            
            return (alpha ,theta, occurrence)

        # Dir(alpha1,alpha2,alpha3), alpha1,alpha2,alpha3 have equally spaced grids
        if (self.indicate==3):
            #generate alpha
            alpha=np.linspace(0.5, 2, self.samples*self.num_param)
            #shuffle alpha
            alpha = np.random.choice(alpha, (self.samples,self.num_param))
            #initialize theta and counts (counts are only used in inference, not training)
            theta = np.zeros((self.samples,self.dataPoints,self.num_param))
            occurrence = np.zeros((self.samples, self.dataPoints,self.num_param))
            #generate theta
            for idx in range(self.samples):
                #generate theta from dirichlet distr
                theta[idx]=np.random.dirichlet(alpha[idx,:],self.dataPoints)
            #shuffle theta
            theta =theta.reshape(self.samples*self.dataPoints,self.num_param)
            theta=shuffle(theta, random_state=0)
            theta = theta.reshape(self.samples,self.dataPoints,self.num_param)
            #generate counts
            for idx in range(self.samples):
                for idy in range(self.dataPoints):
                    occurrence[idx][idy,:]=np.random.multinomial(50,theta[idx][idy,:],size=1)
            
            return (alpha ,theta, occurrence)

        # Dir(1,1,0.2)
        if (self.indicate==4):
            alpha=np.array([1,1,0.5])
            #initialize theta and counts (counts are only used in inference, not training)
            theta = np.zeros((self.samples,self.dataPoints,self.num_param))
            occurrence = np.zeros((self.samples, self.dataPoints,self.num_param))
            #generate theta
            for idx in range(self.samples):
                #generate theta from dirichlet distr
                theta[idx]=np.random.dirichlet(alpha,self.dataPoints)
            #shuffle theta
            theta =theta.reshape(self.samples*self.dataPoints,self.num_param)
            theta=shuffle(theta, random_state=0)
            theta = theta.reshape(self.samples,self.dataPoints,self.num_param)
            #generate counts
            for idx in range(self.samples):
                for idy in range(self.dataPoints):
                    occurrence[idx][idy,:]=np.random.multinomial(50,theta[idx][idy,:],size=1)

            return (alpha ,theta, occurrence)

if __name__ == '__main__':
    ds =Dirdata(dataPoints=200, samples=3, indicate=4,num_param=3)
    dataloader = DataLoader(ds, batch_size=1, shuffle=True)
    fig = plt.figure(figsize=(8,8))
    for no,dt in enumerate(dataloader):
        df=pd.DataFrame(dt[1][0],columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()


### 4.2 Appendix B

In [None]:
from scipy.stats import dirichlet as diri
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import math
from scipy.special import gamma, factorial
from tqdm import tqdm, trange
from Dirdata import Dirdata
import torch
from scipy.stats import multinomial
from torch.utils.data import Dataset, DataLoader
from scipy.stats import dirichlet
import os,sys
import pystan
import pystan
import pandas as pd
import warnings
import plotly.express as px
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

class Encoder(nn.Module):
    ''' This the encoder part of VAE
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, z_dim):
        super().__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim1)
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.mu = nn.Linear(hidden_dim2, z_dim)
        self.sd = nn.Linear(hidden_dim2, z_dim)
    def forward(self, x):
        # x is of shape [batch_size, input_dim]
        hidden1 = self.linear1(x)
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = self.linear2(hidden1)
        # hidden2 is of shape [batch_size, hidden_dim2]
        z_mu = self.mu(hidden2)
        # z_mu is of shape [batch_size, z_dim]
        z_sd = self.sd(hidden2)
        # z_sd is of shape [batch_size, z_dim]
        return z_mu, z_sd

class Decoder(nn.Module):
    ''' This the decoder part of VAE
    '''
    def __init__(self,z_dim, hidden_dim1, hidden_dim2, input_dim):
        super().__init__()
        self.linear1 = nn.Linear(z_dim, hidden_dim2)
        self.linear2 = nn.Linear(hidden_dim2, hidden_dim1)
        self.out1 = nn.Linear(hidden_dim1, input_dim)
        self.out2 = nn.Softmax(dim=2)
    def forward(self, x):
        # x is of shape [batch_size, z_dim]
        hidden1 = self.linear1(x)
        # hidden1 is of shape [batch_size, hidden_dim2]
        hidden2 = self.linear2(hidden1)
        # hidden2 is of shape [batch_size, hidden_dim1]
        out1 = self.out1(hidden2)
        #reshape output for further procedure
        out1 = torch.reshape(out1,(x.shape[0],dataPoints,num_param))
        #ensure sum of 3 elements to be 1
        pred = self.out2(out1)
        # pred is of shape [batch_size, input_dim]
        return pred

class VAE(nn.Module):
    ''' This the VAE, which takes a encoder and decoder.
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, latent_dim):
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim1, hidden_dim2, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim1, hidden_dim2, input_dim)

    def reparameterize(self, z_mu, z_sd):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        if self.training:
            # sample from the distribution having latent parameters z_mu, z_sd
            # reparameterize
            std = torch.exp(z_sd / 2)
            eps = torch.randn_like(std)
            return (eps.mul(std).add_(z_mu))
        else:
            return z_mu

    def forward(self, x):
        # encode
        z_mu, z_sd = self.encoder(x)
        # reparameterize
        x_sample = self.reparameterize(z_mu, z_sd)
        # decode
        generated_x = self.decoder(x_sample)
        return generated_x, z_mu,z_sd

def calculate_loss(reconstructed1,target, mean, log_sd):
    RCL = F.mse_loss(reconstructed1, target, reduction='sum')
    KLD = -0.5 * torch.sum(1 + log_sd - mean.pow(2) - log_sd.exp())
    return RCL + KLD

if __name__ == '__main__':
    ###### intializing data and model parameters
    dataPoints=200
    batch_size = 50
    hidden_dim1 = 300
    hidden_dim2 = 250
    z_dim = 150
    samples = 1000
    num_param=3
    input_dim = dataPoints*num_param

    model = VAE(input_dim, hidden_dim1, hidden_dim2, z_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model = model.to(device)
    
    ###### creating data
    ds =Dirdata(dataPoints=dataPoints, samples=samples, indicate=4,num_param=num_param)
    train_dl = DataLoader(ds, batch_size=batch_size, shuffle=True)
    
    ###### train
    t = trange(20)
    for e in t:
        model.train()
        total_loss = 0
        for i,x in enumerate(train_dl):
            #input for VAE (flattened)
            x_ = x[1].float().to(device).reshape(batch_size,1,-1)[:,0]
            #make gradient to be zero in each loop
            optimizer.zero_grad()
            #get output
            reconstructed_x, z_mu, z_sd = model(x_)
            #change dimensionality for computing loss function
            reconstructed_x1=reconstructed_x.reshape(batch_size,1,-1)[:,0]
            #loss 
            loss=calculate_loss(reconstructed_x1,x_,z_mu,z_sd)
            #compute gradient
            loss.backward() 
            #if gradient is nan, change to 0
            for param in model.parameters():
                param.grad[param.grad!=param.grad]=0
                
            #add to total loss
            total_loss += loss.item()
            optimizer.step() # update the weigh
        t.set_description(f'Loss is {total_loss/(samples*dataPoints):.3}')
    
    ###### Sampling 5 draws from learnt model
    model.eval() # model in eval mode
    z = torch.randn(5, z_dim).to(device) # random draw
    with torch.no_grad():
        sampled_y = model.decoder(z)
    
    for no, y in enumerate(sampled_y):
        #create a dataframe
        df=pd.DataFrame(y,columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        #plot
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()

### 4.3 Appendix C

In [None]:
from scipy.stats import dirichlet as diri
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import math
from scipy.special import gamma, factorial
from tqdm import tqdm, trange
from DirData1 import Dirdata
import torch
from scipy.stats import multinomial
from torch.utils.data import Dataset, DataLoader
from scipy.stats import dirichlet
import os,sys
import pystan
import pystan
import pandas as pd
import warnings
import plotly.express as px
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

class Encoder(nn.Module):
    ''' This the encoder part of VAE
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, z_dim):
        super().__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim1)
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.mu = nn.Linear(hidden_dim2, z_dim)
        self.sd = nn.Linear(hidden_dim2, z_dim)
        
    def forward(self, x):
        # x is of shape [batch_size, input_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        z_mu = self.mu(hidden2)
        # z_mu is of shape [batch_size, z_dim]
        z_sd = self.sd(hidden2)
        # z_sd is of shape [batch_size, z_dim]
        return z_mu, z_sd

class Decoder(nn.Module):
    ''' This the decoder part of VAE
    '''
    def __init__(self,z_dim, hidden_dim1, hidden_dim2, input_dim):
        super().__init__()
        self.linear1 = nn.Linear(z_dim, hidden_dim2)
        self.linear2 = nn.Linear(hidden_dim2, hidden_dim1)
        self.out1 = nn.Linear(hidden_dim1, input_dim)
        self.out2 = nn.Softplus(beta=1,threshold=20)
        
    def forward(self, x):
        # x is of shape [batch_size, z_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        out1 = torch.tanh(self.out1(hidden2))
        # make outputs be positive
        out2 = self.out2(out1)
        out2 = torch.reshape(out2,(x.shape[0],dataPoints,num_param))
        # pred is of shape [batch_size, input_dim]
        return out2

class VAE(nn.Module):
    ''' This the VAE, which takes a encoder and decoder.
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, latent_dim):
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim1, hidden_dim2, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim1, hidden_dim2, input_dim)

    def reparameterize(self, z_mu, z_sd):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        if self.training:
            # sample from the distribution having latent parameters z_mu, z_sd
            # reparameterize
            std = torch.exp(z_sd / 2)
            eps = torch.randn_like(std)
            return (eps.mul(std).add_(z_mu))
        else:
            return z_mu

    def reparameterize1(self,x,beta,dataPoints,num_param):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        # sample from the distribution having parameter outputs
            # reparameterize
        u = torch.rand((x.shape[0],dataPoints,num_param))
        x = x.reshape((x.shape[0],dataPoints,num_param))
        y = torch.zeros((x.shape[0],dataPoints,num_param))
        for idx in range(x.shape[0]):
            v = 1/beta*((torch.mul(torch.mul(u[idx],x[idx]),torch.exp(torch.lgamma(x[idx])))).pow(1/x[idx]))
            y[idx]=torch.mul(v,1/torch.transpose(torch.sum(v,1).repeat(x[idx].shape[1],1),0,1))
        return y

    def forward(self, x,beta,dataPoints,num_param):
        # encode
        z_mu, z_sd = self.encoder(x)
        # reparameterize
        x_sample = self.reparameterize(z_mu, z_sd)
        # decode
        generated_x = self.decoder(x_sample)
        pred = self.reparameterize1(generated_x,beta,dataPoints,num_param)
        return pred, z_mu,z_sd

def calculate_loss(reconstructed1, target, mean, log_sd):

    RCL = F.mse_loss(reconstructed1, target, reduction='sum')
    KLD = -0.5 * torch.sum(1 + log_sd - mean.pow(2) - log_sd.exp())
    return RCL + KLD 

if __name__ == '__main__':
    ###### intializing data and model parameters
    dataPoints=300
    batch_size = 50
    hidden_dim1 = 200
    hidden_dim2 = 150
    z_dim =100
    samples = 1000
    num_param=3
    input_dim = dataPoints*num_param
    beta=1

    model = VAE(input_dim, hidden_dim1, hidden_dim2, z_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model = model.to(device)
    
    ###### creating data
    ds = Dirdata(dataPoints=dataPoints, samples=samples, indicate=4,num_param=num_param)
    train_dl = DataLoader(ds, batch_size=batch_size, shuffle=True)
    
    ###### train
    t = trange(50)
    for e in t:
        model.train()
        total_loss = 0
        for i,x in enumerate(train_dl):
            #input for VAE
            x_ = x[1].float().to(device).reshape(batch_size,1,-1)[:,0]
            #make gradient to be zero in each loop
            optimizer.zero_grad()
            #get output
            reconstructed_x, z_mu,z_sd = model(x_,beta,dataPoints,num_param)
            #change dimension
            reconstructed_x1=reconstructed_x.reshape(batch_size,1,-1)[:,0]
            #loss 
            loss=calculate_loss(reconstructed_x1,x_,z_mu,z_sd)
            #compute gradient
            loss.backward() 
            #if gradient is nan, change to 0
            for param in model.parameters():
                param.grad[param.grad!=param.grad]=0
            #add to total loss
            total_loss += loss.item()
            optimizer.step() # update the weigh
        t.set_description(f'Loss is {total_loss/(samples*dataPoints):.3}')
    
    ###### Sampling 5 draws from learnt model
    model.eval() # model in eval mode
    z = torch.randn(5, z_dim).to(device) # random draw
    with torch.no_grad():
        alpha = model.decoder(z)
    alpha=alpha.detach()
    sampled_y=np.zeros((5,dataPoints,num_param))
    #generate dirichlet distribution data
    for idx in range(5):
        for idy in range(dataPoints):
            sampled_y[idx][idy,:]=np.random.dirichlet(alpha[idx][idy,:])
    print(alpha)        
    for no, y in enumerate(sampled_y):
        df=pd.DataFrame(y,columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()

### 4.4 Appendix D

In [None]:
from scipy.stats import dirichlet as diri
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import math
from scipy.special import gamma, factorial
from tqdm import tqdm, trange
from DirData1 import Dirdata
import torch
from scipy.stats import multinomial
from torch.utils.data import Dataset, DataLoader
from scipy.stats import dirichlet
import os,sys
import pystan
import pystan
import pandas as pd
import warnings
import plotly.express as px
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

class Encoder(nn.Module):
    ''' This the encoder part of VAE
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, z_dim):
        super().__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim1)
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.mu = nn.Linear(hidden_dim2, z_dim)
        self.sd = nn.Linear(hidden_dim2, z_dim)
        
    def forward(self, x):
        # x is of shape [batch_size, input_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        z_mu = self.mu(hidden2)
        # z_mu is of shape [batch_size, z_dim]
        z_sd = self.sd(hidden2)
        # z_sd is of shape [batch_size, z_dim]
        return z_mu, z_sd

class Decoder(nn.Module):
    ''' This the decoder part of VAE
    '''
    def __init__(self,z_dim, hidden_dim1, hidden_dim2, input_dim):
        super().__init__()
        self.linear1 = nn.Linear(z_dim, hidden_dim2)
        self.linear2 = nn.Linear(hidden_dim2, hidden_dim1)
        self.out1 = nn.Linear(hidden_dim1, input_dim)
        self.out2 = nn.Softplus(beta=1,threshold=20)
        
    def forward(self, x):
        # x is of shape [batch_size, z_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        out1 = torch.tanh(self.out1(hidden2))
        # make outputs be positive
        out2 = self.out2(out1)
        out2 = torch.reshape(out2,(x.shape[0],dataPoints,num_param))
        # pred is of shape [batch_size, input_dim]
        return out2

class VAE(nn.Module):
    ''' This the VAE, which takes a encoder and decoder.
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, latent_dim):
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim1, hidden_dim2, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim1, hidden_dim2, input_dim)

    def reparameterize(self, z_mu, z_sd):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        if self.training:
            # sample from the distribution having latent parameters z_mu, z_sd
            # reparameterize
            std = torch.exp(z_sd / 2)
            eps = torch.randn_like(std)
            return (eps.mul(std).add_(z_mu))
        else:
            return z_mu

    def reparameterize1(self,x,beta,dataPoints,num_param):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        # sample from the distribution having parameter outputs
            # reparameterize
        u = torch.rand((x.shape[0],dataPoints,num_param))
        x = x.reshape((x.shape[0],dataPoints,num_param))
        y = torch.zeros((x.shape[0],dataPoints,num_param))
        for idx in range(x.shape[0]):
            v = 1/beta*((torch.mul(torch.mul(u[idx],x[idx]),torch.exp(torch.lgamma(x[idx])))).pow(1/x[idx]))
            y[idx]=torch.mul(v,1/torch.transpose(torch.sum(v,1).repeat(x[idx].shape[1],1),0,1))
        return y

    def forward(self, x,beta,dataPoints,num_param):
        # encode
        z_mu, z_sd = self.encoder(x)
        # reparameterize
        x_sample = self.reparameterize(z_mu, z_sd)
        # decode
        generated_x = self.decoder(x_sample)
        pred = self.reparameterize1(generated_x,beta,dataPoints,num_param)
        return pred, z_mu,z_sd

def calculate_loss(reconstructed1, target, mean, log_sd):

    RCL = F.mse_loss(reconstructed1, target, reduction='sum')
    KLD = -0.5 * torch.sum(1 + log_sd - mean.pow(2) - log_sd.exp())
    return RCL + KLD 

if __name__ == '__main__':
    ###### intializing data and model parameters
    dataPoints=300
    batch_size = 50
    hidden_dim1 = 200
    hidden_dim2 = 150
    z_dim =100
    samples = 1000
    num_param=3
    input_dim = dataPoints*num_param
    beta=1

    model = VAE(input_dim, hidden_dim1, hidden_dim2, z_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model = model.to(device)
    
    ###### creating data
    ds = Dirdata(dataPoints=dataPoints, samples=samples, indicate=4,num_param=num_param)
    train_dl = DataLoader(ds, batch_size=batch_size, shuffle=True)
    
    ###### train
    t = trange(50)
    for e in t:
        model.train()
        total_loss = 0
        for i,x in enumerate(train_dl):
            #input for VAE
            x_ = x[1].float().to(device).reshape(batch_size,1,-1)[:,0]
            #make gradient to be zero in each loop
            optimizer.zero_grad()
            #get output
            reconstructed_x, z_mu,z_sd = model(x_,beta,dataPoints,num_param)
            #change dimension
            reconstructed_x1=reconstructed_x.reshape(batch_size,1,-1)[:,0]
            #loss 
            loss=calculate_loss(reconstructed_x1,x_,z_mu,z_sd)
            #compute gradient
            loss.backward() 
            #if gradient is nan, change to 0
            for param in model.parameters():
                param.grad[param.grad!=param.grad]=0
            #add to total loss
            total_loss += loss.item()
            optimizer.step() # update the weigh
        t.set_description(f'Loss is {total_loss/(samples*dataPoints):.3}')
    
    ###### Sampling 5 draws from learnt model
    model.eval() # model in eval mode
    z = torch.randn(5, z_dim).to(device) # random draw
    with torch.no_grad():
        alpha = model.decoder(z)
    alpha=alpha.detach()
    sampled_y=np.zeros((5,dataPoints,num_param))
    #generate dirichlet distribution data
    for idx in range(5):
        for idy in range(dataPoints):
            sampled_y[idx][idy,:]=np.random.dirichlet(alpha[idx][idy,:])
    print(alpha)        
    for no, y in enumerate(sampled_y):
        df=pd.DataFrame(y,columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()

### 4.5 Appendix E

In [None]:
from scipy.stats import dirichlet as diri
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import math
from scipy.special import gamma, factorial
from tqdm import tqdm, trange
from DirData1 import Dirdata
import torch
from scipy.stats import multinomial
from torch.utils.data import Dataset, DataLoader
from scipy.stats import dirichlet
import os,sys
import pystan
import pystan
import pandas as pd
import warnings
import plotly.express as px
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

class Encoder(nn.Module):
    ''' This the encoder part of VAE
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, z_dim):
        super().__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim1)
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.mu = nn.Linear(hidden_dim2, z_dim)
        self.sd = nn.Linear(hidden_dim2, z_dim)
        
    def forward(self, x):
        # x is of shape [batch_size, input_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        z_mu = self.mu(hidden2)
        # z_mu is of shape [batch_size, z_dim]
        z_sd = self.sd(hidden2)
        # z_sd is of shape [batch_size, z_dim]
        return z_mu, z_sd

class Decoder(nn.Module):
    ''' This the decoder part of VAE
    '''
    def __init__(self,z_dim, hidden_dim1, hidden_dim2, input_dim):
        super().__init__()
        self.linear1 = nn.Linear(z_dim, hidden_dim2)
        self.linear2 = nn.Linear(hidden_dim2, hidden_dim1)
        self.out1 = nn.Linear(hidden_dim1, input_dim)
        self.out2 = nn.Softmax(dim=2)
    def forward(self, x):
        # x is of shape [batch_size, z_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        out1 = torch.tanh(self.out1(hidden2))
        #reshape output for further procedure
        out1 = torch.reshape(out1,(x.shape[0],dataPoints,num_param))
        #ensure sum of 3 elements to be 1
        pred = self.out2(out1)
        # pred is of shape [batch_size, input_dim]
        return pred

class VAE(nn.Module):
    ''' This the VAE, which takes a encoder and decoder.
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, latent_dim):
        #torch.autograd.set_detect_anomaly(True)
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim1, hidden_dim2, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim1, hidden_dim2, input_dim)

    def reparameterize(self, z_mu, z_sd):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        if self.training:
            # sample from the distribution having latent parameters z_mu, z_sd
            # reparameterize
            std = torch.exp(z_sd / 2)
            eps = torch.randn_like(std)
            return (eps.mul(std).add_(z_mu))
        else:
            return z_mu

    def forward(self, x):
        # encode
        z_mu, z_sd = self.encoder(x)
        # reparameterize
        x_sample = self.reparameterize(z_mu, z_sd)
        # decode
        generated_x = self.decoder(x_sample)
        return generated_x, z_mu,z_sd

def calculate_loss(likeli, mean, log_sd):
    RCL = -torch.sum(likeli)
    KLD = -0.5 * torch.sum(1 + log_sd - mean.pow(2) - log_sd.exp())
    return RCL + KLD + REGU


if __name__ == '__main__':
    ###### intializing data and model parameters
    dataPoints=300
    batch_size = 5
    hidden_dim1 = 39
    hidden_dim2 = 25
    z_dim =20
    samples = 1000
    num_param=3
    input_dim = dataPoints*num_param

    model = VAE(input_dim, hidden_dim1, hidden_dim2, z_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model = model.to(device)
    
    ###### creating data
    ds =Dirdata(dataPoints=dataPoints, samples=samples, indicate=0,num_param=num_param)
    train_dl = DataLoader(ds, batch_size=batch_size, shuffle=True)
    
    ###### train
    t = trange(20)
    for e in t:
        model.train()
        total_loss = 0
        for i,x in enumerate(train_dl):
            #input for VAE (flattened)
            x_ = x[1].float().to(device).reshape(batch_size,1,-1)[:,0]
            #make gradient to be zero in each loop
            optimizer.zero_grad()
            #get output
            reconstructed_x, z_mu, z_sd = model(x_)
            #change dimension for computing loss function
            reconstructed_x1=reconstructed_x.reshape(batch_size,1,-1)[:,0]
            #training data
            alpha=x[1].float().to(device)
            #log-likelihood, ignore irrelavent parts, need to be summed later
            likeli=torch.mul(torch.log(reconstructed_x+1e-7),alpha-1)
            #loss 
            loss=calculate_loss(likeli,z_mu,z_sd)
            #compute gradient
            loss.backward() 
            #if gradient is nan, change to 0
            for param in model.parameters():
                param.grad[param.grad!=param.grad]=0
            #add to total loss
            total_loss += loss.item()
            optimizer.step() # update the weigh
        t.set_description(f'Loss is {total_loss/(samples*dataPoints):.3}')
    
    ###### Sampling 5 draws from learnt model
    model.eval() # model in eval mode
    z = torch.randn(5, z_dim).to(device) # random draw
    with torch.no_grad():
        sampled_y = model.decoder(z)
    
    for no, y in enumerate(sampled_y):
        #create a dataframe
        df=pd.DataFrame(y,columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        #plot
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()

### 4.6 Appendix F

In [None]:
from scipy.stats import dirichlet as diri
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import math
from scipy.special import gamma, factorial
from tqdm import tqdm, trange
from Dirdata import Dirdata
import torch
from scipy.stats import multinomial
from torch.utils.data import Dataset, DataLoader
from scipy.stats import dirichlet
import os,sys
import pystan
import pystan
import pandas as pd
import warnings
import plotly.express as px
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

class Encoder(nn.Module):
    ''' This the encoder part of VAE
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, z_dim):
        super().__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim1)
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.mu = nn.Linear(hidden_dim2, z_dim)
        self.sd = nn.Linear(hidden_dim2, z_dim)
    def forward(self, x):
        # x is of shape [batch_size, input_dim]
        hidden1 = self.linear1(x)
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = self.linear2(hidden1)
        # hidden2 is of shape [batch_size, hidden_dim2]
        z_mu = self.mu(hidden2)
        # z_mu is of shape [batch_size, z_dim]
        z_sd = self.sd(hidden2)
        # z_sd is of shape [batch_size, z_dim]
        return z_mu, z_sd

class Decoder(nn.Module):
    ''' This the decoder part of VAE
    '''
    def __init__(self,z_dim, hidden_dim1, hidden_dim2, input_dim):
        super().__init__()
        self.linear1 = nn.Linear(z_dim, hidden_dim2)
        self.linear2 = nn.Linear(hidden_dim2, hidden_dim1)
        self.out1 = nn.Linear(hidden_dim1, input_dim)
        self.out2 = nn.Softmax(dim=2)
    def forward(self, x):
        # x is of shape [batch_size, z_dim]
        hidden1 = self.linear1(x)
        # hidden1 is of shape [batch_size, hidden_dim2]
        hidden2 = self.linear2(hidden1)
        # hidden2 is of shape [batch_size, hidden_dim1]
        out1 = self.out1(hidden2)
        #reshape output for further procedure
        out1 = torch.reshape(out1,(x.shape[0],dataPoints,num_param))
        #ensure sum of 3 elements to be 1
        pred = self.out2(out1)
        # pred is of shape [batch_size, input_dim]
        return pred

class VAE(nn.Module):
    ''' This the VAE, which takes a encoder and decoder.
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, latent_dim):
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim1, hidden_dim2, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim1, hidden_dim2, input_dim)

    def reparameterize(self, z_mu, z_sd):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        if self.training:
            # sample from the distribution having latent parameters z_mu, z_sd
            # reparameterize
            std = torch.exp(z_sd / 2)
            eps = torch.randn_like(std)
            return (eps.mul(std).add_(z_mu))
        else:
            return z_mu

    def forward(self, x):
        # encode
        z_mu, z_sd = self.encoder(x)
        # reparameterize
        x_sample = self.reparameterize(z_mu, z_sd)
        # decode
        generated_x = self.decoder(x_sample)
        return generated_x, z_mu,z_sd

#compute mse of output from different samples 
def meansq(X):
    meansq=torch.zeros(X.shape[1],X.shape[1])
    for idx in range(X.shape[1]):
        for idy in range(X.shape[1]):
            meansq[idx,idy]=F.mse_loss(X[idx,:],X[idy,:], reduction='sum')
    return(torch.sum(meansq)/2)

def calculate_loss(reconstructed1,target, mean, log_sd,a,meansqsum):
    REGU = -a*meansqsum
    RCL = F.mse_loss(reconstructed1, target, reduction='sum')
    KLD = -0.5 * torch.sum(1 + log_sd - mean.pow(2) - log_sd.exp())
    return RCL + KLD+REGU

if __name__ == '__main__':
    ###### intializing data and model parameters
    dataPoints=200
    batch_size = 50
    hidden_dim1 = 300
    hidden_dim2 = 250
    z_dim = 150
    samples = 1000
    num_param=3
    input_dim = dataPoints*num_param
    #parameter multiplied to regularization term
    a=1000

    model = VAE(input_dim, hidden_dim1, hidden_dim2, z_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model = model.to(device)
    
    ###### creating data
    ds =Dirdata(dataPoints=dataPoints, samples=samples, indicate=4,num_param=num_param)
    train_dl = DataLoader(ds, batch_size=batch_size, shuffle=True)
    
    ###### train
    t = trange(20)
    for e in t:
        model.train()
        total_loss = 0
        for i,x in enumerate(train_dl):
            #input for VAE (flattened)
            x_ = x[1].float().to(device).reshape(batch_size,1,-1)[:,0]
            #make gradient to be zero in each loop
            optimizer.zero_grad()
            #get output
            reconstructed_x, z_mu, z_sd = model(x_)
            #compute regularization term
            meansqsum=0
            for j in range(batch_size):
                meansqsum=meansqsum+meansq(reconstructed_x[j])
            #change dimensionality for computing loss function
            reconstructed_x1=reconstructed_x.reshape(batch_size,1,-1)[:,0]
            #loss 
            loss=calculate_loss(reconstructed_x1,x_,z_mu,z_sd,a,meansqsum)
            #compute gradient
            loss.backward() 
            #if gradient is nan, change to 0
            for param in model.parameters():
                param.grad[param.grad!=param.grad]=0
            #add to total loss
            total_loss += loss.item()
            optimizer.step() # update the weigh
        t.set_description(f'Loss is {total_loss/(samples*dataPoints):.3}')
    
    ###### Sampling 5 draws from learnt model
    model.eval() # model in eval mode
    z = torch.randn(5, z_dim).to(device) # random draw
    with torch.no_grad():
        sampled_y = model.decoder(z)
    
    for no, y in enumerate(sampled_y):
        #create a dataframe
        df=pd.DataFrame(y,columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        #plot
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()

### 4.7 Appendix G

In [None]:
from scipy.stats import dirichlet as diri
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import math
from scipy.special import gamma, factorial
from tqdm import tqdm, trange
from Dirdata import Dirdata
import torch
from scipy.stats import multinomial
from torch.utils.data import Dataset, DataLoader
from scipy.stats import dirichlet
import os,sys
import pystan
import pystan
import pandas as pd
import warnings
import plotly.express as px
import math
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"


class Encoder(nn.Module):
    ''' This the encoder part of VAE
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, z_dim):
        super().__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim1)
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.linear3 = nn.Linear(hidden_dim2, z_dim)
        self.softplus1 = torch.nn.Softplus(beta=1, threshold=20)

    def forward(self, x):
        # x is of shape [batch_size, input_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        alpha = self.linear3(hidden2)
        # ensure outputs are positive
        alpha = self.softplus1(alpha)
        # alpha is of shape [batch_size, z_dim]
        return alpha

class Decoder(nn.Module):
    ''' This the decoder part of VAE
    '''
    def __init__(self,z_dim, hidden_dim1, hidden_dim2, input_dim):
        super().__init__()
        self.linear1 = nn.Linear(z_dim, hidden_dim2)
        self.linear2 = nn.Linear(hidden_dim2, hidden_dim1)
        self.linear3 = nn.Linear(hidden_dim1, input_dim)

    def forward(self, x):
        # x is of shape [batch_size, z_dim]
        hidden1 = self.linear1(x)
        # hidden1 is of shape [batch_size, hidden_dim2]
        hidden2 = self.linear2(hidden1)
        # hidden2 is of shape [batch_size, hidden_dim1]
        out1 = self.linear3(hidden2)
        #reshape output for further procedure
        out1 = torch.reshape(out1,(x.shape[0],dataPoints,num_param))
        #ensure sum of 3 elements to be 1
        pred = torch.zeros((x.shape[0],dataPoints,num_param))
        for idx in range(x.shape[0]):
            pred[idx] = torch.mul(out1[idx],1/torch.transpose(torch.sum(out1[idx],1).repeat(num_param,1),0,1))
        # pred is of shape [batch_size, input_dim]
        return pred

class VAE(nn.Module):
    ''' This the VAE, which takes a encoder and decoder.
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, latent_dim,beta):
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim1, hidden_dim2, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim1, hidden_dim2, input_dim)
        self.beta = beta

    def reparameterize(self, alpha,beta):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        # sample from the dirichlet distribution having latent parameters alpha
            # reparameterize
        u = torch.rand(alpha.shape)
        v = 1/beta*((torch.mul(torch.mul(u,alpha),torch.exp(torch.lgamma(alpha)))).pow(1/alpha))
        v = torch.mul(v,1/torch.transpose(torch.sum(v,1).repeat(alpha.shape[1],1),0,1))
        return v

    def forward(self, x,beta):
        # encode
        alpha = self.encoder(x)
        # reparameterize
        v = self.reparameterize(alpha,beta)
        # decode
        generated_x = self.decoder(v)
        return generated_x, alpha, v

def calculate_loss(reconstructed1,target, alpha):
    RCL = F.mse_loss(reconstructed1, target, reduction='sum')
    KLD = torch.sum(torch.lgamma(alpha),1)-torch.sum(torch.tensor(([math.lgamma(1-1/alpha.shape[1])])).repeat(alpha.shape[0]*alpha.shape[1]).reshape(alpha.shape[0],alpha.shape[1]),1)+torch.sum((torch.tensor(([math.lgamma(1-1/alpha.shape[1])])).repeat(alpha.shape[0]*alpha.shape[1]).reshape(alpha.shape[0],alpha.shape[1])-alpha)*torch.digamma(torch.tensor([1-1/alpha.shape[1]])),1)
    KLD = torch.sum(KLD)
    return RCL , KLD

if __name__ == '__main__':
    ###### intializing data and model parameters
    dataPoints = 200
    batch_size = 5
    hidden_dim1 = 150
    hidden_dim2 = 100
    z_dim = 50
    samples = 1000
    num_param = 3
    input_dim = dataPoints*num_param
    beta = 1

    model = VAE(input_dim, hidden_dim1, hidden_dim2, z_dim,beta)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model = model.to(device)
    
    ###### creating data
    ds =Dirdata(dataPoints=dataPoints, samples=samples, indicate=4,num_param=num_param)
    train_dl = DataLoader(ds, batch_size=batch_size, shuffle=True)
    
    ###### train
    t = trange(20)
    for e in t:
        model.train()
        total_loss = 0
        for i,x in enumerate(train_dl):
            #input for VAE (flattened)
            x_ = x[1].float().to(device).reshape(batch_size,1,-1)[:,0]
            #make gradient to be zero in each loop
            optimizer.zero_grad()
            #get output
            reconstructed_x, alpha, v = model(x_,beta)
            #change dimensionality for computing loss function
            reconstructed_x1=reconstructed_x.reshape(batch_size,1,-1)[:,0]
            #loss 
            rcl,kdl=calculate_loss(reconstructed_x1,x_,alpha)
            loss=rcl+kdl
            #compute gradient
            loss.backward() 
            #if gradient is nan, change to 0
            for param in model.parameters():
                param.grad[param.grad!=param.grad]=0
            #add to total loss
            total_loss += loss.item()
            optimizer.step() # update the weigh
        t.set_description(f'Loss is {total_loss/(samples*dataPoints):.3}')
    
    ###### Sampling 5 draws from learnt model
    model.eval() # model in eval mode
    u_ = torch.rand(5,z_dim)
    alpha_ = torch.tensor([1-1/z_dim]).repeat(z_dim*5).reshape(5,z_dim)
    v_ = 1/beta*((torch.mul(torch.mul(u_,alpha_),torch.exp(torch.lgamma(alpha_)))).pow(1/alpha_))
    v_ = torch.mul(v_,1/torch.transpose(torch.sum(v_,1).repeat(z_dim,1),0,1))
    # random draw
    with torch.no_grad():
        sampled_y = model.decoder(v_)

    for no, y in enumerate(sampled_y):
        #create a dataframe
        df=pd.DataFrame(y,columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        #plot
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()

### 4.8 Appendix H

In [None]:
from scipy.stats import dirichlet as diri
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import math
from scipy.special import gamma, factorial
from tqdm import tqdm, trange
from DirData1 import Dirdata
import torch
from scipy.stats import multinomial
from torch.utils.data import Dataset, DataLoader
from scipy.stats import dirichlet
import os,sys
import pystan
import pystan
import pandas as pd
import warnings
import plotly.express as px
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

class Encoder(nn.Module):
    ''' This the encoder part of VAE
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, z_dim):
        super().__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim1)
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.mu = nn.Linear(hidden_dim2, z_dim)
        self.sd = nn.Linear(hidden_dim2, z_dim)
        
    def forward(self, x):
        # x is of shape [batch_size, input_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        z_mu = self.mu(hidden2)
        # z_mu is of shape [batch_size, z_dim]
        z_sd = self.sd(hidden2)
        # z_sd is of shape [batch_size, z_dim]
        return z_mu, z_sd

class Decoder(nn.Module):
    ''' This the decoder part of VAE
    '''
    def __init__(self,z_dim, hidden_dim1, hidden_dim2, input_dim):
        super().__init__()
        self.linear1 = nn.Linear(z_dim, hidden_dim2)
        self.linear2 = nn.Linear(hidden_dim2, hidden_dim1)
        self.out1 = nn.Linear(hidden_dim1, input_dim)
        self.out2 = nn.Softplus(beta=1,threshold=20)
        
    def forward(self, x):
        # x is of shape [batch_size, z_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        out1 = torch.tanh(self.out1(hidden2))
        # make outputs be positive
        out2 = self.out2(out1)
        out2 = torch.reshape(out2,(x.shape[0],dataPoints,num_param))
        # pred is of shape [batch_size, input_dim]
        return out2

class VAE(nn.Module):
    ''' This the VAE, which takes a encoder and decoder.
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, latent_dim):
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim1, hidden_dim2, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim1, hidden_dim2, input_dim)

    def reparameterize(self, z_mu, z_sd):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        if self.training:
            # sample from the distribution having latent parameters z_mu, z_sd
            # reparameterize
            std = torch.exp(z_sd / 2)
            eps = torch.randn_like(std)
            return (eps.mul(std).add_(z_mu))
        else:
            return z_mu

    def reparameterize1(self,x,beta,dataPoints,num_param):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        # sample from the distribution having parameter outputs
            # reparameterize
        u = torch.rand((x.shape[0],dataPoints,num_param))
        x = x.reshape((x.shape[0],dataPoints,num_param))
        y = torch.zeros((x.shape[0],dataPoints,num_param))
        for idx in range(x.shape[0]):
            v = 1/beta*((torch.mul(torch.mul(u[idx],x[idx]),torch.exp(torch.lgamma(x[idx])))).pow(1/x[idx]))
            y[idx]=torch.mul(v,1/torch.transpose(torch.sum(v,1).repeat(x[idx].shape[1],1),0,1))
        return y

    def forward(self, x,beta,dataPoints,num_param):
        # encode
        z_mu, z_sd = self.encoder(x)
        # reparameterize
        x_sample = self.reparameterize(z_mu, z_sd)
        # decode
        generated_x = self.decoder(x_sample)
        pred = self.reparameterize1(generated_x,beta,dataPoints,num_param)
        return pred, z_mu,z_sd

def calculate_loss(likeli1,likeli2,likeli3, mean, log_sd):

    RCL = -(torch.sum(likeli1)+torch.sum(likeli2)+torch.sum(likeli3))/(batch_size*dataPoints)
    KLD = -0.5 * torch.sum(1 + log_sd - mean.pow(2) - log_sd.exp())
    return RCL + KLD 

if __name__ == '__main__':
    ###### intializing data and model parameters
    dataPoints=200
    batch_size = 500
    hidden_dim1 = 100
    hidden_dim2 = 75
    z_dim =50
    samples = 1000
    num_param=3
    input_dim = dataPoints*num_param
    beta=1

    model = VAE(input_dim, hidden_dim1, hidden_dim2, z_dim)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model = model.to(device)
    
    ###### creating data
    ds = Dirdata(dataPoints=dataPoints, samples=samples, indicate=0,num_param=num_param)
    train_dl = DataLoader(ds, batch_size=batch_size, shuffle=True)
    
    ###### train
    t = trange(20)
    for e in t:
        model.train()
        total_loss = 0
        for i,x in enumerate(train_dl):
            #input for VAE
            x_ = x[1].float().to(device).reshape(batch_size,1,-1)[:,0]
            #make gradient to be zero in each loop
            optimizer.zero_grad()
            #get output
            reconstructed_x, z_mu, z_sd = model(x_,beta,dataPoints,num_param)
            #change dimension
            reconstructed_x1=reconstructed_x.reshape(batch_size,1,-1)[:,0]
            #log-likelihood
            likeli1=torch.mul(torch.log(x_),reconstructed_x1-1).reshape(batch_size,dataPoints,num_param)
            likeli2=-torch.sum(torch.lgamma(reconstructed_x),2)
            likeli3=torch.lgamma(torch.sum(reconstructed_x,2))
            #loss 
            loss=calculate_loss(likeli1,likeli2,likeli3,z_mu,z_sd)
            #compute gradient
            loss.backward() 
            #if gradient is nan, change to 0
            for param in model.parameters():
                param.grad[param.grad!=param.grad]=0
            #add to total loss
            total_loss += loss.item()
            optimizer.step() # update the weigh
        t.set_description(f'Loss is {total_loss/(samples*dataPoints):.3}')
    
    ###### Sampling 5 draws from learnt model
    model.eval() # model in eval mode
    z = torch.randn(5, z_dim).to(device) # random draw
    with torch.no_grad():
        alpha = model.decoder(z)
    alpha=alpha.detach()
    sampled_y=np.zeros((5,dataPoints,num_param))
    #generate dirichlet distribution data
    for idx in range(5):
        for idy in range(dataPoints):
            sampled_y[idx][idy,:]=np.random.dirichlet(alpha[idx][idy,:])
    print(alpha)       
    for no, y in enumerate(sampled_y):
        df=pd.DataFrame(y,columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()
    

### 4.9 Appendix I

In [None]:
###########problem: output explode to max because it will give max loglikelihood
from scipy.stats import dirichlet as diri
import numpy as np
import matplotlib.pyplot as plt
import random
import torch.nn as nn
import torch.nn.functional as F 
import torch.optim as optim
import math
from scipy.special import gamma, factorial
from tqdm import tqdm, trange
from DirData1 import Dirdata
import torch
from scipy.stats import multinomial
from torch.utils.data import Dataset, DataLoader
from scipy.stats import dirichlet
import os,sys
import pystan
import pystan
import pandas as pd
import warnings
import plotly.express as px
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

class Encoder(nn.Module):
    ''' This the encoder part of VAE
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, z_dim):
        super().__init__()
        self.linear1 = nn.Linear(input_dim, hidden_dim1)
        self.linear2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.linear3 = nn.Linear(hidden_dim2, z_dim)
        self.softplus1 = torch.nn.Softplus(beta=1, threshold=20)
        
    def forward(self, x):
        # x is of shape [batch_size, input_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        alpha = self.linear3(hidden2)
        # ensure outputs are positive
        alpha = self.softplus1(alpha)
        # alpha is of shape [batch_size, z_dim]
        return alpha

class Decoder(nn.Module):
    ''' This the decoder part of VAE
    '''
    def __init__(self,z_dim, hidden_dim1, hidden_dim2, input_dim):
        super().__init__()
        self.linear1 = nn.Linear(z_dim, hidden_dim2)
        self.linear2 = nn.Linear(hidden_dim2, hidden_dim1)
        self.out1 = nn.Linear(hidden_dim1, input_dim)
        self.out2 = nn.Softplus(beta=1,threshold=20)
        
    def forward(self, x):
        # x is of shape [batch_size, z_dim]
        hidden1 = torch.tanh(self.linear1(x))
        # hidden1 is of shape [batch_size, hidden_dim1]
        hidden2 = torch.tanh(self.linear2(hidden1))
        # hidden2 is of shape [batch_size, hidden_dim2]
        out1 = torch.tanh(self.out1(hidden2))
        # ensure outputs are positive
        out2 = self.out2(out1)
        # reshape output
        pred = torch.reshape(out2,(x.shape[0],dataPoints,num_param))
        # pred is of shape [batch_size, input_dim]
        return pred

class VAE(nn.Module):
    ''' This the VAE, which takes a encoder and decoder.
    '''
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, latent_dim,beta):
        #torch.autograd.set_detect_anomaly(True)
        super().__init__()
        self.encoder = Encoder(input_dim, hidden_dim1, hidden_dim2, latent_dim)
        self.decoder = Decoder(latent_dim, hidden_dim1, hidden_dim2, input_dim)
        self.beta = beta

    def reparameterize(self, alpha,beta):
        '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
        '''
        # sample from dirichlet distribution having latent parameters alpha1,alpha2,alpha3
            # reparameterize
        u = torch.rand(alpha.shape)
        v = 1/beta*((torch.mul(torch.mul(u,alpha),torch.exp(torch.lgamma(alpha)))).pow(1/alpha))
        v = torch.mul(v,1/torch.transpose(torch.sum(v,1).repeat(alpha.shape[1],1),0,1))
        return v

    def reparameterize1(self,x,beta,dataPoints,num_param):
            '''During training random sample from the learned ZDIMS-dimensional
           normal distribution; during inference its mean.
            '''
        # sample from the distribution having parameter outputs
            # reparameterize
            u = torch.rand((x.shape[0],dataPoints,num_param))
            x = x.reshape((x.shape[0],dataPoints,num_param))
            y = torch.zeros((x.shape[0],dataPoints,num_param))
            for idx in range(x.shape[0]):
                v = 1/beta*((torch.mul(torch.mul(u[idx],x[idx]),torch.exp(torch.lgamma(x[idx])))).pow(1/x[idx]))
                y[idx]=torch.mul(v,1/torch.transpose(torch.sum(v,1).repeat(x[idx].shape[1],1),0,1))
            return y

    def forward(self, x,beta,dataPoints,num_param):
        # encode
        alpha = self.encoder(x)
        # reparameterize
        x_sample = self.reparameterize(alpha,beta)
        # decode
        generated_x = self.decoder(x_sample)
        pred = self.reparameterize1(generated_x,beta,dataPoints,num_param)
        return pred, alpha

def calculate_loss(likeli1,likeli2,likeli3, alpha):
    # negative log-likelihood
    RCL = -(torch.sum(likeli1)+torch.sum(likeli2)+torch.sum(likeli3))
    # KL divergence 
    KLD=torch.sum(torch.lgamma(alpha),1)-torch.sum(torch.tensor(([math.lgamma(1-1/alpha.shape[1])])).repeat(alpha.shape[0]*alpha.shape[1]).reshape(alpha.shape[0],alpha.shape[1]),1)+torch.sum((torch.tensor(([math.lgamma(1-1/alpha.shape[1])])).repeat(alpha.shape[0]*alpha.shape[1]).reshape(alpha.shape[0],alpha.shape[1])-alpha)*torch.digamma(torch.tensor([1-1/alpha.shape[1]])),1)
    KLD = torch.sum(KLD)
    return RCL + KLD 


if __name__ == '__main__':
    ###### intializing data and model parameters
    dataPoints=200
    batch_size = 50
    hidden_dim1 = 100
    hidden_dim2 = 50
    z_dim =20
    samples = 1000
    num_param=3
    input_dim = dataPoints*num_param
    beta=1

    model = VAE(input_dim, hidden_dim1, hidden_dim2, z_dim,beta)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    device = 'cuda' if torch.cuda.is_available() else 'cpu' 
    model = model.to(device)
    
    ###### creating data
    ds = Dirdata(dataPoints=dataPoints, samples=samples, indicate=4,num_param=num_param)
    train_dl = DataLoader(ds, batch_size=batch_size, shuffle=True)
    
    ###### train
    t = trange(20)
    for e in t:
        model.train()
        total_loss = 0
        for i,x in enumerate(train_dl):
            #input for VAE (flattened)
            x_ = x[1].float().to(device).reshape(batch_size,1,-1)[:,0]
            #make gradient to be zero in each loop
            optimizer.zero_grad()
            #get output
            reconstructed_x, alpha = model(x_,beta,dataPoints,num_param )
            #change dimension for computing loss function
            reconstructed_x1=reconstructed_x.reshape(batch_size,1,-1)[:,0]
            #log-likelihood
            likeli1=torch.mul(torch.log(x_),reconstructed_x1-1).reshape(batch_size,dataPoints,num_param)
            likeli2=-torch.sum(torch.lgamma(reconstructed_x),2)
            likeli3=torch.lgamma(torch.sum(reconstructed_x,2))
            #loss 
            loss=calculate_loss(likeli1,likeli2,likeli3,alpha)
            #compute gradient
            loss.backward() 
            #if gradient is nan, change to 0
            for param in model.parameters():
                param.grad[param.grad!=param.grad]=0
            #add to total loss
            total_loss += loss.item()
            optimizer.step() # update the weigh
        t.set_description(f'Loss is {total_loss/(samples*dataPoints):.3}')
    
    ###### Sampling 5 draws from learnt model
    model.eval() # model in eval mode
    z=torch.distributions.dirichlet.Dirichlet(torch.tensor([1-1/z_dim]).repeat(5*z_dim).reshape(5,z_dim)).sample()
    z=z.to(device)# random draw
    with torch.no_grad():
        alpha = model.decoder(z)
    alpha=alpha.detach()
    sampled_y=np.zeros((5,dataPoints,num_param))
    #generate dirichlet distribution
    for idx in range(5):
        for idy in range(dataPoints):
            sampled_y[idx][idy,:]=np.random.dirichlet(alpha[idx][idy,:])
    print(alpha)
    for no, y in enumerate(sampled_y):
        df=pd.DataFrame(y,columns=['$\\theta_1$', '$\\theta_2$', '$\\theta_3$'])
        fig =px.scatter_ternary(df, a='$\\theta_1$', b='$\\theta_2$', c='$\\theta_3$',title="Dirichlet Distribution Visualization")
        fig.show()