### Extensions to a multimodal VAE

Another way this model can be extended is to fully let the latent $z$ "cause" both the emotion ratings and the facial expression. That's an example of a Multimodal VAE (Wu & Goodman, 2018).

There is a nice theoretical motivation for this model too. Throughout, we've assumed that the space of emotions is exactly what we measured (e.g., some value of happiness, some value of sadness), but maybe the latent space is more structured, but not along these discrete emotion categories -- perhaps along dimensions like "good" vs "bad", or . In emotion theory, this undifferentiated space is called affect, and often, this is a low-dimensional space (2 to 3 dimensions capture most of the variance in empirical data).

We could thus posit a latent *affect*, and actually we would still want a $z$ that captures non-emotional aspects of the face -- learning to disentangle latent variables is also an active area of research (Narayanaswamy et al, 2017).

And finally, we can add the "outcome to appraisal to affect" part back into this multimodal model.

<div style="width: 300px; margin: auto; ">![Graphical Model](images/graphicalModel_MVAE.png)
</div>

In [None]:
#from __future__ import division, print_function, absolute_import
from __future__ import print_function

%matplotlib inline

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader

import pyro
import pyro.distributions as dist
from pyro.distributions import Normal
from pyro.infer import SVI, Trace_ELBO
from pyro.optim import Adam


from torchvision import transforms, utils, datasets
from torchvision.transforms import ToPILImage
from skimage import io, transform
from scipy.special import expit
from PIL import Image
from matplotlib.pyplot import imshow

from pyro.contrib.examples.util import print_and_log, set_seed
import pyro.poutine as poutine
# custom helperCode for this tutorial, in helperCode.py
import helperCode
from utils.custom_mlp import MLP, Exp


from visdom import Visdom

#from utils.vae_plots import plot_llk, plot_vae_samples
from utils.mnist_cached import  mkdir_p, setup_data_loaders
from utils.vae_plots import plot_conditional_samples_ssvae, plot_vae_samples

EMBED_DIM = 50
IMG_WIDTH = 64
IMG_SIZE = IMG_WIDTH*IMG_WIDTH*3
BATCH_SIZE = 32
DEFAULT_HIDDEN_DIMS = [200,200] #[500, 500]
DEFAULT_Z_DIM = 25#50#2

# FACE_VAR_NAMES = ['facePath']
OUTCOME_VAR_NAMES = ['payoff1', 'payoff2', 'payoff3', 
                     'prob1', 'prob2', 'prob3', 
                     'win', 'winProb', 'angleProp']
EMOTION_VAR_NAMES = ['happy', 'sad', 'anger', 'surprise', 
                     'disgust', 'fear', 'content', 'disapp']

OUTCOME_VAR_DIM = len(OUTCOME_VAR_NAMES)
EMOTION_VAR_DIM = len(EMOTION_VAR_NAMES)

OUTCOME_VAR_DIM_COLLAPSE = len(OUTCOME_VAR_NAMES) - 2 + 3
EMOTION_VAR_DIM_COLLAPSE = len(EMOTION_VAR_NAMES) * 9

def swish(x):
    return x * F.sigmoid(x)

#### Word embeddings

First we define some helper functions for comparing word similarity.

In [None]:
def normalize(v):
    norm = np.sqrt(v.dot(v))
    return v / norm

def cosine_sim_np(a, b):
    if normalize_embeddings:
        return np.dot(a, b)
    else:
        return np.dot(normalize(a), normalize(b))

def cosine_sim_torch(a, b):
    a_norm = a / a.norm()
    b_norm = b / b.norm()    
    return torch.dot(a_norm, b_norm)

Now we will load the GloVe word embeddings. (Warning: may take up to a minute...)

In [None]:
embed_path = os.path.join(os.path.abspath('..'), "glove", "glove.6B.50d.txt")

# Whether or not to normalize the word vectors
normalize_embeddings = False
    
def load_glove_embeddings(path):
    print("Loading GloVe embeddings")
    with open(path,'r') as f:
        model = {}
        for line in f:
            split_line = line.split()
            word = split_line[0]
            embedding = np.array([float(val) for val in split_line[1:]], dtype=np.float32)
            if normalize_embeddings:
                embedding = normalize(embedding)
            model[word] = embedding
        print("Done.",len(model)," words loaded!")
        return model

embeddings = load_glove_embeddings(embed_path)

#### Dataset

Here we define a class to load multimodal data (word embeddings, faces, emotion ratings, and outcomes), allowing for missing values (which are set to 0).

In [None]:
class MultimodalDataset(Dataset):
    """A multimodal experimental dataset."""
    
    def __init__(self, csv_file, embeddings=None, img_dir=None, transform=None):
        """
        Args:
            csv_file (string): Path to the experiment csv file 
            img_dir (string): Directory with all the images.
            transform (callable, optional): Optional transform to be applied
                on a sample.
        """
        self.expdata = pd.read_csv(csv_file)
        self.embeddings = embeddings
        self.img_dir = img_dir
        self.transform = transform
        
        self.has_utterances = False
        self.has_faces = False
        self.has_emotions = False
        self.has_outcomes = False
        
        # Check if dataset has utterances
        if "utterance" in self.expdata.columns and embeddings is not None:
            self.has_utterances = True
        # Check if dataset has face images
        if "facePath" in self.expdata.columns and img_dir is not None:
            self.has_faces = True
        # Check if dataset has emotion ratings
        if set(EMOTION_VAR_NAMES).issubset(self.expdata.columns):
            self.has_emotions = True
            self.normalize_emotions()
        # Check if dataset has outcomes
        if set(OUTCOME_VAR_NAMES).issubset(self.expdata.columns):
            self.has_outcomes = True
            self.normalize_outcomes()

    def __len__(self):
        return len(self.expdata)

    def __getitem__(self, idx):

        if self.has_emotions:
            emotions = np.array(self.expdata.iloc[idx]["happy":"disapp"], np.float32)
        else:
            emotions = 0

        if self.has_outcomes:
            outcomes = np.array(self.expdata.iloc[idx]["payoff1":"angleProp"], np.float32)
        else:
            outcomes = 0

        if self.has_utterances:
            word = self.expdata.iloc[idx]["utterance"]
            embed = self.embeddings[word]
        else:
            word = ""
            embed = 0
        
        if self.has_faces:
            img_name = os.path.join(self.img_dir, self.expdata.iloc[idx]["facePath"] + ".png")
            try:
                image = Image.open(img_name).convert('RGB')
                if self.transform:
                    image = self.transform(image)
            except:
                print(img_name)
                raise
        else:
            image = 0
            
        return word, embed, image, emotions, outcomes
    
    def normalize_outcomes(self):
        """Normalizes outcome data.
        
        payoff1, payoff2, payoff3 and win are between 0 and 100
        need to normalize to [0,1] to match the rest of the variables,
        by dividing payoff1, payoff2, payoff3 and win by 100.
        """        
        self.expdata.loc[:,"payoff1"] = self.expdata.loc[:,"payoff1"]/100
        self.expdata.loc[:,"payoff2"] = self.expdata.loc[:,"payoff2"]/100
        self.expdata.loc[:,"payoff3"] = self.expdata.loc[:,"payoff3"]/100
        self.expdata.loc[:,"win"]     = self.expdata.loc[:,"win"]/100
    
    def normalize_emotions(self):
        """Normalize emotion ratings.
        
        Emotions were rated on a 1-9 Likert scale.
        use emo <- (emo-1)/8 to transform to within [0,1]
        """
        self.expdata.loc[:,"happy":"disapp"] = (self.expdata.loc[:,"happy":"disapp"]-1)/8
    



We load and store the face/outcome/emotion data in `face_outcome_emotion_dataset`. There are N=1,587 observations, and each observation consists of:

- an accompanying face image
- a 9-dimension outcome vector that parameterizes the gamble that agents played, and
- an 8-dimensional emotion rating vector

In [None]:
img_transform = transforms.Compose([
    # Note that we downsample to 64 x 64 here, because we wanted a nice power of 2 
    #(and DCGAN architecture assumes input image of 64x64) 
    transforms.Resize(64),
    transforms.CenterCrop(64),
    transforms.ToTensor()
    ])

# data location
faces_path = os.path.join(os.path.abspath('..'), "CognitionData", "faces")
face_outcome_emotion_path = os.path.join(os.path.abspath('..'), "CognitionData", "data_faceWheel.csv")

# reads in datafile.
print("Reading in dataset...")

face_outcome_emotion_dataset = MultimodalDataset(csv_file=face_outcome_emotion_path, 
                                                 img_dir=faces_path, 
                                                 transform=img_transform)
face_outcome_emotion_loader = torch.utils.data.DataLoader(face_outcome_emotion_dataset,
                                                          batch_size=BATCH_SIZE, shuffle=True,
                                                          num_workers=4)

N_samples = len(face_outcome_emotion_dataset)
print("Number of observations:", N_samples)

# taking a sample observation
word, embed, img, emo, out = face_outcome_emotion_dataset[np.random.randint(0, N_samples)]
print("Sample Observation: ")
print("Ratings:")
row_fmt ="{:<8} " * len(emo)
print(row_fmt.format(*helperCode.EMOTION_VAR_NAMES))
row_fmt ="{:<8.3f} " * len(emo)
print(row_fmt.format(*emo))
print("Outcomes:")
row_fmt ="{:<8} " * len(out)
print(row_fmt.format(*helperCode.OUTCOME_VAR_NAMES))
row_fmt ="{:<8.3f} " * len(out)
print(row_fmt.format(*out))
Image.fromarray(helperCode.TensorToPILImage(img*255.))

Now we shall also load the dataset of utterances, outcomes and emotions.

In [None]:
# Data location
word_outcome_emotion_path = os.path.join(os.path.abspath(".."), "CognitionData", "dataSecondExpt_utteranceWheel.csv")
expdata = pd.read_csv(word_outcome_emotion_path)

# Print utterances
utterances = list(sorted(pd.unique(expdata.loc[:]["utterance"])))
print(utterances)

In [None]:
# Read in datafile.
print("Reading in dataset...")

word_outcome_emotion_dataset = MultimodalDataset(csv_file=word_outcome_emotion_path, 
                                                 embeddings=embeddings)
word_outcome_emotion_loader = torch.utils.data.DataLoader(word_outcome_emotion_dataset,
                                                          batch_size=BATCH_SIZE, shuffle=True,
                                                          num_workers=4)

N_samples = len(word_outcome_emotion_dataset)
print("Number of observations:", N_samples)

# Taking a sample observation
word, embed, img, emo, out = word_outcome_emotion_dataset[np.random.randint(0, N_samples)]
print("Sample Observation: ")
print("Utterance:", word)
print("Embedding:")
print(embed)
print("Ratings:")
row_fmt ="{:<8} " * len(emo)
print(row_fmt.format(*helperCode.EMOTION_VAR_NAMES))
row_fmt ="{:<8.3f} " * len(emo)
print(row_fmt.format(*emo))
print("Outcomes:")
row_fmt ="{:<8} " * len(out)
print(row_fmt.format(*helperCode.OUTCOME_VAR_NAMES))
row_fmt ="{:<8.3f} " * len(out)
print(row_fmt.format(*out))

#### Encoders and Decoders

Here we define the neural network encoders and decoders.

First some helper modules: The product of experts combines multiple independent gaussians into a single gaussian by averaging their means. The Swish module is an activation function similar to ReLU, but with better performance. (https://arxiv.org/abs/1802.05335).

In [None]:
class ProductOfExperts(nn.Module):
    """
    Return parameters for product of independent experts.
    See https://arxiv.org/pdf/1410.7827.pdf for equations.

    @param loc: M x D for M experts
    @param scale: M x D for M experts
    """
    def forward(self, loc, scale, eps=1e-8):
        scale = scale + eps # numerical constant for stability
        # precision of i-th Gaussian expert (T = 1/sigma^2)
        T = 1. / scale
        product_loc = torch.sum(loc * T, dim=0) / torch.sum(T, dim=0)
        product_scale = 1. / torch.sum(T, dim=0)
        return product_loc, product_scale
    
class Swish(nn.Module):
    """https://arxiv.org/abs/1710.05941"""
    def forward(self, x):
        return x * F.sigmoid(x)

We use a deep convolutional generative adversarial network for the images.

In [None]:
class ImageEncoder(nn.Module):
    """
    define the PyTorch module that parametrizes q(z|image).
    This goes from images to the latent z
    
    This is the standard DCGAN architecture.

    @param z_dim: integer
                  size of the tensor representing the latent random variable z
    """
    def __init__(self, z_dim):
        super(ImageEncoder, self).__init__()
        #torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, 
        #                padding=0, dilation=1, groups=1, bias=True)
        # H_out = floor( (H_in + 2*padding - dilation(kernel_size-1) -1) / stride    +1)
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 4, 2, 1, bias=False),
            Swish(),
            nn.Conv2d(32, 64, 4, 2, 1, bias=False),
            nn.BatchNorm2d(64),
            Swish(),
            nn.Conv2d(64, 128, 4, 2, 1, bias=False),
            nn.BatchNorm2d(128),
            Swish(),
            nn.Conv2d(128, 256, 4, 1, 0, bias=False),
            nn.BatchNorm2d(256),
            Swish())
        # Here, we define two layers, one to give z_loc and one to give z_scale
        self.z_loc_layer = nn.Sequential(
            nn.Linear(256 * 5 * 5, 512), # it's 256 * 5 * 5 if input is 64x64.
            #nn.Linear(256 * 9 * 9, 512), # it's 256 * 9 * 9 if input is 100x100.
            Swish(),
            nn.Dropout(p=0.1),
            nn.Linear(512, z_dim))
        self.z_scale_layer = nn.Sequential(
            nn.Linear(256 * 5 * 5, 512), # it's 256 * 5 * 5 if input is 64x64.
            #nn.Linear(256 * 9 * 9, 512), # it's 256 * 9 * 9 if input is 100x100.
            Swish(),
            nn.Dropout(p=0.1),
            nn.Linear(512, z_dim))
        self.z_dim = z_dim

    def forward(self, image):
        hidden = self.features(image)
        hidden = hidden.view(-1, 256 * 5 * 5) # it's 256 * 5 * 5 if input is 64x64.
        #image = image.view(-1, 256 * 9 * 9) # it's 256 * 9 * 9 if input is 100x100.
        z_loc = self.z_loc_layer(hidden)
        z_scale = torch.exp(self.z_scale_layer(hidden)) #add exp so it's always positive
        return z_loc, z_scale
    
class ImageDecoder(nn.Module):
    """
    define the PyTorch module that parametrizes p(image|z).
    This goes from the latent z to the images
    
    This is the standard DCGAN architecture.

    @param z_dim: integer
                  size of the tensor representing the latent random variable z
    """
    def __init__(self, z_dim):
        super(ImageDecoder, self).__init__()
        self.upsample = nn.Sequential(
            nn.Linear(z_dim, 256 * 5 * 5),  # it's 256 * 5 * 5 if input is 64x64.
            #nn.Linear(z_dim, 256 * 9 * 9),  # it's 256 * 9 * 9 if input is 100x100.
            Swish())
        self.hallucinate = nn.Sequential(
            nn.ConvTranspose2d(256, 128, 4, 1, 0, bias=False),
            nn.BatchNorm2d(128),
            Swish(),
            nn.ConvTranspose2d(128, 64, 4, 2, 1, bias=False),
            nn.BatchNorm2d(64),
            Swish(),
            nn.ConvTranspose2d(64, 32, 4, 2, 1, bias=False),
            nn.BatchNorm2d(32),
            Swish(),
            nn.ConvTranspose2d(32, 3, 4, 2, 1, bias=False))

    def forward(self, z):
        # the input will be a vector of size |z_dim|
        z = self.upsample(z)
        z = z.view(-1, 256, 5, 5) # it's 256 * 5 * 5 if input is 64x64.
        #z = z.view(-1, 256, 9, 9) # it's 256 * 9 * 9 if input is 100x100.
        # but if 100x100, the output image size is 96x96
        image = self.hallucinate(z) # this is the image
        return image  # NOTE: no sigmoid here. See train.py

For the other modalities, we use a common network structure with two hidden layers for both the encoder and decoder. The networks have two outputs: mean and variance.

In [None]:
class Encoder(nn.Module):
    """
    define the PyTorch module that parametrizes q(z|input).
    This goes from inputs to the latent z

    @param z_dim: integer
                  size of the tensor representing the latent random variable z
    """
    def __init__(self, z_dim, input_dim, hidden_dim=512):
        super(Encoder, self).__init__()
        self.net = nn.Linear(input_dim, hidden_dim)
        self.z_loc_layer = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            Swish(),
            nn.Linear(hidden_dim, z_dim))
        self.z_scale_layer = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            Swish(),
            nn.Linear(hidden_dim, z_dim))
        self.z_dim = z_dim

    def forward(self, input):
        hidden = self.net(input)
        z_loc = self.z_loc_layer(hidden)
        z_scale = torch.exp(self.z_scale_layer(hidden))
        return z_loc, z_scale


class Decoder(nn.Module):
    """
    define the PyTorch module that parametrizes p(output|z).
    This goes from the latent z to the output

    @param z_dim: integer
                  size of the tensor representing the latent random variable z
    """
    def __init__(self, z_dim, output_dim, hidden_dim=512):
        super(Decoder, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(z_dim, hidden_dim),
            Swish())
        self.output_loc_layer = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            Swish(),
            nn.Linear(hidden_dim, output_dim))
        self.output_scale_layer = nn.Sequential(
            nn.Linear(hidden_dim, hidden_dim),
            Swish(),
            nn.Linear(hidden_dim, output_dim))

    def forward(self, z):
        hidden = self.net(z)
        output_loc = self.output_loc_layer(hidden)
        output_scale = torch.exp(self.output_scale_layer(hidden))
        return output_loc, output_scale  # NOTE: no softmax here. See train.py


#### Multimodal VAE

Now we define the multimodal VAE itself.

In [None]:
class MVAE(nn.Module):
    """
    This class encapsulates the parameters (neural networks), models & guides needed to train a
    multimodal variational auto-encoder.
    Modified from https://github.com/mhw32/multimodal-vae-public
    Multimodal Variational Autoencoder.

    @param z_dim: integer
                  size of the tensor representing the latent random variable z
                  
    Currently all the neural network dimensions are hard-coded; 
    in a future version will make them be inputs into the constructor
    """
    def __init__(self, z_dim, use_cuda=False):
        super(MVAE, self).__init__()
        self.z_dim = z_dim
        self.image_encoder = ImageEncoder(z_dim)
        self.image_decoder = ImageDecoder(z_dim)
        self.word_encoder = Encoder(z_dim, EMBED_DIM)
        self.word_decoder = Decoder(z_dim, EMBED_DIM)
        self.rating_encoder = Encoder(z_dim, EMOTION_VAR_DIM)
        self.rating_decoder = Decoder(z_dim, EMOTION_VAR_DIM)
        self.outcome_encoder = Encoder(z_dim, OUTCOME_VAR_DIM)
        self.outcome_decoder = Decoder(z_dim, OUTCOME_VAR_DIM)
        self.experts = ProductOfExperts()
        self.use_cuda = use_cuda
        
        # using GPUs for faster training of the networks
        if self.use_cuda:
            self.cuda()

    def reparametrize(self, mu, logvar):
        if self.training:
            std = logvar.mul(0.5).exp_()
            eps = Variable(std.data.new(std.size()).normal_())
            return eps.mul(std).add_(mu)
        else:  # return mean during inference
            return mu

    def forward(self, word=None, image=None, rating=None, outcome=None):
        mu, logvar  = self.infer(word, image, rating, outcome)
        # reparametrization trick to sample
        z  = self.reparametrize(mu, logvar)
        # reconstruct inputs based on that gaussian
        word_recon = self.word_decoder(z)
        image_recon = self.image_decoder(z)
        rating_recon = self.rating_decoder(z)
        outcome_recon = self.outcome_decoder(z)
        return word_recon, image_recon, rating_recon, outcome_recon, mu, logvar

    def infer(self, word=None, image=None, rating=None, outcome=None):
        if word is not None:
            batch_size = word.size(0)
        elif image is not None:
            batch_size = image.size(0)
        elif rating is not None:
            batch_size = rating.size(0)
        elif outcome is not None:
            batch_size = outcome.size(0)

        batch_size = 1

        # initialize the universal prior expert
        mu, logvar = prior_expert((1, batch_size, self.z_dim),
                                   use_cuda=self.use_cuda)
        if word is not None:
            word_mu, word_logvar = self.word_encoder(word)
            mu = torch.cat((mu, word_mu.unsqueeze(0)), dim=0)
            logvar = torch.cat((logvar, word_logvar.unsqueeze(0)), dim=0)
        
        if image is not None:
            image_mu, image_logvar = self.image_encoder(image)
            mu = torch.cat((mu, image_mu.unsqueeze(0)), dim=0)
            logvar = torch.cat((logvar, image_logvar.unsqueeze(0)), dim=0)

        if rating is not None:
            rating_mu, rating_logvar = self.rating_encoder(rating)
            mu = torch.cat((mu, rating_mu.unsqueeze(0)), dim=0)
            logvar = torch.cat((logvar, rating_logvar.unsqueeze(0)), dim=0)

        if outcome is not None:
            outcome_mu, outcome_logvar = self.outcome_encoder(outcome)
            mu     = torch.cat((mu, outcome_mu.unsqueeze(0)), dim=0)
            logvar = torch.cat((logvar, outcome_logvar.unsqueeze(0)), dim=0)

        # product of experts to combine gaussians
        mu, logvar = self.experts(mu, logvar)
        return mu, logvar
    
    def model(self, words=None, images=None, ratings=None, outcomes=None):
        # register this pytorch module and all of its sub-modules with pyro
        pyro.module("mvae", self)
        
        batch_size = 0
        if words is not None:
            batch_size = words.size(0)
        elif images is not None:
            batch_size = images.size(0)
        elif ratings is not None:
            batch_size = ratings.size(0)
        elif outcomes is not None:
            batch_size = outcomes.size(0)
        
        with pyro.iarange("data", batch_size):
            if outcomes is not None:
                # sample from outcome prior, compute p(z|outcome)
                outcome_prior_loc = torch.zeros(torch.Size((batch_size, OUTCOME_VAR_DIM)))
                outcome_prior_scale = torch.ones(torch.Size((batch_size, OUTCOME_VAR_DIM)))
                pyro.sample("obs_outcome", dist.Normal(outcome_prior_loc, outcome_prior_scale).independent(1),
                            obs=outcomes.reshape(-1, OUTCOME_VAR_DIM))
                
                z_loc, z_scale = self.outcome_encoder.forward(outcomes)
            else:
                # setup hyperparameters for prior p(z)
                z_loc = torch.zeros(torch.Size((batch_size, self.z_dim)))
                z_scale = torch.ones(torch.Size((batch_size, self.z_dim)))
            
            # sample from prior (value will be sampled by guide when computing the ELBO)
            z = pyro.sample("latent", dist.Normal(z_loc, z_scale).independent(1))
            # decode the latent code z

            word_loc, word_scale = self.word_decoder.forward(z)
            # score against actual words
            if words is not None:
                pyro.sample("obs_word", dist.Normal(word_loc, word_scale).independent(1), 
                            obs=words.reshape(-1, EMBED_DIM))
            
            img_loc = self.image_decoder.forward(z)
            # score against actual images
            if images is not None:
                pyro.sample("obs_img", dist.Bernoulli(img_loc).independent(1), 
                            obs=images.reshape(-1, 3,IMG_WIDTH,IMG_WIDTH))
            
            rating_loc, rating_scale = self.rating_decoder.forward(z)
            # score against actual ratings
            if ratings is not None:
                pyro.sample("obs_rating", dist.Normal(rating_loc, rating_scale).independent(1), 
                            obs=ratings.reshape(-1, EMOTION_VAR_DIM))

            # return the loc so we can visualize it later
            return word_loc, img_loc, rating_loc
    
    def guide(self, words=None, images=None, ratings=None, outcomes=None):
        # register this pytorch module and all of its sub-modules with pyro
        pyro.module("mvae", self)
        
        batch_size = 0
        if words is not None:
            batch_size = words.size(0)
        elif images is not None:
            batch_size = images.size(0)
        elif ratings is not None:
            batch_size = ratings.size(0)
        elif outcomes is not None:
            batch_size = outcomes.size(0)
            
        with pyro.iarange("data", batch_size):
            # use the encoder to get the parameters used to define q(z|x)
                        
            # initialize the prior expert
            # the additional dimension (1) is to 
            z_loc = torch.zeros(torch.Size((1, batch_size, self.z_dim)))
            z_scale = torch.ones(torch.Size((1, batch_size, self.z_dim)))
            if self.use_cuda:
                z_loc, z_scale = z_loc.cuda(), z_scale.cuda()
            
            # figure out the elbo loss? encoder/decoder?
            if outcomes is not None:
                outcome_z_loc, outcome_z_scale = self.outcome_encoder.forward(outcomes)
                z_loc = torch.cat((z_loc, outcome_z_loc.unsqueeze(0)), dim=0)
                z_scale = torch.cat((z_scale, outcome_z_scale.unsqueeze(0)), dim=0)

            if words is not None:
                word_z_loc, word_z_scale = self.word_encoder.forward(words)
                z_loc = torch.cat((z_loc, word_z_loc.unsqueeze(0)), dim=0)
                z_scale = torch.cat((z_scale, word_z_scale.unsqueeze(0)), dim=0)                
                
            if images is not None:
                image_z_loc, image_z_scale = self.image_encoder.forward(images)
                z_loc = torch.cat((z_loc, image_z_loc.unsqueeze(0)), dim=0)
                z_scale = torch.cat((z_scale, image_z_scale.unsqueeze(0)), dim=0)
            
            if ratings is not None:
                rating_z_loc, rating_z_scale = self.rating_encoder.forward(ratings)
                z_loc = torch.cat((z_loc, rating_z_loc.unsqueeze(0)), dim=0)
                z_scale = torch.cat((z_scale, rating_z_scale.unsqueeze(0)), dim=0)
            
            z_loc, z_scale = self.experts(z_loc, z_scale)
            # sample the latent z
            pyro.sample("latent", dist.Normal(z_loc, z_scale).independent(1))
    

#### Training

Here we set up the training parameters.

In [None]:
pyro.clear_param_store()

class Args:
    learning_rate = 5e-5
    num_epochs = 500 #1000
    hidden_layers = DEFAULT_HIDDEN_DIMS
    z_dim = DEFAULT_Z_DIM
    seed = 10
    cuda = False
    visdom_flag = False
    #visualize = True
    #logfile = "./tmp.log"
    
args = Args()

# setup the VAE
mvae = MVAE(z_dim=args.z_dim, use_cuda=args.cuda)
#vae = VAE(z_dim=args.z_dim, use_cuda=args.cuda)


# setup the optimizer
adam_args = {"lr": args.learning_rate}
optimizer = Adam(adam_args)

# setup the inference algorithm
svi = SVI(mvae.model, mvae.guide, optimizer, loss=Trace_ELBO())
#svi = SVI(vae.model, vae.guide, optimizer, loss=Trace_ELBO())

And now we actually train the MVAE!

In [None]:
train_elbo = []
# training loop
for epoch in range(args.num_epochs):
    # initialize loss accumulator
    epoch_loss = 0.
    # do a training epoch over each mini-batch returned
    # by the data loader
    for batch_num, (_, words, faces, ratings, outcomes) in enumerate(word_outcome_emotion_loader):
        # if on GPU put mini-batch into CUDA memory
        if args.cuda:
            faces = faces.cuda()
        # do ELBO gradient and accumulate loss
        #print("Batch: ", batch_num, "out of", len(train_loader))
        #epoch_loss += svi.step(faces)
        #epoch_loss += svi.step(ratings)
        if len(words.shape) == 1:
            words = None
        if len(faces.shape) == 1:
            faces = None
        if len(ratings.shape) == 1:
            ratings = None
        if len(outcomes.shape) == 1:
            outcomes = None
        epoch_loss += svi.step(words, faces, ratings, outcomes)
        

    # report training diagnostics
    normalizer_train = len(face_outcome_emotion_loader.dataset)
    total_epoch_loss_train = epoch_loss / normalizer_train
    train_elbo.append(total_epoch_loss_train)
    print("[epoch %03d]  average training loss: %.4f" % (epoch, total_epoch_loss_train))

Here we can save or load the model.

In [None]:
# save model
savemodel = True
if savemodel:
    pyro.get_param_store().save('models/word_mvae_pretrained.save')

In [None]:
loadmodel = True
if loadmodel:
    pyro.get_param_store().load('models/word_mvae_pretrained.save')
    pyro.module("mvae", mvae, update_module_params=True)

Now we can evaluate the reconstructed data.

In [None]:
# Flag whether to evaluate labelled or non-labelled examples
eval_training = False
# Training set of words
train_words = ['awesome', 'cool', 'damn', 'dang', 'man', 'meh', 'oh', 'wow', 'yay', 'yikes']
# Test set of words
test_words = ['amazing', 'nope', 'nice', 'wonderful', 'jeez', 'gah', 'shit', 'sigh', 'ugh']

if eval_training:
    # Use training set as samples
    samples = []
    df = word_emotion_outcome_dataset.expdata
    for w in train_words:
        # Lookup emotion ratings for each word
        df_ratings = df[df['utterance']==w].loc[:,"happy":"disapp"]
        # Average across all observations
        ratings = df_ratings.mean(axis=0).values
        samples.append((w, torch.from_numpy(embeddings[w]), ratings))
else:
    # Use test set as samples
    samples = []
    for w in test_words:
        samples.append((w, torch.from_numpy(embeddings[w]), 0))    

# Number of nearest neighbors to the reconstructed vector to find
k_neighbors = 0 # 4
    
print("Reconstruction similarity, neighbors and emotion ratings")
for word, embed, ratings in samples:
    # Reconstruct the data
    (word_recon, image_recon, rating_recon, outcome_recon, mu, logvar) =\
        mvae.forward(embed, None, None, None)
    # Find cosine similarity
    sim = cosine_sim_torch(embed, word_recon).detach().numpy()

    if k_neighbors > 0:
        embed_np = recon_embed.detach().numpy()
        nb_words = heapq.nlargest(k_neighbors, exclamations,
                                  key=lambda x: cosine_sim_np(embed_np, embeddings[x]))
        
    # Print reconstruction similarity
    print("{:8} : {:10}".format(word, sim))
    if k_neighbors > 0:
        print("neighbors: ", nb_words)
    str_row_fmt ="{:<8.8} " * len(rating_recon.detach().numpy())
    print(str_row_fmt.format(*EMOTION_VAR_NAMES))
    num_row_fmt ="{:<8.1f} " * len(rating_recon.detach().numpy())
    # Print average of observed raitings if evaluating training
    if eval_training:
        print(num_row_fmt.format(*(ratings*8+1)))     
    print(num_row_fmt.format(*(rating_recon.detach().numpy()*8+1)))

In [None]:
# taking a sample observation
img1, emo1, out1 = face_outcome_emotion_dataset[5]
print("Sample Observation: ")
print(helperCode.EMOTION_VAR_NAMES)
print(emo1)
print(helperCode.OUTCOME_VAR_NAMES)
print(out1)
Image.fromarray(helperCode.TensorToPILImage(img1))

In [None]:
mvae.image_encoder

-----

Written by: Desmond Ong (desmond.c.ong@gmail.com), Harold Soh (hsoh@comp.nus.edu.sg), Mike Wu (wumike@stanford.edu)

References:

Pyro [VAE tutorial](http://pyro.ai/examples/vae.html)

Wu, M., & Goodman, N. D. (2018). Multimodal Generative Models for Scalable Weakly-Supervised Learning. To appear, NIPS 2018, https://arxiv.org/abs/1802.05335
Repo here: https://github.com/mhw32/multimodal-vae-public

DCGAN https://arxiv.org/pdf/1511.06434.pdf

Hoffman, M. D., Blei, D. M., Wang, C., & Paisley, J. (2013). Stochastic variational inference. *The Journal of Machine Learning Research*, 14(1), 1303-1347.

Kingma, D. P., Mohamed, S., Rezende, D. J., & Welling, M. (2014). Semi-supervised learning with deep generative models. In *Advances in Neural Information Processing Systems*, pp. 3581-3589. https://arxiv.org/abs/1406.5298

Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. Auto-Encoding Variational Bayes. In *The International Conference on Learning Representations*. https://arxiv.org/abs/1312.6114


Narayanaswamy, S., Paige, T. B., van de Meent, J. W., Desmaison, A., Goodman, N. D., Kohli, P., Wood, F. & Torr, P. (2017). Learning Disentangled Representations with Semi-Supervised Deep Generative Models. In *Advances in Neural Information Processing Systems*, pp. 5927-5937. https://arxiv.org/abs/1706.00400

Data from https://github.com/desmond-ong/affCog, from the following paper:

Ong, D. C., Zaki, J., & Goodman, N. D. (2015). Affective Cognition: Exploring lay theories of emotion. *Cognition*, 143, 141-162.