# **Starter Notebook to Combine the Image Captioning Network Output to the Image Generator Network to Evaluate Performance of Image Caption**   

Final Project - Mikayla Biggs, Kevin Steele, Austin Strom    
AML 4/26/2021


### Approach: 
* pre-trained image captioning network model
    * used as forward loss to improve T2F performance
    * model essentially used as descriminator
* V1 is only forward loss
    * V2 forward loss + backward 
        * training T2F and image captioning at same time in case captioning bottlenecks performance in V1

### Retreive Text2Face Repository and Data
This is setup to get the v0.1 and v1.0 data from the google drive links. The v1.0 data has not been cleaned so the v0.1 should be used for initial testing of the base project

In [1]:
import tensorflow as tf

# You'll generate plots of attention in order to see which parts of an image
# our model focuses on during captioning
import matplotlib.pyplot as plt

import collections
import random
import numpy as np
import os
import time
import json
from PIL import Image

In [2]:
!git clone https://github.com/austin-strom/T2F.git
!gdown --id 1nD6kNAgIVjxpzIScJNLqUyRA1qEkc4Op
!gdown --id 1cwcYbl0dhXEzmdbee_K_H6jcndbsxT2o

!unzip -u -q face2text_v0.1.zip -d face2text_v0.1
!unzip -u -q face2text_v1.0.zip -d face2text_v1.0

# # This is moving the v0.1 file to the proper data dir for testing
!mkdir T2F/data/LFW/Face2Text
!mv face2text_v0.1/ T2F/data/LFW/Face2Text/.

!wget http://vis-www.cs.umass.edu/lfw/lfw.tgz
!tar -xf lfw.tgz
!mv lfw T2F/data/LFW/.

fatal: destination path 'T2F' already exists and is not an empty directory.
Downloading...
From: https://drive.google.com/uc?id=1nD6kNAgIVjxpzIScJNLqUyRA1qEkc4Op
To: /content/face2text_v0.1.zip
100% 156k/156k [00:00<00:00, 58.2MB/s]
Downloading...
From: https://drive.google.com/uc?id=1cwcYbl0dhXEzmdbee_K_H6jcndbsxT2o
To: /content/face2text_v1.0.zip
100% 217k/217k [00:00<00:00, 64.7MB/s]
mkdir: cannot create directory ‘T2F/data/LFW/Face2Text’: File exists
mv: cannot move 'face2text_v0.1/' to 'T2F/data/LFW/Face2Text/./face2text_v0.1': Directory not empty
--2021-05-03 01:00:29--  http://vis-www.cs.umass.edu/lfw/lfw.tgz
Resolving vis-www.cs.umass.edu (vis-www.cs.umass.edu)... 128.119.244.95
Connecting to vis-www.cs.umass.edu (vis-www.cs.umass.edu)|128.119.244.95|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 180566744 (172M) [application/x-gzip]
Saving to: ‘lfw.tgz.2’


2021-05-03 01:00:31 (81.7 MB/s) - ‘lfw.tgz.2’ saved [180566744/180566744]

mv: cannot move 'lfw

In [3]:
# !pip uninstall pro-gan-pth==1.3.3

In [4]:
# !git clone 'https://github.com/austin-strom/pro_gan_pytorch.git'

In [5]:
# # This is moving the v0.1 file to the proper data dir for testing

!mkdir face2text_v0.1/data
# !mv face2text_v0.1/ Face2Text/.

!wget http://vis-www.cs.umass.edu/lfw/lfw.tgz
!tar -xf lfw.tgz
!mv lfw face2text_v0.1/data/.

--2021-05-03 01:00:34--  http://vis-www.cs.umass.edu/lfw/lfw.tgz
Resolving vis-www.cs.umass.edu (vis-www.cs.umass.edu)... 128.119.244.95
Connecting to vis-www.cs.umass.edu (vis-www.cs.umass.edu)|128.119.244.95|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 180566744 (172M) [application/x-gzip]
Saving to: ‘lfw.tgz.3’


2021-05-03 01:00:37 (80.1 MB/s) - ‘lfw.tgz.3’ saved [180566744/180566744]



In [6]:
%load_ext tensorboard
import datetime, os

In [7]:
!rm -rf T2F/implementation/runs

In [8]:
!git clone 'https://github.com/austin-strom/pro_gan_pytorch.git'
!cd pro_gan_pytorch && git checkout revert_to_v1_3_3
!mv pro_gan_pytorch/pro_gan_pytorch T2F/implementation/pro_gan_pytorch

fatal: destination path 'pro_gan_pytorch' already exists and is not an empty directory.
D	pro_gan_pytorch/CustomLayers.py
D	pro_gan_pytorch/Losses.py
D	pro_gan_pytorch/PRO_GAN.py
D	pro_gan_pytorch/__init__.py
Already on 'revert_to_v1_3_3'
Your branch is up to date with 'origin/revert_to_v1_3_3'.
mv: cannot stat 'pro_gan_pytorch/pro_gan_pytorch': No such file or directory


In [9]:
# %cd T2F/implementation/

# %tensorboard --logdir=runs

# !mkdir training_runs
# !mkdir training_runs/generated_samples training_runs/losses training_runs/saved_models
# !python3 train_network.py --config=configs/1.conf

# %cd ../../

In [10]:
import sys
import datetime
import time
import torch as th
import numpy as np
import argparse
import yaml
import os
import pickle
import timeit
import tensorflow as tf 

from torch.backends import cudnn

# Append path to sys
sys.path.append("T2F/implementation")
sys.path.append("T2F/implementation/networks")
sys.path.append("T2F/implementation/data_processing")
sys.path.append("T2F/implementation/pro_gan_pytorch")

In [11]:
%cd T2F/implementation/

import data_processing.DataLoader as dl

# define the device for the training script
device = th.device("cuda" if th.cuda.is_available() else "cpu")

# set torch manual seed for consistent output
th.manual_seed(3)

# Start fast training mode:
cudnn.benchmark = True
logdir = "runs/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
file_writer = tf.summary.create_file_writer(logdir)

/content/T2F/implementation


In [12]:
def  parse_keyword_arguments():
  # pass dictionary into main
  """
  command line arguments parser
  :return: args => parsed command line arguments
  """
  parser = argparse.ArgumentParser()
  parser.add_argument("--config", action="store", type=str, default="configs/1.conf",
    help="default configuration for the Network")
  parser.add_argument("--start_depth", action="store", type=int, default=0,
    help="Starting depth for training the network")
  parser.add_argument("--encoder_file", action="store", type=str, default=None,
    help="pretrained Encoder file (compatible with my code)")
  parser.add_argument("--ca_file", action="store", type=str, default=None,
    help="pretrained Conditioning Augmentor file (compatible with my code)")
  parser.add_argument("--generator_file", action="store", type=str, default=None,
    help="pretrained Generator file (compatible with my code)")
  parser.add_argument("--discriminator_file", action="store", type=str, default=None,
    help="pretrained Discriminator file (compatible with my code)")

  args = parser.parse_args()

  return args

In [13]:
def get_config(conf_file):
  """
  parse and load the provided configuration
  :param conf_file: configuration file
  :return: conf => parsed configuration
  """
  from easydict import EasyDict as edict

  with open(conf_file, "r") as file_descriptor:
      data = yaml.load(file_descriptor)

  # convert the data into an easyDictionary
  return edict(data)

In [14]:
def create_grid(samples, scale_factor, img_file, real_imgs=False):
  """
  utility function to create a grid of GAN samples
  :param samples: generated samples for storing
  :param scale_factor: factor for upscaling the image
  :param img_file: name of file to write
  :param real_imgs: turn off the scaling of images
  :return: None (saves a file)
  """
  from torchvision.utils import save_image
  from torch.nn.functional import interpolate

  samples = th.clamp((samples / 2) + 0.5, min=0, max=1)
  # print(samples)

  # upsample the image
  if not real_imgs and scale_factor > 1:
      samples = interpolate(samples,
                            scale_factor=scale_factor)
      
  # call new captioning method on samples for new loss calculation
  generated_captions = evaluate(samples.detach())

  # print("Image Caption Loss: ", generated_captions)


  # save the images:
  save_image(samples, img_file, nrow=int(np.sqrt(len(samples))))

In [15]:
def create_descriptions_file(file, captions, dataset):
  """
  utility function to create a file for storing the captions
  :param file: file for storing the captions
  :param captions: encoded_captions or raw captions
  :param dataset: the dataset object for transforming captions
  :return: None (saves a file)
  """
  from functools import reduce

  # transform the captions to text:
  if isinstance(captions, th.Tensor):
      captions = list(map(lambda x: dataset.get_english_caption(x.cpu()),
                          [captions[i] for i in range(captions.shape[0])]))

      with open(file, "w") as filler:
          for caption in captions:
              filler.write(reduce(lambda x, y: x + " " + y, caption, ""))
              filler.write("\n\n")
  else:
      with open(file, "w") as filler:
          for caption in captions:
              filler.write(caption)
              filler.write("\n\n")

### Add step in loss function to compare intermediate image output result from image captioning network   

Acts as adition metric to tune loss and ideally improve generated image quality.   

**Where?**    
In pr_gan_pytorch project by Akanimax, output from training image generator - currently using wpgan loss. 

**TODO:**   
Can't find where any validation/testing is being done to take intermediate generated images - need to investigate training loop with Austin to see if this can be added to change the loss function and add the image captioning network to the pipeline

[ProGAN Loss Functons - Akanimax T2F](https://github.com/akanimax/pro_gan_pytorch/blob/cdd9002ad171ee47c65c3670318473a76eb682e2/pro_gan_pytorch/losses.py#L35)   

[Word Importance - Akanimax T2F](https://github.com/austin-strom/T2F/blob/master/implementation/networks/InferSent/encoder/demo.ipynb)   


[Evaluate Method - Image Captioning](https://github.com/austin-strom/text2face/blob/main/TFImageCap.ipynb)

In [16]:
class BahdanauAttention(tf.keras.Model):
  def __init__(self, units):
    super(BahdanauAttention, self).__init__()
    self.W1 = tf.keras.layers.Dense(units)
    self.W2 = tf.keras.layers.Dense(units)
    self.V = tf.keras.layers.Dense(1)

  def call(self, x):
    features = x[0]
    hidden = x[1]
    # features(CNN_encoder output) shape == (batch_size, 64, embedding_dim)

    # hidden shape == (batch_size, hidden_size)
    # hidden_with_time_axis shape == (batch_size, 1, hidden_size)
    hidden_with_time_axis = tf.expand_dims(hidden, 1)

    # attention_hidden_layer shape == (batch_size, 64, units)
    attention_hidden_layer = (tf.nn.tanh(self.W1(features) +
                                         self.W2(hidden_with_time_axis)))

    # score shape == (batch_size, 64, 1)
    # This gives you an unnormalized score for each image feature.
    score = self.V(attention_hidden_layer)

    # attention_weights shape == (batch_size, 64, 1)
    attention_weights = tf.nn.softmax(score, axis=1)

    # context_vector shape after sum == (batch_size, hidden_size)
    context_vector = attention_weights * features
    context_vector = tf.reduce_sum(context_vector, axis=1)

    return context_vector, attention_weights


In [17]:
class RNN_Decoder(tf.keras.Model):
  def __init__(self, embedding_dim, units, vocab_size):
    super(RNN_Decoder, self).__init__()
    self.units = units

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
    self.fc1 = tf.keras.layers.Dense(self.units)
    self.fc2 = tf.keras.layers.Dense(vocab_size)

    self.attention = BahdanauAttention(self.units)

  def call(self, x):
    features = x[1]
    hidden = x[2]
    x = x[0]
    # defining attention as a separate model
    context_vector, attention_weights = self.attention([features, hidden])

    # x shape after passing through embedding == (batch_size, 1, embedding_dim)
    x = self.embedding(x)

    # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)

    # passing the concatenated vector to the GRU
    output, state = self.gru(x)

    # shape == (batch_size, max_length, hidden_size)
    x = self.fc1(output)

    # x shape == (batch_size * max_length, hidden_size)
    x = tf.reshape(x, (-1, x.shape[2]))

    # output shape == (batch_size * max_length, vocab)
    x = self.fc2(x)

    return x, state, attention_weights

  def reset_state(self, batch_size):
    return tf.zeros((batch_size, self.units))


In [18]:
class CNN_Encoder(tf.keras.Model):
    # Since you have already extracted the features and dumped it
    # This encoder passes those features through a Fully connected layer
    def __init__(self, embedding_dim):
        super(CNN_Encoder, self).__init__()
        # shape after fc == (batch_size, 64, embedding_dim)
        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x

In [19]:
# Feel free to change these parameters according to your system's configuration
BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = 5000 + 1
# num_steps = len(img_name_train) // BATCH_SIZE
# Shape of the vector extracted from InceptionV3 is (64, 2048)
# These two variables represent that vector shape
features_shape = 2048
attention_features_shape = 64

In [20]:
encoder = CNN_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)

encoder.built=True
decoder.built=True

In [21]:
%cd ../../
!pwd
!git clone https://github.com/austin-strom/text2face.git
!mkdir encoder
!mkdir decoder
!mv text2face/ImgCapModel/decoder* decoder/.
!mv text2face/ImgCapModel/encoder* encoder/.
!mv text2face/ImgCapModel/tokenizer.pickle .

%cd T2F/implementation/

/content
/content
Cloning into 'text2face'...
remote: Enumerating objects: 69, done.[K
remote: Counting objects: 100% (69/69), done.[K
remote: Compressing objects: 100% (58/58), done.[K
remote: Total 69 (delta 31), reused 21 (delta 9), pack-reused 0[K
Unpacking objects: 100% (69/69), done.
/content/T2F/implementation


In [22]:
saving_encoder_path = '/content/encoder/encoder'
saving_decoder_path = '/content/decoder/decoder'
saving_tokenizer_path = '/content/tokenizer.pickle'
load_model = True

In [23]:
import pickle as pkl
import pandas as pd
import numpy as np
import torch
# from TFImageCap import RNN_Decoder, CNN_Encoder
if load_model:
  encoder.load_weights(saving_encoder_path)
  decoder.load_weights(saving_decoder_path)

  # loading
  with open(saving_tokenizer_path, 'rb') as handle:
      tokenizer = pickle.load(handle)

In [24]:
image_model = tf.keras.applications.InceptionV3(include_top=False,
                                                weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output

image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/inception_v3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5


In [25]:
# compare words generated from image captioning network and original caption to 
# see if the image generator is preserving key features of the original subject
def eval_key_words(im_cap, true_cap):
  # 1. identify important words (maybe use model to eval word importance like RF)
  # 2. calculate some difference metric like MSE
  # 3. return the metric to be used in GAN loss fxn

  return

# Should be the same as what is used in the image captioning network Kevin worked on
# input image is intermediate output from text-to-image GAN generator
def load_image(img):
  img = np.asarray(img.cpu()).transpose(1,2,0)
  img = tf.image.resize(img, (128, 128))
  img = tf.keras.applications.inception_v3.preprocess_input(img)
  return img

# get the image caption for the intermediate generated output image
# NOTE: may need more inputs based on what is needed for the captioning network
# NOTE: need to use the evaluate method from TFImageCap notebook to generate caption
def evaluate(images):
  result = list()

  for im in images:
    hidden = decoder.reset_state(batch_size=1)

    temp_input = tf.expand_dims(load_image(im), 0)
    print("temp_input: ", np.shape(temp_input))
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0],
                                              -1, img_tensor_val.shape[3]))
    features = encoder(img_tensor_val)
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    current_im_result = []

    for i in range(94):
      predictions, hidden, attention_weights = decoder([dec_input,
                                                        features,
                                                        hidden])

      predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy()

      if predicted_id not in tokenizer.index_word.keys():
        predicted_id = 0

      current_im_result.append(tokenizer.index_word[predicted_id])

      if tokenizer.index_word[predicted_id] == '<end>':
        result.append(current_im_result)
        break

      dec_input = tf.expand_dims([predicted_id], 0)
    result.append(current_im_result)

  return result

In [26]:
def train_networks(encoder, ca, c_pro_gan, dataset, epochs,
                  encoder_optim, ca_optim, fade_in_percentage,
                  batch_sizes, start_depth, num_workers, feedback_factor,
                  log_dir, sample_dir, checkpoint_factor,
                  save_dir, use_matching_aware_dis=True):
  # required only for type checking
  from networks.TextEncoder import PretrainedEncoder

  # input assertions
  assert c_pro_gan.depth == len(batch_sizes), "batch_sizes not compatible with depth"
  assert c_pro_gan.depth == len(epochs), "epochs_sizes not compatible with depth"
  assert c_pro_gan.depth == len(fade_in_percentage), "fip_sizes not compatible with depth"

  # put all the Networks in training mode:
  ca.train()
  c_pro_gan.gen.train()
  c_pro_gan.dis.train()

  if not isinstance(encoder, PretrainedEncoder):
      encoder.train()

  print("Starting the training process ... ")

  # create fixed_input for debugging
  temp_data = dl.get_data_loader(dataset, batch_sizes[start_depth], num_workers=3)
  fixed_captions, fixed_real_images = iter(temp_data).next()

  fixed_real_images = fixed_real_images.to(device)
  fixed_captions = fixed_captions.to(device)

  fixed_embeddings = encoder(fixed_captions)

  fixed_embeddings = (fixed_embeddings).to(device)

  fixed_c_not_hats, _, _ = ca(fixed_embeddings)

  fixed_noise = th.randn(len(fixed_captions),
                          c_pro_gan.latent_size - fixed_c_not_hats.shape[-1]).to(device)

  fixed_gan_input = th.cat((fixed_c_not_hats, fixed_noise), dim=-1)

  # save the fixed_images once:
  fixed_save_dir = os.path.join(sample_dir, "__Real_Info")
  os.makedirs(fixed_save_dir, exist_ok=True)
  create_grid(fixed_real_images, None,  # scale factor is not required here
              os.path.join(fixed_save_dir, "real_samples.png"), real_imgs=True)
  create_descriptions_file(os.path.join(fixed_save_dir, "real_captions.txt"),
                            fixed_captions,
                            dataset)

  # create a global time counter
  global_time = time.time()

  # delete temp data loader:
  del temp_data

  for current_depth in range(start_depth, c_pro_gan.depth):

      print("\n\nCurrently working on Depth: ", current_depth)
      current_res = np.power(2, current_depth + 2)
      print("Current resolution: %d x %d" % (current_res, current_res))

      data = dl.get_data_loader(dataset, batch_sizes[current_depth], num_workers)

      ticker = 1

      for epoch in range(1, epochs[current_depth] + 1):
          start = timeit.default_timer()  # record time at the start of epoch

          print("\nEpoch: %d" % epoch)
          total_batches = len(iter(data))
          fader_point = int((fade_in_percentage[current_depth] / 100)
                            * epochs[current_depth] * total_batches)

          for (i, batch) in enumerate(data, 1):
              # calculate the alpha for fading in the layers
              alpha = ticker / fader_point if ticker <= fader_point else 1

              # extract current batch of data for training
              captions, images = batch

              if encoder_optim is not None:
                  captions = captions.to(device)

              images = images.to(device)

              # perform text_work:
              embeddings = encoder(captions).to(device)
              if encoder_optim is None:
                  # detach the LSTM from backpropagation
                  embeddings = embeddings.detach()
              c_not_hats, mus, sigmas = ca(embeddings)

              z = th.randn(
                  len(captions),
                  c_pro_gan.latent_size - c_not_hats.shape[-1]
              ).to(device)

              gan_input = th.cat((c_not_hats, z), dim=-1)

              # optimize the discriminator:
              dis_loss = c_pro_gan.optimize_discriminator(gan_input, images,
                                                          embeddings.detach(),
                                                          current_depth, alpha,
                                                          use_matching_aware_dis)

              # optimize the generator:
              z = th.randn(
                  captions.shape[0] if isinstance(captions, th.Tensor) else len(captions),
                  c_pro_gan.latent_size - c_not_hats.shape[-1]
              ).to(device)

              gan_input = th.cat((c_not_hats, z), dim=-1)

              if encoder_optim is not None:
                  encoder_optim.zero_grad()

              ca_optim.zero_grad()
              gen_loss = c_pro_gan.optimize_generator(gan_input, embeddings,
                                                      current_depth, alpha)

              # once the optimize_generator is called, it also sends gradients
              # to the Conditioning Augmenter and the TextEncoder. Hence the
              # zero_grad statements prior to the optimize_generator call
              # now perform optimization on those two as well
              # obtain the loss (KL divergence from ca_optim)
              kl_loss = th.mean(0.5 * th.sum((mus ** 2) + (sigmas ** 2)
                                              - th.log((sigmas ** 2)) - 1, dim=1))
              kl_loss.backward()
              ca_optim.step()
              if encoder_optim is not None:
                  encoder_optim.step()

              # provide a loss feedback
              if i % int(total_batches / feedback_factor) == 0 or i == 1:
                  elapsed = time.time() - global_time
                  elapsed = str(datetime.timedelta(seconds=elapsed))
                  print("Elapsed [%s]  batch: %d  d_loss: %f  g_loss: %f  kl_los: %f"
                        % (elapsed, i, dis_loss, gen_loss, kl_loss.item()))

                  # also write the losses to the log file:
                  os.makedirs(log_dir, exist_ok=True)
                  log_file = os.path.join(log_dir, "loss_" + str(current_depth) + ".log")
                  with open(log_file, "a") as log:
                      log.write(str(dis_loss) + "\t" + str(gen_loss)
                                + "\t" + str(kl_loss.item()) + "\n")

                  # create a grid of samples and save it
                  gen_img_file = os.path.join(sample_dir, "gen_" + str(current_depth) +
                                              "_" + str(epoch) + "_" +
                                              str(i) + ".png")

                  create_grid(
                      samples=c_pro_gan.gen(
                          fixed_gan_input,
                          current_depth,
                          alpha
                      ),
                      scale_factor=int(np.power(2, c_pro_gan.depth - current_depth - 1)),
                      img_file=gen_img_file,
                  )

              # increment the ticker:
              ticker += 1

          stop = timeit.default_timer()
          print("Time taken for epoch: %.3f secs" % (stop - start))

          if epoch % checkpoint_factor == 0 or epoch == 0:
              # save the Model
              encoder_save_file = os.path.join(save_dir, "Encoder_" +
                                                str(current_depth) + ".pth")
              ca_save_file = os.path.join(save_dir, "Condition_Augmentor_" +
                                          str(current_depth) + ".pth")
              gen_save_file = os.path.join(save_dir, "GAN_GEN_" +
                                            str(current_depth) + ".pth")
              dis_save_file = os.path.join(save_dir, "GAN_DIS_" +
                                            str(current_depth) + ".pth")

              os.makedirs(save_dir, exist_ok=True)

              if encoder_optim is not None:
                  th.save(encoder.state_dict(), encoder_save_file, pickle)
              th.save(ca.state_dict(), ca_save_file, pickle)
              th.save(c_pro_gan.gen.state_dict(), gen_save_file, pickle)
              th.save(c_pro_gan.dis.state_dict(), dis_save_file, pickle)

  print("Training completed ...")



In [27]:
def main(args):
  """
  Main function for the script
  :param args: all args from cmdl as a dictionary (key,val) 
  :return: None
  """

  print("Using Device:", device)
  from networks.TextEncoder import Encoder
  from networks.ConditionAugmentation import ConditionAugmentor
  from pro_gan_pytorch.PRO_GAN import ConditionalProGAN

  print(args['config'])
  config = get_config(args['config'])
  print("Current Configuration:", config)

  # create the dataset for training
  if config.use_pretrained_encoder:
      dataset = dl.RawTextFace2TextDataset(
          annots_file=config.annotations_file,
          img_dir=config.images_dir,
          img_transform=dl.get_transform(config.img_dims)
      )
      from networks.TextEncoder import PretrainedEncoder
      # create a new session object for the pretrained encoder:
      text_encoder = PretrainedEncoder(
          model_file=config.pretrained_encoder_file,
          embedding_file=config.pretrained_embedding_file,
          device=device
      )
      encoder_optim = None
  else:
      dataset = dl.Face2TextDataset(
          pro_pick_file=config.processed_text_file,
          img_dir=config.images_dir,
          img_transform=dl.get_transform(config.img_dims),
          captions_len=config.captions_length
      )
      text_encoder = Encoder(
          embedding_size=config.embedding_size,
          vocab_size=dataset.vocab_size,
          hidden_size=config.hidden_size,
          num_layers=config.num_layers,
          device=device
      )
      encoder_optim = th.optim.Adam(text_encoder.parameters(),
                                    lr=config.learning_rate,
                                    betas=(config.beta_1, config.beta_2),
                                    eps=config.eps)

  # create the networks

  if args['encoder_file'] is not None:
      # Note this should not be used with the pretrained encoder file
      print("Loading encoder from:", args['encoder_file'])
      text_encoder.load_state_dict(th.load(args['encoder_file']))

  condition_augmenter = ConditionAugmentor(
      input_size=config.hidden_size,
      latent_size=config.ca_out_size,
      use_eql=config.use_eql,
      device=device
  )

  if args['ca_file'] is not None:
      print("Loading conditioning augmenter from:", args['ca_file'])
      condition_augmenter.load_state_dict(th.load(args['ca_file']))

  c_pro_gan = ConditionalProGAN(
      embedding_size=config.hidden_size,
      depth=config.depth,
      latent_size=config.latent_size,
      compressed_latent_size=config.compressed_latent_size,
      learning_rate=config.learning_rate,
      beta_1=config.beta_1,
      beta_2=config.beta_2,
      eps=config.eps,
      drift=config.drift,
      n_critic=config.n_critic,
      use_eql=config.use_eql,
      loss=config.loss_function,
      use_ema=config.use_ema,
      ema_decay=config.ema_decay,
      device=device
  )

  if args['generator_file'] is not None:
      print("Loading generator from:", args['generator_file'])
      c_pro_gan.gen.load_state_dict(th.load(args['generator_file']))

  if args['discriminator_file'] is not None:
      print("Loading discriminator from:", args['discriminator_file'])
      c_pro_gan.dis.load_state_dict(th.load(args['discriminator_file']))

  # create the optimizer for Condition Augmenter separately
  ca_optim = th.optim.Adam(condition_augmenter.parameters(),
                            lr=config.learning_rate,
                            betas=(config.beta_1, config.beta_2),
                            eps=config.eps)

  print("Generator Config:")
  print(c_pro_gan.gen)

  print("\nDiscriminator Config:")
  print(c_pro_gan.dis)

  # train all the networks
  train_networks(
      encoder=text_encoder,
      ca=condition_augmenter,
      c_pro_gan=c_pro_gan,
      dataset=dataset,
      encoder_optim=encoder_optim,
      ca_optim=ca_optim,
      epochs=config.epochs,
      fade_in_percentage=config.fade_in_percentage,
      start_depth=args['start_depth'],
      batch_sizes=config.batch_sizes,
      num_workers=config.num_workers,
      feedback_factor=config.feedback_factor,
      log_dir=config.log_dir,
      sample_dir=config.sample_dir,
      checkpoint_factor=config.checkpoint_factor,
      save_dir=config.save_dir,
      use_matching_aware_dis=config.use_matching_aware_discriminator
  )

In [28]:
!pwd

/content/T2F/implementation


In [29]:
args = {
    'config': 'configs/1.conf',
    'start_depth': 0,
    'encoder_file': None,
    'ca_file': None,
    'generator_file': None,
    'discriminator_file': None
}

main(args)

%cd ../../

Using Device: cuda
configs/1.conf
Current Configuration: {'images_dir': '../data/LFW/lfw', 'processed_text_file': 'processed_annotations/processed_text.pkl', 'annotations_file': '../data/LFW/Face2Text/face2text_v0.1/clean.json', 'pretrained_encoder_dir': 'tf-hub_modules/text_encoder/1fb57c3ffe1a38479233ee9853ddd7a8ac8a8c47', 'log_dir': 'training_runs/1/losses/', 'sample_dir': 'training_runs/1/generated_samples/', 'save_dir': 'training_runs/1/saved_models/', 'captions_length': 100, 'img_dims': [256, 256], 'download_pretrained_encoder': False, 'use_pretrained_encoder': False, 'p_proc_gpu_mem': 0.3, 'embedding_size': 128, 'hidden_size': 512, 'num_layers': 3, 'ca_out_size': 256, 'compressed_latent_size': 128, 'use_eql': True, 'use_ema': True, 'ema_decay': 0.999, 'depth': 7, 'latent_size': 512, 'learning_rate': 0.001, 'beta_1': 0, 'beta_2': 0.99, 'eps': 1e-08, 'drift': 0.001, 'n_critic': 1, 'epochs': [20, 40, 40, 40, 40, 40, 40], 'fade_in_percentage': [50, 50, 50, 50, 50, 50, 50], 'batch_si

  cpuset_checked))


temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)


Currently working on Depth:  0
Current resolution: 4 x 4

Epoch: 1
Elapsed [0:00:01.237357]  batch: 1  d_loss: 8.546745  g_loss: 0.129311  kl_los: 3982.495361
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)
temp_input:  (1, 128, 128, 3)


KeyboardInterrupt: ignored