<span style='background:yellow'><font color="blue">
UC5, DEEP IMAGE ANNOTATION
</font></span>

DeepHealth Project

Franco Alberto Cardillo (`francoalberto.cardillo@ilc.cnr.it`)


TEXT GENERATION ON the Indiana University Chest X-ray dataset

PRE:

     - pre-trained CNN module;
     - pre-trained (unrolled) RNN module;
     - ECVL dataset;
     - img <-> text dataset;
     - input images.

Standalone notebook (the previous files are generated by python modules available on the github project repository)

In [1]:
# imports
import humanize as H
from nltk import sent_tokenize
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import numpy as np
from numpy import count_nonzero as nnz
import pandas as pd
import pickle
from posixpath import join
import pyecvl.ecvl as ecvl
import pyeddl.eddl as eddl
from pyeddl.tensor import Tensor, DEV_CPU, DEV_GPU
import time

In [2]:
# I M A G E - R E L A T E D   C L A S S E S   A N D   F U N C T I O N S
def get_augmentations_chest_iu(img_size=224):
    mean = [0.48197903, 0.48197903, 0.48197903]
    std = [0.26261734, 0.26261734, 0.26261734]
    
    train_augs = ecvl.SequentialAugmentationContainer([
                    ecvl.AugResizeDim([300, 300]),
                    ecvl.AugRotate([-5,5]),
                    ecvl.AugToFloat32(divisor=255.0),
                    ecvl.AugNormalize(mean, std),
                    ecvl.AugRandomCrop([img_size, img_size])
            ])

    test_augs =  ecvl.SequentialAugmentationContainer([
                        ecvl.AugResizeDim([300, 300]),
                        ecvl.AugToFloat32(divisor=255.0),
                        ecvl.AugNormalize(mean, std),
                        ecvl.AugCenterCrop([img_size, img_size])
                    ])
    return train_augs, test_augs

In [3]:
# T E X T - R E L A T E D   C L A S S E S   A N D   F U N C T I O N S
class Vocabulary:
    PAD = 0
    OOV = 1
    BOS = 2
    EOS = 3

    def __init__(self):
        self.initialize()

    def initialize(self):
        self.idx2word = {Vocabulary.PAD: "<pad>", Vocabulary.OOV: "<oov>", Vocabulary.BOS: "<bos>", Vocabulary.EOS: "<eos>"}
        self.word2idx = {w:i for i,w in self.idx2word.items()}
        assert self.word2idx["<pad>"] == 0

        self.word2count = {}
        self.idx = len(self.idx2word)
        self.word_count = 0

    def add_word(self, word):
        if word not in self.word2idx.keys():
            self.word2idx[word] = self.idx
            self.word2count[word] = 1
            self.idx2word[self.idx] = word
            self.idx += 1
        else:
            self.word2count[word] += 1
        self.word_count += 1
    #<

    def add_sentence(self, sentence):
        for word in sentence.split(" "):
            if word != ".":
                # commas and other punctuation already removed
                self.add_word(word)
    #<

    def add_text(self, text):
        for sentence in sent_tokenize(text):
            self.add_sentence(sentence)
    #<
    
    def keep_n_words(self, n_words: int):
        n = self.word_count
        print("(vocabulary) initial word count (total):", self.word_count)
        print("(vocabulary) initial number of words:", len(self.word2count))

        wc = list(self.word2count.items())
        wc = sorted(wc, key=lambda elem: -elem[1])
        # wc does not contain special tokens
        keep = wc[:n_words]
        rem = wc[n_words:]
        self.initialize()
        for w, _ in keep:
            self.add_word(w)

        print("(vocabulary) after iterating with add_word number of words:", len(self.word2count))
        assert len(self.word2idx) == n_words+4, f"words: {len(self.word2count)}, requested: {n_words}+4"  # number of special tokens
        # reset self.word2count
        self.word_count = 0
        for w, c in keep:
            self.word_count += c
            self.word2count[w] = c
        print("(vocabulary) final word_count (total): ", self.word_count)
        print("(vocabulary) final number of words:", len(self.word2count))

        #print("diff:", n - self.word_count)
        #print("removed words:", len(rem))
        #print("coverage:", self.word_count / n)

    def decode(self, idxs):
        return [self.idx2word[i] for i in idxs]
#< class Vocabulary


#> t e x t   c o l l a t i o n
def collate_fn_one_s(enc_text, n_sents=1, max_tokens=12, verbose=False):
    assert type(enc_text) is list, f"expected type 'list', received {type(enc_text)}"
    if verbose: 
        print(f"collating_one_s, len {len(enc_text)}:", enc_text)
    
    if type(enc_text[0]) is list:
        enc_text = enc_text[0]
        
    bos = Vocabulary.BOS
    eos = Vocabulary.EOS
    pad = Vocabulary.PAD
    
    enc_text = [bos] + enc_text + [eos]
    l = len(enc_text)
    if verbose: print(f"len with bos and eos: {l}, max is {max_tokens}")
    if l > max_tokens:
        if verbose: print("truncating")
        enc_text[max_tokens-1] = eos
        enc_text = enc_text[:max_tokens]
    elif l < max_tokens:
        if verbose: print("padding", flush=True)
        enc_text += ([pad] * (max_tokens -l) )
    
    if verbose: print(f"returning collation ({len(enc_text)}):", enc_text)
    assert len(enc_text) == max_tokens
    return np.array(enc_text)
#<

    
def collate_fn_n_sents(enc_text, n_sents, max_tokens, verbose=False):
    res = []
    for i, enc_sent in enumerate(enc_text):
        if i == n_sents:
            break
        v = collate_fn_one_s(enc_sent, n_sents=1, max_tokens=max_tokens, verbose=verbose)
        res.append(v)

    if len(res) > n_sents:
        res = res[:n_sents]
    elif len(res) < n_sents:
        padded = [Vocabulary.PAD] * max_tokens
        for i in range(n_sents - len(res)):
            res.append(np.array(padded))

    res = np.array(res)
    if verbose: print(f"collate_fn_n_sents, returning {res.shape}")
    return res
#<


In [4]:
# E C V L   D A T A S E T  /  D A T A L O A D E R

def load_ecvl_dataset(config):
    _, test_augs = get_augmentations_chest_iu(config["img_size"])
    augs = ecvl.DatasetAugmentations(augs=[test_augs, test_augs, test_augs])
    ecvl.AugmentationParam.SetSeed(config["seed"])
    ecvl.DLDataset.SetSplitSeed(config["shuffle_seed"])
    print("loading dataset:", config["ecvl_ds_fn"])
    
    if config["eddl_cs"] == "cpu":
        num_workers = 16
    else:
        num_workers = 8 if nnz(config["gpu_id"]) == 1 else 4 * nnz(config["gpu_id"])

    print(f"using num workers = {num_workers}")
    print("using batch size:", config["bs"])
    dataset = ecvl.DLDataset(join(config["in_fld"], config["ecvl_ds_fn"]), 
                        batch_size=config["bs"], 
                        augs=augs, 
                        ctype=ecvl.ColorType.RGB, ctype_gt=ecvl.ColorType.GRAY, 
                        num_workers=num_workers, queue_ratio_size= 4 * nnz(config["gpu_id"]), 
                        drop_last={"training": False, "validation": False, "test": False}) # drop_last defined in training.augmentations
                        
    return dataset

In [5]:
# aux functions
def configure(
        in_fld="/opt/uc5/results/demo/text_gen", 
        img_fld="/mnt/datasets/uc5/std-dataset/image",
        ecvl_ds_fn="ecvl_ds.yml",
        bs=128,
        n_tokens=12,
        eddl_cs_mem="mid_mem", 
        eddl_cs="gpu", 
        gpu_id=[1],
        cnn_fn="best_cnn",
        rnn_fn="best_rnn",
        vocab_fn="vocab.pkl",
        img_size=224,
        seed=1234,
        shuffle_seed=5678):
    config = locals()
    return config
#<


def get_eddl_cs(config):
    return  eddl.CS_GPU(g=config["gpu_id"], mem=config["eddl_cs_mem"]) if config["eddl_cs"] == "gpu" else eddl.CS_CPU()
#<


def load_cnn(config):
    cnn_fn = join(config["in_fld"], config["cnn_fn"] + ".onnx")
    cnn = eddl.import_net_from_onnx_file(cnn_fn)
    eddl.build(cnn, 
        eddl.adam(lr=1e-04), # not relevant
        ["binary_cross_entropy"],  # not relevant
        ["binary_accuracy"], # not relevant
        get_eddl_cs(config) , 
        init_weights=False)  # losses, metrics
    eddl.set_mode(cnn, 0)  # inference
    return cnn
#<

def load_rnn(config):
    rnn_fn = join(config["in_fld"], config["rnn_fn"] + ".onnx")
    rnn = eddl.import_net_from_onnx_file(rnn_fn)
    eddl.build(rnn, 
        eddl.adam(lr=1e-04), 
        ["binary_cross_entropy"], 
        ["binary_accuracy"], 
        get_eddl_cs(config), 
        init_weights=False)
    return rnn
#<


def load_vocabulary(config):
    with open(join(config["in_fld"], config["vocab_fn"]), "rb") as f:
        vocab = pickle.load(f)
    return vocab
#<


def load_image_text_ds(config):
    text_dataset = pd.read_pickle(join(config["in_fld"], "img_text_dataset.pkl")).set_index("image_filename")
    return text_dataset
#<


def create_rnn_for_generation(unrolled_rnn, visual_dim, semantic_dim, lstm_dim, n_words, word_emb_dim, config, verbose=False):
    if verbose:
        print("CREATE RNN FOR GENERATION")
        eddl.summary(unrolled_rnn)
        print(f"Dimensions: visual dim: {visual_dim}, semantic dim: {semantic_dim}, lstm size: {lstm_dim}, n_words: {n_words}")
    #<

    cnn_top_in = eddl.Input([visual_dim], name="in_visual_features")
    cnn_out_in = eddl.Input([semantic_dim], name="in_semantic_features")
    features = eddl.Concat([cnn_top_in, cnn_out_in], name="cnn_concat")  # there is no coattention, name kept for subsequent models
    
    lstm_in = eddl.Input([n_words])
    lstate = eddl.States([2, lstm_dim])
    
    to_lstm = eddl.ReduceArgMax(lstm_in, [0])  # word index
    to_lstm = eddl.Embedding(to_lstm, n_words, 1, word_emb_dim, name="word_embs")
    to_lstm = eddl.Concat([to_lstm, features])
    lstm = eddl.LSTM([to_lstm, lstate], lstm_dim, True, name="lstm_cell")
    lstm.isrecurrent = False
    
    out_lstm = eddl.Softmax(
                eddl.Dense(lstm, n_words, name="out_dense"), 
                name="rnn_out")
    
    # *** model
    model = eddl.Model([cnn_top_in, cnn_out_in, lstm_in, lstate], [out_lstm])
    eddl.build(model, 
        eddl.adam(lr=1e-04),  # not relevant
        ["binary_cross_entropy"],  # not relevant
        ["binary_accuracy"],  # not relevant
        get_eddl_cs(config), 
        init_weights=False)

    # if the model is saved in onnx, there is the same error as in the recurrent model when loaded: 
    #       LDense only works over 2D tensors (LDense)
    return model
#<
    

In [6]:
def generate_text_predict_next(rnn, n_tokens, visual_batch=None, semantic_batch=None, dev=False):
    assert (visual_batch is not None) and (semantic_batch is not None)

    bs = visual_batch.shape[0]
    lstm = eddl.getLayer(rnn, "lstm_cell")
    lstm_size = lstm.output.shape[1]
    last_layer = eddl.getLayer(rnn, "rnn_out")
    voc_size = last_layer.output.shape[1]
    
    # return value
    generated_tokens = np.zeros( (bs, n_tokens), dtype=int)

    # lstm cell states
    state_t = Tensor.zeros([bs, 2, lstm_size])
    
    # token: input to lstm cell
    token = Tensor.zeros([bs, voc_size])
    
    for j in range(0, n_tokens):
        if dev:
            print(f" *** token {j}/{n_tokens} ***")
            print(f"cnn_visual: {visual_batch.shape}")
            print(f"cnn_semant: {semantic_batch.shape}")
            print(f"token: {token.shape}")
            print(f"state_t: {state_t.shape}")

        # forward: token and state_t update after the forward step
        eddl.forward(rnn, [visual_batch, semantic_batch, token, state_t])     
        states = eddl.getStates(lstm)

        # save the state for the next token: it must be copied into a Tensor (state_t)
        for si in range(len(states)):
            states[si].reshape_([ states[si].shape[0], 1, states[si].shape[1] ])
            state_t.set_select( [":", str(si), ":"] , states[si] )
        
        out_soft = eddl.getOutput(last_layer)
        # pass control to numpy for argmax
        wis = np.argmax(out_soft, axis=-1)
        # print(wis)
        # if dev:
        #     print(wis.shape)
        #     print(f"next_token {wis[0]}")
        generated_tokens[:, j] = wis
        
        #> next input token to the lstm
        word_index = Tensor.fromarray(wis.astype(float))
        word_index.reshape_([bs, 1])  # add dimension for one-hot encoding
        token = Tensor.onehot(word_index, voc_size)
        
        # print(token.shape)
        token.reshape_([bs, voc_size])  # remove singleton dim
        #<
    #< for n_tokens
    return generated_tokens
#<


class TextGenerator:
    def __init__(self, cnn, rnn, dataset, text_dataset, vocab, n_tokens=12, bs=None):
        self.cnn = cnn
        self.rnn = rnn
        self.ds = dataset
        self.text_ds = text_dataset
        self.vocab = vocab
        self.n_tokens = n_tokens
        self.bs = bs
        
    def generate(self, stages=None):
        stages = stages or {
            # "train": ecvl.SplitType.training,
            # "valid": ecvl.SplitType.validation,
            "test": ecvl.SplitType.test
        }

        cnn = self.cnn
        cnn_out = eddl.getLayer(cnn, "cnn_out")
        cnn_top = eddl.getLayer(cnn, "top")
        
        rnn = self.rnn
        ds = self.ds
        text_ds = self.text_ds
        results = {}
        
        for stage, split_type in stages.items():
            gen_sents = []
            target_sents = []
            print("text generation, stage:", stage)
            ds.SetSplit(split_type)
            ds.ResetBatch()
            ds.Start()
            n_batches = ds.GetNumBatches()
            t0 = time.perf_counter()
            for bi in range(n_batches):
                if (bi + 1) % 100 == 0:
                    print(f"batch {bi+1} / {n_batches}")

                I, X, Y = ds.GetBatch()
                image_ids = [sample.location_[0] for sample in I]
                texts = text_ds.loc[image_ids, "target_text"].tolist()
                # texts = np.array(texts.tolist()).astype(np.float32)
                cnn.forward([X])
                cnn_semantic = eddl.getOutput(cnn_out)
                cnn_visual = eddl.getOutput(cnn_top)
                gen_sentence = generate_text_predict_next(rnn, self.n_tokens, visual_batch=cnn_visual, semantic_batch=cnn_semantic, dev=False)
                # gen_s = np.array(gen_sentence)
                gen_sents.append(gen_sentence)
                target_sents.append(texts)
            #< for over batches
            t1 = time.perf_counter()
            avg_t_batch = (t1 - t0) / n_batches
            avg_t_image = avg_t_batch / self.bs if self.bs else {np.NaN}
            print(f"Stage {stage}, all texts generated in {H.precisedelta(t1-t0)}; {H.precisedelta(avg_t_batch)} per {self.bs}-image batch, average time for a single image: {H.precisedelta(avg_t_image)}")
            ds.Stop()
            results[stage] = (np.concatenate(gen_sents, axis=0), np.concatenate(target_sents, axis=0))                
        #< for over stage
        
        dfs = [] # stage dfs
        def clean(s):
            r = []
            for w in s:
                r.append(w)
                if w == Vocabulary.EOS:
                    break
            return r

        for stage, (gen_sents, target_sents) in results.items():
            #print(gen_sents.shape)
            #print(target_sents.shape)
            clean_generated = []
            clean_targets = []
            for i in range(gen_sents.shape[0]):
                ww = np.squeeze(gen_sents[i,:])
                tt = np.squeeze(target_sents[i,:])
                clean_generated.append(clean(ww))                
                clean_targets.append(clean(tt))
            df = pd.DataFrame({"generated_i": clean_generated, "target_i": clean_targets})
            df["stage"] = stage
            dfs.append(df)
        results = pd.concat(dfs, axis=0)
        
        def decode(tokens):
            return " ".join([self.vocab.idx2word[t] for t in tokens])
        
        results["generated"] = results.generated_i.apply(decode)
        results["target"] = results.target_i.apply(decode)

        smoothing_function = SmoothingFunction()
        smooth = smoothing_function.method3
        

        results["bleu_1"] = results[["target", "generated"]].apply(lambda x: 
                sentence_bleu(
                        [x[0].split(" ")], 
                        x[1].split(" "), 
                        weights=(1, 0, 0, 0), smoothing_function=smooth), axis=1)

        #> 
        def bleu2(target, generated):
            target = target.split(" ")
            generated = generated.split(" ")
            score = sentence_bleu([target], generated, weights=(0.5, 0.5, 0, 0), smoothing_function=smooth)
            return score
        #
        results["bleu_2"] = results[["target", "generated"]].apply(lambda x: 
                    bleu2(x[0], x[1]), axis=1)
        #<
        
        results["bleu_3"] = results[["target", "generated"]].apply(lambda x: 
            sentence_bleu(
                    [x[0].split(" ")], 
                    x[1].split(" "), 
                    weights=(0.33, 0.33, 0.33, 0), smoothing_function=smooth), axis=1)

        results["bleu_4"] = results[["target", "generated"]].apply(lambda x: 
            sentence_bleu(
                    [x[0].split(" ")], 
                    x[1].split(" "), 
                    weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smooth), axis=1)
        return results
        
#<


<span style='background:yellow'><font color="blue">
GENERATION STARTS HERE
</font></span>




<span style='background:yellow'><font color="blue">
1: GENERATE TEXT ON THE TEST PARTITION AND PRINT THE MEAN BLEU-2 SCORE
</font></span>


In [7]:
conf = configure()
cnn = load_cnn(conf)
vocab = load_vocabulary(conf)
trained_rnn = load_rnn(conf)
img_text_ds = load_image_text_ds(conf)
ecvl_ds = load_ecvl_dataset(conf)
# now build non-recurrent version of the recurrent module

# dimensions
visual_dim = eddl.getLayer(cnn, "top").output.shape[1]
semantic_dim = eddl.getLayer(cnn, "cnn_out").output.shape[1]
n_words = len(vocab.idx2word)
word_emb_dim = eddl.getLayer(trained_rnn, "word_embs").output.shape[1]
lstm_dim = eddl.getLayer(trained_rnn, "lstm_cell").output.shape[1]

# build a non-recurrent RNN with the same weights as the trained RNN
rnn = create_rnn_for_generation(trained_rnn, visual_dim, semantic_dim, lstm_dim, n_words, word_emb_dim, conf)

# copy weights from unrolled_rnn to rnn
layers_to_copy = [ "word_embs", "lstm_cell", "out_dense" ]
for l in layers_to_copy:
    eddl.copyParam(eddl.getLayer(trained_rnn, l), eddl.getLayer(rnn, l))

print("", flush=True)  # simply to keep order of output
eddl.set_mode(rnn, 0)
eddl.summary(rnn)

generator =  TextGenerator(cnn, rnn, ecvl_ds, img_text_ds, vocab, bs=conf["bs"])
results = generator.generate()


# ALL BLEUs
# print(results[["bleu_1", "bleu_2", "bleu_3", "bleu_4", "stage"]].groupby("stage").mean())
# BLEU-2

for stage, df in results[["target", "generated", "bleu_2", "stage"]].groupby("stage"):
    print("STAGE:", stage)
    display( df[df.bleu_2 > 0.2].sample(5) )
    display( df[df.bleu_2 < 0.2].sample(5) )

print("KPI")
display( results[["bleu_2", "stage"]].groupby("stage").mean() )

print("all done.")

Generating Random Table
CS with mid memory setup
Building model without initialization
Selecting GPU device 0
EDDL is running on GPU device 0, Tesla V100-SXM2-32GB
CuBlas initialized on GPU device 0, Tesla V100-SXM2-32GB
CuRand initialized on GPU device 0, Tesla V100-SXM2-32GB
CuDNN initialized on GPU device 0, Tesla V100-SXM2-32GB
copying onnx params to devices
CS with mid memory setup
Building model without initialization
copying onnx params to devices


loading dataset: ecvl_ds.yml
using num workers = 8
using batch size: 128
copy all params from word_embs to word_embs
copy all params from lstm_cell to lstm_cell
copy all params from out_dense
 to out_dense
text generation, stage: test


CS with mid memory setup
Building model without initialization
copying onnx params to devices


-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
in_visual_features  |  (512)               =>   (512)               0         
in_semantic_features|  (46)                =>   (46)                0         
cnn_concat          |  (512)               =>   (558)               0         
input1              |  (1004)              =>   (1004)              0         
reduction_argmax1   |  (1004)              =>   (1)                 0         
word_embs           |  (1)                 =>   (512)               514048    
concat1             |  (512)               =>   (1070)              0         
State1              |  (2, 512)            =>   (2, 512)            0         
lstm_cell           |  (1070)              =>   (512)               3241984   
out_dense           |  (512)               =>   (1004)              515052    
rnn_out             |  (1004)              =

Unnamed: 0,target,generated,bleu_2,stage
618,<bos> the heart is mildly enlarged <eos>,<bos> the heart is normal in size <eos>,0.517549,test
356,<bos> the heart is normal in size <eos>,<bos> the heart is normal in size <eos>,1.0,test
715,<bos> the heart size and pulmonary vascularity...,<bos> the heart size and pulmonary vascularity...,1.0,test
79,<bos> there is stable cardiomegaly with xxxx p...,<bos> there is mild cardiomegaly <eos>,0.212395,test
453,<bos> the heart size and pulmonary vascularity...,<bos> the heart is normal in size <eos>,0.280769,test


Unnamed: 0,target,generated,bleu_2,stage
358,<bos> _NUM_ images <eos>,<bos> the heart is normal in size <eos>,0.133631,test
721,<bos> the cardiomediastinal silhouette and pul...,<bos> heart size normal <eos>,0.067533,test
521,<bos> lungs are clear <eos>,<bos> the heart is normal in size and contour ...,0.105409,test
340,<bos> clear lungs <eos>,<bos> the heart is enlarged <eos>,0.182574,test
32,<bos> heart size is normal <eos>,<bos> the heart pulmonary xxxx and mediastinum...,0.123091,test


KPI


Unnamed: 0_level_0,bleu_2
stage,Unnamed: 1_level_1
test,0.257341


all done.


<span style='background:yellow'><font color="blue">
2: GENERATE TEXT ON SINGLE IMAGES
</font></span>



In [8]:
conf = configure()
cnn = load_cnn(conf)
vocab = load_vocabulary(conf)
trained_rnn = load_rnn(conf)
img_text_ds = load_image_text_ds(conf)
ecvl_ds = load_ecvl_dataset(conf)
# now build non-recurrent version of the recurrent module

# dimensions
visual_dim = eddl.getLayer(cnn, "top").output.shape[1]
semantic_dim = eddl.getLayer(cnn, "cnn_out").output.shape[1]
n_words = len(vocab.idx2word)
word_emb_dim = eddl.getLayer(trained_rnn, "word_embs").output.shape[1]
lstm_dim = eddl.getLayer(trained_rnn, "lstm_cell").output.shape[1]

# build a non-recurrent RNN with the same weights as the trained RNN
rnn = create_rnn_for_generation(trained_rnn, visual_dim, semantic_dim, lstm_dim, n_words, word_emb_dim, conf)

# copy weights from unrolled_rnn to rnn
layers_to_copy = [ "word_embs", "lstm_cell", "out_dense" ]
for l in layers_to_copy:
    eddl.copyParam(eddl.getLayer(trained_rnn, l), eddl.getLayer(rnn, l))
eddl.set_mode(rnn, 0)


print("", flush=True)  # simply to keep order of output
eddl.summary(rnn)

#
# -----------------------------------------------------
#


_, augs = get_augmentations_chest_iu()
filenames = ["CXR2086_IM-0717-1001.png", "CXR2814_IM-1239-1001.png", "CXR1688_IM-0450-1001.png", "CXR2655_IM-1137-2001.png", "CXR1721_IM-0476-2001.png"]
# filenames = [join(conf["img_fld"], fn) for fn in filenames]
filenames = [join("./images", fn) for fn in filenames]

cnn_out = eddl.getLayer(cnn, "cnn_out")
cnn_top = eddl.getLayer(cnn, "top")

# cut the generated text at EOS
def clean(s):
    r = []
    for w in s:
        r.append(w)
        if w == Vocabulary.EOS:
            break
    return r

def decode(wis):
    return " ".join([vocab.idx2word[word_index] for word_index in wis])

display(img_text_ds.head())

for i, fn in enumerate(filenames):
    t0 = time.perf_counter()
    # read image from disk
    img = ecvl.ImRead(fn)  # , flags=ecvl.ImReadMode.GRAYSCALE)
    augs.Apply(img)
    ecvl.RearrangeChannels(img, img, "cxy")
    print(f"* Image {i+1}/{len(filenames)}: {fn}: {img.dims_}")
    img = np.array(img, copy=False)
    
    # resize and create Tensor
    img = np.expand_dims(img, 0)  # add "batch" dimension
    img = Tensor(img)
    
    # forward through CNN
    cnn.forward([img])
    cnn_semantic = eddl.getOutput(cnn_out)
    cnn_visual = eddl.getOutput(cnn_top)
    
    # generate text
    word_indexes = generate_text_predict_next(rnn, conf["n_tokens"], cnn_visual, cnn_semantic)[0]
    t1 = time.perf_counter()
    print(f"- image annotated in {t1-t0:.3f}s")
    print("- generated word indexes:", word_indexes)
    text = decode(word_indexes)
    print("\tdecoded:", text)
    # clean generated text: remove word indexes after EOS
    word_indexes = clean(word_indexes)
    print("\tgenerated word indexes cut at EOS:", word_indexes)
    text = decode(word_indexes)
    print("- generated text:", text)
    # target text
    target_word_indexes = img_text_ds.loc[fn, "target_text"]
    target_text = decode(clean(target_word_indexes))
    print("- target text:", target_text)
    print()

print("all done.")

CS with mid memory setup
Building model without initialization
copying onnx params to devices
CS with mid memory setup
Building model without initialization
copying onnx params to devices
CS with mid memory setup
Building model without initialization


loading dataset: ecvl_ds.yml
using num workers = 8
using batch size: 128
copy all params from word_embs to word_embs
copy all params from lstm_cell to lstm_cell

copy all params from out_dense to out_dense
-------------------------------------------------------------------------------
model
-------------------------------------------------------------------------------
in_visual_features  |  (512)               =>   (512)               0         
in_semantic_features|  (46)                =>   (46)                0         
cnn_concat          |  (512)               =>   (558)               0         
input2              |  (1004)              =>   (1004)              0         
reduction_argmax2   |  (1004)              =>   (1)                 0         
word_embs           |  (1)                 =>   (512)               514048    
concat2             |  (512)               =>   (1070)              0         
State2              |  (2, 512)            =>   (2, 512)            0      

copying onnx params to devices


Unnamed: 0_level_0,id,text,enc_text,target_text
image_filename,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
/mnt/datasets/uc5/std-dataset/image/CXR1_1_IM-0001-3001.png,1,the cardiac silhouette and mediastinum size ar...,"[[5, 62, 39, 9, 55, 19, 7, 21, 13, 66], [15, 6...","[2, 5, 62, 39, 9, 55, 19, 7, 21, 13, 66, 3]"
/mnt/datasets/uc5/std-dataset/image/CXR1_1_IM-0001-4001.png,1,the cardiac silhouette and mediastinum size ar...,"[[5, 62, 39, 9, 55, 19, 7, 21, 13, 66], [15, 6...","[2, 5, 62, 39, 9, 55, 19, 7, 21, 13, 66, 3]"
/mnt/datasets/uc5/std-dataset/image/CXR10_IM-0002-1001.png,10,the cardiomediastinal silhouette is within nor...,"[[5, 40, 39, 6, 21, 13, 66, 50, 19, 9, 87], [5...","[2, 5, 40, 39, 6, 21, 13, 66, 50, 19, 9, 3]"
/mnt/datasets/uc5/std-dataset/image/CXR10_IM-0002-2001.png,10,the cardiomediastinal silhouette is within nor...,"[[5, 40, 39, 6, 21, 13, 66, 50, 19, 9, 87], [5...","[2, 5, 40, 39, 6, 21, 13, 66, 50, 19, 9, 3]"
/mnt/datasets/uc5/std-dataset/image/CXR100_IM-0002-1001.png,100,both lungs are clear and expanded. heart and m...,"[[186, 17, 7, 41, 9, 432], [16, 9, 55, 13], [4...","[2, 186, 17, 7, 41, 9, 432, 3, 0, 0, 0, 0]"


>   (1004)              0         
-------------------------------------------------------------------------------
Total params: 4271084
Trainable params: 4271084
Non-trainable params: 0

* Image 1/5: /mnt/datasets/uc5/std-dataset/image/CXR2086_IM-0717-1001.png: [3, 224, 224]
- image annotated in 0.538s
- generated word indexes: [ 2  5 16  6 13 22 19  3  0  0  0  0]
	decoded: <bos> the heart is normal in size <eos> <pad> <pad> <pad> <pad>
	generated word indexes cut at EOS: [2, 5, 16, 6, 13, 22, 19, 3]
- generated text: <bos> the heart is normal in size <eos>
- target text: <bos> the cardiomediastinal silhouette is within normal limits <eos>

* Image 2/5: /mnt/datasets/uc5/std-dataset/image/CXR2814_IM-1239-1001.png: [3, 224, 224]
- image annotated in 0.085s
- generated word indexes: [  2  42  46 111   3   0   0   0   0   0   0   0]
	decoded: <bos> stable mild cardiomegaly <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
	generated word indexes cut at EOS: [2, 42, 46, 111, 3]
- generated