Skip to content

Commit

Permalink
Update prepro.py
Browse files Browse the repository at this point in the history
fix vocab bug
  • Loading branch information
jiasenlu committed Feb 23, 2016
1 parent 0067be1 commit 84dc169
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions prepro.py
Expand Up @@ -80,7 +80,7 @@ def apply_vocab_question(imgs, wtoi):
# apply the vocab on test.
for img in imgs:
txt = img['processed_tokens']
question = [w if wtoi.get(w,len(wtoi)) != len(wtoi) else 'UNK' for w in txt]
question = [w if wtoi.get(w,len(wtoi)+1) != (len(wtoi)+1) else 'UNK' for w in txt]

This comment has been minimized.

Copy link
@jnhwkim

jnhwkim Mar 1, 2016

Contributor

@jiasenlu I think this modification is not necessary, because UNK is added to a vocabulary set for training.
refer https://github.com/VT-vision-lab/VQA_LSTM_CNN/blob/84dc16904065e46f3d42bca3d4b48af224a76572/prepro.py#L70

img['final_question'] = question

return imgs
Expand Down Expand Up @@ -142,7 +142,7 @@ def encode_mc_answer(imgs, atoi):
def filter_question(imgs, atoi):
new_imgs = []
for i, img in enumerate(imgs):
if atoi.get(img['ans'],len(atoi)) != len(atoi):
if atoi.get(img['ans'],len(atoi)+1) != len(atoi)+1:
new_imgs.append(img)

print 'question number reduce from %d to %d '%(len(imgs), len(new_imgs))
Expand Down

1 comment on commit 84dc169

@jnhwkim
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, for split=2 and for test-dev2015

  1. reduced # of questions: 320006 => 320029 (+23)
  2. total words: 2284474 => 2284620 (+146)
  3. of vocab: 14770 (the same, however, the order is shuffled due to introducing few additional questions.)

  4. # of affected questions: 687(train)+142(test) (approx. 0.26%)

Please sign in to comment.