We implement a basic tokenizer class, built on torchtext basic_english tokenizer. Its purpose is to transform text into a series of token_ids (integers), that will be fed to a word vector.
multimodal.text.BasicTokenizer
The pretrained tokenizer for VQA v2 is called pretrained-vqa2
.
from multimodal.text import BasicTokenizer
tokenizer = BasicTokenizer.from_pretrained("pretrained-vqa2")
tokens = tokenizer("What color is the car?")
# feed tokens to model