Skip to content

Latest commit

 

History

History
27 lines (16 loc) · 584 Bytes

tokenizers.rst

File metadata and controls

27 lines (16 loc) · 584 Bytes

Tokenizers

Base class

We implement a basic tokenizer class, built on torchtext basic_english tokenizer. Its purpose is to transform text into a series of token_ids (integers), that will be fed to a word vector.

multimodal.text.BasicTokenizer

VQA v2

The pretrained tokenizer for VQA v2 is called pretrained-vqa2.

from multimodal.text import BasicTokenizer
tokenizer = BasicTokenizer.from_pretrained("pretrained-vqa2")
tokens = tokenizer("What color is the car?")
# feed tokens to model