Skip to content

ficstamas/huclip-the-text

Repository files navigation

The Project

Object identification and question answering. Main goal of the project was to make decisions about images based on the provided questions. The question MUST contain the possible answers like:

'Is this apple red or blue?'.

We extract the keywords with the Hungarian Spacy, so you can just change that part to adapt the model to your language. Keyword extraction part relies on the CCONJ part-of-speech tag so you also need to include a word that fulfils that role (like or). Model then returns an answer based on which keyword represents the image.

Installation & Requirements

  • Python = 3.9.*
pip install git+https://github.com/ficstamas/huclip-the-text.git

Example Usage

from huclip_the_text.model.clip import KeywordCLIP
from PIL import Image

model = KeywordCLIP(model_name='M-BERT-Base-ViT-B')
img = Image.open('bananas.jpg')
out = model.evaluate(img, 'Sárga, kerek vagy lila banánt látsz?')

# Output:
# Probability of the answer 'Sárga banán' is 0.601322591304779
# Probability of the answer 'kerek banán' is 0.20016320049762726
# Probability of the answer 'lila banán' is 0.19851425290107727

Available Models

Pre-trained models and projection weights are from MultilingualCLIP

Name Language Model Model Base Vision Model Pre-trained Languages Target Languages #Parameters
M-BERT-Distil-40 M-BERT Distil 40 M-BERT Distil RN50x4 101 Languages 40 Languages 66 M
M-BERT-Base-69 M-BERT Base 69 M-BERT Base RN50x4 101 Languages 68 Languages 110 M
M-BERT-Base-ViT-B M-BERT Base ViT-B M-BERT Base ViT-B/32 101 Languages 68 Languages 110 M

Related Works