This repository contains code to train Korean CLIP on MS-COCO with Korean annotations in AI-HUB. Additionally, to get more Korean annotations, we use Naver Papago translator from English to Korean on VizWiz data.
The original CLIP has large-scaled dataset however ours dataset is much less than CLIP's. Due to lack Korean caption data, we use pretrained language and visual model to get representations on less dataset.
- We fixed PLM as klue/roberta-large on huggingface to get more powerful text representation in Korean.
- We used PVMs as google/vit-base-patch16-224-in21k on huggingface and RN101 on torchvision to get image representations.
- Actually, the images are not dependent in number of Korean dataset, but CLIP is trained pair of texts-images so Ko-CLIP trained limited images(which has Korean captions).
See WandB dashboard for check training records and model performance with comparing pretrained visual models.
In zero-shot classification, we predict on CIFAR-10 and CIFAR-100 datasets.
We refer to CLIP, clip-training for train, koclip idea, and other pretrained models.