# Planning

## Overview

Feedback:
* more focused analysis and experiments, too broad in general for models and evaluation
* not really valuable to implement a new model from scratch, focus more on the experiments/evaluation and how they illuminate the science of the model
* suggestion: pick one multimodal model & better analyze and understand the evaluated results, what worked and what didn't, what they reveal, etc.
* suggestion: focus only on one type of commonsense to evaluate

Revision:
* pick a single pre-existing multimodal model, justify the decision
* focus on physical commonsense only
* emphasize the analysis of the commonsense results, why they were different between the uni- and multi-modal models, what elements of the architecture/embeddings/etc. impacted those differences or what is evidently different between them
* what's the science behind it?

## References

* [Li 2021](https://arxiv.org/abs/2111.00607)
* [Saharia 2022](https://arxiv.org/abs/2205.11487)
* [A dive into vision-language models](https://huggingface.co/blog/vision_language_pretraining#supporting-vision-language-models-in-%F0%9F%A4%97-transformers)

## Models

Unimodal:
* Gopher LLM (Li et al 2021)
* Llama 2?

Multimodal (vision-text):
* ViLT ([huggingface](https://huggingface.co/docs/transformers/v4.35.2/en/model_doc/vilt#transformers.ViltProcessor), [HF tutorial for QA](https://huggingface.co/docs/transformers/tasks/visual_question_answering), [paper](https://arxiv.org/abs/2102.03334))




## Data

Training:

Data:


## Benchmarks

HellaSwag (used in Li et al 2021): given story, choose a continuation
* [site](https://rowanzellers.com/hellaswag/)
* [github](https://github.com/rowanz/hellaswag)
* [paper](https://arxiv.org/abs/1905.07830)

PIQA (used in Lie et al 2021): given situational prompt, choose a logical answer
* [github](https://github.com/ybisk/ybisk.github.io/tree/master/piqa)
* [paper](https://arxiv.org/abs/1911.11641)


In [1]:
import torch
import torch.nn as nn
import pandas as pd
import json
import transformers

# Testing

## Benchmarks

In [None]:
# ### HellaSwag

# # train set
# !wget https://raw.githubusercontent.com/rowanz/hellaswag/master/data/hellaswag_train.jsonl

# # val set
# !wget https://raw.githubusercontent.com/rowanz/hellaswag/master/data/hellaswag_val.jsonl

# # test set
# !wget https://raw.githubusercontent.com/rowanz/hellaswag/master/data/hellaswag_test.jsonl

--2023-12-02 22:46:22--  https://raw.githubusercontent.com/rowanz/hellaswag/master/data/hellaswag_train.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 47496131 (45M) [text/plain]
Saving to: ‘hellaswag_train.jsonl’


2023-12-02 22:46:23 (185 MB/s) - ‘hellaswag_train.jsonl’ saved [47496131/47496131]

--2023-12-02 22:46:23--  https://raw.githubusercontent.com/rowanz/hellaswag/master/data/hellaswag_val.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12246618 (12M) [text/plain]
Saving to: ‘hellaswag_val.jsonl’


2023-12-02

In [None]:
with open('hellaswag_train.jsonl', 'r') as f:
    hellaswag_train = pd.read_json(f, lines=True)
    print(hellaswag_train.head())

   ind         activity_label  \
0    4  Removing ice from car   
1    8         Baking cookies   
2    9         Baking cookies   
3   12         Baking cookies   
4   27      Getting a haircut   

                                               ctx_a  \
0  Then, the man writes over the snow covering th...   
1  A female chef in white uniform shows a stack o...   
2  A female chef in white uniform shows a stack o...   
3  A tray of potatoes is loaded into the oven and...   
4  The man in the center is demonstrating a hairs...   

                       ctx_b  \
0                       then   
1                   the pans   
2                    a knife   
3       a large tray of meat   
4  the man in the blue shirt   

                                                 ctx  split split_type  label  \
0  Then, the man writes over the snow covering th...  train   indomain      3   
1  A female chef in white uniform shows a stack o...  train   indomain      3   
2  A female chef in white un

In [None]:
def preprocess_function(df):
    first_sentences =

In [None]:
### PIQA

# train data
!wget https://raw.githubusercontent.com/ybisk/ybisk.github.io/master/piqa/data/train.jsonl -O piqa_train.jsonl

# train labels
!wget https://raw.githubusercontent.com/ybisk/ybisk.github.io/master/piqa/data/train-labels.lst -O piqa_train_labels.lst

# val data
!wget https://raw.githubusercontent.com/ybisk/ybisk.github.io/master/piqa/data/valid.jsonl -O piqa_val.jsonl

# val labels
!wget https://raw.githubusercontent.com/ybisk/ybisk.github.io/master/piqa/data/valid-labels.lst -O piqa_val_labels.lst

# test data
!wget https://raw.githubusercontent.com/ybisk/ybisk.github.io/master/piqa/data/tests.jsonl -O piqa_test.jsonl

--2023-11-26 04:03:06--  https://raw.githubusercontent.com/ybisk/ybisk.github.io/master/piqa/data/train.jsonl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4382162 (4.2M) [text/plain]
Saving to: ‘piqa_train.jsonl’


2023-11-26 04:03:06 (43.4 MB/s) - ‘piqa_train.jsonl’ saved [4382162/4382162]

--2023-11-26 04:03:06--  https://raw.githubusercontent.com/ybisk/ybisk.github.io/master/piqa/data/train-labels.lst
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 32226 (31K) [text/plain]
Saving to: ‘piqa_train_labels.lst’


2023-11-26 04:0

# Text model: Llama 2

# Vision-langauge model: ViLT (87.4M parameters)
* [fine-tuned on nlvr2](https://huggingface.co/dandelin/vilt-b32-finetuned-nlvr2)
* [fine-tuned on vqa](https://huggingface.co/dandelin/vilt-b32-finetuned-vqa)

In [4]:
from transformers import ViltProcessor, ViltForQuestionAnswering, BertTokenizerFast

text = "How many hot dogs should I eat?"

tokenizer = BertTokenizerFast
processor = ViltProcessor.from_pretrained('dandelin/vilt-b32-finetuned-vqa')
model = ViltForQuestionAnswering.from_pretrained('dandelin/vilt-b32-finetuned-vqa')

x = processor(text=text)

preprocessor_config.json:   0%|          | 0.00/251 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/320 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

TypeError: ignored

In [8]:
from transformers import ViltProcessor, ViltForQuestionAnswering
import requests
from PIL import Image

# prepare image + question
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "How many cats are there?"

processor = ViltProcessor.from_pretrained("dandelin/vilt-b32-finetuned-vqa")
model = ViltForQuestionAnswering.from_pretrained("dandelin/vilt-b32-finetuned-vqa")

# prepare inputs
encoding = processor(image, text, return_tensors="pt")

encoding['pixel_values'] = None
encoding

{'input_ids': tensor([[ 101, 2129, 2116, 8870, 2024, 2045, 1029,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]]), 'pixel_values': None, 'pixel_mask': tensor([[[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]]])}

In [9]:
# forward pass
outputs = model(**encoding)
logits = outputs.logits
idx = logits.argmax(-1).item()
print("Predicted answer:", model.config.id2label[idx])

ValueError: ignored

# Vision-lanuage model: FLAVA

In [10]:
from transformers import BertTokenizer, FlavaModel

model = FlavaModel.from_pretrained('facebook/flava-full')
tokenizer = BertTokenizer.from_pretrained("facebook/flava-full")

config.json:   0%|          | 0.00/8.69k [00:00<?, ?B/s]

`text_config_dict` is provided which will be used to initialize `FlavaTextConfig`. The value `text_config["id2label"]` will be overriden.
`multimodal_config_dict` is provided which will be used to initialize `FlavaMultimodalConfig`. The value `multimodal_config["id2label"]` will be overriden.
`image_codebook_config_dict` is provided which will be used to initialize `FlavaImageCodebookConfig`. The value `image_codebook_config["id2label"]` will be overriden.


pytorch_model.bin:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/358 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]



In [11]:
inputs = tokenizer(text=['a photo of a dog'], return_tensors='pt', padding='max_length', max_length=77)

text_embedding = model.get_text_features(**inputs)
text_embedding



tensor([[[-0.0185, -0.0586,  0.0698,  ..., -0.0318, -0.1119,  0.0276],
         [-0.0281, -0.0103, -0.0487,  ...,  0.0140,  0.0151, -0.0027],
         [ 0.0016, -0.0051, -0.0045,  ..., -0.0174,  0.0044, -0.0500],
         ...,
         [-0.0257, -0.0431,  0.0912,  ..., -0.0370, -0.0974,  0.0564],
         [-0.0431, -0.1407, -0.0212,  ..., -0.0893,  0.0681,  0.0106],
         [-0.0094, -0.0838, -0.0464,  ..., -0.0334,  0.0361,  0.0641]]],
       grad_fn=<ViewBackward0>)