-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 8e5a975
Showing
14 changed files
with
1,414 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
__pycache__ | ||
.vscode | ||
data/ | ||
results/ | ||
models/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
# Multimodal Consistent Chain-of-Thought (MC-CoT): Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training | ||
|
||
This repository contains the code for the paper "Boosting the Power of Small Multimodal Reasoning Models to Match Larger Models with Self-Consistency Training". Our work focuses on enhancing the capabilities of smaller multimodal reasoning models to achieve performance comparable to larger models. | ||
|
||
## Abstract | ||
|
||
Multimodal reasoning is a challenging task that requires models to reason across multiple modalities to answer questions. Existing approaches have made progress by incorporating language and visual modalities into a two-stage reasoning framework, separating rationale generation from answer inference. However, these approaches often fall short due to the inadequate quality of the generated rationales. In this work, we delve into the importance of rationales in model reasoning. We observe that when rationales are completely accurate, the model's accuracy significantly improves, highlighting the need for high-quality rationale generation. Motivated by this, we propose MC-CoT, a self-consistency training strategy that generates multiple rationales and answers, subsequently selecting the most accurate through a voting process. This approach not only enhances the quality of generated rationales but also leads to more accurate and robust answers. Through extensive experiments, we demonstrate that our approach significantly improves model performance across various benchmarks. Remarkably, we show that even smaller base models, when equipped with our proposed approach, can achieve results comparable to those of larger models, illustrating the potential of our approach in harnessing the power of rationales for improved multimodal reasoning. | ||
|
||
A schematic comparison of different Chain-of-Thought (CoT) prompt-based reasoning methods including: | ||
- Basic input-output prompt. | ||
- Chain-of-Thought with intermediate chain-like reasoning. | ||
- Chain-of-Thought Self-Consistency (CoT-SC), utilizing multiple independent thought chains. | ||
- Multimodal-CoT, inferring rationale using text and image inputs. | ||
- MC-CoT, which derives high-quality rationale through word-level voting. | ||
|
||
![Framework Comparison](assets/framework_comparison.png) | ||
|
||
## Datasets | ||
|
||
The models are trained and evaluated on two open-source datasets: | ||
- ScienceQA: Available at: | ||
- [Hugging Face Repository](https://huggingface.co/cooelf/vision_features/tree/main) | ||
- [Google Drive Link 1](https://drive.google.com/file/d/13B0hc_F_45-UlqPLKSgRz-ALtFQ8kIJr/view?pli=1) | ||
- [Google Drive Link 2](https://drive.google.com/drive/folders/1w8imCXWYn2LxajmGeGH_g5DaL2rabHev) | ||
- A-OKVQA: Accessible at [AllenAI Project Page](https://allenai.org/project/a-okvqa/home). | ||
|
||
The processed vision features for ScienceQA are available at https://huggingface.co/cooelf/vision_features/tree/main. | ||
|
||
The folder with all related files looks like: | ||
|
||
``` | ||
mc-cot | ||
├── assets | ||
├── results | ||
│ ├── base_pretrained_scienceqa | ||
│ │ ├── answer | ||
│ │ │ ├── ... | ||
│ │ ├── rationale | ||
│ │ │ ├── ... | ||
├── models | ||
│ ├── unifiedqa-t5-base | ||
├── data | ||
│ ├── vision_features | ||
│ ├── scienceqa | ||
``` | ||
|
||
## Usage | ||
|
||
To inference with our pretrained weights (`results/base_pretrained_scienceqa/`), run `run_eval_scienceqa.sh`. | ||
|
||
To train the model by yourself, please run `run_train_scienceqa.sh`. | ||
|
||
## Acknowledgements | ||
|
||
We highly thank "Multimodal Chain-of-Thought Reasoning in Language Models". [paper](https://arxiv.org/abs/2302.00923), [code](https://github.com/amazon-science/mm-cot) | ||
|
||
## Reference | ||
``` | ||
TBD | ||
``` |
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
''' | ||
Adapted from https://github.com/lupantech/ScienceQA | ||
''' | ||
|
||
import re | ||
from rouge_cal import Rouge | ||
from nltk.translate.bleu_score import sentence_bleu | ||
from sentence_transformers import util | ||
|
||
######################## | ||
## BLEU | ||
######################## | ||
def tokenize(text): | ||
tokens = re.split(r'\s|\.', text) | ||
tokens = [t for t in tokens if len(t) > 0] | ||
return tokens | ||
|
||
|
||
def bleu_score(reference, hypothesis, gram): | ||
reference_tokens = tokenize(reference) | ||
hypothesis_tokens = tokenize(hypothesis) | ||
|
||
if gram == 1: | ||
bleu = sentence_bleu([reference_tokens], hypothesis_tokens, (1., )) # BELU-1 | ||
elif gram == 2: | ||
bleu = sentence_bleu([reference_tokens], hypothesis_tokens, (1. / 2., 1. / 2.)) # BELU-2 | ||
elif gram == 3: | ||
bleu = sentence_bleu([reference_tokens], hypothesis_tokens, (1. / 3., 1. / 3., 1. / 3.)) # BELU-3 | ||
elif gram == 4: | ||
bleu = sentence_bleu([reference_tokens], hypothesis_tokens, (1. / 4., 1. / 4., 1. / 4., 1. / 4.)) # BELU-4 | ||
|
||
return bleu | ||
|
||
|
||
def caculate_bleu(results, data, gram): | ||
bleus = [] | ||
for qid, output in results.items(): | ||
prediction = output | ||
target = data[qid] | ||
target = target.strip() | ||
if target == "": | ||
continue | ||
bleu = bleu_score(target, prediction, gram) | ||
bleus.append(bleu) | ||
|
||
avg_bleu = sum(bleus) / len(bleus) | ||
|
||
return avg_bleu | ||
|
||
|
||
######################## | ||
## Rouge-L | ||
######################## | ||
def score_rouge(str1, str2): | ||
rouge = Rouge(metrics=["rouge-l"]) | ||
scores = rouge.get_scores(str1, str2, avg=True) | ||
rouge_l = scores['rouge-l']['f'] | ||
return rouge_l | ||
|
||
|
||
def caculate_rouge(results, data): | ||
rouges = [] | ||
for qid, output in results.items(): | ||
prediction = output | ||
target = data[qid] | ||
target = target.strip() | ||
if prediction == "": | ||
continue | ||
if target == "": | ||
continue | ||
rouge = score_rouge(target, prediction) | ||
rouges.append(rouge) | ||
|
||
avg_rouge = sum(rouges) / len(rouges) | ||
return avg_rouge | ||
|
||
|
||
######################## | ||
## Sentence Similarity | ||
######################## | ||
def similariry_score(str1, str2, model): | ||
# compute embedding for both lists | ||
embedding_1 = model.encode(str1, convert_to_tensor=True) | ||
embedding_2 = model.encode(str2, convert_to_tensor=True) | ||
score = util.pytorch_cos_sim(embedding_1, embedding_2).item() | ||
return score | ||
|
||
|
||
def caculate_similariry(results, data, model): | ||
scores = [] | ||
for qid, output in results.items(): | ||
prediction = output | ||
target = data[qid] | ||
target = target.strip() | ||
|
||
score = similariry_score(target, prediction, model) | ||
scores.append(score) | ||
|
||
avg_score = sum(scores) / len(scores) | ||
return avg_score |
Oops, something went wrong.