# LLM Science Exam Optimise Ensemble Weights 

In this competition, when looking for the high-scoring notebooks, those that are ensembles with multiple models stand out. In fact, it is known empirically that ensembles are very powerful in NLP competition.

[The voting ensemble was introduced](https://www.kaggle.com/code/radek1/an-introduction-to-voting-ensemble) by [radek1](https://www.kaggle.com/radek1) and many notes have been published on this basis.

On the other hand, ensembles with predicted probabilities appear to be less used.

This notebook introduces ensembles using probabilities and shows how to optimise model weights with **scipy.optimize**.

Normally, OOF(out of fold) predictions are used to optimise model weights, But The training data used looks mixed and most of the weight is for single models. Therefore, I'll use an evaluation dataset that appears not to have been used for training. the dataset named [MMLU-Dataset](https://www.kaggle.com/datasets/peiyuanliu2001/mmlu-dataset) shared by [Peiyuan Liu](https://www.kaggle.com/peiyuanliu2001). [See his discussion for details.](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/433168) Please note that this dataset contains more than just STEM questions, so it may not be suitable as an evaluation dataset.

edit: Somehow unable to submitted due to the MMLU dataset, so I've created a separate dataset.

edit: [Chris Deotte](https://www.kaggle.com/cdeotte) once again published an [amazing dataset](https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2) and notebooks. his [training code is here](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-1) and [inference code is here](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-2). This version also uses his trained weights.

### References, see also them

Weight optimization related 

* [Optimise Blending Weights with Bonus :0](https://www.kaggle.com/code/gogo827jz/optimise-blending-weights-with-bonus-0/notebook) by [Yirun Zhang](https://www.kaggle.com/gogo827jz)

OpenBook and its tuning related(Too many, so just partial only)

* [OpenBook DeBERTaV3-Large Baseline (Single Model)](https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model) by [Anil Ozturk](https://www.kaggle.com/nlztrk)

* [[0.807] Sharing my trained-with-context model](https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model/notebook) by [MGÃ¶ksu](https://www.kaggle.com/mgoksu)

Trainning and inferring OpenBook Dataset with context

* [How To Train Open Book Model - Part 1](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-1) by [Chris Deotte](https://www.kaggle.com/cdeotte)

* [How To Train Open Book Model - Part 2](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-2) by [Chris Deotte](https://www.kaggle.com/cdeotte)

Voting ensemble (Too many, so just the original)

* [The voting ensemble was introduced](https://www.kaggle.com/code/radek1/an-introduction-to-voting-ensemble) by [radek1](https://www.kaggle.com/radek1)

### My other Notebooks

In this competition

* [Incorporate MAP@k metrics into HF Trainer](https://www.kaggle.com/code/itsuki9180/incorporate-map-k-metrics-into-hf-trainer)

* [Introducing Adversarial Weight Perturbation (AWP)](https://www.kaggle.com/code/itsuki9180/introducing-adversarial-weight-perturbation-awp)

* [Adversarial Weight Perturbation (AWP) Inference](https://www.kaggle.com/code/itsuki9180/adversarial-weight-perturbation-awp-inference)

* [Using DeepSpeed with HFðŸ¤— Trainer](https://www.kaggle.com/code/itsuki9180/using-deepspeed-with-hf-trainer)

Weight optimization related (almost same as Yirun Zhangs')

* [G2Net_oof_weight_optimizer](https://www.kaggle.com/code/itsuki9180/g2net-oof-weight-optimizer)

# How To Train Model for Open Book Q&A Technique - Part 2
The notebook you are reading is a fork of Mgoksu's great notebook [here][1]. Mgoksu (@mgoksu) demonstrated how to achieve top public LB=0.807 using Open Book technique. The Open Book method was first presented by JJ (@jjinho) [here][2], then Quangteo (@quangbk) improved RAM usage [here][3], and Anil (@nlztrk) combined with Q&A [here][4]. Radek (@radek1) demonstrated the strength of Q&A [here][5].

In my previous notebook [here][6] (i.e. Part 1), we demonstrated how to train a model for Open Book. The model was trained using my 60k Kaggle dataset [here][7]. If you enjoy the notebook you are reading, please upvote the dataset too. Thanks!

In this notebook, we will load the trained model output from my previous notebook. We will infer this model after running the code from Mgoksu's public notebook to use Open Book to seach Wikipedia for context. For each test sample in the hidden dataset, we will append Wikipedia context. Then our trained model will infer the multiple choice answer (using both question and appended Wikipedia context). When predicting the answer, this notebook uses a 50% 50% ensemble of the new Q&A model we trained ensembled with Mgoksu's original model. Here is a diagram showing the Open Book method:

![](https://miro.medium.com/v2/resize:fit:800/format:webp/1*bTGY3fKIgNefQxNsOYpnBw.png)

(image source [here][8])

[1]: https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model
[2]: https://www.kaggle.com/code/jjinho/open-book-llm-science-exam
[3]: https://www.kaggle.com/code/quangbk/open-book-llm-science-exam-reduced-ram-usage
[4]: https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model
[5]: https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training
[6]: https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model
[7]: https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2
[8]: https://blog.gopenai.com/enrich-llms-with-retrieval-augmented-generation-rag-17b82a96b6f0

In [1]:
# installing offline dependencies
!pip install -U /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!cp -rf /kaggle/input/sentence-transformers-222/sentence-transformers /kaggle/working/sentence-transformers
!pip install -U /kaggle/working/sentence-transformers
!pip install -U /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

!pip install --no-index --no-deps /kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/datasets-2.14.3-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

Processing /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Processing ./sentence-transformers
  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=126125 sha256=7bf3155e94a3cfdc04a53589892975f47aecdddf26e04ceef5fdfe6091e7e0fa
  Stored in directory: /root/.cache/pip/wheels/6c/ea/76/d9a930b223b1d3d5d6aff69458725316b0fe205b854faf1812
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
Processing /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl
Installing collected packages: blingfire
Succe

In [2]:
import os, time
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf
from __future__ import annotations

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

import torch
import ctypes
libc = ctypes.CDLL("libc.so.6")

from dataclasses import dataclass
from typing import Optional, Union

import torch
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from torch.utils.data import DataLoader

from scipy.special import softmax



In [3]:
trn = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")
trn.head()

Unnamed: 0,id,prompt,A,B,C,D,E
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [4]:
val = pd.read_csv('/kaggle/input/mmlu-dataset-valid-only/valid_mmlu_1526_ind0.csv').sample(1500)

val['E'] = '' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set
val = val.replace(np.NaN, '')

val['A'] = val['A'].map(str)
val['B'] = val['B'].map(str)
val['C'] = val['C'].map(str)
val['D'] = val['D'].map(str)
val['E'] = val['E'].map(str)

val.head()

Unnamed: 0.1,Unnamed: 0,prompt,A,B,C,D,answer,E
230,230,Immediately after two separated charged partic...,the same,opposite,Either of these,Need more information,C,
671,671,For which of these two scenarios does the main...,"Wrong, Wrong","Wrong, Not wrong","Not wrong, Wrong","Not wrong, Not wrong",B,
881,881,Which of the following is a correct statement?,Average total cost equals marginal cost plus a...,Average total cost equals marginal costs plus ...,Average total cost equals average fixed costs ...,Total fixed costs vary with output.,C,
1345,1345,If an empty room measures about 50 feet wide b...,1000,3000,6000,9000,A,
475,475,"""Behavior is personality"" best characterizes w...",psychodynamic,behavioral,biological,sociocultural,B,


In [5]:
val.rename(columns = {'Unnamed: 0' : 'id'}, inplace =True)
val.head()

Unnamed: 0,id,prompt,A,B,C,D,answer,E
230,230,Immediately after two separated charged partic...,the same,opposite,Either of these,Need more information,C,
671,671,For which of these two scenarios does the main...,"Wrong, Wrong","Wrong, Not wrong","Not wrong, Wrong","Not wrong, Not wrong",B,
881,881,Which of the following is a correct statement?,Average total cost equals marginal cost plus a...,Average total cost equals marginal costs plus ...,Average total cost equals average fixed costs ...,Total fixed costs vary with output.,C,
1345,1345,If an empty room measures about 50 feet wide b...,1000,3000,6000,9000,A,
475,475,"""Behavior is personality"" best characterizes w...",psychodynamic,behavioral,biological,sociocultural,B,


In [6]:
!pip install -U /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl


Processing /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
faiss-gpu is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


In [7]:
import numpy as np
import pandas as pd 
from datasets import load_dataset, load_from_disk
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
from transformers import LongformerTokenizer, LongformerForMultipleChoice
import transformers
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import unicodedata

import os
import os, time
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf
from __future__ import annotations

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

import torch
import ctypes
libc = ctypes.CDLL("libc.so.6")

from dataclasses import dataclass
from typing import Optional, Union

import torch
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from torch.utils.data import DataLoader

from scipy.special import softmax

In [8]:
!cp -r /kaggle/input/stem-wiki-cohere-no-emb /kaggle/working
!cp -r /kaggle/input/all-paraphs-parsed-expanded /kaggle/working/

In [9]:
gc.collect
torch.cuda.empty_cache()

In [10]:
def SplitList(mylist, chunk_size):
    return [mylist[offs:offs+chunk_size] for offs in range(0, len(mylist), chunk_size)]

def get_relevant_documents_parsed(df_valid):
    df_chunk_size=600
    paraphs_parsed_dataset = load_from_disk("/kaggle/working/all-paraphs-parsed-expanded")
    modified_texts = paraphs_parsed_dataset.map(lambda example:
                                             {'temp_text':
                                              f"{example['title']} {example['section']} {example['text']}".replace('\n'," ").replace("'","")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         paraphs_parsed_dataset[idx.item()]["title"],
                         paraphs_parsed_dataset[idx.item()]["text"],
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles



def get_relevant_documents(df_valid):
    df_chunk_size=800
    
    cohere_dataset_filtered = load_from_disk("/kaggle/working/stem-wiki-cohere-no-emb")
    modified_texts = cohere_dataset_filtered.map(lambda example:
                                             {'temp_text':
                                              unicodedata.normalize("NFKD", f"{example['title']} {example['text']}").replace('"',"")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         cohere_dataset_filtered[idx.item()]["title"],
                         unicodedata.normalize("NFKD", cohere_dataset_filtered[idx.item()]["text"]),
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles



def retrieval(df_valid, modified_texts):
    
    corpus_df_valid = df_valid.apply(lambda row:
                                     f'{row["prompt"]}\n{row["prompt"]}\n{row["prompt"]}\n{row["A"]}\n{row["B"]}\n{row["C"]}\n{row["D"]}\n{row["E"]}',
                                     axis=1).values
    vectorizer1 = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words)
    vectorizer1.fit(corpus_df_valid)
    vocab_df_valid = vectorizer1.get_feature_names_out()
    vectorizer = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words,
                                 vocabulary=vocab_df_valid)
    vectorizer.fit(modified_texts[:500000])
    corpus_tf_idf = vectorizer.transform(corpus_df_valid)
    
    print(f"length of vectorizer vocab is {len(vectorizer.get_feature_names_out())}")

    chunk_size = 100000
    top_per_chunk = 10
    top_per_query = 10

    all_chunk_top_indices = []
    all_chunk_top_values = []

    for idx in tqdm(range(0, len(modified_texts), chunk_size)):
        wiki_vectors = vectorizer.transform(modified_texts[idx: idx+chunk_size])
        temp_scores = (corpus_tf_idf * wiki_vectors.T).toarray()
        chunk_top_indices = temp_scores.argpartition(-top_per_chunk, axis=1)[:, -top_per_chunk:]
        chunk_top_values = temp_scores[np.arange(temp_scores.shape[0])[:, np.newaxis], chunk_top_indices]

        all_chunk_top_indices.append(chunk_top_indices + idx)
        all_chunk_top_values.append(chunk_top_values)

    top_indices_array = np.concatenate(all_chunk_top_indices, axis=1)
    top_values_array = np.concatenate(all_chunk_top_values, axis=1)
    
    merged_top_scores = np.sort(top_values_array, axis=1)[:,-top_per_query:]
    merged_top_indices = top_values_array.argsort(axis=1)[:,-top_per_query:]
    articles_indices = top_indices_array[np.arange(top_indices_array.shape[0])[:, np.newaxis], merged_top_indices]
    
    return articles_indices, merged_top_scores


def prepare_answering_input(
        tokenizer, 
        question,  
        options,   
        context,   
        max_seq_length=2500,
    ):
    c_plus_q   = context + ' ' + tokenizer.bos_token + ' ' + question
    c_plus_q_4 = [c_plus_q] * len(options)
    tokenized_examples = tokenizer(
        c_plus_q_4, options,
        max_length=max_seq_length,
        padding="longest",
        truncation=False,
        return_tensors="pt",
    )
    input_ids = tokenized_examples['input_ids'].unsqueeze(0)
    attention_mask = tokenized_examples['attention_mask'].unsqueeze(0)
    example_encoded = {
        "input_ids": input_ids.to(model.device.index),
        "attention_mask": attention_mask.to(model.device.index),
    }
    return example_encoded

In [11]:
stop_words = ['each', 'you', 'the', 'use', 'used',
                  'where', 'themselves', 'nor', "it's", 'how', "don't", 'just', 'your',
                  'about', 'himself', 'with', "weren't", 'hers', "wouldn't", 'more', 'its', 'were',
                  'his', 'their', 'then', 'been', 'myself', 're', 'not',
                  'ours', 'will', 'needn', 'which', 'here', 'hadn', 'it', 'our', 'there', 'than',
                  'most', "couldn't", 'both', 'some', 'for', 'up', 'couldn', "that'll",
                  "she's", 'over', 'this', 'now', 'until', 'these', 'few', 'haven',
                  'of', 'wouldn', 'into', 'too', 'to', 'very', 'shan', 'before', 'the', 'they',
                  'between', "doesn't", 'are', 'was', 'out', 'we', 'me',
                  'after', 'has', "isn't", 'have', 'such', 'should', 'yourselves', 'or', 'during', 'herself',
                  'doing', 'in', "shouldn't", "won't", 'when', 'do', 'through', 'she',
                  'having', 'him', "haven't", 'against', 'itself', 'that',
                  'did', 'theirs', 'can', 'those',
                  'own', 'so', 'and', 'who', "you've", 'yourself', 'her', 'he', 'only',
                  'what', 'ourselves', 'again', 'had', "you'd", 'is', 'other',
                  'why', 'while', 'from', 'them', 'if', 'above', 'does', 'whom',
                  'yours', 'but', 'being', "wasn't", 'be']

In [12]:
retrieved_articles_parsed_train = get_relevant_documents_parsed(trn)
gc.collect()

retrieved_articles_train = get_relevant_documents(trn)
gc.collect()

Map (num_proc=2):   0%|          | 0/2101279 [00:00<?, ? examples/s]

  0%|          | 0/1 [00:00<?, ?it/s]



length of vectorizer vocab is 11222


  0%|          | 0/22 [00:00<?, ?it/s]

Map (num_proc=2):   0%|          | 0/2781652 [00:00<?, ? examples/s]

  0%|          | 0/1 [00:00<?, ?it/s]

length of vectorizer vocab is 11222


  0%|          | 0/28 [00:00<?, ?it/s]

37

In [13]:
model_dir = "/kaggle/input/llm-science-exam-context-v2-models/deberta-checkpoint-7600/checkpoint-7600"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

DebertaV2ForMultipleChoice(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1024, padding_idx=0)
      (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_aff

In [14]:
val.head()

Unnamed: 0,id,prompt,A,B,C,D,answer,E
230,230,Immediately after two separated charged partic...,the same,opposite,Either of these,Need more information,C,
671,671,For which of these two scenarios does the main...,"Wrong, Wrong","Wrong, Not wrong","Not wrong, Wrong","Not wrong, Not wrong",B,
881,881,Which of the following is a correct statement?,Average total cost equals marginal cost plus a...,Average total cost equals marginal costs plus ...,Average total cost equals average fixed costs ...,Total fixed costs vary with output.,C,
1345,1345,If an empty room measures about 50 feet wide b...,1000,3000,6000,9000,A,
475,475,"""Behavior is personality"" best characterizes w...",psychodynamic,behavioral,biological,sociocultural,B,


In [15]:
st = val['answer']
val = val.drop(columns = ['answer'])
val['answer'] = st
val.head()

Unnamed: 0,id,prompt,A,B,C,D,E,answer
230,230,Immediately after two separated charged partic...,the same,opposite,Either of these,Need more information,,C
671,671,For which of these two scenarios does the main...,"Wrong, Wrong","Wrong, Not wrong","Not wrong, Wrong","Not wrong, Not wrong",,B
881,881,Which of the following is a correct statement?,Average total cost equals marginal cost plus a...,Average total cost equals marginal costs plus ...,Average total cost equals average fixed costs ...,Total fixed costs vary with output.,,C
1345,1345,If an empty room measures about 50 feet wide b...,1000,3000,6000,9000,,A
475,475,"""Behavior is personality"" best characterizes w...",psychodynamic,behavioral,biological,sociocultural,,B


In [16]:
trn['answer'] = 'A'
trn.head()

Unnamed: 0,id,prompt,A,B,C,D,E,answer
0,0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...,A
1,1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,A
2,2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...,A
3,3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,A
4,4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,A


In [17]:
tokenizer = AutoTokenizer.from_pretrained(model_dir)

In [18]:
LL = ['A', 'B', 'C', 'D', 'E']
indices = list(range(5))

option_to_index = {option: index for option, index in zip(LL, indices)}
index_to_option = {index: option for option, index in zip(LL, indices)}

def preprocess(example):
  
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    for o in LL:
#         print(o)
        second_sentence.append(example[o])
    
    tokenized_example = tokenizer(first_sentence, second_sentence, max_length = 5000, truncation='only_first')
    tokenized_example['label'] = option_to_index[example['answer']]
    first_sentence = None
    second_sentence = None
    del first_sentence, second_sentence
    gc.collect()
    return tokenized_example

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

def tokenize_dataframe(df, tokenizer, retrieved_articles, retrieved_articles_parsed):
    submit_ids = []
    contexts_1 = []
    contexts_2 = []

    for index in tqdm(range(df.shape[0])):
        columns = df.iloc[index].values
        submit_ids.append(columns[0])
        question = columns[1]
        options = [columns[i] for i in range(2, len(LL) + 2)]
        context1 = f"{retrieved_articles[index][-4][2]}\n{retrieved_articles[index][-3][2]}\n{retrieved_articles[index][-2][2]}\n{retrieved_articles[index][-1][2]}"
        context2 = f"{retrieved_articles_parsed[index][-3][2]}\n{retrieved_articles_parsed[index][-2][2]}\n{retrieved_articles_parsed[index][-1][2]}"
        contexts_1.append(context1[:5000])
        contexts_2.append(context2[:5000])
        context1 = context2 = None
        del context1, context2
        gc.collect()

    df_contexts_1 = pd.DataFrame(list(zip(submit_ids, contexts_1)), columns=['id', 'context'])
    df_contexts_2 = pd.DataFrame(list(zip(submit_ids, contexts_2)), columns=['id', 'context'])

    df_merged_1 = pd.merge(df, df_contexts_1, on="id")
    df_merged_2 = pd.merge(df, df_contexts_2, on="id")
    
#     if data_type == "Train":
#         df_merged_1['answer'] = 'A'
#         df_merged_2['answer'] = 'B'

    df_merged_1 = df_merged_1.drop(columns=['id'])
    df_merged_2 = df_merged_2.drop(columns=['id'])

    df_merged_1['prompt'] = df_merged_1['context'] + "####" + df_merged_1['prompt']
    df_merged_2['prompt'] = df_merged_2['context'] + "####" + df_merged_2['prompt']

    df_merged_1 = df_merged_1.drop(columns=['context'])
    df_merged_2 = df_merged_2.drop(columns=['context'])
    
    tokenized_dataset_1 = Dataset.from_pandas(df_merged_1[['prompt', 'A', 'B', 'C', 'D', 'E', 'answer']]).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
    tokenized_dataset_2 = Dataset.from_pandas(df_merged_2[['prompt', 'A', 'B', 'C', 'D', 'E', 'answer']]).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
    df_contexts_1 = df_contexts_2 = df_merged_1 = df_merged_2 = contexts_1 = contexts_2 = submit_ids = None
    del df_contexts_1, df_contexts_2, df_merged_1, df_merged_2, contexts_1, contexts_2, submit_ids
    gc.collect()
    return tokenized_dataset_1, tokenized_dataset_2


data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
tokenized_test_dataset_1, tokenized_test_dataset_2 = tokenize_dataframe(trn, tokenizer, retrieved_articles_train,retrieved_articles_parsed_train )
test_dataloader_1 = DataLoader(tokenized_test_dataset_1, batch_size=1, shuffle=False, collate_fn=data_collator)
test_dataloader_2 = DataLoader(tokenized_test_dataset_2, batch_size=1, shuffle=False, collate_fn=data_collator)

  0%|          | 0/200 [00:00<?, ?it/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

# Inference

In [19]:
retrieved_articles_train = None
retrieved_articles_parsed_train = None
retrieved_articles_parsed_val = None
retrieved_articles_val = None
tokenized_val_dataset_1 = None
tokenized_val_dataset_2 = None
tokenized_test_dataset_1= None
tokenized_test_dataset_2 = None
trn = None

In [20]:
del retrieved_articles_train, retrieved_articles_parsed_train, retrieved_articles_parsed_val, retrieved_articles_val, tokenized_val_dataset_1, tokenized_val_dataset_2, tokenized_test_dataset_1, tokenized_test_dataset_2, trn, val 

In [21]:
gc.collect()

0

In [22]:

test_pred_1 = []
test_pred_2 = []
for batch in tqdm(test_dataloader_1):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_pred_1.append(outputs.logits.cpu().detach())
    outputs = None
    del outputs
    gc.collect()
test_pred_1 = torch.cat(test_pred_1)

# Debugging: Print the first batch of test_dataloader_2
for batch in tqdm(test_dataloader_2):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_pred_2.append(outputs.logits.cpu().detach())
    outputs = None
    del outputs
    gc.collect()
test_pred_2 = torch.cat(test_pred_2)

# Softmax and averaging
test_pred_1 = softmax(test_pred_1, axis=1).numpy()
test_pred_2 = softmax(test_pred_2, axis=1).numpy()
test_pred_a = (test_pred_1 + test_pred_2) / 2

  0%|          | 0/200 [00:00<?, ?it/s]

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


  0%|          | 0/200 [00:00<?, ?it/s]

In [23]:
test_pred_1 = None
test_pred_2 = None
model = None
del model, test_pred_1, test_pred_2
gc.collect()
torch.cuda.empty_cache()

In [24]:
model_dir = "/kaggle/input/how-to-train-open-book-model-part-1/model_v2"
# tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

DebertaV2ForMultipleChoice(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1024, padding_idx=0)
      (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_aff

In [25]:
test_pred_1 = []
test_pred_2 = []
for batch in tqdm(test_dataloader_1):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_pred_1.append(outputs.logits.cpu().detach())
    outputs = None
    del outputs
    gc.collect()
test_pred_1 = torch.cat(test_pred_1)

# Debugging: Print the first batch of test_dataloader_2
for batch in tqdm(test_dataloader_2):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_pred_2.append(outputs.logits.cpu().detach())
    outputs = None
    del outputs
    gc.collect()
test_pred_2 = torch.cat(test_pred_2)

# Softmax and averaging
test_pred_1 = softmax(test_pred_1, axis=1).numpy()
test_pred_2 = softmax(test_pred_2, axis=1).numpy()
test_pred_b = (test_pred_1 + test_pred_2) / 2

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

In [26]:
test_pred_1 = None
test_pred_2 = None
model = None
del test_pred_1, test_pred_2
del model
gc.collect()
torch.cuda.empty_cache()

In [27]:
model_dir = "/kaggle/input/checkpoint-5975-09025"
# tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

DebertaV2ForMultipleChoice(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1024, padding_idx=0)
      (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_aff

In [28]:

test_pred_1 = []
test_pred_2 = []
for batch in tqdm(test_dataloader_1):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_pred_1.append(outputs.logits.cpu().detach())
    outputs = None
    del outputs
    gc.collect()
test_pred_1 = torch.cat(test_pred_1)

# Debugging: Print the first batch of test_dataloader_2
for batch in tqdm(test_dataloader_2):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_pred_2.append(outputs.logits.cpu().detach())
    outputs = None
    del outputs
    gc.collect()
test_pred_2 = torch.cat(test_pred_2)

# Softmax and averaging
test_pred_1 = softmax(test_pred_1, axis=1).numpy()
test_pred_2 = softmax(test_pred_2, axis=1).numpy()
test_pred_c = (test_pred_1 + test_pred_2) / 2
test_pred_1 = None
test_pred_2 = None
model = None
del model
del test_pred_1, test_pred_2
gc.collect()
torch.cuda.empty_cache()

  0%|          | 0/200 [00:00<?, ?it/s]

  0%|          | 0/200 [00:00<?, ?it/s]

# Apply weights and make submission

In [29]:
predictions_overall = test_pred_a * 1.84540578e-17 + test_pred_b * 1.84540578e-17 + test_pred_c*1.84540578e-17
predictions_overall.shape

(200, 5)

In [30]:
predictions_overall = predictions_overall
predictions_overall = np.argsort(-predictions_overall)[:,:3]
predictions_overall[:5]

array([[3, 4, 1],
       [0, 1, 3],
       [0, 3, 2],
       [2, 0, 1],
       [3, 2, 0]])

In [31]:
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_overall]
predictions_as_answer_letters[:3]

array([['D', 'E', 'B'],
       ['A', 'B', 'D'],
       ['A', 'D', 'C']], dtype='<U1')

In [32]:
test_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")

In [33]:
predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]
predictions_as_string[:3]

['D E B', 'A B D', 'A D C']

In [34]:
submission = test_df[['id', 'prediction']]
submission.to_csv('submission.csv', index=False)

pd.read_csv('submission.csv').head(10)

Unnamed: 0,id,prediction
0,0,D E B
1,1,A B D
2,2,A D C
3,3,C A B
4,4,D C A
5,5,B C E
6,6,A D C
7,7,D B E
8,8,C B A
9,9,A B C


In conclusion, at least we were able to confirm that the openbook model (based on Ozturk's and Chris'), which differs in method from other models and has a high score, has the higher weight.

Now it's your turn to blend. Let's add weights for your model. 

Also, running notebooks, especially inference for openbook model, takes a long time, so it's a good idea to separate notebooks for calculating weights and for submitting them like Yirun Zhangs' base notebook.

It would also be important to change the evaluation dataset to something relevant to STEM. If the model weights are unnaturally high, suspect a leak. And make sure the evaluation dataset is not used for training.

### Wishing you happy kaggling!