# LLM Science Exam Optimise Ensemble Weights 

In this competition, when looking for the high-scoring notebooks, those that are ensembles with multiple models stand out. In fact, it is known empirically that ensembles are very powerful in NLP competition.

[The voting ensemble was introduced](https://www.kaggle.com/code/radek1/an-introduction-to-voting-ensemble) by [radek1](https://www.kaggle.com/radek1) and many notes have been published on this basis.

On the other hand, ensembles with predicted probabilities appear to be less used.

This notebook introduces ensembles using probabilities and shows how to optimise model weights with **scipy.optimize**.

Normally, OOF(out of fold) predictions are used to optimise model weights, But The training data used looks mixed and most of the weight is for single models. Therefore, I'll use an evaluation dataset that appears not to have been used for training. the dataset named [MMLU-Dataset](https://www.kaggle.com/datasets/peiyuanliu2001/mmlu-dataset) shared by [Peiyuan Liu](https://www.kaggle.com/peiyuanliu2001). [See his discussion for details.](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/433168) Please note that this dataset contains more than just STEM questions, so it may not be suitable as an evaluation dataset.

edit: Somehow unable to submitted due to the MMLU dataset, so I've created a separate dataset.

edit: [Chris Deotte](https://www.kaggle.com/cdeotte) once again published an [amazing dataset](https://www.kaggle.com/datasets/cdeotte/60k-data-with-context-v2) and notebooks. his [training code is here](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-1) and [inference code is here](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-2). This version also uses his trained weights.

### References, see also them

Weight optimization related 

* [Optimise Blending Weights with Bonus :0](https://www.kaggle.com/code/gogo827jz/optimise-blending-weights-with-bonus-0/notebook) by [Yirun Zhang](https://www.kaggle.com/gogo827jz)

OpenBook and its tuning related(Too many, so just partial only)

* [OpenBook DeBERTaV3-Large Baseline (Single Model)](https://www.kaggle.com/code/nlztrk/openbook-debertav3-large-baseline-single-model) by [Anil Ozturk](https://www.kaggle.com/nlztrk)

* [[0.807] Sharing my trained-with-context model](https://www.kaggle.com/code/mgoksu/0-807-sharing-my-trained-with-context-model/notebook) by [MGöksu](https://www.kaggle.com/mgoksu)

Trainning and inferring OpenBook Dataset with context

* [How To Train Open Book Model - Part 1](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-1) by [Chris Deotte](https://www.kaggle.com/cdeotte)

* [How To Train Open Book Model - Part 2](https://www.kaggle.com/code/cdeotte/how-to-train-open-book-model-part-2) by [Chris Deotte](https://www.kaggle.com/cdeotte)

Voting ensemble (Too many, so just the original)

* [The voting ensemble was introduced](https://www.kaggle.com/code/radek1/an-introduction-to-voting-ensemble) by [radek1](https://www.kaggle.com/radek1)

### My other Notebooks

In this competition

* [Incorporate MAP@k metrics into HF Trainer](https://www.kaggle.com/code/itsuki9180/incorporate-map-k-metrics-into-hf-trainer)

* [Introducing Adversarial Weight Perturbation (AWP)](https://www.kaggle.com/code/itsuki9180/introducing-adversarial-weight-perturbation-awp)

* [Adversarial Weight Perturbation (AWP) Inference](https://www.kaggle.com/code/itsuki9180/adversarial-weight-perturbation-awp-inference)

* [Using DeepSpeed with HF🤗 Trainer](https://www.kaggle.com/code/itsuki9180/using-deepspeed-with-hf-trainer)

Weight optimization related (almost same as Yirun Zhangs')

* [G2Net_oof_weight_optimizer](https://www.kaggle.com/code/itsuki9180/g2net-oof-weight-optimizer)

In [1]:
# installing offline dependencies
!pip install -U /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!cp -rf /kaggle/input/sentence-transformers-222/sentence-transformers /kaggle/working/sentence-transformers
!pip install -U /kaggle/working/sentence-transformers
!pip install -U /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

!pip install --no-index --no-deps /kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/datasets-2.14.3-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

Processing /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Processing ./sentence-transformers
  Preparing metadata (setup.py) ... [?25l- done
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=126125 sha256=af71f334818656184faba28410e915a145c26916afd175d397b364605e34180b
  Stored in directory: /root/.cache/pip/wheels/6c/ea/76/d9a930b223b1d3d5d6aff69458725316b0fe205b854faf1812
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
Processing /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl
Installing collected packages: blingfire
Succe

In [2]:
import os, time
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf
from __future__ import annotations

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

import torch
import ctypes
libc = ctypes.CDLL("libc.so.6")

from dataclasses import dataclass
from typing import Optional, Union

import torch
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from torch.utils.data import DataLoader

from scipy.special import softmax



In [3]:
SIM_MODEL = '/kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2'
DEVICE = 0
MAX_LENGTH = 384
BATCH_SIZE = 32

trn = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv").drop("id", axis=1)

DEBUG = False
# DEBUG = False if len(trn)!=200 else True # If you want to save GPU Quota, check off this comment-out. But cannot get accurate weight on saving notebook
FILTER_LEN = 1 if DEBUG else 10
IND_SEARCH = 1 if DEBUG else 7
NUM_SENTENCES_INCLUDE = 1 if DEBUG else 22
CONTEXT_LEN = 1000 if DEBUG else 2305
IS_TEST_SET = len(trn) != 200

WIKI_PATH = "/kaggle/input/wikipedia-20230701"
wiki_files = os.listdir(WIKI_PATH)

In [4]:
def process_documents(documents: Iterable[str],
                      document_ids: Iterable,
                      split_sentences: bool = True,
                      filter_len: int = FILTER_LEN,
                      disable_progress_bar: bool = False) -> pd.DataFrame:
    
    df = sectionize_documents(documents, document_ids, disable_progress_bar)

    if split_sentences:
        df = sentencize(df.text.values, 
                        df.document_id.values,
                        df.offset.values, 
                        filter_len, 
                        disable_progress_bar)
    return df


def sectionize_documents(documents: Iterable[str],
                         document_ids: Iterable,
                         disable_progress_bar: bool = False) -> pd.DataFrame:

    processed_documents = []
    for document_id, document in tqdm(zip(document_ids, documents), total=len(documents), disable=disable_progress_bar):
        row = {}
        text, start, end = (document, 0, len(document))
        row['document_id'] = document_id
        row['text'] = text
        row['offset'] = (start, end)

        processed_documents.append(row)

    _df = pd.DataFrame(processed_documents)
    if _df.shape[0] > 0:
        return _df.sort_values(['document_id', 'offset']).reset_index(drop=True)
    else:
        return _df


def sentencize(documents: Iterable[str],
               document_ids: Iterable,
               offsets: Iterable[tuple[int, int]],
               filter_len: int = FILTER_LEN,
               disable_progress_bar: bool = False) -> pd.DataFrame:

    document_sentences = []
    for document, document_id, offset in tqdm(zip(documents, document_ids, offsets), total=len(documents), disable=disable_progress_bar):
        try:
            _, sentence_offsets = bf.text_to_sentences_and_offsets(document)
            for o in sentence_offsets:
                if o[1]-o[0] > filter_len:
                    sentence = document[o[0]:o[1]]
                    abs_offsets = (o[0]+offset[0], o[1]+offset[0])
                    row = {}
                    row['document_id'] = document_id
                    row['text'] = sentence
                    row['offset'] = abs_offsets
                    document_sentences.append(row)
        except:
            continue
    return pd.DataFrame(document_sentences)

In [5]:
trn = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv").drop("id", axis=1)
trn.head()

Unnamed: 0,prompt,A,B,C,D,E
0,Which of the following statements accurately d...,MOND is a theory that reduces the observed mis...,MOND is a theory that increases the discrepanc...,MOND is a theory that explains the missing bar...,MOND is a theory that reduces the discrepancy ...,MOND is a theory that eliminates the observed ...
1,Which of the following is an accurate definiti...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...,Dynamic scaling refers to the non-evolution of...,Dynamic scaling refers to the evolution of sel...
2,Which of the following statements accurately d...,The triskeles symbol was reconstructed as a fe...,The triskeles symbol is a representation of th...,The triskeles symbol is a representation of a ...,The triskeles symbol represents three interloc...,The triskeles symbol is a representation of th...
3,What is the significance of regularization in ...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...,Regularizing the mass-energy of an electron wi...
4,Which of the following statements accurately d...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...,The angular spacing of features in the diffrac...


In [6]:
if IS_TEST_SET:
    model = SentenceTransformer(SIM_MODEL, device='cuda')
    model.max_seq_length = MAX_LENGTH
    model = model.half()

    sentence_index = read_index("/kaggle/input/wikipedia-2023-07-faiss-index/wikipedia_202307.index")

    prompt_embeddings = model.encode(trn.prompt.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
    prompt_embeddings = prompt_embeddings.detach().cpu().numpy()

    _ = gc.collect()

    ## Get the top IND_SEARCH pages that are likely to contain the topic of interest
    search_score, search_index = sentence_index.search(prompt_embeddings, IND_SEARCH)

    ## Save memory - delete sentence_index since it is no longer necessary
    del sentence_index
    del prompt_embeddings
    _ = gc.collect()
    libc.malloc_trim(0)

# [86.2] with only 270K articles!

In [7]:
!cp /kaggle/input/datasets-wheel/datasets-2.14.4-py3-none-any.whl /kaggle/working
!pip install  /kaggle/working/datasets-2.14.4-py3-none-any.whl
!cp /kaggle/input/backup-806/util_openbook.py .

Processing ./datasets-2.14.4-py3-none-any.whl
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.3
    Uninstalling datasets-2.14.3:
      Successfully uninstalled datasets-2.14.3
Successfully installed datasets-2.14.4
cp: cannot stat '/kaggle/input/backup-806/util_openbook.py': No such file or directory


In [8]:
import os
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

import torch
import ctypes
libc = ctypes.CDLL("libc.so.6")

from dataclasses import dataclass
from typing import Optional, Union

import torch
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from torch.utils.data import DataLoader


def process_documents(documents: Iterable[str],
                      document_ids: Iterable,
                      split_sentences: bool = True,
                      filter_len: int = 3,
                      disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Main helper function to process documents from the EMR.

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param document_type: String denoting the document type to be processed
    :param document_sections: List of sections for a given document type to process
    :param split_sentences: Flag to determine whether to further split sections into sentences
    :param filter_len: Minimum character length of a sentence (otherwise filter out)
    :param disable_progress_bar: Flag to disable tqdm progress bar
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `section`, `offset`
    """

    df = sectionize_documents(documents, document_ids, disable_progress_bar)

    if split_sentences:
        df = sentencize(df.text.values,
                        df.document_id.values,
                        df.offset.values,
                        filter_len,
                        disable_progress_bar)
    return df


def sectionize_documents(documents: Iterable[str],
                         document_ids: Iterable,
                         disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Obtains the sections of the imaging reports and returns only the
    selected sections (defaults to FINDINGS, IMPRESSION, and ADDENDUM).

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param disable_progress_bar: Flag to disable tqdm progress bar
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `offset`
    """
    processed_documents = []
    for document_id, document in tqdm(zip(document_ids, documents), total=len(documents), disable=disable_progress_bar):
        row = {}
        text, start, end = (document, 0, len(document))
        row['document_id'] = document_id
        row['text'] = text
        row['offset'] = (start, end)

        processed_documents.append(row)

    _df = pd.DataFrame(processed_documents)
    if _df.shape[0] > 0:
        return _df.sort_values(['document_id', 'offset']).reset_index(drop=True)
    else:
        return _df


def sentencize(documents: Iterable[str],
               document_ids: Iterable,
               offsets: Iterable[tuple[int, int]],
               filter_len: int = 3,
               disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Split a document into sentences. Can be used with `sectionize_documents`
    to further split documents into more manageable pieces. Takes in offsets
    to ensure that after splitting, the sentences can be matched to the
    location in the original documents.

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param offsets: Iterable tuple of the start and end indices
    :param filter_len: Minimum character length of a sentence (otherwise filter out)
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `section`, `offset`
    """

    document_sentences = []
    for document, document_id, offset in tqdm(zip(documents, document_ids, offsets), total=len(documents),
                                              disable=disable_progress_bar):
        try:
            _, sentence_offsets = bf.text_to_sentences_and_offsets(document)
            for o in sentence_offsets:
                if o[1] - o[0] > filter_len:
                    sentence = document[o[0]:o[1]]
                    abs_offsets = (o[0] + offset[0], o[1] + offset[0])
                    row = {}
                    row['document_id'] = document_id
                    row['text'] = sentence
                    row['offset'] = abs_offsets
                    document_sentences.append(row)
        except:
            continue
    return pd.DataFrame(document_sentences)


def get_contexts():
    SIM_MODEL = '/kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2'
    DEVICE = 0
    MAX_LENGTH = 384
    BATCH_SIZE = 16

    WIKI_PATH = "/kaggle/input/wikipedia-20230701"
    wiki_files = os.listdir(WIKI_PATH)

    trn = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv").drop("id", axis=1)

    model = SentenceTransformer(SIM_MODEL, device='cuda')
    model.max_seq_length = MAX_LENGTH
    model = model.half()

    sentence_index = read_index("/kaggle/input/wikipedia-2023-07-faiss-index/wikipedia_202307.index")

    # prompt_embeddings = model.encode(trn.prompt.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
    prompt_embeddings = model.encode(
        trn.apply(lambda row: f"{row['prompt']}\n{row['A']}\n{row['B']}\n{row['C']}\n{row['D']}\n{row['E']}",
                  axis=1).values,
        batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)

    prompt_embeddings = prompt_embeddings.detach().cpu().numpy()
    _ = gc.collect()

    # Get the top 20 pages that are likely to contain the topic of interest
    search_score, search_index = sentence_index.search(prompt_embeddings, 20)

    # Save memory - delete sentence_index since it is no longer necessary
    del sentence_index
    del prompt_embeddings
    _ = gc.collect()
    libc.malloc_trim(0)

    df = pd.read_parquet("/kaggle/input/wikipedia-20230701/wiki_2023_index.parquet",
                         columns=['id', 'file'])

    # Get the article and associated file location using the index
    wikipedia_file_data = []

    for i, (scr, idx) in tqdm(enumerate(zip(search_score, search_index)), total=len(search_score)):
        scr_idx = idx
        _df = df.loc[scr_idx].copy()
        _df['prompt_id'] = i
        wikipedia_file_data.append(_df)
    wikipedia_file_data = pd.concat(wikipedia_file_data).reset_index(drop=True)
    wikipedia_file_data = wikipedia_file_data[['id', 'prompt_id', 'file']].drop_duplicates().sort_values(
        ['file', 'id']).reset_index(drop=True)

    # Save memory - delete df since it is no longer necessary
    del df
    _ = gc.collect()
    libc.malloc_trim(0)

    # Get the full text data
    wiki_text_data = []

    for file in tqdm(wikipedia_file_data.file.unique(), total=len(wikipedia_file_data.file.unique())):
        _id = [str(i) for i in wikipedia_file_data[wikipedia_file_data['file'] == file]['id'].tolist()]
        _df = pd.read_parquet(f"{WIKI_PATH}/{file}", columns=['id', 'text', 'title'])

        _df_temp = _df[_df['id'].isin(_id)].copy()
        del _df
        _ = gc.collect()
        libc.malloc_trim(0)
        wiki_text_data.append(_df_temp)
    wiki_text_data = pd.concat(wiki_text_data).drop_duplicates().reset_index(drop=True)
    _ = gc.collect()

    # Parse documents into sentences
    processed_wiki_text_data = process_documents(wiki_text_data.text.values, wiki_text_data.id.values)

    # Get embeddings of the wiki text data
    wiki_data_embeddings = model.encode(processed_wiki_text_data.text,
                                        batch_size=BATCH_SIZE,
                                        device=DEVICE,
                                        show_progress_bar=True,
                                        convert_to_tensor=True,
                                        normalize_embeddings=True)  # .half()
    wiki_data_embeddings = wiki_data_embeddings.detach().cpu().numpy()

    _ = gc.collect()

    # Combine all answers
    trn['answer_all'] = trn.apply(lambda x: " ".join([x['A'], x['B'], x['C'], x['D'], x['E']]), axis=1)

    # Search using the prompt and answers to guide the search
    trn['prompt_answer_stem'] = trn['prompt'] + " " + trn['answer_all']

    question_embeddings = model.encode(trn.prompt_answer_stem.values, batch_size=BATCH_SIZE, device=DEVICE,
                                       show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
    question_embeddings = question_embeddings.detach().cpu().numpy()

    # Parameter to determine how many relevant sentences to include
    NUM_SENTENCES_INCLUDE = 6

    # List containing just Context
    contexts = []

    for r in tqdm(trn.itertuples(), total=len(trn)):

        prompt_id = r.Index

        prompt_indices = processed_wiki_text_data[processed_wiki_text_data['document_id'].isin(
            wikipedia_file_data[wikipedia_file_data['prompt_id'] == prompt_id]['id'].values)].index.values

        if prompt_indices.shape[0] > 0:
            prompt_index = faiss.index_factory(wiki_data_embeddings.shape[1], "Flat")
            prompt_index.add(wiki_data_embeddings[prompt_indices])

            context = ""

            # Get the top matches
            ss, ii = prompt_index.search(question_embeddings, NUM_SENTENCES_INCLUDE)
            for _s, _i in zip(ss[prompt_id], ii[prompt_id]):
                context += processed_wiki_text_data.loc[prompt_indices]['text'].iloc[_i] + " "
        contexts.append(context)

    trn['context'] = contexts

    trn[["prompt", "context", "A", "B", "C", "D", "E"]].to_csv("./test_context.csv", index=False)


@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])

        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch


def generate_openbook_output():
    test_df = pd.read_csv("test_context.csv")
    test_df.index = list(range(len(test_df)))
    test_df['id'] = list(range(len(test_df)))
    test_df["prompt"] = test_df["context"].apply(lambda x: x[:1750]) + " #### " + test_df["prompt"]
    test_df['answer'] = 'A'
    model_dir = "/kaggle/input/llm-science-run-context-2"
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
    model.eval()

    # We'll create a dictionary to convert option names (A, B, C, D, E) into indices and back again
    options = 'ABCDE'
    indices = list(range(5))

    option_to_index = {option: index for option, index in zip(options, indices)}
    index_to_option = {index: option for option, index in zip(options, indices)}

    def preprocess(example):
        # The AutoModelForMultipleChoice class expects a set of question/answer pairs
        # so we'll copy our question 5 times before tokenizing
        first_sentence = [example['prompt']] * 5
        second_sentence = []
        for option in options:
            second_sentence.append(example[option])
        # Our tokenizer will turn our text into token IDs BERT can understand
        tokenized_example = tokenizer(first_sentence, second_sentence, truncation=True)
        tokenized_example['label'] = option_to_index[example['answer']]
        return tokenized_example

    tokenized_test_dataset = Dataset.from_pandas(test_df[['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer']].drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
    tokenized_test_dataset = tokenized_test_dataset.remove_columns(["__index_level_0__"])
    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
    test_dataloader = DataLoader(tokenized_test_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

    test_predictions = []
    i = 0
    for batch in test_dataloader:
        for k in batch.keys():
            batch[k] = batch[k].cuda()
        with torch.no_grad():
            outputs = model(**batch)
        losses = -outputs.logits.cpu().detach().numpy()
        preds = torch.softmax(torch.tensor(-losses), dim=-1)
        if i < 10:
            print(losses, preds)
            i += 1
        test_predictions.append(torch.squeeze(preds).tolist())
    

    prob_lables = ['A_prob', 'B_prob', 'C_prob', 'D_prob', 'E_prob']
    print(np.array(test_predictions).shape)
    df_prob = pd.DataFrame(np.array(test_predictions), index=test_df.index, columns=prob_lables)
    df_prob.to_csv('preds_backup.csv', index=False)

#     predictions_as_ids = np.argsort(-test_predictions, 1)

#     predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_as_ids]
#     # predictions_as_answer_letters[:3]

#     predictions_as_string = test_df['prediction'] = [
#         ' '.join(row) for row in predictions_as_answer_letters[:, :3]
#     ]

#     submission = test_df[['id', 'prediction']]
#     submission.to_csv('submission_backup.csv', index=False)

In [9]:
# import pickle

# get_contexts()
# #generate_openbook_output()

# import gc
# gc.collect()

In [10]:
import pandas as pd
#backup_model_predictions = pd.read_csv("preds_backup.csv")
#backup_model_predictions

In [11]:
import numpy as np
import pandas as pd 
from datasets import load_dataset, load_from_disk
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
from transformers import LongformerTokenizer, LongformerForMultipleChoice
import transformers
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import unicodedata

import os

In [12]:
!cp -r /kaggle/input/stem-wiki-cohere-no-emb /kaggle/working
!cp -r /kaggle/input/all-paraphs-parsed-expanded /kaggle/working/

In [13]:
def SplitList(mylist, chunk_size):
    return [mylist[offs:offs+chunk_size] for offs in range(0, len(mylist), chunk_size)]

def get_relevant_documents_parsed(df_valid):
    df_chunk_size=600
    paraphs_parsed_dataset = load_from_disk("/kaggle/working/all-paraphs-parsed-expanded")
    modified_texts = paraphs_parsed_dataset.map(lambda example:
                                             {'temp_text':
                                              f"{example['title']} {example['section']} {example['text']}".replace('\n'," ").replace("'","")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         paraphs_parsed_dataset[idx.item()]["title"],
                         paraphs_parsed_dataset[idx.item()]["text"],
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles_val = SplitList(articles_flatten, top_per_query)
    
    return retrieved_articles_val



def get_relevant_documents(df_valid):
    df_chunk_size=800
    
    cohere_dataset_filtered = load_from_disk("/kaggle/working/stem-wiki-cohere-no-emb")
    modified_texts = cohere_dataset_filtered.map(lambda example:
                                             {'temp_text':
                                              unicodedata.normalize("NFKD", f"{example['title']} {example['text']}").replace('"',"")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         cohere_dataset_filtered[idx.item()]["title"],
                         unicodedata.normalize("NFKD", cohere_dataset_filtered[idx.item()]["text"]),
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles_val = SplitList(articles_flatten, top_per_query)
    
   
    return retrieved_articles_val



def retrieval(df_valid, modified_texts):
    options = [x for x in df_valid.columns if x in {'A', 'B', 'C', 'D', 'E'}]
    corpus_df_valid = df_valid.apply(lambda row:
                                     f'{row["prompt"]}\n{row["prompt"]}\n{row["prompt"]}\n' + '\n'.join([row[ops] for ops in options]),
                                     axis=1).values
    vectorizer1 = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words)
    vectorizer1.fit(corpus_df_valid)
    vocab_df_valid = vectorizer1.get_feature_names_out()
    vectorizer = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words,
                                 vocabulary=vocab_df_valid)
    vectorizer.fit(modified_texts[:500000])
    corpus_tf_idf = vectorizer.transform(corpus_df_valid)
    
    print(f"length of vectorizer vocab is {len(vectorizer.get_feature_names_out())}")

    chunk_size = 100000
    top_per_chunk = 10
    top_per_query = 10

    all_chunk_top_indices = []
    all_chunk_top_values = []

    for idx in tqdm(range(0, len(modified_texts), chunk_size)):
        wiki_vectors = vectorizer.transform(modified_texts[idx: idx+chunk_size])
        temp_scores = (corpus_tf_idf * wiki_vectors.T).toarray()
        chunk_top_indices = temp_scores.argpartition(-top_per_chunk, axis=1)[:, -top_per_chunk:]
        chunk_top_values = temp_scores[np.arange(temp_scores.shape[0])[:, np.newaxis], chunk_top_indices]

        all_chunk_top_indices.append(chunk_top_indices + idx)
        all_chunk_top_values.append(chunk_top_values)

    top_indices_array = np.concatenate(all_chunk_top_indices, axis=1)
    top_values_array = np.concatenate(all_chunk_top_values, axis=1)
    
    merged_top_scores = np.sort(top_values_array, axis=1)[:,-top_per_query:]
    merged_top_indices = top_values_array.argsort(axis=1)[:,-top_per_query:]
    articles_indices = top_indices_array[np.arange(top_indices_array.shape[0])[:, np.newaxis], merged_top_indices]
    
    return articles_indices, merged_top_scores


def prepare_answering_input(
        tokenizer, 
        question,  
        options,   
        context,   
        max_seq_length=4096,
    ):
    c_plus_q   = context + ' ' + tokenizer.bos_token + ' ' + question
    c_plus_q_4 = [c_plus_q] * len(options)
    tokenized_examples = tokenizer(
        c_plus_q_4, options,
        max_length=max_seq_length,
        padding="longest",
        truncation=False,
        return_tensors="pt",
    )
    input_ids = tokenized_examples['input_ids'].unsqueeze(0)
    attention_mask = tokenized_examples['attention_mask'].unsqueeze(0)
    example_encoded = {
        "input_ids": input_ids.to(model.device.index),
        "attention_mask": attention_mask.to(model.device.index),
    }
    return example_encoded

In [14]:
stop_words = ['each', 'you', 'the', 'use', 'used',
                  'where', 'themselves', 'nor', "it's", 'how', "don't", 'just', 'your',
                  'about', 'himself', 'with', "weren't", 'hers', "wouldn't", 'more', 'its', 'were',
                  'his', 'their', 'then', 'been', 'myself', 're', 'not',
                  'ours', 'will', 'needn', 'which', 'here', 'hadn', 'it', 'our', 'there', 'than',
                  'most', "couldn't", 'both', 'some', 'for', 'up', 'couldn', "that'll",
                  "she's", 'over', 'this', 'now', 'until', 'these', 'few', 'haven',
                  'of', 'wouldn', 'into', 'too', 'to', 'very', 'shan', 'before', 'the', 'they',
                  'between', "doesn't", 'are', 'was', 'out', 'we', 'me',
                  'after', 'has', "isn't", 'have', 'such', 'should', 'yourselves', 'or', 'during', 'herself',
                  'doing', 'in', "shouldn't", "won't", 'when', 'do', 'through', 'she',
                  'having', 'him', "haven't", 'against', 'itself', 'that',
                  'did', 'theirs', 'can', 'those',
                  'own', 'so', 'and', 'who', "you've", 'yourself', 'her', 'he', 'only',
                  'what', 'ourselves', 'again', 'had', "you'd", 'is', 'other',
                  'why', 'while', 'from', 'them', 'if', 'above', 'does', 'whom',
                  'yours', 'but', 'being', "wasn't", 'be']

In [15]:
test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
test_df['answer'] = 'A' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set

In [16]:
if IS_TEST_SET:
    test_retrieved_articles_parsed = get_relevant_documents_parsed(test_df)
    gc.collect()

    test_retrieved_articles = get_relevant_documents(test_df)
    gc.collect()


In [17]:
if IS_TEST_SET:
    tokenizer = LongformerTokenizer.from_pretrained("/kaggle/input/longformer-race-model/longformer_qa_model")
    model = LongformerForMultipleChoice.from_pretrained("/kaggle/input/longformer-race-model/longformer_qa_model").cuda()

In [18]:
def longformer_predict(df, retrieved_articles, retrieved_articles_parsed):
    probabilities = []
    for index in tqdm(range(df.shape[0])):
        columns = df.iloc[index].values
        question = columns[1]
        options = [columns[2], columns[3], columns[4], columns[5], columns[6]]   #[columns[2], columns[3], columns[4], columns[5], columns[6]]
        context1 = f"{retrieved_articles[index][-4][2]}\n{retrieved_articles[index][-3][2]}\n{retrieved_articles[index][-2][2]}\n{retrieved_articles[index][-1][2]}"
        context2 = f"{retrieved_articles_parsed[index][-3][2]}\n{retrieved_articles_parsed[index][-2][2]}\n{retrieved_articles_parsed[index][-1][2]}"
        inputs1 = prepare_answering_input(
            tokenizer=tokenizer, question=question,
            options=options, context=context1,
            )
        inputs2 = prepare_answering_input(
            tokenizer=tokenizer, question=question,
            options=options, context=context2,
            )

        with torch.no_grad():
            outputs1 = model(**inputs1)    
            losses1 = -outputs1.logits[0].detach().cpu().numpy()
            probability1 = torch.softmax(torch.tensor(-losses1), dim=-1)

        with torch.no_grad():
            outputs2 = model(**inputs2)
            losses2 = -outputs2.logits[0].detach().cpu().numpy()
            probability2 = torch.softmax(torch.tensor(-losses2), dim=-1)

        probability_ = (probability1 + probability2)/2
     #   if probability_.max() < 0.4:
      #      probability_ = list(backup_model_predictions.iloc[index, :5])
      #      probabilities.append(probability_)
      #  else:
        probabilities.append(probability_.tolist())
    return np.array(probabilities)   #torch.cat(probabilities)


In [19]:
if IS_TEST_SET:
    longtrans_preds = longformer_predict(test_df, test_retrieved_articles, test_retrieved_articles_parsed)

In [20]:
if IS_TEST_SET:
    del test_retrieved_articles_parsed, test_retrieved_articles

# Getting Sentences from the Relevant Titles

In [21]:
if IS_TEST_SET:
    model = SentenceTransformer(SIM_MODEL, device='cuda')
    model.max_seq_length = MAX_LENGTH
    model = model.half()

    df = pd.read_parquet("/kaggle/input/wikipedia-20230701/wiki_2023_index.parquet", columns=['id', 'file'])

    ## Get the article and associated file location using the index
    wikipedia_file_data = []

    for i, (scr, idx) in tqdm(enumerate(zip(search_score, search_index)), total=len(search_score)):
        scr_idx = idx
        _df = df.loc[scr_idx].copy()
        _df['prompt_id'] = i
        wikipedia_file_data.append(_df)
    wikipedia_file_data = pd.concat(wikipedia_file_data).reset_index(drop=True)
    wikipedia_file_data = wikipedia_file_data[['id', 'prompt_id', 'file']].drop_duplicates().sort_values(['file', 'id']).reset_index(drop=True)


    del df
    _ = gc.collect()
    libc.malloc_trim(0)

    ## Get the full text data
    wiki_text_data = []

    for file in tqdm(wikipedia_file_data.file.unique(), total=len(wikipedia_file_data.file.unique())):
        _id = [str(i) for i in wikipedia_file_data[wikipedia_file_data['file']==file]['id'].tolist()]
        _df = pd.read_parquet(f"{WIKI_PATH}/{file}", columns=['id', 'text'])

        _df_temp = _df[_df['id'].isin(_id)].copy()
        del _df
        _ = gc.collect()
        libc.malloc_trim(0)
        wiki_text_data.append(_df_temp)
    wiki_text_data = pd.concat(wiki_text_data).drop_duplicates().reset_index(drop=True)



    _ = gc.collect()

    processed_wiki_text_data = process_documents(wiki_text_data.text.values, wiki_text_data.id.values)


    wiki_data_embeddings = model.encode(processed_wiki_text_data.text,
                                        batch_size=BATCH_SIZE,
                                        device=DEVICE,
                                        show_progress_bar=True,
                                        convert_to_tensor=True,
                                        normalize_embeddings=True)#.half()
    wiki_data_embeddings = wiki_data_embeddings.detach().cpu().numpy()

    trn['answer_all'] = trn.apply(lambda x: " ".join([x['A'], x['B'], x['C'], x['D'], x['E']]), axis=1)
    trn['prompt_answer_stem'] = trn['prompt'] + " " + trn['answer_all']

    question_embeddings = model.encode(trn.prompt_answer_stem.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
    question_embeddings = question_embeddings.detach().cpu().numpy()

# Extracting Matching Prompt-Sentence Pairs

In [22]:
if IS_TEST_SET:
    contexts = []

    for r in tqdm(trn.itertuples(), total=len(trn)):

        prompt_id = r.Index

        prompt_indices = processed_wiki_text_data[processed_wiki_text_data['document_id'].isin(wikipedia_file_data[wikipedia_file_data['prompt_id']==prompt_id]['id'].values)].index.values

        if prompt_indices.shape[0] > 0:
            prompt_index = faiss.index_factory(wiki_data_embeddings.shape[1], "Flat")
            prompt_index.add(wiki_data_embeddings[prompt_indices])

            context = ""

            ## Get the top matches
            ss, ii = prompt_index.search(question_embeddings, NUM_SENTENCES_INCLUDE)
            for _s, _i in zip(ss[prompt_id], ii[prompt_id]):
                context += processed_wiki_text_data.loc[prompt_indices]['text'].iloc[_i] + " "

        contexts.append(context)
    trn['context'] = contexts
    trn[["prompt", "context", "A", "B", "C", "D", "E"]].to_csv("./test_context.csv", index=False)


# Inference

In [23]:
if IS_TEST_SET:
    test_df = pd.read_csv("test_context.csv")
    test_df.index = list(range(len(test_df)))
    test_df['id'] = list(range(len(test_df)))
    test_df["prompt"] = test_df["context"].apply(lambda x: x[:CONTEXT_LEN]) + " #### " +  test_df["prompt"]
    test_df['answer'] = 'A'

In [24]:
if IS_TEST_SET:
    model_dir = "/kaggle/input/llm-science-run-context-2"
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
    model.eval()

In [25]:

options = 'ABCDE'
indices = list(range(5))

option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}

def preprocess(example):
  
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    for option in options:
        second_sentence.append(example[option])
    
    tokenized_example = tokenizer(first_sentence, second_sentence, truncation='only_first')
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

In [26]:
@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [27]:
if IS_TEST_SET:
    tokenized_test_dataset = Dataset.from_pandas(test_df[['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer']].drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
    tokenized_test_dataset = tokenized_test_dataset.remove_columns(["__index_level_0__"])
    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
    test_dataloader = DataLoader(tokenized_test_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

In [28]:
if IS_TEST_SET:
    test_predictions = []

    for batch in tqdm(test_dataloader):
        for k in batch.keys():
            batch[k] = batch[k].cuda()
        with torch.no_grad():
            outputs = model(**batch)
        test_predictions.append(outputs.logits.cpu().detach())

    test_predictions = torch.cat(test_predictions)
    test_predictions = softmax(test_predictions, axis=1).numpy()
    ob_preds = test_predictions
    del test_predictions

In [29]:
if IS_TEST_SET:
    # model_dir = "/kaggle/input/how-to-train-open-book-model-part-1/model_v2"
    model_dir = "/kaggle/input/checkpoint-5975-09025"
    # model_dir = "/kaggle/input/checkpoint-7100-09108"
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
    model.eval()
    
    test_predictionsc = []

    for batch in tqdm(test_dataloader):
        for k in batch.keys():
            batch[k] = batch[k].cuda()
        with torch.no_grad():
            outputs = model(**batch)
        test_predictionsc.append(outputs.logits.cpu().detach())

    test_predictionsc = torch.cat(test_predictionsc)
    test_predictionsc = softmax(test_predictionsc, axis=1).numpy()    
    gc.collect()

In [30]:
if IS_TEST_SET:
    model_dir = "/kaggle/input/using-deepspeed-with-hf-trainer/checkpoints_1"
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
    model.eval()

    test_predictionsi = []

    for batch in tqdm(test_dataloader):
        for k in batch.keys():
            batch[k] = batch[k].cuda()
        with torch.no_grad():
            outputs = model(**batch)
        test_predictionsi.append(outputs.logits.cpu().detach())

    test_predictionsi = torch.cat(test_predictionsi)
    test_predictionsi = softmax(test_predictionsi, axis=1).numpy()

#### In order to increase diversity, we also use some weights that do not use openbook.

In [31]:
from typing import Optional, Union
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from dataclasses import dataclass
from transformers import AutoTokenizer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer, AutoModel
from torch.utils.data import DataLoader
deberta_v3_large = '/kaggle/input/deberta-v3-large-hf-weights'
import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

In [32]:
option_to_index = {option: idx for idx, option in enumerate('ABCDE')}
index_to_option = {v: k for k,v in option_to_index.items()}

def preprocess(example):
    first_sentence = [example['prompt']] * 5
    second_sentences = [example[option] for option in 'ABCDE']
    tokenized_example = tokenizer(first_sentence, second_sentences, truncation=False)
    tokenized_example['label'] = option_to_index[example['answer']]
    
    return tokenized_example

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch 

In [33]:
if IS_TEST_SET:
    tokenizer = AutoTokenizer.from_pretrained(deberta_v3_large)

    test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
    test_df['answer'] = 'A' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set


    tokenized_test_dataset = Dataset.from_pandas(test_df.drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
    test_dataloader = DataLoader(tokenized_test_dataset, 1, shuffle=False, collate_fn=data_collator, num_workers=0, pin_memory=True,)

In [34]:
if IS_TEST_SET:
    model = AutoModelForMultipleChoice.from_pretrained(f'/kaggle/input/2023kagglellm-deberta-v3-large-model1').cuda()
    model.eval()

    preds = []

    for batch in tqdm(test_dataloader, total=len(test_dataloader)):
        for k in batch.keys():
            batch[k] = batch[k].cuda()
        with torch.no_grad():
            outputs = model(**batch)
        preds.append(outputs.logits.cpu().detach())

    hyc_preds = torch.cat(preds)

    del model
    torch.cuda.empty_cache()

    hyc_preds = softmax(hyc_preds, axis=1).numpy()

    gc.collect()

In [35]:
gc.collect()

163

In [36]:
import os, glob
from typing import Optional, Union
import pandas as pd
import numpy as np
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from datasets import Dataset
from dataclasses import dataclass
from transformers import AutoTokenizer, AutoConfig
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer, AutoModel

In [37]:
MODEL_DIR = '/kaggle/input/llm-kaggle-awp'
CONF_PATH = MODEL_DIR + '/deberta-v3-large_config.pth'
MODEL_PATH = MODEL_DIR + '/best_model_public.pt'

In [38]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device

device(type='cuda')

In [39]:
test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
test_df['answer'] = 'A' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set
test_df = test_df.replace(np.NaN, '')

In [40]:
if IS_TEST_SET:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR+'/tokenizer')
    tokenizer

In [41]:
class LlmseDataset(torch.utils.data.Dataset):
    def __init__(self, df):
        self.df = df
        self.a2i = {alp: idx for idx, alp in enumerate('ABCDE')}
        self.i2a = {v: k for k,v in self.a2i.items()}
        self.perm_dict = {0: [1,2,3,4],
                     1: [2,3,4,0], 
                     2: [3,4,0,1],
                     3: [4,0,1,2],
                     4: [0,1,2,3]}
  
    def __len__(self):
        return len(self.df)
        
    def __getitem__(self, idx):
        example = self.df.iloc[idx]
        tokenized_example = dict()              

        first_sentence = [example['prompt']] * 5
        second_sentences = [example[option] for option in 'ABCDE']
        other_sentences = [[] for i in range(5)]

        for i, p in enumerate(range(5)):
            value = self.perm_dict[p] 
            for v in value:
                al = self.i2a[v] 
                second_sentences[i]+= ' ' + example[al]

        tokenized_example = tokenizer(first_sentence, 
                                      second_sentences,
                                      truncation='only_first')
        tokenized_example['label'] = option_to_index[example['answer']]
        return tokenized_example
            
test_ds = LlmseDataset(test_df)

In [42]:
if IS_TEST_SET:
    data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)


    test_dl = DataLoader(
        test_ds, 
        batch_size=1, 
        shuffle=False, 
        collate_fn=data_collator,
        num_workers=0,
        pin_memory=True,
        drop_last=False
    )

In [43]:
class CustomModel(nn.Module):
    def __init__(self, model_conf, *, dropout=0.2, pretrained=True):
        super().__init__()

        # Transformer
        #self.config = AutoConfig.from_pretrained(model_conf)

        self.transformer = AutoModelForMultipleChoice.from_config(model_conf)

        #self._init_weights(self.fc, self.config)

    def _init_weights(self, module, config):
        module.weight.data.normal_(mean=0.0, std=config.initializer_range)
        if module.bias is not None:
            module.bias.data.zero_()

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        out = self.transformer(input_ids, attention_mask, token_type_ids=token_type_ids)
        x = out['logits'] 

        return x

In [44]:
if IS_TEST_SET:
    config = torch.load(CONF_PATH)
    model = CustomModel(model_conf=config)
    model.load_state_dict(torch.load(MODEL_PATH))
    model.to(device)
    model.eval()

    y_preds = []

    with tqdm(test_dl, leave=True) as pbar:
        with torch.no_grad():
            for idx, batch in enumerate(pbar):
                inp_ids = batch['input_ids'].to(device)
                att_mask = batch['attention_mask'].to(device)
                token_type_ids = batch['token_type_ids'].to(device)

                y_pred = model(input_ids=inp_ids, 
                               attention_mask=att_mask, 
                               token_type_ids=token_type_ids)

                y_pred = y_pred.to(torch.float)

                y_preds.append(y_pred.cpu())


    itk_preds = torch.cat(y_preds)
    del model, y_preds
    torch.cuda.empty_cache()

    itk_preds = softmax(itk_preds, axis=1).numpy()
    gc.collect()

## Blending Weights Optimize

Maximising MAP@3 is very difficult(Is it even possible?). so Minimising CE loss here.

# Apply weights and make submission

In [45]:
if IS_TEST_SET:
#     ws = [1/6., 1/6., 1/6., 1/6., 1/6., 1/6.]
#     ws = [6.47750967e-01, 3.86703564e-18, 7.57196376e-03, 7.00792352e-02, 2.46658558e-01, 2.79392764e-02]
#     ws = [0.411877448, 8.8371708e-20, 1.736642515e-18, 0.00924490935, 0.0788776425, 0.5]
#     ws = [2.66547532e-01, 1.59127328e-18, 3.11583981e-03, 2.88373899e-02, 1.01499238e-01, 6.00000000e-01]
#     ws = [6.66368830e-01, 3.97818320e-18, 7.78959953e-03, 7.20934747e-02, 2.53748096e-01]
    ws = [0.399821298, 2.38690992e-18, 0.00467375972, 0.0432560848, 0.152248858, 0.4]
    predictions_overall = test_predictionsc * ws[0] + ob_preds * ws[1] + test_predictionsi * ws[2] + hyc_preds * ws[3] + itk_preds * ws[4] + longtrans_preds * ws[5]
    print(predictions_overall.shape)
#     predictions_overall = np.where(np.logical_and(longtrans_preds.max(axis=1) < 0.4, predictions_overall.max(axis=1) > 0.4).reshape(-1,1).repeat(5, axis=1), predictions_overall, longtrans_preds)

    predictions_overall = predictions_overall
    predictions_overall = np.argsort(-predictions_overall)[:,:3]
    print(predictions_overall[:5])

    predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_overall]
    print(predictions_as_answer_letters[:3])

    predictions_as_string = test_df['prediction'] = [
        ' '.join(row) for row in predictions_as_answer_letters[:, :3]
    ]
    print(predictions_as_string[:3])

else:
    test_df['prediction'] = 'A B C'
    
submission = test_df[['id', 'prediction']]

submission.to_csv('submission.csv', index=False)

pd.read_csv('submission.csv').head(10)


Unnamed: 0,id,prediction
0,0,A B C
1,1,A B C
2,2,A B C
3,3,A B C
4,4,A B C
5,5,A B C
6,6,A B C
7,7,A B C
8,8,A B C
9,9,A B C


In conclusion, at least we were able to confirm that the openbook model (based on Ozturk's and Chris'), which differs in method from other models and has a high score, has the higher weight.

Now it's your turn to blend. Let's add weights for your model. 

Also, running notebooks, especially inference for openbook model, takes a long time, so it's a good idea to separate notebooks for calculating weights and for submitting them like Yirun Zhangs' base notebook.

It would also be important to change the evaluation dataset to something relevant to STEM. If the model weights are unnaturally high, suspect a leak. And make sure the evaluation dataset is not used for training.

### Wishing you happy kaggling!