I previously shared a [notebook](url) (see the discussion [here](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/441128)) that found a cluster of relevant Wikipedia STEM articles, resulting in around 270K STEM articles for which the resulting dataset is released [here.](https://www.kaggle.com/datasets/mbanaei/stem-wiki-cohere-no-emb)

However, due to issues with WikiExtractor, there're cases in which some numbers or even paragraphs are missing from the final Wiki parsing. Therefore,  for the same set of  articles, I used Wiki API to gather the articles' contexts (see discussion [here](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/442483)), for which the resulting dataset is released [here](https://www.kaggle.com/datasets/mbanaei/all-paraphs-parsed-expanded).

In order to show that the found articles cover not only the train dataset articles but also a majority of LB gold articles, I release this notebook that uses a simple retrieval model (without any prior indexing) together with a model that is trained only on the RACE dataset. (not fine-tuned on any competition-similar dataset).

The main design choices for the notebook are:
- Using a simple TF-IDF to retrieve contexts from both datasets for every given question.
- Although the majority of high-performing public models use DeBERTa-V3 to do the inference in their pipeline, I used a LongFormer Large model, which enables us to have a much longer prefix context given limited GPU memory. More specifically, as opposed to many public notebooks, there's no splitting to sentence level, and the whole paragraph is retrieved and passed to the classifier as a context (the main reason that we don't get OOM and also have relatively fast inference is that in LongFormer full attention is not computed as opposed to standard models like BERT).
- I use a fall-back model (based on a public notebook that uses an openbook approach and performs 81.5 on LB) that is used for prediction when there's low confidence in the main model's output for the top choice.

P.S: Although the model's performance is relatively good compared to other public notebooks, many design choices can be revised to improve both inference time and performance. (e.g., currently, context retrieval seems to be the inference bottleneck as no prior indexing is used).

In [1]:
!cp /kaggle/input/datasets-wheel/datasets-2.14.4-py3-none-any.whl /kaggle/working
!pip install  /kaggle/working/datasets-2.14.4-py3-none-any.whl
!cp /kaggle/input/backup-806/util_openbook.py .

Processing ./datasets-2.14.4-py3-none-any.whl
Installing collected packages: datasets
  Attempting uninstall: datasets
    Found existing installation: datasets 2.1.0
    Uninstalling datasets-2.1.0:
      Successfully uninstalled datasets-2.1.0
Successfully installed datasets-2.14.4


In [2]:
# installing offline dependencies
!pip install -U /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!cp -rf /kaggle/input/sentence-transformers-222/sentence-transformers /kaggle/working/sentence-transformers
!pip install -U /kaggle/working/sentence-transformers
!pip install -U /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

!pip install --no-index --no-deps /kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
!pip install --no-index --no-deps /kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

Processing /kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Installing collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Processing ./sentence-transformers
  Preparing metadata (setup.py) ... [?25l- \ done
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l- \ | done
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=126134 sha256=429095dc5e2cb2f4491d09535c1d91d5bc4c8c99d2ea69c1eea61ad7f6052262
  Stored in directory: /root/.cache/pip/wheels/6c/ea/76/d9a930b223b1d3d5d6aff69458725316b0fe205b854faf1812
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.2.2
Processing /kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl
Installing collected packages: blingfir

In [3]:
# from util_openbook import get_contexts, generate_openbook_output
import pickle

# get_contexts()
# generate_openbook_output()

import gc
import ctypes
import torch

def clean_memory():
    gc.collect()
    ctypes.CDLL("libc.so.6").malloc_trim(0)
    torch.cuda.empty_cache()
    

In [4]:
clean_memory()

In [5]:
import pandas as pd
# backup_model_predictions = pd.read_csv("submission_backup.csv")

In [6]:
import numpy as np
import pandas as pd 
from datasets import load_dataset, load_from_disk
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
from transformers import LongformerTokenizer, LongformerForMultipleChoice
import transformers
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import unicodedata
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
import os

caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io_plugins.so: undefined symbol: _ZN3tsl6StatusC1EN10tensorflow5error4CodeESt17basic_string_viewIcSt11char_traitsIcEENS_14SourceLocationE']
caused by: ['/opt/conda/lib/python3.10/site-packages/tensorflow_io/python/ops/libtensorflow_io.so: undefined symbol: _ZTVN10tensorflow13GcsFileSystemE']


In [7]:
!cp -r /kaggle/input/stem-wiki-cohere-no-emb /kaggle/working
!cp -r /kaggle/input/1006-wikitfidfv1 /kaggle/working/

In [8]:
def SplitList(mylist, chunk_size):
    return [mylist[offs:offs+chunk_size] for offs in range(0, len(mylist), chunk_size)]



def get_relevant_documents(df_valid):
    df_chunk_size=800
    
    cohere_dataset_filtered = load_from_disk("/kaggle/working/stem-wiki-cohere-no-emb")
    modified_texts = cohere_dataset_filtered.map(lambda example:
                                             {'temp_text':
                                              unicodedata.normalize("NFKD", f"{example['title']} {example['text']}").replace('"',"")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         cohere_dataset_filtered[idx.item()]["title"],
                         unicodedata.normalize("NFKD", cohere_dataset_filtered[idx.item()]["text"]),
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles


def get_relevant_tfidf_parse1(df_valid):
    df_chunk_size=600
    paraphs_parsed_dataset = load_from_disk("/kaggle/working/1006-wikitfidfv1")
    modified_texts = paraphs_parsed_dataset.map(lambda example:
                                             {'temp_text':
                                              f"{example['title']} {example['section']} {example['text']}".replace('\n'," ").replace("'","")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         paraphs_parsed_dataset[idx.item()]["title"],
                         paraphs_parsed_dataset[idx.item()]["text"],
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles



def retrieval(df_valid, modified_texts):
    
    corpus_df_valid = df_valid.apply(lambda row:
                                     f'{row["prompt"]}\n{row["prompt"]}\n{row["prompt"]}\n{row["prompt"]}\n{row["A"]}\n{row["B"]}\n{row["C"]}\n{row["D"]}\n{row["E"]}',
                                     axis=1).values
    vectorizer1 = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words)
    vectorizer1.fit(corpus_df_valid)
    vocab_df_valid = vectorizer1.get_feature_names_out()
    vectorizer = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words,
                                 vocabulary=vocab_df_valid)
    vectorizer.fit(modified_texts[:500000])
    corpus_tf_idf = vectorizer.transform(corpus_df_valid)
    
    print(f"length of vectorizer vocab is {len(vectorizer.get_feature_names_out())}")

    chunk_size = 100000
    top_per_chunk = 10
    top_per_query = 10

    all_chunk_top_indices = []
    all_chunk_top_values = []

    for idx in tqdm(range(0, len(modified_texts), chunk_size)):
        wiki_vectors = vectorizer.transform(modified_texts[idx: idx+chunk_size])
        temp_scores = (corpus_tf_idf * wiki_vectors.T).toarray()
        chunk_top_indices = temp_scores.argpartition(-top_per_chunk, axis=1)[:, -top_per_chunk:]
        chunk_top_values = temp_scores[np.arange(temp_scores.shape[0])[:, np.newaxis], chunk_top_indices]

        all_chunk_top_indices.append(chunk_top_indices + idx)
        all_chunk_top_values.append(chunk_top_values)

    top_indices_array = np.concatenate(all_chunk_top_indices, axis=1)
    top_values_array = np.concatenate(all_chunk_top_values, axis=1)
    
    merged_top_scores = np.sort(top_values_array, axis=1)[:,-top_per_query:]
    merged_top_indices = top_values_array.argsort(axis=1)[:,-top_per_query:]
    articles_indices = top_indices_array[np.arange(top_indices_array.shape[0])[:, np.newaxis], merged_top_indices]
    
    return articles_indices, merged_top_scores


def prepare_answering_input(
        tokenizer, 
        question,  
        options,   
        context,   
        max_seq_length=4096,
    ):
    c_plus_q   = context + ' ' + tokenizer.bos_token + ' ' + question
    c_plus_q_4 = [c_plus_q] * len(options)
    tokenized_examples = tokenizer(
        c_plus_q_4, options,
        max_length=max_seq_length,
        padding="longest",
        truncation=False,
        return_tensors="pt",
    )
    input_ids = tokenized_examples['input_ids'].unsqueeze(0)
    attention_mask = tokenized_examples['attention_mask'].unsqueeze(0)
    example_encoded = {
        "input_ids": input_ids.to(model.device.index),
        "attention_mask": attention_mask.to(model.device.index),
    }
    return example_encoded


In [9]:
stop_words = ['each', 'you', 'the', 'use', 'used',
                  'where', 'themselves', 'nor', "it's", 'how', "don't", 'just', 'your',
                  'about', 'himself', 'with', "weren't", 'hers', "wouldn't", 'more', 'its', 'were',
                  'his', 'their', 'then', 'been', 'myself', 're', 'not',
                  'ours', 'will', 'needn', 'which', 'here', 'hadn', 'it', 'our', 'there', 'than',
                  'most', "couldn't", 'both', 'some', 'for', 'up', 'couldn', "that'll",
                  "she's", 'over', 'this', 'now', 'until', 'these', 'few', 'haven',
                  'of', 'wouldn', 'into', 'too', 'to', 'very', 'shan', 'before', 'the', 'they',
                  'between', "doesn't", 'are', 'was', 'out', 'we', 'me',
                  'after', 'has', "isn't", 'have', 'such', 'should', 'yourselves', 'or', 'during', 'herself',
                  'doing', 'in', "shouldn't", "won't", 'when', 'do', 'through', 'she',
                  'having', 'him', "haven't", 'against', 'itself', 'that',
                  'did', 'theirs', 'can', 'those',
                  'own', 'so', 'and', 'who', "you've", 'yourself', 'her', 'he', 'only',
                  'what', 'ourselves', 'again', 'had', "you'd", 'is', 'other',
                  'why', 'while', 'from', 'them', 'if', 'above', 'does', 'whom',
                  'yours', 'but', 'being', "wasn't", 'be']

In [10]:
df_valid = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/test.csv")

In [11]:
retrieved_articles = get_relevant_documents(df_valid)
gc.collect()

Map (num_proc=2):   0%|          | 0/2781652 [00:00<?, ? examples/s]



length of vectorizer vocab is 11222



  0%|          | 0/28 [00:00<?, ?it/s][A
  4%|▎         | 1/28 [00:09<04:28,  9.95s/it][A
  7%|▋         | 2/28 [00:18<04:01,  9.30s/it][A
 11%|█         | 3/28 [00:27<03:44,  8.96s/it][A
 14%|█▍        | 4/28 [00:35<03:31,  8.81s/it][A
 18%|█▊        | 5/28 [00:45<03:26,  8.96s/it][A
 21%|██▏       | 6/28 [00:53<03:11,  8.72s/it][A
 25%|██▌       | 7/28 [01:01<03:00,  8.61s/it][A
 29%|██▊       | 8/28 [01:09<02:49,  8.47s/it][A
 32%|███▏      | 9/28 [01:19<02:44,  8.67s/it][A
 36%|███▌      | 10/28 [01:27<02:33,  8.50s/it][A
 39%|███▉      | 11/28 [01:35<02:22,  8.38s/it][A
 43%|████▎     | 12/28 [01:43<02:13,  8.32s/it][A
 46%|████▋     | 13/28 [01:51<02:05,  8.36s/it][A
 50%|█████     | 14/28 [01:59<01:55,  8.26s/it][A
 54%|█████▎    | 15/28 [02:07<01:45,  8.15s/it][A
 57%|█████▋    | 16/28 [02:15<01:37,  8.13s/it][A
 61%|██████    | 17/28 [02:23<01:28,  8.05s/it][A
 64%|██████▍   | 18/28 [02:31<01:19,  7.94s/it][A
 68%|██████▊   | 19/28 [02:39<01:10,  7.84s/it]

18

In [12]:
retrieved_get_relevant_tfidf_parse1 = get_relevant_tfidf_parse1(df_valid)
gc.collect()

Map (num_proc=2):   0%|          | 0/4520652 [00:00<?, ? examples/s]



length of vectorizer vocab is 11222



  0%|          | 0/46 [00:00<?, ?it/s][A
  2%|▏         | 1/46 [00:12<09:37, 12.84s/it][A
  4%|▍         | 2/46 [00:26<09:43, 13.26s/it][A
  7%|▋         | 3/46 [00:39<09:22, 13.08s/it][A
  9%|▊         | 4/46 [00:51<09:00, 12.88s/it][A
 11%|█         | 5/46 [01:05<08:54, 13.05s/it][A
 13%|█▎        | 6/46 [01:17<08:37, 12.94s/it][A
 15%|█▌        | 7/46 [01:31<08:29, 13.07s/it][A
 17%|█▋        | 8/46 [01:43<08:11, 12.92s/it][A
 20%|█▉        | 9/46 [01:56<07:54, 12.84s/it][A
 22%|██▏       | 10/46 [02:09<07:47, 12.99s/it][A
 24%|██▍       | 11/46 [02:22<07:28, 12.82s/it][A
 26%|██▌       | 12/46 [02:35<07:20, 12.97s/it][A
 28%|██▊       | 13/46 [02:48<07:03, 12.83s/it][A
 30%|███       | 14/46 [03:00<06:48, 12.75s/it][A
 33%|███▎      | 15/46 [03:14<06:40, 12.94s/it][A
 35%|███▍      | 16/46 [03:26<06:23, 12.80s/it][A
 37%|███▋      | 17/46 [03:39<06:15, 12.96s/it][A
 39%|███▉      | 18/46 [03:52<05:59, 12.84s/it][A
 41%|████▏     | 19/46 [04:04<05:43, 12.71s/it]

18

In [13]:
tokenizer = LongformerTokenizer.from_pretrained("/kaggle/input/longformer-race-model/longformer_qa_model")
model = LongformerForMultipleChoice.from_pretrained("/kaggle/input/longformer-race-model/longformer_qa_model").cuda()

In [14]:
df_valid['context'] = 'A'

In [15]:
articles_predictions = []
submit_ids = []

for index in tqdm(range(df_valid.shape[0])):
    columns = df_valid.iloc[index].values
    submit_ids.append(columns[0])
    question = columns[1]
    options = [columns[2], columns[3], columns[4], columns[5], columns[6]]
    context1 = f"{retrieved_articles[index][-1][2]}\n{retrieved_articles[index][-2][2]}\n{retrieved_articles[index][-3][2]}\n{retrieved_articles[index][-4][2]}"
    
    context2 = "\n".join([retrieved_get_relevant_tfidf_parse1[index][-i][2] for i in range(1, 3)])
    context2 = context2[:2750]
    
    
    inputs1 = prepare_answering_input(
        tokenizer=tokenizer, question=question,
        options=options, context=context1,
        )
    inputs2 = prepare_answering_input(
        tokenizer=tokenizer, question=question,
        options=options, context=context2,
        )
    
    # add context
    df_valid.loc[index, 'context'] = context1 + ' ' + context2
        
    
    with torch.no_grad():
        outputs1 = model(**inputs1)    
        losses1 = -outputs1.logits[0].detach().cpu().numpy()
        probability1 = torch.softmax(torch.tensor(-losses1), dim=-1)
        
    with torch.no_grad():
        outputs2 = model(**inputs2)
        losses2 = -outputs2.logits[0].detach().cpu().numpy()
        probability2 = torch.softmax(torch.tensor(-losses2), dim=-1)
        
    probability_ = (probability1 + probability2)/2


    articles_predictions.append(probability_)

del model, tokenizer, inputs1, inputs2
clean_memory()

articles_predictions  = np.array(articles_predictions)
# 각 텐서를 numpy 배열로 변환하고, 이들을 하나의 2차원 배열로 합침
articles_predictions = np.stack([tensor.numpy() for tensor in articles_predictions])

100%|██████████| 200/200 [04:13<00:00,  1.27s/it]
  articles_predictions  = np.array(articles_predictions)
  articles_predictions  = np.array(articles_predictions)


In [16]:
articles_predictions.shape

(200, 5)

In [17]:
df_valid[["prompt", "context", "A", "B", "C", "D", "E"]].to_csv("./test_context.csv", index=False)

# model_cp11100

In [18]:
model_dir = "/kaggle/input/0924-v2-lr1e-5-epochs3/checkpoint-11100"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

DebertaV2ForMultipleChoice(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1024, padding_idx=0)
      (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_aff

In [19]:
v2_epochs3_predictions  = []
submit_ids = []

for index in tqdm(range(df_valid.shape[0])):
    columns = df_valid.iloc[index].values
    submit_ids.append(columns[0])
    question = columns[1]
    options = [columns[2], columns[3], columns[4], columns[5], columns[6]]
    context1 = f"{retrieved_articles[index][-1][2]}\n{retrieved_articles[index][-2][2]}\n{retrieved_articles[index][-3][2]}\n{retrieved_articles[index][-4][2]}"
    
    context2 = "\n".join([retrieved_get_relevant_tfidf_parse1[index][-i][2] for i in range(1, 3)])
    context2 = context2[:2750]
    
    inputs1 = prepare_answering_input(
        tokenizer=tokenizer, question=question,
        options=options, context=context1,
        )
    inputs2 = prepare_answering_input(
        tokenizer=tokenizer, question=question,
        options=options, context=context2,
        )
    
    with torch.no_grad():
        outputs1 = model(**inputs1)    
        losses1 = -outputs1.logits[0].detach().cpu().numpy()
        probability1 = torch.softmax(torch.tensor(-losses1), dim=-1)
        
    with torch.no_grad():
        outputs2 = model(**inputs2)
        losses2 = -outputs2.logits[0].detach().cpu().numpy()
        probability2 = torch.softmax(torch.tensor(-losses2), dim=-1)
        
    probability_ = (probability1 + probability2)/2
    
    v2_epochs3_predictions.append(probability_)
    



del model, tokenizer, inputs1, inputs2
clean_memory()

v2_epochs3_predictions  = np.array(v2_epochs3_predictions)
# 각 텐서를 numpy 배열로 변환하고, 이들을 하나의 2차원 배열로 합침
v2_epochs3_predictions = np.stack([tensor.numpy() for tensor in v2_epochs3_predictions])

100%|██████████| 200/200 [02:04<00:00,  1.61it/s]
  v2_epochs3_predictions  = np.array(v2_epochs3_predictions)
  v2_epochs3_predictions  = np.array(v2_epochs3_predictions)


In [20]:
v2_epochs3_predictions.shape

(200, 5)

In [21]:
# pd.DataFrame({'id':submit_ids,'prediction':predictions}).to_csv('submission.csv', index=False)

# LB 0.865 (Optimise ensemble weight)

In [22]:
import os, time
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf
from __future__ import annotations

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

import torch
import ctypes
libc = ctypes.CDLL("libc.so.6")

from dataclasses import dataclass
from typing import Optional, Union

import torch
import numpy as np
import pandas as pd
from datasets import Dataset
from transformers import AutoTokenizer
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from torch.utils.data import DataLoader

from scipy.special import softmax

In [23]:
DEBUG = False
# DEBUG = False if len(trn)!=200 else True # If you want to save GPU Quota, check off this comment-out. But cannot get accurate weight on saving notebook
FILTER_LEN = 1 if DEBUG else 10
IND_SEARCH = 1 if DEBUG else 7
NUM_SENTENCES_INCLUDE = 1 if DEBUG else 25
CONTEXT_LEN = 1000 if DEBUG else 2750
VAL_SIZE = 200 if DEBUG else 1500


In [24]:
# trn[["prompt", "context", "A", "B", "C", "D", "E"]].to_csv("./test_context.csv", index=False)

In [25]:
test_df = pd.read_csv("test_context.csv")
test_df.index = list(range(len(test_df)))
test_df['id'] = list(range(len(test_df)))
test_df["prompt"] = test_df["context"].apply(lambda x: x[:CONTEXT_LEN]) + " #### " +  test_df["prompt"]
test_df['answer'] = 'A'

In [26]:
model_dir = "/kaggle/input/llm-science-run-context-2"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

DebertaV2ForMultipleChoice(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1024, padding_idx=0)
      (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_aff

In [27]:
options = 'ABCDE'
indices = list(range(5))

option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}

def preprocess(example):
  
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    for option in options:
        second_sentence.append(example[option])
    
    tokenized_example = tokenizer(first_sentence, second_sentence, truncation='only_first')
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

In [28]:
@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = "label" if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch

In [29]:
tokenized_test_dataset = Dataset.from_pandas(test_df[['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer']].drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_test_dataset = tokenized_test_dataset.remove_columns(["__index_level_0__"])
data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
test_dataloader = DataLoader(tokenized_test_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [30]:
test_predictions = []


for batch in tqdm(test_dataloader):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_predictions.append(outputs.logits.cpu().detach())

test_predictions = torch.cat(test_predictions)

  0%|          | 0/200 [00:00<?, ?it/s]

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [31]:
test_predictions = softmax(test_predictions, axis=1).numpy()

In [32]:
ob_preds = test_predictions

del test_predictions

In [33]:
# model_dir = "/kaggle/input/how-to-train-open-book-model-part-1/model_v2"
model_dir = "/kaggle/input/0903-v2-256-v3-384/model_v3"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

DebertaV2ForMultipleChoice(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1024, padding_idx=0)
      (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_aff

In [34]:
test_predictionsc = []


for batch in tqdm(test_dataloader):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_predictionsc.append(outputs.logits.cpu().detach())
    

test_predictionsc = torch.cat(test_predictionsc)


  0%|          | 0/200 [00:00<?, ?it/s]

In [35]:
test_predictionsc = softmax(test_predictionsc, axis=1).numpy()

In [36]:
gc.collect()

55

In [37]:
model_dir = "/kaggle/input/using-deepspeed-with-hf-trainer/checkpoints_1"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

DebertaV2ForMultipleChoice(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1024, padding_idx=0)
      (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_aff

In [38]:
test_predictionsi = []

for batch in tqdm(test_dataloader):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    test_predictionsi.append(outputs.logits.cpu().detach())
    
test_predictionsi = torch.cat(test_predictionsi)

  0%|          | 0/200 [00:00<?, ?it/s]

In [39]:
test_predictionsi = softmax(test_predictionsi, axis=1).numpy()

# model by hyc (model_v26)

In [40]:
model_dir = '/kaggle/input/0917-model-v26/model_v26'
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForMultipleChoice.from_pretrained(model_dir).cuda()
model.eval()

DebertaV2ForMultipleChoice(
  (deberta): DebertaV2Model(
    (embeddings): DebertaV2Embeddings(
      (word_embeddings): Embedding(128100, 1024, padding_idx=0)
      (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
      (dropout): StableDropout()
    )
    (encoder): DebertaV2Encoder(
      (layer): ModuleList(
        (0-23): 24 x DebertaV2Layer(
          (attention): DebertaV2Attention(
            (self): DisentangledSelfAttention(
              (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
              (pos_dropout): StableDropout()
              (dropout): StableDropout()
            )
            (output): DebertaV2SelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_aff

In [41]:
# We'll create a dictionary to convert option names (A, B, C, D, E) into indices and back again
options = 'ABCDE'
indices = list(range(5))

option_to_index = {option: index for option, index in zip(options, indices)}
index_to_option = {index: option for option, index in zip(options, indices)}

def preprocess(example):
    # The AutoModelForMultipleChoice class expects a set of question/answer pairs
    # so we'll copy our question 5 times before tokenizing
    first_sentence = [example['prompt']] * 5
    second_sentence = []
    for option in options:
        second_sentence.append(example[option])
    # Our tokenizer will turn our text into token IDs BERT can understand
    tokenized_example = tokenizer(first_sentence, second_sentence, truncation=True)
    tokenized_example['label'] = option_to_index[example['answer']]
    return tokenized_example

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    # padding: Union[bool, str, PaddingStrategy] = True
    padding: Union[bool, str, PaddingStrategy] = 'max_length'
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
    
        individual_predictions = []

        max_seq_length = max(max(len(feature['input_ids'][i]) for feature in features) for i in range(num_choices))

        for i in range(num_choices):
            choice_features = [
                {k: v[i] for k, v in feature.items()} for feature in features
            ]
            
            batch = self.tokenizer.pad(
                choice_features,
                padding=self.padding,
                max_length=max_seq_length,
                pad_to_multiple_of=self.pad_to_multiple_of,
                return_tensors='pt',
            )
    
            batch = {k: v.view(batch_size, -1) for k, v in batch.items()}
            individual_predictions.append(batch['input_ids'])
    
        labels = torch.tensor(labels, dtype=torch.int64)
    
        batch = {'input_ids': torch.stack(individual_predictions, dim=1), 'labels': labels}
        return batch

In [42]:
tokenized_test_dataset = Dataset.from_pandas(test_df[['id', 'prompt', 'A', 'B', 'C', 'D', 'E', 'answer']].drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
tokenized_test_dataset = tokenized_test_dataset.remove_columns(["__index_level_0__"])
data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
test_dataloader = DataLoader(tokenized_test_dataset, batch_size=1, shuffle=False, collate_fn=data_collator)

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [43]:


preds = []


for batch in tqdm(test_dataloader, total=len(test_dataloader)):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    preds.append(outputs.logits.cpu().detach())
    


hyc_preds_2 = torch.cat(preds)

del model
torch.cuda.empty_cache()

  0%|          | 0/200 [00:00<?, ?it/s]

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [44]:
hyc_preds_2 = softmax(hyc_preds_2, axis=1).numpy()

### In order to increase diversity, we also use some weights that do not use openbook

In [45]:
from typing import Optional, Union
import pandas as pd
import numpy as np
import torch
from datasets import Dataset
from dataclasses import dataclass
from transformers import AutoTokenizer
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer, AutoModel
from torch.utils.data import DataLoader
deberta_v3_large = '/kaggle/input/deberta-v3-large-hf-weights'
import os
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

In [46]:
option_to_index = {option: idx for idx, option in enumerate('ABCDE')}
index_to_option = {v: k for k,v in option_to_index.items()}

def preprocess(example):
    first_sentence = [example['prompt']] * 5
    second_sentences = [example[option] for option in 'ABCDE']
    tokenized_example = tokenizer(first_sentence, second_sentences, truncation=False)
    tokenized_example['label'] = option_to_index[example['answer']]
    
    return tokenized_example

@dataclass
class DataCollatorForMultipleChoice:
    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    
    def __call__(self, features):
        label_name = 'label' if 'label' in features[0].keys() else 'labels'
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]['input_ids'])
        flattened_features = [
            [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
        ]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors='pt',
        )
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        batch['labels'] = torch.tensor(labels, dtype=torch.int64)
        return batch 

In [47]:
tokenizer = AutoTokenizer.from_pretrained(deberta_v3_large)

# test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
# test_df['answer'] = 'A' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set


# tokenized_test_dataset = Dataset.from_pandas(test_df.drop(columns=['id'])).map(preprocess, remove_columns=['prompt', 'A', 'B', 'C', 'D', 'E', 'answer'])
# data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
# test_dataloader = DataLoader(tokenized_test_dataset, 1, shuffle=False, collate_fn=data_collator, num_workers=0, pin_memory=True,)



In [48]:
model = AutoModelForMultipleChoice.from_pretrained(f'/kaggle/input/0907-v13-cv08967/model_v13').cuda()
model.eval()

preds = []
preds_v = []

for batch in tqdm(test_dataloader, total=len(test_dataloader)):
    for k in batch.keys():
        batch[k] = batch[k].cuda()
    with torch.no_grad():
        outputs = model(**batch)
    preds.append(outputs.logits.cpu().detach())
    

hyc_preds = torch.cat(preds)

del model
torch.cuda.empty_cache()

  0%|          | 0/200 [00:00<?, ?it/s]

In [49]:
hyc_preds = softmax(hyc_preds, axis=1).numpy()

In [50]:
gc.collect()

116

In [51]:
import os, glob
from typing import Optional, Union
import pandas as pd
import numpy as np
from tqdm import tqdm

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

from datasets import Dataset
from dataclasses import dataclass
from transformers import AutoTokenizer, AutoConfig
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer, AutoModel

In [52]:
MODEL_DIR = '/kaggle/input/llm-kaggle-awp'
CONF_PATH = MODEL_DIR + '/deberta-v3-large_config.pth'
MODEL_PATH = MODEL_DIR + '/best_model_public.pt'

In [53]:
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
device

device(type='cuda')

In [54]:
test_df = pd.read_csv('/kaggle/input/kaggle-llm-science-exam/test.csv')
test_df['answer'] = 'A' # dummy answer that allows us to preprocess the test datataset using functionality that works for the train set
test_df = test_df.replace(np.NaN, '')

In [55]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR+'/tokenizer')
tokenizer

DebertaV2TokenizerFast(name_or_path='/kaggle/input/llm-kaggle-awp/tokenizer', vocab_size=128000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True)

In [56]:
class LlmseDataset(torch.utils.data.Dataset):
    def __init__(self, df):
        self.df = df
        self.a2i = {alp: idx for idx, alp in enumerate('ABCDE')}
        self.i2a = {v: k for k,v in self.a2i.items()}
        self.perm_dict = {0: [1,2,3,4],
                     1: [2,3,4,0], 
                     2: [3,4,0,1],
                     3: [4,0,1,2],
                     4: [0,1,2,3]}
  
    def __len__(self):
        return len(self.df)
        
    def __getitem__(self, idx):
        example = self.df.iloc[idx]
        tokenized_example = dict()              

        first_sentence = [example['prompt']] * 5
        second_sentences = [example[option] for option in 'ABCDE']
        other_sentences = [[] for i in range(5)]

        for i, p in enumerate(range(5)):
            value = self.perm_dict[p] 
            for v in value:
                al = self.i2a[v] 
                second_sentences[i]+= ' ' + example[al]

        tokenized_example = tokenizer(first_sentence, 
                                      second_sentences,
                                      truncation='only_first')
        tokenized_example['label'] = option_to_index[example['answer']]
        return tokenized_example
            
test_ds = LlmseDataset(test_df)

In [57]:
data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)

test_dl = DataLoader(
    test_ds, 
    batch_size=1, 
    shuffle=False, 
    collate_fn=data_collator,
    num_workers=0,
    pin_memory=True,
    drop_last=False
)

In [58]:
class CustomModel(nn.Module):
    def __init__(self, model_conf, *, dropout=0.2, pretrained=True):
        super().__init__()

        # Transformer
        #self.config = AutoConfig.from_pretrained(model_conf)

        self.transformer = AutoModelForMultipleChoice.from_config(model_conf)

        #self._init_weights(self.fc, self.config)

    def _init_weights(self, module, config):
        module.weight.data.normal_(mean=0.0, std=config.initializer_range)
        if module.bias is not None:
            module.bias.data.zero_()

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        out = self.transformer(input_ids, attention_mask, token_type_ids=token_type_ids)
        x = out['logits'] 

        return x

In [59]:
config = torch.load(CONF_PATH)
model = CustomModel(model_conf=config)
model.load_state_dict(torch.load(MODEL_PATH))
model.to(device)
model.eval()

CustomModel(
  (transformer): DebertaV2ForMultipleChoice(
    (deberta): DebertaV2Model(
      (embeddings): DebertaV2Embeddings(
        (word_embeddings): Embedding(128100, 1024, padding_idx=0)
        (LayerNorm): LayerNorm((1024,), eps=1e-07, elementwise_affine=True)
        (dropout): StableDropout()
      )
      (encoder): DebertaV2Encoder(
        (layer): ModuleList(
          (0-23): 24 x DebertaV2Layer(
            (attention): DebertaV2Attention(
              (self): DisentangledSelfAttention(
                (query_proj): Linear(in_features=1024, out_features=1024, bias=True)
                (key_proj): Linear(in_features=1024, out_features=1024, bias=True)
                (value_proj): Linear(in_features=1024, out_features=1024, bias=True)
                (pos_dropout): StableDropout()
                (dropout): StableDropout()
              )
              (output): DebertaV2SelfOutput(
                (dense): Linear(in_features=1024, out_features=1024, bias=True)
    

In [60]:
y_preds = []


with tqdm(test_dl, leave=True) as pbar:
    with torch.no_grad():
        for idx, batch in enumerate(pbar):
            inp_ids = batch['input_ids'].to(device)
            att_mask = batch['attention_mask'].to(device)
            token_type_ids = batch['token_type_ids'].to(device)

            y_pred = model(input_ids=inp_ids, 
                           attention_mask=att_mask, 
                           token_type_ids=token_type_ids)

            y_pred = y_pred.to(torch.float)

            y_preds.append(y_pred.cpu())
            

            
        
itk_preds = torch.cat(y_preds)

del model, y_preds
torch.cuda.empty_cache()

  0%|          | 0/200 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 200/200 [00:27<00:00,  7.33it/s]


In [61]:
itk_preds = softmax(itk_preds, axis=1).numpy()

In [62]:
gc.collect()

9

# Optimise model weights

In [63]:
# ref: LB 0.865 Weight
# ws = [4.71728275e-01, 1.82735672e-17, 2.48815095e-02, 2.35219500e-01, 2.68170715e-01]

# LB 0.865
ws = [4.29619529e-01, 6.94350755e-26, 3.07758321e-02, 2.10548029e-01, 2.71740013e-01, 5.73165973e-02]

In [64]:
# LB 0.865
predictions_overall = test_predictionsc * ws[0] + ob_preds * ws[1] + test_predictionsi * ws[2] + hyc_preds * ws[3] + itk_preds * ws[4] + hyc_preds_2 * ws[5]
predictions_overall.shape


(200, 5)

# Ensemble
* LB 0.836 + LB 0.865

In [65]:
predictions_overall = (articles_predictions + predictions_overall + v2_epochs3_predictions) / 3

In [66]:
predictions_overall = np.argsort(-predictions_overall)[:,:3]
predictions_overall[:5]

array([[3, 0, 2],
       [0, 3, 1],
       [0, 3, 1],
       [2, 0, 1],
       [3, 1, 0]])

In [67]:
predictions_as_answer_letters = np.array(list('ABCDE'))[predictions_overall]
predictions_as_answer_letters[:3]

array([['D', 'A', 'C'],
       ['A', 'D', 'B'],
       ['A', 'D', 'B']], dtype='<U1')

In [68]:
predictions_as_string = test_df['prediction'] = [
    ' '.join(row) for row in predictions_as_answer_letters[:, :3]
]
predictions_as_string[:3]

['D A C', 'A D B', 'A D B']

In [69]:
submission = test_df[['id', 'prediction']]
submission.to_csv('submission.csv', index=False)

pd.read_csv('submission.csv').head(10)

Unnamed: 0,id,prediction
0,0,D A C
1,1,A D B
2,2,A D B
3,3,C A B
4,4,D B A
5,5,B E C
6,6,A C B
7,7,D E B
8,8,C B A
9,9,A B C


In [70]:
submission.tail(20)

Unnamed: 0,id,prediction
180,180,B A E
181,181,A E B
182,182,A D E
183,183,C A B
184,184,A B D
185,185,A D B
186,186,C E D
187,187,A E D
188,188,D C B
189,189,B A C


# 폴더 정리

In [71]:
import shutil
import os

# 삭제할 폴더 경로 리스트
folders_to_delete = [
    "/kaggle/working/__pycache__",
    "/kaggle/working/1006-wikitfidfv1",
    "/kaggle/working/sentence-transformers",
    "/kaggle/working/stem-wiki-cohere-no-emb",
]

# 리스트에 있는 모든 폴더를 삭제
for folder in folders_to_delete:
    # 폴더가 존재하는지 확인
    if os.path.exists(folder):
        # 폴더 삭제
        shutil.rmtree(folder)
        print(f"{folder} has been deleted.")
    else:
        print(f"{folder} does not exist.")


/kaggle/working/__pycache__ does not exist.
/kaggle/working/1006-wikitfidfv1 has been deleted.
/kaggle/working/sentence-transformers has been deleted.
/kaggle/working/stem-wiki-cohere-no-emb has been deleted.
