### Introduction

This notebook explores the results of a project aimed at enhancing the performance of a model for the Kaggle competition focused on STEM-related questions. Initially, a cluster of relevant Wikipedia STEM articles was identified, leading to the compilation of approximately 270K articles. This dataset is available for download [here](https://www.kaggle.com/datasets/mbanaei/stem-wiki-cohere-no-emb).

During the data preparation phase, challenges with WikiExtractor were encountered, resulting in missing numbers and paragraphs from the final parsed content. To rectify this, the Wiki API was employed to gather the complete contexts for the same set of articles, with the refined dataset accessible [here](https://www.kaggle.com/datasets/mbanaei/all-paraphs-parsed-expanded). More information regarding this process can be found in the discussion [here](https://www.kaggle.com/competitions/kaggle-llm-science-exam/discussion/442483).

To validate the coverage of the identified articles, this notebook implements a simple retrieval model that utilizes a model trained exclusively on the RACE dataset. The approach focuses on demonstrating how the selected articles not only encompass those present in the training dataset but also cover a majority of the leaderboard (LB) gold articles.

Key design choices for this notebook include:
- **Context Retrieval**: A basic TF-IDF method is utilized for retrieving contexts from both datasets in response to each question.
- **Model Utilization**: The LongFormer Large model is employed for inference, allowing for the processing of longer input contexts without splitting into sentence-level tokens. This choice mitigates out-of-memory (OOM) issues and supports faster inference due to the model's efficient handling of attention mechanisms.
- **Fallback Mechanism**: To enhance prediction accuracy, a fallback model based on a public notebook employing an open-book approach is used to make predictions when the primary model demonstrates low confidence in its top choice.

While the performance of this model is competitive with other public notebooks, there remain opportunities for improvement in both inference time and overall accuracy, particularly in the context retrieval process, which currently lacks prior indexing.

In [1]:
!ls ./kaggle/

input  working


In [2]:
!cp ./kaggle/input/datasets-wheel/datasets-2.14.4-py3-none-any.whl ./kaggle/working
!pip install  ./kaggle/working/datasets-2.14.4-py3-none-any.whl
!cp ./kaggle/input/backup-806/util_openbook.py .

Processing ./kaggle/working/datasets-2.14.4-py3-none-any.whl
datasets is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.


In [3]:
# installing offline dependencies
!pip install -U ./kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
!cp -rf ./kaggle/input/sentence-transformers-222/sentence-transformers ./kaggle/working/sentence-transformers
!pip install -U ./kaggle/working/sentence-transformers
!pip install -U ./kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

!pip install --user --no-index --no-deps ./kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
!pip install --user --no-index --no-deps ./kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
!pip install --user --no-index --no-deps ./kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

Processing ./kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
faiss-gpu is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.
Processing ./kaggle/working/sentence-transformers
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone
[?25h  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=126125 sha256=dd4136440edd464c5a9baead8fa70699e6fc6867551a149b45733c6cc1728a9a
  Stored in directory: /home/jovyan/.cache/pip/wheels/cf/29/94/952edff7a57baedcc598dd3582cf671d803cd3205aa09632b4
Successfully built sentence-transformers
Installing collected packages: sentence-transformers
  Attempting uninstall: sentence-transformers
    Found existing installation: sentence-transformers 2.2.2
    Unin

In [4]:
from util_openbook import get_contexts, generate_openbook_output
import pickle

get_contexts()
generate_openbook_output()

import gc
gc.collect()

  from .autonotebook import tqdm as notebook_tqdm
Batches: 100%|██████████| 13/13 [00:00<00:00, 13.08it/s]
100%|██████████| 200/200 [00:00<00:00, 3285.57it/s]
100%|██████████| 28/28 [01:33<00:00,  3.33s/it]
100%|██████████| 3546/3546 [00:00<00:00, 1120883.41it/s]
100%|██████████| 3546/3546 [00:07<00:00, 486.14it/s]
Batches: 100%|██████████| 10459/10459 [00:47<00:00, 222.50it/s]
Batches: 100%|██████████| 13/13 [00:00<00:00, 127.60it/s]
100%|██████████| 200/200 [00:08<00:00, 23.50it/s]
Map:   0%|          | 0/200 [00:00<?, ? examples/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Map: 100%|██████████| 200/200 [00:00<00:00, 482.12 examples/s]
You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


4

In [5]:
import pandas as pd
backup_model_predictions = pd.read_csv("submission_backup.csv")

In [8]:
import numpy as np
import pandas as pd 
from datasets import load_dataset, load_from_disk
from sklearn.feature_extraction.text import TfidfVectorizer
import torch
from transformers import LongformerTokenizer, LongformerForMultipleChoice
import transformers
import pandas as pd
import pickle
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import unicodedata

import os

In [9]:
!cp -r ./kaggle/input/stem-wiki-cohere-no-emb ./kaggle/working
!cp -r ./kaggle/input/all-paraphs-parsed-expanded ./kaggle/working/

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
def SplitList(mylist, chunk_size):
    return [mylist[offs:offs+chunk_size] for offs in range(0, len(mylist), chunk_size)]

def get_relevant_documents_parsed(df_valid):
    df_chunk_size=600
    paraphs_parsed_dataset = load_from_disk("./kaggle/working/all-paraphs-parsed-expanded")
    modified_texts = paraphs_parsed_dataset.map(lambda example:
                                             {'temp_text':
                                              f"{example['title']} {example['section']} {example['text']}".replace('\n'," ").replace("'","")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         paraphs_parsed_dataset[idx.item()]["title"],
                         paraphs_parsed_dataset[idx.item()]["text"],
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles



def get_relevant_documents(df_valid):
    df_chunk_size=800
    
    cohere_dataset_filtered = load_from_disk("./kaggle/working/stem-wiki-cohere-no-emb")
    modified_texts = cohere_dataset_filtered.map(lambda example:
                                             {'temp_text':
                                              unicodedata.normalize("NFKD", f"{example['title']} {example['text']}").replace('"',"")},
                                             num_proc=2)["temp_text"]
    
    all_articles_indices = []
    all_articles_values = []
    for idx in tqdm(range(0, df_valid.shape[0], df_chunk_size)):
        df_valid_ = df_valid.iloc[idx: idx+df_chunk_size]
    
        articles_indices, merged_top_scores = retrieval(df_valid_, modified_texts)
        all_articles_indices.append(articles_indices)
        all_articles_values.append(merged_top_scores)
        
    article_indices_array =  np.concatenate(all_articles_indices, axis=0)
    articles_values_array = np.concatenate(all_articles_values, axis=0).reshape(-1)
    
    top_per_query = article_indices_array.shape[1]
    articles_flatten = [(
                         articles_values_array[index],
                         cohere_dataset_filtered[idx.item()]["title"],
                         unicodedata.normalize("NFKD", cohere_dataset_filtered[idx.item()]["text"]),
                        )
                        for index,idx in enumerate(article_indices_array.reshape(-1))]
    retrieved_articles = SplitList(articles_flatten, top_per_query)
    return retrieved_articles



def retrieval(df_valid, modified_texts):
    
    corpus_df_valid = df_valid.apply(lambda row:
                                     f'{row["prompt"]}\n{row["prompt"]}\n{row["prompt"]}\n{row["A"]}\n{row["B"]}\n{row["C"]}\n{row["D"]}\n{row["E"]}',
                                     axis=1).values
    vectorizer1 = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words)
    vectorizer1.fit(corpus_df_valid)
    vocab_df_valid = vectorizer1.get_feature_names_out()
    vectorizer = TfidfVectorizer(ngram_range=(1,2),
                                 token_pattern=r"(?u)\b[\w/.-]+\b|!|/|\?|\"|\'",
                                 stop_words=stop_words,
                                 vocabulary=vocab_df_valid)
    vectorizer.fit(modified_texts[:500000])
    corpus_tf_idf = vectorizer.transform(corpus_df_valid)
    
    print(f"length of vectorizer vocab is {len(vectorizer.get_feature_names_out())}")

    chunk_size = 100000
    top_per_chunk = 10
    top_per_query = 10

    all_chunk_top_indices = []
    all_chunk_top_values = []

    for idx in tqdm(range(0, len(modified_texts), chunk_size)):
        wiki_vectors = vectorizer.transform(modified_texts[idx: idx+chunk_size])
        # [total_vocab:total_docs]*[total_vocab:chunk_size].T
        temp_scores = (corpus_tf_idf * wiki_vectors.T).toarray()
        chunk_top_indices = temp_scores.argpartition(-top_per_chunk, axis=1)[:, -top_per_chunk:]
        chunk_top_values = temp_scores[np.arange(temp_scores.shape[0])[:, np.newaxis], chunk_top_indices]

        all_chunk_top_indices.append(chunk_top_indices + idx)
        all_chunk_top_values.append(chunk_top_values)

    top_indices_array = np.concatenate(all_chunk_top_indices, axis=1)
    top_values_array = np.concatenate(all_chunk_top_values, axis=1)
    
    merged_top_scores = np.sort(top_values_array, axis=1)[:,-top_per_query:]
    merged_top_indices = top_values_array.argsort(axis=1)[:,-top_per_query:]
    articles_indices = top_indices_array[np.arange(top_indices_array.shape[0])[:, np.newaxis], merged_top_indices]
    
    return articles_indices, merged_top_scores


def prepare_answering_input(
        tokenizer, 
        question,  
        options,   
        context,   
        max_seq_length=4096,
    ):
    c_plus_q   = context + ' ' + tokenizer.bos_token + ' ' + question
    c_plus_q_4 = [c_plus_q] * len(options)
    tokenized_examples = tokenizer(
        c_plus_q_4, options,
        max_length=max_seq_length,
        padding="longest",
        truncation=False,
        return_tensors="pt",
    )
    input_ids = tokenized_examples['input_ids'].unsqueeze(0)
    attention_mask = tokenized_examples['attention_mask'].unsqueeze(0)
    example_encoded = {
        "input_ids": input_ids.to(model.device.index),
        "attention_mask": attention_mask.to(model.device.index),
    }
    return example_encoded


In [11]:
stop_words = ['each', 'you', 'the', 'use', 'used',
                  'where', 'themselves', 'nor', "it's", 'how', "don't", 'just', 'your',
                  'about', 'himself', 'with', "weren't", 'hers', "wouldn't", 'more', 'its', 'were',
                  'his', 'their', 'then', 'been', 'myself', 're', 'not',
                  'ours', 'will', 'needn', 'which', 'here', 'hadn', 'it', 'our', 'there', 'than',
                  'most', "couldn't", 'both', 'some', 'for', 'up', 'couldn', "that'll",
                  "she's", 'over', 'this', 'now', 'until', 'these', 'few', 'haven',
                  'of', 'wouldn', 'into', 'too', 'to', 'very', 'shan', 'before', 'the', 'they',
                  'between', "doesn't", 'are', 'was', 'out', 'we', 'me',
                  'after', 'has', "isn't", 'have', 'such', 'should', 'yourselves', 'or', 'during', 'herself',
                  'doing', 'in', "shouldn't", "won't", 'when', 'do', 'through', 'she',
                  'having', 'him', "haven't", 'against', 'itself', 'that',
                  'did', 'theirs', 'can', 'those',
                  'own', 'so', 'and', 'who', "you've", 'yourself', 'her', 'he', 'only',
                  'what', 'ourselves', 'again', 'had', "you'd", 'is', 'other',
                  'why', 'while', 'from', 'them', 'if', 'above', 'does', 'whom',
                  'yours', 'but', 'being', "wasn't", 'be']

In [12]:
df_valid = pd.read_csv("./kaggle/input/kaggle-llm-science-exam/test.csv")

In [13]:
retrieved_articles_parsed = get_relevant_documents_parsed(df_valid)
gc.collect()

Map (num_proc=2): 100%|██████████| 2101279/2101279 [00:42<00:00, 49008.47 examples/s]


length of vectorizer vocab is 11222



  0%|          | 0/22 [00:00<?, ?it/s][A
  5%|▍         | 1/22 [00:08<02:59,  8.53s/it][A
  9%|▉         | 2/22 [00:16<02:47,  8.38s/it][A
 14%|█▎        | 3/22 [00:25<02:41,  8.48s/it][A
 18%|█▊        | 4/22 [00:33<02:31,  8.44s/it][A
 23%|██▎       | 5/22 [00:42<02:22,  8.37s/it][A
 27%|██▋       | 6/22 [00:50<02:14,  8.39s/it][A
 32%|███▏      | 7/22 [00:59<02:07,  8.47s/it][A
 36%|███▋      | 8/22 [01:07<01:57,  8.42s/it][A
 41%|████      | 9/22 [01:15<01:49,  8.45s/it][A
 45%|████▌     | 10/22 [01:24<01:41,  8.49s/it][A
 50%|█████     | 11/22 [01:33<01:33,  8.51s/it][A
 55%|█████▍    | 12/22 [01:41<01:25,  8.52s/it][A
 59%|█████▉    | 13/22 [01:50<01:16,  8.51s/it][A
 64%|██████▎   | 14/22 [01:58<01:07,  8.49s/it][A
 68%|██████▊   | 15/22 [02:07<00:59,  8.51s/it][A
 73%|███████▎  | 16/22 [02:15<00:51,  8.51s/it][A
 77%|███████▋  | 17/22 [02:24<00:42,  8.53s/it][A
 82%|████████▏ | 18/22 [02:32<00:34,  8.53s/it][A
 86%|████████▋ | 19/22 [02:41<00:25,  8.54s/it]

289

In [14]:
retrieved_articles = get_relevant_documents(df_valid)
gc.collect()

Map (num_proc=2): 100%|██████████| 2781652/2781652 [01:47<00:00, 25791.61 examples/s]
  0%|          | 0/1 [00:00<?, ?it/s]

length of vectorizer vocab is 11222



  0%|          | 0/28 [00:00<?, ?it/s][A
  4%|▎         | 1/28 [00:06<02:48,  6.23s/it][A
  7%|▋         | 2/28 [00:12<02:36,  6.03s/it][A
 11%|█         | 3/28 [00:18<02:29,  5.96s/it][A
 14%|█▍        | 4/28 [00:23<02:22,  5.94s/it][A
 18%|█▊        | 5/28 [00:29<02:14,  5.84s/it][A
 21%|██▏       | 6/28 [00:35<02:08,  5.85s/it][A
 25%|██▌       | 7/28 [00:41<02:01,  5.79s/it][A
 29%|██▊       | 8/28 [00:46<01:54,  5.74s/it][A
 32%|███▏      | 9/28 [00:52<01:48,  5.71s/it][A
 36%|███▌      | 10/28 [00:57<01:41,  5.65s/it][A
 39%|███▉      | 11/28 [01:03<01:36,  5.69s/it][A
 43%|████▎     | 12/28 [01:09<01:29,  5.60s/it][A
 46%|████▋     | 13/28 [01:14<01:22,  5.51s/it][A
 50%|█████     | 14/28 [01:19<01:16,  5.50s/it][A
 54%|█████▎    | 15/28 [01:25<01:11,  5.49s/it][A
 57%|█████▋    | 16/28 [01:30<01:05,  5.44s/it][A
 61%|██████    | 17/28 [01:36<01:00,  5.47s/it][A
 64%|██████▍   | 18/28 [01:41<00:54,  5.42s/it][A
 68%|██████▊   | 19/28 [01:46<00:48,  5.38s/it]

0

In [15]:
tokenizer = LongformerTokenizer.from_pretrained("./kaggle/input/longformer-race-model/longformer_qa_model")
model = LongformerForMultipleChoice.from_pretrained("./kaggle/input/longformer-race-model/longformer_qa_model").cuda()

In [16]:
predictions = []
submit_ids = []

for index in tqdm(range(df_valid.shape[0])):
    columns = df_valid.iloc[index].values
    submit_ids.append(columns[0])
    question = columns[1]
    options = [columns[2], columns[3], columns[4], columns[5], columns[6]]
    context1 = f"{retrieved_articles[index][-4][2]}\n{retrieved_articles[index][-3][2]}\n{retrieved_articles[index][-2][2]}\n{retrieved_articles[index][-1][2]}"
    context2 = f"{retrieved_articles_parsed[index][-3][2]}\n{retrieved_articles_parsed[index][-2][2]}\n{retrieved_articles_parsed[index][-1][2]}"
    inputs1 = prepare_answering_input(
        tokenizer=tokenizer, question=question,
        options=options, context=context1,
        )
    inputs2 = prepare_answering_input(
        tokenizer=tokenizer, question=question,
        options=options, context=context2,
        )
    
    with torch.no_grad():
        outputs1 = model(**inputs1)    
        losses1 = -outputs1.logits[0].detach().cpu().numpy()
        probability1 = torch.softmax(torch.tensor(-losses1), dim=-1)
        
    with torch.no_grad():
        outputs2 = model(**inputs2)
        losses2 = -outputs2.logits[0].detach().cpu().numpy()
        probability2 = torch.softmax(torch.tensor(-losses2), dim=-1)
        
    probability_ = (probability1 + probability2)/2

    if probability_.max() > 0.4:
        predict = np.array(list("ABCDE"))[np.argsort(probability_)][-3:].tolist()[::-1]
    else:
        predict = backup_model_predictions.iloc[index].prediction.replace(" ","")
    predictions.append(predict)

predictions = [" ".join(i) for i in predictions]

100%|██████████| 200/200 [03:06<00:00,  1.07it/s]


In [17]:
pd.DataFrame({'id':submit_ids,'prediction':predictions}).to_csv('submission.csv', index=False)

In [19]:
from functions import mapk
df = pd.read_csv('submission.csv')
answer_df = pd.read_csv('datasets/train.csv')
answer = answer_df['answer'].tolist()
df['prediction'] = df['prediction'].str.split()
prediction= df['prediction'].tolist()
res = mapk(answer, prediction, 3)
print(res)

0.9658333333333334
