# OpenBook DeBERTaV3-Large Baseline

Hi! This notebook is a merge of the following approaches:

**OpenBook Approach**
- https://www.kaggle.com/code/quangbk/open-book-llm-science-exam-reduced-ram-usage
- https://www.kaggle.com/code/jjinho/open-book-llm-science-exam

***DeBERTaV3-Large with extra data***
- https://www.kaggle.com/code/radek1/new-dataset-deberta-v3-large-training

This both leverages the the multi-choice implementation of HuggingFace library and the context retrieval of the openbook approach. It can be extended with the utilization of billion parameter LLMs and better retrieval methods.

In [1]:
!nvidia-smi

Wed Sep 20 09:47:12 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  A100-SXM-80GB       On   | 00000000:81:00.0 Off |                    0 |
| N/A   45C    P0    63W / 275W |      0MiB / 81252MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "1"

In [3]:
# # installing offline dependencies
# !pip install -U ./kaggle/input/faiss-gpu-173-python310/faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
# !cp -rf ./kaggle/input/sentence-transformers-222/sentence-transformers ./kaggle/working/sentence-transformers
# !pip install -U ./kaggle/working/sentence-transformers
# !pip install -U ./kaggle/input/blingfire-018/blingfire-0.1.8-py3-none-any.whl

# !pip install --no-index --no-deps ./kaggle/input/llm-whls/transformers-4.31.0-py3-none-any.whl
# !pip install --no-index --no-deps ./kaggle/input/llm-whls/peft-0.4.0-py3-none-any.whl
# !pip install --no-index --no-deps ./kaggle/input/llm-whls/datasets-2.14.3-py3-none-any.whl
# !pip install --no-index --no-deps ./kaggle/input/llm-whls/trl-0.5.0-py3-none-any.whl

In [4]:
import os
import gc
import pandas as pd
import numpy as np
import re
from tqdm.auto import tqdm
import blingfire as bf
from __future__ import annotations

from collections.abc import Iterable

import faiss
from faiss import write_index, read_index

from sentence_transformers import SentenceTransformer

import torch
import ctypes
libc = ctypes.CDLL("libc.so.6")

In [5]:
def process_documents(documents: Iterable[str],
                      document_ids: Iterable,
                      split_sentences: bool = True,
                      filter_len: int = 3,
                      disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Main helper function to process documents from the EMR.

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param document_type: String denoting the document type to be processed
    :param document_sections: List of sections for a given document type to process
    :param split_sentences: Flag to determine whether to further split sections into sentences
    :param filter_len: Minimum character length of a sentence (otherwise filter out)
    :param disable_progress_bar: Flag to disable tqdm progress bar
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `section`, `offset`
    """
    
    df = sectionize_documents(documents, document_ids, disable_progress_bar)

    if split_sentences:
        df = sentencize(df.text.values, 
                        df.document_id.values,
                        df.offset.values, 
                        filter_len, 
                        disable_progress_bar)
    return df


def sectionize_documents(documents: Iterable[str],
                         document_ids: Iterable,
                         disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Obtains the sections of the imaging reports and returns only the 
    selected sections (defaults to FINDINGS, IMPRESSION, and ADDENDUM).

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param disable_progress_bar: Flag to disable tqdm progress bar
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `offset`
    """
    processed_documents = []
    for document_id, document in tqdm(zip(document_ids, documents), total=len(documents), disable=disable_progress_bar):
        row = {}
        text, start, end = (document, 0, len(document))
        row['document_id'] = document_id
        row['text'] = text
        row['offset'] = (start, end)

        processed_documents.append(row)

    _df = pd.DataFrame(processed_documents)
    if _df.shape[0] > 0:
        return _df.sort_values(['document_id', 'offset']).reset_index(drop=True)
    else:
        return _df


def sentencize(documents: Iterable[str],
               document_ids: Iterable,
               offsets: Iterable[tuple[int, int]],
               filter_len: int = 3,
               disable_progress_bar: bool = False) -> pd.DataFrame:
    """
    Split a document into sentences. Can be used with `sectionize_documents`
    to further split documents into more manageable pieces. Takes in offsets
    to ensure that after splitting, the sentences can be matched to the
    location in the original documents.

    :param documents: Iterable containing documents which are strings
    :param document_ids: Iterable containing document unique identifiers
    :param offsets: Iterable tuple of the start and end indices
    :param filter_len: Minimum character length of a sentence (otherwise filter out)
    :return: Pandas DataFrame containing the columns `document_id`, `text`, `section`, `offset`
    """

    document_sentences = []
    for document, document_id, offset in tqdm(zip(documents, document_ids, offsets), total=len(documents), disable=disable_progress_bar):
        try:
            _, sentence_offsets = bf.text_to_sentences_and_offsets(document)
            for o in sentence_offsets:
                if o[1]-o[0] > filter_len:
                    sentence = document[o[0]:o[1]]
                    abs_offsets = (o[0]+offset[0], o[1]+offset[0])
                    row = {}
                    row['document_id'] = document_id
                    row['text'] = sentence
                    row['offset'] = abs_offsets
                    document_sentences.append(row)
        except:
            continue
    return pd.DataFrame(document_sentences)

In [6]:
SIM_MODEL = './kaggle/input/sentencetransformers-allminilml6v2/sentence-transformers_all-MiniLM-L6-v2'
# SIM_MODEL = './kaggle/input/all-mpnet-base-v2'
# SIM_MODEL = './kaggle/input/bge-large-zh'
# SIM_MODEL = './kaggle/input/bge-small-en'

DEVICE = 0
MAX_LENGTH = 384
BATCH_SIZE = 32

WIKI_PATH = "./kaggle/input/wikipedia-20230701"
wiki_files = os.listdir(WIKI_PATH)

# Relevant Title Retrieval

In [7]:
# add data path to generate context
# data_paths= './kaggle/input/60k-data-with-context-v2/110k_after_46k.csv'

In [8]:
## Import training data
# trn = pd.concat([
#     # pd.read_csv(data_paths),
#     pd.read_csv("./kaggle/input/kaggle-llm-science-exam/train.csv"),
#     pd.read_csv('./kaggle/input/additional-train-data-for-llm-science-exam/6000_train_examples.csv'), # 500
#     pd.read_csv('./kaggle/input/additional-train-data-for-llm-science-exam/extra_train_set.csv'), # 6k
#     pd.read_csv("./kaggle/input/wikipedia-stem-1k/stem_1k_v1.csv"), # 1k
#     pd.read_csv("./kaggle/input/eduqg_llm_formatted_34k/eduqg_llm_formatted.csv"), # 3k # Not working
#     pd.read_csv("./kaggle/input/15k_gpt3.5-turbo/15k_gpt3.5-turbo.csv"), # 1k
#     pd.read_csv("./kaggle/input/15k_gpt3.5-turbo/5900_examples.csv"), # 5.9k
#     pd.read_csv("./kaggle/input/110k_MMLU_dataset/110k_MMLU_dataset.csv"), # 140k # Not working
#     pd.read_csv("./kaggle/input/llm-science-3k-data/llm-science-3k-data.csv"), # 3k    
# ])
trn = pd.read_csv('./kaggle/input/my-dataset-collect-processed/ALL_merged_dataset_60k.csv')
# trn.drop(columns=['id'], inplace=True)
trn.reset_index(inplace=True,drop=True)
# trn.drop(columns=['source', 'context'],inplace=True)
trn.shape

(59396, 7)

In [9]:
# max_length = trn['wikipedia_excerpt'].str.len().max()
# print(max_length)

In [10]:
trn.head(5)

Unnamed: 0,prompt,A,B,C,D,E,answer
0,Where did William John Strang work for most of...,William John Strang worked at the British Aero...,William John Strang worked at the Aircraft Res...,William John Strang worked at the Civil Aviati...,William John Strang worked at the Royal Academ...,William John Strang worked at the Bristol Aero...,E
1,What was William John Strang's role at the Bri...,William John Strang worked in the Aerodynamics...,William John Strang worked as the Chief Design...,William John Strang worked in the Stress and P...,William John Strang worked on the development ...,William John Strang worked as a Technical Dire...,C
2,What was the major contribution of William Joh...,William John Strang made substantial aerodynam...,William John Strang worked on the development ...,William John Strang was involved in the design...,William John Strang worked on the feasibility ...,William John Strang worked on the development ...,B
3,What was the joint project between the British...,The joint project between the British and Fren...,The joint project between the British and Fren...,The joint project between the British and Fren...,The joint project between the British and Fren...,The joint project between the British and Fren...,B
4,What was William John Strang's role in the joi...,William John Strang was the technical director...,William John Strang was the director and chief...,William John Strang was the chief designer of ...,William John Strang was the chairman of the Ci...,William John Strang was the technical director...,B


In [11]:
trn['answer'].value_counts()

answer
A    13822
B    12268
C    12040
D    11225
E    10041
Name: count, dtype: int64

In [12]:
trn.isnull().sum()

prompt       0
A         2799
B         2675
C         2684
D         2793
E         2742
answer       0
dtype: int64

In [13]:
model = SentenceTransformer(SIM_MODEL, device='cuda')
model.max_seq_length = MAX_LENGTH
model = model.half()

In [14]:
sentence_index = read_index("./kaggle/input/wikipedia-2023-07-faiss-index/wikipedia_202307.index")
# sentence_index = read_index("./kaggle/input/all-mp-net-base-v2-embedings/wikipedia_embs_768_all-mp-net-base-v2_faiss.index")
# sentence_index = read_index("./kaggle/input/bge-small-en-wiki-embedding/wikipedia_embeddings.index")

# cpus to all gpus
# sentence_index = faiss.index_cpu_to_all_gpus(sentence_index)

In [15]:
print(sentence_index.d)

384


In [16]:
prompt_embeddings = model.encode(trn.prompt.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
prompt_embeddings = prompt_embeddings.detach().cpu().numpy()
_ = gc.collect()

Batches:   0%|          | 0/1857 [00:00<?, ?it/s]

In [17]:
prompt_embeddings.shape

(59396, 384)

In [18]:
# prompt_embeddings = prompt_embeddings.astype(np.float32)

In [19]:
## Get the top 3 pages that are likely to contain the topic of interest
search_score, search_index = sentence_index.search(prompt_embeddings, 20)

In [20]:
## Save memory - delete sentence_index since it is no longer necessary
del sentence_index
del prompt_embeddings
_ = gc.collect()
libc.malloc_trim(0)

1

# Getting Sentences from the Relevant Titles

In [21]:
df = pd.read_parquet("./kaggle/input/wikipedia-20230701/wiki_2023_index.parquet",
                     columns=['id', 'file'])

In [22]:
## Get the article and associated file location using the index
wikipedia_file_data = []

for i, (scr, idx) in tqdm(enumerate(zip(search_score, search_index)), total=len(search_score)):
    scr_idx = idx        
    _df = df.loc[scr_idx].copy()
    _df['prompt_id'] = i
    wikipedia_file_data.append(_df)
wikipedia_file_data = pd.concat(wikipedia_file_data).reset_index(drop=True)
wikipedia_file_data = wikipedia_file_data[['id', 'prompt_id', 'file']].drop_duplicates().sort_values(['file', 'id']).reset_index(drop=True)

## Save memory - delete df since it is no longer necessary
del df
_ = gc.collect()
libc.malloc_trim(0)

  0%|          | 0/59396 [00:00<?, ?it/s]

1

In [23]:
## Get the full text data
wiki_text_data = []

for file in tqdm(wikipedia_file_data.file.unique(), total=len(wikipedia_file_data.file.unique())):
    _id = [str(i) for i in wikipedia_file_data[wikipedia_file_data['file']==file]['id'].tolist()]
    _df = pd.read_parquet(f"{WIKI_PATH}/{file}", columns=['id', 'text'])

    _df_temp = _df[_df['id'].isin(_id)].copy()
    del _df
    _ = gc.collect()
    libc.malloc_trim(0)
    wiki_text_data.append(_df_temp)
wiki_text_data = pd.concat(wiki_text_data).drop_duplicates().reset_index(drop=True)
_ = gc.collect()

  0%|          | 0/28 [00:00<?, ?it/s]

In [24]:
## Parse documents into sentences
processed_wiki_text_data = process_documents(wiki_text_data.text.values, wiki_text_data.id.values)

  0%|          | 0/657409 [00:00<?, ?it/s]

  0%|          | 0/657409 [00:00<?, ?it/s]

In [25]:
## Get embeddings of the wiki text data
wiki_data_embeddings = model.encode(processed_wiki_text_data.text,
                                    batch_size=BATCH_SIZE,
                                    device=DEVICE,
                                    show_progress_bar=True,
                                    convert_to_tensor=True,
                                    normalize_embeddings=True)#.half()
wiki_data_embeddings = wiki_data_embeddings.detach().cpu().numpy()

Batches:   0%|          | 0/602408 [00:00<?, ?it/s]

In [26]:
# wiki_data_embeddings = wiki_data_embeddings.astype(np.float32)

In [27]:
_ = gc.collect()

In [28]:
trn.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59396 entries, 0 to 59395
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   prompt  59396 non-null  object
 1   A       56597 non-null  object
 2   B       56721 non-null  object
 3   C       56712 non-null  object
 4   D       56603 non-null  object
 5   E       56654 non-null  object
 6   answer  59396 non-null  object
dtypes: object(7)
memory usage: 3.2+ MB


In [29]:
## Combine all answers
# trn['answer_all'] = trn.apply(lambda x: " ".join([x['A'], x['B'], x['C'], x['D'], x['E']]), axis=1)
trn['answer_all'] = trn.apply(lambda x: " ".join([str(x['A']), str(x['B']), str(x['C']), str(x['D']), str(x['E'])]), axis=1)


## Search using the prompt and answers to guide the search
trn['prompt_answer_stem'] = trn['prompt'] + " " + trn['answer_all']

In [30]:
question_embeddings = model.encode(trn.prompt_answer_stem.values, batch_size=BATCH_SIZE, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
question_embeddings = question_embeddings.detach().cpu().numpy()

Batches:   0%|          | 0/1857 [00:00<?, ?it/s]

In [31]:
# question_embeddings = question_embeddings.astype(np.float32)

In [32]:
# wiki_data_embeddings = wiki_data_embeddings.astype(np.float32)

# Extracting Matching Prompt-Sentence Pairs

In [34]:
## Parameter to determine how many relevant sentences to include
NUM_SENTENCES_INCLUDE = 50

## List containing just Context
contexts = []

for r in tqdm(trn.itertuples(), total=len(trn)):    
    prompt_id = r.Index
    
    prompt_indices = processed_wiki_text_data[processed_wiki_text_data['document_id'].isin(wikipedia_file_data[wikipedia_file_data['prompt_id']==prompt_id]['id'].values)].index.values

    if prompt_indices.shape[0] > 0:
        prompt_index = faiss.index_factory(wiki_data_embeddings.shape[1], "Flat")
        prompt_index.add(wiki_data_embeddings[prompt_indices])

        context = ""
        
        ## Get the top matches
        ss, ii = prompt_index.search(question_embeddings, NUM_SENTENCES_INCLUDE)
        for _s, _i in zip(ss[prompt_id], ii[prompt_id]):
            context += processed_wiki_text_data.loc[prompt_indices]['text'].iloc[_i] + " "
        
    contexts.append(context)

  0%|          | 0/59396 [00:00<?, ?it/s]

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_msg_rate_limit`.

Current values:
ServerApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
ServerApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The Jupyter serve

In [35]:
trn['context'] = contexts

In [36]:
max_length = trn['context'].str.len().max()
print(max_length)

1110910


In [37]:
len(trn)

59396

In [38]:
# trn = trn[trn['context'].str.len() <= 5000].reset_index(drop=True)

In [39]:
len(trn)

59396

In [40]:
# data_paths.split('.csv')[0]

In [41]:
trn[["prompt", "context", "A", "B", "C", "D", "E", "answer"]].to_csv('ALL_merged_dataset_99k_all_mini_sentence20_include50.csv', index=False)

In [42]:
trn[["prompt", "context", "A", "B", "C", "D", "E"]]['prompt'][0]

'Where did William John Strang work for most of his professional career?'

In [None]:
asdfasdfasdfasdf

In [None]:
!nvidia-smi