<a href="https://www.kaggle.com/code/marcellemmer/context-creation?scriptVersionId=153621632" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Context creation using Wikipedia

We downloaded the 13GB Wikipedia Plaintext (2023-07-01) dataset from Kaggle. This dataset contains all the wikipedia articles up to the mentioned date stored in parquet files. We use only the wiki_2023_index.parquet file that contains the first sentences of the articles as context for our model. We assing a context column for each question of the Q&A dataframe from this file. This is done with the Sentence Transformer library that embeds the wikipedia articles and with Faiss that does a similarity search between the question and the first sentences of the articles. We retrieve the most similar wikipedia article for each question.

The code uses 2xT4 GPU from Kaggle

## Sources

* https://www.kaggle.com/datasets/jjinho/wikipedia-20230701/data?select=h.parquet

* https://github.com/facebookresearch/faiss/wiki

In [None]:
!pip install datasets
!pip install faiss-gpu sentence-transformers

In [None]:
# Importing the libraries
import os
import numpy as np
import pandas as pd
import pyarrow.parquet as pq
from datasets import load_dataset, Dataset
import faiss
from sentence_transformers import SentenceTransformer
import torch
from tqdm import tqdm

# Set CUDA visible devices to GPU 0 and 1 (The 2xT4 GPUs that Kaggle provides)
os.environ["CUDA_VISIBLE_DEVICES"]="0,1"

# Important parameters describing the code
SIM_MODEL = 'all-MiniLM-L6-v2'
# Set device
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
# Define batch size
BATCH_SIZE = 500_000

#Loading the questions
qna_df = pd.read_csv("/kaggle/input/kaggle-llm-science-exam/train.csv",index_col=0)

qna_df.head()

One possible source of context is to only use the 'wiki_2023_index.parquet' file that has the first sentences of each artice. This is what we considered first but most of the sentences were cut in the middle, providing little to no useful information about the topic.

In [None]:
# Load Parquet files into a Hugging Face dataset
# Source: https://www.kaggle.com/datasets/jjinho/wikipedia-20230701/data?select=wiki_2023_index.parquet
# You could use split='train[:1000] for testing is the files wrk properly
wiki_dataset = load_dataset('parquet',
                            data_files={'train': "/kaggle/input/wikipedia-20230701/wiki_2023_index.parquet"},
                            split='train[:4000]') # 1.76GB file

Another possibility is to use every other file containing the full articles, but only load in the first two sentences. 

In [None]:
# Specify the folder containing the Parquet files
folder_path = "/kaggle/input/wikipedia-20230701/"

# Get a list of all Parquet files in the folder
parquet_files = [file for file in os.listdir(folder_path) 
                 if file.endswith(".parquet") 
                 and file != "wiki_2023_index.parquet"
                and file != "number.parquet"
                and file != "other.parquet"]

# Initialize an empty list to store the selected text from each Parquet file
selected_text = []

# Function to extract the first two sentences from a text
def extract_first_two_sentences(text):
    sentences = text.split('. ')[:2]
    return '. '.join(sentences)

# Iterate over each Parquet file, read only the 'text' column, extract the first two sentences, and append to the list
for parquet_file in tqdm(parquet_files, desc="Loading and Processing Parquet Files", unit="file"):
    file_path = os.path.join(folder_path, parquet_file)
    df = pq.read_table(file_path, columns=['text']).to_pandas()
    df['context'] = df['text'].apply(extract_first_two_sentences)
    selected_text.append(df['context'])

# Concatenate all selected texts into a single Series
combined_selected_text = pd.DataFrame(pd.concat(selected_text, ignore_index=True))

# Display the combined selected text Series
print(combined_selected_text.head())

In [None]:
# convert the pandas dataframe into a huggingface dataset
wiki_dataset = Dataset.from_pandas(combined_selected_text)
# Free up memory
del selected_text, combined_selected_text

In [None]:
wiki_dataset

In [None]:
# Load pre-trained sentence transformer model
model = SentenceTransformer(SIM_MODEL)

# Create a Faiss index on disk
index_path = "/kaggle/working/faiss_index"
dimension = model.get_sentence_embedding_dimension()

# Define parameters for IVF index
nlist = 100  # Number of clusters (adjust as needed)
quantizer = faiss.IndexFlatL2(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_L2)

# Iterate over the dataset in batches
for i in range(0, len(wiki_dataset['context']), BATCH_SIZE):
    # Encode the context sentences using the SentenceTransformer model
    context_embeddings = model.encode(wiki_dataset['context'][i:i+BATCH_SIZE],
                                      device=DEVICE,
                                      show_progress_bar=True,
                                      convert_to_tensor=True,
                                      normalize_embeddings=True).half()  # Use mixed-precision training (FP16) to reduce memory footprint

    # Convert the embeddings to a numpy array
    context_embeddings_np = context_embeddings.detach().cpu().numpy().astype('float32')

    # Add the embeddings to the Faiss index
    index.train(context_embeddings_np)
    index.add_with_ids(context_embeddings_np, np.arange(0, context_embeddings_np.shape[0]))

    # Wrap the index with IndexIDMap
    index_ivf = faiss.IndexIDMap(index)

    # Save the index
    faiss.write_index(index_ivf.index, index_path)

    # Free up memory
    del context_embeddings, context_embeddings_np
    
    
# Function to retrieve most similar documents
def retrieve_most_similar(query, k=20):
    query_embedding = model.encode(query, device=DEVICE, show_progress_bar=True, convert_to_tensor=True, normalize_embeddings=True)
    query_embedding = query_embedding.reshape(1, -1)  # Reshape for Faiss
    query_embedding = query_embedding.detach().cpu().numpy()
    _, idx = index.search(query_embedding, k)
    return idx[0]

# Example usage
query_text = qna_df['prompt'][0]
print(f'example prompt {query_text}')
similar_documents_indices = retrieve_most_similar(query_text)

# Print similar documents
for idx in similar_documents_indices:
    print(wiki_dataset[int(idx)]['context'])

In [None]:
# Create the context column from the wikipedia article
# Create an empty list to store the context for each prompt
context_list = []

# Loop through each prompt in the qna_df dataframe
for i in range(len(qna_df)):
    query_text = qna_df['prompt'][i]
    similar_documents_indices = retrieve_most_similar(query_text)

    # Get the first answer from the corresponding wiki_dataset
    context = wiki_dataset[int(similar_documents_indices[0])]['context']


    context_list.append(context)

# Add the context_list as a new column "context" to the qna_df DataFrame
qna_df['context'] = context_list

# Save the Q&A DataFrame to a CSV file
qna_df.to_csv('openbook-qna-data.csv', index=False)

# Display the modified DataFrame
qna_df.head()

In [None]:
# Save Faiss index to a file
index_file_path = "wiki_faiss_index.index"  # Specify the path where you want to save the index
faiss.write_index(index, index_file_path)