## Fine-tune Embedding Models

Fine-tuning the embedding model is a critical step in enhancing the performance of RAG systems. These systems rely on retrieving relevant information from a corpus to augment the language model's generation capabilities. However, pre-trained embedding models are often trained on general-purpose datasets, which may not accurately capture the nuances and semantics specific to a particular domain or use case. Fine-tuning the embedding model on domain-specific data allows the RAG system to adapt to the target domain, improving the relevance and accuracy of retrieved information. 

The notebooks prepares the dataset for the lab by generating sythetic questions related to Amazon and Google's 10k documents. **A copy of the output dataset is already available in the `data` directory, so you don't have to run the notebook to start on the lab.** But if you are interested to see how the data are prepared, please execute each cell below sequentially.

In [None]:
!pip install -Uq PyPDF2==3.0.1
!pip install -Uq langchain==0.1.5

In [None]:
import langchain
from PyPDF2 import PdfReader 
import uuid
import json
import re

train_file = "google_10k.pdf"
val_file = "amazon_10k.pdf"

train_corpus_location = './data/train_corpus.json'
val_corpus_location = './data/val_corpus.json'

In [None]:
def pdf_to_chunks(pdf_file, chunk_size=5000):

    # Load PDF
    pdf = PdfReader(pdf_file)

    # Extract text
    text = ""
    for page in pdf.pages:
        text += page.extract_text()

    # Strip special characters
    text = re.sub(r'[\W_]+', ' ', text)

    # Initialize chunks dict
    chunks = {}

    # Split text into chunks
    for i in range(0, len(text), chunk_size):
        chunk = text[i:i+chunk_size]
        chunk_id = str(uuid.uuid4())
        chunks[chunk_id] = chunk
        
    # Return JSON object
    return chunks

In [None]:
train_corpus = pdf_to_chunks(train_file, chunk_size=750)
val_corpus = pdf_to_chunks(val_file, chunk_size=750)

Save the our corpus in the data folder

In [None]:
with open(train_corpus_location, 'w+') as f:
    json.dump(train_corpus, f)

with open(val_corpus_location, 'w+') as f:
    json.dump(val_corpus, f)

### Generate synthetic queries

Now, we use Claude v2 from Amazon Bedrock to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

### > initialize boto3

In [None]:
import boto3
from botocore.config import Config
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from tqdm.notebook import tqdm

boto_config = Config(
        connect_timeout=1, read_timeout=300,
        retries={'max_attempts': 1})

boto_session = boto3.Session()

bedrock_runtime = boto_session.client(
    service_name = "bedrock-runtime", config=boto_config
)

### > initialize the model

In [None]:
# initialize the LLM model
model_id = 'anthropic.claude-v2:1'

#Pricing of LLM model
claudev2_input=0.008/1000
claudev2_output=0.024/1000

inference_modifier = {'max_tokens_to_sample':4096, 
                  "temperature":0,
                  "top_k":250,
                  "top_p":1,
                  "stop_sequences": ["\n\nHuman"]
                 }

In [None]:
prompt_template = PromptTemplate(
    input_variables=["context", "number_of_qs"], 
    template="""
    Human: You are a teacher. your task is to setup questions for an upcoming exam. The questions should be diverse in nature across the document. Only generate the questions within the <context> provided.
    
    <context>
    {context}
    </context>
    
    Now generate {number_of_qs} questions in <questions>. No explanation, no bullet list, one carriage return between questions, just the questions only.
    
    Assistant:<questions>"""
)

In [None]:
def generate_queries(corpus, prompt_template, number_of_qs=2):
    model = Bedrock(model_id=model_id, 
          client=bedrock_runtime, 
          model_kwargs=inference_modifier)

    queries = {}
    relevant_docs = {}
    for node_id, chunk in tqdm(corpus.items()):
        prompt = prompt_template.format(context=chunk, number_of_qs=number_of_qs)
        response = model.invoke(prompt)
        result = str(response).strip().replace("\n</questions>", "").split("\n")
        questions = [
            re.sub(r"^\d+[\).\s]", "", question).strip() for question in result
        ]
        questions = [question for question in questions if len(question) > 0]

        for q in questions:
            q_id = str(uuid.uuid4())
            queries[q_id] = q
            relevant_docs[q_id] = [node_id]
    return queries, relevant_docs

### > Test corps
test_corps = {'a302c0e5-44c7-4518-9022-7f5adcf357fa': 'UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington D C 20549 FORM 10 K Mark One ANNUAL REPORT PURSUANT TO SECTION 13 OR 15 d OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ended December 31 2022 OR TRANSITION REPORT PURSUANT TO SECTION 13 OR 15 d OF THE SECURITIES EXCHANGE ACT OF 1934 For the transition period from to Commission file number 001 37580 Alphabet Inc Exact name of registrant as specified in its charter Delaware 61 1767919 State or other jurisdiction of incorporation or organization I R S Employer Identification No 1600 Amphitheatre Parkway Mountain View CA 94043 Address of principal executive offices including zip code 650 253 0000 Registrant s telephone number including area code Securities registered purs',
 'ebc9dab5-73db-4da3-a8da-674ecb128187': 'uant to Section 12 b of the Act Title of each class Trading Symbol s Name of each exchange on which registered Class A Common Stock 0 001 par value GOOGL Nasdaq Stock Market LLC Nasdaq Global Select Market Class C Capital Stock 0 001 par value GOOG Nasdaq Stock Market LLC Nasdaq Global Select Market Securities registered pursuant to Section 12 g of the Act Title of each class None Indicate by check mark if the registrant is a well known seasoned issuer as defined in Rule 405 of the Securities Act Yes No Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15 d of the Act Yes No Indicate by check mark whether the registrant 1 has filed all reports required to be filed by Section 13 or 15 ',    
}
test_corps

In [None]:
train_queries, train_relevant_docs = generate_queries(train_corpus, prompt_template)

In [None]:
val_queries, val_relevant_docs = generate_queries(val_corpus, prompt_template)

### > create the final training and validation dataset

In [None]:
train_data_path = './data/train_dataset.json'
val_data_path = './data/val_dataset.json'

In [None]:
train_dataset = {
    'queries': train_queries,
    'corpus': train_corpus,
    'relevant_docs': train_relevant_docs,
}

val_dataset = {
    'queries': val_queries,
    'corpus': val_corpus,
    'relevant_docs': val_relevant_docs,
}

In [None]:
with open(train_data_path, 'w+') as f:
    json.dump(train_dataset, f)

with open(val_data_path, 'w+') as f:
    json.dump(val_dataset, f)