# RAG-Pipeline

Load different chunking strategies and their embeddings with the models text-embedding-ada-002 and text-embedding-large-3. 

## Imports

In [22]:
import pandas as pd
import pickle
import time
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from langchain.text_splitter import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
from openai import AzureOpenAI
import openai

import credentials

from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import math

from pinecone import Pinecone, Index, ServerlessSpec

In [3]:
deployment_name='gpt-4' #This will correspond to the custom name you chose for your deployment when you deployed a model. 

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2023-12-01-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
)

## Load Embeddings

In [5]:
# load pickled data
with open('../embeddings/recursive_1000_chunksize_100_overlap_ada_002_embeddings.pkl', 'rb') as f:
    embeddings = pickle.load(f)

Missing embeddings, caused by empty text chunks, API errors, etc:

In [14]:
# Filter out rows with None embeddings
valid_embeddings = embeddings[embeddings['embeddings'].notna()]

# Debug: Check how many rows remain
print(f"Total valid embeddings: {len(valid_embeddings)}")
print(f"Total invalid embeddings (None): {len(embeddings) - len(valid_embeddings)}")

Total valid embeddings: 44545
Total invalid embeddings (None): 250


## Store Embeddings in a Vector Database

**Initialize Pinecone Client**

In [18]:
# Initialize Pinecone with the new class-based approach
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))

existing_indexes = [index.name for index in pc.list_indexes()]

# Define the index name and check if it exists
index_name = "npr-mc1-new"
if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=len(embeddings['embeddings'][0]),  # Dimension should match embedding model's output dimension
        spec=ServerlessSpec(
            cloud='aws', 
            region='us-east-1'
        ) 
    )

# Connect to the index
pinecone_index = pc.Index(index_name)

**Store Embeddings in Pinecone**

In [19]:
records = []
for idx, row in valid_embeddings.iterrows():
    doc_id = str(row['index'] if 'index' in valid_embeddings.columns else idx)
    embedding = row['embeddings']
    original_text = row['content_chunks']

    # Validate embedding is a list of floats
    if isinstance(embedding, list) and all(isinstance(value, float) for value in embedding):
        records.append({
            "id": doc_id,
            "values": embedding,
            "metadata": {"text": original_text}
        })

# Upsert valid records in batches
BATCH_SIZE = 100
for i in range(0, len(records), BATCH_SIZE):
    batch = records[i:i + BATCH_SIZE]
    try:
        pinecone_index.upsert(vectors=batch)
        
        # Print progress every 50 batches
        if (i // BATCH_SIZE + 1) % 50 == 0:
            print(f"Upserted batch {i // BATCH_SIZE + 1}/{(len(records) + BATCH_SIZE - 1) // BATCH_SIZE}")
    except Exception as e:
        print(f"Error upserting batch {i // BATCH_SIZE + 1}: {e}")

## Query Processing (Retriever Module)

Define a retrieval function that uses Pinecone for fetching relevant chunks.

In [23]:
def generate_embeddings(text, embedding_model):
    # Generate embeddings for a given text using the specified embedding model.
    response = client.embeddings.create(input=[text], model=embedding_model)
    return response.data[0].embedding

def retrieve_relevant_chunks(query, embedding_model, top_k=5):
    # Embed the query using the embedding function
    query_embedding = generate_embeddings(query, embedding_model)

    # Retrieve similar documents from Pinecone
    results = pinecone_index.query(
        vector=query_embedding,
        top_k=top_k,
        include_values=True,
        include_metadata=True
    )
    return results

**Example Query and Retrieval**

In [28]:
query = "What did Qatar Petroleum mention what will happen in Phase 1 of the LNG expansion?"
embedding_model = "text-embedding-ada-002"

relevant_chunks = retrieve_relevant_chunks(query, embedding_model, top_k=5)

# Loop through each retrieved chunk and print its details
for match in relevant_chunks.matches:
    chunk_id = match.id
    text_content = match.metadata.get('text', '')[:300]
    score = match.score
    
    print(f"Chunk ID: {chunk_id}")
    print(f"{text_content}...")
    print(f"Similarity Score: {score:.2f}")
    print("-" * 80)

Chunk ID: 1
eyeing a phased expansion to 126 million tons/yr. QP says it should be able to eliminate routine gas flaring by 2030, with methane emissions limited by setting a methane intensity target of 02% across all facilities by 2025. The company also plans to build some 16 gigawatts of solar energy capacity ...
Similarity Score: 0.89
--------------------------------------------------------------------------------
Chunk ID: 0
Qatar Petroleum (QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. In its latest Sustainability Report published on Wednesday, QP said its goals include reducing the emissions intensity of Qatar's L...
Similarity Score: 0.87
--------------------------------------------------------------------------------
Chunk ID: 24499
buyers have emerged, pressure is growing to reduce the environmental footprint of the global gas trade. With plans to remain dominant in the sec

## Generate Answers (Generator Module)

We can now implement the Generator Module to generate answers based on the retrieved chunks using Azure OpenAI's GPT-4. In this step, we will create a function that takes a user query and the retrieved chunks, composes a relevant context from those chunks, and then uses GPT-4 to generate an answer based on this context.

- **`model`**: Specifies the model, in our case GPT-4, for generating responses.
- **`max_tokens`**: Limits response length.
- **`temperature`**: Controls randomness; lower = focused, higher = creative.
- **`top_p`**: Limits choices to most likely words for coherent output.
- **`top_k`**: Restricts to top-k choices, narrowing token selection.
- **`stop`**: Defines where the model should stop generating for clean output.

In [29]:
prompt = """
    You are a highly knowledgeable AI assistant specializing in providing accurate and contextually relevant answers. 
    Use the context provided below to answer the user's query as thoroughly and concisely as possible. 
    If the context does not contain sufficient information to answer the query, say so explicitly. 
    Do not include any information not supported by the context.
"""

In [50]:
def generate_answer(query, chunks):
    """
    Generate an answer to the given query using retrieved chunks and Azure OpenAI GPT-4.

    Args:
        query (str): The user query.
        chunks (list): Retrieved chunks containing context.

    Returns:
        str: The generated answer.
    """
    # Compose the context from the retrieved chunks, handling potential missing metadata
    context = " ".join(chunk.get('metadata', {}).get('text', '') for chunk in chunks['matches'])

    # Ensure the context isn't empty
    if not context.strip():
        return "The provided context does not contain sufficient information to answer the query."
   
    # Generate the answer using Azure OpenAI GPT-4
    response = client.chat.completions.create(
        messages=[
            {"role": "system", "content": prompt},
            {"role": "system", "content": context},
            {"role": "user", "content": query}
        ],
        model = deployment_name,
        max_tokens=150,
        temperature=0.1,            # Lower temperature for concise and deterministic answers
        stop=["End of answer"],     # Optional stop sequence for clean output
    )
    answer = response.choices[0].message.content.strip()
    return answer

In [31]:
# Generate answer based on retrieved chunks
generated_answer = generate_answer(query, relevant_chunks)
print("Generated Answer:\n", generated_answer)

Generated Answer:
 Qatar Petroleum (QP) mentioned that Phase 1 of the LNG expansion, also known as the North Field East project, will contribute to the company's carbon capture goal by capturing about 22 million tons/yr of carbon. This phase will increase Qatar's LNG production capacity by 32 million tons/yr. However, bids for the construction of all four trains for Phase 1 were deemed too expensive and none met QP's targeted 50-week construction schedule. Contractors were asked to submit new bids with cost savings. Once the construction contract is awarded, QP is expected to select foreign investment partners to take stakes of up to 30% in the Phase 1 trains. Exxon Mobil, Royal Dutch Shell, Total, Chevron, ConocoPhillips, and Eni have been shortlisted for this.


## Evaluation

- BLEU Score: Measures the overlap of n-grams between the generated answer and the reference answer. This metric is valuable for measuring content similarity, especially for factual information.
- ROUGE Score: Commonly used for summarization tasks, it also evaluates the overlap of n-grams but considers recall more heavily, which is beneficial for checking if generated responses capture the core content.
- Cosine Similarity: Measures the semantic similarity between the generated answer and the reference answer in embedding space. This ensures that even if the wording differs, the underlying meaning is still preserved.

**Setup**:

We have `data_eval`, which contains the following fields `example_id`, `question_id`, `question`, `relevant_text`, `answer` and `article_url`. Each row represents an evaluation example, with the `question` to be queried in our pipeline, the `relevant_text` providing context for manual verification, and `answer` as the ground-truth answer to compare against.

In [33]:
data_eval = pd.read_csv('../data/evaluation/cleantech_rag_evaluation_data_2024-09-20.csv', delimiter=';')
data_eval.head()

Unnamed: 0,example_id,question_id,question,relevant_text,answer,article_url
0,1,1,What is the innovation behind Leclanché's new ...,Leclanché said it has developed an environment...,Leclanché's innovation is using a water-based ...,https://www.sgvoice.net/strategy/technology/23...
1,2,2,What is the EU’s Green Deal Industrial Plan?,The Green Deal Industrial Plan is a bid by the...,The EU’s Green Deal Industrial Plan aims to en...,https://www.sgvoice.net/policy/25396/eu-seeks-...
2,3,2,What is the EU’s Green Deal Industrial Plan?,The European counterpart to the US Inflation R...,The EU’s Green Deal Industrial Plan aims to en...,https://www.pv-magazine.com/2023/02/02/europea...
3,4,3,What are the four focus areas of the EU's Gree...,The new plan is fundamentally focused on four ...,The four focus areas of the EU's Green Deal In...,https://www.sgvoice.net/policy/25396/eu-seeks-...
4,5,4,When did the cooperation between GM and Honda ...,What caught our eye was a new hookup between G...,July 2013,https://cleantechnica.com/2023/05/08/general-m...


In [51]:
class RAGPipeline:
    def __init__(self, retriever, generator):
        self.retriever = retriever
        self.generator = generator

    def retrieve_relevant_chunks(self, query, embedding_model, top_k=5):
        return self.retriever(query, embedding_model, top_k)

    def generate_answer(self, query, retrieved_chunks):
        return self.generator(query, retrieved_chunks)

# Instantiate the pipeline
pipeline = RAGPipeline(
    retriever=retrieve_relevant_chunks,
    generator=generate_answer
)

In [72]:
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

def calculate_bleu_scores(reference, hypothesis):
    smoothing_function = SmoothingFunction().method1

    scores = {}
    for n in range(1, 5):  # BLEU-1 to BLEU-4
        weights = tuple((1 / n) for _ in range(n)) + (0,) * (4 - n)
        scores[f"bleu_{n}"] = sentence_bleu(
            [reference.split()],
            hypothesis.split(),
            weights=weights,
            smoothing_function=smoothing_function
        )
    return scores

def calculate_rouge_scores(reference, hypothesis):
    rouge = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    rouge_scores = rouge.score(reference, hypothesis)
    rouge1 = rouge_scores['rouge1'].fmeasure
    rouge2 = rouge_scores['rouge2'].fmeasure
    rougeL = rouge_scores['rougeL'].fmeasure
    return {"rouge1": rouge1, "rouge2": rouge2, "rougeL": rougeL}

In [81]:
from sklearn.metrics import recall_score
import numpy as np

# Function to calculate evaluation metrics for each example
def evaluate_example(query, true_answer, pipeline, embedding_model):

    relevant_chunks = pipeline.retrieve_relevant_chunks(query, embedding_model, top_k=5)
    generated_answer = pipeline.generate_answer(query, relevant_chunks)

    # Cosine Similarity
    true_embedding = np.array(generate_embeddings(true_answer, embedding_model)).reshape(1, -1)
    generated_embedding = np.array(generate_embeddings(generated_answer, embedding_model)).reshape(1, -1)
    cosine_sim = cosine_similarity(true_embedding, generated_embedding)[0][0]

    # BLEU and ROUGE Scores
    bleu_scores = calculate_bleu_scores(true_answer, generated_answer)
    rouge_scores = calculate_rouge_scores(true_answer, generated_answer)

    # relevant chunks to list
    relevant_chunks_list = [chunk.get('metadata', {}).get('text', '') for chunk in relevant_chunks['matches']]

    return {
        "question": query,
        "ground_truth": true_answer,
        "answer": generated_answer,
        "contexts": relevant_chunks_list,
        **bleu_scores,
        **rouge_scores,
        "cosine_similarity": cosine_sim
    }

# Function to evaluate the entire dataset
def evaluate_pipeline(data_eval, pipeline, embedding_model):
    results = []
    for _, row in tqdm(data_eval.iterrows(), total=len(data_eval), desc="Evaluating RAG Pipeline"):
        query = row['question']
        true_answer = row['answer']
        
        # Evaluate this example
        example_results = evaluate_example(query, true_answer, pipeline, embedding_model)
        results.append(example_results)
    
    return pd.DataFrame(results)

evaluation_results = evaluate_pipeline(data_eval, pipeline, embedding_model)

Evaluating RAG Pipeline: 100%|██████████| 23/23 [01:31<00:00,  3.97s/it]


## Evaluation with RAGAS

In [93]:
dataset = evaluation_results[["question", "answer", "contexts", "ground_truth"]]
dataset.head()

Unnamed: 0,question,answer,contexts,ground_truth
0,What is the innovation behind Leclanché's new ...,Leclanché has developed an environmentally fri...,[The new battery leverages highly conductive b...,Leclanché's innovation is using a water-based ...
1,What is the EU’s Green Deal Industrial Plan?,The EU's Green Deal Industrial Plan is an init...,['The EU has presented its Green Deal Industri...,The EU’s Green Deal Industrial Plan aims to en...
2,What is the EU’s Green Deal Industrial Plan?,The EU's Green Deal Industrial Plan is a strat...,['The EU has presented its Green Deal Industri...,The EU’s Green Deal Industrial Plan aims to en...
3,What are the four focus areas of the EU's Gree...,"The four focus areas, or pillars, of the EU's ...",['The EU has presented its Green Deal Industri...,The four focus areas of the EU's Green Deal In...
4,When did the cooperation between GM and Honda ...,The cooperation between GM and Honda on fuel c...,"[collaboration launched in July of 2013, provi...",July 2013


In [99]:
from ragas import EvaluationDataset, evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

os.environ["OPENAI_API_KEY"] = os.getenv("AZURE_OPENAI_API_KEY")

ragas_dataset = EvaluationDataset.from_pandas(data_eval)

result = evaluate(
    dataset = ragas_dataset, 
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

df = result.to_pandas()

ValueError: The metric [context_precision] that is used requires the following additional columns ['reference', 'retrieved_contexts', 'user_input'] to be present in the dataset.

In [None]:
df