# RAG-Pipeline

In this notebook, we will demonstrate how to use the RAG pipeline to generate answers for the questions defined in the cleantech evaluation dataset. The pipeline consists of the following steps:
1. Data Preparation and Ingestion
2. Embedding
3. Store Embeddings in Vector Database
4. Retrieve Contexts
5. Generate Prompts

## Imports

In [138]:
import pandas as pd
import openai
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from openai import AzureOpenAI
from sentence_transformers import SentenceTransformer


In [139]:
import pandas as pd

import os
from openai import AzureOpenAI
import credentials

deployment_name = "gpt-4"

client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),  
    api_version="2024-02-15-preview",
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
    )

## Data Preparation and Ingestion

We will load and prepare the data, create embeddings for each chunk, and store the embeddings in a vector database. Since the data is already cleaned, we'll move directly to chunking and embedding.

In [140]:
data = pd.read_csv('../data/processed/cleantech_processed.csv')
data_eval = pd.read_csv('../data/evaluation/cleantech_rag_evaluation_data_2024-09-20.csv', delimiter=";")

data.head()

Unnamed: 0,title,date,content,domain,url
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,Qatar Petroleum (QP) is targeting aggressive c...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
1,India Launches Its First 700 MW PHWR,2021-01-15,Nuclear Power Corp of India Ltd (NPCIL) synchr...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
2,New Chapter for US-China Energy Trade,2021-01-20,New US President Joe Biden took office this we...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
3,Japan: Slow Restarts Cast Doubt on 2030 Energy...,2021-01-22,The slow pace of Japanese reactor restarts con...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...
4,NYC Pension Funds to Divest Fossil Fuel Shares,2021-01-25,Two of New York City's largest pension funds s...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...


In [141]:
print("Number of articles in the dataset: ", len(data))

Number of articles in the dataset:  9593


- `chunk_size`: Sets the maximum character length of each chunk to 1000 characters.
- `chunk_overlap`: Adds a n-character overlap between chunks to preserve context.

In [142]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define a function to perform chunking with adjustable parameters
def chunk_text(dataframe, text_column, chunk_size=1000, chunk_overlap=100):
    # Initialize RecursiveCharacterTextSplitter with dynamic parameters
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap
    )
    # Split text in the specified column
    dataframe['content_chunks'] = dataframe[text_column].apply(lambda text: text_splitter.split_text(text))
    
    # Flatten the DataFrame for individual chunk rows
    chunked_df = dataframe.explode('content_chunks').reset_index(drop=True)

    return chunked_df

# Apply the chunking with adjustable parameters
chunked_data = chunk_text(data, text_column='content', chunk_size=1000, chunk_overlap=100)
chunked_data.head()

Unnamed: 0,title,date,content,domain,url,content_chunks
0,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,Qatar Petroleum (QP) is targeting aggressive c...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Qatar Petroleum (QP) is targeting aggressive c...
1,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,Qatar Petroleum (QP) is targeting aggressive c...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,million tons/yr. Qatar currently has an LNG pr...
2,Qatar to Slash Emissions as LNG Expansion Adva...,2021-01-13,Qatar Petroleum (QP) is targeting aggressive c...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,be too expensive and none met its targeted 50 ...
3,India Launches Its First 700 MW PHWR,2021-01-15,Nuclear Power Corp of India Ltd (NPCIL) synchr...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,Nuclear Power Corp of India Ltd (NPCIL) synchr...
4,India Launches Its First 700 MW PHWR,2021-01-15,Nuclear Power Corp of India Ltd (NPCIL) synchr...,energyintel,https://www.energyintel.com/0000017b-a7dc-de4c...,will likely not be met. India's nuclear suppli...


In [143]:
len(chunked_data)

48039

Now that we have chunked the texts into smaller segments, the next step is to pass these chunks through an embedding model to obtain their vector representations. The embedding model maps the textual information into high-dimensional vector spaces, where semantic similarities and relationships are preserved.

## Embedding Creation

In [144]:
def generate_embeddings(text, model): # model = "deployment_name"
    return client.embeddings.create(input = [text], model=model).data[0].embedding

chunked_data['embeddings'] = chunked_data["content_chunks"].apply(lambda x: generate_embeddings(x, "text-embedding-3-large"))

In [None]:
print("Dimension of the embeddings: ", len(chunked_data['embeddings'][0]))

Dimension of the embeddings:  3072


The `chunked_data` DataFrame is structured so that each row represents an individual **chunk** of text derived from the original documents, with a corresponding **embedding vector** stored in the `embeddings` column. Each embedding vector is a dense list of floating-point numbers (3072 dimensions in our case, given the use of the `text-embedding-3-large` model) that encapsulates the semantic meaning of the text chunk.

These embeddings are critical for the RAG system, as they allow efficient similarity searches and retrieval tasks by representing the content of each text chunk in a high-dimensional vector space. When a query is embedded, the RAG system can quickly locate the most relevant text chunks by identifying embedding vectors in the database that are closest in meaning. This approach enables the system to retrieve contextually relevant information by comparing semantic relationships, streamlining the entire retrieval process.

## Store Embeddings in a Vector Database

Pinecone is a hosted vector database. We will store the embeddings in Pinecone.

**Initialize Pinecone Client**

In [None]:
from pinecone import Pinecone, Index, ServerlessSpec

# Initialize Pinecone with the new class-based approach
pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))

existing_indexes = [index.name for index in pc.list_indexes()]

# Define the index name and check if it exists
index_name = "npr-mc1"
if index_name not in existing_indexes:
    pc.create_index(
        name=index_name,
        dimension=3072,  # Dimension should match embedding model's output dimension
        spec=ServerlessSpec(
            cloud='aws', 
            region='us-east-1'
        ) 
    )

# Connect to the index
pinecone_index = pc.Index(index_name)

**Store Embeddings in Pinecone**

When upserting larger amounts of data, it is recommended to upsert records in large batches. A batch of upserts should be as large as possible (up to 1000 records) without exceeding the maximum request size of 2MB.

In [None]:
records = []
for idx, row in chunked_data.iterrows():
    doc_id = str(row['index'] if 'index' in chunked_data.columns else idx)
    embedding = row['embeddings']
    original_text = row['content_chunks']

    records.append({
        "id": doc_id,
        "values": embedding,
        "metadata": {"text": original_text}
    })

# Upsert all records at once
pinecone_index.upsert(vectors=records)

{'upserted_count': 56}

## Query Processing (Retriever Module)

Define a retrieval function that uses Pinecone for fetching relevant chunks.

In [None]:
def retrieve_relevant_chunks(query, top_k=5):
    query_embedding = generate_embeddings(query, "text-embedding-3-large")

    # Retrieve similar documents from Pinecone
    results = pinecone_index.query(
        vector=query_embedding,
        top_k=top_k,
        include_values=True,
        include_metadata=True
    )
    return results

In [None]:
# Example question
question = "What did Qatar Petroleum mention what will happen in Phase 1 of the LNG expansion?"

# Call the retrieve_relevant_chunks function with the question
relevant_chunks = retrieve_relevant_chunks(question, top_k=5)

print("Most relevant chunks for the following question:")
print(question, "\n")

# Loop through each retrieved chunk and print its details
for match in relevant_chunks.matches:
    chunk_id = match.id
    text_content = match.metadata.get('text', '')[:300]
    score = match.score
    
    print(f"Chunk ID: {chunk_id}")
    print(f"{text_content}...")
    print(f"Similarity Score: {score:.2f}")
    print("-" * 80)

Most relevant chunks for the following question:
What did Qatar Petroleum mention what will happen in Phase 1 of the LNG expansion? 

Chunk ID: 0
Qatar Petroleum (QP) is targeting aggressive cuts in its greenhouse gas emissions as it prepares to launch Phase 2 of its planned 48 million ton per year LNG expansion. In its latest Sustainability Report published on Wednesday, QP said its goals include reducing the emissions intensity of Qatar's L...
Similarity Score: 0.72
--------------------------------------------------------------------------------
Chunk ID: 1
of Qatar's massive LNG expansion. But McDermott International said last week that it had been awarded the front end engineering and design contract for five offshore wellhead platforms (LNGI Jan12'21). Bids for construction of all four trains for Phase 1 of the LNG expansion were submitted in Septem...
Similarity Score: 0.66
--------------------------------------------------------------------------------
Chunk ID: 6
could hinder C

## Generate Answers (Generator Module)

We can now implement the Generator Module to generate answers based on the retrieved chunks using Azure OpenAI's GPT-4. In this step, we will create a function that takes a user query and the retrieved chunks, composes a relevant context from those chunks, and then uses GPT-4 to generate an answer based on this context.

- `model`: 
- `max_tokens`:
- `temperature`:
- `top_p`:
- `top_k`:
- `stop`:

In [128]:
def generate_answer(query, chunks):
    # Compose the context from the retrieved chunks, handling potential missing metadata
    context = " ".join(chunk.get('metadata', {}).get('text', '') for chunk in chunks['matches'])
    
    # Generate the answer using Azure OpenAI GPT-4
    response = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a helpful AI assistant. Please answer the query with the provided context."},
            {"role": "system", "content": context},
            {"role": "user", "content": query}
        ],
        model = deployment_name,
        max_tokens=250,
        temperature=0.3,            # Adjust temperature for concise answers
        stop=["End of answer"],     # Optionally add a stop sequence
    )
    answer = response.choices[0].message.content.strip()
    return answer

In [129]:
# Generate answer based on retrieved chunks
generated_answer = generate_answer(question, relevant_chunks)

print("Generated Answer:\n", generated_answer)

Generated Answer:
 Qatar Petroleum (QP) mentioned that about 22 million tons/yr of the carbon capture goal will come from the 32 million ton/yr Phase 1 of the LNG expansion, also known as the North Field East project. Bids for construction of all four trains for Phase 1 of the LNG expansion were submitted, but QP judged them to be too expensive and none met its targeted 50 week construction schedule. Contractors were asked to look for cost savings and submit new bids. The contract, estimated to be worth around $35 billion, is expected to be awarded by Mar 31. After the construction contract is awarded, QP is expected to select foreign investments partners to take stakes of up to 30% in the Phase 1 trains. Exxon Mobil, Royal Dutch Shell, Total, Chevron, ConocoPhillips and Eni have been shortlisted.


## Evaluation

- BLEU Score: Measures the overlap of n-grams between the generated answer and the reference answer. This metric is valuable for measuring content similarity, especially for factual information.
- ROUGE Score: Commonly used for summarization tasks, it also evaluates the overlap of n-grams but considers recall more heavily, which is beneficial for checking if generated responses capture the core content.
- Cosine Similarity: Measures the semantic similarity between the generated answer and the reference answer in embedding space. This ensures that even if the wording differs, the underlying meaning is still preserved.

**Setup**:

We have `data_eval`, which contains fields such as `question`, `relevant_text`, `answer`, and `article_url`. Each row represents an evaluation example, with the `question` to be queried in our pipeline, the `relevant_text` providing context for manual verification, and `answer` as the ground-truth answer to compare against.

In [131]:
data_eval.head()

Unnamed: 0,example_id,question_id,question,relevant_text,answer,article_url
0,1,1,What is the innovation behind Leclanché's new ...,Leclanché said it has developed an environment...,Leclanché's innovation is using a water-based ...,https://www.sgvoice.net/strategy/technology/23...
1,2,2,What is the EU’s Green Deal Industrial Plan?,The Green Deal Industrial Plan is a bid by the...,The EU’s Green Deal Industrial Plan aims to en...,https://www.sgvoice.net/policy/25396/eu-seeks-...
2,3,2,What is the EU’s Green Deal Industrial Plan?,The European counterpart to the US Inflation R...,The EU’s Green Deal Industrial Plan aims to en...,https://www.pv-magazine.com/2023/02/02/europea...
3,4,3,What are the four focus areas of the EU's Gree...,The new plan is fundamentally focused on four ...,The four focus areas of the EU's Green Deal In...,https://www.sgvoice.net/policy/25396/eu-seeks-...
4,5,4,When did the cooperation between GM and Honda ...,What caught our eye was a new hookup between G...,July 2013,https://cleantechnica.com/2023/05/08/general-m...


In [133]:
class RAGPipeline:
    def __init__(self, retriever, generator):
        self.retriever = retriever
        self.generator = generator

    def retrieve_relevant_chunks(self, query, top_k=5):
        return self.retriever(query, top_k=top_k)

    def generate_answer(self, query, retrieved_chunks):
        return self.generator(query, retrieved_chunks)

# Instantiate the pipeline
pipeline = RAGPipeline(
    retriever=retrieve_relevant_chunks,   # Your retrieval function
    generator=generate_answer             # Your generation function
)

In [136]:
from sklearn.metrics import recall_score
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
import pandas as pd

# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

# Function to calculate evaluation metrics for each example
def evaluate_example(query, true_answer, pipeline):
    # Retrieve relevant chunks
    relevant_chunks = pipeline.retrieve_relevant_chunks(query)
    
    # Generate answer based on retrieved chunks
    generated_answer = pipeline.generate_answer(query, relevant_chunks)
    
    # 1. Calculate Retrieval Recall
    retrieved_texts = [chunk.metadata.get('text', '') for chunk in relevant_chunks['matches']]
    recall = int(true_answer in " ".join(retrieved_texts))

    # 2. Calculate BLEU Score
    bleu_score = sentence_bleu([true_answer.split()], generated_answer.split())
    
    # 3. Calculate ROUGE Scores
    rouge_scores = scorer.score(true_answer, generated_answer)
    rouge1 = rouge_scores['rouge1'].fmeasure
    rouge2 = rouge_scores['rouge2'].fmeasure
    rougeL = rouge_scores['rougeL'].fmeasure
    
    # 4. Exact Match (EM)
    exact_match = int(generated_answer.strip() == true_answer.strip())
    
    return {
        "recall": recall,
        "bleu": bleu_score,
        "rouge1": rouge1,
        "rouge2": rouge2,
        "rougeL": rougeL,
        "exact_match": exact_match,
        "generated_answer": generated_answer
    }

# Function to evaluate the entire dataset
def evaluate_pipeline(data_eval, pipeline):
    results = []
    for _, row in data_eval.iterrows():
        query = row['question']
        true_answer = row['answer']
        
        # Evaluate this example
        example_results = evaluate_example(query, true_answer, pipeline)
        example_results['example_id'] = row['example_id']
        results.append(example_results)
    
    return pd.DataFrame(results)

# Run evaluation
evaluation_results = evaluate_pipeline(data_eval, pipeline)
evaluation_results.head()

# Display aggregated scores
average_scores = evaluation_results[['recall', 'bleu', 'rouge1', 'rouge2', 'rougeL', 'exact_match']].mean()
print("Average Evaluation Metrics:\n", average_scores)


The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


Average Evaluation Metrics:
 recall         0.043478
bleu           0.029453
rouge1         0.157721
rouge2         0.087592
rougeL         0.141349
exact_match    0.000000
dtype: float64


In [137]:
evaluation_results

Unnamed: 0,recall,bleu,rouge1,rouge2,rougeL,exact_match,generated_answer,example_id
0,0,4.992897e-155,0.325581,0.195122,0.325581,0,The provided text does not contain information...,1
1,0,0.03071619,0.285714,0.222222,0.25,0,The provided context does not contain informat...,2
2,0,0.03071619,0.285714,0.222222,0.25,0,The provided context does not contain informat...,3
3,0,0.3255657,0.565217,0.5,0.521739,0,The provided context does not contain informat...,4
4,0,0.0,0.0,0.0,0.0,0,The text does not provide information on when ...,5
5,0,0.0,0.0,0.0,0.0,0,The provided text does not contain any informa...,6
6,0,0.1669782,0.380952,0.3,0.333333,0,The provided text does not contain information...,7
7,0,6.559657e-156,0.196078,0.122449,0.156863,0,The text does not provide information on what ...,8
8,0,0.0,0.0,0.0,0.0,0,The text does not provide information on wheth...,9
9,0,0.0,0.020833,0.0,0.020833,0,"Yes, you can hang solar panels on garden fence...",10


## Testing

In [7]:
# Define a function to test the RAG pipeline on evaluation data
def test_pipeline_on_eval_data(data_eval, top_k=5):
    results = []

    for _, row in data_eval.iterrows():
        question = row['question']
        expected_answer = row['answer']
        example_id = row['example_id']
        
        # Retrieve relevant chunks for each question
        retrieved_chunks = retrieve_relevant_chunks(question, top_k=top_k)
        
        # Generate answer based on retrieved chunks
        generated_answer = generate_answer(question, retrieved_chunks)
        
        # Store results for later analysis
        results.append({
            'example_id': example_id,
            'question': question,
            'expected_answer': expected_answer,
            'generated_answer': generated_answer,
            'retrieved_chunks': retrieved_chunks
        })
    
    # Convert results to DataFrame for easier analysis
    results_df = pd.DataFrame(results)
    return results_df

In [None]:
# Run the pipeline on evaluation data
results_df = test_pipeline_on_eval_data(data_eval)
results_df[['example_id', 'question', 'expected_answer', 'generated_answer', 'retrieved_chunks']].head()