### 1. Data Ingestion & Preprocessing

Loading the cleaned text data and split it into manageable chunks for efficient retrieval.

In [1]:
import pandas as pd

# Load data
data = pd.read_csv("../data/processed/cleantech_processed_new.csv")

# Function to split text data into manageable sections
def chunk_text(text, chunk_size=200):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

# Apply chunking to each content entry
data['chunks'] = data['content_cleaned'].apply(chunk_text)

### 2. Flatten the Chunks List and Prepare the Vector Database

Flatting the list of text chunks for easier indexing, then generating embeddings for each chunk and store them in a vector index for fast retrieval.

In [2]:
# 6 min

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Flatten the chunks for easier indexing
flattened_chunks = [chunk for chunks in data['chunks'] for chunk in chunks]

# Initialize the embedding model and create embeddings for each chunk
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = [model.encode(chunk) for chunk in flattened_chunks]

# Initialize FAISS index and add the embeddings
dimension = embeddings[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

  from tqdm.autonotebook import tqdm, trange


### 3. Implement the Retriever Module

Retrieving the most relevant chunks based on a user query. The retriever uses the FAISS index to return the top-k matching chunks.

In [3]:
def retrieve(query, k=5):
    # Encode the query into an embedding
    query_embedding = model.encode([query])

    # Search the index for the top-k most similar embeddings
    distances, indices = index.search(query_embedding, k)

    # Retrieve corresponding chunks from the flattened list
    retrieved_chunks = [flattened_chunks[idx] for idx in indices[0]]
    return retrieved_chunks

# Example retrieval
print(retrieve("Sample query"))

['data processed', 'participants', 'represent 15 percent regions projected growth obtains necessary permits commissioning would take place 2026 content protected copyright may reused want cooperate us would like reuse content please contact editors pvmagazinecom please mindful community standards email address published required fields marked save name email website browser next time comment submitting form agree pv magazine using data purposes publishing comment personal data disclosed otherwise transmitted third parties purposes spam filtering necessary technical maintenance website transfer third parties take place unless justified basis applicable data protection regulations pv magazine legally obliged may revoke consent time effect future case personal data deleted immediately otherwise data deleted pv magazine processed request purpose data storage fulfilled information data privacy found data protection policy website uses cookies anonymously count visitor numbers view privacy p

### 4. Implement the Generator Module

Generating responses based on the retrieved chunks using a LLM (GPT).

In [4]:
import os
from openai import AzureOpenAI
import credentials

# Define the deployment name based on the model deployed in Azure
deployment_name = 'gpt-4'

# Initialize the AzureOpenAI client
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2023-12-01-preview",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

def generate_answer(context, query):
    # Create the prompt using the query and context
    prompt = f"Question: {query}\nContext: {context}"
    response = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a knowledgeable assistant."},
            {"role": "user", "content": prompt}
        ],
        model=deployment_name,
        max_tokens=100
    )
    generated_response = response.choices[0].message.content.strip()
    return generated_response

# Example usage: retrieve relevant chunks and generate a response
retrieved_texts = retrieve("Sample query")
response = generate_answer(" ".join(retrieved_texts), "Sample query")
print(f"Response:\n{response}")

Response:
I'm sorry, but your question was not clear. Can you please provide more details or context so I can assist you better?


### 5. Implement the Evaluator Module

Evaluating generated answers against the gold-standard responses from the evaluation dataset.

In [15]:
# 2 min

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from csv import QUOTE_ALL

# Load the evaluation data
evaluation_data = pd.read_csv(
    "../data/evaluation/cleantech_rag_evaluation_data_2024-09-20.csv",
    delimiter=';',
    on_bad_lines="skip",
    quoting=QUOTE_ALL
)

# Jaccard Similarity Function for exact word overlap
def jaccard_similarity(generated_answer, gold_standard):
    vectorizer = CountVectorizer().fit_transform([generated_answer, gold_standard])
    vectors = vectorizer.toarray()
    intersection = np.minimum(vectors[0], vectors[1]).sum()
    union = np.maximum(vectors[0], vectors[1]).sum()
    return intersection / union if union != 0 else 0

# Semantic Similarity using Sentence-BERT
semantic_model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity(generated_answer, gold_standard):
    embeddings = semantic_model.encode([generated_answer, gold_standard])
    return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

# Evaluate each query against the gold standard answer
evaluation_scores = []

for query, gold_standard in zip(evaluation_data['question'], evaluation_data['answer']):
    # Retrieve chunks and generate an answer
    retrieved_texts = retrieve(query)
    generated_answer = generate_answer(" ".join(retrieved_texts), query)

    # Calculate Jaccard and Semantic Similarity Scores directly
    jaccard_score_value = jaccard_similarity(generated_answer, gold_standard)
    semantic_score = semantic_similarity(generated_answer, gold_standard)

    # Combine scores for a final evaluation score
    combined_score = (jaccard_score_value + semantic_score) / 2
    evaluation_scores.append(combined_score)

    print(f"Query: {query}\nGenerated Answer: {generated_answer}\nGold Standard: {gold_standard}")
    print(f"Jaccard Score: {jaccard_score_value:.2f}, Semantic Score: {semantic_score:.2f}, Combined Score: {combined_score:.2f}\n")

# Calculate average score across all evaluations
average_score = sum(evaluation_scores) / len(evaluation_scores)
print(f"Average Evaluation Score: {average_score:.2f}")



Query: What is the innovation behind Leclanché's new method to produce lithium-ion batteries?
Generated Answer: The innovation behind Leclanché's new method to produce lithium-ion batteries lies in the production process and the materials used. Leclanché has developed an environmentally friendly way to produce lithium-ion batteries, replacing the highly toxic organic solvents typically used in the production process with a water-based process. This approach results in a lower carbon footprint, uses 30 percent less energy, and eliminates the risk of explosion, making the production process safer.

The new method makes nickelmanganesecobalt
Gold Standard: Leclanché's innovation is using a water-based process instead of highly toxic organic solvents to produce nickel-manganese-cobalt-aluminium cathodes for lithium-ion batteries.
Jaccard Score: 0.18, Semantic Score: 0.85, Combined Score: 0.51

Query: What is the EU’s Green Deal Industrial Plan?
Generated Answer: The Green Deal Industrial P