### 1. Data Ingestion & Preprocessing

Loading the cleaned text data and split it into manageable chunks for efficient retrieval.

In [1]:
import pandas as pd

# Load data
data = pd.read_csv("../data/processed/cleantech_processed_new.csv")

# Function to split text data into manageable sections
def chunk_text(text, chunk_size=200):
    words = text.split()
    return [' '.join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]

# Apply chunking to each content entry
data['chunks'] = data['content_cleaned'].apply(chunk_text)

The chunking variant used is fixed-length word chunking.

Fixed-Length Word Count: The function chunk_text splits the text into segments of exactly 200 words each.
Sequential Chunking: It moves sequentially through the text without overlap, meaning each chunk is a continuous sequence of words.

### 2. Flatten the Chunks List and Prepare the Vector Database

Flatting the list of text chunks for easier indexing, then generating embeddings for each chunk and store them in a vector index for fast retrieval.

In [2]:
# 6 min

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Flatten the chunks for easier indexing
flattened_chunks = [chunk for chunks in data['chunks'] for chunk in chunks]

# Initialize the embedding model and create embeddings for each chunk
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = [model.encode(chunk) for chunk in flattened_chunks]

# Initialize FAISS index and add the embeddings
dimension = embeddings[0].shape[0]
index = faiss.IndexFlatL2(dimension)
index.add(np.array(embeddings))

  from tqdm.autonotebook import tqdm, trange


The embedding model used here is all-MiniLM-L6-v2 from the Sentence Transformers library, specifically from the Sentence-BERT family. This model is optimized for generating dense vector representations (embeddings) of sentences or short texts for tasks like semantic similarity, clustering, and information retrieval.

### 3. Implement the Retriever Module

Retrieving the most relevant chunks based on a user query. The retriever uses the FAISS index to return the top-k matching chunks.

In [5]:
def retrieve(query, k=5):
    # Encode the query into an embedding
    query_embedding = model.encode([query])

    # Search the index for the top-k most similar embeddings
    distances, indices = index.search(query_embedding, k)

    # Retrieve corresponding chunks from the flattened list
    retrieved_chunks = [flattened_chunks[idx] for idx in indices[0]]
    return retrieved_chunks

We are encoding the query into an embedding, then searching our vector index to find the top-k most similar embeddings. Finally, we retrieve and return the corresponding text chunks from our preprocessed list, providing the most relevant content based on the query's semantic similarity.

### 4. Implement the Generator Module

Generating responses based on the retrieved chunks using a LLM (GPT).

In [8]:
import os
from openai import AzureOpenAI
import credentials

# Define the deployment name based on the model deployed in Azure
deployment_name = 'gpt-4'

# Initialize the AzureOpenAI client
client = AzureOpenAI(
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    api_version="2023-12-01-preview",
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT")
)

def generate_answer(context, query):
    # Create a prompt with refined instructions
    prompt = f"Using only the information provided in the context, please answer the following question as accurately as possible.\n\nContext:\n{context}\n\nQuestion:\n{query}"
    response = client.chat.completions.create(
        messages=[
            {"role": "system", "content": "You are a knowledgeable assistant that provides accurate answers based on the provided context."},
            {"role": "user", "content": prompt}
        ],
        model=deployment_name,
        max_tokens=150
    )
    generated_response = response.choices[0].message.content.strip()
    return generated_response

# Example query related to energy
energy_query = "What are the main sources of renewable energy and their benefits?"

# Retrieve relevant chunks from the database based on the energy-related query
retrieved_texts = retrieve(energy_query, k=10)

# Generate an answer based on the retrieved chunks and query
response = generate_answer(" ".join(retrieved_texts), energy_query)
print(f"Response:\n{response}")

Response:
The main sources of renewable energy mentioned in the context are hydrogen, wind, solar, hydro, geothermal, and biomass. 

Hydrogen renewable energy can contribute to many aspects of the energy industry, particularly in periods of reduced wind and sunlight, thus compensating for intermittencies in other renewable sources.

Wind and solar energy, which account for 55 percent of energy generation in Denmark, for instance, are increasingly competitive, particularly when paired with energy storage technologies. They benefit from expansive geographic potential, as weather conditions can vary across regions. The reduction in costs of solar and wind technologies has also been significant over the past decade, making them increasingly cost-competitive. 

Hydro power, as shown by the example of Norway, is particularly useful in


We are initializing an Azure OpenAI client and defining a generate_answer function to create a prompt based on the given context and query, ensuring an accurate response strictly derived from the context. Then, we retrieve relevant text chunks for an energy-related query and generate a response about renewable energy sources and their benefits.

### 5. Implement the Evaluator Module

Evaluating generated answers against the gold-standard responses from the evaluation dataset.

In [9]:
# 2 min

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from csv import QUOTE_ALL

# Load the evaluation data
evaluation_data = pd.read_csv(
    "../data/evaluation/cleantech_rag_evaluation_data_2024-09-20.csv",
    delimiter=';',
    on_bad_lines="skip",
    quoting=QUOTE_ALL
)

# Jaccard Similarity Function for exact word overlap
def jaccard_similarity(generated_answer, gold_standard):
    vectorizer = CountVectorizer().fit_transform([generated_answer, gold_standard])
    vectors = vectorizer.toarray()
    intersection = np.minimum(vectors[0], vectors[1]).sum()
    union = np.maximum(vectors[0], vectors[1]).sum()
    return intersection / union if union != 0 else 0

# Semantic Similarity using Sentence-BERT
semantic_model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity(generated_answer, gold_standard):
    embeddings = semantic_model.encode([generated_answer, gold_standard])
    return cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]

# Evaluate each query against the gold standard answer
evaluation_scores = []

for query, gold_standard in zip(evaluation_data['question'], evaluation_data['answer']):
    # Retrieve chunks and generate an answer
    retrieved_texts = retrieve(query)
    generated_answer = generate_answer(" ".join(retrieved_texts), query)

    # Calculate Jaccard and Semantic Similarity Scores directly
    jaccard_score_value = jaccard_similarity(generated_answer, gold_standard)
    semantic_score = semantic_similarity(generated_answer, gold_standard)

    # Combine scores for a final evaluation score
    combined_score = (jaccard_score_value + semantic_score) / 2
    evaluation_scores.append(combined_score)

    print(f"Query: {query}\nGenerated Answer: {generated_answer}\nGold Standard: {gold_standard}")
    print(f"Jaccard Score: {jaccard_score_value:.2f}, Semantic Score: {semantic_score:.2f}, Combined Score: {combined_score:.2f}\n")

# Calculate average score across all evaluations
average_score = sum(evaluation_scores) / len(evaluation_scores)
print(f"Average Evaluation Score: {average_score:.2f}")



Query: What is the innovation behind Leclanché's new method to produce lithium-ion batteries?
Generated Answer: Leclanché's new method to produce lithium-ion batteries involves a more environmentally friendly approach. The company replaced the highly toxic organic solvents typically used in the production process with a water-based process to make nickel-manganese-cobalt-aluminium (NMCA) cathodes. This process not only eliminates the risk of explosion, making the production safer for employees, but also lessens the impact on the environment. It also has a lower carbon footprint, using 30 percent less energy than would be needed to dry, evaporate, or recycle solvents. In addition to this, Leclanché increased the nickel content in the cathode to about 90 percent, which allows the cobalt content to be reduced from 20 percent to
Gold Standard: Leclanché's innovation is using a water-based process instead of highly toxic organic solvents to produce nickel-manganese-cobalt-aluminium cathodes

We are loading evaluation data and defining functions to calculate Jaccard and semantic similarity scores to assess the accuracy of generated answers against gold-standard answers. For each query, we retrieve relevant chunks, generate a response and compute both similarity scores, averaging them into a combined evaluation score. We then calculate and print an overall average evaluation score of 0.33 across all queries.

In general, scores closer to 0.7 or higher are often considered good for text generation tasks, while scores below 0.5 indicate room for improvement in retrieval or response quality.