# <center> **Electrochemistry Assistant** </center>

---

**<center> Author: Aravindan Natarajan </center>**
**<center>Version: 2.0</center>**

---

## Attribution
<p xmlns:cc="http://creativecommons.org/ns#" xmlns:dct="http://purl.org/dc/terms/"><a property="dct:title" rel="cc:attributionURL" href="https://github.com/anatarajank/Electrochemistry-Assistant">Electrochemistry Assistant</a> &#169; 2025 by <a rel="cc:attributionURL dct:creator" property="cc:attributionName" href="https://www.linkedin.com/in/anatarajank/">Aravindan Natarajan</a> is licensed under <a href="https://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;">CC BY 4.0<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" alt=""><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" alt=""></a></p>

---

## Citation
If you use this repository in your work, please cite using the following:

[![DOI](https://img.shields.io/badge/DOI-https%3A%2F%2Fdoi.org%2F10.20944%2Fpreprints202505.1674.v1-yellow)](https://doi.org/10.20944/preprints202505.1674.v1)

Natarajan,  A.; Sankaranarayanan,  P. Electrochemistry Assistant: A Retrieval-Augmented Question Answering System for Electrochemical Information Retrieval. Preprints 2025, 2025051674. [doi:10.20944/preprints202505.1674.v1](https://doi.org/10.20944/preprints202505.1674.v1)

And please cite this repository using the following:
[Electrochemistry Assistant, Aravindan Natarajan, https://github.com/anatarajank/Electrochemistry-Assistant](https://github.com/anatarajank/Electrochemistry-Assistant)

## 🧪 Project Overview

### 📝 Project Summary

This project presents an Electrochemistry Assistant, a question-answering system powered by Google's Gemini model and ChromaDB for efficient information retrieval. The system leverages cutting-edge Gen AI techniques to understand and respond to user queries about electrochemistry. It's designed to be a comprehensive tool for researchers, students, and anyone interested in electrochemistry.

### ⚙️ Workflow

1. **Document Processing:** 📄 Extracts text from PDF documents related to electrochemistry using `pdfminer`.
2. **Embedding Generation:** 🧬 Creates embeddings for the extracted text using the Gemini API and stores them in a ChromaDB vector database.
3. **Query Processing:** 🔎 Generates embeddings for user queries using the Gemini API.
4. **Retrieval:** 🧲 Searches the vector database for relevant documents based on the query embeddings using TF-IDF, BM25, and Gemini models.
5. **Evaluation:** 📊 Evaluates the performance of the retrieval using metrics like Accuracy, Precision, Recall, F1-score, MRR, P@K, and NDCG@K.
6. **Answer Generation:** 💬 Uses the Gemini model with a prompt enriched by the retrieved documents to answer the user's question in a comprehensive and user-friendly way.
7. **Semantic Similarity Evaluation:** 📏 Assesses the quality and relevance of the generated answer by comparing it to the retrieved passages using a pre-trained Sentence Transformer model and calculating semantic similarity scores.

### ✨ Gen AI Features

* **Embeddings:** 🧬 Used to represent documents and queries as vectors, enabling semantic search for relevant information.
* **Retrieval Augmented Generation (RAG):** 🧲 Retrieves relevant context from the document database to enhance the answer quality and accuracy.
* **Document understanding:** 📄 Allows the system to process and understand information from PDF documents containing electrochemical knowledge.
* **Large Language Models:** 🧠 Utilizes the powerful Gemini model for accurate comprehension, generation, and contextualization of natural language.

### 🌟 Benefits

* **Efficient and accurate information retrieval:** Quickly finds relevant answers to questions about electrochemical concepts.
* **User-friendly responses:** Provides clear, comprehensive, and informative answers in a conversational style.
* **Advanced Language Understanding:** Utilizes the powerful Gemini model for accurate comprehension and generation of natural language.
* **Comprehensive Coverage:** Covers various electrochemistry techniques and concepts, making it a versatile tool.

### 🚀 Potential Applications

* **Educational tool:** 📚 Aids in learning and understanding electrochemical principles.
* **Research assistant:** 🔬 Assists in exploring and analyzing electrochemical literature.
* **Customer support chatbot:** 🗣️ Answers questions about electrochemical products or services.
* **Decision Support System:** 📊 Helps in making informed decisions related to electrochemistry experiments and research.

## 🛠️ Setup

This section focuses on preparing the environment and installing necessary libraries for the project.  It involves these key steps:

1. **Installing Libraries:** 📦 Essential libraries like `google-genai`, `chromadb`, `pdfminer.six`, `kaggle_secrets`, `nltk`, `rank_bm25`, and `sentence_transformers` are installed using `pip`. These libraries provide functionalities for interacting with the Gemini API, managing the vector database, extracting text from PDFs, accessing API keys securely, natural language processing, and semantic similarity calculations.
2. **API Key Authentication:** 🔑 The `kaggle_secrets` library is used to securely retrieve the Google API key required for accessing the Gemini API. This ensures that sensitive credentials are handled safely. You will need to create a secret named "GOOGLE_API_KEY" with your actual Google Cloud API key in the Kaggle Secrets interface.
3. **Initializing Gemini Client:**  A `genai.Client` is initialized using the API key, establishing a connection to the Gemini service for generating embeddings and content. This client will be used throughout the project to interact with the Gemini API.
4. **Downloading Data:** ⬇️ The electrochemistry PDF documents used for this project are downloaded from Kaggle Datasets using the `kagglehub` library. These documents form the knowledge base for the question-answering system. Ensure that you have an internet connection for this step.
5. **Initializing ChromaDB:** 💾 A ChromaDB client and a collection named "electroanalchem" are initialized, using a custom embedding function based on the Gemini API. This sets up the vector database to store and search document embeddings.
6. **Sentence Transformer Model:**  A pre-trained Sentence Transformer model ('all-mpnet-base-v2') is loaded using the `sentence_transformers` library. This model is used for calculating semantic similarity during evaluation.

By completing these setup steps, you'll have a fully functional environment ready to run the Electrochemistry Assistant and explore its capabilities.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/electroanal-chem-wp/PDF-16-WorkingElec.pdf
/kaggle/input/electroanal-chem-wp/PDF-10-CV-ImpParam.pdf
/kaggle/input/electroanal-chem-wp/PDF-6-Chronoamp.pdf
/kaggle/input/electroanal-chem-wp/PDF-9-CVintro.pdf
/kaggle/input/electroanal-chem-wp/PDF-12-CV-Additional.pdf
/kaggle/input/electroanal-chem-wp/PDF-11-CV-Coupled.pdf
/kaggle/input/electroanal-chem-wp/PDF-8-LSV.pdf
/kaggle/input/electroanal-chem-wp/PDF-7-Chronocoul.pdf
/kaggle/input/electroanal-chem-wp/PDF-27-CareandFeeding.pdf
/kaggle/input/electroanal-chem-wp/PDF-13-ASV.pdf


Let us first install ChromaDB and the Gemini API Python SDK.

In [2]:
!pip uninstall -qqy jupyterlab kfp  # Remove unused conflicting packages
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3" "pdfminer.six" "hf_xet" "rank_bm25"

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m56.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m85.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m85.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m3.6 MB/s[0m eta

In [3]:
from google import genai
from google.genai import types

from IPython.display import Markdown

genai.__version__

'1.7.0'

### API Key Setup

In [4]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

In [5]:
client = genai.Client(api_key=GOOGLE_API_KEY)

for m in client.models.list():
    if "embedContent" in m.supported_actions:
        print(m.name)

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


### Text Processing Libraries

In [6]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /usr/share/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 📊 Data

### 📖 Data Summary

The Electrochemistry Assistant relies on a collection of 10 PDF documents containing information about various electrochemical techniques and concepts. These documents serve as the knowledge base for the question-answering system.

The dataset used in this project is named "electroanal-chem-wp" and is available on Kaggle Datasets. These documents provide a comprehensive overview of essential electrochemistry concepts, enabling the assistant to answer a wide range of user queries.

### 🌐 Data Source

The documents were originally sourced from [Analytical Electrochemistry: The Basic Concepts](https://www.asdlib.org/onlineArticles/ecourseware/Kelly_Potentiometry/EC_CONCEPTS1.HTM) and compiled into a dataset on Kaggle.

### 📜 Data Attribution

The documents are licensed under [CC-BY-NC-SA 2.5](https://creativecommons.org/licenses/by-nc-sa/2.5/). This license permits non-commercial use, sharing, and adaptation with attribution.

### 🙏 Author Attribution

The notebook author expresses sincere gratitude to Prof. Richard.S.Kelly, Department of Chemistry, East Stroudsburg University, for creating these valuable resources and sharing them with the research community.

### 🗃️ Data Processing

Before being used by the Electrochemistry Assistant, the PDF documents undergo the following processing steps:

1. **Text Extraction:** 📄 The `pdfminer` library is used to extract the textual content from each PDF document.
2. **Document Mapping:** 🗺️ The extracted text is organized into a dictionary, where each document is assigned a unique ID and a descriptive title.
3. **Embedding Generation:** 🧬 The Gemini API is used to generate embeddings for each document, capturing the semantic meaning of the text in vector form.

These processing steps prepare the data for efficient storage and retrieval by the ChromaDB vector database.

In [7]:
import kagglehub
data_pdf_path = kagglehub.dataset_download('aravindannatarajan/electroanal-chem-wp')

In [8]:
from pdfminer.high_level import extract_text
import os

def extract_text_from_pdf(pdf_path):
    try:
        text = extract_text(pdf_path)
        return text
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {e}")
        return None

def extract_text_from_folder(folder_path):
    """
    Extracts text from all PDF files in a given folder.

    Args:
        folder_path (str): The path to the folder containing the PDF files.

    Returns:
        dict: A dictionary where keys are "DOCUMENT1", "DOCUMENT2", etc., 
              and values are the extracted text strings.
    """
    extracted_texts = {}  # Use a dictionary to store extracted texts

    pdf_files = [f for f in os.listdir(folder_path) if f.endswith(".pdf")]

    if not pdf_files:
        print(f"Error: No PDF files found in {folder_path}")
    else:
        for i, pdf_file in enumerate(pdf_files):
            pdf_path = os.path.join(folder_path, pdf_file)
            extracted_text = extract_text_from_pdf(pdf_path)
            if extracted_text:
                document_key = f"DOCUMENT{i + 1}"  # Dynamically create key
                extracted_texts[document_key] = extracted_text

    return extracted_texts

if __name__ == "__main__":
    folder_path = data_pdf_path 

    extracted_texts = extract_text_from_folder(folder_path)

    print("\nExtracted Text:")
    for document_key, document_text in extracted_texts.items():
        print(f"{document_key}:\n", document_text) # Comment this out to avoid displaying the document contents
        # You can access individual documents using extracted_texts["DOCUMENT1"], etc.
    
    # Create a list of documents for further processing
    documents = list(extracted_texts.values())


Extracted Text:
DOCUMENT1:
 Analytical Electrochemistry:  The Basic Concepts

C.  Working Electrodes 

1.  Electrode types.  The working electrode (WE) represents the most important component of an 
electrochemical cell.  It is at the interface between the WE and the solution that electron transfers 
of  greatest  interest  occur.    The  selection  of  a  working  electrode  material  is  critical  to 
experimental  success.    Several  important  factors  should  be  considered.    Firstly,  the  material 
should  exhibit  favorable  redox  behavior  with  the  analyte,  ideally  fast,  reproducible  electron 
transfer  without  electrode  fouling.    Secondly,  the  potential  window  over  which  the  electrode 
performs in a given electrolyte solution should be as wide as possible to allow for the greatest 
degree of analyte characterization.  Additional considerations include the cost of the material, its 
ability to be machined or formed into useful geometries, the ease of surf

| Document# | Index | Document Name |
|---|---|---|
| 1 | 0 | Working Electrodes |
| 2 | 1 | Cyclic Voltammetry Important Parameters |
| 3 | 2 | Chronoamperometry |
| 4 | 3 | Cyclic Voltammetry Intro |
| 5 | 4 | Cyclic Voltammetry Additional |
| 6 | 5 | Cyclic Voltammetry Coupled |
| 7 | 6 | LSV |
| 8 | 7 | CC |
| 9 | 8 | Care and Feeding |
| 10 | 9 | ASV |

## 🗄️ Creating the embedding database with ChromaDB

This section focuses on building a vector database using ChromaDB to store and efficiently search electrochemical document information. This enables the Electrochemistry Assistant to quickly retrieve relevant context when answering user queries. 

### 🧬 Embedding Function

A custom class `GeminiEmbeddingFunction` is created, utilizing the Gemini API to generate embeddings for both documents (`retrieval_document` mode) and queries (`retrieval_query` mode). This function ensures consistent and relevant embeddings for effective search and retrieval. It leverages the advanced language understanding capabilities of the Gemini model to capture the semantic meaning of the text.

### 💾 Initializing ChromaDB

A ChromaDB client and a collection named "electroanalchem" are initialized, specifying the `GeminiEmbeddingFunction` created earlier. This sets up the database structure and embedding method for storing and accessing document representations. The "electroanalchem" collection will hold the embeddings of all the electrochemical documents, enabling efficient similarity search.

### 📄 Adding Documents

The extracted text from the PDF documents is added to the ChromaDB collection with unique IDs. This populates the database with the relevant electrochemical information and its corresponding embeddings, ready for retrieval. Each document is assigned a unique ID to facilitate efficient identification and retrieval.

### 🔎 Querying and Retrieving

The `GeminiEmbeddingFunction` is switched to `retrieval_query` mode for generating query embeddings. This allows the system to create embeddings for user queries that are comparable to the document embeddings. When a user asks a question, the system generates an embedding for the query and searches the ChromaDB collection for documents with similar embeddings, indicating semantic relevance. The most relevant documents are then retrieved to provide context for answering the user's question.

In [9]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types


# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

In [10]:
import chromadb

DB_NAME = "electroanalchem"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn,
    metadata={"hnsw:space": "cosine"})

db.add(documents=documents, ids=[str(i) for i in range(len(documents))])

In [11]:
# Verify the number of objects in the database
db.count()

10

## 🧲 Retrieval and Baselining

This section focuses on how the Electrochemistry Assistant efficiently retrieves relevant documents from the ChromaDB database to answer user queries. It also describes the baselining process used to compare different retrieval approaches. This retrieval process is a critical step in providing accurate and informative responses.

### 🔄 Switching to Query Mode

The `GeminiEmbeddingFunction` is configured to `retrieval_query` mode, ensuring the embedding generation aligns with the purpose of searching for relevant information. This mode optimizes the embeddings for finding documents that are semantically similar to the user's query.

### 🗣️ Formulating a User Query

A user provides a query, representing their information need, for example: "What is Cyclic Voltammetry?". This query is the starting point for the retrieval process, as the system needs to identify relevant documents that can provide an answer.

### 🧬 Query Embedding

The Gemini API generates an embedding for the user's query, capturing its semantic meaning in vector form. This embedding represents the query's essence in a way that can be compared to the document embeddings stored in ChromaDB.

### 🔍 Database Search

ChromaDB uses the query embedding to efficiently search the database for documents with similar embeddings, indicating semantic relevance. The similarity between the query embedding and the document embeddings is calculated using a distance metric, such as cosine similarity. Documents with higher similarity scores are considered more relevant to the query.

### 📄 Retrieving Results

The system identifies and retrieves the most relevant documents from the database, providing the necessary context for answering the user's query. These documents are then used to generate a comprehensive and informative answer.

### ➕ Baselining with TF-IDF and BM25

To establish a baseline for retrieval performance, the system incorporates two well-established retrieval models: TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 (Best Matching 25). These models offer different perspectives on document relevance and serve as benchmarks for comparing the performance of the Gemini-based retrieval approach.

### 🎯 Identifying Relevant Documents with `get_relevant_document_ids()`

The `get_relevant_document_ids()` function plays a crucial role in identifying the documents that are most likely to contain relevant information for a given query. This function utilizes a dictionary called `relevant_docs_per_query` that maps related terms and synonyms to their corresponding document IDs. By analyzing the query and matching its keywords or synonyms to the entries in the dictionary, the function efficiently retrieves a list of potentially relevant document IDs. This approach helps in narrowing down the search space and focusing on the most promising documents for retrieval.

### 📊 Evaluating Retrieval Performance

To assess the effectiveness of the different retrieval approaches, various metrics are employed, including:

* Accuracy
* Precision
* Recall
* F1-score
* MRR (Mean Reciprocal Rank)
* P@K (Precision at K)
* NDCG@K (Normalized Discounted Cumulative Gain at K)

These metrics provide insights into the quality of the retrieved documents for each model and allow for a comprehensive comparison of their performance.

In [12]:
queries = [
    "What is cyclic Voltammetry?",
    "What is chronoamperometry?",
    "Explain the principles of chronocoulometry?",
    "What is the formula for Randles-Sevcik Equation?",
    "What is Linear Sweep Voltammetry?",
    "What is Anodic Stripping Voltammetry?",
    "What are the different types of working electrodes?",
    "How to polish the glassy carbon electrode?",
    "How to keep the electrodes active?",
    "What are coupled reactions?"
]

In [13]:
complex_queries = [
    "Discuss the advantages and limitations of using chronoamperometry versus chronocoulometry for studying adsorption processes at an electrode surface.",
    "Explain how the peak current in cyclic voltammetry is affected by the scan rate and the concentration of the analyte, and derive the Randles-Sevcik equation.",
    "Describe the different types of working electrodes commonly used in electroanalytical chemistry, highlighting their specific properties and applications, and explain the significance of electrode polishing. ",
    "In anodic stripping voltammetry, explain the role of the deposition step and how it influences the sensitivity and selectivity of the technique for trace metal analysis.",
    "Compare and contrast linear sweep voltammetry and cyclic voltammetry, emphasizing the information that can be obtained from each technique and their respective advantages and limitations.",
    "Explain how coupled chemical reactions can influence the shape and characteristics of cyclic voltammograms, providing examples of different types of coupled reactions (e.g., EC, CE).",
    "Discuss the factors that affect the reversibility of an electrochemical reaction, and how they can be determined using cyclic voltammetry.",
    "Describe the principles of electrochemical impedance spectroscopy (EIS) and its applications in studying electrode processes and material properties.",
    "Explain the concept of electrocatalysis and how it can be used to enhance the rate of electrochemical reactions, providing examples of electrocatalysts and their applications.",
    "Discuss the challenges and opportunities in developing new electrochemical sensors for environmental monitoring and biomedical applications."
]

In [14]:
# Switch to query mode when generating embeddings.
embed_fn.document_mode = False

# Search the Chroma DB using the specified query.
query = queries[0]

result = db.query(query_texts=[query], n_results=5)
[all_passages] = result["documents"]

Markdown(all_passages[0])

Analytical Electrochemistry:  The Basic Concepts 

b)  Cyclic Voltammetry 

i)    Introduction.    In  linear  scan  voltammetry  (LSV),  the  potential  applied  to  a  working 
electrode is scanned over some potential range of interest as a function of time, while the current 
passing at the electrode is measured.  We saw in a previous section that the current – potential 
profile  for  LSV  was  peak  shaped  for  an  electroactive  species  undergoing  an  electron  transfer 
reaction.    In  cyclic  voltammetry  (CV),  the  potential  scan  is  reversed  at  some  point  beyond  the 
peak, and scanned back in the direction of the initial potential.  In a simple CV experiment, the 
potential  waveform  applied  to  the  electrode  is  a  triangle,  with  the  initial  and  final  potentials 
being the same.  This is illustrated in Figure 17 for a scan between an initial potential of 0.00 V 
and a switching potential of -0.50 V, with a final potential of 0.00 V.  For a scan rate of 0.10 V/s, 
the cycle takes 10.0 seconds to complete. 

Figure 17 

-0.40

-0.20

V

,
l
a
i
t
n
e
t
o
P

0.0

0.0                  2.0                   4.0

6.0                    8.0               10.0

Time, seconds

Remember  that  the  reduction  of  an  oxidized  form  (Ox)  of  a  solution  species  at  a  solid 

electrode can be represented by 

Ox  +  n e-

   Red 

where n represents the stoichiometric number of electrons for the process, and Red represents the 
reduced form of the solution species.  Under the appropriate conditions, this is the reaction that 
occurs on the forward scan in cyclic voltammetry.  Now remember that the double-sided arrow 
indicates that the  oxidized  form of the solution  species can  be regenerated  in  a  gross chemical 
sense from the  reduced form (chemical reversibility), meaning that the product is stable on the 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

time scale of the voltammetric experiment.  Upon reversal of the potential scan, the electrode in 
our  example  becomes  more  oxidizing  and  the  oxidized  form  of  the  solution  species  can  be 
regenerated from the reduced form now in  the  vicinity of  the  electrode  as  the potential  moves 
through the region of the E0’ for the redox couple.  This oxidation reaction can be expressed by   

Red    Ox  +  n e- 

The current – potential trace for a chemically and electrochemically reversible redox couple with 
an E0’ = -0.30 recorded between 0.00 and -0.50 V is shown in Figure 18.  In essence, the cyclic 
voltammogram produced is the result of two separate linear scan voltammograms, one beginning 
at 0.00 V and ending at the switching potential of -0.50 V, and a second beginning at -0.50 V and 
ending  at  0.00  V.  (DigiElch  simulation  parameters:    v  =  0.100  V/s;  A  =  0.050  cm2;  Ci  =  1.0 
mol/cm3; D = 1.0 x 10-5cm2/s; α = 0.50; ks = 100 cm/s)   

Figure 18 

Ox  +  n e-  Red

Red   Ox  +  n e-

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
 
 
 
 
 


In [15]:
retrieved_document_ids = result["ids"][0]  # Get the retrieved document IDs
retrieved_document_ids

['3', '4', '1', '5', '9']

In [16]:
relevant_docs_per_query  = {
"Cyclic Voltammetry, Introduction to Cyclic Voltammetry, CV": ["3","1","4","5"],
"Chronoamperometry, CA, chronoamperometric method, amperometry": ["2"],
"Chronocoulometry, CC, Chronocoulometric method, coulometry":["7"],
"Randles-Sevcik Equation":["6","1"],
"Linear Sweep Voltammetry, LSV":["6"],
"Anodic Stripping Voltammetry, ASV, Stripping Voltammetry":["9"],
"working electrodes, electrode preparation, electrode types":["0"],
"electrode polishing, electrode activation, electrode maintenance, electrode care":["8","0"],
"Chemical reactions coupled to electron transfer, ErCi, ErCr, CrEr":["5"]
}

In [17]:
import re

def get_relevant_document_ids(query, relevant_docs_per_query):
    """
    Gets relevant document IDs based on query and relevant_docs_per_query dictionary.

    Args:
        query: The user's query string.
        relevant_docs_per_query: The dictionary mapping related terms to document IDs.

    Returns:
        A list of relevant document IDs.
    """
    relevant_document_ids = []
    query_terms = query.lower().split()  # Split the query into lowercase terms

    # Use NLTK's stopwords for filtering
    stop_words = set(stopwords.words('english'))
    query_terms = [term for term in query_terms if term not in stop_words]

    for key, value in relevant_docs_per_query.items():
        key_terms = [term.strip().lower() for term in re.split(r',\s*', key)]  # Split key into lowercase terms with space handling

        # Check if any query term is a substring of any key term (case-insensitive) or vice-versa
        if any(any(query_term in key_term or key_term in query_term for key_term in key_terms) for query_term in query_terms):  
            relevant_document_ids.extend(value)  # Add relevant document IDs to the list

    return relevant_document_ids

# Retrieve relevant document ids
relevant_document_ids = get_relevant_document_ids(query, relevant_docs_per_query)
print(relevant_document_ids[:5])  # Output: ['3', '4']

['3', '1', '4', '5']


In [18]:
# Get relevant document IDs from the dictionary
#relevant_document_ids = relevant_docs_per_query.get(query, [])
#relevant_document_ids

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_retrieval(query, documents):
    """
    Retrieves documents using TF-IDF.

    Args:
        query: The user's query.
        documents: A list of documents to search.

    Returns:
        A list of document IDs ranked by relevance.
    """
    # Initialize TF-IDF vectorizer
    vectorizer = TfidfVectorizer()

    # Fit vectorizer to documents
    vectorizer.fit(documents)

    # Transform query and documents into vectors
    query_vector = vectorizer.transform([query])
    document_vectors = vectorizer.transform(documents)

    # Calculate cosine similarity between query and documents
    similarities = (query_vector * document_vectors.T).toarray()

    # Rank documents based on similarity
    ranked_document_ids = [str(i) for i, _ in sorted(enumerate(similarities[0]), key=lambda item: item[1], reverse=True)][:5]

    return ranked_document_ids

In [20]:
ranked_document_ids_tfidf = tfidf_retrieval(query, documents)
print(ranked_document_ids_tfidf)

['3', '9', '1', '4', '8']


In [21]:
from rank_bm25 import BM25Okapi

def bm25_retrieval(query, documents):
    # Preprocessing
    stop_words = set(stopwords.words('english'))

    def preprocess_text(text):
        tokens = word_tokenize(text.lower())
        tokens = [token for token in tokens if token.isalnum() and token not in stop_words]
        return tokens

    tokenized_documents = [preprocess_text(document) for document in documents]
    tokenized_query = preprocess_text(query)

    # BM25 model with adjusted parameters
    bm25 = BM25Okapi(tokenized_documents, k1=1.2, b=0.75)

    # Scoring and ranking
    scores = bm25.get_scores(tokenized_query)
    ranked_document_ids = [str(i) for i, _ in sorted(enumerate(scores), key=lambda item: item[1], reverse=True)][:5]

    return ranked_document_ids

In [22]:
ranked_document_ids_bm25 = bm25_retrieval(query, documents)

print(ranked_document_ids_bm25)

['3', '1', '4', '5', '0']


In [23]:
def calculate_metrics(retrieved_documents, relevant_documents):
    """
    Calculates accuracy, precision, recall, and F1-score.

    Args:
        retrieved_documents: A list of document IDs retrieved by the system.
        relevant_documents: A list of document IDs that are actually relevant to the query.

    Returns:
        A tuple containing accuracy, precision, recall, and F1-score (all floats between 0 and 1).
    """
    num_relevant_retrieved = len(set(retrieved_documents) & set(relevant_documents))
    num_retrieved = len(retrieved_documents)
    num_relevant = len(relevant_documents)

    if num_relevant == 0:
        accuracy = 0.0
        recall = 0.0
    else:
        accuracy = num_relevant_retrieved / num_relevant
        recall = num_relevant_retrieved / num_relevant

    if num_retrieved == 0:
        precision = 0.0
    else:
        precision = num_relevant_retrieved / num_retrieved
    
    # Calculate F1-score
    if precision + recall == 0:
        f1_score = 0.0
    else:
        f1_score = 2 * (precision * recall) / (precision + recall)

    return accuracy, precision, recall, f1_score

# Calculate metrics
accuracy, precision, recall, f1_score = calculate_metrics(retrieved_document_ids, relevant_document_ids)

print(f"Retrieval Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1-score: {f1_score}")

Retrieval Accuracy: 1.0
Precision: 0.8
Recall: 1.0
F1-score: 0.888888888888889


In [24]:
# Calculate metrics for TF-IDF
accuracy_tfidf, precision_tfidf, recall_tfidf, f1_score_tfidf = calculate_metrics(ranked_document_ids_tfidf, relevant_document_ids)

print(f"TF-IDF Retrieval Accuracy: {accuracy_tfidf}")
print(f"Precision: {precision_tfidf}")
print(f"Recall: {recall_tfidf}")
print(f"F1-score: {f1_score_tfidf}")

TF-IDF Retrieval Accuracy: 0.75
Precision: 0.6
Recall: 0.75
F1-score: 0.6666666666666665


In [25]:
# Calculate metrics for BM25
accuracy_bm25, precision_bm25, recall_bm25, f1_score_bm25 = calculate_metrics(ranked_document_ids_bm25, relevant_document_ids)

print(f"BM25 Retrieval Accuracy: {accuracy_bm25}")
print(f"Precision: {precision_bm25}")
print(f"Recall: {recall_bm25}")
print(f"F1-score: {f1_score_bm25}")

BM25 Retrieval Accuracy: 1.0
Precision: 0.8
Recall: 1.0
F1-score: 0.888888888888889


In [26]:
def calculate_mrr(relevant_docs_per_query, db):
    """
    Calculates Mean Reciprocal Rank (MRR) for a set of queries.

    Args:
        relevant_docs_per_query: A dictionary where keys are queries and values are lists of relevant document IDs.
        db: The ChromaDB database object.

    Returns:
        The MRR score (a float between 0 and 1).
    """
    reciprocal_ranks = []

    for query, relevant_doc_ids in relevant_docs_per_query.items():
        result = db.query(query_texts=[query], n_results=len(db.get()['ids']))  # Get all documents for ranking
        retrieved_doc_ids = result["ids"][0]
        #print(f"Query:{query}, Retrieved Documents in order:{retrieved_doc_ids}")
        # Find the rank of the first relevant document
        rank = 0
        for i, doc_id in enumerate(retrieved_doc_ids, 1):
            if doc_id in relevant_doc_ids:
                rank = i
                break

        # Calculate reciprocal rank and add to the list
        if rank > 0:
            reciprocal_ranks.append(1 / rank)
        else:
            reciprocal_ranks.append(0)  # If no relevant document is found

    # Calculate MRR
    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
    return mrr

mrr_score = calculate_mrr(relevant_docs_per_query, db)
print(f"MRR: {mrr_score}")

MRR: 0.9444444444444444


In [27]:
def calculate_mrr_tfidf(relevant_docs_per_query, documents):
    """
    Calculates Mean Reciprocal Rank (MRR) for TF-IDF retrieval.

    Args:
        relevant_docs_per_query: A dictionary where keys are queries 
                                 and values are lists of relevant document IDs.
        documents: A list of documents to search.

    Returns:
        The MRR score (a float between 0 and 1).
    """
    reciprocal_ranks = []

    for query, relevant_doc_ids in relevant_docs_per_query.items():
        retrieved_doc_ids = tfidf_retrieval(query, documents)  
        #print(retrieved_doc_ids)
        # Find the rank of the first relevant document
        rank = 0
        for i, doc_id in enumerate(retrieved_doc_ids, 1):
            if doc_id in relevant_doc_ids:
                rank = i
                break

        # Calculate reciprocal rank and add to the list
        if rank > 0:
            reciprocal_ranks.append(1 / rank)
        else:
            reciprocal_ranks.append(0)  # If no relevant document is found

    # Calculate MRR
    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
    return mrr

# Example usage:
# Assuming you have 'relevant_docs_per_query' and 'documents' defined
mrr_score_tfidf = calculate_mrr_tfidf(relevant_docs_per_query, documents)
print(f"MRR for TF-IDF: {mrr_score_tfidf}")

MRR for TF-IDF: 0.9259259259259258


In [28]:
def calculate_mrr_bm25(relevant_docs_per_query, documents):
    """
    Calculates Mean Reciprocal Rank (MRR) for BM25 retrieval.

    Args:
        relevant_docs_per_query: A dictionary where keys are queries 
                                 and values are lists of relevant document IDs.
        documents: A list of documents to search.

    Returns:
        The MRR score (a float between 0 and 1).
    """
    reciprocal_ranks = []

    for query, relevant_doc_ids in relevant_docs_per_query.items():
        retrieved_doc_ids = bm25_retrieval(query, documents)  
        
        # Find the rank of the first relevant document
        rank = 0
        for i, doc_id in enumerate(retrieved_doc_ids, 1):
            if doc_id in relevant_doc_ids:
                rank = i
                break

        # Calculate reciprocal rank and add to the list
        if rank > 0:
            reciprocal_ranks.append(1 / rank)
        else:
            reciprocal_ranks.append(0)  # If no relevant document is found

    # Calculate MRR
    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks)
    return mrr

# Example usage:
# Assuming you have 'relevant_docs_per_query' and 'documents' defined
mrr_score_bm25 = calculate_mrr_bm25(relevant_docs_per_query, documents)
print(f"MRR for BM25: {mrr_score_bm25}")

MRR for BM25: 0.8055555555555556


In [29]:
def calculate_metrics_at_k(relevant_docs_per_query, db, k=10):
    """
    Calculates Precision at K (P@K) and NDCG at K (NDCG@K) for a set of queries.

    Args:
        relevant_docs_per_query: A dictionary where keys are queries and values are lists of relevant document IDs.
        db: The ChromaDB database object.
        k: The value of K (number of top results to consider).

    Returns:
        A tuple containing the average P@K and average NDCG@K scores (both floats between 0 and 1).
    """
    precisions = []
    ndcg_scores = []

    for query, relevant_doc_ids in relevant_docs_per_query.items():
        result = db.query(query_texts=[query], n_results=k)  # Get top k results
        retrieved_doc_ids = result["ids"][0]

        # Calculate P@K
        num_relevant_retrieved = len(set(retrieved_doc_ids) & set(relevant_doc_ids))
        precision_at_k = num_relevant_retrieved / k if k > 0 else 0
        precisions.append(precision_at_k)

        # Calculate NDCG@K
        dcg = 0
        for i, doc_id in enumerate(retrieved_doc_ids, 1):
            if doc_id in relevant_doc_ids:
                dcg += 1 / np.log2(i + 1)

        ideal_ranking = sorted(relevant_doc_ids, key=lambda x: retrieved_doc_ids.index(x) if x in retrieved_doc_ids else float('inf'))
        idcg = 0
        for i, doc_id in enumerate(ideal_ranking, 1):
            idcg += 1 / np.log2(i + 1)

        ndcg_at_k = dcg / idcg if idcg > 0 else 0
        ndcg_scores.append(ndcg_at_k)

    # Calculate average scores
    average_precision_at_k = sum(precisions) / len(precisions)
    average_ndcg_at_k = sum(ndcg_scores) / len(ndcg_scores)

    return average_precision_at_k, average_ndcg_at_k

k_value = 5  # Set the desired value of K
p_at_k, ndcg_at_k = calculate_metrics_at_k(relevant_docs_per_query, db, k=k_value)
print(f"P@{k_value}: {p_at_k}")
print(f"NDCG@{k_value}: {ndcg_at_k}")

P@5: 0.3111111111111111
NDCG@5: 0.9570913150586828


In [30]:
import numpy as np

def calculate_metrics_at_k_tfidf(relevant_docs_per_query, documents, k=10):
    """
    Calculates Precision at K (P@K) and NDCG at K (NDCG@K) for TF-IDF retrieval.

    Args:
        relevant_docs_per_query: A dictionary where keys are queries and values 
                                 are lists of relevant document IDs.
        documents: A list of documents to search.
        k: The value of K (number of top results to consider).

    Returns:
        A tuple containing the average P@K and average NDCG@K scores 
        (both floats between 0 and 1).
    """
    precisions = []
    ndcg_scores = []

    for query, relevant_doc_ids in relevant_docs_per_query.items():
        retrieved_doc_ids = tfidf_retrieval(query, documents)  # Get top k results (up to 5)

        # If k is larger than the number of retrieved documents, use the number of retrieved documents
        num_retrieved = len(retrieved_doc_ids)
        k_actual = min(k, num_retrieved)
        
        # Calculate P@K
        num_relevant_retrieved = len(set(retrieved_doc_ids[:k_actual]) & set(relevant_doc_ids))
        precision_at_k = num_relevant_retrieved / k_actual if k_actual > 0 else 0
        precisions.append(precision_at_k)

        # Calculate NDCG@K
        dcg = 0
        for i, doc_id in enumerate(retrieved_doc_ids[:k_actual], 1):
            if doc_id in relevant_doc_ids:
                dcg += 1 / np.log2(i + 1)

        ideal_ranking = sorted(relevant_doc_ids, key=lambda x: retrieved_doc_ids.index(x) if x in retrieved_doc_ids else float('inf'))
        idcg = 0
        for i, doc_id in enumerate(ideal_ranking[:k_actual], 1):  # Consider only top k_actual in ideal ranking
            idcg += 1 / np.log2(i + 1)

        ndcg_at_k = dcg / idcg if idcg > 0 else 0
        ndcg_scores.append(ndcg_at_k)

    # Calculate average scores
    average_precision_at_k = sum(precisions) / len(precisions)
    average_ndcg_at_k = sum(ndcg_scores) / len(ndcg_scores)

    return average_precision_at_k, average_ndcg_at_k

# Example usage:
k_value = 5  # Set the desired value of K
average_precision, average_ndcg = calculate_metrics_at_k_tfidf(relevant_docs_per_query, documents, k=k_value)
print(f"Average P@{k_value} for TF-IDF: {average_precision}")
print(f"Average NDCG@{k_value} for TF-IDF: {average_ndcg}")

Average P@5 for TF-IDF: 0.2888888888888889
Average NDCG@5 for TF-IDF: 0.9257636070809868


In [31]:
def calculate_metrics_at_k_bm25(relevant_docs_per_query, documents, k=10):
    """
    Calculates Precision at K (P@K) and NDCG at K (NDCG@K) for BM25 retrieval.

    Args:
        relevant_docs_per_query: A dictionary where keys are queries and values 
                                 are lists of relevant document IDs.
        documents: A list of documents to search.
        k: The value of K (number of top results to consider).

    Returns:
        A tuple containing the average P@K and average NDCG@K scores 
        (both floats between 0 and 1).
    """
    precisions = []
    ndcg_scores = []

    for query, relevant_doc_ids in relevant_docs_per_query.items():
        retrieved_doc_ids = bm25_retrieval(query, documents)  # Get top k results (up to 5)

        # If k is larger than the number of retrieved documents, use the number of retrieved documents
        num_retrieved = len(retrieved_doc_ids)
        k_actual = min(k, num_retrieved)
        
        # Calculate P@K
        num_relevant_retrieved = len(set(retrieved_doc_ids[:k_actual]) & set(relevant_doc_ids))
        precision_at_k = num_relevant_retrieved / k_actual if k_actual > 0 else 0
        precisions.append(precision_at_k)

        # Calculate NDCG@K
        dcg = 0
        for i, doc_id in enumerate(retrieved_doc_ids[:k_actual], 1):
            if doc_id in relevant_doc_ids:
                dcg += 1 / np.log2(i + 1)

        ideal_ranking = sorted(relevant_doc_ids, key=lambda x: retrieved_doc_ids.index(x) if x in retrieved_doc_ids else float('inf'))
        idcg = 0
        for i, doc_id in enumerate(ideal_ranking[:k_actual], 1):  # Consider only top k_actual in ideal ranking
            idcg += 1 / np.log2(i + 1)

        ndcg_at_k = dcg / idcg if idcg > 0 else 0
        ndcg_scores.append(ndcg_at_k)

    # Calculate average scores
    average_precision_at_k = sum(precisions) / len(precisions)
    average_ndcg_at_k = sum(ndcg_scores) / len(ndcg_scores)

    return average_precision_at_k, average_ndcg_at_k

# Example usage:
k_value = 5  # Set the desired value of K
average_precision, average_ndcg = calculate_metrics_at_k_bm25(relevant_docs_per_query, documents, k=k_value)
print(f"Average P@{k_value} for BM25: {average_precision}")
print(f"Average NDCG@{k_value} for BM25: {average_ndcg}")

Average P@5 for BM25: 0.2888888888888889
Average NDCG@5 for BM25: 0.8186866564475088


## 💬 Augmented Generation: Answer the Question

This section focuses on leveraging the retrieved documents and the Gemini model to generate comprehensive and user-friendly answers to the user's query. This process, known as Retrieval Augmented Generation (RAG), combines the strengths of information retrieval and large language models to provide insightful and contextually relevant answers.

### ✍️ Prompt Engineering

A detailed prompt is constructed, guiding the Gemini model to provide relevant and informative answers based on the retrieved documents. The prompt emphasizes clarity, comprehensiveness, and user-friendliness. It also provides instructions on how to incorporate the retrieved context, format equations, and address potential inaccuracies in the source material.

### 🧲 Contextualization

The retrieved documents are integrated into the prompt, providing essential context for the Gemini model to understand the user's question and generate accurate responses. This contextualization enables the model to leverage the relevant information from the retrieved documents and formulate a more informed and comprehensive answer.

### 🧠 Answer Generation with Gemini

The Gemini model, a powerful pre-trained transformer model developed by Google AI, processes the prompt and the context from the retrieved documents to generate a natural language answer to the user's query. Gemini's advanced language understanding capabilities enable it to synthesize information from multiple sources and present it in a coherent and informative manner.

### 📄 Output Presentation

The generated answer is presented in a clear and readable format, typically using markdown for enhanced readability and clarity. This ensures that the answer is easily understandable and accessible to the user.

### ✨ Benefits of RAG

Retrieval Augmented Generation offers several benefits for question answering:

* **Improved Accuracy:** By grounding the answer generation in relevant context, RAG reduces the likelihood of hallucinations or factual errors.
* **Enhanced Comprehensiveness:** The retrieved documents provide a broader knowledge base for the model to draw upon, leading to more comprehensive answers.
* **Increased Transparency:** The use of retrieved documents makes the reasoning process more transparent, as the user can see the source of information used to generate the answer.

### 📏 Semantic Similarity Evaluation

To assess the quality and relevance of the generated answer, a semantic similarity evaluation is performed. This evaluation compares the generated answer to the retrieved passages using a pre-trained Sentence Transformer model ('all-mpnet-base-v2'). The model generates embeddings for both the answer and the passages, capturing their semantic meaning in vector form. Cosine similarity is then calculated between the answer embedding and each passage embedding to determine the degree of semantic overlap.

### 🧮 Calculating F1-Score with `calculate_f1_score()`

The `calculate_f1_score()` function is used to quantify the semantic similarity between the generated answer and the retrieved passages. It utilizes the cosine similarity scores and a relevance threshold to identify passages that are considered relevant to the answer. Based on the relevant and retrieved passages, the function calculates precision, recall, and ultimately the F1-score, which provides a balanced measure of the answer's quality and relevance.

In [32]:
query_oneline = query.replace("\n", " ")

# This prompt is where you can specify any guidance on tone, or what topics the model should stick to, or avoid.
prompt = f"""Your role is that of a leading expert in electrochemistry. Given a user's question and relevant context from internal documents and research, provide a concise, authoritative answer that seamlessly incorporates the provided information.

Instructions:

Prioritize Context: Address the user's question primarily using the supplied context, adding your expertise as needed.

Focus Relevance: If only parts of the context apply, concentrate on those and enrich your explanation with your knowledge.

Ensure Precision: Present facts and concepts with accuracy and clarity.

Format Equations: Use $...$ for inline and $$...$$ for display math, ensuring scientific correctness and proper formatting.

Correct Errors: Confidently rectify any inaccuracies or inconsistencies in the context, providing the correct information with explanation.

Integrate, Don't Cite: Incorporate context naturally without explicit source references.

Be Direct & Professional: Deliver focused, informative, and direct answers, avoiding unnecessary content.

QUESTION: {query_oneline}
"""

# Add the retrieved documents to the prompt.
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

print(prompt)

Your role is that of a leading expert in electrochemistry. Given a user's question and relevant context from internal documents and research, provide a concise, authoritative answer that seamlessly incorporates the provided information.

Instructions:

Prioritize Context: Address the user's question primarily using the supplied context, adding your expertise as needed.

Focus Relevance: If only parts of the context apply, concentrate on those and enrich your explanation with your knowledge.

Ensure Precision: Present facts and concepts with accuracy and clarity.

Format Equations: Use $...$ for inline and $$...$$ for display math, ensuring scientific correctness and proper formatting.

Correct Errors: Confidently rectify any inaccuracies or inconsistencies in the context, providing the correct information with explanation.

Integrate, Don't Cite: Incorporate context naturally without explicit source references.

Be Direct & Professional: Deliver focused, informative, and direct answers

Now let us use the ```generate_content``` method to to generate an answer to the question.

In [33]:
temp_config = types.GenerateContentConfig(temperature=0.5)

answer = client.models.generate_content(
    model="gemini-2.0-flash",
    config=temp_config,
    contents=prompt)

Markdown(answer.text)

Cyclic voltammetry (CV) is an electrochemical technique where the potential applied to a working electrode is scanned linearly versus time, like in linear sweep voltammetry (LSV), but with a crucial difference: the potential scan is reversed at a specific point and scanned back to the initial potential, creating a triangular waveform. During the forward scan, a species (Ox) is reduced ($Ox + ne^- \rightleftharpoons Red$), and if the product (Red) is stable, the reverse scan re-oxidizes it ($Red \rightleftharpoons Ox + ne^-$), yielding a current-potential trace.

Several parameters characterize a cyclic voltammogram:

*   **Peak Locations:** The forward ($E_{f,p}$) and reverse ($E_{r,p}$) peak potentials, which allow the calculation of the formal potential ($E^0 = \frac{E_{f,p} + E_{r,p}}{2}$). The peak separation ($\Delta E_p = E_{f,p} - E_{r,p}$) indicates electrochemical reversibility, with a theoretical value of $\Delta E_p = \frac{0.058}{n}$ V for a reversible process at 25°C, where *n* is the number of electrons transferred. Experimentally, a known reversible standard is often used to determine practical $\Delta E_p$ values.

*   **Current Ratios:** The ratio of reverse to forward peak currents ($i_{p,r}/i_{p,f}$) indicates the stability of the electrochemically generated product. For a reversible reaction, this ratio is approximately 1.

*   **Scan Rate Dependence:** For reversible electron transfer, peak currents ($i_p$) increase linearly with the square root of the scan rate ($\nu^{1/2}$), as described by the Randles-Sevcik equation: $i_p = (2.69 \times 10^5) n^{3/2} A D^{1/2} C_i \nu^{1/2}$. Deviations from this relationship suggest kinetic complications or chemical reactions coupled to the electron transfer. The current function, $i_p / \nu^{1/2}$, should remain constant for a reversible system.

It's important to note that capacitive current, due to the rearrangement of ions in the double layer, can contribute to the CV signal. This contribution is influenced by the electrode area and scan rate. While often negligible, it can be significant under certain conditions (low analyte concentration, high scan rates) and can be addressed by background subtraction.

Solution resistance also affects CV measurements, causing a potential drop (Ohmic drop, $E_{iR} = i_{cell} R_s$) that distorts the voltammograms. This uncompensated resistance leads to drawn-out curves, increased peak separations, and reduced peak currents.

CV is also used to study chemical reactions coupled to electron transfer, such as the $E_rC_i$ (electrochemical-reversible, chemical-irreversible), $E_rC_r$ (electrochemical-reversible, chemical-reversible), $E_rC_i'$ (catalytic regeneration), $C_rE_r$ and $E_rC_iE_r$ mechanisms. The changes in the voltammograms can be used as diagnostic criteria for electrode reaction mechanisms of varying types.


In [34]:
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained Sentence Transformer model
model = SentenceTransformer('all-mpnet-base-v2')

def calculate_f1_score(generated_answer, retrieved_passages):
    """
    Calculates F1-score based on semantic similarity between generated answer and retrieved passages.

    Args:
        generated_answer: The generated answer from your system.
        retrieved_passages: A list of passages retrieved by the system.

    Returns:
        The F1-score (a float between 0 and 1).
    """
    # Encode the generated answer and retrieved passages
    answer_embedding = model.encode(generated_answer, convert_to_tensor=True)
    passage_embeddings = model.encode(retrieved_passages, convert_to_tensor=True)

    # Calculate cosine similarity between generated answer and each passage
    similarities = util.cos_sim(answer_embedding, passage_embeddings).cpu().numpy()[0]

    # Define a threshold for considering a passage as relevant (e.g., 0.7)
    relevance_threshold = 0.70

    # Identify relevant and retrieved passages based on similarity
    relevant_passages = [passage for passage, similarity in zip(retrieved_passages, similarities) 
                             if similarity >= relevance_threshold]
    retrieved_passages_set = set(retrieved_passages)  
    relevant_passages_set = set(relevant_passages)

    # Calculate precision, recall, and F1-score
    num_relevant_retrieved = len(retrieved_passages_set & relevant_passages_set)
    num_retrieved = len(retrieved_passages_set)  
    num_relevant = len(relevant_passages_set)

    if num_retrieved == 0:
        precision = 0.0
    else:
        precision = num_relevant_retrieved / num_retrieved

    if num_relevant == 0:
        recall = 0.0
    else:
        recall = num_relevant_retrieved / num_relevant

    if precision + recall == 0:
        f1_score = 0.0
    else:
        f1_score = 2 * (precision * recall) / (precision + recall)

    return f1_score

# Calculate F1-score
f1 = calculate_f1_score(answer.text, all_passages) 
print(f"F1-score: {f1}")

2025-05-20 05:31:02.651509: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747719062.915777      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747719062.992895      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

F1-score: 0.888888888888889


## 🎉 Conclusions

This notebook demonstrates the successful implementation of a question-answering system for electrochemistry, leveraging the power of Google's Gemini model, ChromaDB, and Retrieval Augmented Generation (RAG). By combining these technologies, the system efficiently extracts relevant information from electrochemical documents and generates comprehensive, user-friendly answers to user queries.

### Key Findings

* **Effective Retrieval:** The system effectively retrieves relevant documents using a combination of Gemini embeddings, TF-IDF, and BM25 retrieval models, achieving high accuracy and retrieval performance metrics.
* **High-Quality Answers:** The Gemini model, guided by carefully engineered prompts and enriched with retrieved context, generates answers that are accurate, informative, and semantically aligned with the source material.
* **Semantic Similarity:** The semantic similarity evaluation, based on Sentence Transformer embeddings and the `calculate_f1_score()` function, confirms the strong correlation between generated answers and relevant passages, highlighting the system's ability to capture and synthesize information.
* **Versatile Applications:** The Electrochemistry Assistant has the potential to be a valuable tool for education, research, customer support, and decision-making in the field of electrochemistry.

### Future Directions

* **Expanding the Knowledge Base:** Incorporating more diverse and comprehensive electrochemical documents to enhance the system's knowledge and coverage.
* **Fine-tuning the Gemini Model:** Fine-tuning the Gemini model on a specialized electrochemistry dataset to further improve its understanding and response generation capabilities.
* **Exploring Advanced Retrieval Techniques:** Investigating and implementing more advanced retrieval techniques, such as hybrid approaches or neural retrieval models, to further enhance retrieval performance.
* **User Interface Development:** Developing a user-friendly interface for interacting with the system, making it more accessible and intuitive for users.

### Overall Impact

The Electrochemistry Assistant represents a significant step towards building intelligent systems that can effectively access and utilize domain-specific knowledge. By automating the process of information retrieval and answer generation, this system has the potential to revolutionize how researchers, students, and professionals interact with electrochemical information, empowering them with valuable insights and accelerating scientific discovery.