# **Capstone Project - Electrochemistry Assistant**

---

**<center> Author: Aravindan Natarajan </center>**
**<center>Version: 1.0</center>**

---

## 📝 Project Summary

This project showcases a question-answering system powered by Google's Gemini model and ChromaDB for efficient information retrieval.  The system leverages cutting-edge Gen AI techniques for understanding and responding to user queries about electrochemistry.

### ⚙️ Workflow

1. **Document Processing:** 📄 Extracts text from PDF documents related to electrochemistry using `pdfminer`.
2. **Embedding Generation:** 🧬 Creates embeddings for the extracted text using the Gemini API and stores them in a ChromaDB vector database.
3. **Query Processing:** 🔎 Generates embeddings for user queries using the Gemini API.
4. **Retrieval:** 🧲 Searches the vector database for relevant documents based on the query embeddings.
5. **Answer Generation:** 💬 Uses the Gemini model with a prompt enriched by the retrieved documents to answer the user's question in a comprehensive and user-friendly way.

### ✨ Gen AI Features

* **Embeddings:** 🧬 Used to represent documents and queries as vectors, enabling semantic search for relevant information.
* **Retrieval Augmented Generation (RAG):** 🧲 Retrieves relevant context from the document database to enhance the answer quality and accuracy.
* **Document understanding:** 📄 Allows the system to process and understand information from PDF documents containing electrochemical knowledge.

### 🌟 Benefits

* **Efficient and accurate information retrieval:** Quickly finds relevant answers to questions about electrochemical concepts.
* **User-friendly responses:** Provides clear, comprehensive, and informative answers in a conversational style.
* **Advanced Language Understanding:** Utilizes the powerful Gemini model for accurate comprehension and generation of natural language.

### 🚀 Potential Applications

* **Educational tool:** 📚 Aids in learning and understanding electrochemical principles.
* **Research assistant:** 🔬 Assists in exploring and analyzing electrochemical literature.
* **Customer support chatbot:** 🗣️ Answers questions about electrochemical products or services.

## 🛠️ Setup

This section focuses on preparing the environment and installing necessary libraries for the project. Key steps include:

1. **Installing Libraries:**  📦 Essential libraries like `google-genai`, `chromadb`, `pdfminer.six`, and `kaggle_secrets` are installed using `pip`. These libraries provide functionalities for interacting with the Gemini API, managing the vector database, extracting text from PDFs, and accessing API keys securely.

2. **API Key Authentication:** 🔑 The `kaggle_secrets` library is used to securely retrieve the Google API key required for accessing the Gemini API. This ensures that sensitive credentials are handled safely.

3. **Initializing Gemini Client:** A `genai.Client` is initialized using the API key, establishing a connection to the Gemini service for generating embeddings and content.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/capstone-data-pdf/PDF-10-CV-ImpParam.pdf
/kaggle/input/capstone-data-pdf/PDF-6-Chronoamp.pdf
/kaggle/input/capstone-data-pdf/PDF-7-Chronocoul.pdf


In [2]:
!pip uninstall -qqy jupyterlab kfp  # Remove unused conflicting packages
!pip install -qU "google-genai==1.7.0" "chromadb==0.6.3"
!pip install pdfminer.six

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m144.7/144.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m611.1/611.1 kB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m54.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m100.9/100.9 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.2/284.2 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m95.2/95.2 kB[0m [31m4.8 MB/s[0

In [3]:
from google import genai
from google.genai import types

from IPython.display import Markdown

genai.__version__

'1.7.0'

### API Key Setup

In [4]:
from kaggle_secrets import UserSecretsClient

GOOGLE_API_KEY = UserSecretsClient().get_secret("GOOGLE_API_KEY")

In [5]:
client = genai.Client(api_key=GOOGLE_API_KEY)

for m in client.models.list():
    if "embedContent" in m.supported_actions:
        print(m.name)

models/embedding-001
models/text-embedding-004
models/gemini-embedding-exp-03-07
models/gemini-embedding-exp


## 📊 Data

### 📖 Data Summary

The documents used for this project are PDF files related to electrochemistry techniques. They are stored in a dataset folder named `capstone-data-pdf`. This folder contains 3 key documents:

* `PDF-10-CV-ImpParam.pdf`: Focuses on Cyclic Voltammetry.
* `PDF-6-Chronoamp.pdf`: Covers Chronoamperometry.
* `PDF-7-Chronocoul.pdf`: Explains Chronocoulometry.

These documents serve as the knowledge base for the question-answering system. 

### 🌐 Data Source

The documents were originally sourced from [Analytical Electrochemistry: The Basic Concepts](https://www.asdlib.org/onlineArticles/ecourseware/Kelly_Potentiometry/EC_CONCEPTS1.HTM) 

### 📜 Data Attribution

The documents are licensed under [CC-BY-NC-SA 2.5](https://creativecommons.org/licenses/by-nc-sa/2.5/). This license permits non-commercial use, sharing, and adaptation with attribution. 

### 🙏 Author Attribution

The notebook author expresses sincere gratitude to Prof. Richard.S.Kelly, Department of Chemistry, East Stroudsburg University, for creating these valuable resources and sharing them with the research community. 

In [6]:
import kagglehub
capstone_data_pdf_path = kagglehub.dataset_download('aravindannatarajan/capstone-data-pdf')

In [7]:
from pdfminer.high_level import extract_text

def extract_text_from_pdf(pdf_path):
    try:
        text = extract_text(pdf_path)
        return text
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {e}")
        return None

def extract_text_from_folder(folder_path):
    """
    Extracts text from all PDF files in a given folder.

    Args:
        folder_path (str): The path to the folder containing the PDF files.

    Returns:
        list: A list of extracted text strings, one for each PDF file.
    """
    extracted_text_list = []

    # Get a list of all PDF files in the folder
    pdf_files = [f for f in os.listdir(folder_path) if f.endswith(".pdf")]

    if not pdf_files:
        print(f"Error: No PDF files found in {folder_path}")
    else:
        for pdf_file in pdf_files:
            pdf_path = os.path.join(folder_path, pdf_file)  # Create the full file path
            extracted_text = extract_text_from_pdf(pdf_path)
            if extracted_text:
                extracted_text_list.append(extracted_text)

    return extracted_text_list

if __name__ == "__main__":
    folder_path = capstone_data_pdf_path  # Replace with the path to your folder containing the 3 PDFs

    extracted_text_list = extract_text_from_folder(folder_path)

    if len(extracted_text_list) == 3:
        DOCUMENT1 = extracted_text_list[0]
        DOCUMENT2 = extracted_text_list[1]
        DOCUMENT3 = extracted_text_list[2]

        print("\nExtracted Text:")
        print("DOCUMENT1:\n", DOCUMENT1)
        print("\nDOCUMENT2:\n", DOCUMENT2)
        print("\nDOCUMENT3:\n", DOCUMENT3)
    else:
        print("Error: The folder does not contain exactly 3 PDF files.")


Extracted Text:
DOCUMENT1:
 Analytical Electrochemistry:  The Basic Concepts 

b)  

Cyclic Voltammetry 

ii)  Important parameters in CV.  Cyclic voltammograms are most often characterized by:  
a) the location of the forward and reverse peaks on the potential axis (Ep and ∆Ep, allowing the 
calculation of E0’); b) the ratio of currents observed on the reverse and forward scans (ip,rev / ip, 
fwd);  and  c)  the  dependence  of  peak  currents  on  the  scan  rate  (ip  vs.  ν1/2
).    As  before,  an 
uncomplicated, chemically and electrochemically reversible redox system will first be described.  
Next,  deviations  from  the  reversible  case  will  be  presented,  and  the  resulting  changes  in  the 
appearance  of  the  voltammograms  obtained  will  be  used  as  diagnostic  criteria  for  electrode 
reaction mechanisms of varying types. 

Peak Location.  In the forward scan of a cyclic voltammogram, the conditions that exist 
at the surface of the electrode are the same as t

In [8]:
# Create the list of documents
documents = [DOCUMENT1,DOCUMENT2,DOCUMENT3]

## 🗄️ Creating the embedding database with ChromaDB

This section focuses on building a vector database using ChromaDB to store and efficiently search electrochemical document information.  It involves these key steps:

1. **Define an Embedding Function:** 🧬 A custom class `GeminiEmbeddingFunction` is created, utilizing the Gemini API to generate embeddings for both documents (`retrieval_document` mode) and queries (`retrieval_query` mode). This function ensures consistent and relevant embeddings for effective search and retrieval.

2. **Initialize ChromaDB:** 💾 A ChromaDB client and a collection named "electroanalchem" are initialized, specifying the embedding function created earlier. This sets up the database structure and embedding method for storing and accessing document representations.

3. **Add Documents:** 📄 The extracted text from PDF documents is added to the ChromaDB collection with unique IDs. This populates the database with the relevant electrochemical information and its corresponding embeddings, ready for retrieval.

4. **Query and Retrieve:** 🔎 The embedding function is switched to `retrieval_query` mode for generating query embeddings. A user query is then used to search the database, retrieving the most relevant documents. This demonstrates how ChromaDB enables efficient information retrieval based on semantic similarity.

In [9]:
from chromadb import Documents, EmbeddingFunction, Embeddings
from google.api_core import retry

from google.genai import types


# Define a helper to retry when per-minute quota is reached.
is_retriable = lambda e: (isinstance(e, genai.errors.APIError) and e.code in {429, 503})


class GeminiEmbeddingFunction(EmbeddingFunction):
    # Specify whether to generate embeddings for documents, or queries
    document_mode = True

    @retry.Retry(predicate=is_retriable)
    def __call__(self, input: Documents) -> Embeddings:
        if self.document_mode:
            embedding_task = "retrieval_document"
        else:
            embedding_task = "retrieval_query"

        response = client.models.embed_content(
            model="models/text-embedding-004",
            contents=input,
            config=types.EmbedContentConfig(
                task_type=embedding_task,
            ),
        )
        return [e.values for e in response.embeddings]

In [10]:
import chromadb

DB_NAME = "electroanalchem"

embed_fn = GeminiEmbeddingFunction()
embed_fn.document_mode = True

chroma_client = chromadb.Client()
db = chroma_client.get_or_create_collection(name=DB_NAME, embedding_function=embed_fn)

db.add(documents=documents, ids=[str(i) for i in range(len(documents))])

In [11]:
# Verify the number of objects in the database
db.count()

3

## 🧲 Retrieval: Find Relevant Documents

This section focuses on how the system efficiently retrieves relevant documents from the ChromaDB database to answer user queries. It involves:

1. **Switching to Query Mode:** 🔄 The `GeminiEmbeddingFunction` is configured to `retrieval_query` mode, ensuring the embedding generation aligns with the purpose of searching for relevant information.

2. **Formulating a User Query:** 🗣️ A user provides a query, representing their information need, for example: "What is Cyclic Voltammetry?".

3. **Query Embedding:** 🧬 The Gemini API generates an embedding for the user's query, capturing its semantic meaning in vector form.

4. **Database Search:** 🔍 ChromaDB uses the query embedding to efficiently search the database for documents with similar embeddings, indicating semantic relevance.

5. **Retrieving Results:** 📄 The system identifies and retrieves the most relevant documents from the database, providing the necessary context for answering the user's query.

In [12]:
# Switch to query mode when generating embeddings.
embed_fn.document_mode = False

# Search the Chroma DB using the specified query.
query = "What is the formula for formal potential?"

result = db.query(query_texts=[query], n_results=1)
[all_passages] = result["documents"]

Markdown(all_passages[0])

Analytical Electrochemistry:  The Basic Concepts 

b)  

Cyclic Voltammetry 

ii)  Important parameters in CV.  Cyclic voltammograms are most often characterized by:  
a) the location of the forward and reverse peaks on the potential axis (Ep and ∆Ep, allowing the 
calculation of E0’); b) the ratio of currents observed on the reverse and forward scans (ip,rev / ip, 
fwd);  and  c)  the  dependence  of  peak  currents  on  the  scan  rate  (ip  vs.  ν1/2
).    As  before,  an 
uncomplicated, chemically and electrochemically reversible redox system will first be described.  
Next,  deviations  from  the  reversible  case  will  be  presented,  and  the  resulting  changes  in  the 
appearance  of  the  voltammograms  obtained  will  be  used  as  diagnostic  criteria  for  electrode 
reaction mechanisms of varying types. 

Peak Location.  In the forward scan of a cyclic voltammogram, the conditions that exist 
at the surface of the electrode are the same as those we have previously described for linear scan 
voltammetry.  For conditions that include an electrode product that is stable on the time scale of 
the  potential  scan,  the  presence  of  a  return  peak  in  cyclic  voltammetry  allows  us  to  now 
determine the formal potential for a redox couple, E0’

, by averaging the two peak potentials 

'0

E

=

E

f,p

+

E

r,p

2

where Ep,f and Ep,r represent the forward and reverse peak potentials, respectively.  The position 
of  the  E0’
  is  characteristic  of  a  redox  species  in  much  the  same  way  that  the  wavelength  of 
maximum absorbance is characteristic of a species in spectroscopic experiments.   

The  separation  between  the  two  peak  potentials,  ∆E

,  can  be  used  determine  the 

p

electrochemical reversibility for a redox couple, with 
.0

∆

E p =

058
n

for the reversible case (∆Ep in volts).  This value is independent of the scan rate for fast electron 
transfer.  Increasing values of ∆Ep as a function of increasing scan rate indicates the presence of 
electrochemical  irreversibility.    The  value  of  ∆Ep  can  be  used  in  the  calculation  of  the 
heterogenous electron transfer rate constant (ks

) for the redox reaction.     

In  practice,  the  theoretical  value  of  (58/n)  mV  for  ∆E

p  is  seldom  observed.    A  known 
reversible  standard  compound  (like  ferrocene  or  ferricyanide)  is  often  helpful  in  the 
measurement of practical values for ∆Ep under the specific experimental conditions being used 
in the characterization of an unknown compound.  The ∆Ep is also useful in the determination of 
n-values, as a two-electron transfer (n = 2) will give about 29 mV for the reversible case. 

Current ratios.   In cyclic voltammetry, we are given the opportunity to directly observe 
the stability of the electrochemically generated product.  If the product is stable on the time scale 
of the experiment, the peak current observed for the return potential scan (ip,r) should be equal to 

 
 
 
 
 
 
 
 
 
 
 
that seen for the forward potential scan (ip,f
directions. 

) given reversible electron transfer reactions in both 

i

i

r,p

f,p

=

0.1

The measurement of peak current is sometimes complicated by the difficulty in locating a 
proper “baseline” from which to define the magnitude of a peak.  If no other peaks are present, 
the  baseline  for  the  forward  wave  can  generally  be  defined  by  extrapolating  the  zero  current 
portion of the scan to a position under the forward peak.  The reverse peak current is more often 
problematic,  with  baselines  affected  by  the  location  of  the  switching  potential,  among  other 
things.    Proper  extrapolation  of  baselines  for  peak  current  measurement  in  the  absence  of 
complications is shown  at left in the Figure 19.   Many computer-controlled  potentiostats have 
data systems that can accurately and quickly make these types of measurements.  For situations 
where baseline assignment is more complicated, the current ratio can sometimes be conveniently 
calculated  using  the  empirical  method  of  Nicholson7,  which  requires  the  measurement  of  the 
forward  (ip,f)  and  reverse  peak  currents  (designated  (ip,r)0  to  differentiate  it  from  that  used 
, which is the current measured 
previously) from the same baseline, along with a third current, iλ
at the switching potential.  This method is illustrated at right in Figure 19. (DigiElch simulation 
conditions same as in Figure 18.)  

Figure 19

ip,f

ip,f

iλ

ip,r

(ip,r)0

The currents measured using the method of Nicholson can then be applied to the calculation of 
(ip,r / ip,f

) by 

i

i

r,p

f,p

=

i(

)
0r,p
i

f,p

+

48.0

λ

+

.0

086

i
i

f,p

 
 
 
 
 
  
 
 
 
 
 
 
 
 
Scan Rate  Dependence of Peak Current.  We learned in our discussion  of LSV that the 
observed peak current on the forward potential scan is given for the case of reversible electron 
transfer by the Randles-Sevcik equation (at 25 o
C) 

ip  = (2.69 x 105) n3/2 A D1/2 Ci v

1/2 

Thus,  peak  currents  will  increase  linearly  as  a  function  of  the  square  root  of  the  scan  rate  for 
reversible  electron  transfer.    Plots  of  ip  vs.  ν1/2
  are  useful  in  the  characterization  of 
electrochemically  reversible  redox  systems.    Deviations  from  linearity  are  indicative  either  of 
complications in the kinetics of the observed electron transfer, or the result of chemical changes 
which occur as a result of the electron transfer (homogeneous reactions).   

p / ν1/2

A  similar  indicator  of  reversible  electron  transfer  is  called  the  current  function,  whose 
value is given by (i
).  The current function should be constant for all scan rates for which 
the electron transfer is fast enough to maintain the equilibrium ratio between the reduced and the 
oxidized forms of the redox couple predicted by the Nernst equation.   
At  this  point,  it  is  instructive  to  note  that  when  describing  electrochemical  reversibility  it  is 
s, the heterogeneous electron transfer rate constant, 
important to consider not only the value of k
but the scan  rate  as  well.    It should  be clear that even for large values of ks there exists some 
values  of  scan  rate  for  which  Nernstian  equilibrium  cannot  be  maintained  at  the  electrode 
surface.    At  these  scan  rates,  the  observed  voltammetry  will  display  characteristics  of  quasi-
reversible  or  irreversible  behavior,  such  as  the  drawing  out  of  voltammetric  peaks  over  wider 
potential ranges, decreased peak currents, and increased values for ∆Ep

.   

 
 
 


## 💬 Augmented Generation: Answer the Question

This section focuses on leveraging the retrieved documents and the Gemini model to generate comprehensive and user-friendly answers to the user's query. It involves:

1. **Prompt Engineering:** ✍️ A detailed prompt is constructed, guiding the Gemini model to provide relevant and informative answers based on the retrieved documents. The prompt emphasizes clarity, comprehensiveness, and user-friendliness.

2. **Contextualization:** 🧲 The retrieved documents are integrated into the prompt, providing essential context for the Gemini model to understand the user's question and generate accurate responses.

3. **Answer Generation:** 🧠 The Gemini model processes the prompt and the context from the retrieved documents to generate a natural language answer to the user's query.

4. **Output Presentation:** 📄 The generated answer is presented in a clear and readable format, typically using markdown for enhanced readability and clarity.

In [13]:
query_oneline = query.replace("\n", " ")

# This prompt is where you can specify any guidance on tone, or what topics the model should stick to, or avoid.
prompt = f"""You are a helpful and informative bot that answers questions using text from the reference passage included below. 
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. 
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: {query_oneline}
"""

# Add the retrieved documents to the prompt.
for passage in all_passages:
    passage_oneline = passage.replace("\n", " ")
    prompt += f"PASSAGE: {passage_oneline}\n"

print(prompt)

You are a helpful and informative bot that answers questions using text from the reference passage included below. 
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. 
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and 
strike a friendly and converstional tone. If the passage is irrelevant to the answer, you may ignore it.

QUESTION: What is the formula for formal potential?
PASSAGE: Analytical Electrochemistry:  The Basic Concepts   b)    Cyclic Voltammetry   ii)  Important parameters in CV.  Cyclic voltammograms are most often characterized by:   a) the location of the forward and reverse peaks on the potential axis (Ep and ∆Ep, allowing the  calculation of E0’); b) the ratio of currents observed on the reverse and forward scans (ip,rev / ip,  fwd);  and  c)  the  dependence  of  peak  currents  on  the  scan  rate  (ip  vs.  ν1/2 ).    As  before,  an  uncomplicated, chem

In [14]:
answer = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=prompt)

Markdown(answer.text)

Okay, so when we're talking about cyclic voltammetry, which is a technique used to study redox reactions (where electrons are transferred), one important parameter is the formal potential, denoted as E0'. Essentially, this value tells us about the characteristic potential of a redox species, much like how the wavelength of maximum absorbance tells us about a species in spectroscopy.

To find the formal potential (E0'), you can use the following formula:

E0' = (Ef,p + Er,p) / 2

Where:

Ef,p is the potential at the forward peak

Er,p is the potential at the reverse peak

So, you just take the average of the forward and reverse peak potentials to get the formal potential!


## 🎉 Conclusions

This notebook demonstrates the successful implementation of a question-answering system for electrochemistry using Google's Gemini model, ChromaDB, and Retrieval Augmented Generation (RAG). By leveraging these technologies, the system efficiently extracts relevant information from PDF documents to provide comprehensive and user-friendly answers to user queries.