# Climate Change Rag ChatBot

This project makes use of a ChromaDB vector database, OpenAI API, and an IPCC (Intergovernmental Panel on Climate Change) pdf document to build a RAG-enhanced chatbot which can answer questions about climate change with high accuracy.

In [51]:
import os
from dotenv import load_dotenv
import re #regular expressions
from pprint import pprint
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
import chromadb
import openai
from openai import OpenAI
import streamlit as st

load_dotenv
openai.api_key = os.environ["OPENAI_API_KEY"]


## Data Preprocessing

We first need to take the pdf and preprocesses it so that we have just raw text that can then be tokenized.

In [24]:
# import the pdf
ipcc_report_file = "IPCC_AR6_WGII_TechnicalSummary.pdf"
reader = PdfReader(ipcc_report_file)
ipcc_texts = [page.extract_text().strip() for page in reader.pages]

In [25]:
pprint(ipcc_texts[5])

('40\n'
 'Technical Summary\n'
 'TSTS.A Introduction\n'
 'TS.A.1 Background\n'
 'This technical summary complements and expands the key findings of \n'
 'the Working Group (WG) II contribution to the Sixth Assessment Report \n'
 '(AR6) presented in the Summary for Policymakers and covers literature \n'
 'accepted for publication by 1 September 2021. It provides technical \n'
 'understanding and is developed from the key findings of chapters and \n'
 'cross-chapter papers (CCPs) as presented in their executive summaries \n'
 'and integrates across them. The report builds on the WGII contribution \n'
 'to the Fifth Assessment Report (AR5) of the IPCC and three special \n'
 'reports of the AR6 cycle providing new knowledge and updates. The \n'
 'three special reports are the Special Report on Global Warming of \n'
 '1.5°C (2018), an IPCC special report on the impacts of global warming \n'
 'of 1.5°C above pre-industrial levels and related global greenhouse \n'
 'gas emission pathways in t

In [26]:
# Filter out beginnning and end of the document
ipcc_texts_filt = ipcc_texts[5: -5]
print(f"Number of pages: {len(ipcc_texts_filt)}")
ipcc_texts_filt

Number of pages: 74


['40\nTechnical Summary\nTSTS.A Introduction\nTS.A.1 Background\nThis technical summary complements and expands the key findings of \nthe Working Group (WG) II contribution to the Sixth Assessment Report \n(AR6) presented in the Summary for Policymakers and covers literature \naccepted for publication by 1 September 2021. It provides technical \nunderstanding and is developed from the key findings of chapters and \ncross-chapter papers (CCPs) as presented in their executive summaries \nand integrates across them. The report builds on the WGII contribution \nto the Fifth Assessment Report (AR5) of the IPCC and three special \nreports of the AR6 cycle providing new knowledge and updates. The \nthree special reports are the Special Report on Global Warming of \n1.5°C (2018), an IPCC special report on the impacts of global warming \nof 1.5°C above pre-industrial levels and related global greenhouse \ngas emission pathways in the context of strengthening the global \nresponse to the threat 

In [27]:
# Remove all header and footer texts
#   (regular expression with substituation)
ipcc_wo_header_footer = [re.sub(r'\d+\nTechnical Summary', '', s) for s in ipcc_texts_filt]
# remove \nTS
ipcc_wo_header_footer = [re.sub(r'\nTS', '', s) for s in ipcc_wo_header_footer]

# remove TS\n
ipcc_wo_header_footer = [re.sub(r'TS\n', '', s) for s in ipcc_wo_header_footer]
ipcc_wo_header_footer

['TS.A Introduction.A.1 Background\nThis technical summary complements and expands the key findings of \nthe Working Group (WG) II contribution to the Sixth Assessment Report \n(AR6) presented in the Summary for Policymakers and covers literature \naccepted for publication by 1 September 2021. It provides technical \nunderstanding and is developed from the key findings of chapters and \ncross-chapter papers (CCPs) as presented in their executive summaries \nand integrates across them. The report builds on the WGII contribution \nto the Fifth Assessment Report (AR5) of the IPCC and three special \nreports of the AR6 cycle providing new knowledge and updates. The \nthree special reports are the Special Report on Global Warming of \n1.5°C (2018), an IPCC special report on the impacts of global warming \nof 1.5°C above pre-industrial levels and related global greenhouse \ngas emission pathways in the context of strengthening the global \nresponse to the threat of climate change, sustainabl

In [28]:
# Split the text
char_splitter = RecursiveCharacterTextSplitter(
    separators= ["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000, # model will allow 256 words in a sentence (256*4 ~= 1000, ~4 tokens is one word)
    chunk_overlap=2
    )

texts_char_splitted = char_splitter.split_text('\n\n'.join(ipcc_wo_header_footer)) # joining all lines to make one string
print(f"Number of chunks: {len(texts_char_splitted)}")

Number of chunks: 461


In [29]:
# Token split
# Model not specified, using default
token_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=2, # must be int, changed from 0.2
    tokens_per_chunk=256
    )

texts_token_splitted = []
for text in texts_char_splitted:
    try:
        texts_token_splitted.extend(token_splitter.split_text(text))
    except Exception as e:
        print(f"Error in text: {text[:100]}...") # Print the first 100 characters to avoid overly long output
        print(f"Error type: {type(e).__name__}, Error message: {str(e)}\n") # Print the type and message of the exception
        continue
    
print(f"Number of chunks: {len(texts_token_splitted)}")

Number of chunks: 560


In [34]:
# Show first chunk
pprint(texts_token_splitted[0])

('ts. a introduction. a. 1 background this technical summary complements and '
 'expands the key findings of the working group ( wg ) ii contribution to the '
 'sixth assessment report ( ar6 ) presented in the summary for policymakers '
 'and covers literature accepted for publication by 1 september 2021. it '
 'provides technical understanding and is developed from the key findings of '
 'chapters and cross - chapter papers ( ccps ) as presented in their executive '
 'summaries and integrates across them. the report builds on the wgii '
 'contribution to the fifth assessment report ( ar5 ) of the ipcc and three '
 'special reports of the ar6 cycle providing new knowledge and updates. the '
 'three special reports are the special report on global warming of 1. 5°c ( '
 '2018 ), an ipcc special report on the impacts of global warming of 1. 5°c '
 'above pre - industrial levels and related global greenhouse gas emission '
 'pathways in the context of strengthening the global response to 

## Vector Database

A vector database is a colelction of data stored as mathematical representations (vectors), and they are inherently suitable for machine learning models. Vector databases are also powerful because data can be identified based on similarity metrics (similar to how GloVe works when searching for similar words), which in turn allows computers to have a contextual understanding of the data.

In [36]:
# Databse setup
chroma_client = chromadb.PersistentClient(path="db")
chroma_collection = chroma_client.get_or_create_collection("ipcc")

In [37]:
# Add documents to the db
ids = [str(i) for i in range(len(texts_token_splitted))]
chroma_collection.add(
    ids=ids,
    documents=texts_token_splitted
    )

/Users/dylanwardlow/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:15<00:00, 5.41MiB/s]


In [38]:
# Query the db
query = "What is the impact of climate change on the ocean?"
result = chroma_collection.query(query_texts=[query], n_results=5)
result

{'ids': [['170', '138', '164', '31', '87']],
 'distances': [[0.7571058869361877,
   0.7732738256454468,
   0.7798511385917664,
   0.7818472385406494,
   0.8270434141159058]],
 'metadatas': [[None, None, None, None, None]],
 'embeddings': None,
 'documents': [['14. 5. 4, 16. 5. 2, 16. 6. 3, ccb moving plate }. c. 3. 2 climate change will significantly alter aquatic food provisioning services, with direct impacts on food - insecure people ( high confidence ). global ocean animal biomass will',
   'distribution at all scales. { 16. 5. 2, table 16. a. 4, smts. 2 } ecosystems and biodiversity. c. 1 without urgent and ambitious emissions reductions, more terrestrial, marine and freshwater species and ecosystems will face conditions that approach or exceed the limits of their historical experience ( very high confidence ). threats to species and ecosystems in oceans, coastal regions and on land, particularly in biodiversity hotspots, present a global risk that will increase with every additio

## RAG Development

Retreival-augmented generation combines the capabilities of both retrieval models and generative models by sending a user's query through a vector database as a vector representation which does mathematical vector calculations in order to match that user query with relevant information, then augments the user's query with the relevenat retrieved data. Therefore, with each query, the LLM is able to provide a response that is largely based upon the data in the vector database rather than solely on its general training data because it is being fed an embedding that is largely made up of the data in the vector database.

In [56]:
def rag(query, n_results=5):
    result = chroma_collection.query(query_texts=[query], n_results=n_results)
    docs = result["documents"][0]
    joined_information = ';'.join([f'{doc}' for doc in docs])
    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert on climate change. Your users are asking questions about information contained in attached information."
            "You will be shown the user's question, and the relevant information. Answer the user's question using only this information."
        },
        {   "role": "user", 
            "content": f"Question: {query}. \n Information: {joined_information}"
        }
    ]
    openai_client = OpenAI()
    model = "gpt-3.5-turbo"
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [57]:
query = "What is the impact of climate change on the ocean?"

pprint(rag(query=query, n_results=5))

('Climate change will significantly alter aquatic food provisioning services, '
 'impacting food-insecure people. Global ocean animal biomass distribution '
 'will be affected, impacting ecosystems and biodiversity. Without urgent '
 'emissions reductions, more species and ecosystems face dangerous conditions '
 'exceeding historical limits. Threats to species and ecosystems in oceans, '
 'coastal regions, and biodiversity hotspots are increasing with each '
 'additional tenth of a degree of warming. The transformation of ecosystems '
 'and loss of biodiversity, exacerbated by pollution and habitat changes, '
 'threaten livelihoods and food security. Additionally, climate change has '
 'altered marine, terrestrial, and freshwater ecosystems globally, with '
 'biological responses struggling to cope with recent changes. Climate-induced '
 'changes in the hydrological cycle have negatively impacted freshwater and '
 'terrestrial ecosystems.')
