# Goal
Lots of great information in climate/earth science is locked up in major intergovernmental reports like the IPCC. The summarized url version linked below is only a small fraction of the actual ~12000 pages that will be eventually be released for the sixth assessment. It would be great to for policymakers and hobbyists to be able to converse with this other documents.

Some folks have already tried to do this with ClimateBERT at chatclimate.ai, but to the best of my knowledge this is a fine-tuned approach and about a year behind relevant docs (2022). I'm curious if we can get decent, grounded QA without the overhead and cost of fine-tuning.

# Approach

Chunk the document and turn those chunks into embeddings using a model of choice (in this case OpenAI for ease of execution). Those embeddings are stored in a database that supports vector similarity search (Pinecone). When a user writes a query we get it into the same emedding space, and retrieve the k=5 most relevant chunks of the document. This is appended as context to the model prompt using Langchain. Heavy lifting of document loading/parsing handled by Langchain as well.



# TODOs


*   Actually finalize the list of documents to include all IPCC reports from the sixth assessment cycle once they are released to the public in totality (~12k pages or so?)
*   Langchain claims the tokenizing/parsing works out of the box but may want to double check this/weird symbols.
*   On the same note, these papers are filled with charts/infographics/tables. Can't do much about the infographics w/o multi-modal, but should dig really deep into the tables/figures to see if we're extracting correctly.
*   Smarter document chunking as opposed to just going by e.g 1000. Maybe crawl the DOM, get paragraphs and such.
*   Different ways of doing vector similarity search and langchain appending.
*   Overall the model is a little *too* grounded and has a tendency to pull only the minimum correct info from the document without the normal elaboration that makes LLMs useful. See if I can tune temperature higher to support more creative/synthesizing queries without sacrificing the objectivity we try to cultivate, or maybe just feed less context (k <= 3 for retrieval? Or assign this dynamically on document size?).




**Install Neceessary Package and Set Up API Keys**

In [None]:
!pip install langchain --upgrade
!pip install unstructured
!pip install pinecone-client
!pip install getpass4
!pip install openai
!pip install tiktoken

In [None]:
from getpass import getpass
from langchain.chains.question_answering import load_qa_chain
from langchain.document_loaders import UnstructuredURLLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Pinecone
import pinecone

In [3]:
# Extract necessary api key and metadata without putting them on the internet...
openai_api_key = getpass('Enter OpenAI API key: ')
pinecone_api_key = getpass('Enter pinecone API key: ')
pinecone_environment = 'us-west1-gcp-free' # To the left of api key
# Name of the index to use in Pinecone Vector Database. For this I created the
# index with dimension size 1536 (this matches the size of OpenAI's embeddings
# but you could also use HuggingFace, etc. so long as it matches) and cosine
# similarity as the default notion of vector similarity.
pinecone_index = 'climatequery'

Enter OpenAI API key: ··········
Enter Pinecone API key: ··········


**Extract and Load Climate Information**

In [None]:
# Summarized version of the report as a placeholder for the proof of concept,
# to be replaced by the full 12k pages when released.
urls = [
    "https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_SPM.pdf",
]
loader = UnstructuredURLLoader(urls=urls)
data = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
text_chunks = splitter.split_documents(data)

In [6]:
# Sanity check the text.
print(f'{len(data)} doc(s) processed. First doc has {len(data[0].page_content)} tokens.')
print(f'{len(text_chunks)} chunks created.')

1 doc(s) processed. First doc has 149761 tokens.
207 chunks created.


**Perform Semantic Search by Embedding Document Text**

In [53]:
pinecone.init(api_key=pinecone_api_key, environment=pinecone_environment)
openai_embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)

In [57]:
content = [text.page_content for text in text_chunks]
docsearch = Pinecone.from_texts(content, openai_embeddings,
                                index_name=pinecone_index)

In [58]:
model = OpenAI(temperature=0.2, openai_api_key=openai_api_key)

In [59]:
# 'stuff' uses all retrieved documents without filtering. For larger documents
# it may be better to do map_reduce or map_rerank which filters the least useful
# retrieved docs.
chain = load_qa_chain(model, chain_type="stuff")

**Query Model w/ Document Context**

In [74]:
question = "If we do nothing, how much will the planet warm by 2100"
similar_docs = docsearch.similarity_search(question)
print(similar_docs)

[Document(page_content='Increase in compound ﬂooding\n\nc) The extent to which current and future generations will experience a hotter and different world depends on choices now and in the near-term\n\n2020\n\n2011-2020 was around 1.1°C warmer than 1850-1900\n\nfuture experiences depend on how we address climate change\n\nFuture emissions scenarios:\n\n1900\n\n1940\n\n1980\n\n2060\n\n2100\n\nvery high\n\nwarming continues beyond 2100\n\nhigh\n\nintermediate\n\nlow\n\nvery low\n\n°C\n\nGlobal temperature change above 1850-1900 levels\n\n70 years old in 2090\n\nborn in 2020\n\n0\n\n0.5\n\n1\n\n1.5\n\n2\n\n2.5\n\n3\n\n3.5\n\n4\n\nborn in 1980\n\n70 years old in 2050\n\n70 years old in 2020\n\nborn in 1950', metadata={}), Document(page_content='B.7.1 Only a small number of the most ambitious global modelled pathways limit global warming to 1.5°C (>50%) by 2100 without exceeding this level temporarily. Achieving and sustaining net negative global CO2 emissions, with annual rates of CDR grea

In [75]:
# Correct, see section B.1.1
chain.run(input_documents=similar_docs, question=question)

' If we do nothing, the planet is likely to warm by 4.4°C by 2100.'

In [76]:
# Model without any documents.
# Not wrong but not as up-to-date / precise.
model(question)

'?\n\nIf no action is taken to reduce emissions, the planet is projected to warm by an average of 3.2 to 5.4 degrees Celsius (5.8 to 9.7 degrees Fahrenheit) by 2100.'

In [60]:
# The paper gives a lot of ways to mitigate... In particular I'm looking for the
# the best results from the (difficult to parse) table on page 31.
question = "What are the five best mitigation options to reduce net emissions?"
similar_docs = docsearch.similarity_search(question)
print(similar_docs)

[Document(page_content='y t i l i\n\ns e i g r e n y S\n\nClimate responses and adaptation options\n\nl\n\na i t n e t o P\n\nMitigation options\n\nPotential contribution to net emission reduction, 2030\n\nb i s a e f\n\nh t i\n\nGtCO2-eq/yr 4\n\nm\n\nw\n\n0\n\n1\n\n2\n\n3\n\n5\n\nY L P P U S Y G R E N E\n\nSolar\n\nWind Reduce methane from coal, oil and gas Bioelectricity (includes BECCS) Geothermal and hydropower Nuclear\n\nEnergy reliability (e.g. diversiﬁcation, access, stability) Resilient power systems\n\nImprove water use efﬁciency\n\nFossil Carbon Capture and Storage (CCS)\n\nD O O F\n\nEfﬁcient livestock systems\n\nImproved cropland management Water use efﬁciency and water resource management Biodiversity management and ecosystem connectivity Agroforestry\n\nReduce conversion of natural ecosystems\n\n,\n\nR E T A W\n\nCarbon sequestration in agriculture Ecosystem restoration, afforestation, reforestation\n\n,\n\nD N A L\n\nShift to sustainable healthy diets\n\nSustainable aqua

In [62]:
# It's terse but this is exactly the correct order.
chain.run(input_documents=similar_docs, question=question)

' Solar, wind, reduce methane from coal, oil and gas, bioelectricity (includes BECCS), and geothermal and hydropower.'

In [61]:
# More expository and not wrong, just not necessarily grounded in the results
# of the paper.
model(question)

'\n\n1. Increase energy efficiency: Investing in energy efficiency measures such as insulation, LED lighting, and energy efficient appliances can reduce energy consumption and emissions.\n\n2. Renewable energy: Investing in renewable energy sources such as solar, wind, and geothermal can reduce emissions and provide clean energy.\n\n3. Carbon capture and storage: Capturing and storing carbon dioxide emissions from power plants and other sources can reduce net emissions.\n\n4. Reforestation: Planting trees and other vegetation can help absorb carbon dioxide from the atmosphere and reduce net emissions.\n\n5. Sustainable agriculture: Implementing sustainable agricultural practices such as crop rotation, cover crops, and no-till farming can reduce emissions from agricultural activities.'

In [79]:
# There is a little ambiguity in this question (e.g reduce emissions
# compared to what timeframe? What is the time horizon?)
question = "How many gigatons of CO2 emissions do we have to cut by 2035 to limit warming to 1.5 degrees Celcius?"
similar_docs = docsearch.similarity_search(question)
print(similar_docs)

[Document(page_content='Limiting human-caused global warming requires net zero CO2 emissions. Cumulative carbon emissions until the time of reaching net zero CO2 emissions and the level of greenhouse gas emission reductions this decade largely determine whether warming can be limited to 1.5°C or 2°C (high confidence). Projected CO2 emissions from existing fossil fuel infrastructure without additional abatement would exceed the remaining carbon budget for 1.5°C (50%) (high confidence). {2.3, 3.1, 3.3, Table 3.1}', metadata={}), Document(page_content='B.5.4 Based on central estimates only, historical cumulative net CO2 emissions between 1850 and 2019 amount to about four-fifths45 of the total carbon budget for a 50% probability of limiting global warming to 1.5°C (central estimate about 2900 GtCO2), and to about two thirds46 of the total carbon budget for a 67% probability to limit global warming to 2°C (central estimate about 3550 GtCO2). {3.3.1, Figure 3.5}\n\nMitigation Pathways\n\nB.

In [80]:
# 8.5 is orders of magnitude off the mark...
model(question)

'\n\nAccording to the Intergovernmental Panel on Climate Change (IPCC), we need to cut global net CO2 emissions by 45% from 2010 levels by 2030 and reach net zero by 2050 in order to limit global warming to 1.5 degrees Celsius. This would require reducing global CO2 emissions by an estimated 8.5 gigatons by 2035.'

In [81]:
# Correct according to B.5.2
chain.run(input_documents=similar_docs, question=question)

' According to the context, limiting warming to 1.5°C with no or limited overshoot involves rapid and deep and, in most cases, immediate greenhouse gas emissions reductions in all sectors this decade. The best estimates of the remaining carbon budgets from the beginning of 2020 are 500 GtCO2 for a 50% likelihood of limiting global warming to 1.5°C. Therefore, we need to cut at least 500 GtCO2 emissions by 2035 to limit warming to 1.5 degrees Celcius.'

In [83]:
# Test when we get queries with less explicit connections to the source.
# See if it actually gives a creative answer.

# Both answers work but the one without grounding is probably a little
# subjectively better/informative.
question = "What are some creative ways to combat climate change?"
similar_docs = docsearch.similarity_search(question)
print(similar_docs)



In [84]:
chain.run(input_documents=similar_docs, question=question)

' Some creative ways to combat climate change include using carbon capture and storage, improving water use efficiency, shifting to sustainable healthy diets, restoring ecosystems, agroforestry, and using renewable energy sources like solar and wind.'

In [85]:
model(question)

'\n\n1. Plant trees: Planting trees is one of the most effective ways to combat climate change. Trees absorb carbon dioxide from the atmosphere, helping to reduce the amount of greenhouse gases in the air.\n\n2. Reduce energy consumption: Reducing energy consumption is one of the most important things we can do to combat climate change. This can be done by using energy-efficient appliances, turning off lights when not in use, and unplugging electronics when not in use.\n\n3. Use renewable energy sources: Renewable energy sources such as solar, wind, and geothermal can help reduce our reliance on fossil fuels and reduce greenhouse gas emissions.\n\n4. Eat less meat: Eating less meat can help reduce greenhouse gas emissions, as the production of meat requires a lot of energy and resources.\n\n5. Support sustainable agriculture: Supporting sustainable agriculture practices such as crop rotation, composting, and using natural fertilizers can help reduce the amount of greenhouse gases relea

In [71]:
question = "How do I install solar on my roof?"
similar_docs = docsearch.similarity_search(question)
print(similar_docs)

[Document(page_content='Summary for Policymakers\n\nSummary for Policymakers\n\npublic. From 2010-2019 there have been sustained decreases in the unit costs of solar energy (85%), wind energy (55%), and lithium-ion batteries (85%), and large increases in their deployment, e.g., >10x for solar and >100x for electric vehicles (EVs), varying widely across regions. The mix of policy instruments that reduced costs and stimulated adoption includes public R&D, funding for demonstration and pilot projects, and demand-pull instruments such as deployment subsidies to attain scale. Maintaining emission-intensive systems may, in some regions and sectors, be more expensive than transitioning to low emission systems. (high confidence) {2.2.2, Figure 2.4}', metadata={}), Document(page_content='management (e.g., storage and energy efficiency improvements) can increase energy reliability and reduce vulnerabilities to climate change (high confidence). Climate responsive energy markets, updated design st

In [72]:
# Failure case: Queries to info not in the source doc should ideally fall back
# to the base model. This might be a langchain issue also.
chain.run(input_documents=similar_docs, question=question)

" I'm sorry, I don't know."

In [73]:
model(question)

'\n\n1. Research local solar installers: Start by researching local solar installers in your area. Ask for referrals from friends and family, and read online reviews.\n\n2. Get a free solar quote: Contact the solar installers you’ve researched and ask for a free quote. Make sure to ask questions about the type of equipment they use, the installation process, and any warranties or guarantees they offer.\n\n3. Choose the right installer: Once you’ve received quotes from several installers, compare them and choose the one that best meets your needs.\n\n4. Sign the contract: Once you’ve chosen an installer, sign the contract and make sure you understand all of the terms and conditions.\n\n5. Schedule the installation: Schedule a date for the installation and make sure you’re available to answer any questions the installer may have.\n\n6. Monitor your system: Once the installation is complete, monitor your system to make sure it’s working properly and producing the expected amount of energy