# Environmental Issues QA System

Prototype of a Question-Answering (QA) system for learning about environmental issues.  

The system queries a set of environment-related Wikipedia articles for relevant passages, then uses a fine-tuned language model to produce a list of possible answers. The articles were downloaded in a separate notebook, based on the index page below. Modified from the Haystack Pipeline tutorial, see the original below for more detailed comments. Author: ekohrt.

[Original Tutorial Source](https://colab.research.google.com/github/deepset-ai/haystack/blob/main/tutorials/Tutorial1_Basic_QA_Pipeline.ipynb)

[Wikipedia Index of Environmental Articles](https://en.wikipedia.org/wiki/Index_of_environmental_articles)

### Prepare environment

#### Colab: Enable the GPU runtime
Make sure you enable the GPU runtime to experience decent speed in this tutorial.
**Runtime -> Change Runtime type -> Hardware accelerator -> GPU**

<img src="https://raw.githubusercontent.com/deepset-ai/haystack/main/docs/img/colab_gpu_runtime.jpg">

In [1]:
# Install the latest release of Haystack
! pip install farm-haystack

# Install the latest main of Haystack
!pip install --upgrade pip
!pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting farm-haystack
  Downloading farm_haystack-1.8.0-py3-none-any.whl (666 kB)
[K     |████████████████████████████████| 666 kB 4.0 MB/s 
Collecting sentence-transformers>=2.2.0
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[K     |████████████████████████████████| 85 kB 6.1 MB/s 
[?25hCollecting posthog
  Downloading posthog-2.1.0-py2.py3-none-any.whl (31 kB)
Collecting mmh3
  Downloading mmh3-3.0.0-cp37-cp37m-manylinux2010_x86_64.whl (50 kB)
[K     |████████████████████████████████| 50 kB 8.6 MB/s 
Collecting elastic-apm
  Downloading elastic_apm-6.11.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (381 kB)
[K     |████████████████████████████████| 381 kB 76.1 MB/s 
Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 2.3 MB/s 
[?25hCollecting huggingface-h

In [3]:
from haystack.utils import clean_wiki_text, convert_files_to_docs, print_answers
from haystack.nodes import FARMReader, TransformersReader

## Document Store

Haystack finds answers to queries within the documents stored in a `DocumentStore`. The current implementations of `DocumentStore` include `ElasticsearchDocumentStore`, `FAISSDocumentStore`,  `SQLDocumentStore`, and `InMemoryDocumentStore`.

### Start an Elasticsearch server
Manually download and execute Elasticsearch from source.

In [44]:
# In Colab / No Docker environments: Start Elasticsearch from source
! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
! chown -R daemon:daemon elasticsearch-7.9.2

import os
from subprocess import Popen, PIPE, STDOUT

es_server = Popen(
    ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
)
# wait until ES has started
! sleep 30

In [45]:
# Connect to Elasticsearch

from haystack.document_stores import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")

## Loading Documents and Preprocessing

In [46]:
# Fetch documents to query (collection of 1397 Wikipedia articles about environmental issues)
from google.colab import drive
drive.mount('/content/drive')
doc_dir = "/content/drive/MyDrive/Colab Notebooks/Wiki_Env_Articles/"

# Convert files to dicts
docs = convert_files_to_docs(dir_path=doc_dir, clean_func=clean_wiki_text, split_paragraphs=True)

# :ook at the first 3 entries:
print(docs[:3])

# Now, let's write the dicts containing documents to our DB.
document_store.write_documents(docs)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[<Document: {'content': 'Intelligent design (ID) is a pseudoscientific argument for the existence of God, presented by its proponents as "an evidence-based scientific theory about life\'s origins". Proponents claim that "certain features of the universe and of living things are best explained by an intelligent cause, not an undirected process such as natural selection." ID is a form of creationism that lacks empirical support and offers no testable or tenable hypotheses, and is therefore not science. The leading proponents of ID are associated with the Discovery Institute, a Christian, politically conservative think tank based in the United States.Although the phrase intelligent design had featured previously in theological discussions of the argument from design, its first publication in its present use as an alternative term for creationism was in Of Pandas

## Initialize Retriever, Reader & Pipeline

### Retriever

Retrievers help narrow down the scope for the Reader to smaller units of text where a given question could be answered.  
We use Elasticsearch's default BM25 algorithm.


In [47]:
from haystack.nodes import BM25Retriever

retriever = BM25Retriever(document_store=document_store)

### Reader

A Reader scans the texts returned by retrievers in detail and extracts the k best answers. 

#### FARMReader

In [10]:
# Load a  local model or any of the QA models on Hugging Face's model hub (https://huggingface.co/models)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)

Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/473M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

### Pipeline

With a Haystack `Pipeline` you can stick together your building blocks to a search pipeline.
Under the hood, `Pipelines` are Directed Acyclic Graphs (DAGs) that you can easily customize for your own use cases.
To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the `ExtractiveQAPipeline` that combines a retriever and a reader to answer our questions.
You can learn more about `Pipelines` in the [docs](https://haystack.deepset.ai/docs/latest/pipelinesmd).

In [48]:
from haystack.pipelines import ExtractiveQAPipeline

pipe = ExtractiveQAPipeline(reader, retriever)

## Example Questions

In [93]:
# You can configure how many candidates the Reader and Retriever shall return
from IPython.utils import io

# The higher top_k_retriever, the better (but also the slower) your answers.
def ask(query):
  with io.capture_output() as captured: # suppress console output
    prediction = pipe.run(
        query=query, params={"Retriever": {"top_k": 50}, "Reader": {"top_k": 3}}
    )
  return prediction

In [54]:
# Correct: cattle ranching
prediction = ask("What is the main cause of deforestation in the amazon?")
print_answers(prediction, details="minimum")


Query: What is the main cause of deforestation in the amazon?
Answers:
[   {   'answer': 'cattle ranching',
        'context': 's. Some 80% of the deforestation of the Amazon can be '
                   'attributed to cattle ranching, as Brazil is the largest '
                   'exporter of beef in the world. The Amazo'},
    {   'answer': 'Consumption and production of beef',
        'context': 'lion hectares of virgin tropical forest was lost in 2018. '
                   'Consumption and production of beef is the primary driver '
                   'of deforestation in the Amazon, wit'},
    {   'answer': 'Agricultural expansion',
        'context': 'lobally due to increasing stocks in temperate and boreal '
                   'forest.Agricultural expansion continues to be the main '
                   'driver of deforestation and forest fra'}]


In [68]:
# Incorrect: Qatar
prediction = ask("What country has the highest per capita carbon emissions?")
print_answers(prediction, details="minimum")


Query: What country has the highest per capita carbon emissions?
Answers:
[   {   'answer': 'U.S.',
        'context': 'mitter at 1.8 Tonnes of CO2e respectively, compared with '
                   'for example the U.S. at position of the 14th largest per '
                   'capita CO2e emitter at 22.9 Tonnes o'},
    {   'answer': 'Denmark',
        'context': '\n'
                   '=== Denmark ===\n'
                   'As of 2002, the standard carbon tax rate since 1996 '
                   'amounted to 100 kr. per tonne of CO2, equivalent to '
                   'approximately €13 or US$18. T'},
    {   'answer': 'Japan ===\nAlthough Japan',
        'context': '\n'
                   '=== Japan ===\n'
                   'Although Japan does not tax carbon emissions directly, '
                   'since 2012 the country has levied a "Tax for Climate '
                   'Change Mitigation" on petro'}]


In [77]:
# Incorrect: U.S. 
prediction = ask("What country has produced the highest cumulative carbon emissions?")
print_answers(prediction, details="minimum")


Query: What country has produced the highest cumulative carbon emissions?
Answers:
[   {   'answer': 'Japan',
        'context': '\n'
                   '=== Japan ===\n'
                   'Although Japan does not tax carbon emissions directly, '
                   'since 2012 the country has levied a "Tax for Climate '
                   'Change Mitigation" on petro'},
    {   'answer': 'Norway',
        'context': '(about US$1.3 billion in 2010 dollars).\n'
                   "According to IEA's 2005 Review, Norway's CO2 tax is its "
                   'most important climate policy instrument, and covers a'},
    {   'answer': 'Tokyo',
        'context': 'n trading scheme launched in April 2010 covers the top '
                   '1,400 emitters in Tokyo, and is enforced and overseen by '
                   'the Tokyo Metropolitan Government. Pha'}]


In [96]:
# Incorrect: China (weird though, since the context of the first answer actually is about China)
prediction = ask("What country currently has the highest total carbon emissions?")
print_answers(prediction, details="minimum")


Query: What country currently has the highest total carbon emissions?
Answers:
[   {   'answer': 'India',
        'context': ' carbon dioxide than the next two biggest countries '
                   'combined (U.S.A. and India). Total carbon dioxide '
                   'emissions were projected to increase until 2030.'},
    {   'answer': 'United States',
        'context': 'he UK, and accounts for about 27% of total emissions, and '
                   '33% in the United States.  Of the total greenhouse gas '
                   'emissions from transport, over 85% ar'},
    {   'answer': 'Japan',
        'context': '\n'
                   '=== Japan ===\n'
                   'Although Japan does not tax carbon emissions directly, '
                   'since 2012 the country has levied a "Tax for Climate '
                   'Change Mitigation" on petro'}]


In [65]:
# Correct: 1978
prediction = ask("when were CFCs banned in the US?")
print_answers(prediction, details="minimum")


Query: when were CFCs banned in the US?
Answers:
[   {   'answer': '1978',
        'context': '\n'
                   '==== Regulation and DuPont ====\n'
                   'In 1978 the United States banned the use of CFCs such as '
                   'Freon in aerosol cans, the beginning of a long series of '
                   'reg'},
    {   'answer': '1978',
        'context': 'tates, Canada and Norway banned the use of CFCs in aerosol '
                   'spray cans in 1978. Early estimates were that, if CFC '
                   'production continued at 1977 levels, '},
    {   'answer': '1978',
        'context': 'ire in 1979. The United States banned the use of CFCs in '
                   'aerosol cans in 1978. The European Community rejected '
                   'proposals to ban CFCs in aerosol sprays'}]


In [81]:
# Correct: Montreal protocol
prediction = ask("what treaty fixed the ozone layer?")
print_answers(prediction, details="minimum")


Query: what treaty fixed the ozone layer?
Answers:
[   {   'answer': 'Montreal Protocol',
        'context': 'y for the Preservation of the Ozone Layer, or "World Ozone '
                   'Day". The designation commerates the signing of the '
                   'Montreal Protocol on that date in 1987.'},
    {   'answer': 'Montreal Protocol',
        'context': 'was the Chief U.S. Negotiator at the meetings that '
                   'resulted in the Montreal Protocol.)\n'
                   'Chasek, P. S.; Downie, David L.; Brown, J. W. (2013). '
                   'Global En'},
    {   'answer': 'Montreal Protocol',
        'context': 'was the Chief U.S. Negotiator at the meetings that '
                   'resulted in the Montreal Protocol.)\n'
                   'Chasek, Pamela S., David L. Downie, and Janet Welsh Brown '
                   '(2013'}]


In [64]:
# 2nd guess is correct: 194
prediction = ask("how many countries signed the paris agreement?")
print_answers(prediction, details="minimum")


Query: how many countries signed the paris agreement?
Answers:
[   {   'answer': '82',
        'context': 'es, Samoa, St. Lucia and Switzerland. At the end of the '
                   'signature period, 82 countries and the European Community '
                   'had signed. Ratification (which is r'},
    {   'answer': '194',
        'context': 't developing countries must be financially supported. As '
                   'of October 2021, 194 states and the European Union have '
                   'signed the treaty and 191 states and '},
    {   'answer': '175',
        'context': 'ificant climate accord in the history of the climate '
                   'movement. On Earth Day 2016, world leaders from 175 '
                   'nations broke a record by doing exactly that.'}]


In [69]:
# Correct, kind of: an international treaty on climate change
prediction = ask("What is the paris agreement?")
print_answers(prediction, details="minimum")


Query: What is the paris agreement?
Answers:
[   {   'answer': 'climate change mitigation responsibilities',
        'context': 'such as CO2 emissions reductions of 55% by 2030 by the EU, '
                   'climate change mitigation responsibilities of the Paris '
                   'Agreement and EU air quality rules.'},
    {   'answer': 'to reduce emission',
        'context': ' of climate change, not only did Nigeria sign the Paris '
                   'agreement to reduce emission, in its national climate '
                   'pledge, the Nigerian government has prom'},
    {   'answer': 'climate change goals – such as the Paris Agreement on '
                  'Climate Change',
        'context': 'States. This means that in order to meet climate change '
                   'goals – such as the Paris Agreement on Climate Change – '
                   'and reduce greenhouse gas emissions, e'}]


In [73]:
# Correct, but would prefer a more detailed answer: human activity
prediction = ask("What causes global warming?")
print_answers(prediction, details="minimum")


Query: What causes global warming?
Answers:
[   {   'answer': 'human activity',
        'context': 'entific consensus that global warming is happening and is '
                   'caused by human activity. Disputes over the key scientific '
                   'facts of global warming are more '},
    {   'answer': 'human activity',
        'context': 'the view that the current warming trend exists and is '
                   'ongoing, that human activity is the cause, and that it is '
                   'without precedent in at least 2000 yea'},
    {   'answer': 'human activities',
        'context': '\n'
                   '== Background ==\n'
                   'The view that human activities are likely responsible for '
                   'most of the observed increase in global mean temperature '
                   '("global warming"'}]


In [89]:
# 2nd guess is correct: nuclear power
prediction = ask("What energy source has the lowest carbon emissions?")
print_answers(prediction, details="minimum")


Query: What energy source has the lowest carbon emissions?
Answers:
[   {   'answer': 'Hydropower',
        'context': 'ding security during drought for drinking water supply and '
                   'irrigation.Hydropower ranks among the energy sources with '
                   'the lowest levels of greenhouse g'},
    {   'answer': 'nuclear power',
        'context': 'nt from energy consumption can be reduced through the '
                   'development of nuclear power (a zero carbon emissions '
                   'energy source) and alternative energy proj'},
    {   'answer': 'wind',
        'context': 'and reduce reliance on fossil fuels. One study claimed '
                   'that, as of 2009, wind had the "lowest relative greenhouse '
                   'gas emissions, the least water consu'}]


In [90]:
# Correct? Doesn't do well with definitions or long-form answers
prediction = ask("What is the great pacific garbage patch?")
print_answers(prediction, details="minimum")


Query: What is the great pacific garbage patch?
Answers:
[   {   'answer': 'fishing related plastics',
        'context': 'n estimated 46% of the Great Pacific garbage patch '
                   'consists of fishing related plastics. Fishing nets account '
                   'for about 1% of the total mass of all ma'},
    {   'answer': 'Donora',
        'context': '. Extreme smog events were experienced by the cities of '
                   'Los Angeles and Donora, Pennsylvania, in the late 1940s, '
                   'serving as another public reminder.Ai'},
    {   'answer': 'a huge concentration of plastics, chemical sludge and other '
                  'debris',
        'context': 'oblem is the Great Pacific Garbage Patch, a huge '
                   'concentration of plastics, chemical sludge and other '
                   'debris which has been collected into a large are'}]


In [91]:
# Incorrect - possibly due to limited corpus
prediction = ask("Why is cryptocurrency bad for the environment?")
print_answers(prediction, details="minimum")


Query: Why is cryptocurrency bad for the environment?
Answers:
[   {   'answer': "it's an inconvenient truth",
        'context': 'so hard for people to grasp." To which Gore replied, '
                   '"Because it\'s an inconvenient truth, ya know." "[...] In '
                   "the back of my head, I go, that's the ti"},
    {   'answer': 'stochastic variation',
        'context': 'survival of a small population. Some detrimental effects '
                   'include stochastic variation in the environment, (year to '
                   'year variation in rainfall, tempera'},
    {   'answer': 'in a world where political choices are not made '
                  'democratically at a global level, but by a small number of '
                  'rich countries and corporations, the poor and the '
                  'environment are never going to be a priority',
        'context': ' in a world where political choices are not made '
                   'democratically at a global level

In [92]:
# Correct: glacial melt, melt of the ice sheets in Greenland and Antarctica, and thermal expansion
prediction = ask("Why does climate change make the ocean level rise?")
print_answers(prediction, details="minimum")


Query: Why does climate change make the ocean level rise?
Answers:
[   {   'answer': 'glacial melt, melt of the ice sheets in Greenland and '
                  'Antarctica, and thermal expansion',
        'context': 'l is rising as a consequence of glacial melt, melt of the '
                   'ice sheets in Greenland and Antarctica, and thermal '
                   'expansion. Between 1993 and 2020, the ri'},
    {   'answer': 'heats (and therefore expands) the ocean',
        'context': 'his acceleration is due mostly to climate change, which '
                   'heats (and therefore expands) the ocean and which melts '
                   'the land-based ice sheets and glaciers'},
    {   'answer': 'melting of the West Antarctic ice-sheet alone',
        'context': 'ld cause a sea level rise of around 5 m (15 ft) from '
                   'melting of the West Antarctic ice-sheet alone.In 2019, a '
                   'study projected that in low emission sce'}]
