# Climate Dictionary Creation


This notebook creates a dictionary of climate terms which can be used to filter Hansard and the Congressional Record to create a climate corpus. The notebook uses the IPCC Sixth Assessment Report Glossary as the dictionary's basis.


## Setup


In [1]:
import fitz
import re
import pandas as pd

data_path = 'data/'
dist_path = 'dist/'

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Extracting climate terms and their definitions from the IPCC Sixth Assessment Report Glossary


### Extracting the text from the IPCC Sixth Assessment Report Glossary PDF


In [2]:
glossary_path = data_path + 'IPCC Sixth Assessment Report Glossary.pdf'
doc = fitz.open(glossary_path)

### Glossary cleaning


In [3]:
# Remomving the first and final three pages of the glossary
glossary_text = [page.get_text() for page in doc]
glossary_text.pop(0)
glossary_text = glossary_text[:-3]

# Concatenating the glossary text
glossary_text_string = ' '.join(glossary_text)

# Removing superfluous text
pattern = r"(Approval Session|Glossary|IPCC SR1\.5|Do Not Cite, Quote or Distribute|Total pages: \d+|See [A-Za-z]+\.|1-\d+)"
cleaned_glossary_text = re.sub(pattern, '', glossary_text_string).strip()

### Creating a rough dictionary of climate terms and their definitions


In [5]:
chunks = re.split(r'\s{5,}', cleaned_glossary_text)

terms = []
definitions = []

for chunk in chunks:
    split_chunk = re.split(r'\s{2,}', chunk)
    term = split_chunk[0]
    definition = ' '.join(split_chunk[1:])

    terms.append(term)
    definitions.append(definition)

climate_dictionary = pd.DataFrame({'term': terms, 'definition': definitions})
climate_dictionary = climate_dictionary.drop_duplicates(
    subset='term', keep='first')

climate_dictionary.to_csv(
    dist_path + 'raw_climate_dictionary.csv', index=False)

The dictionary is manually cleaned at this point to create the final `cleaned_climate_dictionary.csv` file: https://docs.google.com/spreadsheets/d/1a1rvYR6gQWmUY9fYlmm2lxK9xqiRB0I6LqsUNNZ6eUw/edit#gid=1267656563
