# Scrape Text from Corpus "Studies on Water"

In this notebook, we extract the text data from the series of reports "Studies on Water" issued by the OECD ([link](https://www.oecd-ilibrary.org/environment/oecd-studies-on-water_22245081)). The 55 reports range from the years 2009 to 2022 and form the corpus for the project "Markets for Resilience or Disaster Capitalism". The pdf versions of the reports are not openly available but must be purchased from the OECD library. Therefore, they are not included in this repository. This notebooks contains code to extract the text from the pdf versions, which must be placed in the directory `pdf/`. For text extraction from pdf files, we use the package `pymupdf` (i.e., `fitz`). The extracted text is stored in json files for each report in the directory `studies_on_water_scraped/` (as well as a big json file including all reports called `studies_on_water_scaped.json`). The individual json files contain an object with a key for each top-level chapter (including preface, table of contents, etc.) plus a key for the meta data of the report (e.g., author, date). Each top-level chapter contains an array with each element being a line from the chapter (with subheaders as a separate line). **NOTE**: The extracted text may still be quite messy and probably needs further cleaning in many cases. The json file containing all chapters is an array with each element being a report object as just described.

In [1]:
""" Extract text from pdf files of corpus 'Studies on Water' """

import fitz
import os
import json


In [2]:
PDF_DIR = "pdf/"
OUT_DIR = "studies_on_water_scraped/"


In [3]:
def check_out_dir(data_dir):
    """ Check if directory for saving scraped text exists, make directory if not """
    if not os.path.isdir(data_dir):
        os.makedirs(data_dir)
        print(f"Created saving directory at {data_dir}")


In [4]:
check_out_dir(OUT_DIR)


In [5]:
def extract_chapters_from_pdf(filename):
    """ Extracts the chapters of document by looking at the table of contents """
    with fitz.open(filename) as doc:
        chapters = {"meta": doc.metadata}

        # Only take chapters on highest level
        toc = [ch for ch in doc.get_toc() if ch[0] == 1]

        for i, ch in enumerate(toc):
            start = ch[2]-1  # Look at starting page of chapter

            if (i == len(toc)-1):  # If last chapter
                stop = doc.page_count-1
            else:
                stop = toc[i+1][2]-1  # Look at starting page of next chapter
            chapter = doc.pages(start, stop, 1)

            chapter_text = []

            for page in chapter:
                lines = page.get_text().split("\n")
                # Skip the first three lines (page header)
                for line in lines[3:]:
                    chapter_text.append(line)

            chapters[ch[1]] = chapter_text
            print(f"Extracted chapter {ch[1]}")

    print(f"... from {filename}")

    return chapters


In [6]:
docs = []

with os.scandir("pdfs") as iter:
    for filename in iter:
        new_doc = extract_chapters_from_pdf(f"pdfs/{filename.name}")
        docs.append(new_doc)
        outfile_name = filename.name.split(".")[0]
        with open(f"{OUT_DIR}{outfile_name}.json", "w") as outfile:
            json.dump(new_doc, outfile)


Extracted chapter Foreword
Extracted chapter Acknowledgments
Extracted chapter Table of contents
Extracted chapter Acronyms
Extracted chapter Executive Summary
Extracted chapter Chapter 1 Why is financing water resources management an issue?
Extracted chapter Chapter 2 Four principles for WRM financing
Extracted chapter Chapter 3 The value added of economic instruments
Extracted chapter Chapter 4 Issues related to the implementation of the four principles
Extracted chapter Annex A Cost-recovery strategies in selected OECD countries and BRICS
Extracted chapter Annex B An OECD survey on investment needs for water supply and sanitation
... from pdfs/A Frameworok for Financing Water Resources Management.pdf
Extracted chapter Foreword
Extracted chapter Acknowledgements
Extracted chapter Executive summary
Extracted chapter Chapter 1.  Managing floods today – Setting the scene
Extracted chapter Chapter 2.  Effectiveness of flood governance
Extracted chapter Chapter 3.  Efficiency of flood gov

In [7]:
with open("studies_on_water_scaped.json", "w") as outfile:
    json.dump(docs, outfile)
