# Scrape Text from Corpus "Studies on Water"

In this notebook, we extract the text data from the series of reports "Studies on Water" issued by the OECD ([link](https://www.oecd-ilibrary.org/environment/oecd-studies-on-water_22245081)). The 55 reports range from the years 2009 to 2022 and form the corpus for the project "Markets for Resilience or Disaster Capitalism". The pdf versions of the reports are not openly available but must be purchased from the OECD library. Therefore, they are not included in this repository. This notebooks contains code to extract the text from the pdf versions, which must be placed in the directory `pdf/`. For text extraction from pdf files, we use the package `pymupdf` (i.e., `fitz`). The extracted text is stored in json files for each report in the directory `studies_on_water_scraped/` (as well as a big json file including all reports called `studies_on_water_scaped.json`). The individual json files contain an object with a key for each top-level chapter (including preface, table of contents, etc.) plus a key for the meta data of the report (e.g., author, date). Each top-level chapter contains an array with each element being a line from the chapter (with subheaders as a separate line). **NOTE**: The extracted text may still be quite messy and probably needs further cleaning in many cases. The json file containing all chapters is an array with each element being a report object as just described.

In [1]:
""" Extract text from pdf files of corpus 'Studies on Water' """

import fitz
import os
import json


ModuleNotFoundError: No module named 'fitz'

In [None]:
PDF_DIR = "pdfs/"
OUT_DIR = "studies_on_water_scraped/"


In [None]:
def check_out_dir(data_dir):
    """ Check if directory for saving scraped text exists, make directory if not """
    if not os.path.isdir(data_dir):
        os.makedirs(data_dir)
        print(f"Created saving directory at {data_dir}")


In [None]:
check_out_dir(OUT_DIR)


In [None]:
def extract_chapters_from_pdf(filename):
    """ Extracts the chapters of document by looking at the table of contents """
    title = filename.split(".")[0].split("/")[-1]

    with fitz.open(filename) as doc:
        chapters = {"meta": doc.metadata}
        chapters["meta"]["title"] = title

        # Only take chapters on highest level
        toc = [ch for ch in doc.get_toc() if ch[0] == 1]

        if toc == []: # If no toc available
            print(f"Could not extract toc from {filename}")
            return

        for i, ch in enumerate(toc):
            start = ch[2]-1  # Look at starting page of chapter

            if (i == len(toc)-1):  # If last chapter
                stop = doc.page_count-1
            else:
                stop = toc[i+1][2]-1  # Look at starting page of next chapter
            chapter = doc.pages(start, stop, 1)

            chapter_text = []

            for page in chapter:
                lines = page.get_text().split("\n")
                # Skip the first three lines (page header)
                for line in lines[3:]:
                    chapter_text.append(line)

            chapters[ch[1]] = chapter_text
            print(f"Extracted chapter {ch[1]}")

    print(f"... from {filename}")

    return chapters


In [None]:
def extract_chapters_from_pdf_custom_toc(filename, custom_toc):
    """ Extracts the chapters of document from a custom table of contents """
    title = filename.split(".")[0].split("/")[-1]

    with fitz.open(filename) as doc:
        chapters = {"meta": doc.metadata}
        chapters["meta"]["title"] = title

        # Only take chapters on highest level
        toc = [ch for ch in custom_toc]

        for i, ch in enumerate(toc):
            start = ch[1]-1  # Look at starting page of chapter

            if (i == len(toc)-1):  # If last chapter
                stop = doc.page_count-1
            else:
                stop = toc[i+1][1]-1  # Look at starting page of next chapter
            chapter = doc.pages(start, stop, 1)

            chapter_text = []

            for page in chapter:
                lines = page.get_text().split("\n")
                # Skip the first three lines (page header)
                for line in lines[3:]:
                    chapter_text.append(line)

            chapters[ch[0]] = chapter_text
            print(f"Extracted chapter {ch[0]}")

    print(f"... from {filename} with custom toc")

    return chapters

In [None]:
docs = []

with os.scandir(PDF_DIR) as iter:
    for i, filename in enumerate(iter):
        new_doc = extract_chapters_from_pdf(f"pdfs/{filename.name}")
        if new_doc is not None:
            outfile_name = filename.name.split(".")[0]
            new_doc["meta"]["id"] = i
            docs.append(new_doc)
            with open(f"{OUT_DIR}{outfile_name}.json", "w") as outfile:
                json.dump(new_doc, outfile)


In [None]:
""" Define custom toc for report 'Managing Water for All.pdf' because it has no toc available """

custom_toc = [
    ["Foreword", 5],
    ["Acknowledgements", 7],
    ["Background Materials", 8],
    ["Table of Contents", 9],
    ["List of Abbreviations and Acronyms", 13],
    ["Executive Summary", 15],
    ["Introduction", 23],
    ["Chapter 1. Setting the Stage", 25],
    ["Chapter 2. Financing Water and Sanitation Services: Key Challenges and the Way Forward", 49],
    ["Chapter 3. Water Services: The Central Role of Tariffs", 75],
    ["Chapter 4. Beyond Money: The Roles of Governments and Private Actors in Water Services", 107],
    ["Chapter 5. Managing Water Resources in the Agricultural Sector", 129]
]


In [None]:
custom_doc = extract_chapters_from_pdf_custom_toc(f"{PDF_DIR}Managing Water for All.pdf", custom_toc)
custom_doc["meta"]["id"] = 20
docs.append(custom_toc)

with open(f"{OUT_DIR}Managing Water for All.json", "w") as outfile:
    json.dump(custom_doc, outfile)

In [None]:
with open("studies_on_water_scaped.json", "w") as outfile:
    json.dump(docs, outfile)
