## Notebook for extracting OECD corpus texts

This notebook generates a single `data.json` file which gathers all the relevant texts from the scraped corpus of OECD documents (`studies_on_water_scraped.json`). This `data.json` file is used in downstream analyses using Named Entity Recognition (NER) and Semantic Role Labelling (SRL). The input data for this script is **not** the raw PDF files of the OECD corpus. The input is rather the scraped, structured data in JSON format which is generated by [this notebook](https://github.com/disaster-capitalism/scrape-corpus/blob/main/scrape_studies_on_water_pdf.ipynb)


### 1. Import JSON structured data

In [30]:
import json
import os
from pathlib import Path

path = Path(os.getcwd())
data_dir = os.path.join(path.parents[0], "data-files")
with open(os.path.join(data_dir, "studies_on_water_scraped.json")) as f:
    documents = json.load(f)

### 2. Generate documents

Extracts the sections from each document and restructures them into four categories: 1) the forewords, 2) the acknowledgements, 3) the (executive) summaries, and 4) the main body text. Creates a separate document containing the text for each category. The directory `unprocessed/` contains the split documents separated into these categories (ensure you have such a folder created in the `data-files/` directory). The text files are named according the the ID of the document as contained in the JSON input file.

Output directory structure under '/path/to/unprocessed/':

foreword/ <br/>
--- 1.txt <br/>
--- 2.txt <br/>
--- 3.txt <br/>
acknowledgements/ <br/>
--- 1.txt <br/>
--- 2.txt <br/>
--- 3.txt <br/>
summary/ <br/>
--- 1.txt <br/>
--- 2.txt <br/>
--- 3.txt <br/>
body/ <br/>
--- 1.txt <br/>
--- 2.txt <br/>
--- 3.txt <br/>
annexes/ <br/>
--- 1.txt <br/>
--- 2.txt <br/>
--- 3.txt <br/>

In [31]:
ignored_sections = ["abbreviations","acronyms","glossary","table of contents",
                    "acronyms and abbreviations","abbreviations and acronyms",
                    "bibliography","further reading","acronyms and local terms",
                    "acknowledgements and disclaimers","tables","figures","boxes",
                    "references and websites"]

# Assumptions or choices made:
# ----------------------------
# "advisors and special experts" and "Blank Page" sections
# are actually an extension of the "Acknowledgements" section
# in those documents.
#
# We merge the Preface and Foreword sections together into one section because they contain similar content
    
def generate_docs(section):
    """
    Given the OECD corpus, for each file in the corpus, this method generates
    a document that contains only the text from the specified section in the file.

    :param str section: the section of the document to extract. Options to choose from: 
    - foreword (preface and foreword sections),
    - acknowledgements,
    - body (all the chapters and main text in the document),
    - summary (executive summary)
    - annexes (all the text from the appendices)

    """
    index = 0
    for doc in documents:
        section_text = ""
        if (index < 55):
            for key in doc:
                if (key.lower() not in ignored_sections):
                    #key_category = get_key_category(key.lower())
                    if (section.lower() == "foreword"):
                        if (key.lower() == "foreword" or key.lower() == "preface"):
                            for line in doc[key]:
                                if ((len(line) - 1 < 0) or (line[len(line)-1] not in [" ","-"])):
                                    tmpline = line + " "
                                    section_text = section_text + tmpline
                                else:
                                    section_text = section_text + line
                    elif (section.lower() == "acknowledgements"):
                        if (key.lower().startswith("acknowledg") 
                                                or (key.lower() == "advisors and special experts") 
                                                or (key.lower() == "blank page")):
                            for line in doc[key]:
                                if ((len(line) - 1 < 0) or (line[len(line)-1] not in [" ","-"])):
                                    tmpline = line + " "
                                    section_text = section_text + tmpline
                                else:
                                    section_text = section_text + line           
                    elif (section.lower() == "annexes"):
                        if (key.lower().startswith("annex")):
                            for line in doc[key]:
                                if ((len(line) - 1 < 0) or (line[len(line)-1] not in [" ","-"])):
                                    tmpline = line + " "
                                    section_text = section_text + tmpline
                                else:
                                    section_text = section_text + line 
                    elif (section.lower() == "summary"):
                        if (key.lower().startswith("executive summary")):
                            for line in doc[key]:
                                if ((len(line) - 1 < 0) or (line[len(line)-1] not in [" ","-"])):
                                    tmpline = line + " "
                                    section_text = section_text + tmpline
                                else:
                                    section_text = section_text + line
                    else:
                        for line in doc[key]:
                            if ((len(line) - 1 < 0) or (line[len(line)-1] not in [" ","-"])):
                                tmpline = line + " "
                                section_text = section_text + tmpline
                            else:
                                section_text = section_text + line
                                
                # Check whether 'unprocessed/' directory exists
                upPath = os.path.join(data_dir, "unprocessed")
                sPath = os.path.join(upPath, section)
                upExist = os.path.exists(upPath)
                spExist = os.path.exists(sPath)
                if not upExist:
                    os.makedirs(upPath)
                if not spExist:
                    os.makedirs(sPath)
                with open(os.path.join(sPath, str(index)+".txt"), 'w', encoding='utf-8') as f:
                    f.write(section_text)             
        index = index + 1
    
# generate_docs("foreword")
# generate_docs("summary")
# generate_docs("acknowledgements")
generate_docs("body")

### 3. Gather the plain text data for each document into a JSON file

The JSON file which is generated at the end of this notebook will be the raw input for the NER notebook and other downstream analyses in the study.

In [32]:
# collect data
ack_dir = os.path.join(os.path.join(data_dir, "unprocessed"), "acknowledgements")
body_dir = os.path.join(os.path.join(data_dir, "unprocessed"), "body")
summary_dir = os.path.join(os.path.join(data_dir, "unprocessed"), "summary")
foreword_dir = os.path.join(os.path.join(data_dir, "unprocessed"), "foreword")

import json

def gather_data():
    global data_dir
    result = {}
#     for filename in os.listdir(foreword_dir):
#         f = os.path.join(foreword_dir, filename)

#         if (os.path.isfile(f) and (".DS_Store" not in f)):
#             with open(f, 'r', encoding="utf8") as fp:
#                 currentfiledata = fp.read()
#             result[f.replace(foreword_dir, "").replace(".txt", "")] = currentfiledata.lower()
            
#     for filename in os.listdir(ack_dir):
#         f = os.path.join(ack_dir, filename)

#         if (os.path.isfile(f) and (".DS_Store" not in f)):
#             with open(f, 'r', encoding="utf8") as fp:
#                 currentfiledata = fp.read()
#             result[f.replace(ack_dir, "").replace(".txt", "")] = result[f.replace("texts/acknowledgements/", "").replace(".txt", "")] + " " + currentfiledata.lower()
            
#     for filename in os.listdir(summary_dir):
#         f = os.path.join(summary_dir, filename)

#         if (os.path.isfile(f) and (".DS_Store" not in f)):
#             with open(f, 'r', encoding="utf8") as fp:
#                 currentfiledata = fp.read()
#             result[f.replace(summary_dir, "").replace(".txt", "")] = result[f.replace("texts/summary/", "").replace(".txt", "")] + " " + currentfiledata.lower()

    for filename in os.listdir(body_dir):
        f = os.path.join(body_dir, filename)

        if (os.path.isfile(f) and (".DS_Store" not in f)):
            with open(f, 'r', encoding="utf8") as fp:
                currentfiledata = fp.read()

            result[f.replace(body_dir, "").replace(".txt", "").replace("/","")] = currentfiledata
                
    with open(os.path.join(data_dir, "data.json"), "w") as outfile:
        json.dump(result, outfile)

# invoke the method above, generates a 'data.json' file containing the plain text documents
# the keys of the json file are the IDs of the documents as indicated in the input JSON file
gather_data()