## Document Scraping + Synthetic Data Generation

By: Glenn Parham, Defense Digital Service

In this notebook, we will be 
1. Scraping unclassified DoD Issuances from the WHS website.
2. Generating a synthetic Q&A dataset on these documents via GPT-4Turbo

In [8]:
!pip install requests -q
!pip install bs4 -q
!pip install PyPDF2 -q
!pip install openai -q
!pip install python-dotenv -q

In [3]:
# This is the schema of documents_schema.json
# [
#   {
#     "id": "test",
#     "url": "...",
#     "raw_questions_output": "...",
#     "questions": [
#       {
#         "question": "insert question",
#         "answer": "insert answer",
#         "source_text": "insert snippet source text from the original document"
#       }
#     ]
#   }
# ]

In [85]:
import os
import requests
from bs4 import BeautifulSoup

# Define the URL of the website
url = "https://www.esd.whs.mil/Directives/issuances/dodd"
base_url = "https://www.esd.whs.mil"
# Send a GET request to the website
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Find all the <a> tags with href attribute
links = soup.find_all("a", href=True)

# Filter out the PDF links to milsuite.mil
pdf_links = [link["href"] for link in links if link["href"].endswith(".pdf") and "milsuite.mil" not in link["href"]]

# Create the /directives directory if it doesn't exist
os.makedirs("directives", exist_ok=True)

# Download and save the PDFs
for pdf_link in pdf_links:
    # Extract the filename from the URL and remove pdf extension
    filename = pdf_link.split("/")[-1].split(".")[0]
    full_link = base_url+pdf_link
    print(f"Downloading {full_link}")

    try:
        # Download the PDF file
        pdf_response = requests.get(full_link)
        
        # Save the PDF file to /directives directory
        with open(f"directives/{filename}", "wb") as file:
            file.write(pdf_response.content)

        # initialize /documents/<filename>.json
        with open(f"documents/{filename}.json", "w") as f:
            f.write("{}")

            # write metadata to <filename>.json
        with open(f"documents/{filename}.json", "r") as f:
            import json
            document = json.load(f)

            document.update({
                "id": filename,
                "url": full_link,
                "raw_questions_output": "",
                "generated_questions": [],
                "document_type":"pdf",
                "analyzed": False
            })

        with open(f"documents/{filename}.json", "w") as f:
            json.dump(document, f)
    except Exception as e:
        print(e)
        # break
        continue
    



Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/iss_process/AO_Focal_Point_Checklist.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodi/775007p.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/100020p.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/110004p.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/120016p.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/120017p.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/130421p.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/132208p.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/132212p.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/133227p.pdf
Downloading https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/133805p.pdf
Downloading https://www.esd.w

Iterate through documents & generate synthetic questions


In [73]:
import PyPDF2

def extract_text_from_pdf(pdf_path):
    # Open the PDF file in read-binary mode
    with open(pdf_path, "rb") as file:
        # Create a PDF reader object
        pdf_reader = PyPDF2.PdfReader(file)
        
        # Extract text from each page of the PDF
        text = ""
        for page in pdf_reader.pages:
            text += page.extract_text()
        
        # Remove any leading or trailing whitespace
        text = text.strip()
        
        return text


In [74]:
extract_text_from_pdf("directives/551509p.pdf")

'Department of Defense \n \nDIRECTIVE \n  \nNUMBER 5515.9 \n April 19, 2004 \n Certified Current as  of October 31, 2006 \n \nGC, DoD \n \nSUBJECT: Settlement of Tort Claims  References: (a) DoD Directive 5515.9, "Settlem ent of Tort Claims," September 12, 1990  \n  (hereby canceled)  (b) Sections 1346(b), 2671-2680 of  title 28, United States Code \n (c) Section 113 of title 10, United States Code \n  1.  REISSUANCE AND PURPOSE\n \n This Directive: \n 1.1.  Reissues reference (a).  1.2.  Establishes policy for the administrati ve processing and, where appropriate, the \nsettlement of claims filed under reference (b ) against the DoD Com ponents other than the \nMilitary Departments. \n 1.3.  Delegates authority, pursuant to reference (c ), to the Secretary of the Army to process \nand, where appropriate, settle claims under this Directive. \n  2.  APPLICABILITY\n \nThis Directive applies to the Office of the Secretary of Defense (OSD), the Military \nDepartments, the Chairman of the J

In [94]:
def generate_questions(pdf_text:str) -> json:
  from openai import OpenAI

  client = OpenAI(api_key="**")

  response = client.chat.completions.create(
    model="gpt-4-1106-preview",
    messages=[
      {
        "role": "system",
        "content": "You must respond in JSON format, in an array of questions.  Generate up to 10, and at least 2, questions across the various key topics mentioned in the source text.\n\n{\"questions\":[ {\"question\": \"insert question\", \"answer\":\"insert answer\"},...]}"
      },
      {
        "role": "user",
        "content": "DOCUMENT SOURCE TEXT START \n\n"+pdf_text+"\n\nDOCUMENT SOURCE TEXT END"
      }
    ],
    temperature=1,
    max_tokens=2000,
    top_p=1,
    frequency_penalty=0,
    presence_penalty=0,
    response_format={ "type": "json_object" },
  )
  raw_completion = response.choices[0].message.content

  import json
  completion = json.loads(raw_completion)
  return completion, raw_completion




In [95]:
def process_pdf(file_id:str):
    import json
    print(f"Processing {file_id}")
    # Define the file ID
    
    pdf_text = extract_text_from_pdf(f"directives/{file_id}.pdf")
    # print(pdf_text)
    generated_questions=generate_questions(pdf_text)

    # Load the JSON file
    with open(f"documents/{file_id}.json", "r") as f:
        document = json.load(f)

    # Update the "generated_questions" attribute
    document["generated_questions"] = generated_questions[0]["questions"]
    document["raw_questions_output"] = generated_questions[1]
    document["analyzed"] = True
    # Save the updated JSON file
    with open(f"documents/{file_id}.json", "w") as f:
        json.dump(document, f)
    
    print(f"Finished processing {file_id}. View the updated JSON file at ./documents/{file_id}.json")
    return document


In [96]:
process_pdf("826005p")

Processing 826005p
Finished processing 826005p. View the updated JSON file at ./documents/826005p.json


{'id': '826005p',
 'url': 'https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodd/826005p.pdf',
 'raw_questions_output': '{\n  "questions":[\n    {"question": "What is the primary purpose of DoD Directive 8260.05?", "answer":"The primary purpose of DoD Directive 8260.05 is to establish policy and assign responsibilities for providing Support for Strategic Analysis (SSA) activities in accordance with various references, as well as to supersede and cancel previous instructions on the same subject."},\n    {"question": "What are some of the contents that SSA products should include according to the policy?", "answer":"SSA products should include current baselines reflecting selected Combatant Commander plans and approved force management decisions, and near- to long-term scenarios, concepts of operation, forces, and baselines based on plausible challenges requiring DoD resources and capabilities."},\n    {"question": "Who are required to collaborate and lead the SSA product develo

In [97]:
import os

# Get a list of all JSON files in the /documents directory
json_files = [file for file in os.listdir("documents") if file.endswith(".json")]
counter = 0
# Iterate through each JSON file
for json_file in json_files:
    counter += 1
    print(f"Processing {counter}/{len(json_files)}")
    # Extract the file ID from the JSON file name
    file_id = os.path.splitext(json_file)[0]
    # Pass the file ID to the process_pdf function
    try:
        process_pdf(file_id)
    except Exception as e:
        print(e)
        break


Processing 1/80
Processing 510604p
Processing 510604p
Finished processing 510604p. View the updated JSON file at ./documents/510604p.json
Processing 2/80
Processing 540007p
Processing 540007p
Finished processing 540007p. View the updated JSON file at ./documents/540007p.json
Processing 3/80
Processing 510009p
Processing 510009p
Finished processing 510009p. View the updated JSON file at ./documents/510009p.json
Processing 4/80
Processing 510114p
Processing 510114p
Finished processing 510114p. View the updated JSON file at ./documents/510114p.json
Processing 5/80
Processing responsible_ai_toolkit
Processing responsible_ai_toolkit
Finished processing responsible_ai_toolkit. View the updated JSON file at ./documents/responsible_ai_toolkit.json
Processing 6/80
Processing 510055p
Processing 510055p
Finished processing 510055p. View the updated JSON file at ./documents/510055p.json
Processing 7/80
Processing 134410p
Processing 134410p
Finished processing 134410p. View the updated JSON file at

## Transform documents into documents.json

In [100]:
# document_schema =[
#   {
#     "id": "test",
#     "url": "...",
#     "raw_questions_output": "...",
#     "generated_questions": [
#       {
#         "question": "insert question",
#         "answer": "insert answer"
#       }
#     ],
#     "document_type": "pdf",
#     "analyzed": true
#   }
# ]


In [102]:
import os
import json

# Get a list of all JSON files in the /documents directory
json_files = [file for file in os.listdir("documents") if file.endswith(".json")]

documents_json_schema = [
    {
        "doc_id": "file_id",
        "doc_url": "url",
        "question": "question",
        "answer": "answer",
    }
]
documents_json = []
# Iterate through each JSON file
for json_file in json_files:
    # Extract the file ID from the JSON file name
    file_id = os.path.splitext(json_file)[0]
    
    # Read the JSON file
    with open(f"documents/{json_file}", "r") as f:
        document = json.load(f)
    
    # Iterate over generated questions and append to documents_json
    for question in document["generated_questions"]:
        documents_json.append({
            "doc_id": file_id,
            "doc_url": document["url"],
            "question": question["question"],
            "answer": question["answer"]
        })
    

    # Save the documents_json to a JSON file
    with open("documents.json", "w") as f:
        json.dump(documents_json, f)
    