# STORM with local documents

This notebook downloads documentation from the US Federal Emergency Management Agency (FEMA) to use as part of STORM analysis on local documents.

The notebook ...

1. Downloads some FEMA documents
2. Parses and chunks them
3. Embeds with local "BAAI/bge-m3"
4. Creates a local filesystem Qdrant vector store
5. Runs STORM using this store

# Setup

1. See [README](./README) to set up a conda environment and `.env` file

In [7]:
import os
import openai
from dotenv import load_dotenv
import os
import pandas as pd
import requests
from uuid import uuid4
import json

from langchain_community.document_loaders import PyPDFLoader
from langchain.vectorstores.chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
from knowledge_storm.rm import VectorRM
from knowledge_storm.lm import OpenAIModel, AzureOpenAIModel
from knowledge_storm.utils import load_api_key, QdrantVectorStoreManager
from knowledge_storm.lm import OllamaClient

from langchain_openai import ChatOpenAI

from dspy import Example

pd.set_option("display.max_colwidth", None)

# Load environment variables from .env file
load_dotenv()

# Initialize the OpenAI API client
openai.api_key = os.getenv("OPENAI_API_KEY")

DATA_DIR = "./data"
DB_DIR = f"{DATA_DIR}/db"
PDF_DIR = f"{DATA_DIR}/pdfs"
STORM_OUTPUT_DIR=f"{DATA_DIR}/storm_output"
DB_COLLECTION_NAME="fema_docs_demo"
EMBEDDING_MODEL="BAAI/bge-m3"

for dir in [DATA_DIR, PDF_DIR, DB_DIR, STORM_OUTPUT_DIR]:
    os.makedirs(dir, exist_ok=True)

model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceBgeEmbeddings(
    model_name=EMBEDDING_MODEL, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)


llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)


sentence_transformers.SentenceTransformer : INFO     : Load pretrained SentenceTransformer: BAAI/bge-m3


In [2]:
vectors = embeddings.embed_query("Bagels are the best!")
num_vectors = len(vectors)

print(f"Number of vectors: {num_vectors}")

Number of vectors: 1024


# Analysis

## Indexing FEMA Disaster preparedness documents

### Get FEMA PDF documents

In [2]:
df = pd.read_csv(f"{DATA_DIR}/fema_docs.csv")
display(df)

Unnamed: 0,Source,URL,Extra instructions,Document
0,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-10/fema_scenario_1-active_shooter-01102020.pdf
1,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-10/fema_scenario_1_active_shooter_TTX_answer_key-01102020.pdf
2,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_coastal-erosion.pdf
3,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_earthquakes.pdf
4,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-home_flooding.pdf
5,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_severe-wind.pdf
6,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/documents/fema_protect-your-property-storm-surge.pdf
7,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_wildfire.pdf
8,FEMA,https://www.fema.gov/emergency-managers/individuals-communities/what-would-you-do-scenarios,,https://www.fema.gov/sites/default/files/2020-10/fema_scenario_2_tornado-01102020.pdf
9,FEMA,https://www.fema.gov/emergency-managers/individuals-communities/what-would-you-do-scenarios,,https://www.fema.gov/sites/default/files/2020-10/fema_scenario_2-tornado_TTX_answer_key-01102020.pdf


### Build Vector Database

First we will build a FEMA RAG chain for asnwering questions about preparing for disasters, using FEMA PDFs.

In [6]:
# Download all documents as defined in 'Documents' column
for doc_url in df["Document"]:
    print(f"Downloading {doc_url}")
    response = requests.get(doc_url)
    with open(f"{PDF_DIR}/{doc_url.split('/')[-1]}", "wb") as f:
        f.write(response.content)


Downloading https://www.fema.gov/sites/default/files/2020-10/fema_scenario_1-active_shooter-01102020.pdf
Downloading https://www.fema.gov/sites/default/files/2020-10/fema_scenario_1_active_shooter_TTX_answer_key-01102020.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_coastal-erosion.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_earthquakes.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-home_flooding.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_severe-wind.pdf
Downloading https://www.fema.gov/sites/default/files/documents/fema_protect-your-property-storm-surge.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_wildfire.pdf
Downloading https://www.fema.gov/sites/default/files/2020-10/fema_scenario_2_tornado-01102020.pdf
Downloading https://www.fema.gov/sites/default/files/2020

### Index documents

We will use a very simple parser and chunking methodology to ingest documents for this demo.

In [33]:
# Load the PDFs
docs = []
for pdf_file in os.listdir(PDF_DIR):
    if not pdf_file.endswith(".pdf"):
        continue
    print(f"Loading PDF: {pdf_file}")
    file_path = f"{PDF_DIR}/{pdf_file}"
    loader = PyPDFLoader(file_path)
    docs = docs + loader.load()
    print(f"Loaded {len(docs)} documents")

print(len(docs))

Loading PDF: fema_scenario_10_power_outage_answer_key_01102020.pdf
Loaded 2 documents
Loading PDF: fema_scenario_7-shelter_in_place_TTX_answer_key_01102020.pdf
Loaded 5 documents
Loading PDF: ready_12-ways-to-prepare_postcard.pdf
Loaded 7 documents
Loading PDF: fema_safeguard-critical-documents-and-valuables.pdf
Loaded 10 documents
Loading PDF: ready_document-and-insure-your-property.pdf
Loaded 16 documents
Loading PDF: fema_scenario_1-active_shooter-01102020.pdf
Loaded 18 documents
Loading PDF: fema_protect-your-property_wildfire.pdf
Loaded 26 documents
Loading PDF: fema_scenario_4-hurricane-01102020.pdf
Loaded 27 documents
Loading PDF: fema_scenario_10_power_outage_01102020.pdf
Loaded 28 documents
Loading PDF: fema_scenario_4_hurricane_flood_TTX_answer_key-01102020.pdf
Loaded 30 documents
Loading PDF: fema_scenario_11_winter_storm_01102020.pdf
Loaded 32 documents
Loading PDF: fema_protect-your-property_severe-wind.pdf
Loaded 44 documents
Loading PDF: fema_protect-your-property-storm-

In [34]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

This is a very basic population of metadata, for real-world use-cases it would be more comprehensive. 

In [36]:

def summarize_text(text, prompt):
    messages = [
        (
            "system",
            "You are an assistant that gives very brief single sentence description of text.",
        ),
        ("human", f"{prompt} :: \n\n {text}"),
    ]
    ai_msg = llm.invoke(messages)
    summary = ai_msg.content
    return summary

new_splits = []
for doc in splits:

    # pdf name is last part of doc.metadata['source']
    pdf_name = doc.metadata['source'].split('/')[-1]

    # Find row in df where pdf_name is in URL
    row = df[df['Document'].str.contains(pdf_name)]
    page = doc.metadata["page"] + 1
    url = f"{row['Document'].values[0]}?id={str(uuid4())}#page={page}"

    # We'll use an LLM to generate a summary and title of the text, used by STORM
    # This is just for the demo, proper application would have better metadata
    summary = summarize_text(doc.page_content, prompt="Please describe this text:")
    title = summarize_text(doc.page_content, prompt="Please generate a 5 word title for this text:")

    doc.metadata['description'] = summary
    doc.metadata['title'] = title
    doc.metadata['url'] = url
    doc.metadata['content'] = doc.page_content

    #print(json.dumps(doc.metadata, indent=2))
    new_splits.append(doc)

splits = new_splits

In [3]:
client = QdrantClient(path=DB_DIR)

client.create_collection(
    collection_name=DB_COLLECTION_NAME,
    vectors_config=VectorParams(size=num_vectors, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name=DB_COLLECTION_NAME,
    embedding=embeddings,
)

NameError: name 'num_vectors' is not defined

In [38]:
uuids = [str(uuid4()) for _ in range(len(splits))]

vector_store.add_documents(documents=splits, ids=uuids)

['d820f321-b534-4b6a-9e76-1a7c1e3a94e5',
 'e2b6ce35-cf23-4c12-80cc-99dbecd08c3c',
 '8f91c9a1-6423-40e1-82c2-81a7bb426477',
 '2688d57e-9263-4f17-b93a-993ed1e98804',
 '2a6ae41d-ac60-4741-abf3-d62f38f019b4',
 '08703467-a54d-4f84-a903-778f29356b9c',
 '52c0a41c-2ff5-48ff-952b-0c3e3fde3e66',
 '8fa065f4-3dd1-4822-8632-be914dea9607',
 'f007c30b-39b9-4f56-891d-8b96ad63b6dd',
 'bfd199c5-5396-4f09-9ab6-76443551d43e',
 '2eef279f-1bbc-4e7e-a38a-5601922832a3',
 '907c656e-80cf-43cf-8ce3-48c38c62b676',
 '35a840a4-04f8-4d8b-9e21-4f42f24186fb',
 '72408d3d-d9ae-4707-931d-b87b121ab80d',
 '5f0750e9-399c-4a0d-b2e3-c33da960fc0a',
 '410694ce-bef4-4bbe-a230-79bc7ced0f39',
 '4d970a6f-2896-4cdc-9c80-947de9cf8e3c',
 '43ab2dcc-1662-4cb0-92c1-a3c9d57d406a',
 '044a0817-b567-42b7-bb5b-e0f47a082caa',
 'c00696f0-c2fe-4332-b2bb-d80c111860c5',
 '3b0b588e-0968-4973-8d4a-74418c79cbf5',
 '566a6c13-2056-40e8-867e-e303c45614d7',
 '1c66dcab-3e47-4318-bb80-1190addbf3d9',
 'eb2880b1-5715-4473-8948-d2a6de9721a3',
 'd08d400c-9b71-

### Base retriever

Let's do a quick check, also read vectors from disk.

In [4]:
# Remove DB_DIR/.lock
if os.path.exists(f"{DB_DIR}/.lock"):
    os.remove(f"{DB_DIR}/.lock")

print("Loading vector_store from disk")
client = QdrantClient(path=DB_DIR)
vector_store = QdrantVectorStore(
    client=client,
    collection_name=DB_COLLECTION_NAME,
    embedding=embeddings,
)

retriever = vector_store.as_retriever(search_kwargs={"k": 15})

Loading vector_store from disk


In [5]:
results = retriever.invoke("How can I prepare my house for a flood?")
for doc in results[0:3]:
    print("=====================================")
    print(json.dumps(doc.metadata))
    print(doc.page_content)

for Alerts
Plan with
NeighborsMake a Plan
Make Your 
Home
SaferDocument and
Insure PropertySafeguard
Documents
Know 
Evacuation
Routes Practice 
Emergency 
DrillsEXIT
Save for a
Rainy DayTest Family
Communication
Plan12 WAYS TO PREPARE
Make a Plan Save for a 
Rainy Day 
Plan with 
Neighbors Document and 
Insure PropertySafeguard 
Documents Sign up 
for Alerts 
Communication 
Plan EXIT
Practice 
Emergency 
Drills 
Get Involved in 
Your Community Assemble or 
Update 
Supplies Know 
Evacuation 
Routes Make Your 
Home 
Safer 12 WAYS TO PREPARE
{"source": "./data/pdfs/fema_protect-your-property-storm-surge.pdf", "page": 3, "description": "The text provides guidelines for preparing for potential flooding by documenting home contents, storing valuables safely, and elevating appliances and utilities.", "title": "\"Essential Flood Preparedness and Documentation\"", "url": "https://www.fema.gov/sites/default/files/documents/fema_protect-your-property-storm-surge.pdf?id=e02e84a3-bc2d-428a-b7f9-d1

## Run STORM Using our local document vectors

From the STORM [examples](https://github.com/stanford-oval/storm/blob/main/examples/storm_examples/README.md) ...

In [16]:
def run_storm(topic, model_type="openai"):

    # Clear lock so can be read
    if os.path.exists(f"{DB_DIR}/.lock"):
        os.remove(f"{DB_DIR}/.lock")
        
    engine_lm_configs = STORMWikiLMConfigs()

    if model_type == "openai":

        print("Using OpenAI models")

        # Initialize the language model configurations
        openai_kwargs = {
            'api_key': os.getenv("OPENAI_API_KEY"),
            'temperature': 1.0,
            'top_p': 0.9,
        }

        ModelClass = OpenAIModel if os.getenv('OPENAI_API_TYPE') == 'openai' else AzureOpenAIModel
        # If you are using Azure service, make sure the model name matches your own deployed model name.
        # The default name here is only used for demonstration and may not match your case.
        gpt_35_model_name = 'gpt-4o-mini' if os.getenv('OPENAI_API_TYPE') == 'openai' else 'gpt-35-turbo'
        gpt_4_model_name = 'gpt-4o'
        if os.getenv('OPENAI_API_TYPE') == 'azure':
            openai_kwargs['api_base'] = os.getenv('AZURE_API_BASE')
            openai_kwargs['api_version'] = os.getenv('AZURE_API_VERSION')

        # STORM is a LM system so different components can be powered by different models.
        # For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm 
        # which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
        # for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
        # which is responsible for generating sections with citations.
        conv_simulator_lm = ModelClass(model=gpt_35_model_name, max_tokens=10000, **openai_kwargs)
        question_asker_lm = ModelClass(model=gpt_35_model_name, max_tokens=10000, **openai_kwargs)
        outline_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=10000, **openai_kwargs)
        article_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=10000, **openai_kwargs)
        article_polish_lm = ModelClass(model=gpt_4_model_name, max_tokens=10000, **openai_kwargs)

    elif model_type == "ollama":

        print("Using Ollama models")

        ollama_kwargs = {
            #"model": "llama3.2:3b",
            "model": "llama3.1:latest",
            #"model": "qwen2.5:14b",
            "port": "11434",
            "url": "http://localhost",
            "stop": ('\n\n---',)  # dspy uses "\n\n---" to separate examples. Open models sometimes generate this.
        }

        conv_simulator_lm = OllamaClient(max_tokens=500, **ollama_kwargs)
        question_asker_lm = OllamaClient(max_tokens=500, **ollama_kwargs)
        outline_gen_lm = OllamaClient(max_tokens=400, **ollama_kwargs)
        article_gen_lm = OllamaClient(max_tokens=700, **ollama_kwargs)
        article_polish_lm = OllamaClient(max_tokens=4000, **ollama_kwargs)

    engine_lm_configs.set_conv_simulator_lm(conv_simulator_lm)
    engine_lm_configs.set_question_asker_lm(question_asker_lm)
    engine_lm_configs.set_outline_gen_lm(outline_gen_lm)
    engine_lm_configs.set_article_gen_lm(article_gen_lm)
    engine_lm_configs.set_article_polish_lm(article_polish_lm)

    max_conv_turn=4
    max_perspective=3
    search_top_k=10
    max_thread_num=1
    device='cpu'
    vector_db_mode='offline'

    do_research=True
    do_generate_outline=True
    do_generate_article=True
    do_polish_article=True

    # Initialize the engine arguments
    engine_args = STORMWikiRunnerArguments(
        output_dir=STORM_OUTPUT_DIR,
        max_conv_turn=max_conv_turn,
        max_perspective=max_perspective,
        search_top_k=search_top_k,
        max_thread_num=max_thread_num,
    )

    # Setup VectorRM to retrieve information from your own data
    rm = VectorRM(collection_name=DB_COLLECTION_NAME, \
                    embedding_model=EMBEDDING_MODEL, \
                    device=device, \
                    k=search_top_k)

    # initialize the vector store, either online (store the db on Qdrant server) or offline (store the db locally):
    if vector_db_mode == 'offline':
        rm.init_offline_vector_db(vector_store_path=DB_DIR)

    # Initialize the STORM Wiki Runner
    runner = STORMWikiRunner(engine_args, engine_lm_configs, rm)

    if model_type == "ollama":

        print("Using Ollama models prompting")

        # Open LMs are generally weaker in following output format.
        # One way for mitigation is to add one-shot example to the prompt to exemplify the desired output format.
        # For example, we can add the following examples to the two prompts used in StormPersonaGenerator.
        # Note that the example should be an object of dspy.Example with fields matching the InputField
        # and OutputField in the prompt (i.e., dspy.Signature).
        find_related_topic_example = Example(
            topic="Knowledge Curation",
            related_topics="https://en.wikipedia.org/wiki/Knowledge_management\n"
                        "https://en.wikipedia.org/wiki/Information_science\n"
                        "https://en.wikipedia.org/wiki/Library_science\n"
        )
        gen_persona_example = Example(
            topic="Knowledge Curation",
            examples="Title: Knowledge management\n"
                    "Table of Contents: History\nResearch\n  Dimensions\n  Strategies\n  Motivations\nKM technologies"
                    "\nKnowledge barriers\nKnowledge retention\nKnowledge audit\nKnowledge protection\n"
                    "  Knowledge protection methods\n    Formal methods\n    Informal methods\n"
                    "  Balancing knowledge protection and knowledge sharing\n  Knowledge protection risks",
            personas="1. Historian of Knowledge Systems: This editor will focus on the history and evolution of knowledge curation. They will provide context on how knowledge curation has changed over time and its impact on modern practices.\n"
                    "2. Information Science Professional: With insights from 'Information science', this editor will explore the foundational theories, definitions, and philosophy that underpin knowledge curation\n"
                    "3. Digital Librarian: This editor will delve into the specifics of how digital libraries operate, including software, metadata, digital preservation.\n"
                    "4. Technical expert: This editor will focus on the technical aspects of knowledge curation, such as common features of content management systems.\n"
                    "5. Museum Curator: The museum curator will contribute expertise on the curation of physical items and the transition of these practices into the digital realm."
        )
        runner.storm_knowledge_curation_module.persona_generator.create_writer_with_persona.find_related_topic.demos = [
            find_related_topic_example]
        runner.storm_knowledge_curation_module.persona_generator.create_writer_with_persona.gen_persona.demos = [
            gen_persona_example]

        # A trade-off of adding one-shot example is that it will increase the input length of the prompt. Also, some
        # examples may be very long (e.g., an example for writing a section based on the given information), which may
        # confuse the model. For these cases, you can create a pseudo-example that is short and easy to understand to steer
        # the model's output format.
        # For example, we can add the following pseudo-examples to the prompt used in WritePageOutlineFromConv and
        # ConvToSection.
        write_page_outline_example = Example(
            topic="Example Topic",
            conv="Wikipedia Writer: ...\nExpert: ...\nWikipedia Writer: ...\nExpert: ...",
            old_outline="# Section 1\n## Subsection 1\n## Subsection 2\n"
                        "# Section 2\n## Subsection 1\n## Subsection 2\n"
                        "# Section 3",
            outline="# New Section 1\n## New Subsection 1\n## New Subsection 2\n"
                    "# New Section 2\n"
                    "# New Section 3\n## New Subsection 1\n## New Subsection 2\n## New Subsection 3"
        )
        runner.storm_outline_generation_module.write_outline.write_page_outline.demos = [write_page_outline_example]
        write_section_example = Example(
            info="[1]\nInformation in document 1\n[2]\nInformation in document 2\n[3]\nInformation in document 3",
            topic="Example Topic",
            section="Example Section",
            output="# Example Topic\n## Subsection 1\n"
                "This is an example sentence [1]. This is another example sentence [2][3].\n"
                "## Subsection 2\nThis is one more example sentence [1]."
        )
        runner.storm_article_generation.section_gen.write_section.demos = [write_section_example]

    # run the pipeline
    runner.run(
        topic=topic,
        do_research=do_research,
        do_generate_outline=do_generate_outline,
        do_generate_article=do_generate_article,
        do_polish_article=do_polish_article,
    )
    runner.post_run()
    runner.summary()

In [12]:
query = "Write a detailed and comprehensive report on how should people prepare their homes and respond in the event of extreme flood events?"
run_storm(query, model_type="openai")

sentence_transformers.SentenceTransformer : INFO     : Load pretrained SentenceTransformer: BAAI/bge-m3


Using OpenAI models




Collection fema_docs_demo exists. Loading the collection...


knowledge_storm.interface : INFO     : run_knowledge_curation_module executed in 12.5988 seconds
knowledge_storm.interface : INFO     : run_outline_generation_module executed in 6.9678 seconds
sentence_transformers.SentenceTransformer : INFO     : Use pytorch device_name: mps
sentence_transformers.SentenceTransformer : INFO     : Load pretrained SentenceTransformer: paraphrase-MiniLM-L6-v2
knowledge_storm.interface : INFO     : run_article_generation_module executed in 50.9987 seconds
knowledge_storm.interface : INFO     : run_article_polishing_module executed in 4.7054 seconds


***** Execution time *****
run_knowledge_curation_module: 12.5988 seconds
run_outline_generation_module: 6.9678 seconds
run_article_generation_module: 50.9987 seconds
run_article_polishing_module: 4.7054 seconds
***** Token usage of language models: *****
run_knowledge_curation_module
    gpt-4o-mini: {'prompt_tokens': 47811, 'completion_tokens': 12029}
    gpt-4o: {'prompt_tokens': 0, 'completion_tokens': 0}
run_outline_generation_module
    gpt-4o-mini: {'prompt_tokens': 0, 'completion_tokens': 0}
    gpt-4o: {'prompt_tokens': 6770, 'completion_tokens': 934}
run_article_generation_module
    gpt-4o-mini: {'prompt_tokens': 0, 'completion_tokens': 0}
    gpt-4o: {'prompt_tokens': 16091, 'completion_tokens': 2869}
run_article_polishing_module
    gpt-4o-mini: {'prompt_tokens': 0, 'completion_tokens': 0}
    gpt-4o: {'prompt_tokens': 3028, 'completion_tokens': 460}
***** Number of queries of retrieval models: *****
run_knowledge_curation_module: {'VectorRM': 48}
run_outline_generation_mo

In [43]:
def generate_footnotes():

    # Find most recent folder (by modified date) in DATA_DIR/storm_data
    # TODO, find out how exactly storm passes back its output directory to avoid this hack
    folders = [f.path for f in os.scandir(f"{DATA_DIR}/storm_output") if f.is_dir()]
    folder = max(folders, key=os.path.getmtime)

    file = f"{folder}/url_to_info.json"

    with open(file) as f:
        data = json.load(f)

    refs = {}
    for rec in data['url_to_unified_index']:
        val = data['url_to_unified_index'][rec]
        title = data['url_to_info'][rec]['title'].replace('"','')
        refs[val] = f"{val} [{title}]({rec})"

    keys = list(refs.keys())
    keys.sort()

    footer = ""
    for key in keys:
        footer += f"{refs[key]}\n"

    return footer




1 [Hurricane and Flood Risk Awareness](https://www.fema.gov/sites/default/files/2020-10/fema_scenario_4_hurricane_flood_TTX_answer_key-01102020.pdf?id=a8ed703b-a8c4-4224-8e50-b7872ebde8b6#page=1)
2 [Protecting Your Home from Flooding](https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-home_flooding.pdf?id=6051c8dc-003e-4715-ae02-8dd5f1fd2111#page=2)
3 [Understanding Flood Risks and Insurance](https://www.fema.gov/sites/default/files/2020-10/fema_scenario_4_hurricane_flood_TTX_answer_key-01102020.pdf?id=2050cc38-3ce9-485d-af00-4c017fd0eebe#page=2)
4 [Hurricane Safety: Prepare and Protect](https://www.fema.gov/sites/default/files/2020-10/fema_scenario_4_hurricane_flood_TTX_answer_key-01102020.pdf?id=e64c39a1-b8da-4279-b177-e110660448c5#page=2)
5 [Coastal Flooding Alerts and Terminology](https://www.fema.gov/sites/default/files/2020-10/fema_scenario_12_small_business_answer_key_01102020.pdf?id=9cec0963-b30a-45b9-abd1-12f89bcf6512#page=2)
6 [Protecting Your Home from Storms

In [15]:
#run_storm(query, model_type="ollama")

In [43]:
# Here we generate footnotes by parsing url_to_info.json, noting that 

1. Loop through url_to_unified_index to get URLs and ref numbers
2. Loop through URLs in "url_to_info" to get titles
3. Add to end of article