# STORM with local documents

This notebook downloads documentation from the US Federal Emergency Management Agency (FEMA) to use as part of STORM analysis on local documents.

The notebook ...

1. Downloads some FEMA documents
2. Parses and chunks them
3. Embeds with local "BAAI/bge-m3"
4. Creates a local filesystem Qdrant vector store
5. Runs STORM using this store

# Setup

1. See [README](./README) to set up a conda environment and `.env` file

In [21]:
import os
import openai
from dotenv import load_dotenv
import os
import pandas as pd
import requests
from uuid import uuid4
import json

from langchain_community.document_loaders import PyPDFLoader
from langchain.vectorstores.chroma import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

from knowledge_storm import STORMWikiRunnerArguments, STORMWikiRunner, STORMWikiLMConfigs
from knowledge_storm.rm import VectorRM
from knowledge_storm.lm import OpenAIModel, AzureOpenAIModel
from knowledge_storm.utils import load_api_key, QdrantVectorStoreManager

from langchain_openai import ChatOpenAI

pd.set_option("display.max_colwidth", None)

# Load environment variables from .env file
load_dotenv()

# Initialize the OpenAI API client
openai.api_key = os.getenv("OPENAI_API_KEY")

DATA_DIR = "./data"
DB_DIR = f"{DATA_DIR}/db"
PDF_DIR = f"{DATA_DIR}/pdfs"
STORM_OUTPUT_DIR=f"{DATA_DIR}/storm_output"
DB_COLLECTION_NAME="fema_docs_demo"
EMBEDDING_MODEL="BAAI/bge-m3"

for dir in [DATA_DIR, PDF_DIR, DB_DIR, STORM_OUTPUT_DIR]:
    os.makedirs(dir, exist_ok=True)

model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
embeddings = HuggingFaceBgeEmbeddings(
    model_name=EMBEDDING_MODEL, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)


llm = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
)


sentence_transformers.SentenceTransformer : INFO     : Load pretrained SentenceTransformer: BAAI/bge-m3


In [2]:
vectors = embeddings.embed_query("Bagels are the best!")
num_vectors = len(vectors)

print(f"Number of vectors: {num_vectors}")

Number of vectors: 1024


# Analysis

## Indexing FEMA Disaster preparedness documents

### Get FEMA PDF documents

In [3]:
df = pd.read_csv(f"{DATA_DIR}/fema_docs.csv")
display(df)

Unnamed: 0,Source,URL,Extra instructions,Document
0,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-10/fema_scenario_1-active_shooter-01102020.pdf
1,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-10/fema_scenario_1_active_shooter_TTX_answer_key-01102020.pdf
2,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_coastal-erosion.pdf
3,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_earthquakes.pdf
4,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-home_flooding.pdf
5,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_severe-wind.pdf
6,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/documents/fema_protect-your-property-storm-surge.pdf
7,FEMA,https://www.fema.gov/emergency-managers/risk-management/hazard-mitigation-planning/risk-reduction-activities,"Selected ""Protect my home from natural hazards""",https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_wildfire.pdf
8,FEMA,https://www.fema.gov/emergency-managers/individuals-communities/what-would-you-do-scenarios,,https://www.fema.gov/sites/default/files/2020-10/fema_scenario_2_tornado-01102020.pdf
9,FEMA,https://www.fema.gov/emergency-managers/individuals-communities/what-would-you-do-scenarios,,https://www.fema.gov/sites/default/files/2020-10/fema_scenario_2-tornado_TTX_answer_key-01102020.pdf


### Build Vector Database

First we will build a FEMA RAG chain for asnwering questions about preparing for disasters, using FEMA PDFs.

In [6]:
# Download all documents as defined in 'Documents' column
for doc_url in df["Document"]:
    print(f"Downloading {doc_url}")
    response = requests.get(doc_url)
    with open(f"{PDF_DIR}/{doc_url.split('/')[-1]}", "wb") as f:
        f.write(response.content)


Downloading https://www.fema.gov/sites/default/files/2020-10/fema_scenario_1-active_shooter-01102020.pdf
Downloading https://www.fema.gov/sites/default/files/2020-10/fema_scenario_1_active_shooter_TTX_answer_key-01102020.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_coastal-erosion.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_earthquakes.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-home_flooding.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_severe-wind.pdf
Downloading https://www.fema.gov/sites/default/files/documents/fema_protect-your-property-storm-surge.pdf
Downloading https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-property_wildfire.pdf
Downloading https://www.fema.gov/sites/default/files/2020-10/fema_scenario_2_tornado-01102020.pdf
Downloading https://www.fema.gov/sites/default/files/2020

### Index documents

We will use a very simple parser and chunking methodology to ingest documents for this demo.

In [33]:
# Load the PDFs
docs = []
for pdf_file in os.listdir(PDF_DIR):
    if not pdf_file.endswith(".pdf"):
        continue
    print(f"Loading PDF: {pdf_file}")
    file_path = f"{PDF_DIR}/{pdf_file}"
    loader = PyPDFLoader(file_path)
    docs = docs + loader.load()
    print(f"Loaded {len(docs)} documents")

print(len(docs))

Loading PDF: fema_scenario_10_power_outage_answer_key_01102020.pdf
Loaded 2 documents
Loading PDF: fema_scenario_7-shelter_in_place_TTX_answer_key_01102020.pdf
Loaded 5 documents
Loading PDF: ready_12-ways-to-prepare_postcard.pdf
Loaded 7 documents
Loading PDF: fema_safeguard-critical-documents-and-valuables.pdf
Loaded 10 documents
Loading PDF: ready_document-and-insure-your-property.pdf
Loaded 16 documents
Loading PDF: fema_scenario_1-active_shooter-01102020.pdf
Loaded 18 documents
Loading PDF: fema_protect-your-property_wildfire.pdf
Loaded 26 documents
Loading PDF: fema_scenario_4-hurricane-01102020.pdf
Loaded 27 documents
Loading PDF: fema_scenario_10_power_outage_01102020.pdf
Loaded 28 documents
Loading PDF: fema_scenario_4_hurricane_flood_TTX_answer_key-01102020.pdf
Loaded 30 documents
Loading PDF: fema_scenario_11_winter_storm_01102020.pdf
Loaded 32 documents
Loading PDF: fema_protect-your-property_severe-wind.pdf
Loaded 44 documents
Loading PDF: fema_protect-your-property-storm-

In [34]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

This is a very basic population of metadata, for real-world use-cases it would be more comprehensive. 

In [36]:

def summarize_text(text, prompt):
    messages = [
        (
            "system",
            "You are an assistant that gives very brief single sentence description of text.",
        ),
        ("human", f"{prompt} :: \n\n {text}"),
    ]
    ai_msg = llm.invoke(messages)
    summary = ai_msg.content
    return summary

new_splits = []
for doc in splits:

    # pdf name is last part of doc.metadata['source']
    pdf_name = doc.metadata['source'].split('/')[-1]

    # Find row in df where pdf_name is in URL
    row = df[df['Document'].str.contains(pdf_name)]
    page = doc.metadata["page"] + 1
    url = f"{row['Document'].values[0]}?id={str(uuid4())}#page={page}"

    # We'll use an LLM to generate a summary and title of the text, used by STORM
    # This is just for the demo, proper application would have better metadata
    summary = summarize_text(doc.page_content, prompt="Please describe this text:")
    title = summarize_text(doc.page_content, prompt="Please generate a 5 word title for this text:")

    doc.metadata['description'] = summary
    doc.metadata['title'] = title
    doc.metadata['url'] = url
    doc.metadata['content'] = doc.page_content

    #print(json.dumps(doc.metadata, indent=2))
    new_splits.append(doc)

splits = new_splits

In [26]:
client = QdrantClient(path=DB_DIR)

client.create_collection(
    collection_name=DB_COLLECTION_NAME,
    vectors_config=VectorParams(size=num_vectors, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name=DB_COLLECTION_NAME,
    embedding=embeddings,
)

ValueError: Collection fema_docs_demo already exists

In [27]:
uuids = [str(uuid4()) for _ in range(len(splits))]

vector_store.add_documents(documents=splits, ids=uuids)

['c034c463-8b01-4afc-9434-e9ebb5dfd895',
 'dc3b10b4-e284-46d9-935f-5e532fe897c0',
 '300c388e-c1e3-46aa-8ea0-9a7773825df4',
 'adfcb6cc-0715-4d6b-8a77-88d99f8ba60d',
 '8772e462-1041-4123-b6a2-32ac9b7a3931',
 '76a2e78d-edf0-4f25-9c28-88d09e2422f5',
 'd70d1bb7-b419-4b89-b21f-7ac5546fe6f2',
 '5ed6d86c-14de-469f-8398-ad276aa95eee',
 '7909e6c1-0554-429d-b6b6-93b764e13aaf',
 'ef9f2ba0-3aeb-447d-acca-ee30ae224d8e',
 '6543e232-c272-4a8c-8dbb-1bbc94dfe1d4',
 '31709c17-2298-4b57-958a-c85627703c9f',
 'bae193b0-e87e-480b-aa5d-5a4d1327c94b',
 'cb593c8d-bc2f-4086-937e-d463f38fed0f',
 'b5f313fb-8e4a-436b-beb8-b7c649d8aa74',
 '8642d0cf-e29f-48c4-929c-97783f75823e',
 '9495703c-4769-4a2e-821b-43d0a9ddb32a',
 'dfac56e6-cac0-4fc5-8d1a-c3fae05c6740',
 '8dcccaff-89f2-4a8b-b0d7-59805f0b32fb',
 'c3e79921-dbbf-438f-a6b2-8f7a7ae67ba0',
 '38249ef2-4d11-41f3-b9ab-112cc343f2c5',
 '40557054-f023-496a-95eb-c55ead055508',
 '3ac8949d-22a1-436a-8d4c-fe11cdf798c3',
 '68d1690b-fbcf-494c-b5f2-2be0f9635464',
 '9b107e5b-4679-

### Base retriever

Let's do a quick check, also read vectors from disk.

In [28]:
# Remove DB_DIR/.lock
if os.path.exists(f"{DB_DIR}/.lock"):
    os.remove(f"{DB_DIR}/.lock")

print("Loading vector_store from disk")
client = QdrantClient(path=DB_DIR)
vector_store = QdrantVectorStore(
    client=client,
    collection_name=DB_COLLECTION_NAME,
    embedding=embeddings,
)

retriever = vector_store.as_retriever(search_kwargs={"k": 15})

Loading vector_store from disk


In [29]:
results = retriever.invoke("How can I prepare my house for a flood?")
for doc in results[0:3]:
    print("=====================================")
    print(json.dumps(doc.metadata))
    print(doc.page_content)

for Alerts
Plan with
NeighborsMake a Plan
Make Your 
Home
SaferDocument and
Insure PropertySafeguard
Documents
Know 
Evacuation
Routes Practice 
Emergency 
DrillsEXIT
Save for a
Rainy DayTest Family
Communication
Plan12 WAYS TO PREPARE
for Alerts
Plan with
NeighborsMake a Plan
Make Your 
Home
SaferDocument and
Insure PropertySafeguard
Documents
Know 
Evacuation
Routes Practice 
Emergency 
DrillsEXIT
Save for a
Rainy DayTest Family
Communication
Plan12 WAYS TO PREPARE
Make a Plan Save for a 
Rainy Day 
Plan with 
Neighbors Document and 
Insure PropertySafeguard 
Documents Sign up 
for Alerts 
Communication 
Plan EXIT
Practice 
Emergency 
Drills 
Get Involved in 
Your Community Assemble or 
Update 
Supplies Know 
Evacuation 
Routes Make Your 
Home 
Safer 12 WAYS TO PREPARE


## Run STORM Using our local document vectors

From the STORM [examples](https://github.com/stanford-oval/storm/blob/main/examples/storm_examples/README.md) ...

In [30]:
def run_storm(topic):

    # Clear lock so can be read
    if os.path.exists(f"{DB_DIR}/.lock"):
        os.remove(f"{DB_DIR}/.lock")

    # Initialize the language model configurations
    engine_lm_configs = STORMWikiLMConfigs()
    openai_kwargs = {
        'api_key': os.getenv("OPENAI_API_KEY"),
        'temperature': 1.0,
        'top_p': 0.9,
    }

    ModelClass = OpenAIModel if os.getenv('OPENAI_API_TYPE') == 'openai' else AzureOpenAIModel
    # If you are using Azure service, make sure the model name matches your own deployed model name.
    # The default name here is only used for demonstration and may not match your case.
    gpt_35_model_name = 'gpt-4o-mini' if os.getenv('OPENAI_API_TYPE') == 'openai' else 'gpt-35-turbo'
    gpt_4_model_name = 'gpt-4o'
    if os.getenv('OPENAI_API_TYPE') == 'azure':
        openai_kwargs['api_base'] = os.getenv('AZURE_API_BASE')
        openai_kwargs['api_version'] = os.getenv('AZURE_API_VERSION')

    # STORM is a LM system so different components can be powered by different models.
    # For a good balance between cost and quality, you can choose a cheaper/faster model for conv_simulator_lm 
    # which is used to split queries, synthesize answers in the conversation. We recommend using stronger models
    # for outline_gen_lm which is responsible for organizing the collected information, and article_gen_lm
    # which is responsible for generating sections with citations.
    conv_simulator_lm = ModelClass(model=gpt_35_model_name, max_tokens=10000, **openai_kwargs)
    question_asker_lm = ModelClass(model=gpt_35_model_name, max_tokens=10000, **openai_kwargs)
    outline_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=10000, **openai_kwargs)
    article_gen_lm = ModelClass(model=gpt_4_model_name, max_tokens=10000, **openai_kwargs)
    article_polish_lm = ModelClass(model=gpt_4_model_name, max_tokens=10000, **openai_kwargs)

    engine_lm_configs.set_conv_simulator_lm(conv_simulator_lm)
    engine_lm_configs.set_question_asker_lm(question_asker_lm)
    engine_lm_configs.set_outline_gen_lm(outline_gen_lm)
    engine_lm_configs.set_article_gen_lm(article_gen_lm)
    engine_lm_configs.set_article_polish_lm(article_polish_lm)

    max_conv_turn=4
    max_perspective=3
    search_top_k=10
    max_thread_num=1
    device='cpu'
    vector_db_mode='offline'

    do_research=True
    do_generate_outline=True
    do_generate_article=True
    do_polish_article=True

    # Initialize the engine arguments
    engine_args = STORMWikiRunnerArguments(
        output_dir=STORM_OUTPUT_DIR,
        max_conv_turn=max_conv_turn,
        max_perspective=max_perspective,
        search_top_k=search_top_k,
        max_thread_num=max_thread_num,
    )

    # Setup VectorRM to retrieve information from your own data
    rm = VectorRM(collection_name=DB_COLLECTION_NAME, \
                    embedding_model=EMBEDDING_MODEL, \
                    device=device, \
                    k=search_top_k)

    # initialize the vector store, either online (store the db on Qdrant server) or offline (store the db locally):
    if vector_db_mode == 'offline':
        rm.init_offline_vector_db(vector_store_path=DB_DIR)

    # Initialize the STORM Wiki Runner
    runner = STORMWikiRunner(engine_args, engine_lm_configs, rm)

    # run the pipeline
    runner.run(
        topic=topic,
        do_research=do_research,
        do_generate_outline=do_generate_outline,
        do_generate_article=do_generate_article,
        do_polish_article=do_polish_article,
    )
    runner.post_run()
    runner.summary()




In [31]:
run_storm("Write a detailed and comprehensive report on how should people prepare their homes and respond in the event of extreme flood events?")

sentence_transformers.SentenceTransformer : INFO     : Load pretrained SentenceTransformer: BAAI/bge-m3


Collection fema_docs_demo exists. Loading the collection...


knowledge_storm.interface : INFO     : run_knowledge_curation_module executed in 155.2907 seconds
knowledge_storm.interface : INFO     : run_outline_generation_module executed in 6.3131 seconds
sentence_transformers.SentenceTransformer : INFO     : Use pytorch device_name: mps
sentence_transformers.SentenceTransformer : INFO     : Load pretrained SentenceTransformer: paraphrase-MiniLM-L6-v2
knowledge_storm.interface : INFO     : run_article_generation_module executed in 36.4234 seconds
knowledge_storm.interface : INFO     : run_article_polishing_module executed in 5.3633 seconds


***** Execution time *****
run_knowledge_curation_module: 155.2907 seconds
run_outline_generation_module: 6.3131 seconds
run_article_generation_module: 36.4234 seconds
run_article_polishing_module: 5.3633 seconds
***** Token usage of language models: *****
run_knowledge_curation_module
    gpt-4o-mini: {'prompt_tokens': 46233, 'completion_tokens': 11853}
    gpt-4o: {'prompt_tokens': 0, 'completion_tokens': 0}
run_outline_generation_module
    gpt-4o-mini: {'prompt_tokens': 0, 'completion_tokens': 0}
    gpt-4o: {'prompt_tokens': 6880, 'completion_tokens': 879}
run_article_generation_module
    gpt-4o-mini: {'prompt_tokens': 0, 'completion_tokens': 0}
    gpt-4o: {'prompt_tokens': 15564, 'completion_tokens': 2507}
run_article_polishing_module
    gpt-4o-mini: {'prompt_tokens': 0, 'completion_tokens': 0}
    gpt-4o: {'prompt_tokens': 2680, 'completion_tokens': 442}
***** Number of queries of retrieval models: *****
run_knowledge_curation_module: {'VectorRM': 48}
run_outline_generation_m

In [32]:
# https://github.com/stanford-oval/storm/issues/117 Citations