# Introduction

This notebook downloads PDF files from FEMA and processes them with Open AI's GPT-4 to create bot for answering questions about preparing for various disasters.

# Setup
## Environment
This notebook runs on Python 3.9 and package versions as outlined in `environment.yml`. A miniconda environment has been supplied, which you can use with ...

1. Install [miniconda](https://docs.conda.io/en/latest/miniconda.html) by selecting the installer that fits your OS version. Once it is installed you may have to restart your terminal (closing your terminal and opening again)
2. In this directory, open terminal
3. `conda env create -f environment.yml`
4. `conda activate stay_safe_bot`

## Credentials

1. cp `.env.example` `.env`
2. Edit this file and set your API Keys

## Data

PDF Files were downloaded from FEMA as noted in the table below, and saved into the folder `./data`.


**Note:** I didn't automatically download data intentionally 

# Analysis

In [62]:
import os
import sys
import shutil
from langchain.document_loaders import PyPDFDirectoryLoader
from langchain.document_loaders import PyPDFLoader 
from langchain.embeddings import OpenAIEmbeddings 
from langchain.vectorstores import Chroma 
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
import json
import re

from dotenv import load_dotenv

# A little mod to enable using memory *and* getting docs. See: https://github.com/langchain-ai/langchain/issues/2256#issuecomment-1665188576
import langchain
from typing import Dict, Any, Tuple
from langchain.memory.utils import get_prompt_input_key
def _get_input_output(
    self, inputs: Dict[str, Any], outputs: Dict[str, str]
) -> Tuple[str, str]:
    if self.input_key is None:
        prompt_input_key = get_prompt_input_key(inputs, self.memory_variables)
    else:
        prompt_input_key = self.input_key
    if self.output_key is None:
        output_key = list(outputs.keys())[0]
    else:
        output_key = self.output_key
    return inputs[prompt_input_key], outputs[output_key]
  
langchain.memory.chat_memory.BaseChatMemory._get_input_output = _get_input_output

def setup_model(vecs_dir, docs_sublist, all_docs, prefix_file_name_to_chunks = False, temperature=0.0, \
    extra_prefix=''):

    # Subset for docs we are interested in
    docs = []
    for d in all_docs:        
        d_dict = vars(d)
        if d_dict['metadata']['source'].replace('docs_data/','') in docs_sublist:
            if len(d.page_content) > 20:
                # Add file name to content for more context
                if prefix_file_name_to_chunks:
                    file_clean = re.sub(r'docs_data\/|\.pdf', '', d_dict['metadata']['source'])  
                    file_clean = re.sub(r'\-|\_', ' ', file_clean)                
                    d.page_content = f"{extra_prefix} {file_clean}: {d.page_content}"
                docs.append(d)

    # Create vector DB directory
    if os.path.exists(vecs_dir):
        shutil.rmtree(vecs_dir)
    os.makedirs(vecs_dir)

    embedding_model = OpenAIEmbeddings()
    #chat_model = OpenAI(temperature=temperature)
    chat_model = ChatOpenAI(temperature=temperature,model_name="gpt-4")

    # Calculate embeddings
    embeddings = OpenAIEmbeddings()
    vectordb = Chroma.from_documents(docs, embedding=embedding_model,persist_directory=vecs_dir)
    vectordb.persist()
    
    # Set up chat
    memory = ConversationBufferMemory(memory_key="chat_history", input_key='question', output_key='answer', return_messages=True)
    pdf_qa = ConversationalRetrievalChain.from_llm(chat_model, vectordb.as_retriever(), memory=memory, \
                                                   return_source_documents=True)
    return pdf_qa


def ask_question(query, qa):
    result = qa({"question": query})
    print(f"Question: \n{query}")
    print(f"\nAnswer:\n{result['answer']}")
    for doc in result['source_documents']:
        print('\n')
        print(json.dumps(vars(doc), indent=4))

# This will load API keys as defined in .env file
load_dotenv()

pdf_folder_path = f'./docs_data'

## Read All Our PDF Documents

In [68]:

files = os.listdir(pdf_folder_path)
files.sort()
all_docs_list =[]
for file in files:
    if file.endswith('.pdf'):
        print(file)
        all_docs_list.append(file)
loader = PyPDFDirectoryLoader(pdf_folder_path)
all_docs = loader.load()

print(pdf_folder_path)
print(len(all_docs))

cfpb_adult-fin-edyour-disaster-checklist.pdf
fema_protect-your-home_flooding.pdf
fema_protect-your-property-storm-surge.pdf
fema_protect-your-property_coastal-erosion.pdf
fema_protect-your-property_earthquakes.pdf
fema_protect-your-property_severe-wind.pdf
fema_protect-your-property_wildfire.pdf
fema_proteja-su-propiedad-erosion-costera_2023.pdf
fema_proteja-su-propiedad-incendios-forestales_2023.pdf
fema_proteja-su-propiedad-inundaciones_2023.pdf
fema_proteja-su-propiedad-marejada-ciclonica_2023.pdf
fema_proteja-su-propiedad-terremotos_2023.pdf
fema_proteja-su-propiedad-vientos-fuertes_2023.pdf
fema_safeguard-critical-documents-and-valuables.pdf
fema_scenario_1-active_shooter-01102020.pdf
fema_scenario_10_power_outage_01102020.pdf
fema_scenario_10_power_outage_answer_key_01102020.pdf
fema_scenario_11_winter_storm_01102020.pdf
fema_scenario_11_winter_storm_answer_key_01102020.pdf
fema_scenario_12_small_business_01102020.pdf
fema_scenario_12_small_business_answer_key_01102020.pdf
fema_s

## Test using only a single document

Before we index all documents, it's useful to have a controlled test to analyze performance for a small body of content, one single document. The embedding process will split (chunk) this into parts and the LangChain pattern will still work, but being just one document we can get a better feel for performance.

In [64]:
vecs_dir = './vector_dbs/one_flood_doc'
docs = all_docs

# Subset to one document
docs_sublist = ['fema_protect-your-home_flooding.pdf']

pdf_qa = setup_model(vecs_dir, docs_sublist, docs, prefix_file_name_to_chunks=False)

## Ask 'How do I prepare my home for floods?' using one PDF document as the source

In [66]:
ask_question("How do I prepare my home for floods?", pdf_qa)

Question: 
How do I prepare my home for floods?

Answer:
There are several steps you can take to prepare your home for floods:

1. Determine the Base Flood Elevation (BFE) for your home. This is how high the water is expected to rise during flooding in high risk areas. Your local floodplain manager can help you find this information.

2. Direct water away from structures. Make sure your yard slopes away from buildings on your property and that water has a place to drain. Clear your gutters, assess drainage issues, or collect water in rain barrels.

3. Anchor fuel tanks to prevent them from tipping over or floating in a flood. 

4. Floodproof walls by adding water-resistant exterior sheathing and sealing them to prevent shallow flooding from damaging your home.

5. Secure manufactured homes to a permanent foundation so that the wheels and axles do not support its weight and resist flotation, collapse, or side-to-side movement.

6. Document all of your belongings to help with the insuran

The answer *looks* great, and has summarized brilliantly the chunks that were returned in the retreival. However, looking at the single source document [https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-home_flooding.pdf](https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-home_flooding.pdf), notice how page 4 is missing from the retreived documents above. This PDF is a short document where *everything* in it is relevant to flood preparation, so losing pages is actually significant.

It goes to show that blindly accepting LLM patterns on the web may give amazing looking results, but work is needed to make them truly useful.

## Adding filename context to text chunks

What about if we give a little more context to our text chunks? For example a really crude approach would be to prefix chunks with 'This snippet relates to ' and the filename, that way chunks in the above document would all have at least something to indicate they are related to flood protection. This concept could be extended to include other meta data about the documents, but let's try this crude approach ...

In [67]:
docs = all_docs
# Note the argument prefix_file_name_to_chunks=True
pdf_qa = setup_model(vecs_dir, docs_sublist, docs, prefix_file_name_to_chunks=True, extra_prefix="This snippet relates to ")
ask_question("How do I prepare my home for floods?", pdf_qa)

Question: 
How do I prepare my home for floods?

Answer:
There are several steps you can take to prepare your home for floods:

1. Direct Water Away from Structures: Make sure your yard slopes away from buildings on your property and that water has a place to drain. Clear your gutters, assess drainage issues, or collect water in rain barrels.

2. Anchor Fuel Tanks: Anchor any fuel tanks to the pad to prevent them from tipping over or floating in a flood. 

3. Floodproof Walls: Add water-resistant exterior sheathing on walls and seal them to prevent shallow flooding from damaging your home. 

4. Secure Manufactured Homes: If you have a manufactured home, it must be affixed to a permanent foundation so that the wheels and axles do not support its weight and resist flotation, collapse, or side-to-side movement.

5. Elevate Your Home: Elevating your home prepares your property against floods and lowers flood insurance premiums. 

6. Secure Yard Items: Unsecure items can be swept away or da

That does now seem to have captured the important pages from this PDF, and summarized them nicely.

## Ask 'How do I prepare my home for floods?' using one PDF document as the source

Now that we know we can surface appropriate content from one document, what about if we add more documents into our library?

In [69]:
vecs_dir = './vector_dbs/all_docs'
docs = all_docs
# Note the argument prefix_file_name_to_chunks=True
pdf_qa = setup_model(vecs_dir, all_docs_list, docs, prefix_file_name_to_chunks=True, extra_prefix="This snippet relates to ")
ask_question("How do I prepare my home for floods?", pdf_qa)

Question: 
How do I prepare my home for floods?

Answer:
There are several steps you can take to prepare your home for floods:

1. Create an emergency plan for your family and practice it regularly. When a storm is approaching, evacuate and move your car to higher ground.

2. Purchase flood insurance for your home and its contents, even if you do not live in a high-risk flood zone.

3. Document your belongings. This will help with the insurance process if you need to file a claim.

4. Store valuables and important documents above the Base Flood Elevation (BFE) in waterproof or water-resistant containers. 

5. Elevate appliances and utilities such as water heaters, washers, dryers, and electric panels on higher floors to prevent them from getting damaged by flood water.

6. Use flood-resistant materials for insulation, drywall, and floor coverings like tile to minimize damage.

7. Make sure your yard slopes away from buildings on your property and that water has a place to drain. 

8. A

That's done a great job, it surfaced our three key articles from [https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-home_flooding.pdf](https://www.fema.gov/sites/default/files/2020-11/fema_protect-your-home_flooding.pdf), plus an aticle from 