<a href="https://colab.research.google.com/github/garrett-vangilder/CS5262/blob/week3-task-three/fema_declarations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Future Frequency of Disasters by State

## Background

Extreme weather events are becoming more likely as the effects of climate change are realized. This can be seen in the plethora of wildfires throughout the West. The Federal Emergency Management Agency (FEMA) does not declare every extreme weather event a disaster, but through a workflow specified by the Stafford Act, state and federal leaders can work together to determine if federal aid is necessary. The type of aid given to individuals is broad. It can include financial gifts to individuals or small businesses with the intention of using the funds for temporary housing or repairs.

Is it possible to use [historical FEMA declarations](https://www.kaggle.com/datasets/headsortails/us-natural-disaster-declarations) to determine the change in frequency of catastrophic weather events for a given state? Are catastrophic weather events becoming more prevalent?

## Project Description

Using historical FEMA disaster records, researchers can determine the likelihood of future disasters and disaster types when given a date and a state location. This could be useful in analyzing changes to local and federal funding for extreme weather events. This research can potentially be used as a proxy for defining the financial repercussions of climate change. When paired with a municipalities budget and other financial records, this data could help decide if funds are over or under-provisioned for disaster relief. Federal governments could extrapolate this data to inform federal funding better. There may even be a case for private real estate development firms to use this model when selecting the viability of significant projects for a given location. If developers are aware that there is a high likelihood of volatile weather throughout a given month range, then a developer may determine that it is best to start their project in a different month to prevent delays to their timeline.

The overall goal of this model is to be used as an severe weather predictor by volume for states.

## Performance Metrics

We will analyze the performance of this regression model through the mean squared error metric. This model will be considered a success if the prediction is within +/- 2 events for most states. To validate the model we will create a training dataset of all events before 2022, our test dataset will include the complete dataset, meaning that we will include 2022 disasters. 

### Exploratory Data Analysis

Through my initial exploratory data analysis, I am looking to answer the following questions to understand my dataset better.

1. Are all states represented, and if so which state is most prone to disasters?

2. Are disasters becoming more or less common? How many disasters are there each year? Are there any outliers?

3. What are the most common disaster types?

4. Are there any correlated variables in my dataset? 

In [1]:
# Load commons libs
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt

### Load Data

We will injest our dataset, and the data dictionary.

Data was initially sourced from FEMA, however it has been made public via [kaggle](https://www.kaggle.com/datasets/headsortails/us-natural-disaster-declarations)

We wil source our data from a clone located in the corresponding [github repo](https://github.com/garrett-vangilder/CS5262).

In [2]:
data_dictionary = pd.read_csv('https://raw.githubusercontent.com/garrett-vangilder/CS5262/main/data_dictionary.csv')

disasters = pd.read_csv('https://raw.githubusercontent.com/garrett-vangilder/CS5262/main/us_disaster_declarations.csv')


#### Data Dictionary

In [3]:
display(data_dictionary)

Unnamed: 0,feature,description,type
0,fema_declaration_string,codified identifier for disaster - declaration...,string
1,disaster_number,an incremented value used to designate an event,integer
2,state,US state or territory - formatted as XX,string
3,declaration_type,"DR(""major disaster"") or EM(""emergency manageme...",string
4,declaration_date,date of disaster declaration formatted YYYY-MM...,datetime
5,fy_declared,fiscal year of declaration formatted YYYY,integer
6,incident_type,"classification of incident type example: ""Floo...",integer
7,declaration_title,generic identifier for incident typically huma...,string
8,ih_program_declared,"boolean value denotes if the ""Individual and H...",boolean
9,ia_program_declared,"boolean value denotes if the ""Individual Assis...",boolean


#### 1. Are all states represented, and if so which state is most prone to disasters?

In [43]:
disasters.groupby('state').fema_declaration_string.nunique().sort_values(ascending=False)

state
TX    371
CA    357
OK    218
WA    191
FL    168
OR    140
NM    111
NY    107
AZ    106
LA    101
CO    101
NV    100
AL     99
MT     98
MS     90
SD     87
KY     86
TN     83
KS     82
AR     79
MO     77
MN     77
AK     77
NE     76
WV     75
NC     73
IA     73
GA     72
VA     72
ND     67
IL     64
PA     63
ME     63
HI     61
OH     59
NJ     57
NH     57
MA     56
WI     54
ID     53
VT     52
UT     51
IN     51
PR     46
MI     43
CT     39
SC     38
WY     38
MD     37
VI     30
RI     27
DE     25
DC     23
MP     23
FM     21
GU     19
AS     16
MH      7
PW      1
Name: fema_declaration_string, dtype: int64

In [46]:
len(disasters.groupby('state'))

59