This notebook will perform our stratified train/test split of the dataset, after merging the two existing datasets into one and filtering out entries in the Disaster Declaration Summaries (DDS) that do not exist in the Mission Assignments (MA).

Our initial approach used a temporal split but it turns out that we have a few incidentTypes that are rare events, having only a few or just one occurrence, so we specifically wanted to include those in the training set. We also felt that a stratified approach would allow some of the more recent entries to be included in our training set and that would help us account for any changes in entry behavior that may happen over time.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import train_test_split


Adjust Pandas display options.

In [2]:
pd.set_option('display.max_columns', 80)

Assign filepaths and random state to variables.

In [3]:
ma_filepath = 'mission_assignments.parquet'
dds_filepath = 'disaster_declaration_summaries.parquet'
train_filepath = 'combined_training_set_nontime.parquet'
test_filepath = 'combined_test_set_nontime.parquet'
random_state = 42

Load initial datasets.

In [4]:
df_dds = pd.read_parquet(dds_filepath)
df_ma = pd.read_parquet(ma_filepath)
print(df_dds.shape, df_ma.shape)

(68485, 28) (40340, 39)


In [5]:
df_dds['designatedIncidentTypes'].head(10)

0       R
1       R
2       R
3    None
4    None
5    None
6    None
7    None
8    None
9       R
Name: designatedIncidentTypes, dtype: object

Add conversion dictionaries for later use.

In [6]:
#dictionary to convert disaster codes to strings representing each type of disaster
disaster_dict = {'0':'Not applicable','1':'Explosion','2':'Straight-Line Winds','3':'Tidal Wave','4':'Tropical Storm',
                '5':'Winter Storm','A':'Tsunami','B':'Biological','C':'Coastal Storm','D':'Drought','E':'Earthquake',
                'F':'Flood','G':'Freezing','H':'Hurricane','I':'Terrorist','J':'Typhoon','K':'Dam/Levee Break','L':'Chemical',
                'M':'Mud/Landslide','N':'Nuclear','O':'Severe Ice Storm','P':'Fishing Losses','Q':'Crop Losses','R':'Fire',
                'S':'Snowstorm','T':'Tornado','U':'Civil Unrest', 'V':'Volcanic Eruption','W':'Severe Storm','X':'Toxic Substances',
                'Y':'Human Cause','Z':'Other', '8':'Tropical Depression'}

agencyid_dict = {'CISA':'DHS-CISA','DHSMGMT':'DHS-MGMT','USDANRCS':'USDA-NRCS','GSA-':'GSA','VA-':'VA','EPA-':'EPA','DOT-':'DOT',
                'CNCS-':'CNCS','FCC-':'FCC','DOED':'DOE','DHUD':'HUD','DOD-':'DOD','VA -':'VA','USDAOCIO':'USDA-OCIO','FPS':'DHS-FPS',
                'TSA':'DHS-TSA','ICE':'DHS-ICE','USCIS':'DHS-CIS','DLA':'DOD-DLA','CBP':'DHS-CBP','NPS':'DOI-NPS','NPPD':'DHS-CISA',
                'CDC':'HHS-CDC','USAF':'DOD-USAF','OSHA':'DOL-OSHA','DHS-MGT':'DHS-MGMT','USGS':'DOI-USGS','USCG':'DHS-USCG',
                'USDJ':'DOJ','DHS-MGA':'DHS-IA','FLETC':'DHS-FLETC','DHS-FLET':'DHS-FLETC','USFS':'USDA-FS','HHS -PSC':'HHS-PSC'}

Clean our Mission Assignments dataset before merging.

In [7]:
df_ma=df_ma[(df_ma['declarationType']!='SU')&(df_ma['maAmendNumber']==0)&(df_ma['supportFunction']<=15)]
df_ma.shape

(7044, 39)

In [8]:

df_ma['supportFunction'].fillna(value=0,inplace=True)

df_ma['agencyId'].replace(agencyid_dict,inplace=True)

column_list_ma = ['incidentId','stt','incidentType','region','maType','maPriority','supportFunction','agencyId', 'maId',
              'declarationType', 'assistanceRequested', 'statementOfWork']

df_ma = df_ma.reindex(columns=column_list_ma)

df_ma.rename(columns={'incidentType': 'incidentTypeMA'},
             inplace=True)

df_ma.shape

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_ma['supportFunction'].fillna(value=0,inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_ma['agencyId'].replace(agencyid_dict,inplace=True)


(7044, 12)

In [9]:
df_ma['agencyId'].value_counts()

agencyId
DOD          743
GSA          485
HHS          411
EPA          360
COE-SAD      348
            ... 
DC-CSOSA       1
DOC-NTIA       1
DOC-BIS        1
USDA-OCIO      1
USDA-OCP       1
Name: count, Length: 109, dtype: int64

Data cleaning for DDS includes keeping of specific columns and filtering of year and declaration type.

In [10]:
# select columns necessary for data analysis, add empty columns for each natural disaster type

column_list_dds = ['femaDeclarationString','state','incidentType','incidentBeginDate','fipsStateCode','region',
               'designatedIncidentTypes','declarationTitle', 'incidentId','declarationType']

df_dds = df_dds.reindex(
    columns=column_list_dds,
    fill_value=0)

# Add time information to DDS

df_dds['incidentBeginDate']=pd.to_datetime(df_dds['incidentBeginDate'])
df_dds['year'] = df_dds['incidentBeginDate'].dt.year
df_dds['month'] = df_dds['incidentBeginDate'].dt.month
df_dds['day'] = df_dds['incidentBeginDate'].dt.day

# Filter out values before 2012 as they will not have matching Mission Assignments anyway
df_dds=df_dds[(df_dds['year']>=2012) & (df_dds['declarationType']!='FM')]

print(df_dds.shape)

#ensures that incident type is reflected in designated incident types if designatedIncidentTypes is empty
df_dds['designatedIncidentTypes'].fillna(df_dds['incidentType'], inplace = True)

df_dds.drop_duplicates(inplace=True)
df_dds.reset_index(inplace = True,
                   drop=True)

df_dds.shape


(26041, 13)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_dds['designatedIncidentTypes'].fillna(df_dds['incidentType'], inplace = True)


(1123, 13)

Check for NaNs

In [11]:
df_dds.isna().sum(), df_ma.isna().sum()

(femaDeclarationString      0
 state                      0
 incidentType               0
 incidentBeginDate          0
 fipsStateCode              0
 region                     0
 designatedIncidentTypes    0
 declarationTitle           0
 incidentId                 0
 declarationType            0
 year                       0
 month                      0
 day                        0
 dtype: int64,
 incidentId             0
 stt                    0
 incidentTypeMA         0
 region                 0
 maType                 0
 maPriority             0
 supportFunction        0
 agencyId               0
 maId                   0
 declarationType        0
 assistanceRequested    0
 statementOfWork        0
 dtype: int64)

Set lists of column names to variables.

In [12]:
dds_column_list = df_dds.columns.to_list()
dds_column_list

['femaDeclarationString',
 'state',
 'incidentType',
 'incidentBeginDate',
 'fipsStateCode',
 'region',
 'designatedIncidentTypes',
 'declarationTitle',
 'incidentId',
 'declarationType',
 'year',
 'month',
 'day']

In [13]:
df_ma.rename(columns={'stt':'state'},inplace=True)
ma_column_list = df_ma.columns.to_list()
ma_column_list

['incidentId',
 'state',
 'incidentTypeMA',
 'region',
 'maType',
 'maPriority',
 'supportFunction',
 'agencyId',
 'maId',
 'declarationType',
 'assistanceRequested',
 'statementOfWork']

Merge our two datasets on matching columns.

In [14]:
print(df_dds['incidentId'].nunique(), df_ma['incidentId'].nunique())

662 326


In [15]:
overlapping_columns = list(set(ma_column_list).intersection(set(dds_column_list)))
overlapping_columns

['region', 'incidentId', 'declarationType', 'state']

In [16]:
MA_disaster_combined=df_ma.merge(
    df_dds, 
    how='left',
    on=overlapping_columns,
    validate='m:m')

In [17]:
MA_disaster_combined.shape

(7699, 21)

In [18]:
MA_disaster_combined.drop_duplicates(inplace=True)
MA_disaster_combined.shape


(7699, 21)

Check for missing values after merging.

In [19]:
MA_disaster_combined.isna().sum()

incidentId                   0
state                        0
incidentTypeMA               0
region                       0
maType                       0
maPriority                   0
supportFunction              0
agencyId                     0
maId                         0
declarationType              0
assistanceRequested          0
statementOfWork              0
femaDeclarationString      156
incidentType               156
incidentBeginDate          156
fipsStateCode              156
designatedIncidentTypes    156
declarationTitle           156
year                       156
month                      156
day                        156
dtype: int64

We drop rows with NaNs as this is indicacive of a Mission Assignment without at least one corresponding Disaster Declaration Summary with matching overlapping columns.

In [20]:
MA_disaster_combined.dropna(inplace=True)

Include incidentType values in the list of designatedIncidentTypes if not already included in the list.

In [21]:
MA_disaster_combined['designatedIncidentTypes'] = MA_disaster_combined['designatedIncidentTypes'].str.split(',').apply(
    lambda lst: [s.strip() for s in lst] if isinstance(lst, list) else lst).apply(
    lambda lst: [disaster_dict.get(s, s) for s in lst] if isinstance(lst, list) else lst)

MA_disaster_combined['designatedIncidentTypes'] = MA_disaster_combined.apply(
    lambda row: list(set([str(row['incidentType'])] + row['designatedIncidentTypes'])), axis=1
    )

MA_disaster_combined['designatedIncidentTypes'] = MA_disaster_combined['designatedIncidentTypes'].apply(
    lambda lst: ','.join(lst) if isinstance(lst, list) else str(lst))

In [22]:
MA_disaster_combined[['incidentTypeMA','designatedIncidentTypes']].sample(30)

Unnamed: 0,incidentTypeMA,designatedIncidentTypes
365,Fire,Fire
5504,Hurricane,Hurricane
3749,Biological,Biological
4027,Earthquake,Earthquake
2424,Biological,Biological
7549,Fire,"Fire,Straight-Line Winds"
6172,Flood,Flood
6561,Hurricane,Hurricane
6645,Hurricane,Hurricane
6826,Tropical Storm,Hurricane


In [23]:
MA_disaster_combined['incidentId'].nunique()

321

Add cluster information for assistanceRequested topics (as generated in a separate notebook) by merging it with our new DataFrame.

In [24]:
df_clusters = pd.read_csv('AR_topics.csv')
df_clusters.head()

Unnamed: 0.1,Unnamed: 0,maId,assistanceRequested,AR_topic
0,0,3612EMCTDOI-USGS01,USGS Field measurements of flood-water heights...,28
1,49,4806DRFLVA02,Activate VHA OEM to NRCC ESF-8 PHMS. This is a...,23
2,50,4806DRFLVA01,Activate VA to NRCC. This is a re-issuance of ...,23
3,51,4806DRFLUSDA-FS01,Activate USFS to the NRCC. This is a re-issuan...,30
4,52,4806DRFLUSDA-APH01,"USDA liaison(s) to the NRCC, FEMA teams, or ot...",32


In [25]:
df_clusters.drop(columns=['Unnamed: 0', 'assistanceRequested'], inplace=True)

In [26]:
print(MA_disaster_combined.shape)
MA_disaster_combined = MA_disaster_combined.merge(
    df_clusters,
    how='left',
    on='maId'
)
MA_disaster_combined.shape

(7543, 21)


(7543, 22)

In [27]:
MA_disaster_combined.head()

Unnamed: 0,incidentId,state,incidentTypeMA,region,maType,maPriority,supportFunction,agencyId,maId,declarationType,assistanceRequested,statementOfWork,femaDeclarationString,incidentType,incidentBeginDate,fipsStateCode,designatedIncidentTypes,declarationTitle,year,month,day,AR_topic
0,2024081901,CT,Severe Storm,1,FOS,High,5.0,DOI-USGS,3612EMCTDOI-USGS01,EM,USGS Field measurements of flood-water heights...,"As directed by and in coordination with FEMA, ...",EM-3612-CT,Severe Storm,2024-08-18 00:00:00+00:00,9.0,"Severe Storm,Flood,Mud/Landslide","SEVERE STORMS, FLOODING, LANDSLIDES, AND MUDSL...",2024.0,8.0,18.0,28
1,2024072801,FL,Tropical Storm,4,FOS,Normal,8.0,VA,4806DRFLVA02,DR,Activate VHA OEM to NRCC ESF-8 PHMS. This is a...,As directed by and in coordination with FEMA a...,DR-4806-FL,Tropical Storm,2024-08-01 00:00:00+00:00,12.0,"Hurricane,Tropical Storm",HURRICANE DEBBY,2024.0,8.0,1.0,23
2,2024072801,FL,Tropical Storm,4,FOS,Normal,0.0,VA,4806DRFLVA01,DR,Activate VA to NRCC. This is a re-issuance of ...,"As directed by and in coordination with FEMA, ...",DR-4806-FL,Tropical Storm,2024-08-01 00:00:00+00:00,12.0,"Hurricane,Tropical Storm",HURRICANE DEBBY,2024.0,8.0,1.0,23
3,2024072801,FL,Tropical Storm,4,FOS,Normal,4.0,USDA-FS,4806DRFLUSDA-FS01,DR,Activate USFS to the NRCC. This is a re-issuan...,"As directed by and in coordination with FEMA, ...",DR-4806-FL,Tropical Storm,2024-08-01 00:00:00+00:00,12.0,"Hurricane,Tropical Storm",HURRICANE DEBBY,2024.0,8.0,1.0,30
4,2024072801,FL,Tropical Storm,4,FOS,Normal,11.0,USDA-APH,4806DRFLUSDA-APH01,DR,"USDA liaison(s) to the NRCC, FEMA teams, or ot...","As directed by and in coordination with FEMA, ...",DR-4806-FL,Tropical Storm,2024-08-01 00:00:00+00:00,12.0,"Hurricane,Tropical Storm",HURRICANE DEBBY,2024.0,8.0,1.0,32


In [28]:
MA_disaster_combined.isna().sum()

incidentId                 0
state                      0
incidentTypeMA             0
region                     0
maType                     0
maPriority                 0
supportFunction            0
agencyId                   0
maId                       0
declarationType            0
assistanceRequested        0
statementOfWork            0
femaDeclarationString      0
incidentType               0
incidentBeginDate          0
fipsStateCode              0
designatedIncidentTypes    0
declarationTitle           0
year                       0
month                      0
day                        0
AR_topic                   0
dtype: int64

We are utilizing a stratified shuffle split rather than a temporal split for a few reasons. The first is that this approach will allow us to keep incidentId values together since that will be our grouping column. This reduces the possibility of data leakage as it is possible that some incident ids were used over multiple years. This should not be an issue the way that we handled years but it is possible. The second reason is that this allows us to force any rare events (singularly occuring incident types) into the training set. We will not be able to test these sorts of events but it could always come up in the future, and this minimizes chances of 'incident not seen so it is being ignored' errors moving forward.

In [29]:
split_info = StratifiedShuffleSplit(n_splits=1,
                                    test_size=.2,
                                    random_state=random_state)

grouping_col = 'incidentId'
stratifying_col = 'incidentType'

temp_df = MA_disaster_combined.drop_duplicates(subset=[grouping_col]).copy()
group_stratify_map = temp_df[stratifying_col]

group_counts = group_stratify_map.value_counts()
rare_stratify_values = group_counts[group_counts == 1].index.tolist()

forced_train_group_ids = temp_df[temp_df[stratifying_col].isin(rare_stratify_values)][grouping_col].values
safe_groups_df = temp_df[~temp_df[stratifying_col].isin(rare_stratify_values)]
safe_group_ids = safe_groups_df[grouping_col].values
safe_stratify_map = safe_groups_df[stratifying_col].values

train_safe_group_ids, test_group_ids, _, _ = train_test_split(
    safe_group_ids,
    safe_stratify_map,
    test_size=0.2,
    random_state=42,
    # This split is now safe because all classes in safe_stratify_map have count >= 2
    stratify=safe_stratify_map 
)

# 5. Combine the forced rare groups with the safely split training groups
final_train_group_ids = np.concatenate([train_safe_group_ids, forced_train_group_ids])

# 6. Apply the final masks to the original full DataFrame
train_mask = MA_disaster_combined[grouping_col].isin(final_train_group_ids)
test_mask = MA_disaster_combined[grouping_col].isin(test_group_ids)

df_train = MA_disaster_combined[train_mask]
df_test = MA_disaster_combined[test_mask]

print(f"Total groups in Train set: {len(final_train_group_ids)}")
print(f"Total groups in Test set: {len(test_group_ids)}")
print(f"Number of forced rare groups: {len(forced_train_group_ids)}")


Total groups in Train set: 257
Total groups in Test set: 64
Number of forced rare groups: 4


In [30]:
print(df_train.shape, df_test.shape)

(6765, 22) (778, 22)


In [31]:
print(df_train['incidentType'].value_counts(), df_test['incidentType'].value_counts())

incidentType
Hurricane            2702
Biological           1930
Tropical Storm        465
Flood                 462
Fire                  437
Severe Storm          350
Typhoon               115
Tornado                74
Other                  50
Severe Ice Storm       40
Coastal Storm          36
Mud/Landslide          25
Volcanic Eruption      25
Earthquake             22
Dam/Levee Break        12
Chemical                7
Snowstorm               5
Winter Storm            4
Terrorist               4
Name: count, dtype: int64 incidentType
Hurricane           485
Flood                92
Earthquake           57
Fire                 56
Severe Storm         50
Tornado              12
Winter Storm          6
Other                 5
Severe Ice Storm      5
Snowstorm             4
Tropical Storm        4
Typhoon               2
Name: count, dtype: int64


Write the train and test datasets to parquet files for later use.

In [32]:
df_train.to_parquet(train_filepath)
df_test.to_parquet(test_filepath)