# Checkpoint Three: Cleaning Data

Now you are ready to clean your data. Before starting coding, provide the link to your dataset below.

My dataset:

Import the necessary libraries and create your dataframe(s).

In [1]:
import numpy as np
import pandas as pd

In [4]:
df = pd.read_csv("./US Natural Disaster Declarations/us_disaster_declarations.csv")

## Missing Data

Test your dataset for missing data and handle it as needed. Make notes in the form of code comments as to your thought process.

In [5]:
print("Amount of missing data:")
for col in df.columns:
    print(f"[{col}]: " + str(int(100*df[col].isna().sum()/len(df))) + "%")

Amount of missing data:
[fema_declaration_string]: 0%
[disaster_number]: 0%
[state]: 0%
[declaration_type]: 0%
[declaration_date]: 0%
[fy_declared]: 0%
[incident_type]: 0%
[declaration_title]: 0%
[ih_program_declared]: 0%
[ia_program_declared]: 0%
[pa_program_declared]: 0%
[hm_program_declared]: 0%
[incident_begin_date]: 0%
[incident_end_date]: 0%
[disaster_closeout_date]: 23%
[tribal_request]: 0%
[fips]: 0%
[place_code]: 0%
[designated_area]: 0%
[declaration_request_number]: 0%
[last_ia_filing_date]: 71%
[incident_id]: 0%
[region]: 0%
[designated_incident_types]: 69%
[last_refresh]: 0%
[hash]: 0%
[id]: 0%


In [None]:
# Fortunately this data is mostly clean, and almost none of the fields with 
#  missing values are especially relevant to the analysis - mostly 
#  representing administrative details and disaster response details.
# The one field that may have been useful (designated_incident_types) is
#  largely redundant in the context of the other classification columns.
# https://www.kaggle.com/datasets/headsortails/us-natural-disaster-declarations
# A quick check on the source reveals that 'designated_incident_type' is 
#  mostly redundant and can thus be dropped.

## Irregular Data

Detect outliers in your dataset and handle them as needed. Use code comments to make notes about your thought process.

In [None]:
# Drop Hurricane Katrina - It represents too far an outlier, especially in regards
#  to very specifically disastrous circumstances and ends up blowing out the data.
df = df[
    ~df['declaration_title'].str.contains('Hurricane Katrina')
    ]

## Unnecessary Data

Look for the different types of unnecessary data in your dataset and address it as needed. Make sure to use code comments to illustrate your thought process.

In [14]:
unnecessary_fields = (
    'fema_declaration_string', # Administrative. Drop it.
    'disaster_number', # Organizational/Administrative. Drop it.
    'declaration_date', # Would only be useful if we were looking administrative-level disaster responses.
    'fy_declared', # Similar to above. Primarily administrative and also sometimes unmatched with the actual start date
    'ia_program_declared', # Individual Assistance Program - Response type. Dropped.
    'pa_program_declared', # Public Assistance Program - Response type. Dropped.
    'hm_program_declared', # Hazard Mitigation Program - Response type. Dropped.
    'ih_program_declared', # Individuals and Households Program - Response type. Dropped.
    'tribal_request', # Marks if request was made by Tribal Nation. Dropped.
    'fips', # Administrative. Dropped.
    'place_code', # Administrative, would require geospatial processing to make sense of.
    'declaration_request_number', # Administrative. Dropped.
    'last_ia_filing_date', # Administrative. Dropped.
    'incident_id', # Administrative. Dropped.
    'last_refresh', # Administrative. Dropped. 
    'hash', # Organizational. Dropped.
    'id', # Organizational. Dropped.
    'designated_area', # Too variable of a scope (state/county/reservation/parish/municipio/area/etc.), highly inconsistent formatting, and matching variable dates to geographic coordinates while filtering out similarly named counties in different areas requires a database beyond the resources of this project. Dropped.
    'disaster_closeout_date', # Represents administrative response, not actual disaster.
    'designated_incident_types', # Seems useful but is redundant to other more specific and filled-out fields
)

# Main ethos: we're not interested in the response and administrative end of things, the project is more concerned
#  with the incident, location, and types of the incidents occurring.

for field in unnecessary_fields:
    try:
        df.drop(field, inplace=True, axis=1)
    except:
        print(f"[{field}] was not present or likely already removed.")

df.sample(5)

[fema_declaration_string] was not present or likely already removed.
[disaster_number] was not present or likely already removed.
[declaration_date] was not present or likely already removed.
[fy_declared] was not present or likely already removed.
[ia_program_declared] was not present or likely already removed.
[pa_program_declared] was not present or likely already removed.
[hm_program_declared] was not present or likely already removed.
[ih_program_declared] was not present or likely already removed.
[tribal_request] was not present or likely already removed.
[fips] was not present or likely already removed.
[place_code] was not present or likely already removed.
[declaration_request_number] was not present or likely already removed.
[last_ia_filing_date] was not present or likely already removed.
[incident_id] was not present or likely already removed.
[last_refresh] was not present or likely already removed.
[hash] was not present or likely already removed.
[id] was not present or

Unnamed: 0,state,declaration_type,incident_type,declaration_title,incident_begin_date,incident_end_date,region
14196,KY,DR,Snowstorm,Blizzard Of 96,1996-01-05T00:00:00Z,1996-01-12T00:00:00Z,4
3990,MN,DR,Flood,Severe Storms & Flooding,1974-07-13T00:00:00Z,1974-07-13T00:00:00Z,5
25147,AR,DR,Severe Storm,Severe Storms And Flooding,2004-05-30T00:00:00Z,2004-07-09T00:00:00Z,6
57349,AR,DR,Biological,Covid-19 Pandemic,2020-01-20T00:00:00Z,2023-05-11T00:00:00Z,6
24460,PR,DR,Severe Storm,"Severe Storms, Flooding, Mudslides, And Landsl...",2003-11-10T00:00:00Z,2003-11-23T00:00:00Z,2


## Inconsistent Data

Check for inconsistent data and address any that arises. As always, use code comments to illustrate your thought process.

In [16]:
df['declaration_title'].value_counts()

declaration_title
Covid-19 Pandemic               4165
Severe Storms And Flooding      3955
Covid-19                        3692
Severe Storms & Flooding        3387
Severe Winter Storm             2466
                                ... 
Typhoon Karen                      1
Abnormally High Tides              1
Co-Wiley Ridge Fire-06/23/02       1
Louisiana Fire                     1
Tornado & Heavy Rainfall           1
Name: count, Length: 2418, dtype: int64

In [17]:
df['incident_type'].value_counts()

incident_type
Severe Storm           19267
Flood                  11204
Hurricane              10823
Biological              7857
Fire                    3843
Snowstorm               3707
Severe Ice Storm        2956
Tornado                 1623
Drought                 1292
Tropical Storm          1059
Coastal Storm            350
Other                    313
Freezing                 301
Earthquake               228
Winter Storm             149
Typhoon                  130
Volcanic Eruption         51
Mud/Landslide             44
Fishing Losses            42
Dam/Levee Break           13
Toxic Substances           9
Chemical                   9
Tsunami                    9
Human Cause                7
Tropical Depression        7
Terrorist                  5
Straight-Line Winds        2
Name: count, dtype: int64

In [None]:
# Incident type is clearly the more practical field to hold onto here, 
#  though for the sake of visualizations, description, and specificity, 
#  it may be practical to keep the declaration title.
# We do absolutely want to keep the incident type, however.
# At this point, we can filter and sort the data.
weather = (
    'Severe Storm',
    'Flood',
    'Hurricane',
    'Fire', 
    'Snowstorm', 
    'Severe Ice Storm', 
    'Tornado', 
    'Drought', 
    'Tropical Storm', 
    'Freezing', 
    'Winter Storm', 
    'Typhoon'
)
dry = (
    'Fire',
    'Drought'
)
cold = (
    'Severe Ice Storm',
    'Winter Storm',
    'Snowstorm',
    'Freezing'
)
coastal = (
    'Hurricane',
    'Tropical Storm',
    'Coastal Storm',
    'Typhoon',
    'Tropical Depression'
)
climate = df[df['incident_type'].isin(weather)]
dry_cli = df[df['incident_type'].isin(dry)]
cold_cli = df[df['incident_type'].isin(cold)]
coastal_cli = df[df['incident_type'].isin(coastal)]

In [20]:
climate['declaration_title'].value_counts()

declaration_title
Severe Storms And Flooding                3948
Severe Storms & Flooding                  3387
Severe Winter Storm                       2466
Severe Storms, Tornadoes, And Flooding    2055
Flooding                                  1540
                                          ... 
Or - Winter Fire - 07/15/2002                1
Hurricane Cindy                              1
Drought & Impending Freeze                   1
Or - Eyerly Fire - 07/13/2002                1
Munger Shaw Fire                             1
Name: count, Length: 2336, dtype: int64

Over 2,000 unique values in the 'declaration_title" field - this suggests that the data would be best cleaned in Excel or similar. The field itself is nearly useless as a field for categorization data due to the per-entry variability and should be understood as more practical for purposes of labeling data points. Even if we did clean the field, it would be mostly redundant on the incident_type.

As an experiment, we can filter for flooding to see how much of the data that represents - if we were able to reduce it so far on just that value, the remaining entries might seem more easy to operate over and it might make more sense to create a flagging field for overlapping incident types.

In [30]:
pct = int(100*(len(climate[climate['declaration_title'].str.contains('Flooding')])/len(climate)))
print(f"{pct}% of climate events had flooding.")

45% of climate events had flooding.


Let's filter for those that didn't.

In [39]:
climate[~climate['declaration_title'].str.contains('Flooding')].sample(10)

Unnamed: 0,state,declaration_type,incident_type,declaration_title,incident_begin_date,incident_end_date,region
60003,AR,EM,Hurricane,Hurricane Laura,2020-08-26T00:00:00Z,2020-08-28T00:00:00Z,6
31225,OK,DR,Fire,Extreme Wildfire Threat,2005-11-27T00:00:00Z,2006-03-31T00:00:00Z,6
66836,FL,DR,Hurricane,Hurricane Helene,2024-09-23T00:00:00Z,2024-10-07T00:00:00Z,4
5453,OH,EM,Snowstorm,Snowstorms,1977-02-02T00:00:00Z,1977-02-02T00:00:00Z,5
25947,PA,DR,Hurricane,Tropical Depression Ivan,2004-09-17T00:00:00Z,2004-10-01T00:00:00Z,3
2342,VA,DR,Flood,Tropical Storm Agnes,1972-06-23T00:00:00Z,1972-06-23T00:00:00Z,3
23687,OK,DR,Severe Storm,Severe Storms And Tornadoes,2003-05-08T00:00:00Z,2003-05-30T00:00:00Z,6
25469,FL,DR,Hurricane,Hurricane Frances,2004-09-03T00:00:00Z,2004-10-08T00:00:00Z,4
66609,SC,EM,Tropical Storm,Hurricane Helene,2024-09-25T00:00:00Z,2024-10-07T00:00:00Z,4
59552,MT,FM,Fire,Falling Star Fire,2020-08-02T00:00:00Z,2020-08-04T00:00:00Z,8


In [46]:
climate[~climate['declaration_title'].str.contains('Flood')]['declaration_title'].value_counts()

declaration_title
Severe Winter Storm           2466
Severe Winter Storms          1375
Drought                       1014
Wildfires                      700
Hurricane Irma                 663
                              ... 
Hurricane Cindy                  1
Drought & Impending Freeze       1
Hurricane Cleo                   1
Hurricane Hilda                  1
Typhoon Louise                   1
Name: count, Length: 1971, dtype: int64

2007 unique values remaining after we filter out one of the most common items is still entirely impractical to hand-filter and sort each case in Python. At this point, if we wanted to follow through with something like that (which would be a largely redundant process), it would make the most sense to take it into Excel or to create additional calculated fields in the upcoming visualization with a boolean flag on whether an event matches one or multiple criteria if we wanted to be more specific with the types of events occurring. I.e. - something like flooding doesn't occur on its own, requiring causes like rain or dam breaks, or in the case of fire, a dry climate, a trigger event, etc. All this to say - we could keep the field and use it to split and visualize a shift in the nature of disasters if we wanted.

## Summarize Your Results

Make note of your answers to the following questions.

1. Did you find all four types of dirty data in your dataset?

    No - the data was remarkably clean overall, and the few fields that could have contained dirty results were dropped ('designated_area') due to too variable of scope - ranged from state-wide to county level and a range of other response types. While I didn't see any typos on a cursory glance of the column, actually matching those to coordinates and scopes in a geospatial database (we could easily search and match for many by simply concatenating state of declaration with the designated area) is a task beyond the scope of the project and for fairly minimal returns.

    The data had no discovered duplicates, as revealed in checkpoint 2.
    
    There are a few fields (declaration_title) where the output does get a bit messy and inconsistent, including unnecessary quotation marks, vague entries, and vague/specific entries that are either redundant to incident_type or specific to the point of uselessness beyond perhaps labeling outlier events. Some of the entries did seem to have messy input (something automated) as well, and contain examples like:

    "Severe Storms, Tornadoes, Straight-Line Winds, And"

    The designated_area field would require case-by-case analysis on many of the entries - e.g. a reservation could be labeled something like:

    Alamo Navajo (Indian Reservation)

    Jicharilla Apache Indian Reservation

    San Felipe/Santa Ana Joint Area

    etc.
    
    The geographic specificity is largely beyond the scope of the project, especially when we are looking at regional patterns down to roughly the state level at finest resolution.


2. Did the process of cleaning your data give you new insights into your dataset?

    No and yes - no, because I did very similar work in the previous checkpoint in order to visualize and organize the data to better understand it, including breaking it down into the subsections. Additionally, much of the data is as-expected, almost alarmingly so, including the increases in declarations and eventful versus non-eventful years. Yes, because that work - especially the visualization and value counts - showed remarkable outliers like Katrina, but also allowed a better understanding into the entry processes of the data (though it was remarkably clean, save for a few fields)

3. Is there anything you would like to make note of when it comes to manipulating the data and making visualizations?

    For dates and times, field of interest is generally 'incident_begin_date' though we can keep 'incident_end_date' for visualization if we want later.

    Incident type is the most useful field to sort on.

    The other fields (state/region) can help give clarity to where incidents are occurring.

    Due to the nature of the dataset, the information is generally limited in scale to types of events, dates, and locations. The finest scale we can analyze or visualize is state-level, though regional results are also interesting and may prove more practical for actual analysis. We could, for example, create a visualization of drought-prone states in the West, colder states in the North versus states in the plains more likely to face windy weather, low ground-level states in the South more likely to face tropical storms and flooding, etc. - FEMA already has these defined in the dataset, though it may be good to give them a once-over, e.g. excluding certain areas for the sake of not skewing analysis (e.g. DC is too specific, Hawaii is grouped with California but will have different climate patterns, etc.). 
    
    If we wanted, we could possibly utilize the program declaration fields as a barometer of event severity or response, though these seem to fluctuate in actual application, practicality, etc.

    Since we are working on Type/Location/Date, it may be helpful to bring in another dataset to add an extra dimension to the analysis, or to use calculated fields to help develop more nuanced visualizations and analysis.

    To the ends of added datasets, the data could not be about the events themselves. Adding such a dataset would be redundant and create an unreasonable time-suck for the project late in the process and would inevitably have thousands of missing and unmatched fields, even if it could offer an extra dimension (event severity/casualties, etc.). Therefore, the only additional datasets we can rely on should be geographic and chronological. Geographic datasets could represent changes in population in states/regions over the years (a simple census pull), yearly temperatures, El Nino/La Nina years, and so on. If we find the visualizations aren't as informative or interesting as desired, it may be practical to add these supplemental datasets in.



    