### NOAA Storm Events Database ###

https://www.ncei.noaa.gov/stormevents/ftp.jsp

The database currently contains data from January 1950 to May 2025, as entered by NOAA's National Weather Service (NWS). Bulk data are available in comma-separated files (CSV). These files can be viewed in Excel and other spreadsheet applications.

Access: ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/

Detailed information about the fields/columns: ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/Storm-Data-Bulk-csv-Format.pdf

Documentation on the file naming convention: ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/README

In [1]:
# import libraries
# NOTE: global_vars should be edited to include local paths and credentials before use.
# If global_vars.py is created in the root dir remove the ignore/ prefix in the import statement below.
import ignore.global_vars as gv
import db_tools as dbt
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import urllib.request
import re

In [2]:
dbt.browse_ftp("ftp://ftp.ncei.noaa.gov", "/pub/data/swdi/stormevents/csvfiles/");


Contents of /pub/data/swdi/stormevents/csvfiles/:
drwxrwxr-x   2 ftp      ftp          4096 May 14  2014 legacy
-rw----r-x   1 ftp      ftp          2020 May 14  2014 README
-rw-r-xr-x   1 ftp      ftp        147087 Jul 30  2024 Storm-Data-Bulk-csv-Format.pdf
-rw----r-x   1 ftp      ftp        150527 Jul 30  2024 Storm-Data-Export-Format.pdf
-rw-rw-r--   1 ftp      ftp         10597 Jul  2 14:22 StormEvents_details-ftp_v1.0_d1950_c20250520.csv.gz
-rw-rw-r--   1 ftp      ftp         12020 May 20 12:35 StormEvents_details-ftp_v1.0_d1951_c20250520.csv.gz
-rw-rw-r--   1 ftp      ftp         12634 May 20 12:35 StormEvents_details-ftp_v1.0_d1952_c20250520.csv.gz
-rw-rw-r--   1 ftp      ftp         21804 May 20 12:35 StormEvents_details-ftp_v1.0_d1953_c20250520.csv.gz
-rw-rw-r--   1 ftp      ftp         26220 May 20 12:35 StormEvents_details-ftp_v1.0_d1954_c20250520.csv.gz
-rw-rw-r--   1 ftp      ftp         53699 May 20 12:35 StormEvents_details-ftp_v1.0_d1955_c20250520.csv.gz
-rw-rw-r--   1

In [3]:
# Download and print the README file from the NOAA Storm Events database
readme_url = "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/README"
with urllib.request.urlopen(readme_url) as response:
    readme_content = response.read().decode("utf-8")

print(readme_content)


---------------------------------------------------------------
-- README:                                                   --
-- Storm Events Database, Bulk Download                      --
---------------------------------------------------------------

This directory contains CSV (Comma-Separated Values) text files
which represent a dump or export of the Storm Events Database.


Update: 5/14/2014
Data from 1950 to 1996 has been added to the database and 
exported to CSV files in this directory.  Data from 1996 to 
present is available in the legacy CSV format but will be
reprocessed to the new data format by the end of May 2014.
The file naming convention has changed and the data are now 
compressed.  However, the contents of the files are similar.

Example file name:
StormEvents_details-ftp_v1.0_d1972_c20140508.csv.gz

The file is compressed with GZIP compression.  This compression
type is widely supported but custom software, such as 
'7-zip' (http://www.7-zip.org/), may be neede

In [4]:
# Download a sample CSV file from the NOAA Storm Events database and load it into a DataFrame
df_sample = dbt.ftp_to_df(
    "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2024_c20250818.csv.gz",
    compression="gzip",)

Streamed StormEvents_details-ftp_v1.0_d2024_c20250818.csv.gz: 69493 rows, 51 columns


In [5]:
df_sample.head()

Unnamed: 0,BEGIN_YEARMONTH,BEGIN_DAY,BEGIN_TIME,END_YEARMONTH,END_DAY,END_TIME,EPISODE_ID,EVENT_ID,STATE,STATE_FIPS,...,END_RANGE,END_AZIMUTH,END_LOCATION,BEGIN_LAT,BEGIN_LON,END_LAT,END_LON,EPISODE_NARRATIVE,EVENT_NARRATIVE,DATA_SOURCE
0,202404,30,2033,202404,30,2033,189851,1174463,OKLAHOMA,40,...,0.0,SSW,FREDERICK ARPT,34.3444,-98.983,34.3444,-98.983,A rather nebulous upper air pattern existed ac...,Frederick Municipal Airport (KFDR) observation.,CSV
1,202407,1,0,202407,5,900,193486,1195301,LOUISIANA,22,...,,,,,,,,An upper ridge of high pressure built in acros...,,CSV
2,202411,16,230,202411,18,1421,197838,1223377,OREGON,41,...,,,,,,,,A series of cold fronts the weekend of Nov. 16...,The Hog Pass SNOTEL reported an estimated 12 i...,CSV
3,202405,22,1230,202405,22,1615,191723,1184135,TEXAS,48,...,,,,,,,,A strong upper-level subtropical ridge/heat do...,Harlingen Valley International Airport (KHRL) ...,CSV
4,202405,21,1200,202405,21,1530,191723,1184133,TEXAS,48,...,,,,,,,,A strong upper-level subtropical ridge/heat do...,"By proxy, between locations in northern Kenedy...",CSV


Cleaning strategy

List of cols needed. Erring on the side of having too many.

#### StormEvents_details ###

- BEGIN_YEARMONTH 
- BEGIN_DAY
- Combine these 2 and get datetime format, don't bring in time
- EPISODE_ID
- EVENT_ID
- EVENT_TYPE
- STATE_FIPS
- CZ_FIPS
- combine state and cz fips
- INJURIES_DIRECT
- INJURIES_INDIRECT
- DEATHS_DIRECT
- DEATHS_INDIRECT
- "DAMAGE_PROPERTY

#### StormEvents_locations
not needed, concat state and cz fips


In [6]:
# Get file list where type is StormEvents_details and year is 1999-2025
files = dbt.get_ftp_filenames("ftp://ftp.ncei.noaa.gov", "/pub/data/swdi/stormevents/csvfiles/")

# Filter for StormEvents_details files from 1999-2025
pattern = r'StormEvents_details-ftp_v1\.0_d(\d{4})_c.*\.csv\.gz'
selected_files = []
    
for file in files:
    match = re.match(pattern, file)
    if match:
        year = int(match.group(1))
        if 1999 <= year <= 2025:
            selected_files.append(file)


# Use the selected_files list
print(f"Selected {len(selected_files)} StormEvents_details files:")
for i, filename in enumerate(selected_files, 1):
    year = re.search(r"d(\d{4})", filename).group(1)
    print(f"{filename}")

Selected 27 StormEvents_details files:
StormEvents_details-ftp_v1.0_d2006_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2013_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2020_c20250702.csv.gz
StormEvents_details-ftp_v1.0_d2016_c20250818.csv.gz
StormEvents_details-ftp_v1.0_d2018_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2024_c20250818.csv.gz
StormEvents_details-ftp_v1.0_d2015_c20250818.csv.gz
StormEvents_details-ftp_v1.0_d2017_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2021_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2025_c20250818.csv.gz
StormEvents_details-ftp_v1.0_d2019_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d1999_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2014_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2000_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2012_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2001_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2011_c20250520.csv.gz
StormEvents_details-ftp_v1.0_d2002_c20250520.csv.gz
StormEvents_details-ftp_v

In [7]:
# get all files identified in 'filenames' and populate a df for cleaning

base_url = "ftp://ftp.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/"
all_storm_data = []

print(f"Processing {len(selected_files)} Storm Events files...")

for i, filename in enumerate(selected_files, 1):
    try:
        # Construct full URL
        full_url = base_url + filename

        # Stream file to DataFrame
        df = dbt.ftp_to_df(full_url, compression="gzip")

        if not df.empty:
            # Add year column for reference
            year = re.search(r"d(\d{4})", filename).group(1)
            df["FILE_YEAR"] = int(year)

            all_storm_data.append(df)
            print(f"{i:2d}/{len(selected_files)}: {year} - {len(df)} rows")
        else:
            print(f"{i:2d}/{len(selected_files)}: {filename} - No data")

    except Exception as e:
        print(f"Error processing {filename}: {e}")

# Concatenate all DataFrames
if all_storm_data:
    df_all_storms = pd.concat(all_storm_data, ignore_index=True)
    print(
        f"\nCombined DataFrame: {len(df_all_storms)} total rows, {len(df_all_storms.columns)} columns"
    )
    print(
        f"Years covered: {df_all_storms['FILE_YEAR'].min()} - {df_all_storms['FILE_YEAR'].max()}"
    )

    # Show basic info
    df_all_storms.head()
else:
    print("No data was successfully loaded")

Processing 27 Storm Events files...


  df = pd.read_csv(bio, **kwargs)


Streamed StormEvents_details-ftp_v1.0_d2006_c20250520.csv.gz: 56400 rows, 51 columns
 1/27: 2006 - 56400 rows
Streamed StormEvents_details-ftp_v1.0_d2013_c20250520.csv.gz: 59986 rows, 51 columns
 2/27: 2013 - 59986 rows
Streamed StormEvents_details-ftp_v1.0_d2020_c20250702.csv.gz: 61278 rows, 51 columns
 3/27: 2020 - 61278 rows
Streamed StormEvents_details-ftp_v1.0_d2016_c20250818.csv.gz: 56005 rows, 51 columns
 4/27: 2016 - 56005 rows
Streamed StormEvents_details-ftp_v1.0_d2018_c20250520.csv.gz: 62697 rows, 51 columns
 5/27: 2018 - 62697 rows
Streamed StormEvents_details-ftp_v1.0_d2024_c20250818.csv.gz: 69493 rows, 51 columns
 6/27: 2024 - 69493 rows
Streamed StormEvents_details-ftp_v1.0_d2015_c20250818.csv.gz: 57907 rows, 51 columns
 7/27: 2015 - 57907 rows
Streamed StormEvents_details-ftp_v1.0_d2017_c20250520.csv.gz: 57029 rows, 51 columns
 8/27: 2017 - 57029 rows
Streamed StormEvents_details-ftp_v1.0_d2021_c20250520.csv.gz: 61389 rows, 51 columns
 9/27: 2021 - 61389 rows
Streamed S

- BEGIN_YEARMONTH 
- BEGIN_DAY
- Combine these 2 and get datetime format, don't bring in time
- EPISODE_ID
- EVENT_ID
- EVENT_TYPE
- STATE_FIPS
- CZ_FIPS
- combine state and cz fips
- INJURIES_DIRECT
- INJURIES_INDIRECT
- DEATHS_DIRECT
- DEATHS_INDIRECT
- "DAMAGE_PROPERTY

In [8]:
# Drop unneeded columns to reduce memory usage
df_all_storms_drop = df_all_storms[['BEGIN_YEARMONTH', 'BEGIN_DAY', 'EPISODE_ID', 'EVENT_ID', 'EVENT_TYPE', 'CZ_FIPS', 'STATE_FIPS', 'INJURIES_DIRECT', 'INJURIES_INDIRECT', 'DEATHS_DIRECT', 'DEATHS_INDIRECT', 'DAMAGE_PROPERTY']]
df_all_storms_drop.head()

Unnamed: 0,BEGIN_YEARMONTH,BEGIN_DAY,EPISODE_ID,EVENT_ID,EVENT_TYPE,CZ_FIPS,STATE_FIPS,INJURIES_DIRECT,INJURIES_INDIRECT,DEATHS_DIRECT,DEATHS_INDIRECT,DAMAGE_PROPERTY
0,200604,7,1207534,5501658,Thunderstorm Wind,51,18.0,0,0,0,0,60K
1,200601,1,1202408,5482463,Drought,2,8.0,0,0,0,0,
2,200601,1,1202408,5482464,Drought,7,8.0,0,0,0,0,
3,200601,1,1202408,5482465,Drought,4,8.0,0,0,0,0,
4,200601,1,1202408,5482466,Drought,13,8.0,0,0,0,0,


In [9]:
# combine BEGIN_YEARMONTH and BEGIN_DAY into a single DATE column and convert to datetime, drop original columns

df_all_storms_comb = df_all_storms_drop.copy()

df_all_storms_comb['BEGIN_YEARMONTH'] = df_all_storms_comb['BEGIN_YEARMONTH'].astype(str)
df_all_storms_comb['BEGIN_DAY'] = df_all_storms_comb['BEGIN_DAY'].astype(str).str.zfill(2)
df_all_storms_comb['DATE']= df_all_storms_comb['BEGIN_YEARMONTH'] + df_all_storms_comb['BEGIN_DAY']
df_all_storms_comb['DATE'] = pd.to_datetime(df_all_storms_comb['DATE'], format='%Y%m%d')
df_all_storms_comb.drop(columns=['BEGIN_YEARMONTH', 'BEGIN_DAY'], inplace=True)
df_all_storms_comb.head()

Unnamed: 0,EPISODE_ID,EVENT_ID,EVENT_TYPE,CZ_FIPS,STATE_FIPS,INJURIES_DIRECT,INJURIES_INDIRECT,DEATHS_DIRECT,DEATHS_INDIRECT,DAMAGE_PROPERTY,DATE
0,1207534,5501658,Thunderstorm Wind,51,18.0,0,0,0,0,60K,2006-04-07
1,1202408,5482463,Drought,2,8.0,0,0,0,0,,2006-01-01
2,1202408,5482464,Drought,7,8.0,0,0,0,0,,2006-01-01
3,1202408,5482465,Drought,4,8.0,0,0,0,0,,2006-01-01
4,1202408,5482466,Drought,13,8.0,0,0,0,0,,2006-01-01


In [10]:
# combine state and county fips into a single high level FIPS. handle NA with convention of 99999 as unknown county
# keep original columns in case needed.
df_all_storms_comb["STATE_FIPS"] = (
    pd.to_numeric(df_all_storms_comb["STATE_FIPS"], errors="coerce")
    .fillna(99)
    .astype(int)
    .astype(str)
    .str.zfill(2)
)
df_all_storms_comb["CZ_FIPS"] = (
    pd.to_numeric(df_all_storms_comb["CZ_FIPS"], errors="coerce")
    .fillna(999)
    .astype(int)
    .astype(str)
    .str.zfill(3)
)
df_all_storms_comb["CO_FIPS"] = (
    df_all_storms_comb["STATE_FIPS"] + df_all_storms_comb["CZ_FIPS"]
)
df_all_storms_comb.head()

Unnamed: 0,EPISODE_ID,EVENT_ID,EVENT_TYPE,CZ_FIPS,STATE_FIPS,INJURIES_DIRECT,INJURIES_INDIRECT,DEATHS_DIRECT,DEATHS_INDIRECT,DAMAGE_PROPERTY,DATE,CO_FIPS
0,1207534,5501658,Thunderstorm Wind,51,18,0,0,0,0,60K,2006-04-07,18051
1,1202408,5482463,Drought,2,8,0,0,0,0,,2006-01-01,8002
2,1202408,5482464,Drought,7,8,0,0,0,0,,2006-01-01,8007
3,1202408,5482465,Drought,4,8,0,0,0,0,,2006-01-01,8004
4,1202408,5482466,Drought,13,8,0,0,0,0,,2006-01-01,8013


In [11]:
# clean FIPS due to historical changes and non populated areas (marine, unincorporated, etc)

df_clean = df_all_storms_comb.copy()
df_clean = df_clean[
    (df_clean['CO_FIPS'] >= '01001') & 
    (df_clean['CO_FIPS'] <= '56045') &
    (~df_clean['CO_FIPS'].str.startswith('99'))
].copy()

In [27]:
# Filter data to only include events with direct deaths or injuries
severe_events = df_clean[
    (df_clean["DEATHS_DIRECT"] > 0) | (df_clean["INJURIES_DIRECT"] > 0)
].copy()

# Add year column
severe_events['YEAR'] = severe_events['DATE'].dt.year

# Group by episode and county fips to get unique events
county_episodes = (severe_events.groupby(["CO_FIPS", "EPISODE_ID", "YEAR"]).agg(
        {
            "DEATHS_DIRECT": "sum",  # Total deaths in this episode for this county
            "INJURIES_DIRECT": "sum",  # Total injuries in this episode for this county
            "EVENT_TYPE": lambda x: ", ".join(sorted(set(x))),  # Combined event types
            "DATE": "first",  # Representative date
        }
    )
    .reset_index()
)

# count unique episodes per county-year for Poisson λ parameter
annual_episodes = county_episodes.groupby(["CO_FIPS", "YEAR"]).size().reset_index(name='event_count')
annual_episodes.columns = ['county_fips', 'year', 'event_count']
annual_episodes.sample(10, random_state=36)

Unnamed: 0,county_fips,year,event_count
562,2204,2008,1
1536,6061,2012,1
9251,37107,2016,1
3114,12202,2018,3
5251,21035,2000,1
322,1077,2000,1
3491,13191,2008,1
3115,12202,2020,3
1517,6059,2002,4
3864,17021,2007,1


In [32]:
annual_episodes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13791 entries, 0 to 13790
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   county_fips  13791 non-null  object
 1   year         13791 non-null  int32 
 2   event_count  13791 non-null  int64 
dtypes: int32(1), int64(1), object(1)
memory usage: 269.5+ KB


In [34]:
annual_episodes.describe()

Unnamed: 0,year,event_count
count,13791.0,13791.0
mean,2010.548619,1.319049
std,7.648862,1.053403
min,1999.0,1.0
25%,2004.0,1.0
50%,2010.0,1.0
75%,2017.0,1.0
max,2025.0,21.0


In [None]:

# load dfs to db
dbt.load_data(annual_episodes, "NOAA_STORM_EVENTS", if_exists="replace")

Created SQLAlchemy engine for disaster_db
Data loaded successfully into NOAA_STORM_EVENTS
