# Weather Text

This notebook is basically an experiment looking at historical weather text in NOAA's [Storms Events Database](https://www.ncdc.noaa.gov/stormevents/).

## Download Data

First lets download the CSV data from NOAA's [index page](https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/). You can see that they make several types of CSV files available. We are interested (for the moment) in the storm event files that go back to 1950 and have names that start like `StormEvents_details`.
[requests_html](https://requests-html.kennethreitz.org/) makes it pretty easy to scrape the CSV URLs we want out of the page, and we can load them into a huge pandas DataFrame, by concatenating each one after it has been loaded by its URL. Yes pandas can load gzipped CSV data from a URL...

In [None]:
!pip install --quiet requests-html pandas

Get the URLs for the CSVs and download them to a `data` directory.


In [None]:
import pandas
import requests_html

http = requests_html.HTMLSession()
url = 'https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/'
resp = http.get(url)

df = pandas.DataFrame()
for link in resp.html.find('a'):
  if 'StormEvents_details-ftp_v1.0' in link.attrs['href']:
    csv_url = url + link.attrs['href']
    df = pandas.concat([df, pandas.read_csv(csv_url)])
    print(csv_url)


https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1950_c20170120.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1951_c20160223.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1952_c20170619.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1953_c20160223.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1954_c20160223.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1955_c20160223.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1956_c20170717.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1957_c20160223.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1

  interactivity=interactivity, compiler=compiler, result=result)


https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1996_c20170717.csv.gz


  interactivity=interactivity, compiler=compiler, result=result)


https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1997_c20190920.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1998_c20170717.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2000_c20200707.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2001_c20200518.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2002_c20200518.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2003_c20200518.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2004_c20200518.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2005_c20200518.csv.gz


  interactivity=interactivity, compiler=compiler, result=result)


https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2006_c20200518.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2007_c20170717.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2008_c20180718.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2009_c20180718.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2010_c20200922.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2011_c20180718.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2012_c20200317.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d2013_c20170519.csv.gz
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1

In [None]:
df

Unnamed: 0,BEGIN_YEARMONTH,BEGIN_DAY,BEGIN_TIME,END_YEARMONTH,END_DAY,END_TIME,EPISODE_ID,EVENT_ID,STATE,STATE_FIPS,YEAR,MONTH_NAME,EVENT_TYPE,CZ_TYPE,CZ_FIPS,CZ_NAME,WFO,BEGIN_DATE_TIME,CZ_TIMEZONE,END_DATE_TIME,INJURIES_DIRECT,INJURIES_INDIRECT,DEATHS_DIRECT,DEATHS_INDIRECT,DAMAGE_PROPERTY,DAMAGE_CROPS,SOURCE,MAGNITUDE,MAGNITUDE_TYPE,FLOOD_CAUSE,CATEGORY,TOR_F_SCALE,TOR_LENGTH,TOR_WIDTH,TOR_OTHER_WFO,TOR_OTHER_CZ_STATE,TOR_OTHER_CZ_FIPS,TOR_OTHER_CZ_NAME,BEGIN_RANGE,BEGIN_AZIMUTH,BEGIN_LOCATION,END_RANGE,END_AZIMUTH,END_LOCATION,BEGIN_LAT,BEGIN_LON,END_LAT,END_LON,EPISODE_NARRATIVE,EVENT_NARRATIVE,DATA_SOURCE
0,195004,28,1445,195004,28,1445,,10096222,OKLAHOMA,40.00,1950,April,Tornado,C,149,WASHITA,,28-APR-50 14:45:00,CST,28-APR-50 14:45:00,0,0,0,0,250K,0,,0.00,,,,F3,3.40,400.00,,,,,0.00,,,0.00,,,35.12,-99.20,35.17,-99.20,,,PUB
1,195004,29,1530,195004,29,1530,,10120412,TEXAS,48.00,1950,April,Tornado,C,93,COMANCHE,,29-APR-50 15:30:00,CST,29-APR-50 15:30:00,0,0,0,0,25K,0,,0.00,,,,F1,11.50,200.00,,,,,0.00,,,0.00,,,31.90,-98.60,31.73,-98.60,,,PUB
2,195007,5,1800,195007,5,1800,,10104927,PENNSYLVANIA,42.00,1950,July,Tornado,C,77,LEHIGH,,05-JUL-50 18:00:00,CST,05-JUL-50 18:00:00,2,0,0,0,25K,0,,0.00,,,,F2,12.90,33.00,,,,,0.00,,,0.00,,,40.58,-75.70,40.65,-75.47,,,PUB
3,195007,5,1830,195007,5,1830,,10104928,PENNSYLVANIA,42.00,1950,July,Tornado,C,43,DAUPHIN,,05-JUL-50 18:30:00,CST,05-JUL-50 18:30:00,0,0,0,0,2.5K,0,,0.00,,,,F2,0.00,13.00,,,,,0.00,,,0.00,,,40.60,-76.75,,,,,PUB
4,195007,24,1440,195007,24,1440,,10104929,PENNSYLVANIA,42.00,1950,July,Tornado,C,39,CRAWFORD,,24-JUL-50 14:40:00,CST,24-JUL-50 14:40:00,0,0,0,0,2.5K,0,,0.00,,,,F0,0.00,33.00,,,,,0.00,,,0.00,,,41.63,-79.68,,,,,PUB
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40032,202006,18,600,202006,19,600,147668.00,899187,IOWA,19.00,2020,June,Heavy Rain,C,69,FRANKLIN,DMX,18-JUN-20 06:00:00,CST-6,19-JUN-20 06:00:00,0,0,0,0,0.00K,0.00K,COOP Observer,,,,,,,,,,,,1.00,S,HAMPTON,1.00,S,HAMPTON,42.74,-93.20,42.74,-93.20,An upper-level trough with associated surface ...,Coop observer reported a 24 hour rainfall tota...,CSV
40033,202006,18,600,202006,19,600,147668.00,899188,IOWA,19.00,2020,June,Heavy Rain,C,69,FRANKLIN,DMX,18-JUN-20 06:00:00,CST-6,19-JUN-20 06:00:00,0,0,0,0,0.00K,0.00K,COOP Observer,,,,,,,,,,,,1.00,N,HAMPTON,1.00,N,HAMPTON,42.76,-93.20,42.76,-93.20,An upper-level trough with associated surface ...,Coop observer reported a 24 hour rainfall tota...,CSV
40034,202006,22,2320,202006,24,915,148769.00,899497,IOWA,19.00,2020,June,Flood,C,171,TAMA,DMX,22-JUN-20 23:20:00,CST-6,24-JUN-20 09:15:00,0,0,0,0,0.00K,0.00K,Department of Highways,,,Heavy Rain,,,,,,,,,0.00,N,TRAER,0.00,E,TRAER,42.20,-92.47,42.20,-92.46,A surface low to the northwest of Iowa allowed...,The department of transportation relayed a rep...,CSV
40035,202006,23,38,202006,24,915,148769.00,899498,IOWA,19.00,2020,June,Flood,C,171,TAMA,DMX,23-JUN-20 00:38:00,CST-6,24-JUN-20 09:15:00,0,0,0,0,0.00K,0.00K,Department of Highways,,,Heavy Rain,,,,,,,,,0.00,N,TRAER,0.00,E,TRAER,42.20,-92.47,42.20,-92.46,A surface low to the northwest of Iowa allowed...,Iowa Department of Transportation and the Tama...,CSV


Since we're interested in textual content for these weather events maybe the EVENT_NARRATIVE and EPISODE_NARRATIVE columns could be interesting. Lets just try to load all the CVSs into one big DataFrame. Will Colab blow up, we'll see I guess...

Wow so it didn't blow up, it looks like it's just using 1.5GB of RAM for the 1.5 million rows?


## Narrative Text

Lets take a look at how many columns have **EPISODE_NARRATIVE**.

In [None]:
episodes = df[df['EPISODE_NARRATIVE'].notnull()]
len(episodes) / len(all)

0.7109169441003286

So about 71% of the events have episode narrative. How about the EVENT_NARRATIVE?



In [None]:
events = df[df['EVENT_NARRATIVE'].notnull()]
len(events) / len(all)

0.4982583244197391

Much less 50% have event narrative. About how long are these descriptions?


In [None]:
df['EPISODE_NARRATIVE'].str.len().describe()

count   1134536.00
mean        615.10
std         932.01
min           1.00
25%         196.00
50%         351.00
75%         681.00
max       29050.00
Name: EPISODE_NARRATIVE, dtype: float64

Scientific notation makes that kinda hard to read. Let's adjust the default format for floats.

In [None]:
pandas.set_option('display.float_format', lambda x: '%.2f' % x)

In [45]:
episodes['EVENT_NARRATIVE'].str.len().describe()

count   643153.00
mean       146.72
std        198.60
min          1.00
25%         58.00
50%         87.00
75%        154.00
max       8333.00
Name: EVENT_NARRATIVE, dtype: float64

So the longest description in here is 29050 characters long! The shortest is one character, and on average they are 615 characters long.

In [48]:
events_text_len = events['EVENT_NARRATIVE'].str.len().sum()
events_text_len

114256773

In [53]:
ulysses_len = len(http.get('https://www.gutenberg.org/files/4300/4300-0.txt').text)
ulysses_len

1586488

In [54]:
events_text_length / ulysses_len

72.01868088507446

So there are 72 Ulysses worth of weather text :-)