# 01: Web Scraping

---

The storm event data I'm looking to work with is available through the [National Oceanic and Atmospheric Administration](https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/) website. The data is grouped by year, and conveniently, there is a CSV file available for each year. However, the data goes back to 1950 and the later files, in particular, are quite large, so it'll be easier to web scrape them rather than dowloading them individually.

## 1. Imports

In [1]:
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

import pandas as pd
import os

---
## 2. Connecting to URL

In [2]:
url = 'https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/'
response = requests.get(url)
print(response)
html = response.text
print(html[:1250])

<Response [200]>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<html>
 <head>
  <title>Index of /pub/data/swdi/stormevents/csvfiles</title>
 </head>
 <body>
<h1>Index of /pub/data/swdi/stormevents/csvfiles</h1>
  <table>
   <tr><th><a href="?C=N;O=D">Name</a></th><th><a href="?C=M;O=A">Last modified</a></th><th><a href="?C=S;O=A">Size</a></th><th><a href="?C=D;O=A">Description</a></th></tr>
   <tr><th colspan="4"><hr></th></tr>
<tr><td><a href="/pub/data/swdi/stormevents/">Parent Directory</a></td><td>&nbsp;</td><td align="right">  - </td><td>&nbsp;</td></tr>
<tr><td><a href="Storm-Data-Bulk-csv-Format.pdf">Storm-Data-Bulk-csv-Format.pdf</a></td><td align="right">2020-07-17 13:10  </td><td align="right">161K</td><td>&nbsp;</td></tr>
<tr><td><a href="Storm-Data-Export-Format.pdf">Storm-Data-Export-Format.pdf</a></td><td align="right">2020-07-17 09:17  </td><td align="right">163K</td><td>&nbsp;</td></tr>
<tr><td><a href="StormEvents_details-ftp_v1.0_d1950_c20210803.csv.gz">Stor

---
## 3. Creating Soup Object

In [3]:
soup = BeautifulSoup(html, 'html.parser')

---
## 4. Scraping Relevant Links & Downloading

Relevant links are those that begin with 'StormEvents_details' for years 1950-2022.
- Downloaded in a .csv.gz format
- 73 files total

In [9]:
%%time

# Help from here: https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460

line_count = 0

for a_tag in soup.findAll('a'):
    if line_count >=7 and line_count <=79:
        link = a_tag['href']
        download_url = 'https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/' + link
        urllib.request.urlretrieve(download_url, '../data/webscrape_files/' + link[link.find('_v1.0_d') + 1:])
        time.sleep(3)
    line_count +=1

CPU times: user 4.73 s, sys: 4.37 s, total: 9.1 s
Wall time: 5min 27s


---
## 5. Creating a DataFrame

In [12]:
# Getting list of files in the data folder

files = os.listdir('../data')
files.remove('.ipynb_checkpoints')
files.sort()
files

['v1.0_d1950_c20210803.csv.gz',
 'v1.0_d1951_c20210803.csv.gz',
 'v1.0_d1952_c20210803.csv.gz',
 'v1.0_d1953_c20210803.csv.gz',
 'v1.0_d1954_c20210803.csv.gz',
 'v1.0_d1955_c20210803.csv.gz',
 'v1.0_d1956_c20210803.csv.gz',
 'v1.0_d1957_c20210803.csv.gz',
 'v1.0_d1958_c20210803.csv.gz',
 'v1.0_d1959_c20210803.csv.gz',
 'v1.0_d1960_c20210803.csv.gz',
 'v1.0_d1961_c20210803.csv.gz',
 'v1.0_d1962_c20210803.csv.gz',
 'v1.0_d1963_c20210803.csv.gz',
 'v1.0_d1964_c20210803.csv.gz',
 'v1.0_d1965_c20210803.csv.gz',
 'v1.0_d1966_c20210803.csv.gz',
 'v1.0_d1967_c20210803.csv.gz',
 'v1.0_d1968_c20210803.csv.gz',
 'v1.0_d1969_c20210803.csv.gz',
 'v1.0_d1970_c20210803.csv.gz',
 'v1.0_d1971_c20210803.csv.gz',
 'v1.0_d1972_c20220425.csv.gz',
 'v1.0_d1973_c20220425.csv.gz',
 'v1.0_d1974_c20220425.csv.gz',
 'v1.0_d1975_c20220425.csv.gz',
 'v1.0_d1976_c20220425.csv.gz',
 'v1.0_d1977_c20220425.csv.gz',
 'v1.0_d1978_c20220425.csv.gz',
 'v1.0_d1979_c20220425.csv.gz',
 'v1.0_d1980_c20220425.csv.gz',
 'v1.0_d

In [13]:
# List comprehension to convert all files to csv and concatenate in 1 dataframe

all_storms = pd.concat([pd.read_csv('../data/' + file, compression='gzip', header=0, sep=',', quotechar='"', on_bad_lines='skip', dtype=str) for file in files])

In [14]:
# Resetting the index

all_storms.reset_index(drop=True, inplace=True)

In [15]:
all_storms

Unnamed: 0,BEGIN_YEARMONTH,BEGIN_DAY,BEGIN_TIME,END_YEARMONTH,END_DAY,END_TIME,EPISODE_ID,EVENT_ID,STATE,STATE_FIPS,...,END_RANGE,END_AZIMUTH,END_LOCATION,BEGIN_LAT,BEGIN_LON,END_LAT,END_LON,EPISODE_NARRATIVE,EVENT_NARRATIVE,DATA_SOURCE
0,195004,28,1445,195004,28,1445,,10096222,OKLAHOMA,40,...,0,,,35.12,-99.20,35.17,-99.20,,,PUB
1,195004,29,1530,195004,29,1530,,10120412,TEXAS,48,...,0,,,31.90,-98.60,31.73,-98.60,,,PUB
2,195007,5,1800,195007,5,1800,,10104927,PENNSYLVANIA,42,...,0,,,40.58,-75.70,40.65,-75.47,,,PUB
3,195007,5,1830,195007,5,1830,,10104928,PENNSYLVANIA,42,...,0,,,40.60,-76.75,,,,,PUB
4,195007,24,1440,195007,24,1440,,10104929,PENNSYLVANIA,42,...,0,,,41.63,-79.68,,,,,PUB
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1740592,202203,7,1641,202203,7,1641,167199,1011982,MISSOURI,29,...,,,,,,,,A late season system brought widespread light ...,"On Walnut Street near Sheridan, Missouri a veh...",CSV
1740593,202203,18,1625,202203,18,1645,167148,1011610,FLORIDA,12,...,1,SE,EGGLESTON HGTS,30.34,-81.59,30.34,-81.59,A pre-frontal squall line moved from northwest...,The Duval county 911 call center reported powe...,CSV
1740594,202203,1,0,202203,31,2359,166470,1006917,NEW MEXICO,35,...,,,,,,,,As what's typically expected for New Mexico du...,Severe drought conditions from February 2022 c...,CSV
1740595,202203,7,1506,202203,7,1506,167226,1012159,MONTANA,30,...,,,,,,,,A vigorous shortwave trough and attendant cold...,Snow Squall observed at KCTB. Visibility fell ...,CSV


In [16]:
# Writing the dataframe to a csv

all_storms.to_csv('../data/all_storms_original.csv', index=False)