# Wildfire and Drought Data Wrangling

### Collect Data

  - &#x2611; Download [wildfire Sqlite DB](https://www.kaggle.com/rtatman/188-million-us-wildfires) from Kaggle
  - &#x2611; Download [drought soil and weather CSVs](https://www.kaggle.com/cdminix/us-drought-meteorological-data) from Kaggle
  - &#x2611; Import soil and weather CSVs into Sqlite
  - &#x2611; Remove non-California data to keep the dataset more focused
  - &#x2611; Remove wildfire and soil/weather data that does not overlap
  - &#x2611; Load in county FIPS codes and geospatial lat/long into Sqlite
  - Add indexes/foreign keys to speed up Sqlite
    - &#x2611; year
    - &#x2611; fips
    - &#x2611; long/lat on fires and soil
  - &#x2611; Truncate Latitude and Longitude to 11 km (1 decimal place)
  - &#x2611; Backfill FIPS_CODE for fire using long/lat (maybe?)
  - Weather by date and long/lat between 2000-01-01 and 2015-12-31 from NASA Power API
  - &#x2611; Drought score by date and FIPS county between 2000-01-01 and 2015-12-31

In [None]:
!pip install -q pandas
!pip install -q pysqlite3
!pip install -q requests
!pip install -q shapely

In [2]:
import pandas as pd
import re
import time
import sqlite3
import shapely.wkt
from shapely.geometry import Point
import requests

In [4]:
county_df = pd.read_html('https://en.wikipedia.org/wiki/User:Michael_J/County_table')[0]
float_degrees = lambda x: float(x.replace('°','').replace('–','-'))
county_df['latitude'] = county_df['Latitude'].apply(float_degrees)
county_df['longitude'] = county_df['Longitude'].apply(float_degrees)
county_df['lat'] = round(county_df['latitude'], 1)
county_df['long'] = round(county_df['longitude'], 1)
county_df['name'] = county_df['County [2]']

county_df = county_df[county_df['State'] == 'CA']
county_df = county_df.loc[:, county_df.columns.intersection(['FIPS', 'name', 'latitude', 'longitude', 'lat', 'long'])]

county_geo_df = pd.read_csv('./county_geospatial.csv')
county_geo_df = county_geo_df.loc[:, county_geo_df.columns.intersection(['name', 'geo_multipolygon'])]

county_df = pd.merge(county_df, county_geo_df, left_on='name', right_on='name')
county_df = county_df.set_index('FIPS')

print(county_df.head())

       latitude   longitude   lat   long       name  \
FIPS                                                  
6001  37.648081 -121.913304  37.6 -121.9    Alameda   
6003  38.617610 -119.798999  38.6 -119.8     Alpine   
6005  38.443550 -120.653856  38.4 -120.7     Amador   
6007  39.665959 -121.601919  39.7 -121.6      Butte   
6009  38.187844 -120.555115  38.2 -120.6  Calaveras   

                                       geo_multipolygon  
FIPS                                                     
6001  MULTIPOLYGON (((-122.3110971410252 37.86340197...  
6003  MULTIPOLYGON (((-119.93538249202298 38.8084818...  
6005  MULTIPOLYGON (((-120.25874105290194 38.5799975...  
6007  MULTIPOLYGON (((-121.6354363647807 40.00088422...  
6009  MULTIPOLYGON (((-120.2108859831663 38.50000349...  


In [282]:
conn = sqlite3.connect('/Users/eerichmo/Documents/fires.sqlite')

cur = conn.cursor()

cur.execute('DROP TABLE county')
cur.execute('''CREATE TABLE county (
	fips	  					INTEGER NOT NULL,
	name	  					TEXT NOT NULL,
	latitude 					REAL NOT NULL,
	longitude					REAL NOT NULL,
	lat	    					REAL NOT NULL,
	long	  					REAL NOT NULL,
	geo_multipolygon	TEXT NOT NULL,
	PRIMARY KEY(fips)
);''')

county_df.to_sql('county', conn, if_exists='append')

conn.commit()
conn.close()

In [283]:
ca_bounds = [-180, 90, 180, -90]

for i, county in county_df.iterrows():
  name = county['name']
  geo = shapely.wkt.loads(county['geo_multipolygon'])

  # East
  if (geo.bounds[0] > ca_bounds[0]):
    ca_bounds[0] = geo.bounds[0]

  # South
  if (geo.bounds[1] < ca_bounds[1]):
    ca_bounds[1] = geo.bounds[1]

  # West
  if (geo.bounds[2] < ca_bounds[2]):
    ca_bounds[2] = geo.bounds[2]

  # Norht
  if (geo.bounds[3] > ca_bounds[3]):
    ca_bounds[3] = geo.bounds[3]

ca_bounds = tuple(ca_bounds)
print(f'California bounds (east-south, west-north): {ca_bounds}')

California bounds (east-south, west-north): (-116.10618166434291, 32.53402817678555, -123.51814169611895, 42.009834867689875)


In [435]:
weather_params = [p.strip() for p in re.findall(
"^\w+",
"""
WS10M_MIN      MERRA2 1/2x1/2 Minimum Wind Speed at 10 Meters (m/s) 
QV2M           MERRA2 1/2x1/2 Specific Humidity at 2 Meters (g/kg) 
T2M_RANGE      MERRA2 1/2x1/2 Temperature Range at 2 Meters (C) 
WS10M          MERRA2 1/2x1/2 Wind Speed at 10 Meters (m/s) 
T2M            MERRA2 1/2x1/2 Temperature at 2 Meters (C) 
WS50M_MIN      MERRA2 1/2x1/2 Minimum Wind Speed at 50 Meters (m/s) 
T2M_MAX        MERRA2 1/2x1/2 Maximum Temperature at 2 Meters (C) 
WS50M          MERRA2 1/2x1/2 Wind Speed at 50 Meters (m/s) 
TS             MERRA2 1/2x1/2 Earth Skin Temperature (C) 
WS50M_RANGE    MERRA2 1/2x1/2 Wind Speed Range at 50 Meters (m/s) 
WS50M_MAX      MERRA2 1/2x1/2 Maximum Wind Speed at 50 Meters (m/s) 
WS10M_MAX      MERRA2 1/2x1/2 Maximum Wind Speed at 10 Meters (m/s) 
WS10M_RANGE    MERRA2 1/2x1/2 Wind Speed Range at 10 Meters (m/s) 
PS             MERRA2 1/2x1/2 Surface Pressure (kPa) 
T2MDEW         MERRA2 1/2x1/2 Dew/Frost Point at 2 Meters (C) 
T2M_MIN        MERRA2 1/2x1/2 Minimum Temperature at 2 Meters (C) 
T2MWET         MERRA2 1/2x1/2 Wet Bulb Temperature at 2 Meters (C) 
PRECTOT        MERRA2 1/2x1/2 Precipitation (mm day-1) 
""",
re.MULTILINE
)]

print(weather_params)

['WS10M_MIN', 'QV2M', 'T2M_RANGE', 'WS10M', 'T2M', 'WS50M_MIN', 'T2M_MAX', 'WS50M', 'TS', 'WS50M_RANGE', 'WS50M_MAX', 'WS10M_MAX', 'WS10M_RANGE', 'PS', 'T2MDEW', 'T2M_MIN', 'T2MWET', 'PRECTOT']


In [9]:
conn = sqlite3.connect('/Users/eerichmo/Documents/fires.sqlite')
cur = conn.cursor()

cur.execute('UPDATE fires SET fips = fips_code WHERE fips IS NULL')
cur.execute('UPDATE fires SET LONG = round(LONGITUDE, 1), LAT = round(LATITUDE, 1)')
cur.execute("UPDATE fires SET date = strftime('%Y-%m-%d', discovery_date)")

cur.execute('DROP INDEX IF EXISTS idx_fires_date_long_lat')
cur.execute('CREATE INDEX idx_fires_date_long_lat ON fires(date, long, lat)')

conn.commit()
conn.close()

In [255]:
conn = sqlite3.connect('/Users/eerichmo/Documents/fires.sqlite')

cur = conn.cursor()

# cur.execute('DROP TABLE IF EXISTS weather_geo')
cur.execute('''CREATE TABLE weather_geo (
	date							TEXT NOT NULL,
	long						  REAL NOT NULL,
	lat							  REAL NOT NULL,
	fips							INTEGER NOT NULL,
	precipitation			REAL,
	pressure					REAL,
	humidity_2m				REAL,
	temp_2m						REAL,
	temp_dew_point_2m	REAL,
	temp_wet_bulb_2m	REAL,
	temp_max_2m				REAL,
	temp_min_2m				REAL,
	temp_range_2m			REAL,
	temp_0m						REAL,
	wind_10m					REAL,
	wind_max_10m			REAL,
	wind_min_10m			REAL,
	wind_range_10m		REAL,
	wind_50m					REAL,
	wind_max_50m			REAL,
	wind_min_50m			REAL,
	wind_range_50m		REAL,
	PRIMARY KEY(date, long, lat)
);''')

conn.commit()
conn.close()

In [437]:
def fetch_weather(long, lat, start, end):
    return requests.get(
      'https://power.larc.nasa.gov/api/temporal/daily/point',
      {
          'parameters': ','.join(weather_params),
          'community': 'SB',
          'longitude': long,
          'latitude': lat,
          'start': start,
          'end': end,
          'format': 'JSON',
      }
    ).json()['properties']['parameter']

In [455]:
start_date = '20000101'
end_date = '20151231'

conn = sqlite3.connect('/Users/eerichmo/Documents/fires.sqlite')
cur = conn.cursor()

cur.execute('SELECT fips FROM county WHERE fips NOT IN (SELECT DISTINCT fips FROM weather_geo)')

for row in cur.fetchall():
  fips = row[0]
  county = county_df.loc[fips]

  name = county['name']
  geo = shapely.wkt.loads(county['geo_multipolygon'])

  long_min = round(geo.bounds[0], 1)
  long_max = round(geo.bounds[2], 1)

  lat_min = round(geo.bounds[1], 1)
  lat_max = round(geo.bounds[3], 1)

  print(f'{name} southwest to northeast: ({lat_min}, {long_min}) to ({lat_max}, {long_max})')

  for long in range(int(long_min * 10), int(long_max * 10)):
    for lat in range(int(lat_min * 10), int(lat_max * 10)):
      point = Point(long / 10, lat / 10)
      start = time.time()

      if geo.contains(point):
        json = fetch_weather(point.x, point.y, start_date, end_date)

        for date in json['TS'].keys():
          cur.execute('''
            INSERT INTO weather_geo (
              date, long, lat, fips, precipitation, pressure, humidity_2m, temp_2m,
              temp_dew_point_2m, temp_wet_bulb_2m, temp_max_2m, temp_min_2m, temp_range_2m,
              temp_0m, wind_10m, wind_max_10m, wind_min_10m, wind_range_10m, wind_50m,
              wind_max_50m, wind_min_50m, wind_range_50m
            )
            VALUES (
              :date, :long, :lat, :fips, :precipitation, :pressure, :humidity_2m, :temp_2m,
              :temp_dew_point_2m, :temp_wet_bulb_2m, :temp_max_2m, :temp_min_2m, :temp_range_2m,
              :temp_0m, :wind_10m, :wind_max_10m, :wind_min_10m, :wind_range_10m, :wind_50m,
              :wind_max_50m, :wind_min_50m, :wind_range_50m
            )
            ''', {
              'date': f'{date[0:4]}-{date[4:6]}-{date[6:8]}',
              'long': point.x,
              'lat': point.y,
              'fips': fips,
              'precipitation': json['PRECTOTCORR'][date],
              'pressure': json['PS'][date],
              'humidity_2m': json['QV2M'][date],
              'temp_2m': json['T2M'][date],
              'temp_dew_point_2m': json['T2MDEW'][date],
              'temp_wet_bulb_2m': json['T2MWET'][date],
              'temp_max_2m': json['T2M_MAX'][date],
              'temp_min_2m': json['T2M_MIN'][date],
              'temp_range_2m': json['T2M_RANGE'][date],
              'temp_0m': json['TS'][date],
              'wind_10m': json['WS10M'][date],
              'wind_max_10m': json['WS10M_MAX'][date],
              'wind_min_10m': json['WS10M_MIN'][date],
              'wind_range_10m': json['WS10M_RANGE'][date],
              'wind_50m': json['WS50M'][date],
              'wind_max_50m': json['WS50M_MAX'][date],
              'wind_min_50m': json['WS50M_MIN'][date],
              'wind_range_50m': json['WS50M_RANGE'][date]
            })

        end = time.time()
        print(f'{name} at {point} took {round(end - start, 1)}s')

conn.commit()
conn.close()

Colusa southwest to northeast: (38.9, -122.8) to (39.4, -121.8)
Colusa at POINT (-122.7 39.3) took 6.0s
Colusa at POINT (-122.6 39.3) took 5.2s
Colusa at POINT (-122.5 39.2) took 5.7s
Colusa at POINT (-122.5 39.3) took 5.8s
Colusa at POINT (-122.4 39) took 5.5s
Colusa at POINT (-122.4 39.1) took 5.1s
Colusa at POINT (-122.4 39.2) took 11.0s
Colusa at POINT (-122.4 39.3) took 5.8s
Colusa at POINT (-122.3 39) took 6.9s
Colusa at POINT (-122.3 39.1) took 65.3s
Colusa at POINT (-122.3 39.2) took 65.9s
Colusa at POINT (-122.3 39.3) took 5.9s
Colusa at POINT (-122.2 39) took 13.4s
Colusa at POINT (-122.2 39.1) took 7.3s
Colusa at POINT (-122.2 39.2) took 65.5s
Colusa at POINT (-122.2 39.3) took 68.5s
Colusa at POINT (-122.1 39) took 5.5s
Colusa at POINT (-122.1 39.1) took 8.0s
Colusa at POINT (-122.1 39.2) took 5.2s
Colusa at POINT (-122.1 39.3) took 6.0s
Colusa at POINT (-122 39) took 5.8s
Colusa at POINT (-122 39.1) took 5.9s
Colusa at POINT (-122 39.2) took 7.6s
Colusa at POINT (-122 39.3

JSONDecodeError: [Errno Expecting value] <html><body><h1>504 Gateway Time-out</h1>
The server didn't respond in time.
</body></html>
: 0

In [351]:
conn = sqlite3.connect('/Users/eerichmo/Documents/fires.sqlite')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS soil')
cur.execute('''CREATE TABLE soil (
	long										REAL NOT NULL,
	lat											REAL NOT NULL,
	fips										INTEGER NOT NULL,
	latitude								REAL NOT NULL,
	longitude								REAL NOT NULL,
	elevation								INTEGER NOT NULL,
	slope_005								REAL NOT NULL,
	slope_005_02						REAL NOT NULL,
	slope_02_05							REAL NOT NULL,
	slope_05_10							REAL NOT NULL,
	slope_10_15							REAL NOT NULL,
	slope_15_30							REAL NOT NULL,
	slope_30_45							REAL NOT NULL,
	slope_45								REAL NOT NULL,
	aspect_north						REAL NOT NULL,
	aspect_east							REAL NOT NULL,
	aspect_south						REAL NOT NULL,
	aspect_west							REAL NOT NULL,
	aspect_unknown					REAL NOT NULL,
	water_land							REAL NOT NULL,
	barren_land							REAL NOT NULL,
	urban_land							REAL NOT NULL,
	grass_land							REAL NOT NULL,
	forest_land							REAL NOT NULL,
	partial_cultivated_land	REAL NOT NULL,
	irrigated_land					REAL NOT NULL,
	cultivated_land					REAL NOT NULL,
	nutrient								INTEGER NOT NULL,
	nutrient_retention			INTEGER NOT NULL,
	rooting									INTEGER NOT NULL,
	oxygen									INTEGER NOT NULL,
	excess_salts						INTEGER NOT NULL,
	toxicity								INTEGER NOT NULL,
	workablity							INTEGER NOT NULL
)''')

soil_df = pd.read_csv('./soil.csv')
soil_df['lat'] = round(soil_df['latitude'], 1)
soil_df['long'] = round(soil_df['longitude'], 1)

soil_df = soil_df[soil_df['fips'].isin(county_df.index)]

soil_df.to_sql('soil', conn, if_exists='append', index=False)

conn.commit()
conn.close()

In [419]:
conn = sqlite3.connect('/Users/eerichmo/Documents/fires.sqlite')
cur = conn.cursor()

cur.execute('DROP TABLE IF EXISTS drought')
cur.execute('''CREATE TABLE drought (
  date          TEXT NOT NULL,
  fips          INTEGER NOT NULL,
  drought_score REAL,
  PRIMARY KEY(date, fips)
)''')

conn.commit()
conn.close()

In [402]:
def fetch_drought(fips):
    return requests.get(
        'https://usdmdataservices.unl.edu/api/CountyStatistics/GetDroughtSeverityStatisticsByAreaPercent',
        {
            'aoi': fips,
            'startdate': '10/1/1999',
            'enddate': '12/31/2015',
            'statisticsType': 1,
        }
    ).json()

In [422]:
conn = sqlite3.connect('/Users/eerichmo/Documents/fires.sqlite')
cur = conn.cursor()

cur.execute('SELECT DISTINCT fips FROM drought WHERE drought_score IS NULL')

for row in cur.fetchall():
  fips = row[0]
  fips_5_char = f'0{str(fips)}' if fips < 10000 else str(fips)

  print(f'Fetch drought score for {fips_5_char}')
  json = fetch_drought(fips_5_char)

  for item in json:
    drought_score = float(item['D0'])/100 + float(item['D1'])/100 + float(item['D2'])/100 + float(item['D3'])/100 + float(item['D4'])/100

    # Backfill Jan 4 score to Jan 1-3 of 2000 as it seems to be missing
    start = '2000-01-01' if item['ValidStart'] <= '2000-01-04' else item['ValidStart']

    drought_params = { 'fips': fips, 'drought_score': drought_score, 'start': start, 'end': item['ValidEnd'] }
    
    cur.execute('''
      UPDATE drought SET
        drought_score = :drought_score
      WHERE
        fips = :fips AND date >= :start AND date <= :end
    ''', drought_params)

  conn.commit()
  
conn.close()


Fetch drought score for 06023
Fetch drought score for 06025
Fetch drought score for 06027
Fetch drought score for 06029
Fetch drought score for 06031
Fetch drought score for 06033
Fetch drought score for 06035
Fetch drought score for 06037
Fetch drought score for 06039
Fetch drought score for 06041
Fetch drought score for 06043
Fetch drought score for 06045
Fetch drought score for 06047
Fetch drought score for 06049
Fetch drought score for 06051
Fetch drought score for 06053
Fetch drought score for 06055
Fetch drought score for 06057
Fetch drought score for 06059
Fetch drought score for 06061
Fetch drought score for 06063
Fetch drought score for 06065
Fetch drought score for 06067
Fetch drought score for 06069
Fetch drought score for 06071
Fetch drought score for 06073
Fetch drought score for 06075
Fetch drought score for 06077
Fetch drought score for 06079
Fetch drought score for 06081
Fetch drought score for 06083
Fetch drought score for 06085
Fetch drought score for 06087
Fetch drou

In [5]:
conn = sqlite3.connect('/Users/eerichmo/Documents/fires.sqlite')
cur = conn.cursor()

cur.execute('SELECT longitude, latitude FROM fires WHERE fips IS NULL')

for row in cur.fetchall():
  long = row[0]
  lat = row[1]

  for fips, county in county_df.iterrows():
    geo = shapely.wkt.loads(county['geo_multipolygon'])
    point = Point(long, lat)

    if geo.contains(point):
      print(f'{point} is in {fips}')
      cur.execute('''
        UPDATE fires SET fips_code = :fips
        WHERE longitude = :longitude AND latitude = :latitude
      ''', { 'fips': fips, 'longitude': long, 'latitude': lat })
      conn.commit()
      break
  
conn.close()

### Test-Train-Split

#### Splitting for 2000-2015

- Train 2000-13 (13 years)
- Validation 2013-14 (1 year)
- Test 2014-15 (1 years)

#### How to Process 13 years of data?

- NN might allow batching
- Regression models for 0% - 100% per long/lat grid (11 km^2)
- Reduce long/lat km^2 over time (5 km^2)
- Visualize with heatmap
- Focus on origin long/lat

#### Models

1.  Linear
2.  Random Forest Regression
3.  ...any regression model