# COGS 108 - Data Checkpoint

# Names

- Kairi Sageshima
- Brandon Wang
- Marisol Jimenez
- Ashley Chu
- Daniel Milton

<a id='research_question'></a>
# Research Question

*Which factor is most influential in predicting the likelihood of a forest fire in the Northern Region (Department of Forestry and Fire Protection) of California (humidity, amount of combustible materials like wood, precipitation, temperature, month, etc.)? Can we use such a variable to predict and serve as an early warning system for wildfires in California?*

# Dataset(s)

Dataset Name: Wildfire Incident Database:
- Link to the dataset: https://gis.data.ca.gov/datasets/e3802d2abf8741a187e73a9db49d68fe_0/explore?showTable=true
- Number of observations: 21,318 observations
This database logs wildfire incidents by county/forest in California, and also records the cause, timestamp recorded, and the timestamp it was contained. 


Dataset Name: Weather Database
- Link to the dataset: https://www.ncdc.noaa.gov/cag/county/mapping/4/tavg/201802/1/value
- Number of observations: 88,219 observations
This database holds the average temperature for each county of California in each month from 2018-2020.

We plan to use the wildfire incident database to gather observations of wildfires in our selected counties by timestamp and location. We then plan to match the timestamp and location from the fire to the environmental variables from the weather database, so we can see the temperature at the given location and time. 

# Setup

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
## Import the wildfire data csv
df = pd.read_csv('fire.csv')
## Import the weather data csv
dfw = pd.read_csv('temp.csv')

df

Unnamed: 0,OBJECTID,YEAR_,STATE,AGENCY,UNIT_ID,FIRE_NAME,INC_NUM,ALARM_DATE,CONT_DATE,CAUSE,COMMENTS,REPORT_AC,GIS_ACRES,C_METHOD,OBJECTIVE,FIRE_NUM,SHAPE_Length,SHAPE_Area
0,21440,2020.0,CA,CDF,NEU,NELSON,00013212,2020/06/18 00:00:00+00,2020/06/23 00:00:00+00,11.0,,110.0,109.602500,1.0,1.0,,4179.743142,-7.331347e+05
1,21441,2020.0,CA,CDF,NEU,AMORUSO,00011799,2020/06/01 00:00:00+00,2020/06/04 00:00:00+00,2.0,,670.0,685.585020,1.0,1.0,,12399.375391,-4.578172e+06
2,21442,2020.0,CA,CDF,NEU,ATHENS,00018493,2020/08/10 00:00:00+00,2020/03/01 00:00:00+00,14.0,,26.0,27.300480,1.0,1.0,,2119.194120,-1.823876e+05
3,21443,2020.0,CA,CDF,NEU,FLEMING,00007619,2020/03/31 00:00:00+00,2020/04/01 00:00:00+00,9.0,,13.0,12.931550,1.0,1.0,,2029.524881,-8.667942e+04
4,21444,2020.0,CA,CDF,NEU,MELANESE,00008471,2020/04/14 00:00:00+00,2020/04/19 00:00:00+00,18.0,,10.3,10.315960,1.0,1.0,,1342.742903,-7.017912e+04
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21313,42760,2019.0,CA,CCO,LAC,MUREAU,,2019/10/30 00:00:00+00,2019/10/30 00:00:00+00,,,,6.297734,,1.0,,897.323534,-3.730642e+04
21314,42761,2019.0,CA,LRA,,OAK,,2019/10/28 00:00:00+00,2019/10/28 00:00:00+00,14.0,NPS#596 Rapid ROS in light grassy 1yr old fuels,,8.482671,8.0,1.0,,1215.382952,-5.024716e+04
21315,42762,2019.0,CA,LRA,LDF,BARHAM,00000845,2019/11/09 00:00:00+00,2019/11/10 00:00:00+00,14.0,LACFD 0845,,64.888229,8.0,1.0,,4093.657796,-3.843117e+05
21316,42763,2019.0,CA,NPS,MNP,STAR,00013598,,,14.0,,,66.587181,8.0,1.0,,4777.042672,-4.051741e+05


# Data Cleaning

### Overview of cleaning steps

**Pre-processing steps:**

1. Decide which counties and which dates to look at: 20 counties in Northern California, from 2018 to 2020 -- what was the average temperature like in each county when each wildfire happened?

2. Make a dictionary of the 20 counties linking their full names to the region codes used in the fire dataset so we can link the datasets later.

**Clean fire dataset**

3. Drop unused columns.

4. Rename/convert columns to lowercase for better readability and filter dataframe to only contain the 20 counties we want to look at.

5. Convert dates to ints then filter dataframe to only contain entries from years 2018-2020.

6. Drop any rows with missing values.

**Clean weather dataset**

7. Drop unused columns.

8. Rename/convert columns to lowercase for better readability.

9. Convert the county full name to a county_id to represent the region to match what is used in the wildfire dataset. 

10. Filter dataframe to only contain entries from years 2018-2020.

11. Dropped any rows with missing values.

In [3]:
# County Dictionary
county_dict = {
    'Yuba County': 'NEU',
    'Nevada County': 'NEU',
    'Placer County': 'NEU',
    'Butte County' : 'BTU',
    'San Mateo County' : 'CZU',
    'Santa Cruz County' : 'CZU',
    'Mendocino County' : 'MEU',
    'Humboldt County' : 'HUU',
    'Del Norte County' : 'HUU',
    'Tehama County' : 'TGU',
    'Glenn County' : 'TGU',
    'Lassen County' : 'LMU',
    'Modoc County' : 'LMU',
    'Siskiyou County' : 'SKU',
    'Shasta County' : 'SHU',
    'Trinity County' : 'SHU',
    'Santa Clara County' : 'SCU',
    'Sonoma County' : 'LNU',
    'Lake County' : 'LNU',
    'Napa County' : 'LNU',
    'Marin County' : 'MRN',
    'El Dorado County' : 'AEU',
    'Amador County' : 'AEU'
}

# Counties we are interested in
# counties = ['SKU', 'HUU', 'SHU', 'LMU', 'TGU', 'MEU', 'BTU', 'NEU', 'AEU', 'LNU', 'SHF', 'TNF', 'PNF', 'HIA', 'LNF',
#            'KNF', 'MNF', 'SRF', 'BNP', 'RNP']
counties = ['SKU', 'HUU', 'SHU', 'LMU', 'TGU', 'MEU', 'BTU', 'NEU', 'AEU', 'LNU']

## Cleaning fire dataset

In [4]:
## Cleaning fire data set
# Rename columns
df = df.rename(columns = {'UNIT_ID' : 'county'})
# Drop unused columns
df = df.drop(columns = ['OBJECTID', 'AGENCY', 'INC_NUM', 'COMMENTS', 'REPORT_AC', 'SHAPE_Length', 'SHAPE_Area', 'FIRE_NUM', 'OBJECTIVE', 'C_METHOD'])
df = df.reset_index()
# Change to lowercase
df.columns = df.columns.str.lower()
# Only want the counties interested in
df = df[df['county'].isin(counties)]
# Convert dates into ints
df['alarm_date'] = pd.to_datetime(df['alarm_date'])
df['date_conv'] = (df['alarm_date'].dt.year.fillna(0).astype(int)).astype(str) + (df['alarm_date'].dt.month.fillna(0).astype(int)).astype(str)
df['date_conv'] = df['date_conv'].astype(int)
df = df.drop(columns = ['year_', 'alarm_date', 'cont_date'])
# Only want 2018 - 2020
df = df[df.date_conv > 201800]
df = df[df.date_conv < 202100]
# Drop NA values
df = df.dropna()

## Cleaning weather data set 

In [None]:
# Drop unused columns
dfw = dfw.drop(columns = ['Location ID', 'Rank', 'Anomaly (1901-2000 base period)', '1901-2000 Mean']) 
# Rename columns
dfw = dfw.rename(columns = {'Location':'county', 'Value':'temperature'})
# Create ‘region’ column
dfw.assign(region = '')
# Change to lowercase
dfw.columns = dfw.columns.str.lower()

# Function to change County full name to county ID in temp.csv
def County_toID(county):
       if county in county_dict:
            return county.replace(county, county_dict[county])

# Convert counties to regions
new_county = dfw['county'].apply(County_toID)
dfw['region'] = new_county

# Filter out dates
dfw = dfw[dfw.date > 201800]
dfw = dfw[dfw.date < 202100]

# Drop NA values
dfw = dfw.dropna()


In [5]:
df

Unnamed: 0,index,state,county,fire_name,cause,gis_acres,date_conv
12,12,CA,NEU,FIELDS,5.0,55.32843,202010
28,28,CA,HUU,REDWOOD,2.0,101.4003,202010
34,34,CA,HUU,MINE,5.0,11.07413,202012
37,37,CA,BTU,GRAND,14.0,31.06109,202010
45,45,CA,NEU,SIMPSON,14.0,28.57456,202010
47,47,CA,NEU,LOCUST,5.0,15.85951,202012
120,120,CA,SHU,POINT,9.0,48.14772,202010
127,127,CA,SHU,DERSCH,7.0,133.2995,202010
147,147,CA,AEU,CAMERON FIRE,10.0,14.42601,202010
149,149,CA,AEU,LAMBERT FIRE,9.0,21.816271,202010


In [6]:
dfw['region'].unique()

array(['AEU', 'BTU', 'HUU', 'TGU', 'LNU', 'LMU', 'MRN', 'MEU', 'NEU',
       'CZU', 'SCU', 'SHU', 'SKU'], dtype=object)

In [7]:
## Create a temp column for wildfires and grab data from temp dataset based on region

# Create ‘temp’ column
df.assign(temp = '')

temp_region = dfw.groupby(['region', 'date']).mean()
temp_region.loc['BTU', 201805]

# Function to get averages
# @param region
# @param date int
def getTempForRegion(region, date):
    return temp_region.loc[region, date]
        

# Apply function to temp column
df['temp'] = df.apply(lambda x: getTempForRegion(x['county'], x['date_conv']), axis=1)


In [8]:
temp_region.loc['AEU', 201805]

temperature    59.3
Name: (AEU, 201805), dtype: float64

In [9]:
df

Unnamed: 0,index,state,county,fire_name,cause,gis_acres,date_conv,temp
12,12,CA,NEU,FIELDS,5.0,55.32843,202010,64.6
28,28,CA,HUU,REDWOOD,2.0,101.4003,202010,60.3
34,34,CA,HUU,MINE,5.0,11.07413,202012,43.45
37,37,CA,BTU,GRAND,14.0,31.06109,202010,68.4
45,45,CA,NEU,SIMPSON,14.0,28.57456,202010,64.6
47,47,CA,NEU,LOCUST,5.0,15.85951,202012,43.8
120,120,CA,SHU,POINT,9.0,48.14772,202010,61.8
127,127,CA,SHU,DERSCH,7.0,133.2995,202010,61.8
147,147,CA,AEU,CAMERON FIRE,10.0,14.42601,202010,66.2
149,149,CA,AEU,LAMBERT FIRE,9.0,21.816271,202010,66.2


The data is now pretty clean, as each row and column has a value, and we were able to combine the average temperature dataset with the wildfire dataset to find the month's average daily temperature when each fire happened.