# DATA 512 - Wildfire Analysis - Project Part 1 - Common Analysis

## Data Cleaning and Manipulation

### Baisakhi Sarkar, University of Washington, MSDS 2023-2025


More and more frequently summers in the western US have been characterized by wildfires with smoke billowing across multiple western states. There are many proposed causes for this: climate change, US Forestry policy, growing awareness, just to name a few. Regardless of the cause, the impact of wildland fires is widespread as wildfire smoke reduces the air quality of many cities. There is a growing body of work pointing to the negative impacts of smoke on health, tourism, property, and other aspects of society.

The course project will require that us to analyze wildfire impacts on a specific city in the US. The end goal is to be able to inform policy makers, city managers, city councils, or other civic institutions, to make an informed plan for how they could or whether they should make plans to mitigate future impacts from wildfires.

The common analysis research question is based on one specific dataset. You should get the [Combined wildland fire datasets for the United States](https://www.sciencebase.gov/catalog/item/61aa537dd34eb622f699df81) and certain territories, 1800s-Present (combined wildland fire polygons) dataset. This dataset was collected and aggregated by the US Geological Survey. The dataset is relatively well documented. The dataset provides fire polygons in ArcGIS and GeoJSON formats. 
You have been assigned a specific US city as the focus of your analysis. You are NOT analyzing the entire dataset. You have been assigned one US city that will form the basis for your individual analysis. You can find your individual US city assignment from a Google spreadsheet.

Wildland fires within 650 miles of Palmdale, California are analyzed for the last 60 years (1964-2021). A smoke estimate is then created to estimate the wildfire smoke impact which is later modeled to make predictions for the next 30 years (until 2050).

This section of the analysis cleans and preprocesses the Wildland Fires Data that we generated from the Wildfire_Analysis_Data_Acquisition notebook. We will be analyzing the various columns present in the dataset and make relevant assumptions to remove any unnecessary columns.


### Preliminaries

In [2]:
# These are standard python modules
import pandas as pd
import json
import warnings

In [3]:
# Suppress the warning statements
warnings.filterwarnings("ignore")

### Loading and analyzing the Wildfires Dataset

We load the csv data that we generated from Wildfire_Analysis_Data_Acquisition.ipynb, into a pandas dataframe and check wihich columns are unnecessary for our analysis

In [5]:
wf_df = pd.read_csv("../Intermediate_files/Past_60_years_wildfires_with_distances_from_Palmdale.csv")
wf_df.head()

Unnamed: 0.1,Unnamed: 0,OBJECTID,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,Fire_Polygon_Tier,Fire_Attribute_Tiers,GIS_Acres,GIS_Hectares,Source_Datasets,...,Listed_Rx_Reported_Acres,Listed_Map_Digitize_Methods,Wildfire_and_Rx_Flag,Overlap_Within_1_or_2_Flag,Circleness_Scale,Circle_Flag,Exclude_From_Summary_Rasters,Shape_Length,Shape_Area,Closest_Distance_Miles
0,0,14600,14600,Wildfire,1964,1,"1 (1), 3 (3)",65338.87764,26441.705659,Comb_National_NIFC_Interagency_Fire_Perimeter_...,...,,,,,0.263753,,No,112240.801495,264417100.0,78.287919
1,1,14601,14601,Wildfire,1964,1,"1 (2), 3 (3), 5 (1)",55960.694158,22646.489455,Comb_National_NIFC_Interagency_Fire_Perimeter_...,...,,,,,0.224592,,No,112566.141954,226464900.0,367.502205
2,2,14602,14602,Wildfire,1964,1,"1 (2), 3 (3)",19218.105903,7777.29153,Comb_National_NIFC_Interagency_Fire_Perimeter_...,...,,,,,0.138493,,No,84004.974692,77772920.0,37.284539
3,3,14603,14603,Wildfire,1964,1,"1 (2), 3 (3)",18712.49475,7572.677954,Comb_National_NIFC_Interagency_Fire_Perimeter_...,...,,,,,0.39196,,No,49273.004457,75726780.0,490.145821
4,4,14604,14604,Wildfire,1964,1,"1 (4), 3 (6)",16887.001024,6833.926855,Comb_National_NIFC_Interagency_Fire_Perimeter_...,...,,,,,0.392989,,No,46746.577459,68339270.0,513.197125


In [6]:
wf_df.shape

(117163, 28)

Our dataset has 117163 rows and 28 features before processing. List of features below:

In [7]:
wf_df.columns

Index(['Unnamed: 0', 'OBJECTID', 'USGS_Assigned_ID', 'Assigned_Fire_Type',
       'Fire_Year', 'Fire_Polygon_Tier', 'Fire_Attribute_Tiers', 'GIS_Acres',
       'GIS_Hectares', 'Source_Datasets', 'Listed_Fire_Types',
       'Listed_Fire_Names', 'Listed_Fire_Codes', 'Listed_Fire_IDs',
       'Listed_Fire_IRWIN_IDs', 'Listed_Fire_Dates', 'Listed_Fire_Causes',
       'Listed_Fire_Cause_Class', 'Listed_Rx_Reported_Acres',
       'Listed_Map_Digitize_Methods', 'Wildfire_and_Rx_Flag',
       'Overlap_Within_1_or_2_Flag', 'Circleness_Scale', 'Circle_Flag',
       'Exclude_From_Summary_Rasters', 'Shape_Length', 'Shape_Area',
       'Closest_Distance_Miles'],
      dtype='object')

In this step, several columns not useful for the analysis are identified and dropped. The explanation for dropping each of these columns is provided below,

**OBJECTID**: It is a unique identification for the fire polygon and its attributes. The dataset also has another column named 'USGS_Assigned_ID' which is also a unique identification that provides further consistency. Thus, the OBJECTID column is dropped.

**Fire_Polygon_Tier**: This refers to the tier from which the fire polygon is generated. One or more polygons within the tier can be combined to create the fire polygon. This feature although numerical, did not feel like it will add any value to the creation of the smoke estimate and its modeling. Thus, it is dropped.

**Fire_Attribute_Tiers**: The dataset being used is created by combining 40 different data sources. This feature has a list of Polygon Tiers consolidated from all the data sources for each fire. This is irrelevant to the analysis at hand, and is hence dropped.

**GIS_Hectares**: This encapsulates the hectares of the fire polygon calculated by using the Calculate Geometry tool in ArcGIS Pro. Since there is another column representing the same value in the units of acres, this column is dropped.

**Source_Datasets**: This column contains all the original source datasets that contributed to either the polygon or the attributes. This is irrelevant for our analysis.

**Listed_Fire_Types**: This includes each fire type listed in the fires from the merged dataset where the number of features that contributed to a specific fire type are in parentheses after the fire type. Since we have kept the 'Fire_Type' column in our dataset for now, this column is not needed.

**Listed_Fire_Codes**: This includes each fire code listed in the fires from the merged dataset. Any feature that has a 'list' of values from the merged dataset are ignored.

**Listed_Fire_IDs**: This includes each fire ID listed in the fires from the merged dataset. Since it is a 'list', it is dropped.

**Listed_Fire_IRWIN_IDs**: This includes each fire IRWIN ID listed in the fires from the merged dataset. This is dropped since it is a 'list'.

**Listed_Fire_Dates**: This includes each fire date listed in the fires from the merged dataset. Since we are considering wildfires on a yearly basis, the fire dates are not important for our analysis.

**Listed_Fire_Causes**: This includes each fire cause listed in the fires from the merged dataset. It is a 'list' and is hence dropped.

**Listed_Fire_Cause_Class**: This includes each fire cause class listed in the fires from the merged dataset. While fire cause may seem important for analysis, it cannot be quantified in any manner. Thus, it is dropped.

**Listed_Rx_Reported_Acres**: This contains each prescribed fire's reported acres listed in the fires from the merged dataset. For the area of the fire, we are relying on the column 'GIS_Acres' and hence, this column is dropped.

**Listed_Map_Digitize_Methods**: This includes each fire digitization method listed in the fires from the merged dataset. This does not add any value to our analysis and is thus dropped.


In [8]:
# Drop all irrelevant columns
wf_df = wf_df.drop(['OBJECTID', 'Fire_Polygon_Tier', 'Fire_Attribute_Tiers', 'GIS_Hectares',
                    'Source_Datasets', 'Listed_Fire_Types', 'Listed_Fire_Codes', 'Listed_Fire_IDs',
                    'Listed_Fire_IRWIN_IDs', 'Listed_Fire_Dates', 'Listed_Fire_Causes', 'Listed_Fire_Cause_Class',
                    'Listed_Rx_Reported_Acres', 'Listed_Map_Digitize_Methods'], axis=1)

In [10]:
wf_df.head()

Unnamed: 0.1,Unnamed: 0,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Names,Wildfire_and_Rx_Flag,Overlap_Within_1_or_2_Flag,Circleness_Scale,Circle_Flag,Exclude_From_Summary_Rasters,Shape_Length,Shape_Area,Closest_Distance_Miles
0,0,14600,Wildfire,1964,65338.87764,COYOTE (4),,,0.263753,,No,112240.801495,264417100.0,78.287919
1,1,14601,Wildfire,1964,55960.694158,"C. HANLY (5), Hanley (1)",,,0.224592,,No,112566.141954,226464900.0,367.502205
2,2,14602,Wildfire,1964,19218.105903,COZY DELL (5),,,0.138493,,No,84004.974692,77772920.0,37.284539
3,3,14603,Wildfire,1964,18712.49475,HAYFORK HWY. #2 (5),,,0.39196,,No,49273.004457,75726780.0,490.145821
4,4,14604,Wildfire,1964,16887.001024,"MATTOLE (5), ROBERTS COOP. ESCAPE (5)",,,0.392989,,No,46746.577459,68339270.0,513.197125


In [11]:
wf_df.shape

(117163, 14)

Applying the filter to only keep the fires that occured within 650 miles of our city of interest (Palmdale, CA)

In [13]:
wf_df=wf_df[wf_df['Closest_Distance_Miles'] <= 650]

In [14]:
wf_df.shape

(45891, 14)

After all the filters are applied we are left with 45891 records.

In [15]:
wf_df.head()

Unnamed: 0.1,Unnamed: 0,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Names,Wildfire_and_Rx_Flag,Overlap_Within_1_or_2_Flag,Circleness_Scale,Circle_Flag,Exclude_From_Summary_Rasters,Shape_Length,Shape_Area,Closest_Distance_Miles
0,0,14600,Wildfire,1964,65338.87764,COYOTE (4),,,0.263753,,No,112240.801495,264417100.0,78.287919
1,1,14601,Wildfire,1964,55960.694158,"C. HANLY (5), Hanley (1)",,,0.224592,,No,112566.141954,226464900.0,367.502205
2,2,14602,Wildfire,1964,19218.105903,COZY DELL (5),,,0.138493,,No,84004.974692,77772920.0,37.284539
3,3,14603,Wildfire,1964,18712.49475,HAYFORK HWY. #2 (5),,,0.39196,,No,49273.004457,75726780.0,490.145821
4,4,14604,Wildfire,1964,16887.001024,"MATTOLE (5), ROBERTS COOP. ESCAPE (5)",,,0.392989,,No,46746.577459,68339270.0,513.197125


Few more columns are dropped since they seemed unnecessary at this step:

**Circle_Flag**: Indicates if the polygon is circular, but it doesn't contribute to smoke impact estimation, so it's dropped.

**Exclude_From_Summary_Rasters**: Marks polygons excluded from raster summaries, irrelevant for our analysis, hence dropped.

**Overlap_Within_1_or_2_Flag**: Shows polygon overlaps within data tiers, but overlaps don't impact smoke estimates, so it's removed.

**Wildfire_and_Rx_Flag**: Identifies fires as both wildfire and prescribed, but this is redundant with 'Fire_Type', so it's dropped.

In [16]:
wf_df = wf_df.drop(['Circle_Flag', 'Exclude_From_Summary_Rasters', 'Overlap_Within_1_or_2_Flag', 'Wildfire_and_Rx_Flag'], axis=1)

In [17]:
wf_df.head()

Unnamed: 0.1,Unnamed: 0,USGS_Assigned_ID,Assigned_Fire_Type,Fire_Year,GIS_Acres,Listed_Fire_Names,Circleness_Scale,Shape_Length,Shape_Area,Closest_Distance_Miles
0,0,14600,Wildfire,1964,65338.87764,COYOTE (4),0.263753,112240.801495,264417100.0,78.287919
1,1,14601,Wildfire,1964,55960.694158,"C. HANLY (5), Hanley (1)",0.224592,112566.141954,226464900.0,367.502205
2,2,14602,Wildfire,1964,19218.105903,COZY DELL (5),0.138493,84004.974692,77772920.0,37.284539
3,3,14603,Wildfire,1964,18712.49475,HAYFORK HWY. #2 (5),0.39196,49273.004457,75726780.0,490.145821
4,4,14604,Wildfire,1964,16887.001024,"MATTOLE (5), ROBERTS COOP. ESCAPE (5)",0.392989,46746.577459,68339270.0,513.197125


In [18]:
wf_df.shape

(45891, 10)

Final cleaned and processed dataset is exported to CSV format for usage in next notebook where will will perform smoke estimation and prediction mdoelling.

In [20]:
# Export the final processed dataframe to a CSV file
wf_df.to_csv('../Intermediate_files/Final_Wildfire_Data_Cleaned.csv', index=False)