# Predicting Algerian Forest Fires with Machine Learning

# Introduction

Starting in June 2012, the Bejaia and Sidi-Bel Abbes regions of Algeria experienced numerous wildfires that likely resulted from a heat wave. The fires burned throuh Algeria's pine and cork oak forests, damaging a total area of 295 square kilometers. 

### Purpose of Analysis

Do weather metrics in the Algerian regions determine the occureces of forest fires? How distinct are weather metrics for forest fires? How can we detect fires before they occur?

Our team uses machine learning to process features and predict whether or not a fire will occur based on significant attributes. Our model is based on data from Bejaia and Sidi-Bel Abbes, but we aim to build more robust and generalizable models that can help policy makers in different regions implement precautionary measures against fires before they strike. 

### Assumptions

In our analysis, we assume that the response variable is binary. The response variable is the class 'fire' or 'not fire.' We do not consider midway situations, such as fires that were close to ignition but utimately faded. We assume that our observations are unique instances with no extreme outliers. We also deduce that the samples are representative and large enough to make justifiable predictions.  

# The Data 

### Data Acquisition

The multivariate dataset is acquired from the UCI Machine Learning repository. It can be found at https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++.

The data contain a total of 244 instances for the two regions with 122 instances per region. There are 12 attributes, including date variables and weather metrics, that are measured from June 2012 to September 2012. Descriptions of the attributes can be found at the provided link. There are no missing values.

There are 2 repsonse classes, 'fire' and 'not fire.'

### Data Cleaning

The data is originally formatted as two stacked datasets where the top corresponds to Bejaia and the bottom correponds to Sidi-Bel Abbes. We combine them into one dataframe and create a 'Region' variable to distinguish regions. We reset the index and polish the column names for easy manipulation.

# Exploratory Data Analysis

We first examine the correlation between all pairs of attributes except 'date' and 'time'. We find that some variables are highly correlated with one another, such as BUI and DC. These findings are later reflected in our feature selection process in modeling to avoid multicollinearity. 

### Pairwise Correlations
<img src="figures/pairplot_all_quant.png" width="600" height="600">

Let us focus on the relationships for fire behavior indicies and fuel moisture codes. We expect to see correlations. For example, one fuel moisture code should heavily effect another fuel moisture code, as they capture similar information.

### Fire Behavior Indices
<img src="figures/firebehaviour_indices.png" width="400" height="400">

As expected, the variables rougly follow linear relationships and show positive correlation. It is no surprise that, for example, higher FWI induces higher BUI if larger fires are likely to have large buildups.

### Fuel Moisture Codes
<img src="figures/fuel_moisture_codes.png" width="400" height="400">

DMC and DC roughly follow a linear relationship. The remaining relationships seems to be logarithmic, though the curves are quite sharp. As FFMC increases, DC stays consistent until around 80 FFMC where the spread increasees greatly. The FFMC is an inverse measure of moisture content for easily ignited surface litter and other cured fine fuels. Thus, it makes sense that FFMC increases as DC increases.