# Forest Fires in Northeastern Portugal

In this project we are going to analyze forest fire burn area, location, and meteorological data in order to predict the area of burned forest in the Montesinho Natural Park in the northeast region of Portugal.

The dataset used in this project is from UC Irvine's Machine Learning Repository. As required, here is the citation for the dataset:

P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data. In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence, Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December, Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9. Available at: http://www.dsi.uminho.pt/~pcortez/fires.pdf

## Framing the Problem

Forest fires are a natural occurence and actually necessary for the propagation of many plant species. However, the size and severity of forest fires has increased in recent years due to many factors which are enhanced by climate change, such as a decrease in the amount of precipitation received in forested areas and an increase in the number and intensity of heatwaves during the summer months.

There is no doubt that the increasing frequency and severity of forest fires also has an effect on the human population--human lives are lost and housing destroyed, families are displaced, and countless dollars are lost to damage by the fires. Additionally, millions of dollars are spent to supply firefighters with necessary supplies to try and tame the fires.

By using data about past fires coupled with machine learning techniques, we can train computers to accurately predict many variables, such as the size and number of wildfires, thereby enabling us to better prepare for them. This will result in many dollars and lives saved.

The dataset that will be used in the following is provided by UC Irvine. It compiles 517 instances of wildfires with associated data and meteorological variables. Our aim is train a machine learning model to predict the size of the burned area. As we are predicting a number and we have data to train with, this will be a supervised regression task. Performance of the model will be measured with a RMSE or MSE function.

First, let's import the dataset.

In [2]:
import os
import pandas as pd 

FOREST_FIRES_FILENAME = 'forestfires.csv'
    
def load_forest_fire_data(fires_path=FOREST_FIRES_FILENAME):
    csv_path = os.path.join('datasets', fires_path)
    return pd.read_csv(csv_path)

forest_fires = load_forest_fire_data()

In [3]:
forest_fires.head()

Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,7,5,mar,fri,86.2,26.2,94.3,5.1,8.2,51,6.7,0.0,0.0
1,7,4,oct,tue,90.6,35.4,669.1,6.7,18.0,33,0.9,0.0,0.0
2,7,4,oct,sat,90.6,43.7,686.9,6.7,14.6,33,1.3,0.0,0.0
3,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97,4.0,0.2,0.0
4,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99,1.8,0.0,0.0


The columns in the dataset represent the following data:

    1. X - x-axis spatial coordinate within the Montesinho park map: 1 to 9
    2. Y - y-axis spatial coordinate within the Montesinho park map: 2 to 9
    3. month - month of the year: "jan" to "dec" 
    4. day - day of the week: "mon" to "sun"
    5. FFMC - FFMC index from the FWI system: 18.7 to 96.20
    6. DMC - DMC index from the FWI system: 1.1 to 291.3 
    7. DC - DC index from the FWI system: 7.9 to 860.6 
    8. ISI - ISI index from the FWI system: 0.0 to 56.10
    9. temp - temperature in Celsius degrees: 2.2 to 33.30
    10. RH - relative humidity in %: 15.0 to 100
    11. wind - wind speed in km/h: 0.40 to 9.40 
    12. rain - outside rain in mm/m2 : 0.0 to 6.4 
    13. area - the burned area of the forest (in ha): 0.00 to 1090.84 
    (this output variable is very skewed towards 0.0, thus it may make
    sense to model with the logarithm transform). 