In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
# from sklearn.tree import DecisionTreeClassifier
# from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# from sklearn.tree import plot_tree
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Load dataset
fires_df = pd.read_csv('fires.csv')

# Clean column names: remove leading/trailing whitespace
# df.columns = df.columns.str.strip()

# Display dataset overview
display(fires_df.head())
print("="*55)
print(" FOREST FIRES DATASET OVERVIEW ")
print("="*55, "\n")
print(fires_df.info(), "\n")

print("="*55)
print(" DATA SUMMARY ")
print("="*55, "\n")
print(fires_df.describe().transpose(), "\n")

print("="*55)
print(" MISSING VALUES ")
print("="*55, "\n")
print(fires_df.isnull().sum(), "\n")

print("="*55)
print(" DUPLICATE ROWS ")
print("="*55, "\n")
print(f"Number of duplicate rows: {fires_df.duplicated().sum()}\n")

Unnamed: 0.1,Unnamed: 0,X,Y,month,day,FFMC,DMC,DC,ISI,temp,RH,wind,rain,area
0,1,7,5,mar,fri,86.2,26.2,94.3,5.1,,51.0,6.7,0.0,0.0
1,2,7,4,oct,tue,90.6,,669.1,6.7,18.0,33.0,0.9,0.0,0.0
2,3,7,4,oct,sat,90.6,43.7,,6.7,14.6,33.0,1.3,0.0,0.0
3,4,8,6,mar,fri,91.7,33.3,77.5,9.0,8.3,97.0,4.0,0.2,0.0
4,5,8,6,mar,sun,89.3,51.3,102.2,9.6,11.4,99.0,,0.0,0.0


 FIRES DATASET OVERVIEW 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 517 entries, 0 to 516
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  517 non-null    int64  
 1   X           517 non-null    int64  
 2   Y           517 non-null    int64  
 3   month       517 non-null    object 
 4   day         517 non-null    object 
 5   FFMC        469 non-null    float64
 6   DMC         496 non-null    float64
 7   DC          474 non-null    float64
 8   ISI         515 non-null    float64
 9   temp        496 non-null    float64
 10  RH          487 non-null    float64
 11  wind        482 non-null    float64
 12  rain        485 non-null    float64
 13  area        517 non-null    float64
dtypes: float64(9), int64(3), object(2)
memory usage: 56.7+ KB
None 

 DATA SUMMARY 

            count        mean         std   min      25%     50%      75%  \
Unnamed: 0  517.0  259.000000  149.389312   1.0  1

## Forest Fires Dataset

The dataset used in this project comes from the [UCI Machine Learning Repository – Forest Fires Dataset](https://archive.ics.uci.edu/dataset/162/forest+fires).  
It contains information about forest fires in the Montesinho Natural Park (Portugal) and the meteorological conditions under which they occurred.  
The goal is to predict the **burned area of the forest (in hectares)** given environmental and weather-related features.

---

### Dataset Overview
- **Rows (observations):** 517  
- **Columns (features):** 12 + 1 target variable (`area`)  
- **Target Variable:** `area` – burned forest area (in hectares)  

---

### Column Descriptions

1. & 2. **X, Y** – Spatial coordinates (grid location within the Montesinho park map).
3. **month** – Month of the year (`jan`–`dec`).
4. **day** – Day of the week (`mon`–`sun`).

- **Fire Weather Index (FWI) System Variables**  
These indices come from the **Canadian Forest Fire Weather Index (FWI) System**, widely used in fire danger assessment:  
5. **FFMC (Fine Fuel Moisture Code)** – Indicates the moisture content of surface litter and small vegetation; affects fire ignition probability (range: 18.7–96.2).  
6. **DMC (Duff Moisture Code)** – Reflects the moisture of decomposed organic material (duff) beneath the surface; represents medium-term fire potential (range: 1.1–291.3).  
7. **DC (Drought Code)** – Measures long-term dryness in deep, compact organic soil layers; affects fire sustainability (range: 7.9–860.6).  
8. **ISI (Initial Spread Index)** – Combines wind and FFMC to represent the expected fire spread rate at ignition (range: 0.0–56.1).

- **Weather Conditions**  
9. **temp** – Temperature in Celsius (2.2–33.3 °C).  
10. **RH (Relative Humidity)** – Percentage of air moisture (15–100%).  
11. **wind** – Wind speed in km/h (0.4–9.4 km/h).  
12. **rain** – Rainfall in mm/m² (0.0–6.4).  

- **Target Variable**  
13. **area** – Burned forest area (in hectares). Skewed heavily toward zero; most fires are small, but some extreme events exceed 1,000 ha.  

---

### Missing Values
Several features contain missing entries (e.g., `FFMC`, `DMC`, `DC`, `temp`, `RH`, `wind`, `rain`), which need to be handled during preprocessing.
