# Algerian Forest Fires

---

## Dataset
1. Regroups data recorded from two regions of Algeria-
   - Bejaia region, NE-Algeria
   - Sidi Bel-abbes region, NW-Algeria
2. Size- 244 instances (122 instances per region)
3. Contains 11 attributes/features and 1 target variable/class
4. 244 instances is split into two classes-
   - Fire (138 instances)
   - Not Fire (106 instances)
   
   
### Attributes/Features
|Attribute|Description|
|:---:|:---:|
|`Date`|Split into 3 columns- `day`, `month`, `year`|
|`Temperature`|Maximum noon temperature ($°C$)|
|`RH`|Relative humidity in $\%$|
|`Ws`|Wind speed in $km/h$|
|`Rain`|Total rainfall in $mm$|
|`FWI`|Fire Weather Index (FWI) Index|
|`FFMC`|Fine Fuel Moisture Code (FFMC) index from the FWI system|
|`DMC`|Duff Moisture Code (DMC) index from the FWI system|
|`DC`|Drought Code (DC) index from the FWI system|
|`ISI`|Initial Spread Index (ISI) from the FWI system|
|`BUI`|Buildup Index (BUI) from the FWI system|
|`Classes`|Target variable. Two possible values- `fire`, `not fire`|

### Data Source

[Abid, Faroudja. (2019). Algerian Forest Fires. UCI Machine Learning Repository.](https://doi.org/10.24432/C5KW4N)

---

# Dependencies
---

In [1]:
# for data wrangling with dataframes
import pandas as pd
# for numerical computations
import numpy as np
# for data visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# for working with file paths
from pathlib import Path

### Loading the Dataset

In [2]:
# local path
path = Path(r"D:\End-to-End ML Project\dataset\raw\Algerian_forest_fires_dataset_UPDATE.csv")

In [3]:
# load as dataframe
dataset = pd.read_csv(path, header=1) ## using second row as header

In [4]:
# check if dataframe loaded properly
dataset.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire


In [5]:
# attributes and datatypes in dataset
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246 entries, 0 to 245
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   day          246 non-null    object
 1   month        245 non-null    object
 2   year         245 non-null    object
 3   Temperature  245 non-null    object
 4    RH          245 non-null    object
 5    Ws          245 non-null    object
 6   Rain         245 non-null    object
 7   FFMC         245 non-null    object
 8   DMC          245 non-null    object
 9   DC           245 non-null    object
 10  ISI          245 non-null    object
 11  BUI          245 non-null    object
 12  FWI          245 non-null    object
 13  Classes      244 non-null    object
dtypes: object(14)
memory usage: 27.0+ KB


In [6]:
dataset.dtypes

day            object
month          object
year           object
Temperature    object
 RH            object
 Ws            object
Rain           object
FFMC           object
DMC            object
DC             object
ISI            object
BUI            object
FWI            object
Classes        object
dtype: object

---
## Data Cleaning
1. Dataset will be probed for potential issues which may render difficult future exploration and modelling.
2. Data will be cleaned, transformed as needed to convert into required structure and format.


In [7]:
# check for missing records
nulls = np.sum(dataset.isnull().sum(axis=0))
print(f"Total null values in dataset: {nulls}")

Total null values in dataset: 14


In [8]:
for column in dataset:
    if dataset[column].isnull().sum() > 0:
        print(f"'{column}' column has {dataset[column].isnull().sum()} missing values.")

'month' column has 1 missing values.
'year' column has 1 missing values.
'Temperature' column has 1 missing values.
' RH' column has 1 missing values.
' Ws' column has 1 missing values.
'Rain ' column has 1 missing values.
'FFMC' column has 1 missing values.
'DMC' column has 1 missing values.
'DC' column has 1 missing values.
'ISI' column has 1 missing values.
'BUI' column has 1 missing values.
'FWI' column has 1 missing values.
'Classes  ' column has 2 missing values.


In [9]:
# check rows which contain null values
dataset[dataset.isnull().any(axis=1)]

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
122,Sidi-Bel Abbes Region Dataset,,,,,,,,,,,,,
167,14,7.0,2012.0,37.0,37.0,18.0,0.2,88.9,12.9,14.6 9,12.5,10.4,fire,


#### Obvservations
1. Row no. 122 seems to indicate that all rows from no. 123 onwards are records of Siddi-Bel Abbes Region (NW-Algeria)
2. Records from row no. 0 to 121 are from the other region (Bejaia, NE-Algeria)

### Adding new column for Region
1. This will signify the region from where data of that row was obtained
2. Two possible values-
    - `0`: Bejaia region, NE-Algeria
    - `1`: Siddi-Bel Abbes Region, NW-Algeria

In [10]:
# add region column
dataset.loc[0:122, "Region"] = 0 ## Bejaia region data
dataset.loc[122:, "Region"] = 1 ## Siddi-Bel Abbes region data

In [11]:
# check modified expanded dataset
dataset.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,0.0
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire,0.0
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,0.0
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire,0.0
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire,0.0


In [12]:
# store new dataset in separate dataframe
df = dataset

In [13]:
# convert Region column into `int` dtype
df["Region"] = df["Region"].astype(dtype=int)

# check dtypes
df.dtypes

day            object
month          object
year           object
Temperature    object
 RH            object
 Ws            object
Rain           object
FFMC           object
DMC            object
DC             object
ISI            object
BUI            object
FWI            object
Classes        object
Region          int32
dtype: object

In [14]:
# check for rows with null values
df[df.isnull().any(axis=1)]

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
122,Sidi-Bel Abbes Region Dataset,,,,,,,,,,,,,,1
167,14,7.0,2012.0,37.0,37.0,18.0,0.2,88.9,12.9,14.6 9,12.5,10.4,fire,,1


#### Dealing with Null records
1. Only 2 records have missing values.
2. Given that no. of records with missing values is negligible compared to dataset size, they will be removed.

In [15]:
df

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
0,01,06,2012,29,57,18,0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,0
1,02,06,2012,29,61,13,1.3,64.4,4.1,7.6,1,3.9,0.4,not fire,0
2,03,06,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,0
3,04,06,2012,25,89,13,2.5,28.6,1.3,6.9,0,1.7,0,not fire,0
4,05,06,2012,27,77,16,0,64.8,3,14.2,1.2,3.9,0.5,not fire,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
241,26,09,2012,30,65,14,0,85.4,16,44.5,4.5,16.9,6.5,fire,1
242,27,09,2012,28,87,15,4.4,41.1,6.5,8,0.1,6.2,0,not fire,1
243,28,09,2012,27,87,29,0.5,45.9,3.5,7.9,0.4,3.4,0.2,not fire,1
244,29,09,2012,24,54,18,0.1,79.7,4.3,15.2,1.7,5.1,0.7,not fire,1


In [16]:
# drop instances with NaN values
df = df.dropna(axis=0)

# check dataframe
df.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,0
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire,0
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,0
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire,0
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire,0


In [18]:
# reset index
df = df.reset_index(drop=True)

# check dataframe
df.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,0
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire,0
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,0
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire,0
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire,0


In [19]:
print(f"Dataframe has {np.sum(df.isnull().sum(axis=0))} missing values.")

Dataframe has 0 missing values.


In [20]:
# check for any remaining rows with NaN
df[df.isnull().any(axis=1)]

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region


In [21]:
# check row no. 122
df.iloc[121:123, :]

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
121,30,09,2012,25,78,14,1.4,45,1.9,7.5,0.2,2.4,0.1,not fire,0
122,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,1


In [22]:
# drop row no. 122
df = df.drop(122).reset_index(drop=True)

# check
df.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,0
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire,0
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,0
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire,0
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire,0


In [23]:
# check row no. 122 after cleaning
df.iloc[[122], :]

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
122,1,6,2012,32,71,12,0.7,57.1,2.5,8.2,0.6,2.8,0.2,not fire,1


In [24]:
# count null records by columns
df.isnull().sum(axis=0)

day            0
month          0
year           0
Temperature    0
 RH            0
 Ws            0
Rain           0
FFMC           0
DMC            0
DC             0
ISI            0
BUI            0
FWI            0
Classes        0
Region         0
dtype: int64

### Cleaning the Column names
1. Some column names have unnecessary leading/training whitespaces

In [25]:
df.columns

Index(['day', 'month', 'year', 'Temperature', ' RH', ' Ws', 'Rain ', 'FFMC',
       'DMC', 'DC', 'ISI', 'BUI', 'FWI', 'Classes  ', 'Region'],
      dtype='object')

In [26]:
# remove whitespaces and format column names
col_names = list((df.columns).map(lambda name: str(name).strip()))

# set column names
df.columns = col_names

### Transforming Datatype of Attributes
1. Following attributes will be converted into `int` datatype-
    - `month`, `day`, `year`, `Temperature`, `RH`, `Ws`
2. Following attributes will be converted into `float` datatype-
    - `Rain`, `FFMC`, `DMC`, `DC`, `ISI`, `BUI`, `FWI`
3. Following attribute will be `object` datatype-
    - `Classes`

In [27]:
# columns to be converted in int
col_to_int = ['month','day','year','Temperature','RH','Ws']

# convert into int
df[col_to_int] = df[col_to_int].astype(int)

In [28]:
# columns to be converted in float
col_to_float = ['Rain', 'FFMC','DMC','DC','ISI','BUI','FWI']

# convert into float
df[col_to_float] = df[col_to_float].astype(float)

In [29]:
# check corrected datatypes
df.dtypes

day              int32
month            int32
year             int32
Temperature      int32
RH               int32
Ws               int32
Rain           float64
FFMC           float64
DMC            float64
DC             float64
ISI            float64
BUI            float64
FWI            float64
Classes         object
Region           int32
dtype: object

In [30]:
df.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,Region
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,0
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire,0
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,0
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire,0
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire,0


### Saving the Cleaned Dataset as `.csv` file

In [31]:
# local storage path
path_cleaned = Path(r"D:\End-to-End ML Project\dataset\processed\Algerian_forest_fires_dataset_CLEAN.csv")

# save as csv
df.to_csv(path_cleaned, index=False)

---