## Data Loading & Cleaning

In this section, we load the dataset into a pandas DataFrame and perform an initial inspection to understand its structure and quality. The goal is to ensure that the data is reliable, consistent, and suitable for analysis and model training.

### Steps Performed

**Initial Data Inspection**  
   The dataset is imported and basic exploratory checks are performed, including examining the shape of the data, column names, and data types. This helps identify whether features are numerical, categorical, or temporal, and whether any type conversions are required. Additionally, we assess the presence of missing or null values across all features.

---

### Importing the Libraries & Basic Exploration

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('Algerian_forest_fires_dataset.csv')

Checking the dataset

In [3]:
df.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire


Checking dataset shape

In [4]:
df.shape

(247, 14)

Checking datatypes

In [5]:
df.dtypes

day            object
month          object
year           object
Temperature    object
 RH            object
 Ws            object
Rain           object
FFMC           object
DMC            object
DC             object
ISI            object
BUI            object
FWI            object
Classes        object
dtype: object

It is important to note that our data is divided into 2 Regions, which are Bejaia and Sidi Bel-abbes regions in Algeria
We cannot perform numerical or any visualizations if we do not clean our data first. 

In [6]:
df.columns = df.columns.str.strip()
df.dropna(how='all', inplace=True) #Removing the empty row
df = df[df['day'] != 'day'] #Removing the inner header row by searching for a day attribute called 'day'
df.iloc[120:125]
df = df.drop(123)

Stripping all string numerical values from any space

In [7]:
for col in df:
    df[col] = df[col].map(lambda elt : elt.replace(" ","").strip() if isinstance(elt,str) else elt ) #We have to make sure that we are splitting a valid numerical string

Converting all numerical datatypes to type float

In [8]:
for col in df:
    if(col != 'Classes'):
        df[col] = df[col].astype('float')
        

---

Checking Classes Values

In [9]:
df['Classes'].value_counts()

Classes
fire       138
notfire    106
Name: count, dtype: int64

Encoding the Classes Feature
- notfire = 0
- fire = 1

In [10]:
df['encode'] = df['Classes'].map(lambda item : 0 if (item == 'notfire') else 1)

In [11]:
df['encode'].value_counts()

encode
1    138
0    106
Name: count, dtype: int64

---

Checking the cleaned data

In [12]:
df.head()

Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,encode
0,1.0,6.0,2012.0,29.0,57.0,18.0,0.0,65.7,3.4,7.6,1.3,3.4,0.5,notfire,0
1,2.0,6.0,2012.0,29.0,61.0,13.0,1.3,64.4,4.1,7.6,1.0,3.9,0.4,notfire,0
2,3.0,6.0,2012.0,26.0,82.0,22.0,13.1,47.1,2.5,7.1,0.3,2.7,0.1,notfire,0
3,4.0,6.0,2012.0,25.0,89.0,13.0,2.5,28.6,1.3,6.9,0.0,1.7,0.0,notfire,0
4,5.0,6.0,2012.0,27.0,77.0,16.0,0.0,64.8,3.0,14.2,1.2,3.9,0.5,notfire,0


---

Data has been cleaned, saving the file.

In [13]:
df.to_csv('alg_ff_clean.csv', index = False)