# Data Cleaning and Preprocessing

In this notebook, we will perform the necessary steps to clean and preprocess the raw dataset before applying any machine learning models. The dataset contains weather-related features and a classification target indicating whether a forest fire occurred.

The raw dataset is divided into two regions: **Bejaia** and **Sidi-Bel Abbes**, located in Algeria. Our goal is to clean and preprocess the data, separate it by region, and prepare it for further analysis and modeling.

## Step 1: Loading the Data
In this first step, we will load the raw data from the specified path into a pandas DataFrame and inspect the first few rows to understand the structure of the dataset.


In [5]:
# Import necessary libraries
import pandas as pd

# Define the path to the raw dataset
data_path = r'C:\Users\Administrator\Desktop\Data Analytics\Portfolio project\New folder\Algerian_Forest_Fires_Classification\data\raw\Algerian_forest_fires_dataset_UPDATE.xlsx'

# Load the Excel file with multiple sheets
df_bejaia = pd.read_excel(data_path, sheet_name='Bejaia Region Dataset')
df_sidi_bel_abbes = pd.read_excel(data_path, sheet_name='Sidi-Bel Abbes Region Dataset')

# Combine the two datasets into one
df_combined = pd.concat([df_bejaia, df_sidi_bel_abbes], ignore_index=True)

# Display the first few rows of the combined dataset
df_combined.head()


Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire


## Step 2 Inspecting the Data
Now that we have loaded the dataset, we will check its basic information. We’ll verify the shape of the dataset, check for missing values, and display summary statistics for numerical features.


In [6]:
# Check the shape of the dataset (number of rows and columns)
print(f"Dataset Shape: {df_combined.shape}")

# Check for missing values in the dataset
missing_values = df_combined.isnull().sum()
print(f"Missing Values:\n{missing_values}")

# Display summary statistics for numerical columns
summary_stats = df_combined.describe()
print(f"Summary Statistics:\n{summary_stats}")


Dataset Shape: (244, 14)
Missing Values:
day            0
month          0
year           0
Temperature    0
RH             0
Ws             0
Rain           0
FFMC           0
DMC            0
DC             0
ISI            0
BUI            0
FWI            0
Classes        1
dtype: int64
Summary Statistics:
              day       month    year  Temperature          RH          Ws  \
count  244.000000  244.000000   244.0   244.000000  244.000000  244.000000   
mean    15.754098    7.500000  2012.0    32.172131   61.938525   15.504098   
std      8.825059    1.112961     0.0     3.633843   14.884200    2.810178   
min      1.000000    6.000000  2012.0    22.000000   21.000000    6.000000   
25%      8.000000    7.000000  2012.0    30.000000   52.000000   14.000000   
50%     16.000000    7.500000  2012.0    32.000000   63.000000   15.000000   
75%     23.000000    8.000000  2012.0    35.000000   73.250000   17.000000   
max     31.000000    9.000000  2012.0    42.000000   90.000000  

## Step 3 Checking for Duplicate Entries
In this step, we will check if there are any duplicate rows in the dataset. Duplicates can skew our analysis and model performance, so it's important to identify and remove them if present.


In [9]:
# Check for duplicate rows in the combined dataset
duplicates = df_combined.duplicated().sum()

# Remove duplicates if any
df_cleaned = df_combined.drop_duplicates()

# Output the number of duplicates and the shape of the dataset after removing duplicates
duplicates, df_cleaned.shape


(0, (244, 14))

## Step 4 Loading and Combining Both Sheets
We loaded both sheets (Bejaia and Sidi-Bel Abbes) and combined them into one dataset for unified analysis. This will allow us to analyze both regions together.


## Step 5 Adding the Region Column
Since the dataset doesn't contain an explicit `region` column, we will create one to indicate whether a row belongs to the Bejaia or Sidi-Bel Abbes region. This new column will help us label the rows accordingly, enabling us to encode it later for use in machine learning models.



In [13]:
# Add the 'region' column to indicate which dataset each row belongs to
df_bejaia['region'] = 'Bejaia'
df_sidi_bel_abbes['region'] = 'Sidi-Bel Abbes'

# Combine the datasets again
df_combined = pd.concat([df_bejaia, df_sidi_bel_abbes], ignore_index=True)

# Display the first few rows of the combined dataset with the new 'region' column
df_combined.head()


Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,region
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,Bejaia
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire,Bejaia
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,Bejaia
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire,Bejaia
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire,Bejaia


## Step 6: Encoding the Region Column
The `region` column is categorical, and for machine learning models to work, we need to encode this column as numeric values. We will use Label Encoding to convert the `region` column into numeric values.


In [14]:
from sklearn.preprocessing import LabelEncoder

# Initialize the LabelEncoder
le = LabelEncoder()

# Encode the 'region' column
df_combined['region'] = le.fit_transform(df_combined['region'])

# Display the first few rows after encoding
df_combined.head()


Unnamed: 0,day,month,year,Temperature,RH,Ws,Rain,FFMC,DMC,DC,ISI,BUI,FWI,Classes,region
0,1,6,2012,29,57,18,0.0,65.7,3.4,7.6,1.3,3.4,0.5,not fire,0
1,2,6,2012,29,61,13,1.3,64.4,4.1,7.6,1.0,3.9,0.4,not fire,0
2,3,6,2012,26,82,22,13.1,47.1,2.5,7.1,0.3,2.7,0.1,not fire,0
3,4,6,2012,25,89,13,2.5,28.6,1.3,6.9,0.0,1.7,0.0,not fire,0
4,5,6,2012,27,77,16,0.0,64.8,3.0,14.2,1.2,3.9,0.5,not fire,0


## Step 7: Checking for Missing Values
Now that we have the dataset prepared, let's check for any missing values. Missing data can be problematic for machine learning models, so we need to handle it before moving forward. If any missing values are found, we will decide how to address them (e.g., imputation or removal).


In [15]:
# Check for missing values in the combined dataset
missing_values = df_combined.isnull().sum()

# Display columns with missing values, if any
missing_values[missing_values > 0]


Classes    1
dtype: int64

In [None]:
## Step 9: Dropping Rows with Missing Values in the Target Column
Since the `Classes` column contains missing values and it's the target variable, we will drop the rows that have missing values in this column. This will ensure we don't have incomplete data for model training.


## Step 8: Dropping Rows with Missing `FWI` and `Classes`
Since the `Classes` column is empty when the `FWI` (Fire Weather Index) is missing, we will drop rows where either `FWI` or `Classes` is missing. This ensures that we only keep complete rows where both the target variable and the feature needed for prediction are available.


In [16]:
# Drop rows where 'FWI' or 'Classes' are missing
df_cleaned = df_combined.dropna(subset=['FWI', 'Classes'])

# Verify if missing values are resolved
missing_values_after = df_cleaned.isnull().sum()
missing_values_after[missing_values_after > 0]


Series([], dtype: int64)

## Exporting the Processed Data
Now that we’ve completed the data cleaning and preprocessing steps, it’s important to save the cleaned data to avoid repeating this process in the future. We’ll export the dataset into a CSV file for easy access and reuse.


In [18]:
# Define the path to save the cleaned dataset
processed_data_path = r'C:\Users\Administrator\Desktop\Data Analytics\Portfolio project\New folder\Algerian_Forest_Fires_Classification\data\processed\cleaned_algerian_forest_fires.csv'

# Save the cleaned dataset to a CSV file
df_cleaned.to_csv(processed_data_path, index=False)

# Verify if the file has been saved successfully
import os
os.path.exists(processed_data_path)

True