# INTRODUCTION
- Observing forest fire in 2 regions of Algeria, namely the Bejaia region located in the northeast of Algeria and the Sidi Bel-abbes region located in the northwest of algeria.
- The dataset I'm using comes from UCI on Algerian Forest Fires
--- https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++#

- The dataset contains record forest fire occurrence in summer 2012 which spans the period from June 2012 to September 2012.
- This project requires the possibility of using machine learning algorithm to predict forest fires in these regions. This is because machine learning can analyse amount of data. Such as, weather patterns, temperature changes and other environmental factors. This can be used to predict when and where a future forest fire might occur using the patterns detected by the machine learning algorithm in these regions.

## Data Set Information:

The dataset includes 244 instances that regroup a data of two regions of Algeria,namely the Bejaia region located in the northeast of Algeria and the Sidi Bel-abbes region located in the northwest of Algeria.

122 instances for each region.

The period from June 2012 to September 2012.
The dataset includes 11 attributes and 1 output attribute (class)
The 244 instances have been classified into fire (138 classes) and not fire (106 classes) classes.


## Attribute Information:

1. Date: (DD/MM/YYYY) Day, month ('june' to 'september'), year (2012)
Weather data observations
2. Temp: temperature noon (temperature max) in Celsius degrees: 22 to 42
3. RH: Relative Humidity in %: 21 to 90
4. Ws:Wind speed in km/h: 6 to 29
5. Rain: total day in mm: 0 to 16.8
FWI Components
6. Fine Fuel Moisture Code (FFMC) index from the FWI system: 28.6 to 92.5
7. Duff Moisture Code (DMC) index from the FWI system: 1.1 to 65.9
8. Drought Code (DC) index from the FWI system: 7 to 220.4
9. Initial Spread Index (ISI) index from the FWI system: 0 to 18.5
10. Buildup Index (BUI) index from the FWI system: 1.1 to 68
11. Fire Weather Index (FWI) Index: 0 to 31.1
12. Classes: two classes

## Steps
- Data gathering
- Exploratory Data Analysis (EDA)
- Feature Selection
- Model Building & Selection
- Hyperparameter Tuning
- Model deployment

In [1]:
import warnings
warnings.filterwarnings("ignore")

##### Importing python libraries

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
#import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import r2_score
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
#from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import RandomizedSearchCV
import bz2,pickle

from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
#from sklearn.tree import export_graphviz
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

#### Loading CSV fire
- Pandas Library: to download the forest dataset

###### Variable declaration:
- ff_df is forestFire_dataframe

In [3]:
ff_df = pd.read_csv('Algerian_forest_fires_dataset_UPDATE.csv')
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
ff_df

FileNotFoundError: [Errno 2] No such file or directory: 'Algerian_forest_fires_dataset_UPDATE.csv'

# EDA
- Exploratory Data Analysis (EDA) application to extract insights from the data set by performing Data Analysis using Pandas and Data visualisation using Matplotlib & Seaborn to know which features have contributed more in predicting Forest fire. As it is a good practice to study and understand the data first and gather as many insight as possible.

In [None]:
ff_df.head()

In [None]:
ff_df.tail()

In [None]:
ff_df.info()

To allow regression analysis, features datatypes needs to be converted to integer datatype from the object datatype

In [None]:
ff_df.shape

#### cleaning dataset

In [None]:
ff_df.isnull().sum()

In [None]:
#To check the row which have a missing value
ff_df[ff_df.isnull().any(axis=1)]

#### Making new column based on the region
- As seen above the missing values at 122nd index separate the data set of the 2 regions.

1 : Bejaia Region Dataset

2 : Sidi Bel-Abbes Region Dataset

In [None]:
ff_df.loc[:122, 'Region']=1
ff_df.loc[122:, 'Region']=2
ff_df[['Region']] = ff_df[['Region']].astype(int)

In [None]:
ff_df.head(10)

In [None]:
ff_df.tail(10)

In [None]:
ff_df.isnull().sum()

In [None]:
ff_df =ff_df.dropna().reset_index(drop=True)
ff_df.shape

In [None]:
# Column that has String
ff_df.iloc[[122]]

In [None]:
ff_df[ff_df.duplicated()]

There are no duplicated data in the data set

In [None]:
# Remove the 122nd column
ff_df1 = ff_df.drop(122).reset_index(drop=True)
pd.set_option('display.max_rows', None)
ff_df1

In [None]:
ff_df1.head(10)

In [None]:
ff_df1.shape

In [None]:
ff_df1[ff_df1.isnull().any(axis=1)]

No missing data

In [None]:
# Check for column names
ff_df1.columns

In [None]:
ff_df1.columns = ff_df1.columns.str.strip()
ff_df1.columns

##### Changing the data types into the required data types for the respective features for the analysis

In [None]:
ff_df1[['month', 'day', 'year', 'Temperature', 'RH', 'Ws']] = ff_df1[['month', 'day', 'year', 'Temperature','RH', 'Ws']].astype(int)

In [None]:
objects = [features for features in ff_df1.columns if ff_df1[features].dtypes=='O']
for i in objects:
    if i != 'Classes':
        ff_df1[i] = ff_df1[i].astype(float)

In [None]:
ff_df1.info()

In [None]:
ff_df1.describe()

In [None]:
ff_df1.describe(include = 'all')

In [None]:
ff_df1["Classes"].value_counts()

The dependent feature (Classes) only contains two categories. However, due to miss-pace it outputs multiple category so need to change the spacing in order to make two category.

In [None]:
ff_df1.Classes = ff_df1.Classes.str.strip()

In [None]:
ff_df1["Classes"].value_counts()

In [None]:
# Bejaia Region dataset only
ff_df1[:122]

In [None]:
# Sidi Bel-Abbes region dataset only
ff_df1[122:]

In [None]:
ff_df1.shape

In [None]:
# Encoding Not fire as 0 and fire as 1
ff_df1['Classes']= np.where(ff_df1['Classes']=='not fire',0,1)
ff_df1.head(10)

In [None]:
ff_df1.Classes.value_counts()

In [None]:
plt.figure(figsize=(20,15))
sns.heatmap(ff_df1.corr(), annot=True, linewidths=1,
           linecolor="black", cbar=True, cmap="Paired",
           xticklabels="auto", yticklabels="auto")

## Distribution visualisation

In [None]:
#Plotting density graphs for all features.
ff_df1.hist(bins=50, figsize=(20,15), ec ='b')
plt.show()

In [None]:
# Calculating the percentages of each Class categories
percent =ff_df1.Classes.value_counts(normalize=True)*100
percent

In [None]:
clabels =["Fire", "Not Fire"]
plt.figure(figsize=(12,7))
plt.pie(percent, labels = clabels, autopct='%1.1f%%')
plt.title("Classes Pie Chart", fontsize=12)
plt.show()

In [None]:
sns.countplot(x ='Classes', data=ff_df1, palette="tab10")
plt.title('Class Distributions \n 0: No Fire || 1: Fire', fontsize =14)


### Month-wise Fire Analysis

In [None]:
temp= ff_df1.loc[ff_df1['Region']== 1]
plt.subplots(figsize=(13,6))
sns.set_style('whitegrid')
sns.countplot(x='month',hue='Classes',data= ff_df1,ec = 'black', palette= 'Set2')
plt.title('Fire Analysis Month wise for Bejaia Region', fontsize=18, weight='bold')
plt.ylabel('Count', weight = 'bold')
plt.xlabel('Months', weight= 'bold')
plt.legend(loc='upper right')
plt.xticks(np.arange(4), ['June','July', 'August', 'September',])
plt.grid(alpha = 0.5,axis = 'y')
plt.show()

In [None]:
temp= ff_df1.loc[ff_df1['Region']== 2]
plt.subplots(figsize=(13,6))
sns.set_style('whitegrid')
sns.countplot(x='month',hue='Classes',data= ff_df1,ec = 'black', palette= 'Set2')
plt.title('Fire Analysis Month wise for Sidi Bel-Abbes', fontsize=18, weight='bold')
plt.ylabel('Count', weight = 'bold')
plt.xlabel('Months', weight= 'bold')
plt.legend(loc='upper right')
plt.xticks(np.arange(4), ['June','July', 'August', 'September',])
plt.grid(alpha = 0.5,axis = 'y')
plt.show()

- As observed from the count plots above, July and September seems to have the most number of forest fires for both regions.
- Most of the fires happened in August.
- Less fire in September.

### Weather System EDA

In [None]:
def barchart(features, xlabel):
    plt.figure(figsize=[14,8])
    by_ft= ff_df1.groupby([features], as_index=False)['Classes'].sum()
    ax =sns.barplot(x=features, y="Classes", data=by_ft[[features, 'Classes']], estimator=sum)
    ax.set(xlabel=xlabel, ylabel='Fire Count')

In [None]:
barchart('Temperature', 'Temperature max in Celsius degree')

In [None]:
barchart('Rain', 'Rain in mm')

In [None]:
barchart('Ws', 'Wind Speed in km/hr')

In [None]:
barchart('RH', 'Relative Humidity in %')

### FWI System Components EDA

In [None]:
temp = temp = ff_df1.drop(['Region','Temperature','Rain','Ws','RH'], axis=1)
for feature in temp:
    sns.histplot(data = temp,x=feature, hue = 'Classes')
    plt.legend(labels=['Fire','Not Fire'])
    plt.title(feature)
    plt.show()

## Report from EDA
### Weather system report: highest fire counts
- Temperature: 30-37 degree Celsius
- Rain: no rain to very less rain 0.0 to 0.3.
- Wind Speed: 13 to 19km/hr.
- Relative Humidity: 50% to 80%.

## FWI System component report: indexes that indicates higher chances of fire
- Fine Fuel Moisture code(FFMC): range from 28.6 to 92.5: above 75(Higher chances of fires)
- Duff Moisture Code(DMC): range 1.1 to 65.9: 1.1 to 10(Lower chance of fires)
- Initial Spread Index(ISI): range from 0 to 18: 0 to 3(Lower chance of fires)
- Buildup Index(BUI): range from 1.1 to 68: 1.1 to 10(Lower chance of fires)
- Fire Weather Index(FWI): ranges from 1 to 31.1: 0 to 3 (Lower chance of fires)

### Multicollinearity
- Multicollinearity is a statistical concept where independent variables modelled are correlated. If 2 variables are considered perfectly collinear, this suggests their correlation coefficient is +/-1.0.
- Result are less likely to be reliable statistical inferences.
- It can be detected with various techniques
- Regression analysis has the assumption that independent features should not have multicollinearity. Therefore, the independent variables need to have little correlation as much as possible .
- Variance Inflation Factor(VIF).
    - VIF value greater than 10 ---> Multicollinearity


In [None]:
x =ff_df1.iloc[:, 0:13]
y =ff_df1['Classes']

In [None]:
x.head(10)

In [None]:
y.head(10)

In [None]:
#from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_value = pd.DataFrame()
vif_value["feature"] =x.columns
vif_value["VIF"]= [variance_inflation_factor(x.values, i)
                                            for i in range(len(x.columns))]
print(vif_value)

## Defining classes for the algorithms

In [None]:
ff_df2 =ff_df1.drop(['day','month','year'], axis=1)
ff_df2.head(10)

# Algorithm Analysis

### Correlation

In [None]:
def corrlt(dataset, threshold):
    col_corr =set()
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j])>threshold:
                colname =corr_matrix.columns[i]
                col_corr.add(colname)
    return col_corr

### Scaling

In [None]:
def scaler_standard(xtrain, xtest):
    scaler = StandardScaler()
    xtrain_scale = scaler.fit_transform(xtrain)
    xtest_scale = scaler.transform(xtest)

    return xtrain_scale, xtest_scale

#### Splitting the dataset into train and test

In [None]:
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.25,
                                                    random_state=0)
xtrain.shape, xtest.shape

# Regression Problem algorithm:
* Prediction of the feature [FWI] (Fire Weather Index) which correlates to Classes Feature by 90%+

## Chosen model
### Random Forest Regressor
In these algorithm I have chosen to use the Random Forest Regressor model because it is versatile and performs well in various situations. Such as identifying important features or variables in the dataset by calculating feature importance's, which can be useful for feature selection and understanding the underlying relations in the dataset.

## Regression Analysis

In [None]:
x= ff_df2.iloc[:,0:10]
y= ff_df2['FWI']

In [None]:
x.head()

In [None]:
y.head()

In [None]:
xtrain.columns

In [None]:
xtrain.corr()

In [None]:
#Pearson correlation
plt.figure(figsize=(12,10))
correlate= xtrain.corr()
sns.heatmap(correlate, annot=True, cmap=plt.cm.CMRmap_r)
plt.show()

### consider Correlation threshold value as 0.8


### Remove from the analysis any correlation for independent features and features with correlation >0.8
using the corrlt and scaler_standard functions.

In [None]:
corrlt_features = corrlt(xtrain, 0.8)
corrlt_features

The features that are above the 0.8 threshold are 'BUI', 'DC', 'FWI'

In [None]:
xtrain.drop(corrlt_features, axis=1, inplace=True)
xtest.drop(corrlt_features, axis=1, inplace=True)
xtrain.shape, xtest.shape

In [None]:
xtrain_scale, xtest_scale = scaler_standard(xtrain, xtest)

In [None]:
plt.subplots(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.boxplot(data= xtrain)
plt.title('X_train Before Scaling')
plt.subplot(1, 2, 2)
sns.boxplot(data= xtrain_scale)
plt.title('X_train After Scaling')

## Model building for regression analysis

#### Linear Regression

In [None]:
LiReg = LinearRegression()
LiReg.fit(xtrain_scale, ytrain)
LiReg_pred = LiReg.predict(xtest_scale)
MAE = metrics.mean_absolute_error(ytest, LiReg_pred)
MSE = metrics.mean_squared_error(ytest, LiReg_pred)
r2 =r2_score(ytest, LiReg_pred)#Coefficient of determination

print("Linear Regressor")
print("Mean Absolute Error: {:.4f}".format(MAE))
print("Mean Squared Error: {:.4f}".format(MSE))
print("R-Square: {:.4f}".format(r2))

In [None]:
actual_pred = pd.DataFrame({'Actual Revenue: ':ytest, 'Predicted Revenue: ':LiReg_pred})
actual_pred

##### Lasso Regression

In [None]:
#from sklearn.linear_model import Lasso
Lass_Reg = Lasso()
Lass_Reg.fit(xtrain_scale, ytrain)
LassReg_pred = Lass_Reg.predict(xtest_scale)
MAE = metrics.mean_absolute_error(ytest, LassReg_pred)
MSE = metrics.mean_squared_error(ytest, LassReg_pred)
r2= r2_score(ytest, LassReg_pred)

print("Lasso Regression")
print("Mean Absolute Error: {:.4f}".format(MAE))
print("Mean Squared Error: {:.4f}".format(MSE))
print("R-Square: {:.4f}".format(r2))

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue: ': ytest, 'Predicted Revenue': LassReg_pred})
Actual_pred

##### Ridge Regression

In [None]:
#from sklearn.linear_model import Ridge

RReg = Ridge()
RReg.fit(xtrain_scale, ytrain)
RReg_Pred = RReg.predict(xtest_scale)
MAE = metrics.mean_absolute_error(ytest, RReg_Pred)
MSE = metrics.mean_squared_error(ytest, RReg_Pred)
r2 =  r2_score(ytest, RReg_Pred)

print("Ridge Regression")
print("Mean Absolute Error: {:.4f}".format(MAE))
print("Mean Squared Error: {:.4f}".format(MSE))
print("R-Square: {:.4f}".format(r2))

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue ': ytest, 'Predicted Revenue': RReg_Pred})
Actual_pred

##### Support Vector Regressor

In [None]:
from sklearn.svm import SVR

SVector_Reg = SVR()
SVector_Reg.fit(xtrain_scale, ytrain)
SVector_Reg_Pred = SVector_Reg.predict(xtest_scale)
MAE = metrics.mean_absolute_error(ytest, SVector_Reg_Pred)
MSE = metrics.mean_squared_error(ytest, SVector_Reg_Pred)
r2 =  r2_score(ytest, SVector_Reg_Pred)

print("Support Vector Regressor")
print("Mean Absolute Error: {:.4f}".format(MAE))
print("Mean Squared Error: {:.4f}".format(MSE))
print("R-Square: {:.4f}".format(r2))

In [None]:
Actual_predict = pd.DataFrame({'Actual Revenue ': ytest, 'Predicted Revenue': SVector_Reg_Pred})
Actual_predict

##### Random Forest Regressor

In [None]:
RforestReg = RandomForestRegressor()
RforestReg.fit(xtrain_scale, ytrain)
forestReg_Pred = RforestReg.predict(xtest_scale)
MAE = metrics.mean_absolute_error(ytest, forestReg_Pred)
MSE = metrics.mean_squared_error(ytest, forestReg_Pred)
r2 =  r2_score(ytest, forestReg_Pred)

print("Random Forest Regressor")
print("Mean Absolute Error: {:.4f}".format(MAE))
print("Mean Squared Error: {:.4f}".format(MSE))
print("R-Square: {:.4f}".format(r2))

In [None]:
Actual_predict = pd.DataFrame({'Actual Revenue ': ytest, 'Predicted Revenue': forestReg_Pred})
Actual_predict

### K_Neighbors Regressor

In [None]:
#from sklearn.neighbors import KNeighborsRegressor

K_NReg = KNeighborsRegressor()
K_NReg.fit(xtrain_scale, ytrain)
K_NReg_Pred = K_NReg.predict(xtest_scale)
MAE = metrics.mean_absolute_error(ytest, K_NReg_Pred)
MSE = metrics.mean_squared_error(ytest, K_NReg_Pred)
r2 =  r2_score(ytest, K_NReg_Pred)

print("K-Neighbor Regressor")
print("Mean Absolute Error: {:.4f}".format(MAE))
print("Mean Squared Error: {:.4f}".format(MSE))
print("R-Square: {:.4f}".format(r2))

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue ': ytest, 'Predicted Revenue': K_NReg_Pred})
Actual_pred

### r2 Score Results Summary

In [None]:
print("      Models                 Score  ")
print("Random Forest Regressor      95.15% ")
print("Support Vector regressor     72.33% ")
print("K-Neighbors Regressor        69.10% ")
print("Linear Regressor             64.53% ")
print("Ridge Regressor              64.50% ")
print("Lasso Regressor              -0.43% ")

Random Forest regressor has performed best out of all the models

## Hyperparameter Tuning
#### Tuning Random Forest Regressor

In [None]:
param_grid =[{'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110,120],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 3, 4],
'min_samples_split': [2, 6, 10],
'n_estimators': [5, 20, 50, 100]}]

forestReg = RandomForestRegressor()
Rand_rf = RandomizedSearchCV(forestReg, param_grid, cv = 10, verbose=2,n_jobs = -1)
Rand_rf.fit(xtrain_scale, ytrain)

In [None]:
BestRand_grid = Rand_rf.best_estimator_

bestref_pred = BestRand_grid.predict(xtest_scale)
bestref_pred
MAE = metrics.mean_absolute_error(ytest, bestref_pred)
MSE = metrics.mean_squared_error(ytest, bestref_pred)
reg2 =  r2_score(ytest, K_NReg_Pred)

print("Random Forest Tuned")
print("Mean Absolute Error: {:.4f}".format(MAE))
print("Mean Squared Error: {:.4f}".format(MSE))
print("R-Square: {:.4f}".format(r2))

## Selecting Features
- Only selecting 5 important features to make the prediction
    - #### ISI, FFMC, DMC, RH, and Ws as seen in the output below.

In [None]:
important_features = Rand_rf.best_estimator_.feature_importances_
important_df = pd.DataFrame({
    'feature': xtrain.columns,
    'importance': important_features
}).sort_values('importance', ascending=False)
important_df

In [None]:
plt.figure(figsize=(12,6))
sns.set_style('ticks')
ax = sns.barplot(data=important_df, x='importance', y='feature',ec = 'black')
ax.set_title('Top 7 Important Features', weight='bold',fontsize = 15)
ax.set_xlabel('Feature Importance %',weight='bold')
ax.set_ylabel('Features',weight='bold')

## Deployment Model

In [None]:
xtrain_new = xtrain.drop(['Rain', 'RH'], axis=1)
xtest_new = xtest.drop(['Rain', 'RH'], axis=1)

In [None]:
xtrain_new.columns

In [None]:
xtest_new.columns

In [None]:
xtrain_new_scale, xtest_new_scale = scaler_standard(xtrain_new, xtest_new)


In [None]:
BestRand_grid.fit(xtrain_new_scale, ytrain)
bestref_pred = BestRand_grid.predict(xtest_new_scale)
MAE = metrics.mean_absolute_error(ytest, bestref_pred)
MSE = metrics.mean_squared_error(ytest, bestref_pred)
reg2 =  r2_score(ytest, K_NReg_Pred)

print("Random Forest Tuned")
print("Mean Absolute Error: {:.4f}".format(MAE))
print("Mean Squared Error: {:.4f}".format(MSE))
print("R-Square: {:.4f}".format(r2))

In [None]:
#import bz2,pickle
file = bz2.BZ2File('Regression.pkl','wb')
pickle.dump(BestRand_grid,file)
file.close()


# Classification Algorithm:
* Binary classification [(fire, not fire)] by predicting the features ["Classes"] from dataset

## XGboost Classifier.
* In these algorithm I choose to use the XGboost Classifier because of its high performance, scalability and ability to handle a variety of data types. It will also implement the gradient boosting algorithm that ensemble weak models while minimising prediction errors. It also provides a more generalised solution and over-fitting prevention as it has a regularisation techniques. It also increases model performances as it offers parallel training and in built cross-validation methods.

### Classification Analysis

In [None]:
ff_df2.head()

#### Splitting dataset into inout and output feature

In [None]:
x= ff_df2.iloc[:, 0:10]
y= ff_df2['Classes']

In [None]:
x.head(10)

In [None]:
y.head(10)

In [None]:
# separate dataset into train and test
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.4, random_state=0)
xtrain.shape, xtest.shape

In [None]:
xtrain.columns

In [None]:
corrlt_features = corrlt(xtrain, 0.8)
corrlt_features

In [None]:
xtrain.drop(corrlt_features,axis=1, inplace=True)
xtest.drop(corrlt_features,axis=1, inplace=True)
xtrain.shape, xtest.shape

In [None]:
xtrain_scale, xtest_scale = scaler_standard(xtrain, xtest)

#### Logistic Regression

In [None]:
LReg =LogisticRegression()
LReg.fit(xtrain_scale, ytrain)
LReg_Pred = LReg.predict(xtest_scale)
Score = accuracy_score(ytest, LReg_Pred)
CReport = classification_report(ytest, LReg_Pred)

print("Logistic Regression")
print ("Accuracy Score value: {:.4f}".format(Score))
print (CReport)

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue': ytest, 'Predicted Revenue': LReg_Pred})
Actual_pred

##### Metrics
- Precision: Ratio of true +ve to sum of true and false +ve

- Recall: Ratio of true +ve to sum of true +ve and false -ve

- F1 Score: Weighted harmonic mean of precision and recall.
the close the value is to 1.0 the better the expected performance of a model.

In [None]:
LReg_ConfMatrix = ConfusionMatrixDisplay.from_estimator(LReg, xtest_scale, ytest)
LReg_ConfMatrix

##### Decision Tree

In [None]:
DT_Classifier = DecisionTreeClassifier()
DT_Classifier.fit(xtrain_scale, ytrain)
DT_Classifier_pred = DT_Classifier.predict(xtest_scale)
Score = accuracy_score(ytest, DT_Classifier_pred)
CReport = classification_report(ytest, DT_Classifier_pred)

print("Decision Tree")
print ("Accuracy Score value: {:.4f}".format(Score))
print (CReport)

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue': ytest, 'Predicted Revenue': DT_Classifier_pred})
Actual_pred

In [None]:
DT_Classifier_ConfMatrix = ConfusionMatrixDisplay.from_estimator(DT_Classifier, xtest_scale, ytest)
DT_Classifier_ConfMatrix

In [None]:
plt.figure(figsize = (10,5))
tree.plot_tree(DT_Classifier,filled = True)
plt.show()

### Random Forest Classifier

In [None]:
#from sklearn.ensemble import RandomForestClassifier

RF_Classifier = RandomForestClassifier()
RF_Classifier.fit(xtrain_scale, ytrain)
RF_Classifier_pred = RF_Classifier.predict(xtest_scale)
RF_Classifier_pred
Score = accuracy_score(ytest, RF_Classifier_pred)
CReport = classification_report(ytest, RF_Classifier_pred)

print("Random Forest")
print ("Accuracy Score value: {:.4f}".format(Score))
print (CReport)

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue': ytest, 'Predicted Revenue': RF_Classifier_pred})
Actual_pred

In [None]:
RF_Classifier_ConfMatrix = ConfusionMatrixDisplay.from_estimator(RF_Classifier, xtest_scale, ytest)
RF_Classifier_ConfMatrix

### K_Neighbors Classifier

In [None]:
# KNeighborsClassifier to Train from SKlearn
K_NClassifier = KNeighborsClassifier()
K_NClassifier.fit(xtrain_scale, ytrain)
K_NClassifier_pred = K_NClassifier.predict(xtest_scale)
K_NClassifier_pred
Score = accuracy_score(ytest, K_NClassifier_pred)
CReport = classification_report(ytest, K_NClassifier_pred)

print("Random Forest")
print ("Accuracy Score value: {:.4f}".format(Score))
print (CReport)

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue': ytest, 'Predicted Revenue': K_NClassifier_pred})
Actual_pred

In [None]:
K_NClassifier_ConfMatrix = ConfusionMatrixDisplay.from_estimator(K_NClassifier, xtest_scale, ytest)
K_NClassifier_ConfMatrix

## XGb Model

In [None]:
xgb = XGBClassifier()
xgb.fit(xtrain_scale, ytrain)
xgb_pred = xgb.predict(xtest_scale)
xgb_pred
Score = accuracy_score(ytest, xgb_pred)
CReport = classification_report(ytest, xgb_pred)

print("XGBoost Classifier")
print ("Accuracy Score value: {:.4f}".format(Score))
print (CReport)

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue': ytest, 'Predicted Revenue': xgb_pred})
Actual_pred

In [None]:
xgb_cm = ConfusionMatrixDisplay.from_estimator(xgb, xtest_scale, ytest)

# HyperParameter Tuning

#### XGb classifier tuning

In [None]:
params={
 "learning_rate"    : (np.linspace(0,10, 100)) ,
 "max_depth"        : (np.linspace(1,50, 25,dtype=int)),
 "min_child_weight" : [1, 3, 5, 7],
 "gamma"            : [0.0, 0.1, 0.2 , 0.3, 0.4],
 "colsample_bytree" : [0.3, 0.4, 0.5 , 0.7]}
Rand_xgb = RandomizedSearchCV(xgb, params, cv = 10, n_jobs = -1)
Rand_xgb.fit(xtrain_scale, ytrain).best_estimator_

In [None]:
Bst_xgb = Rand_xgb.best_estimator_
Bst_xgb.score(xtest_scale, ytest)
Bstxgb_pred = Bst_xgb.predict(xtest_scale)
Score = accuracy_score(ytest, Bstxgb_pred)
CReport = classification_report(ytest, Bstxgb_pred)
print("FINAL XGB")
print ("Accuracy Score value: {:.4f}".format(Score))
print (CReport)

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue': ytest, 'Predicted Revenue': Bstxgb_pred})
Actual_pred

#### Forest Classifier Tuning

In [None]:
params = {
    "n_estimators" : [90,100,115,130],
    'criterion': ['gini', 'entropy'],
    'max_depth' : range(2,20,1),
    'min_samples_leaf' : range(1,10,1),
    'min_samples_split': range(2,10,1),
    'max_features' : ['auto','log2']
}
rf = RandomizedSearchCV(RF_Classifier, params, cv = 10,n_jobs = -1)
rf.fit(xtrain_scale, ytrain).best_estimator_

In [None]:
Bst_rf = rf.best_estimator_
Bst_rf.score(xtest_scale, ytest)

In [None]:
Bstrf_pred = Bst_rf.predict(xtest_scale)
Bstrf_pred

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue': ytest, 'Predicted Revenue': Bstrf_pred})
Actual_pred

# Model Selection


## Stratified K-fold Cross-validation(CV)
- This ensures the feature interest needed for the training and test set have the same proportion as the original dataset.
- Needed for good accuracy without error.

In [None]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
skfold = StratifiedKFold(n_splits= 10,shuffle= True,random_state= 0)

In [None]:
cv_xgb= cross_val_score(Bst_xgb, x, y, cv=skfold, scoring='accuracy').mean()
print('CV Score XGB Tuned {:.4f}'.format(cv_xgb))

In [None]:
cv_rf =cross_val_score(RF_Classifier, x, y, cv=skfold, scoring="accuracy").mean()
print('CV Score Random Forest {:.4f}'.format(cv_rf))

In [None]:
cv_dt =cross_val_score(DT_Classifier, x, y, cv= skfold, scoring="accuracy").mean()
print('CV Score Decision Tree {:.4f}'.format(cv_dt))

In [None]:
cv_knn =cross_val_score(K_NClassifier, x, y, cv=skfold, scoring="accuracy").mean()
print('CV Score KNN Classifier {:.4f}'.format(cv_knn))

In [None]:
cv_lg =cross_val_score(LReg, x, y, cv=skfold, scoring="accuracy").mean()
print('CV Score Logistic Regression {:.4f}'.format(cv_lg))

XGboost Classifier has the better result.

###  Model Deployment Feature Selection

In [None]:
important_features =    Rand_xgb.best_estimator_.feature_importances_
important_df = pd.DataFrame({
    'feature': xtrain.columns,
    'importance': important_features
}).sort_values('importance', ascending=False)
important_df

In [None]:
plt.figure(figsize=(12,6))
sns.set_style('ticks')
ax = sns.barplot(data=important_df, x='importance', y='feature',ec = 'black')
ax.set_title('Top 7 Important Features', weight='bold',fontsize = 15)
ax.set_xlabel('Feature Importance %',weight='bold')
ax.set_ylabel('Features',weight='bold')

# Model Deployment

In [None]:
xtrain.columns

In [None]:
xtrain_new = xtrain.drop(['Rain', 'RH'], axis=1)
xtest_new = xtest.drop(['Rain', 'RH'], axis=1)

In [None]:
xtrain_new.columns

In [None]:
xtest_new.columns

In [None]:
xtrain_new_scale, xtest_new_scale = scaler_standard(xtrain_new, xtest_new)

In [None]:
xgb_model =Rand_xgb.fit(xtrain_new_scale, ytrain).best_estimator_
xgb_model.score(xtest_new_scale, ytest)
xgb_model_pred = xgb_model.predict(xtest_new_scale)
Score = accuracy_score(ytest, xgb_model_pred)
CReport = classification_report(ytest, xgb_model_pred)
print("Final Model XGB")
print ("Accuracy Score value: {:.4f}".format(Score))
print (CReport)

In [None]:
Actual_pred = pd.DataFrame({'Actual Revenue': ytest, 'Predicted Revenue': xgb_model_pred})
Actual_pred

In [None]:
#import bz2,pickle
file = bz2.BZ2File('Classification.pkl','wb')
pickle.dump(xgb_model, file)
file.close()

# Conclusion
- Based on the results and the implementation of the algorithm, the prediction can only be used to calculate if fire could possibly occur at a location based on the inputs collected.
- Both algorithms classification and regression were deployed and completed.

# References
1. Faroudja ABID et al. , â€œPredicting Forest Fire in Algeria using Data Mining Techniques: Case Study of the Decision Tree Algorithmâ€, International Conference on Advanced Intelligent Systems for Sustainable Development (AI2SD 2019) , 08 - 11 July , 2019, Marrakech, Morocco.
2. https://github.com/ashishrana1501/Forest-Fire-Prediction