<a href="https://www.kaggle.com/code/amirulmahmud/svc-acoustic-extinguisher-fire-prediction?scriptVersionId=124935625" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **The Objective**

The goal of this work is to create a Classification Model with Support Vector Classifier that can predict flame extinction based on features obtained from fire extinguishing experiments.

# **The Dataset**

The dataset used in this work is taken from https://www.kaggle.com/datasets/muratkokludataset/acoustic-extinguisher-fire-dataset.

Import the library and dataset that will be used.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_excel('/kaggle/input/acoustic-extinguisher-fire-dataset/Acoustic_Extinguisher_Fire_Dataset/Acoustic_Extinguisher_Fire_Dataset.xlsx')

In [None]:
df.head()

In [None]:
df.info()

# **Data Cleaning**

**Let's check if there is any missing values in the data.**

In [None]:
df.isna().sum()

**Check the unique values for each column**

In [None]:
for col in df.columns:
    print(col)
    print(df[col].value_counts())

**Check the duplicates**

In [None]:
df.duplicated().sum()

Fortunately, the data is complete and does not have any missing values or duplicates.

# **Exploratory Data Analysis**

**Create a statistical summary**

In [None]:
df.describe().transpose()

**Create a pairplot**

In [None]:
sns.pairplot(df,hue='STATUS')

**Create a heatmap that displays the correlation between features**

In [None]:
plt.figure(figsize=(6,4),dpi=180)
sns.heatmap(df.corr(),cmap='viridis',annot=True)

**Find the top 3 correlated features with target.**

In [None]:
np.abs(df.corr()['STATUS']).sort_values().tail(4)

As wee see here, DISTANCE and AIRFLOW each has good correlation with STATUS.

**Create boxplots to see if there is any outlier**

In [None]:
plt.figure(figsize=(3,3),dpi=180)
sns.boxplot(data=df,y='DISTANCE',hue='STATUS');

In [None]:
plt.figure(figsize=(3,3),dpi=180)
sns.boxplot(data=df,y='DESIBEL',hue='STATUS');

In [None]:
plt.figure(figsize=(3,3),dpi=180)
sns.boxplot(data=df,y='AIRFLOW',hue='STATUS');

In [None]:
plt.figure(figsize=(3,3),dpi=180)
sns.boxplot(data=df,y='FREQUENCY',hue='STATUS');

From the boxplots above, wee can see that there is no outlier.

# **Train Test Split**

The approach here will use Cross Validation on 90% of the dataset, and then judge the results on a final test set of 10% to evaluate the model.

**Split the dataset into features (X) and label/target (y)**

In [None]:
X = df.drop('STATUS', axis=1)
y = df['STATUS']

In [None]:
X.head()

In [None]:
y.head()

**Split the dataset into training set and testing set with ratio 90% : 10%.**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

**Check the balance of the label data in training set.**

In [None]:
y_train.value_counts()

In [None]:
sns.countplot(x=y_train)

In [None]:
7791/(7791+7906)

As we see here, the label class of training data is slightly unbalance (49.6% : 50.4%). So, we need to balance the label. Here, I will use under-sampling method.

In [None]:
from imblearn.under_sampling import RandomUnderSampler

In [None]:
rus = RandomUnderSampler(random_state=42)

In [None]:
X_train, y_train = rus.fit_resample(X_train, y_train)

In [None]:
y_train.value_counts()

Now, the label of training data is balanced and ready to use for next steps.

# **Features Engineering**

ColumnTransformer is particularly handy for the case of datasets that contain heterogeneous data types, since we may want to scale the numeric features and one-hot encode the categorical ones.

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

In [None]:
col_numeric = ['SIZE','DISTANCE','DESIBEL','AIRFLOW','FREQUENCY']
col_categoric = ['FUEL']

In [None]:
scaler = StandardScaler()
encoder = OneHotEncoder(drop='first')

In [None]:
preprocessor = ColumnTransformer([
    ('num',scaler,col_numeric),
    ('cat',encoder,col_categoric)
])

# **Pipeline and Grid Search**

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV

In [None]:
def grid_search(model, parameters):
    pipe = Pipeline([('preprocessor',preprocessor),('model',model)])
    grid = HalvingGridSearchCV(estimator=pipe, param_grid=parameters,factor=2, cv=5, scoring='accuracy',random_state=42)
    grid.fit(X_train, y_train)
    print()
    print('Best Score : ',grid.best_score_)
    print('Best parameters : ', grid.best_params_)
    optimal_model = grid.best_estimator_
    return optimal_model

# **Modeling with SVC**

**Create a base model of SVC**

In [None]:
from sklearn.svm import SVC

In [None]:
svc_model = SVC(class_weight='balanced')

**Perform a grid search to find the best estimator for final model.**

In [None]:
svc_param = {'model__C':[0.001, 0.01, 0.1, 1, 10, 100],
              'model__kernel':['poly','rbf','sigmoid'],
              'model__gamma':['scale','auto',0.001, 0.01, 0.1, 1, 10, 100]}

In [None]:
svc_optimal = grid_search(svc_model,svc_param)

# **Modeling with LinearSVC**

**Create a base model of LinearSVC**

In [None]:
from sklearn.svm import LinearSVC

In [None]:
linear_svc = LinearSVC(class_weight='balanced',random_state=42,dual=False)

**Perform a grid search to find the best estimator for final model.**

In [None]:
linsvc_param = {'model__C':[0.001, 0.01, 0.1, 1, 10, 100],
              'model__penalty':['l1','l2'],
              'model__fit_intercept':[True,False]}

In [None]:
linearsvc_optimal = grid_search(linear_svc,linsvc_param)

# **Modeling with NuSVC**

**Create a base model of NuSVC**

In [None]:
from sklearn.svm import NuSVC

In [None]:
NuSVC = NuSVC(class_weight='balanced')

**Perform a grid search to find the best estimator for final model.**

In [None]:
nusvc_param = {'model__nu':[0.05, 0.1, 0.2, 0.3, 0.4, 0.5],
              'model__kernel':['poly','rbf','sigmoid','linear'],
              'model__gamma':['scale','auto',0.001, 0.01, 0.1, 1, 10, 100]}

In [None]:
nusvc_optimal = grid_search(NuSVC, nusvc_param)

# **Final Model Evaluation**

In [None]:
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay,classification_report

In [None]:
def final_eval(model):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test,y_pred)
    print(model)
    print()
    print(classification_report(y_test,y_pred,labels=model.classes_))
    print()
    ConfusionMatrixDisplay.from_predictions(y_test,y_pred,labels=model.classes_)
    print()
    return accuracy

**Evaluation : SVC Optimal Model**

In [None]:
svc_accuracy = final_eval(svc_optimal)

**Evaluation : LinearSVC Optimal Model**

In [None]:
linearsvc_accuracy = final_eval(linearsvc_optimal)

**Evaluation : NuSVC Optimal Model**

In [None]:
nusvc_accuracy = final_eval(nusvc_optimal)

**Evaluation Results**

In [None]:
models = ['SVC','Linear SVC','NuSVC']
score = [svc_accuracy,linearsvc_accuracy,nusvc_accuracy]

In [None]:
plt.figure(figsize=(3,3),dpi=180)
plt.plot(models,score,marker='o')
plt.ylim(0.5,1)
plt.xlabel('Model')
plt.ylabel('Accuracy Score')
plt.title('Final Model Comparison')
plt.xticks(rotation=90);

**Find the best accuracy**

In [None]:
max(score)

# **Conclusion**

1. The best model is SVC with parameters of C = 1, kernel = 'rbf', and gamma = 1 and accuracy of 97,3%.
2. The 3 final models perform pretty well in predicting the unseen data (X_test), with >92% accuracy.

**Thank you for reading this notebook. Feel free to give some constructive advice or suggestion. I will really appreciate it.**