# Smoke Detection

In this example, I want to show you how to do a few things.
 
1. Pull data from Kaggle
2. Set up a RandomForestClassifier model
3. Define how we calculate errors in classification models

In [1]:
#Set up the environment
%pip install -r requirements.txt

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


 I will be using the smoke_detection dataset from [Kaggle](https://www.kaggle.com/code/narminhumbatli/smoke-detection-classification/data). If you have the time, take a minute to pull down the latest version of this data. Otherwise, I have provided the csv here as well. 

## Get Data

We have 14 descriptive columns and one label column ("Fire Alarm"). Let's take a look at the descriptors here.

**UTC** - The time when experiment was performed.

**Temperature** - Temperature of Surroundings. Measured in Celsius

**Humidity** - The air humidity during the experiment.

**TVOC** - Total Volatile Organic Compounds. Measured in ppb (parts per billion)

**eCo2** - CO2 equivalent concentration. Measured in ppm (parts per million)

**Raw H2** - The amount of Raw Hydrogen present in the surroundings.

**Raw Ethanol** - The amount of Raw Ethanol present in the surroundings.

**Pressure** - Air pressure. Measured in hPa

**PM1.0** - Paticulate matter of diameter less than 1.0 micrometer .

**PM2.5** - Paticulate matter of diameter less than 2.5 micrometer.

**NC0.5** - Concentration of particulate matter of diameter less than 0.5 micrometers.

**NC1.0** - Concentration of particulate matter of diameter less than 1.0 micrometers.

**NC2.5** - Concentration of particulate matter of diameter less than 2.5 micrometers.

**CNT** - Simple Count.

**Fire Alarm** - (Reality) If fire was present then value is 1 else it is 0.*</span>

Now, let's take a look at what we have. 

In [2]:
import pandas as pd
df = pd.read_csv('smoke_detection_iot.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62630 entries, 0 to 62629
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      62630 non-null  int64  
 1   UTC             62630 non-null  int64  
 2   Temperature[C]  62630 non-null  float64
 3   Humidity[%]     62630 non-null  float64
 4   TVOC[ppb]       62630 non-null  int64  
 5   eCO2[ppm]       62630 non-null  int64  
 6   Raw H2          62630 non-null  int64  
 7   Raw Ethanol     62630 non-null  int64  
 8   Pressure[hPa]   62630 non-null  float64
 9   PM1.0           62630 non-null  float64
 10  PM2.5           62630 non-null  float64
 11  NC0.5           62630 non-null  float64
 12  NC1.0           62630 non-null  float64
 13  NC2.5           62630 non-null  float64
 14  CNT             62630 non-null  int64  
 15  Fire Alarm      62630 non-null  int64  
dtypes: float64(8), int64(8)
memory usage: 7.6 MB


In [4]:
import json
from dataprofiler import Data, Profiler

data = Data('smoke_detection_iot.csv') # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text
print(data.data.head(5)) # Access data directly via a compatible Pandas DataFrame

profile = Profiler(data) # Calculate Statistics, Entity Recognition, etc
readable_report = profile.report(report_options={"output_format":"pretty"})
print(json.dumps(readable_report, indent=4))

  Unnamed: 0         UTC Temperature[C] Humidity[%] TVOC[ppb] eCO2[ppm]  \
0          0  1654733331           20.0       57.36         0       400   
1          1  1654733332         20.015       56.67         0       400   
2          2  1654733333         20.029       55.96         0       400   
3          3  1654733334         20.044       55.28         0       400   
4          4  1654733335         20.059       54.69         0       400   

  Raw H2 Raw Ethanol Pressure[hPa] PM1.0 PM2.5 NC0.5 NC1.0 NC2.5 CNT  \
0  12306       18520       939.735   0.0   0.0   0.0   0.0   0.0   0   
1  12345       18651       939.744   0.0   0.0   0.0   0.0   0.0   1   
2  12374       18764       939.738   0.0   0.0   0.0   0.0   0.0   2   
3  12390       18849       939.736   0.0   0.0   0.0   0.0   0.0   3   
4  12403       18921       939.744   0.0   0.0   0.0   0.0   0.0   4   

  Fire Alarm  
0          0  
1          0  
2          0  
3          0  
4          0  


2022-12-04 13:25:56.919088: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns...  (with 7 processes)


  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
100%|██████████| 16/16 [00:05<00:00,  2.90it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|██████████| 16/16 [00:06<00:00,  2.59it/s]


{
    "global_stats": {
        "samples_used": 12526,
        "column_count": 16,
        "row_count": 62630,
        "row_has_null_ratio": 0.0,
        "row_is_null_ratio": 0.0,
        "unique_row_ratio": 1.0,
        "duplicate_row_count": 0,
        "file_type": "csv",
        "encoding": "utf-8",
        "correlation_matrix": null,
        "chi2_matrix": "[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n  nan, nan], ... , [nan, nan, nan, nan,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0., nan,  0.,\n  nan,  1.]]",
        "profile_schema": {
            "Unnamed: 0": [
                0
            ],
            "UTC": [
                1
            ],
            "Temperature[C]": [
                2
            ],
            "Humidity[%]": [
                3
            ],
            "TVOC[ppb]": [
                4
            ],
            "eCO2[ppm]": [
                5
            ],
            "Raw H2": [
                6
            ],
          

In [5]:
df.nunique()

Unnamed: 0        62630
UTC               62630
Temperature[C]    21672
Humidity[%]        3890
TVOC[ppb]          1966
eCO2[ppm]          1713
Raw H2             1830
Raw Ethanol        2659
Pressure[hPa]      2213
PM1.0              1337
PM2.5              1351
NC0.5              3093
NC1.0              4113
NC2.5              1161
CNT               24994
Fire Alarm            2
dtype: int64

Hmm. We have an unnamed,CNT, and UTC column. None of those columns add value, so we will need to remove them.

In [6]:
df.drop(['Unnamed: 0','CNT','UTC'],axis=1,inplace=True)

In [7]:
#checking the columns again
df.columns

Index(['Temperature[C]', 'Humidity[%]', 'TVOC[ppb]', 'eCO2[ppm]', 'Raw H2',
       'Raw Ethanol', 'Pressure[hPa]', 'PM1.0', 'PM2.5', 'NC0.5', 'NC1.0',
       'NC2.5', 'Fire Alarm'],
      dtype='object')

In [8]:
FireAlarm = df[df['Fire Alarm'] == 1]
nonFireAlarm = df[df['Fire Alarm'] == 0]

x = len(FireAlarm)/len(df)
y = len(nonFireAlarm)/len(df)

print('Fire Alarm:',x*100,'%')
print('No Fire Alarm:',y*100,'%')

Fire Alarm: 71.46255787961042 %
No Fire Alarm: 28.53744212038959 %


In [9]:
X = df
y = df['Fire Alarm']
X.drop('Fire Alarm',axis = 1,inplace = True)

In [10]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=0)

Now, we are ready to train the model.

## Train Model

In [11]:

from sklearn.preprocessing import StandardScaler
std = StandardScaler()
X_train = std.fit_transform(X_train)
X_test = std.transform(X_test)

In [12]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier()
classifier.fit(X_train,y_train)

In [13]:
from sklearn.metrics import classification_report, confusion_matrix
# Predict y data with classifier: 
y_predict = classifier.predict(X_test)

# Print results: 
print(confusion_matrix(y_test, y_predict))
print(classification_report(y_test, y_predict)) 


[[3605    0]
 [   0 8921]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3605
           1       1.00      1.00      1.00      8921

    accuracy                           1.00     12526
   macro avg       1.00      1.00      1.00     12526
weighted avg       1.00      1.00      1.00     12526

