<a href="https://colab.research.google.com/github/Veckey5/Veckey5/blob/main/TASK_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PART B: DATA MODELLING
For this part, I have chosen  to work with the Support Vector Machine (SVM) model because:

1. It is efficient in working with time series data which relates to the data provided to me for the task.

2. It is used to modelling one-class classification.

3. During one of our discussions in the interview, Rob Spencer highlighted the efficiency of SVM models in data modelling for Acutro and thus, I applied SVM.


In [23]:
import pandas as pd
import numpy as np
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

In [24]:
testdata2 = pd.read_csv('boiler-flow-temperature.csv')
testdata2.head(100)

Unnamed: 0,timestamp,value
0,1543622414000,35.5000
1,1543622715000,35.2500
2,1543623014000,35.0000
3,1543623314000,34.7500
4,1543623614000,34.5000
...,...,...
95,1543651813000,27.6250
96,1543652112000,27.6250
97,1543652413000,27.5625
98,1543652712000,27.5625


In [25]:
# To ensure the attribute's data type is time stamp
testdata2[' timestamp'] = pd.to_datetime(testdata2[' timestamp'], unit='ms')

FEATURE ENGINEERING

I extracted the date and time from the timestamp attribute in order to make more features and improve the model performance at the end.

In [26]:
# Extract date and time components separately
testdata2['date'] = testdata2[' timestamp'].dt.date
testdata2['time'] = testdata2[' timestamp'].dt.time

# Display what the extracted features look like
print(testdata2['date'])
print(testdata2['time'])

0       2018-12-01
1       2018-12-01
2       2018-12-01
3       2018-12-01
4       2018-12-01
           ...    
8479    2018-12-31
8480    2018-12-31
8481    2018-12-31
8482    2018-12-31
8483    2018-12-31
Name: date, Length: 8484, dtype: object
0       00:00:14
1       00:05:15
2       00:10:14
3       00:15:14
4       00:20:14
          ...   
8479    23:30:50
8480    23:35:50
8481    23:40:51
8482    23:50:50
8483    23:55:51
Name: time, Length: 8484, dtype: object


In [33]:
column = testdata2[' value']
print(column)

0       35.5000
1       35.2500
2       35.0000
3       34.7500
4       34.5000
         ...   
8479    36.7500
8480    36.2500
8481    35.8125
8482    35.0000
8483    34.6250
Name:  value, Length: 8484, dtype: float64


MEMORY PARAMETER

The task requires that I specify the amount of memory training the algorithm would use and so I set a memory parameter. It also acknowledged the previous temperature values providing a memory window (m = 5) for the algorithm.

In [51]:
#Set the memory parameter, m and create input data with m using a loop function
m = 5
input_data = []

for i in range(m, len(testdata2)):
    data_point = {
        'date': testdata2['date'].iloc[i],
        'time': testdata2['time'].iloc[i],
        ' value': testdata2[' value'].iloc[i],
    }
    # Add previous m temperature values to the data point
    for j in range(1, m + 1):
        data_point[f' value_{j}'] = testdata2[' value'].iloc[i - j]

    input_data.append(data_point)

input_df = pd.DataFrame(input_data)

NORMALIZATION: I used the Standard Scaler to normalize the values of the temperature to help reduce overfitting

In [52]:
# Normalize the temperature values using StandardScaler
scaler = StandardScaler()
input_df[[' value', ' value_1', ' value_2', ' value_3', ' value_4', ' value_5']] = scaler.fit_transform(input_df[[' value', ' value_1', ' value_2', ' value_3', ' value_4', ' value_5']])

TRAINING THE DATA

In [62]:
# Define training data (e.g., the first 3 weeks)
training_data = input_df.iloc[:len(input_df) - 7]  # 7 days in the last week



SETTING THE PARAMETER GRIDS

In [63]:
parameters = {
    'nu':  0.05,  # Shows the proportion of outliers
    'kernel':  'rbf',
    'gamma': 0.01
}

BUILDING THE SVM MODEL

In [64]:
# I built a one-class SVM model
svm_model = svm.OneClassSVM(**parameters)



TRAINING THE MODEL AND MAKING PREDICTIONS

In [66]:
svm_model.fit(training_data[[' value', ' value_1', ' value_2', ' value_3', ' value_4', ' value_5']])
# Making predictions
predictions = svm_model.predict(input_df[[' value', ' value_1', ' value_2', ' value_3', ' value_4', ' value_5']])


CLASSIFICATION REPORT


In [68]:
from sklearn.metrics import classification_report

In [72]:
# Print the classification report
report = classification_report([1] * len(input_df), predictions, target_names=['Normal', 'Anomaly'])
print(report)

              precision    recall  f1-score   support

      Normal       0.00      0.00      0.00         0
     Anomaly       1.00      0.95      0.97      8479

    accuracy                           0.95      8479
   macro avg       0.50      0.47      0.49      8479
weighted avg       1.00      0.95      0.97      8479



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


ANALYSIS

There is a lot of overfitting in the dataset probably because of the limited features available in the dataset.
From the classification report above, the following are deduced.
1. I set the target names as normal and anomaly just for clarification.

2. The precision for normal class is very low at 0.00 showing that the model does not predict this class correctly at all. On th other hand, the model perfectly predicts the anomaly class since precision is 1.0.

3. For the recall: The model predicts the actual anomalies or outliers in th edata given its value of 0.95 and it is not able to identify the actual classes of the normal class.

4. Once again, the f-measure of the anomaly class is high at 0.97 due to the high values of precision and recall since it is a reflection of their harmonic mean. This same assumption affects the normal class which obviously has a low fmeasure score.

However, the accuracy is high as 0.95 as it correctly classified about 95% of the samples in the data provided.

In summary, the model performance can be attributed to:
1. little historic data provided

2. limitation of features in the dataset

3. Data splitting which still reduced the amount of data used to develop the model

4. It could also require using other algorithms and comparing the model performance.
