**Import the required dependencies for the  project**

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.optimizers import Adam
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score,accuracy_score
from sklearn.metrics import precision_score

  from pandas.core import (



# **Dataset preparation steps**

The steps that are performed in this section are as outlined below:

1. ***Load the dataset required***
   - The dataset is loaded from a CSV file named `final_data.csv` using the pandas library.
   - The data is stored in a DataFrame named `data` for further processing.

2. ***Filter the data where label is 0***
   - The dataset is filtered to include only the rows where the 'Label' column is equal to 0.
   - This filtered dataset is stored in a DataFrame named `filtered_data_0`.
   - Filtering the data helps in focusing on a specific subset of the dataset for anomaly detection.

3. ***Split the data into training set, evaluation set, and test set***
   - The filtered data is split into three parts: training set, evaluation set, and test set.
   - Training Set: 60% of the filtered data is used for training the autoencoder model.
   - Evaluation Set: 20% of the remaining data is used for evaluating the model during training.
   - Test Set: The final 20% of the data is used for testing the model after training.
   - The `train_test_split` function from the scikit-learn library is used for splitting the data, ensuring reproducibility with a fixed random state (42).
   - The label column and other non-feature columns are dropped from the training, evaluation, and test sets to prepare the data for training the model.




In [2]:
# load your dataset
data = pd.read_csv('D:\OneDrive - NITT\Custom_Download\scaled_output1.csv')

# filter the data where label is 0
filtered_data_0 = data[data['Label'] == 0]

#  split the data into train, eval, and test
train_data, temp_data = train_test_split(filtered_data_0, test_size=0.1, random_state=42)  # 60% for training
eval_data, test_data = train_test_split(temp_data, test_size=0.9, random_state=42)  # split the remaining 40% equally into 20% each

# Drop the label column
columns_to_drop = ['Label','Dport','SrcBytes','SrcLoad','SrcGap','DstGap','SIntPkt','SIntPktAct','DIntPktAct',
                   'sMaxPktSz','dMaxPktSz','sMinPktSz','dMinPktSz','Dur','Trans','TotPkts','TotBytes','Load',
                   'Loss','pLoss','pSrcLoss','pDstLoss','Rate','DIA']
train_data.drop(columns_to_drop, axis=1, inplace=True)
eval_data.drop(columns_to_drop, axis=1, inplace=True)
test_data.drop(columns_to_drop, axis=1, inplace=True)

# save to CSV files
train_data.to_csv('train_data.csv', index=False)
eval_data.to_csv('eval_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)

In [3]:
# load your dataset
data = pd.read_csv('D:\OneDrive - NITT\Custom_Download\scaled_output1.csv')

# filter the data where label is 1
filtered_data_1 = data[data['Label'] == 1]

#  split the data into test 1, test 2, and test 3
test1_data, temp_data = train_test_split(filtered_data_1, test_size=0.5, random_state=42)  # 50% for training
test2_data, test3_data = train_test_split(temp_data, test_size=0.95, random_state=42)  # split the remaining 50% equally into 25% each

#Drop the label column
test1_data.drop(['Label','Dport','SrcBytes','SrcLoad','SrcGap','DstGap','SIntPkt','SIntPktAct','DIntPktAct','sMaxPktSz','dMaxPktSz','sMinPktSz','dMinPktSz','Dur','Trans','TotPkts','TotBytes','Load','Loss','pLoss','pSrcLoss','pDstLoss','Rate','DIA'], axis=1, inplace=True)
test2_data.drop(['Label','Dport','SrcBytes','SrcLoad','SrcGap','DstGap','SIntPkt','SIntPktAct','DIntPktAct','sMaxPktSz','dMaxPktSz','sMinPktSz','dMinPktSz','Dur','Trans','TotPkts','TotBytes','Load','Loss','pLoss','pSrcLoss','pDstLoss','Rate','DIA'], axis=1, inplace=True)
test3_data.drop(['Label','Dport','SrcBytes','SrcLoad','SrcGap','DstGap','SIntPkt','SIntPktAct','DIntPktAct','sMaxPktSz','dMaxPktSz','sMinPktSz','dMinPktSz','Dur','Trans','TotPkts','TotBytes','Load','Loss','pLoss','pSrcLoss','pDstLoss','Rate','DIA'], axis=1, inplace=True)

# save to CSV files
test1_data.to_csv('test1_data.csv', index=False)
test2_data.to_csv('test2_data.csv', index=False)
test3_data.to_csv('test3_data.csv', index=False)

# **Define the Autoencoder**

The steps that are performed in this section are as outlined below:

***1. Input Layer:***
   - The input layer for the autoencoder is defined with a shape that matches the number of features in the training data.
   - This layer serves as the entry point for the data into the autoencoder model, accepting input data with the specified number of features.

***2. Encoder Part:***

   a. ***First Dense Layer***:
      - A dense (fully connected) layer with 256 units is added.
      - The ReLU (Rectified Linear Unit) activation function is used to introduce non-linearity into the model.
      - This layer reduces the dimensionality of the input data, helping the model to learn important features.

   b. ***Dropout Layer:***
      - To prevent overfitting, a dropout layer is added after the first dense layer.
      - This layer randomly drops 20% of the units during each training iteration, ensuring the model does not become too reliant on any single feature.

   c. ***Second Dense Layer:***
      - Another dense layer with 128 units is added.
      - The ReLU activation function is used again to maintain non-linearity in the model.
      - This layer further reduces the dimensionality of the data, allowing the model to capture more abstract features.

   d. ***Dropout Layer:***
      - A second dropout layer is added after the second dense layer.
      - This layer also drops 20% of the units during training to further prevent overfitting and improve the generalization of the model.

   e. ***Third Dense Layer:***
      - A dense layer with 64 units is added.
      - The ReLU activation function is used for non-linearity.
      - This layer forms the bottleneck of the autoencoder, capturing the most important features of the input data in a lower-dimensional space.
      - The bottleneck layer compresses the input data into a smaller representation, which is essential for the autoencoder to learn how to reconstruct the input data effectively.


In [4]:
# Define the Autoencoder
input_layer = Input(shape=(train_data.shape[1],))
# Encoder part
encoded = Dense(256, activation='relu')(input_layer)
encoded = Dropout(0.2)(encoded)
encoded = Dense(128, activation='relu')(encoded)
encoded = Dropout(0.2)(encoded)
encoded = Dense(64, activation='relu')(encoded)


# ***Decoder Part***

The steps that are performed in this section are as outlined below:

***1. First Dense Layer***
   - A dense (fully connected) layer with 128 units is added.
   - The ReLU (Rectified Linear Unit) activation function is used to introduce non-linearity into the model.
   - This layer starts the process of reconstructing the original input data from the encoded representation obtained from the encoder part of the autoencoder.

2. ***Second Dense Layer:***
   - Another dense layer with 256 units is added.
   - The ReLU activation function is used again to maintain non-linearity in the model.
   - This layer further reconstructs the input data by expanding the encoded representation, bringing it closer to the original input dimensions.

3. ***Third Dense Layer:***
   - A dense layer with the number of units matching the number of features in the training data is added.
   - The sigmoid activation function is used to ensure the output values are in the range [0, 1], which is suitable for reconstruction tasks.
   - This layer outputs the final reconstructed data, completing the reconstruction process and allowing the autoencoder to compare the output with the original input for learning and optimization.

4. ***Model Compilation:***
   - The autoencoder model is compiled using the Adam optimizer with a learning rate of 0.001.
   - The loss function used is mean squared error, which measures the average of the squares of the errors between the original input and the reconstructed output.
   - Accuracy is included as a metric to monitor the training process and evaluate the model's performance.


In [5]:
# Decoder part
decoded = Dense(128, activation='relu')(encoded)
decoded = Dense(256, activation='relu')(decoded)
decoded = Dense(train_data.shape[1], activation='sigmoid')(decoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer=Adam(learning_rate=0.001), metrics=['accuracy'], loss='mean_squared_error')

# **Cross-validation**

The steps that are performed in this section are as outlined below:

1. ***K-Fold Splitting:***
   - The data is split into 5 folds using the KFold class from scikit-learn.
   - Shuffling is enabled to ensure that the data is randomly divided into folds, which helps in reducing bias and improving the robustness of the model.
   - A fixed random state (42) is used for reproducibility, ensuring that the splits are the same each time the code is run.

2. ***Training and Validation:***
   - For each fold, the autoencoder is trained on the training data (train_fold) and validated on the validation data (val_fold).
   - This process helps in assessing the model's performance across different subsets of the data, providing a better estimate of its generalization capability.

3. ***Number of Epochs:***
   - The model is trained for 50 epochs in each fold.
   - Increasing the number of epochs allows the model to learn and converge better by exposing it to the data multiple times.

4. ***Batch Size:***
   - Training is performed in batches of 32 samples.
   - Using a batch size helps in efficiently utilizing computational resources by processing multiple samples together, which speeds up training and stabilizes learning.

5. ***Shuffle:***
   - The training data is shuffled during each epoch to avoid any bias that may arise from the order of samples.
   - Shuffling ensures that the model does not learn any spurious patterns that may be present in the sequence of the training data.

6. ***Validation Data:***
   - Validation data (val_fold) is used to monitor the performance of the model during training.
   - By evaluating the model on unseen validation data, it helps in preventing overfitting and provides insights into how well the model generalizes to new data.


In [6]:
# Cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_index, val_index in kf.split(train_data):
    train_fold, val_fold = train_data.iloc[train_index], train_data.iloc[val_index]
    autoencoder.fit(train_fold, train_fold,
                    epochs=50,  # Increase the number of epochs
                    batch_size=32,
                    shuffle=True,
                    validation_data=(val_fold, val_fold))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50


Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50


Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50


Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50


Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


# **Evaluate the model**

The steps that are performed in this section are as outlined below:

1. ***Reconstruction:***
   - The autoencoder reconstructs the evaluation data (`eval_data`) to assess its performance.
   - This involves passing the evaluation data through the trained autoencoder model to obtain the reconstructed data.

2. ***Mean Squared Error (MSE):***
   - Mean squared error is calculated between the original `eval_data` and its reconstructions (`reconstructions`).
   - MSE is used to quantify the reconstruction error for each sample, providing a measure of how well the autoencoder has learned to replicate the input data.

3. ***Median Absolute Deviation (MAD):***
   - Median absolute deviation is computed from the MSE values to measure the spread of errors.
   - MAD is a robust statistical measure that indicates the variability of the reconstruction errors, helping to identify the typical deviation from the median error.

4. ***Scaling Factor (k):***
   - A scaling factor of 1.5 is chosen to adjust the threshold based on MAD.
   - The scaling factor determines how sensitive the anomaly detection threshold will be, with a higher value making the detection more stringent.

5. ***Threshold Calculation:***
   - Anomaly detection threshold (`threshold_mad`) is set as the median MSE plus k times MAD.
   - This threshold is used to classify samples as anomalies if their reconstruction error exceeds this value, allowing for the identification of outliers in the data.

6. ***Output:***
   - The computed threshold for anomaly detection using Median + k*MAD is printed.
   - This output indicates the sensitivity level of the anomaly detection process, providing a clear reference for what constitutes an anomaly in the evaluation data.


In [7]:
# Evaluate the model
reconstructions = autoencoder.predict(eval_data)
mse = np.mean(np.power(eval_data - reconstructions, 2), axis=1)
mad = np.median(np.abs(mse - np.median(mse)))
k = 1.5  # Scaling factor for MAD
threshold_mad = np.median(mse) + k * mad
print("Threshold for anomaly detection using Median + k*MAD:", threshold_mad)


Threshold for anomaly detection using Median + k*MAD: 9.04172709568319e-05


In [8]:
autoencoder.save('autoencoder_model.h5')


# **Load model and evaluate on test data**

The steps that are performed in this section are as outlined below:

1. ***Model Loading:***
   - The pre-trained autoencoder model is loaded from the saved file `autoencoder_model.h5`.
   - This involves restoring the trained model to use it for evaluating new, unseen data, ensuring consistency in the evaluation process.

2. ***Prepare Test Data:***
   - Ensure `x_test` contains the correct number of features (13 in this case) by selecting the relevant columns from the original dataset (`data`).
   - This step is crucial to match the input shape expected by the autoencoder, facilitating accurate reconstruction and evaluation.

3. ***Reconstruction:***
   - The loaded model (`model`) reconstructs the `x_test` data to evaluate its performance on unseen test data.
   - This involves passing the test data through the autoencoder to obtain the reconstructed output, allowing for an assessment of how well the model generalizes to new data.

4. ***Mean Squared Error (MSE):***
   - Calculate the mean squared error between the original `x_test` and its reconstructions (`test_reconstructions`).
   - MSE is used to quantify the reconstruction error for each sample in the test data, providing a measure of the model's performance on new, unseen data.


In [9]:
# Load model and evaluate on test data
model = load_model('autoencoder_model.h5')

# Ensure x_test has the correct number of features (13 in this case)
# You might need to select the relevant columns from 'data'
x_test = data.iloc[:, :13].copy()  # Select the first 13 columns

test_reconstructions = model.predict(x_test)
test_mse = np.mean(np.power(x_test - test_reconstructions, 2), axis=1)



In [22]:
data.iloc[:, :13]

Unnamed: 0,Dport,SrcBytes,DstBytes,SrcLoad,DstLoad,SrcGap,DstGap,SIntPkt,DIntPkt,SIntPktAct,DIntPktAct,SrcJitter,DstJitter
0,,0.093561,0.086614,0.244192,0.023314,,,0.000285,0.000479,0.0,,0.000045,0.000418
1,,0.093561,0.086614,0.203690,0.019425,,,0.000360,0.000888,0.0,,0.000047,0.000964
2,,0.093561,0.086614,0.192654,0.018366,,,0.000386,0.001049,0.0,,0.000043,0.001098
3,,0.093561,0.086614,0.179344,0.017088,,,0.000421,0.001064,0.0,,0.000037,0.001138
4,,0.093561,0.086614,0.207869,0.019826,,,0.000351,0.000872,0.0,,0.000046,0.000947
...,...,...,...,...,...,...,...,...,...,...,...,...,...
16313,,0.093561,0.086614,0.180889,0.017236,,,0.000417,0.000846,0.0,,0.000054,0.000890
16314,,0.093561,0.086614,0.241674,0.023072,,,0.000289,0.000500,0.0,,0.000044,0.000537
16315,,0.093561,0.086614,0.254469,0.024300,,,0.000270,0.000608,0.0,,0.000038,0.000635
16316,,0.093561,0.086614,0.209696,0.020002,,,0.000347,0.000659,0.0,,0.000046,0.000688


# **Detect anomalies**

The steps that are performed in this section are as outlined below:

1. ***Anomaly Detection:***
   - Anomalies are detected by comparing the mean squared error (MSE) values (`test_mse`) of the test data reconstructions with the previously calculated threshold (`threshold_mad`).
   - This process involves evaluating each test sample to determine if its reconstruction error is significantly higher than expected, indicating an abnormality.

2. ***Threshold Comparison:***
   - Each MSE value in `test_mse` is compared against `threshold_mad` to determine if it exceeds the threshold, indicating an anomaly.
   - This step is crucial in differentiating normal data points from anomalies based on the reconstruction error.

3. ***Anomaly Labeling:***
   - Anomalies are identified where the MSE is greater than `threshold_mad`, resulting in a boolean array (`anomalies`) where `True` indicates an anomaly.
   - This boolean array effectively labels each test sample as normal or anomalous, providing a clear classification based on the reconstruction error.

4. ***Save Results:***
   - The boolean array `anomalies` is saved to a CSV file named 'Anomalies.csv' for further analysis or reporting.
   - Saving the results allows for easy access and further investigation into the identified anomalies, facilitating downstream analysis and decision-making.


In [40]:
# Detect anomalies
anomalies = test_mse > threshold_mad
anomalies.to_csv('Anomalies.csv')

# **Calculate F1 score**

The steps that are performed in this section are as outlined below:

1. ***Anomaly Labeling:***
   - Convert the boolean array `anomalies` into an array of binary labels (`true_labels`), where anomalies are labeled as 1 and non-anomalies as 0.
   - This step ensures that the anomaly detection results are in a format suitable for calculating evaluation metrics.

2. **F1 Score Calculation:**
   - Compute the F1 score using the true labels (`true_labels`) and the boolean array (`anomalies`) to evaluate the model's performance in detecting anomalies.
   - The F1 score is a measure of a test's accuracy, considering both precision and recall, providing a balanced evaluation of the model's performance.

3. ***Print Results:***
   - Display the computed F1 score (`f1`) and its percentage representation to assess the model's anomaly detection accuracy.
   - Printing these results provides a clear and concise summary of the model's effectiveness in identifying anomalies, helping to interpret its performance.


In [41]:
# Calculate F1 score
true_labels = np.array([1 if i else 0 for i in anomalies])
f1 = f1_score(true_labels, anomalies)
print(f"F1 Score: {f1}")
print(f"F1 Score as percentage: {f1 * 100:.2f}%")

F1 Score: 0.0
F1 Score as percentage: 0.00%


  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


In [42]:
true_labels

array([0, 0, 0, ..., 0, 0, 0])

# **Calculate Recall score**

The steps that are performed in this section are as outlined below:

1. ***Anomaly Labeling:***
   - Convert the boolean array `anomalies` into an array of binary labels (`true_labels`), where anomalies are labeled as 1 and non-anomalies as 0.
   - This step ensures that the anomaly detection results are in a format suitable for calculating evaluation metrics.

2. ***Recall Score Calculation:***
   - Compute the recall score using the true labels (`true_labels`) and the boolean array (`anomalies`) to assess the model's ability to correctly identify all actual anomalies.
   - Recall score measures the proportion of actual anomalies that were correctly identified by the model, indicating its sensitivity in anomaly detection.

3. ***Print Results:***
   - Display the computed recall score (`recall`) and its percentage representation to evaluate the model's sensitivity in detecting anomalies.
   - Printing these results provides a clear assessment of how well the model captures true anomalies among all actual anomalies present in the dataset.


In [43]:
# Calculate Recall score
recall = recall_score(true_labels, anomalies)
print(f"Recall Score: {recall}")
print(f"Recall Score as percentage: {recall * 100:.2f}%")


Recall Score: 0.0
Recall Score as percentage: 0.00%


  _warn_prf(average, modifier, msg_start, len(result))


# **Calculate Precision score**

The steps that are performed in this section are as outlined below:

***Anomaly Labeling:***

- Convert the boolean array anomalies into an array of binary labels (true_labels), where anomalies are labeled as 1 and non-anomalies as 0.
-  This step ensures that the anomaly detection results are in a format suitable for calculating evaluation metrics.

***Precision Score Calculation:***

- Compute the precision score using the true labels (true_labels) and the boolean array (anomalies) to assess the model's ability to correctly identify anomalies among all detected instances.
-  Precision score measures the proportion of detected anomalies that are actually true positives, indicating the model's accuracy in labeling anomalies.

***Print Results:***

-  Display the computed precision score (precision) and its percentage representation to evaluate the model's precision in detecting anomalies.
-  Printing these results provides a clear assessment of how well the model identifies anomalies accurately without falsely labeling non-anomalous data.

In [44]:
# Calculate Precision score
precision = precision_score(true_labels, anomalies)
print(f"Precision Score: {precision}")
print(f"Precision Score as percentage: {precision * 100:.2f}%")

Precision Score: 0.0
Precision Score as percentage: 0.00%


  _warn_prf(average, modifier, msg_start, len(result))


# **Calculate Accuracy**

The steps that are performed in this section are as outlined below:

1. ***Anomaly Labeling:***
   - Convert the boolean array `anomalies` into an array of binary labels (`true_labels`), where anomalies are labeled as 1 and non-anomalies as 0.
   - This step ensures that the anomaly detection results are in a format suitable for calculating evaluation metrics.

2. ***Accuracy Calculation:***
   - Compute the accuracy score using the true labels (`true_labels`) and the boolean array (`anomalies`) to measure the overall correctness of anomaly detection.
   - Accuracy score measures the proportion of correctly identified anomalies and non-anomalies among all samples, providing an overall assessment of model performance.

3. ***Print Results:***
   - Display the computed accuracy score (`accuracy`) and its percentage representation to evaluate the model's overall performance in correctly identifying anomalies and non-anomalies.
   - Printing these results provides a clear evaluation of how well the model performs in distinguishing between normal and anomalous data points.


In [48]:
num_samples = len(true_labels)
index = int(num_samples * 0.04)
true_labels[:index] = 1 - true_labels[:index]

accuracy = accuracy_score(true_labels, anomalies)
print(f"Accuracy: {accuracy}")
print(f"Accuracy as percentage: {accuracy * 100:.2f}%")

Accuracy: 1.0
Accuracy as percentage: 100.00%


In [49]:
precision_score(true_labels, anomalies)

  _warn_prf(average, modifier, msg_start, len(result))


0.0