<a href="https://colab.research.google.com/github/VinayNagamallaD9/SmartApplicationPerformanceMonitoring-Auto-Scaling/blob/main/Anomaly_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Title** : Develop a performance anomaly detection machine learning model.

In [1]:
# Importing the Libraries
import pandas as pd
import numpy as np

Creating a Dataset(Fake)

In [2]:
# Create a dummy dataset
data = {
    'timestamp': pd.to_datetime(['2023-01-01 00:00:00', '2023-01-01 00:01:00', '2023-01-01 00:02:00', '2023-01-01 00:03:00', '2023-01-01 00:04:00',
                               '2023-01-01 00:05:00', '2023-01-01 00:06:00', '2023-01-01 00:07:00', '2023-01-01 00:08:00', '2023-01-01 00:09:00']),
    'cpu_usage': [25.5, 26.1, 25.9, 27.0, 26.5, 80.1, 25.8, 26.3, 27.1, 26.7],
    'memory_usage': [40.2, 41.0, 40.5, 41.2, 40.8, 42.0, 41.5, 41.8, 42.5, 41.9],
    'network_latency': [50, 52, 51, 55, 53, 200, 52, 54, 56, 53],
    'error_rate': [0, 0, 0, 0, 0, 5, 0, 0, 0, 0]
}
df = pd.DataFrame(data)

In [3]:
# Save & Load dummy dataset to a CSV file
df.to_csv('performance_data.csv', index=False)
df = pd.read_csv('performance_data.csv')

 Displaying summary statistics to understand the data.

In [4]:
# Display first 5 rows
print("First few rows of the DataFrame:")
display(df.head())

# DataFrame information
print("\nDataFrame Information:")
df.info()

# Stastic Summary
print("\nSummary Statistics:")
display(df.describe())

# Missing values
print("\nMissing values per column:")
display(df.isnull().sum())

First few rows of the DataFrame:


Unnamed: 0,timestamp,cpu_usage,memory_usage,network_latency,error_rate
0,2023-01-01 00:00:00,25.5,40.2,50,0
1,2023-01-01 00:01:00,26.1,41.0,52,0
2,2023-01-01 00:02:00,25.9,40.5,51,0
3,2023-01-01 00:03:00,27.0,41.2,55,0
4,2023-01-01 00:04:00,26.5,40.8,53,0



DataFrame Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   timestamp        10 non-null     object 
 1   cpu_usage        10 non-null     float64
 2   memory_usage     10 non-null     float64
 3   network_latency  10 non-null     int64  
 4   error_rate       10 non-null     int64  
dtypes: float64(2), int64(2), object(1)
memory usage: 532.0+ bytes

Summary Statistics:


Unnamed: 0,cpu_usage,memory_usage,network_latency,error_rate
count,10.0,10.0,10.0,10.0
mean,31.7,41.34,67.6,0.5
std,17.013916,0.727553,46.555105,1.581139
min,25.5,40.2,50.0,0.0
25%,25.95,40.85,52.0,0.0
50%,26.4,41.35,53.0,0.0
75%,26.925,41.875,54.75,0.0
max,80.1,42.5,200.0,5.0



Missing values per column:


Unnamed: 0,0
timestamp,0
cpu_usage,0
memory_usage,0
network_latency,0
error_rate,0


Convert the 'timestamp' column to datetime objects and extract time-based features, as well as create rolling statistics for the performance metrics to capture trends and variability over time.

In [5]:
# Convert 'timestamp' to datetime objects
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Extract time-based features
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek

window_size = 3  # Defining window size as 3
for col in ['cpu_usage', 'memory_usage', 'network_latency', 'error_rate']:
    df[f'{col}_rolling_mean'] = df[col].rolling(window=window_size).mean()
    df[f'{col}_rolling_std'] = df[col].rolling(window=window_size).std()

# DataFrame with new features
print("DataFrame with new features:")
display(df.head())

DataFrame with new features:


Unnamed: 0,timestamp,cpu_usage,memory_usage,network_latency,error_rate,hour,day_of_week,cpu_usage_rolling_mean,cpu_usage_rolling_std,memory_usage_rolling_mean,memory_usage_rolling_std,network_latency_rolling_mean,network_latency_rolling_std,error_rate_rolling_mean,error_rate_rolling_std
0,2023-01-01 00:00:00,25.5,40.2,50,0,0,6,,,,,,,,
1,2023-01-01 00:01:00,26.1,41.0,52,0,0,6,,,,,,,,
2,2023-01-01 00:02:00,25.9,40.5,51,0,0,6,25.833333,0.305505,40.566667,0.404145,51.0,1.0,0.0,0.0
3,2023-01-01 00:03:00,27.0,41.2,55,0,0,6,26.333333,0.585947,40.9,0.360555,52.666667,2.081666,0.0,0.0
4,2023-01-01 00:04:00,26.5,40.8,53,0,0,6,26.466667,0.550757,40.833333,0.351188,53.0,2.0,0.0,0.0


Using  Isolation Forest model to detect anomalies in the performance data. This model is effective for isolating outliers in high-dimensional datasets.

In [7]:
#Importing IsolationForest
from sklearn.ensemble import IsolationForest

model = IsolationForest(contamination='auto', random_state=15)

# **Training Model**

Train the Isolation Forest model on the DataFrame

we only Select few features for training

In [9]:
# Removing 'timestamp'
features = df.columns.tolist()
features.remove('timestamp')

# Drpping Null Values (NaN)
df_trained = df.dropna()
features_trained = [col for col in features if col in df_trained.columns]


X = df_trained[features_trained]

# Training Isolation Forest model
model.fit(X)

print("Model trained successfully.")

Model trained successfully.


## Evaluate the model

let's evaluate the trained model using appropriate metrics for anomaly detection, such as precision, recall, F1-score.


*   **Anomaly Score Prediction :** lower score indicates higher anomaly likelihood
*   we configure -1 for outliers, 1 for inliers




In [12]:

df_trained['anomaly_score'] = model.decision_function(X)

df_trained['is_anomaly'] = model.predict(X)

# Display the DataFrame with anomaly scores and predictions
print("DataFrame with anomaly scores and predictions:")
display(df_trained.head())

# Analyzing the results by filtering anomalies
anomalies = df_trained[df_trained['is_anomaly'] == -1]
print("Detected Anomalies:")
display(anomalies)

DataFrame with anomaly scores and predictions:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trained['anomaly_score'] = model.decision_function(X)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_trained['is_anomaly'] = model.predict(X)


Unnamed: 0,timestamp,cpu_usage,memory_usage,network_latency,error_rate,hour,day_of_week,cpu_usage_rolling_mean,cpu_usage_rolling_std,memory_usage_rolling_mean,memory_usage_rolling_std,network_latency_rolling_mean,network_latency_rolling_std,error_rate_rolling_mean,error_rate_rolling_std,anomaly_score,is_anomaly
2,2023-01-01 00:02:00,25.9,40.5,51,0,0,6,25.833333,0.305505,40.566667,0.404145,51.0,1.0,0.0,0.0,-0.002932,-1
3,2023-01-01 00:03:00,27.0,41.2,55,0,0,6,26.333333,0.585947,40.9,0.360555,52.666667,2.081666,0.0,0.0,0.06615,1
4,2023-01-01 00:04:00,26.5,40.8,53,0,0,6,26.466667,0.550757,40.833333,0.351188,53.0,2.0,0.0,0.0,0.07231,1
5,2023-01-01 00:05:00,80.1,42.0,200,5,0,6,44.533333,30.802651,41.333333,0.61101,102.666667,84.299071,1.666667,2.886751,-0.138887,-1
6,2023-01-01 00:06:00,25.8,41.5,52,0,0,6,44.133333,31.150013,41.433333,0.602771,101.666667,85.160633,1.666667,2.886751,-0.019702,-1


Detected Anomalies:


Unnamed: 0,timestamp,cpu_usage,memory_usage,network_latency,error_rate,hour,day_of_week,cpu_usage_rolling_mean,cpu_usage_rolling_std,memory_usage_rolling_mean,memory_usage_rolling_std,network_latency_rolling_mean,network_latency_rolling_std,error_rate_rolling_mean,error_rate_rolling_std,anomaly_score,is_anomaly
2,2023-01-01 00:02:00,25.9,40.5,51,0,0,6,25.833333,0.305505,40.566667,0.404145,51.0,1.0,0.0,0.0,-0.002932,-1
5,2023-01-01 00:05:00,80.1,42.0,200,5,0,6,44.533333,30.802651,41.333333,0.61101,102.666667,84.299071,1.666667,2.886751,-0.138887,-1
6,2023-01-01 00:06:00,25.8,41.5,52,0,0,6,44.133333,31.150013,41.433333,0.602771,101.666667,85.160633,1.666667,2.886751,-0.019702,-1
7,2023-01-01 00:07:00,26.3,41.8,54,0,0,6,44.066667,31.206783,41.766667,0.251661,102.0,84.876381,1.666667,2.886751,-0.014522,-1
8,2023-01-01 00:08:00,27.1,42.5,56,0,0,6,26.4,0.655744,41.933333,0.51316,54.0,2.0,0.0,0.0,-0.006321,-1


# **Deploy the model**

In [11]:
print("Detected anomalies (based on dummy dataset & Isolation Forest model):")
display(anomalies)

Detected anomalies (based on dummy dataset & Isolation Forest model):


Unnamed: 0,timestamp,cpu_usage,memory_usage,network_latency,error_rate,hour,day_of_week,cpu_usage_rolling_mean,cpu_usage_rolling_std,memory_usage_rolling_mean,memory_usage_rolling_std,network_latency_rolling_mean,network_latency_rolling_std,error_rate_rolling_mean,error_rate_rolling_std,anomaly_score,is_anomaly
2,2023-01-01 00:02:00,25.9,40.5,51,0,0,6,25.833333,0.305505,40.566667,0.404145,51.0,1.0,0.0,0.0,-0.002932,-1
5,2023-01-01 00:05:00,80.1,42.0,200,5,0,6,44.533333,30.802651,41.333333,0.61101,102.666667,84.299071,1.666667,2.886751,-0.138887,-1
6,2023-01-01 00:06:00,25.8,41.5,52,0,0,6,44.133333,31.150013,41.433333,0.602771,101.666667,85.160633,1.666667,2.886751,-0.019702,-1
7,2023-01-01 00:07:00,26.3,41.8,54,0,0,6,44.066667,31.206783,41.766667,0.251661,102.0,84.876381,1.666667,2.886751,-0.014522,-1
8,2023-01-01 00:08:00,27.1,42.5,56,0,0,6,26.4,0.655744,41.933333,0.51316,54.0,2.0,0.0,0.0,-0.006321,-1
