#Q1
Anomaly detection is a technique used to identify patterns in data that do not conform to expected behavior. Its purpose is to identify outliers or anomalies in a dataset that deviate from the norm. Anomalies could represent errors, fraud, or other unexpected events.

In [1]:
#1
from sklearn.ensemble import IsolationForest
import numpy as np

# Create a sample dataset with anomalies
np.random.seed(42)
normal_data = np.random.normal(loc=0, scale=1, size=(1000, 2))
anomalies = np.random.normal(loc=5, scale=1, size=(50, 2))
data = np.vstack([normal_data, anomalies])

# Fit an Isolation Forest model
model = IsolationForest(contamination=0.05)  # 5% contamination (expected proportion of anomalies)
model.fit(data)

# Predict anomalies (1 for normal, -1 for anomalies)
predictions = model.predict(data)

# Print the predicted labels
print(predictions)

[ 1  1  1 ... -1 -1 -1]


#Q2
Key challenges in anomaly detection include dealing with imbalanced datasets, adapting to evolving patterns, and selecting appropriate features. Here's an example using a dataset with imbalanced classes:

In [2]:
#2
from sklearn.ensemble import IsolationForest
import numpy as np

# Create an imbalanced dataset
np.random.seed(42)
normal_data = np.random.normal(loc=0, scale=1, size=(900, 2))
anomalies = np.random.normal(loc=5, scale=1, size=(100, 2))
data = np.vstack([normal_data, anomalies])

# Fit an Isolation Forest model
model = IsolationForest(contamination=0.1)  # 10% contamination (expected proportion of anomalies)
model.fit(data)

# Predict anomalies (1 for normal, -1 for anomalies)
predictions = model.predict(data)

# Print the predicted labels
print(predictions)

[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1 -1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1 -1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1 -1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1

#Q3
Unsupervised anomaly detection doesn't require labeled data, while supervised anomaly detection relies on labeled examples. Example using Isolation Forest for unsupervised:

In [3]:
#3
from sklearn.ensemble import IsolationForest
import numpy as np

# Create a dataset (no labels)
data = np.random.normal(loc=0, scale=1, size=(1000, 2))

# Fit an Isolation Forest model
model = IsolationForest(contamination=0.05)  # 5% contamination (expected proportion of anomalies)
model.fit(data)

# Predict anomalies (1 for normal, -1 for anomalies)
predictions = model.predict(data)

# Print the predicted labels
print(predictions)

[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1 -1  1  1
  1  1 -1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1 -1  1  1 -1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1 -1  1  1 -1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

#Q4
Main categories include statistical methods, machine learning-based methods, and proximity-based methods. Example using a proximity-based method (Isolation Forest):

In [4]:
#4
from sklearn.ensemble import IsolationForest
import numpy as np

# Create a sample dataset
data = np.random.normal(loc=0, scale=1, size=(1000, 2))

# Fit an Isolation Forest model
model = IsolationForest(contamination=0.05)  # 5% contamination (expected proportion of anomalies)
model.fit(data)

# Predict anomalies (1 for normal, -1 for anomalies)
predictions = model.predict(data)

# Print the predicted labels
print(predictions)

[ 1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1 -1  1  1  1  1  1 -1  1
  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1 -1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1
  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

#Q5
Distance-based anomaly detection methods assume that anomalies are far from normal instances. Example using K-Nearest Neighbors (KNN):

In [5]:
#5
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Create a sample dataset
data = np.random.normal(loc=0, scale=1, size=(1000, 2))

# Fit a KNN model
knn = NearestNeighbors(n_neighbors=5)
knn.fit(data)

# Calculate distances and indices of neighbors
distances, indices = knn.kneighbors(data)

# Print distances to the 5 nearest neighbors for each point
print(distances)

[[0.         0.05866551 0.06227732 0.06818919 0.07515834]
 [0.         0.1647693  0.17702975 0.21068721 0.21174465]
 [0.         0.01413078 0.02859833 0.03819789 0.09033897]
 ...
 [0.         0.01994344 0.10868608 0.14387276 0.15376172]
 [0.         0.03694488 0.0499535  0.06750045 0.07681045]
 [0.         0.04356882 0.0817513  0.0971224  0.09716311]]


#Q6
The Local Outlier Factor (LOF) algorithm computes anomaly scores based on the local density deviation of a data point compared to its neighbors. Example using LOF:

In [6]:
#6
from sklearn.neighbors import LocalOutlierFactor
import numpy as np

# Create a sample dataset
data = np.random.normal(loc=0, scale=1, size=(1000, 2))

# Fit a LOF model
lof = LocalOutlierFactor(contamination=0.05)  # 5% contamination (expected proportion of anomalies)
scores = lof.fit_predict(data)

# Print anomaly scores
print(scores)

[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1 -1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1 -1 -1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1
  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1
  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1

#Q7
Key parameters of the Isolation Forest algorithm include the number of trees and the contamination parameter. Example using Isolation Forest:

In [7]:
#7
from sklearn.ensemble import IsolationForest
import numpy as np

# Create a sample dataset
data = np.random.normal(loc=0, scale=1, size=(1000, 2))

# Fit an Isolation Forest model with 50 trees and 10% contamination
model = IsolationForest(n_estimators=50, contamination=0.1)
predictions = model.fit_predict(data)

# Print predicted labels
print(predictions)

[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1 -1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1 -1  1  1  1  1
  1  1 -1  1  1 -1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1
  1 -1  1  1  1  1  1  1  1 -1  1 -1  1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1 -1  1  1  1  1  1  1  1  1
  1  1 -1  1 -1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1 -1  1  1  1
  1 -1  1  1 -1  1  1  1  1 -1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1
  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1  1
  1  1  1 -1  1  1  1  1  1 -1  1  1  1  1  1  1  1  1 -1  1  1  1 -1  1
  1 -1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1 -1  1  1  1
  1 -1  1  1  1  1 -1  1  1  1  1  1  1  1  1  1  1

In [8]:
#Q8
from sklearn.neighbors import NearestNeighbors
import numpy as np

# Create a sample dataset
data = np.random.normal(loc=0, scale=1, size=(1000, 2))

# Fit a KNN model with K=10
knn = NearestNeighbors(n_neighbors=10)
knn.fit(data)

# Calculate distances and indices of neighbors
distances, indices = knn.kneighbors(data)

# Compute anomaly scores based on the number of neighbors within a radius of 0.5
anomaly_scores = np.sum(distances < 0.5, axis=1) - 2  # Subtract 2 to exclude the point itself and one neighbor
print(anomaly_scores)

[ 8  8  8  8  8  8  8  8  8  8  8  8  3  8  8  8  8  8  1  8  8  8  8  8
  8  8  8  8  8  8  8  8  8  1  8  8  8  8  8  8  8  8  6  8  8  8  8  8
  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8 -1  8
  8  8  8  8  8  8  8  8  8  8  8  2  8  8  8  8  8  8  8  8  8  8  8  8
  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  3  8  8  8  8  8  8  8  8
  8  8  8  8  8  8  8  8  8  8  8  1  8  8  8  8  8  8  8  8  7  8  8  8
  8  8  8  8  8  0  5  0  8  8  7  8  8  8  8  8  8  8  8  8  8  8  7  8
  8  8  8  4  8  0  8  8  8  8  8  8  8  8  8  8  8  8  8  8  0  8  8  8
  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  0  8  8  8  0
  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  7  8  8  8  8
  8  8  8  8  8  8  8  5  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8
  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8

In [11]:
#Q9
from sklearn.ensemble import IsolationForest
import numpy as np

# Create a sample dataset
data = np.random.normal(loc=0, scale=1, size=(3000, 2))

# Fit an Isolation Forest model with 100 trees
model = IsolationForest(n_estimators=100)
model.fit(data)

# Calculate anomaly scores based on the average path length
average_path_length = model.decision_function(data)
anomaly_score = 2 ** (-average_path_length / np.mean(average_path_length))  # Divide by the mean of path lengths

# Print the anomaly score for a data point with an average path length of 5.0
sample_data_point_index = 2
print(f"Anomaly score for data point {sample_data_point_index}: {anomaly_score[sample_data_point_index]}")


Anomaly score for data point 2: 0.6315866598955324
