# Experiment 2: 

# iForest ASD (Dynamic Sliding Window)


For the iForest ASD, We used Numpy, Pandas library for data processing.  To collect and count the data we have used default dictionary and counter function. We are using IsolationForest Model from the Scikit ensamble library.

In [13]:
# Importing required libraries.
import numpy as np
import pandas as pd
from collections import defaultdict, Counter
from sklearn.ensemble import  IsolationForest

In [14]:
# Loading and reading the labelled data file have filled values with knn imputer neighbout value 3.
df_knn3= pd.read_csv("682200107170331_fill_knn_3.csv")

# Feature Outlier Function

#### Identify the outlier for each feature for particular data point.

In this function the data having anomaly lable as 1 is passed with all features. It will identify outlier index for all the features and add all the indexes into single outlier dictionary and return it as an output.

In [15]:
# Function is defined and accepting two values one is data and other is list of features.
def feature_outliers(data, features):
    # Create a empty dictionary called outliers
    outliers = defaultdict(list)
    # Iterating through each feature.
    for feature in features:
        # Getting the mean values for each feature.
        median = data[feature].median()
        # Using Inter Quartile Range function to decide the upper and lower bound for the outlier detection. 
        iqr = data[feature].quantile(0.75) - data[feature].quantile(0.25)
        # Generating lower bound using IQR
        lower_bound = median - 1.5 * iqr
        # Generating Upper bound using IQR
        upper_bound = median + 1.5 * iqr
        # Listing all the outlier index values for the feature
        outlier_indices = data[(data[feature] < lower_bound) | (data[feature] > upper_bound)].index.tolist()
        # Adding all the outlier index values in the outlier dictionary for all the features.
        for idx in outlier_indices:
            # Adding indexes into dictionary.
            outliers[feature].append(idx)
    # Returning the outlier dictionary.        
    return outliers

# iForest Anomaly Detection

This is the baseline model with the optimized values. Here, 80% data is used for training and 20% data is used for testing with 42 random states and threshold value of 90%.

In [16]:
# Defining the function.
def apply_isolation_forest(df, threshold_factor=0.09):
    # Removing the column which are not important for anomaly detection.
    excluded_columns = ['FRMC', 'DATE_YEAR', 'DATE_MONTH', 'DATE_DAY']
    # Getting all the feature names except the excluded once. 
    features = [col for col in df.columns if col not in excluded_columns]
    # Defining the Isolation Forest model ussing 100 estimators with the 0.01 contamination rate.
    model = IsolationForest(n_estimators=100, contamination=0.01, random_state=42)
    # Fitting the model with the data.
    model.fit(df[features])
    
    # Calculating the Anomaly scores.
    anomaly_scores = model.decision_function(df[features])
    
    # Choosing a threshold for classifying anomalies based on the 90% threshold factor
    threshold = np.percentile(anomaly_scores, threshold_factor * 100)
    
    # Labelling the instances as normal (1) or anomalous (-1) based on the defined threshold
    labels = np.where(anomaly_scores < threshold, -1, 1)
    
    # Defining the label column and passing the labels.
    df['labels'] = labels
    # Mapping 1 for Anomaly and 0 for Normal
    df['labels'] = df['labels'].apply(lambda x: 1 if x == -1 else 0)
    # Storing the Anomaly Score in anomaly_scores column.
    df['anomaly_scores'] = anomaly_scores
    # Returning the dataframe containing the results with the features.
    return df

# Dynamic Sliding Window Using iForest 

Dyanamic sliding window function perfroms sliding window operation on particular data windows. we are taking initial window size as 500 which is also the minimum window size defined. We are taking 10000 as a maximum window size. We have defined Upper standard threshold and Lower standard threshold for window size adjustment.

In [17]:
# Initializing the function with the required parameters
def dynamic_window_isolation_forest(df, initial_window_size=500, 
                                    min_window_size=500, max_window_size=10000, 
                                    lower_sd_threshold=0.01, upper_sd_threshold=0.3, 
                                    threshold_factor=0.09):
    # Getting the total length of the data size so it doesn't exceeds the maximum no of data.
    n = len(df)
    # initializing window size for the current window size
    current_window_size = initial_window_size
    # Defining the anomalies dictionary which will store the Features having anomaly for particular FRMC point.
    anomalies = defaultdict(list)
    
    # Getting the feature names excluding the features that are not contributing in anomaly deteciton.
    features = [col for col in df.columns if col not in ['DATE_YEAR', 'DATE_MONTH', 'DATE_DAY', 'timestamp'] and df[col].dtype != 'object']

    # Stating the process of identifying the anomaly.
    start = 0
    # Looping the commands till it reaches the maximum numer of data.
    while start + min_window_size < n:
        # Defining the end.
        end = min(start + current_window_size, n)
        # Copying all the data into window for anomaly detection. 
        data_window = df.iloc[start:end].copy()

        # Applying isolation forest on the current window.
        data_window = apply_isolation_forest(data_window, threshold_factor=threshold_factor)

        # Adjusting the window size based on anomaly score variability.
        # Standardizing the anomaly scores.
        sd_anomaly_scores = np.std(data_window['anomaly_scores'])
        # If the data is not getting veried rapidly then we will increase the window size to cover more data in our window.
        if sd_anomaly_scores < lower_sd_threshold and current_window_size < max_window_size:
            # Increasing the current window size.
            current_window_size += 500
        # If standard anomaly score is high than the upper threshold then algorithm expecting more anomalies in the window so we will decrease the window size.    
        elif sd_anomaly_scores > upper_sd_threshold and current_window_size > min_window_size:
            # Decreasing the window size.
            current_window_size -= 500
            # choosing the current window size for the maimum, minimum and current window size. 
        current_window_size = max(min(current_window_size, max_window_size), min_window_size)

        # Identifying feature outliers for anomalies.
        if data_window[data_window['labels'] == 1].shape[0] > 0:
            # Passing all the data into feature outlier function to get the feature wise anomaly.
            feature_outlier_indices = feature_outliers(data_window[data_window['labels'] == 1], features)
            # For each index and row having anomaly iterating the rows.
            for index, row in data_window[data_window['labels'] == 1].iterrows():
                # Getting the features having anomalies in an anomaly feature dictionary.
                anomaly_features = [feature for feature, indices in feature_outlier_indices.items() if index in indices]
                # checking if the anomaly feature having FRMC
                if anomaly_features:
                    frmc = row['FRMC']
                    # Adding the Feature anomalies for the FRMC wise.
                    anomalies[frmc].extend(anomaly_features)
        # Finishing working for the particular window.
        start = end
    # Creating an empty list called detailed_anomalies
    detailed_anomalies = []
    # Iterating the loop for the frmc an features in anomalies.
    for frmc, features in anomalies.items():
        # Geting the features.
        unique_features = set(features)
        # Appending the data in to list to display the Frame Rate wise point anomaly.
        detailed_anomalies.append({'FRMC': frmc, 'features': list(unique_features)})
    # Returning the current window side and detailed_anomalies.
    return detailed_anomalies, current_window_size


## Performing the Point-Wise Anomaly Detection Using iForest ASD(Dynamic Sliding Window)

In [18]:
detailed_anomalies,final_window_size = dynamic_window_isolation_forest(df_knn3)

### Printing The Final Window Size 

In [19]:
print("\nFinal Window Size:", final_window_size)


Final Window Size: 1000


### Printing The FRMC Wise Anomalies For Features

In [20]:
print("\nDetected Anomalies:", detailed_anomalies)


Detected Anomalies: [{'FRMC': 1997.0, 'features': ['BAL1']}, {'FRMC': 2001.0, 'features': ['FQTY_2', 'PSA', 'LONG']}, {'FRMC': 2005.0, 'features': ['LOC']}, {'FRMC': 2013.0, 'features': ['ALTR', 'PSA']}, {'FRMC': 2017.0, 'features': ['ALTR', 'VRTG', 'FRMC', 'RALT', 'AOA2', 'AOA1', 'LATG', 'OIP_4', 'OIPL', 'PS', 'DA']}, {'FRMC': 2029.0, 'features': ['ALTR']}, {'FRMC': 2053.0, 'features': ['ALTR', 'GLS']}, {'FRMC': 2061.0, 'features': ['LGUP']}, {'FRMC': 2065.0, 'features': ['ABRK']}, {'FRMC': 2077.0, 'features': ['LOC', 'FQTY_2', 'RALT', 'DWPT', 'N1T', 'LONG', 'GLS', 'ABRK', 'LATG', 'GMT_MINUTE', 'LATP']}, {'FRMC': 2081.0, 'features': ['FQTY_2', 'ALTR', 'VRTG']}, {'FRMC': 2209.0, 'features': ['GMT_SEC', 'EAI', 'HYDY', 'BPYR_1', 'MRK', 'ABRK']}, {'FRMC': 2221.0, 'features': ['GMT_SEC', 'LGUP', 'EAI', 'HYDY', 'BPYR_1']}, {'FRMC': 2233.0, 'features': ['GMT_SEC', 'LGUP', 'EAI', 'HYDY', 'BPYR_1', 'LONG']}, {'FRMC': 2249.0, 'features': ['GMT_SEC', 'EAI', 'VHF2', 'PUSH', 'BPYR_1', 'EVNT', 'VH