**Kaggle Dataset Link:** https://www.kaggle.com/datasets/zhonglifr/thyroid-disease-unsupervised-anomaly-detection

# Multivariate Anomaly Detection (Machine Learning Methods)

### Using the UCI Thyroid Disease Data Set

This project demonstrates a comprehensive approach to multivariate anomaly detection using Python within a Jupyter Notebook. It guides users through loading, preprocessing, training, and evaluating models on the UCI Thyroid Disease dataset.

---

## Overview

Multivariate anomaly detection is crucial in fields like healthcare, finance, and fraud detection. This tutorial covers three prominent machine learning techniques for detecting anomalies in multivariate data:

- **Isolation Forest**
- **One-Class SVM**
- **Kernel Density Estimation (KDE)**

# Step 1: Import Libraries

In [17]:
# import pandas for reading the data
import pandas as pd

# sklearn for standardising and normalizing numerical data
from sklearn.preprocessing import StandardScaler

# importing ensemble.IsolationForest, neighbors.KernelDensity,svm.OneClassSVM
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import KernelDensity
from sklearn.svm import OneClassSVM

# sklearn to import classification_report
from sklearn.metrics import classification_report

# numpy to import quantile
from numpy import quantile

# Step 2: Read and Explore Dataset

In [18]:
CSV_PATH = "annthyroid_unsupervised_anomaly_detection.csv"

df = pd.read_csv(CSV_PATH, sep=";")
df.head()

Unnamed: 0,Age,Sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,query_hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH,T3_measured,TT4_measured,T4U_measured,FTI_measured,Outlier_label,Unnamed: 22,Unnamed: 23
0,0.45,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,61.0,6.0,23.0,87.0,26.0,o,,
1,0.61,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29.0,15.0,61.0,96.0,64.0,o,,
2,0.16,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,29.0,19.0,58.0,103.0,56.0,o,,
3,0.85,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,114.0,3.0,24.0,61.0,39.0,o,,
4,0.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,49.0,3.0,5.0,116.0,4.0,o,,


In [19]:
df.columns

Index(['Age', 'Sex', 'on_thyroxine', 'query_on_thyroxine',
       'on_antithyroid_medication', 'sick', 'pregnant', 'thyroid_surgery',
       'I131_treatment', 'query_hypothyroid', 'query_hyperthyroid', 'lithium',
       'goitre', 'tumor', 'hypopituitary', 'psych', 'TSH', 'T3_measured',
       'TT4_measured', 'T4U_measured', 'FTI_measured', 'Outlier_label ',
       'Unnamed: 22', 'Unnamed: 23'],
      dtype='object')

# Step 3: Data Proprocessing

In [20]:
# Removing unwanted columns and make the columns names in similar format
df.drop(["Unnamed: 22", "Unnamed: 23"], axis=1, inplace=True)

In [21]:
df.head(2)

Unnamed: 0,Age,Sex,on_thyroxine,query_on_thyroxine,on_antithyroid_medication,sick,pregnant,thyroid_surgery,I131_treatment,query_hypothyroid,query_hyperthyroid,lithium,goitre,tumor,hypopituitary,psych,TSH,T3_measured,TT4_measured,T4U_measured,FTI_measured,Outlier_label
0,0.45,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,61.0,6.0,23.0,87.0,26.0,o
1,0.61,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29.0,15.0,61.0,96.0,64.0,o


In [22]:
# Removing duplicate rows
df.drop_duplicates(inplace=True)

In [23]:
# Checking for missing values
df.isna().sum()

Unnamed: 0,0
Age,0
Sex,0
on_thyroxine,0
query_on_thyroxine,0
on_antithyroid_medication,0
sick,0
pregnant,0
thyroid_surgery,0
I131_treatment,0
query_hypothyroid,0


In [24]:
# convert all column names to lowercase
df.columns = [item.strip().lower() for item in df.columns]

In [25]:
df.columns

Index(['age', 'sex', 'on_thyroxine', 'query_on_thyroxine',
       'on_antithyroid_medication', 'sick', 'pregnant', 'thyroid_surgery',
       'i131_treatment', 'query_hypothyroid', 'query_hyperthyroid', 'lithium',
       'goitre', 'tumor', 'hypopituitary', 'psych', 'tsh', 't3_measured',
       'tt4_measured', 't4u_measured', 'fti_measured', 'outlier_label'],
      dtype='object')

In [26]:
# Create feature and target dataframes
target_name = 'outlier_label'
features_name = [col for col in df.columns if col != target_name]

# Create feature and target dataframes
target = df[target_name].copy()
features = df[features_name].copy()

In [28]:
target = target.map(lambda label: -1 if label == "o" else 1)

In [29]:
target

Unnamed: 0,outlier_label
0,-1
1,-1
2,-1
3,-1
4,-1
...,...
6911,1
6912,1
6913,1
6914,1


In [None]:
# Check impurity as target column is already given

# If target column is not given, we assume impurity and adjust the impurity as hyperparameter to get the optimum results.

# impurity is used in IsolationForest as contamination, OneClassSVM as nu and quantile threshold in Kernel Density.

In [30]:
target.value_counts()

Unnamed: 0_level_0,count
outlier_label,Unnamed: 1_level_1
1,6595
-1,250


In [31]:
impurity = (250/(250 + 6595))

### On reading the database information on Kaggle and analysing the dataset, we form list of categorical and numerical columns

* **Categorical Columns:** 'sex', 'on_thyroxine', 'query_on_thyroxine', 'on_antithyroid_medication', 'sick', 'pregnant', 'thyroid_surgery', 'i131_treatment', 'query_hypothyroid', 'query_hyperthyroid', 'lithium', 'goitre', 'tumor', 'hypopituitary', 'psych'
<br><br>
* **Numerical Columns:** 'age', 'tsh', 't3_measured', 'tt4_measured', 't4u_measured', 'fti_measured'

In [32]:
# create categorical and numerical columns
cat_cols = ['sex', 'on_thyroxine', 'query_on_thyroxine', 'on_antithyroid_medication', 'sick', 'pregnant', 'thyroid_surgery', 'i131_treatment', 'query_hypothyroid', 'query_hyperthyroid', 'lithium', 'goitre', 'tumor', 'hypopituitary', 'psych']
num_cols = ['age', 'tsh', 't3_measured', 'tt4_measured', 't4u_measured', 'fti_measured']

In [33]:
# scale & normalize the numerical columnss
scaler = StandardScaler()
features[num_cols] = scaler.fit_transform(features[num_cols])

In [36]:
# check data types of columns
features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6845 entries, 0 to 6915
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   age                        6845 non-null   float64
 1   sex                        6845 non-null   float64
 2   on_thyroxine               6845 non-null   float64
 3   query_on_thyroxine         6845 non-null   float64
 4   on_antithyroid_medication  6845 non-null   float64
 5   sick                       6845 non-null   float64
 6   pregnant                   6845 non-null   float64
 7   thyroid_surgery            6845 non-null   float64
 8   i131_treatment             6845 non-null   float64
 9   query_hypothyroid          6845 non-null   float64
 10  query_hyperthyroid         6845 non-null   float64
 11  lithium                    6845 non-null   float64
 12  goitre                     6845 non-null   float64
 13  tumor                      6845 non-null   float64
 1

In [37]:
# convert column datatypes
for col in cat_cols:
  features[col] = features[col].astype("category")

In [38]:
features.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6845 entries, 0 to 6915
Data columns (total 21 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   age                        6845 non-null   float64 
 1   sex                        6845 non-null   category
 2   on_thyroxine               6845 non-null   category
 3   query_on_thyroxine         6845 non-null   category
 4   on_antithyroid_medication  6845 non-null   category
 5   sick                       6845 non-null   category
 6   pregnant                   6845 non-null   category
 7   thyroid_surgery            6845 non-null   category
 8   i131_treatment             6845 non-null   category
 9   query_hypothyroid          6845 non-null   category
 10  query_hyperthyroid         6845 non-null   category
 11  lithium                    6845 non-null   category
 12  goitre                     6845 non-null   category
 13  tumor                      6845 non-nu

# Step 4: Build Machine Learning Model

In [42]:
# Isolation Forest
IF = IsolationForest(n_estimators=300, contamination=impurity)
IF.fit(features)
IF_pred = IF.predict(features)

# OneClassSVM Forest
OC = OneClassSVM(nu=impurity)
OC.fit(features)
OC_pred = OC.predict(features)

# Kernel Density
KD = KernelDensity()
KD.fit(features)
scores = KD.score_samples(features)
threshold = quantile(scores, impurity)
KD_pred = [-1 if score < threshold else 1 for score in scores]

In [43]:
print(f"For Isolation Forest:\n{classification_report(IF_pred, target)}\n\n")
print(f"For OneClassSVM Forest:\n{classification_report(OC_pred, target)}\n\n")
print(f"For Kernel Density:\n{classification_report(KD_pred, target)}\n\n")

For Isolation Forest:
              precision    recall  f1-score   support

          -1       0.06      0.06      0.06       250
           1       0.96      0.96      0.96      6595

    accuracy                           0.93      6845
   macro avg       0.51      0.51      0.51      6845
weighted avg       0.93      0.93      0.93      6845



For OneClassSVM Forest:
              precision    recall  f1-score   support

          -1       0.17      0.17      0.17       252
           1       0.97      0.97      0.97      6593

    accuracy                           0.94      6845
   macro avg       0.57      0.57      0.57      6845
weighted avg       0.94      0.94      0.94      6845



For Kernel Density:
              precision    recall  f1-score   support

          -1       0.22      0.22      0.22       250
           1       0.97      0.97      0.97      6595

    accuracy                           0.94      6845
   macro avg       0.60      0.60      0.60      6845
weig

Kernel Density gives the best outlier estimates after evaluting using F1-scores
