# Anomaly Detection Benchmark Data

In this notebook, we will demonstrate how to download anomaly detection benchmark datasets. These datasets are used for evaluating and testing various **anomaly detection algorithms**.

We will explore two methods to obtain the datasets:

1. Using the `adbench` library to download data.
2. Cloning the repository directly.


## Method 1: Using the `adbench` Library

1. **Check installation**: Make sure the `adbench` library is installed. If it's not installed, please refer to [anomaly detectin environment setup](00-setup.ipynb) for detailed installation instructions.
2. **Download Datasets Using adbench**: Use the following Python script to download the datasets:

In [7]:
import logging
from adbench.myutils import Utils

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Initialize the utility class
utils = Utils()

# Attempt to download datasets with error handling
try:
    # Download datasets from the specified repository
    utils.download_datasets(repo='jihulab')                          # Use 'jihulab' if you're in China
    logger.info("Datasets downloaded successfully using adbench.")
except Exception as e:
    if "already exists" in str(e).lower():
        logger.info("Datasets are already downloaded.")
    else:
        # Log unexpected exceptions
        logger.error(f"An unexpected error occurred: {e}")

if there is any question while downloading datasets, we suggest you to download it from the website:
https://github.com/Minqi824/ADBench/tree/main/adbench/datasets
如果您在中国大陆地区，请使用链接：
https://jihulab.com/BraudoCC/ADBench_datasets/
Downloading datasets from jihulab...


100%|██████████| 3/3 [00:00<00:00, 805.77it/s]
INFO:__main__:Datasets downloaded successfully using adbench.


CIFAR10_0.npz already exists. Skipping download...
CIFAR10_1.npz already exists. Skipping download...
CIFAR10_2.npz already exists. Skipping download...
CIFAR10_3.npz already exists. Skipping download...
CIFAR10_4.npz already exists. Skipping download...
CIFAR10_5.npz already exists. Skipping download...
CIFAR10_6.npz already exists. Skipping download...
CIFAR10_7.npz already exists. Skipping download...
CIFAR10_8.npz already exists. Skipping download...
CIFAR10_9.npz already exists. Skipping download...
FashionMNIST_0.npz already exists. Skipping download...
FashionMNIST_1.npz already exists. Skipping download...
FashionMNIST_2.npz already exists. Skipping download...
FashionMNIST_3.npz already exists. Skipping download...
FashionMNIST_4.npz already exists. Skipping download...
FashionMNIST_5.npz already exists. Skipping download...
FashionMNIST_6.npz already exists. Skipping download...
FashionMNIST_7.npz already exists. Skipping download...
FashionMNIST_8.npz already exists. Skippin

After running the previous script, I had to go through the source code of `download_datasets()` method to figure out the directory where the data was stored. 

I also did the following steps:

In [8]:
# 1. import datasets from adbench
from adbench import datasets

# 2. Check attributes of this module
print(dir(datasets))



In [35]:
# 3. Cheching the __path__ attribute
print(datasets.__path__)

# 4. List the content of directory
import os
print(os.listdir(datasets.__path__[0]))

# Path to the datasets directory
datasets_path = datasets.__path__[0]

['/opt/homebrew/Caskroom/mambaforge/base/envs/anom-detect-env/lib/python3.9/site-packages/adbench/datasets']
['CV_by_ResNet18', '__init__.py', '__pycache__', 'Classical', 'data_generator.py', 'NLP_by_BERT']


In [36]:
# This code will list the file names in the directories, 
for item in os.listdir(datasets_path):
    if not str(item).startswith('__'): 
        item_path = os.path.join(datasets_path, item)
        # Check if the item is a directory
        if os.path.isdir(item_path) and str(item_path) != "__pycache__":
            print(f"\n{item} directory:")
            
            for ind, sub_item in enumerate(os.listdir(item_path), start=1):
                sub_item_path = os.path.join(item_path, sub_item)
                if os.path.isfile(sub_item_path):
                    print(f" {ind} : {sub_item}")
                elif os.path.isdir(sub_item_path):
                    print(f"\t\t{item}: {sub_item}")


CV_by_ResNet18 directory:
 1 : MNIST-C_spatter.npz
 2 : MNIST-C_glass_blur.npz
 3 : MVTec-AD_pill.npz
 4 : MNIST-C_rotate.npz
 5 : MVTec-AD_screw.npz
 6 : SVHN_8.npz
 7 : SVHN_9.npz
 8 : MNIST-C_dotted_line.npz
 9 : MNIST-C_shear.npz
 10 : FashionMNIST_3.npz
 11 : MVTec-AD_toothbrush.npz
 12 : FashionMNIST_2.npz
 13 : FashionMNIST_0.npz
 14 : FashionMNIST_1.npz
 15 : FashionMNIST_5.npz
 16 : MNIST-C_identity.npz
 17 : MVTec-AD_leather.npz
 18 : FashionMNIST_4.npz
 19 : FashionMNIST_6.npz
 20 : CIFAR10_8.npz
 21 : CIFAR10_9.npz
 22 : MVTec-AD_metal_nut.npz
 23 : FashionMNIST_7.npz
 24 : MNIST-C_motion_blur.npz
 25 : MNIST-C_stripe.npz
 26 : MNIST-C_impulse_noise.npz
 27 : MNIST-C_translate.npz
 28 : CIFAR10_4.npz
 29 : CIFAR10_5.npz
 30 : FashionMNIST_9.npz
 31 : MNIST-C_brightness.npz
 32 : CIFAR10_7.npz
 33 : CIFAR10_6.npz
 34 : FashionMNIST_8.npz
 35 : CIFAR10_2.npz
 36 : MVTec-AD_tile.npz
 37 : CIFAR10_3.npz
 38 : MVTec-AD_carpet.npz
 39 : CIFAR10_1.npz
 40 : MVTec-AD_capsule.npz
 

In [45]:
# Check few datasets: Donors, Cardio, and fraud
import numpy as np
dataset_dir = datasets.__path__[0]
donors_path = os.path.join(dataset_dir, "Classical/11_donors.npz")
cardio_path = os.path.join(dataset_dir, "Classical/6_cardio.npz")
fraud_path = os.path.join(dataset_dir, "Classical/13_fraud.npz")

donors = np.load(donors_path, allow_pickle=True)
cardio = np.load(cardio_path, allow_pickle=True)
fraud = np.load(fraud_path, allow_pickle=True)

print("Donors dataset: ---------------------------------: ")
X, y = donors['X'], donors['y']
print(X.shape)
print(y.shape)

print("Cardio dataset: ---------------------------------: ")
X, y = cardio['X'], cardio['y']
print(X.shape)
print(y.shape)

print("Cardio dataset: ---------------------------------: ")
X, y = fraud['X'], fraud['y']
print(X.shape)
print(y.shape)


Donors dataset: ---------------------------------: 
(619326, 10)
(619326,)
Cardio dataset: ---------------------------------: 
(1831, 21)
(1831,)
Cardio dataset: ---------------------------------: 
(284807, 29)
(284807,)


The current directory structure for saving datasets is not optimal for this project. I prefer a more organized and accessible location where specific datasets can be easily accessed. While the existing setup may work for some, I intend to write a script to download datasets to a directory of my choice, allowing for more control over data management and accessibility.


In [None]:
import os
import json
import wget
from tqdm import tqdm
from adbench.myutils import Utils

# Access the project root directory from the environment variable
# Ensure that the ANOMALY_DETECTION_PATH environment variable is set
# if not set here like this
# project_root = "Your/path/to/anomaly-detection-project"   # uncomment this before run unless
                                                            # You set up Your project directory path
                                                            # as an environment variable

project_root = os.getenv('ANOMALY_DETECTION_PATH')

print(f"Project Root: {project_root}")

if project_root is None:
    raise EnvironmentError("The ANOMALY_DETECTION_PATH environment variable is not set.")

# Define the dataset directory 
dataset_dir = os.path.join(project_root, 'datasets')
print(dataset_dir)
print(os.listdir(dataset_dir))


# Ensure the dataset directory exists
os.makedirs(dataset_dir, exist_ok=True)

# Initialize the utility class
utils = Utils()

# List of folders expected to contain datasets
expected_folders = ['CV_by_ResNet18', 'NLP_by_BERT', 'Classical']


# Define the function to check if all datasets are already downloaded
def all_datasets_exist(base_dir, folders):
    """
    Checks if all datasets are present in the specified base directory.

    :param base_dir: The base directory where datasets are stored.
    :param folders: A list of folder names expected to be present in the base directory.
    :param files_dict: Optional dictionary specifying expected files in each folder.
    :return: True if all folders and files are present, False otherwise.
    """
    for folder in folders:
        folder_path = os.path.join(base_dir, folder)
        print(folder_path)
        # Check if folder exists
        if not os.path.exists(folder_path):
            print(f"Missing folder: {folder_path}")
            return False
        # Check if folder is not empty
        if not os.listdir(folder_path):
            print(f"Folder is empty: {folder_path}")
            return False
    return True

# Check if all datasets are already downloaded
if all_datasets_exist(dataset_dir, expected_folders):
    print("All datasets are already downloaded.")
else:
    # Attempt to download datasets
    try:
        utils.download_datasets(repo='jihulab')                    # Use 'github' if not in China
        print("Datasets downloaded successfully using adbench.")
    except Exception as e:
        print(f"An unexpected error occurred during download: {e}")

## Method 2: Cloning the `adbench` repository

Cloning the adbench repository is the alternative approach for accessing the anomaly detection benchmark datasets. This approach is easier and allows you to obtain all datasets and supplementary materials directly from the source, offering flexibility and full control over the data. Here’s how you can clone the repository and access the datasets:

Steps to Clone the Repository

1.	**Open a Terminal**: Begin by opening a terminal on your machine. Ensure you have Git or `GitHub CLI` utility installed, as it will be used to clone the repository.
2.	**Navigate to the Desired Directory**: Choose the directory where you created the anomaly detection project  to clone the `adbench` repository. You can navigate to this directory using the cd command:
```bash
cd /path/to/your/project/location
```

3.	Clone the Repository: Use the following Git command to clone the adbench repository:
```bash
git clone https://github.com/Minqi824/ADBench.git
```

or 

```bash
gh repo clone Minqi824/ADBench
```

This command will download the entire repository, including all datasets, scripts, and documentation, to your local machine.

4.	Navigate to the Cloned Repository: Once the cloning process is complete, navigate into the cloned repository:

```bash
cd ADBench
```

5.	Access the Datasets: The datasets will be located in the datasets directory within the cloned repository. You can list the contents of this directory to verify the datasets:

```bash
cd adbench/datasets
ls -la
```

This will display all the available datasets organized in their respective directories.