# **Post-Hoc Concept Bottleneck Models Replication**

This file assumes that the main README instructions have already been followed, which would be every step after the environment has been activated. If not, then you can view it [here](../README.md).

The instructions in this notebook parallel those present in the README in the later steps, meaning that it is possible to follow either for guidance.

Before starting, all the necessary files need to first be prepared. This notebook, when run, will setup all the necessary installations in the environment. We need to first move outside of the `\notebook` directory via the code block below. It should automatically setup the directory depending on whether this notebook is being run locally or on Google Colab.

In [1]:
import os

try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    from google.colab import userdata
    url = f"https://{userdata.get('gh_pat')}@github.com/dgcnz/FACT.git" # Note: may need to be scrubbed
    !git clone -b "greg" --single-branch {url} # Change this back
    !pip install git+https://github.com/openai/CLIP.git # for the CLIP library
    print("\nCurrent Directory:")
    %cd FACT

else: # automatically checks if the current directory is 'FACT'
    curdir = os.getcwd()
    curdir = curdir.split('\\')[-1]
    if curdir != "FACT":
        print("Current Directory:")
        %cd ..

Current Directory:
c:\Users\Gregory Go\.github\FACT


You should now be in the **FACT** main folder. This is important for running the files to ensure that they save/search in the correct locations!

# **Downloading the Datasets**

## Broden Concepts Dataset

_Note_: There is a potential permission error which may arise when trying to download the files via this notebook. Manual downloading may be needed.

In [None]:
# Get the BRODEN concepts dataset
!bash ./scripts/download_broden

## COCO-Stuff Dataset

*Note:* The dataset is around 20 GB in total. Ensure you have enough space on your device before attempting to download.

In [None]:
# Get the COCO-stuff dataset (bash is needed to run the command below)
!bash ./scripts/download_cocostuff

## CUB Dataset

In [None]:
# Get the CUB dataset (bash is needed here to run the command below)
!bash ./scripts/download_cub

## Derm7pt Dataset

_Note:_ Due to how the download has been setup by the Original Derm7pt authors, registering to their site is necessary to access the dataset, thus meaning that we need to perform some manual processing. As such, please perform the following steps:

1. Go to the Derm7pt site [here](https://derm.cs.sfu.ca/Download.html).
2. Fill in the form with the necessary details.
3. The email received should contain the download link alongside the needed login credentials below it. Click the link and then fill in the details in the prompt given, which should automatically trigger the download afterwards.
4. Extract the .zip file and rename the folder extracted to `derm7pt`.
5. Move this folder to `./FACT/artifacts/data`.

_Note:_ If desired, for Google Colab you can upload the dataset to Google Drive and copy it to the current session using the following command:

In [None]:
# set the location variable to the specific directory in your Google Drive
location = "path/to/your/directory/in/drive"

# construct the source and destination paths
source_path = f"/content/drive/MyDrive/{location}"
destination_path = "/content/FACT/artifacts/data"

# copy to the destination path
!cp -r "{source_path}" "{destination_path}"

## HAM10000 Dataset

_Note:_ The HAM10000 Dataset is made available as a public Kaggle dataset. In order to download it through this script, make sure you have a Kaggle API token ready and place it (a .JSON file) in the following directory: `C:\Users\\*your username*\\.kaggle`.

To create a Kaggle API token, please do the following steps:

1. Go to your [account settings](https://www.kaggle.com/account). You will need to create a Kaggle account if you do not have one already.
2. Click on your profile icon > "Settings" > Scroll down to "API" > click "Create New Token"
3. This will download a file named `kaggle.json`. Again remember to move it to the scripts folder in the **FACT** directory.

If on Google Colab, upload your generated API token (`kaggle.json`) to any folder you want and paste the directory to that file to the `folder_containing_api` variable below.

Afterwards, just run the following codeblock:

In [None]:
# Get the HAM10K dataset (bash is needed here to run the command below)

folder_containing_api = "" # Add your folder here if on Colab

if IN_COLAB:
    import os
    from google.colab import drive
    drive.mount('/content/drive')
    os.environ['KAGGLE_CONFIG_DIR'] = f'/content/drive/MyDrive/{folder_containing_api}'

!bash ./scripts/download_ham

## SIIM-ISIC Dataset

_Note_: The original dataset is around 23 GB in total. The version downloaded by this script is a trimmed-down version which replicates what the original authors did.

In [None]:
# Get the SIIM-ISIC dataset
!python ./scripts/download_siim

## Metashift Dataset

_Note_: The original dataset is around __ GB in total. The version downloaded by this script is a trimmed-down version which replicates what the original authors did.

In [None]:
# Get the Metashift dataset
!bash ./scripts/download_metashift

...and that would be every dataset needed for reproducing the main results! If you'd like, you can also download the datasets and dependencies for the extension experiments (which totals ~7 GB).

## ESC-50 Dataset

In [None]:
# Get the ESC-50 Dataset
!bash ./scripts/download_esc

## UrbanSound8K Dataset

Similar to the HAM10000 Dataset, you need to have an API token ready. Follow the instructions [there](#ham10000-dataset) if you don't have one ready and would like guidance on how to do so.

In [None]:
# Get the UrbanSound8K Dataset

folder_containing_api = "" # Add your folder here if on Colab

if IN_COLAB:
    import os
    from google.colab import drive
    drive.mount('/content/drive')
    os.environ['KAGGLE_CONFIG_DIR'] = f'/content/drive/MyDrive/{folder_containing_api}'

!bash ./scripts/download_us8k

## AudioCLIP Dependencies

**Please Note:** Due to how everything is setup, running the below script is necessary to run the experiments.

This downloader only includes the fully pretrained AudioCLIP model and the vocabulary in case needed. The reason being that the main repository for AudioCLIP is not designed to be installed as a Python package. As of writing, no `setup.py` files or anything that would work has been implemented, making it not possible to directly install their repo.

Thus, a copy of it has been integrated here, with the assets separated to prevent bottlenecking this repo. You can find the citation to the original authors [here](../models/AudioCLIP/README.md) and their original repository [here](https://github.com/AndreyGuzhov/AudioCLIP).

In [None]:
# Get the AudioCLIP Dependencies
!bash ./scripts/download_audioclip

# **Training and Evaluating PCBMs**

Now that we have prepared all the necessary files, we can now begin with replicating the results obtained. 

Do note however, that some details for replication are missing, meaning that the results may somewhat differ compared to the original paper (which is also influenced by the hardware differences between experiments).

# Learning Concepts

In total, there are three concepts datasets needed for these experiments:
1. BRODEN
2. CUB
3. Derm7pt

Here we prepare each of these concepts for later use alongside the corresponding models, starting with the BRODEN ones.

_Note:_ If you are on Colab, make sure to install PyTorch Ignite and Visdom first.

In [None]:
# To install PyTorch Ignite and Visdom
if IN_COLAB:
    !pip install pytorch-ignite
    !pip install visdom

In [None]:
!python learn_concepts_dataset.py \
  --dataset-name="broden" \
  --backbone-name="clip:RN50" \
  --C 0.001 0.01 0.1 1.0 10.0 \
  --n-samples=50 \
  --out-dir=artifacts/outdir

Then we move on to the CUB concepts,

In [None]:
!python learn_concepts_dataset.py \
  --dataset-name="cub" \
  --C 0.001 0.01 0.1 1.0 10.0 \
  --n-samples=50 \
  --out-dir=artifacts/outdir

... and finally the Derm7pt concepts.

In [None]:
!python learn_concepts_dataset.py \
  --dataset-name="derm7pt" \
  --backbone-name="ham10000_inception" \
  --C 0.001 0.01 0.1 1.0 10.0 \
  --n-samples=50 \
  --out-dir=artifacts/outdir

## Learning Multimodal Concepts

One part of the original paper involves learning concepts automatically by utilizing CLIP embeddings. This has already been implemented by the authors in `learn_multimodal_concepts.py` (though some adjustments for improved readability and extra concept banks for extension studies have been made by us).

Because of this, we can run the following snippets directly for each dataset, starting with CIFAR10/100 below.

_Note:_ Make sure to change the device to match what you would like/have (by default it assumes _cuda_).

In [None]:
!python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="cifar10" \
  --backbone-name="clip:RN50" \
  --device="cuda"\
  --recurse=1

In [None]:
!python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="cifar100" \
  --backbone-name="clip:RN50" \
  --device="cuda" \
  --recurse=1

Now we can learn the concepts for COCO-Stuff

In [None]:
!python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="cub" \
  --backbone-name="clip:RN50" \
  --device="cuda"\
  --recurse=1

Below you can find the concept learner snippets for the extension experiments which can be executed, if desired.

In [None]:
# Note: 'audio' here refers to AudioCLIP
!python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="esc50" \
  --backbone-name="audio" \
  --device="cuda"\
  --recurse=1

In [None]:
!python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="us8k" \
  --backbone-name="audio" \
  --device="cuda"\
  --recurse=1

# **Training PCBMs**

_Note:_ For the COCO-Stuff experiments, please set the `out-dir` to `artifacts/outdir/coco-stuff`or else the folder will be way less organized.

## Main Experiments

In [None]:
!python train_pcbm.py \
  --concept-bank="artifacts/outdir/broden_clip:RN50_10.0_50.pkl" \
  --dataset="cifar10" \
  --backbone-name="clip:RN50" \
  --out-dir=artifacts/outdir \
  --lam=2e-4

In [13]:
!python train_pcbm.py \
  --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
  --dataset="coco_stuff" \
  --backbone-name="clip:RN50" \
  --out-dir=artifacts/outdir/coco-stuff \
  --targets 3 6 31 35 36 37 40 41 43 46 47 50 53 64 75 76 78 80 85 89 \
  --lam=2e-4

Bank path: artifacts/outdir/broden_clipRN50_10.0_50.pkl. 170 concepts will be used.
Concept Bank is initialized.
Training 20 Model Instances for COCO-Stuff datasets. This may take a while...

{1: 'car', 0: 'other'}
Evaluating COCO-Stuff Binary Classification for Class 'car'
tensor([[ 0.2042, -0.1078,  0.1701,  ...,  0.0623, -0.1310,  0.1547],
        [ 0.1653, -0.0186,  0.1959,  ...,  0.1125, -0.0475, -0.0463],
        [-0.0182, -0.0592,  0.3801,  ...,  0.0688,  0.0496,  0.4647],
        ...,
        [-0.0454,  0.1998,  0.0077,  ...,  0.0264, -0.0078,  0.2209],
        [ 0.2496, -0.1845, -0.0925,  ..., -0.0239,  0.3286, -0.0937],
        [-0.0247,  0.2149, -0.0538,  ...,  0.0962,  0.0631,  0.4053]],
       device='cuda:0') concept bank vectors
concept bank vectors tensor([[ 0.2042, -0.1078,  0.1701,  ...,  0.0623, -0.1310,  0.1547],
        [ 0.1653, -0.0186,  0.1959,  ...,  0.1125, -0.0475, -0.0463],
        [-0.0182, -0.0592,  0.3801,  ...,  0.0688,  0.0496,  0.4647],
        ...,
  


  0%|          | 0/500 [00:00<?, ?it/s]
  0%|          | 1/500 [00:04<33:48,  4.06s/it]
  1%|          | 3/500 [00:04<09:12,  1.11s/it]
  1%|          | 5/500 [00:04<04:43,  1.75it/s]
  1%|▏         | 7/500 [00:04<02:55,  2.82it/s]
  2%|▏         | 9/500 [00:04<01:59,  4.11it/s]
  2%|▏         | 11/500 [00:04<01:27,  5.57it/s]
  3%|▎         | 13/500 [00:04<01:07,  7.24it/s]
  3%|▎         | 15/500 [00:04<00:54,  8.96it/s]
  3%|▎         | 17/500 [00:05<00:45, 10.55it/s]
  4%|▍         | 19/500 [00:05<00:41, 11.65it/s]
  4%|▍         | 21/500 [00:05<00:37, 12.85it/s]
  5%|▍         | 23/500 [00:05<00:34, 13.76it/s]
  5%|▌         | 25/500 [00:05<00:31, 15.01it/s]
  5%|▌         | 27/500 [00:05<00:30, 15.30it/s]
  6%|▌         | 29/500 [00:05<00:29, 16.01it/s]
  6%|▌         | 31/500 [00:05<00:29, 16.10it/s]
  7%|▋         | 33/500 [00:05<00:28, 16.51it/s]
  7%|▋         | 35/500 [00:06<00:28, 16.51it/s]
  7%|▋         | 37/500 [00:06<00:28, 16.22it/s]
  8%|▊         | 39/500 [00:06<00

 [ 0.20986004 -0.00807726 -0.17851748 ... -0.08957244 -0.19764014
  -0.2701234 ]
 ...
 [ 0.36452988  0.05962066  0.03699544 ...  0.14180662 -0.2426829
  -0.27672043]
 [ 0.04625248 -0.081562   -0.16648267 ...  0.07837289 -0.17873022
  -0.50046486]
 [-0.10219879  0.0188331  -0.07642031 ...  0.00800968 -0.15828253
  -0.20504002]] proj
Class : other
	 1 - doorframe: 8.018
	 2 - fence: 7.010
	 3 - ground: 6.852
	 4 - pedestal: 6.120
	 5 - manhole: 5.942
Class : tennis racket
	 1 - doorframe: 8.018
	 2 - fence: 7.010
	 3 - ground: 6.852
	 4 - pedestal: 6.120
	 5 - manhole: 5.942
Model saved to : artifacts/outdir/coco-stuff\pcbm_coco_stuff__clipRN50__broden_clipRN50_10__lam_0.0002__alpha_0.99__seed_42_target_43.ckpt
{'train_acc': 73.8, 'test_acc': 48.8, 'cls_acc': {'train': {0: 0.624, 1: 0.852}, 'test': {0: 0.376, 1: 0.6}}, 'test_auc': 0.504192, 'train_auc': 0.8284}
{0: 'other', 1: 'wine glass'}
Evaluating COCO-Stuff Binary Classification for Class 'wine glass'
tensor([[ 0.2042, -0.1078,  0.1

## Extension Experiments

In [None]:
!python train_pcbm.py \
  --concept-bank="artifacts/multimodal/mmc_audio_esc50_recurse_1.pkl" \
  --dataset="esc50" \
  --backbone-name="audio" \
  --out-dir=artifacts/outdir \
  --lam=2e-4

In [None]:
!python train_pcbm.py \
  --concept-bank="artifacts/multimodal/mmc_audio_us8k_recurse_1.pkl" \
  --dataset="us8k" \
  --backbone-name="audio" \
  --out-dir=artifacts/outdir \
  --lam=2e-4

# **Training PCBM-h's**

## Main Experiments

In [None]:
!python train_pcbm_h.py \
  --concept-bank="artifacts/outdir/broden_clip:RN50_10.0_50.pkl" \
  --pcbm-path="artifacts/outdir/pcbm_cifar10__clipRN50__broden_clipRN50_10__lam:0.0002__alpha:0.99__seed:42.ckpt" \
  --out-dir=artifacts/outdir \
  --dataset="cifar10" \
  --num-workers=4

## Extension Experiments

In [None]:
!python train_pcbm_h.py \
  --concept-bank="artifacts/multimodal/mmc_audio_esc50_recurse_1.pkl" \
  --pcbm-path="artifacts/outdir/" \
  --out-dir=artifacts/outdir \
  --dataset="esc50" \
  --num-workers=4

In [None]:
!python train_pcbm_h.py \
  --concept-bank="artifacts/multimodal/mmc_audio_esc50_recurse_1.pkl" \
  --pcbm-path="artifacts/outdir/" \
  --out-dir=artifacts/outdir \
  --dataset="us8k" \
  --num-workers=4