# **Post-Hoc Concept Bottleneck Models Replication**

This file contains essentially all the scripts necessary to replicate the results we obtained for reproducing the following claims:
1. *PCBMs achieve comparable performance to the original model.*
2. *PCBMs do not require labelled concept datasets.*
3. *PCBMs allow for global model editing.*

In addition, the extension experiments evaluating the performance of PCBMs on audio data can also be performed here. The experiments related to the user study are present [in another notebook](user_study.ipynb).

This file assumes that the main README instructions have already been followed, which would be every step before the environment has been activated. If not, then you can view it [here](../README.md). Note that the instructions after the environment activation parallel those present here, meaning that it is possible to follow either for guidance (though here all that is generally needed is to run the cells, with maybe some command editing if desired and file setup).

Before starting, all the necessary files need to first be prepared. This notebook, when run, will setup all the necessary installations in the environment. We need to first move outside of the `\notebook` directory via the code block below. It should automatically setup the directory depending on whether this notebook is being run locally or on Google Colab.

**Extra Note:** In many cells, the following part (`!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1`) exists due to the importing of AudioCLIP in the files which have this snippet in front of them. Currently, the only way to not have to use this is by downloading the AudioCLIP dependencies. This is due to how the repository is setup, with the snippet being a workaround due to time constraints.

In [None]:
import os
from pathlib import Path

try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    from google.colab import userdata
    repo_name = 'write the repository name here'
    url = f"https://{userdata.get('gh_pat')}@github.com/{repo_name}.git"
    !git clone {url}
    !pip install git+https://github.com/openai/CLIP.git # for the CLIP library
    print("\nCurrent Directory:")
    %cd 'put repo name here'
    !pip install -r requirements.txt

else: # automatically checks if the current directory is 'repo name'
    curdir = Path.cwd()
    print("Current Directory", curdir)
    repo_name = "put repo name here"
    assert curdir.name == repo_name or curdir.parent.name == repo_name, "Notebook cwd has to be on the project root"
    if curdir.name == "notebooks":
        %cd ..
        print("New Current Directory:", curdir)

You should now be in the **Anonymous** main folder. This is important for running the files to ensure that they save/search in the correct locations!

# **Downloading the Datasets**

## BRODEN Concepts Dataset

_Note_: There is a potential permission error which may arise when trying to download the files via this notebook. Manual downloading may be needed.

In [None]:
# Get the BRODEN concepts dataset
!bash ./scripts/download_broden

## COCO-Stuff Dataset

*Note:* The dataset is around 20 GB in total. Ensure you have enough space on your device before attempting to download.

In [None]:
# Get the COCO-stuff dataset (bash is needed to run the command below)
!bash ./scripts/download_cocostuff

## CUB Dataset

In [None]:
# Get the CUB dataset (bash is needed here to run the command below)
!bash ./scripts/download_cub

## Derm7pt Dataset

_Note:_ Due to how the download has been setup by the Original Derm7pt authors, registering to their site is necessary to access the dataset, thus meaning that we need to perform some manual processing. As such, please perform the following steps:

1. Go to the Derm7pt site [here](https://derm.cs.sfu.ca/Download.html).
2. Fill in the form with the necessary details.
3. The email received should contain the download link alongside the needed login credentials below it. Click the link and then fill in the details in the prompt given, which should automatically trigger the download afterwards.
4. Extract the .zip file and rename the folder extracted to `derm7pt`.
5. Move this folder to `./Anonymous/artifacts/data`.

_Note:_ If desired, for Google Colab you can upload the dataset to Google Drive and copy it to the current session using the following command:

In [None]:
# set the location variable to the specific directory in your Google Drive
location = "path/to/your/directory/in/drive"

# construct the source and destination paths
source_path = f"/content/drive/MyDrive/{location}"
destination_path = "/content/Anonymous/artifacts/data"

# copy to the destination path
!cp -r "{source_path}" "{destination_path}"

## HAM10000 Dataset

_Note:_ The HAM10000 Dataset is made available as a public Kaggle dataset. In order to download it through this script, make sure you have a Kaggle API token ready and place it (a .JSON file) in the following directory: `C:\Users\\*your username*\\.kaggle`.

To create a Kaggle API token, please do the following steps:

1. Go to your [account settings](https://www.kaggle.com/account). You will need to create a Kaggle account if you do not have one already.
2. Click on your profile icon > "Settings" > Scroll down to "API" > click "Create New Token"
3. This will download a file named `kaggle.json`. Again remember to move it to the scripts folder in the **Anonymous** directory.

If on Google Colab, upload your generated API token (`kaggle.json`) to any folder you want and paste the directory to that file to the `folder_containing_api` variable below.

Afterwards, just run the following codeblock:

In [None]:
# Get the HAM10K dataset (bash is needed here to run the command below)

folder_containing_api = "" # Add your folder here if on Colab

if IN_COLAB:
    import os
    from google.colab import drive
    drive.mount('/content/drive')
    os.environ['KAGGLE_CONFIG_DIR'] = f'/content/drive/MyDrive/{folder_containing_api}'

!bash ./scripts/download_ham

## SIIM-ISIC Dataset

_Note_: The original dataset is around 23 GB in total. The version downloaded by this script is a trimmed-down version which replicates what the original authors did (totalling less than 2 GB).

In [None]:
# Get the SIIM-ISIC dataset
!bash ./scripts/download_siim

## Metashift Dataset

In [None]:
# Get the Metashift dataset
!bash ./scripts/download_metashift

...and that would be every dataset needed for reproducing the main results! If you'd like, you can also download the datasets and dependencies for the extension experiments (which totals ~7 GB).

## ESC-50 Dataset

In [None]:
# Get the ESC-50 Dataset
!bash ./scripts/download_esc

## UrbanSound8K Dataset

Similar to the HAM10000 Dataset, you need to have an API token ready. Follow the instructions [there](#ham10000-dataset) if you don't have one ready and would like guidance on how to do so.

In [None]:
# Get the UrbanSound8K Dataset

folder_containing_api = "" # Add your folder here if on Colab

if IN_COLAB:
    import os
    from google.colab import drive
    drive.mount('/content/drive')
    os.environ['KAGGLE_CONFIG_DIR'] = f'/content/drive/MyDrive/{folder_containing_api}'

!bash ./scripts/download_us8k

## AudioCLIP Dependencies

**Please Note:** Due to how everything is setup, running the below script is necessary to run the experiments.

This downloader only includes the fully pretrained AudioCLIP model and the vocabulary in case needed. The reason being that the main repository for AudioCLIP is not designed to be installed as a Python package. As of writing, no `setup.py` files or anything that would work has been implemented, making it not possible to directly install their repo.

Thus, a copy of it has been integrated here, with the assets separated to prevent bottlenecking this repo. You can find the citation to the original authors [here](../models/AudioCLIP/README.md) and their original repository [here](https://github.com/AndreyGuzhov/AudioCLIP).

In [None]:
# Get the AudioCLIP Dependencies
!bash ./scripts/download_audioclip

# **Training and Evaluating PCBMs**

Now that we have prepared all the necessary files, we can now begin with replicating the results obtained. 

Do note however, that some details for replication are missing, meaning that the results may somewhat differ compared to the original paper (which is also influenced by the hardware differences between experiments).

As a reference for how you should specify the `dataset-name` and `backbone-name`, we provide the following table:

### **Datasets**
| Dataset | Codename for Parameter |
| :-: | :-: |
| CIFAR-10 | 'cifar10' |
| CIFAR-100 | 'cifar100' |
| COCO-Stuff | 'coco-stuff' |
| CUB | 'cub' |
| HAM10000 | 'ham10000' |
| SIIM-ISIC | 'siim-isic' |

### **Backbone Models**
| Backbone | Codename for Parameter |
| :-: | :-: |
| ResNet18 | 'resnet18_cub' |
| CLIP | 'clip:RN50' |
| Inception | 'ham10000_inception' |
| ImageNet ResNet18 | 'resnet18_imagenet1k_v1' |
| AudioCLIP | 'audio' |


# Learning Concepts

In total, there are three concepts datasets needed for these experiments:
1. BRODEN
2. CUB
3. Derm7pt

Here we prepare each of these concepts for later use alongside the corresponding models, starting with the BRODEN ones.

_Note:_ If you are on Colab, make sure to install PyTorch Ignite and Visdom first.

**Extra Note:** If the concept bank was generated using CLIP, it will be saved without the colon (`:`) in the filename due to it causing the filename to be split. Keep this in mind when specifying the concept bank path.

In [None]:
# To install PyTorch Ignite and Visdom
if IN_COLAB:
    !pip install pytorch-ignite
    !pip install visdom
    !pip install pytorch-lightning

In [None]:
!python learn_concepts_dataset.py \
  --dataset-name="broden" \
  --backbone-name="clip:RN50" \
  --C 0.001 0.01 0.1 1.0 10.0 \
  --n-samples=50 \
  --out-dir=artifacts/outdir

Then we move on to the CUB concepts,

In [None]:
!python learn_concepts_dataset.py \
  --dataset-name="cub" \
  --C 0.001 0.01 0.1 1.0 10.0 \
  --n-samples=50 \
  --out-dir=artifacts/outdir

... and finally the Derm7pt concepts.

In [None]:
!python learn_concepts_dataset.py \
  --dataset-name="derm7pt" \
  --backbone-name="ham10000_inception" \
  --C 0.001 0.01 0.1 1.0 10.0 \
  --n-samples=50 \
  --out-dir=artifacts/outdir

## Learning Multimodal Concepts

One part of the original paper involves learning concepts automatically by utilizing CLIP embeddings. This has already been implemented by the authors in `learn_multimodal_concepts.py` (though some adjustments for improved readability and extra concept banks for extension studies have been made by us).

Because of this, we can run the following snippets directly for each dataset, starting with CIFAR10/100 below.

_Note:_ Make sure to change the device to match what you would like/have (by default it assumes _cuda_). Also, the `recurse` parameter defines how much we recurse through ConceptNet, meaning that a higher recurse value results in more (but less directly related) concepts.

In [None]:
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="cifar10" \
  --backbone-name="clip:RN50" \
  --device="cuda"\
  --recurse=1

In [None]:
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="cifar100" \
  --backbone-name="clip:RN50" \
  --device="cuda" \
  --recurse=1

Now we can learn the concepts for COCO-Stuff

In [None]:
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="task" \
  --backbone-name="clip:RN50" \
  --device="cuda"\
  --recurse=1

Below you can find the concept learner snippets for the extension experiments which can be executed, if desired.

In [None]:
# Note: 'audio' here refers to AudioCLIP
!python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="audioset+us8k+esc50" \
  --backbone-name="audio" \
  --device="cuda"\
  --recurse=1

# **Reproducing the Original Results**

We provide code to replicate the original results obtained. 
For the `datasets` parameter, we need to input a list of datasets we want to evaluate on.

_Note:_ If you want, then all datasets can be evaluated by choosing the `eval-all` option. 

In [None]:
# Add "PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1" if you do not want to evaluate AudioCLIP
!python eval_original_model.py \
  --datasets "cifar10" "cifar100" \
  --out-dir=artifacts/outdir \
  --num-workers=4 \
  --seeds 42 \
  --eval-all

# **Training PCBMs**

_Note:_ For the COCO-Stuff experiments, please set the `out-dir` to `artifacts/outdir/coco-stuff`or else the folder will be way less organized.

## Main Experiments

The following cells are sample experiments for training the initial PCBMs. For other experiments, feel free to try the combinations specified by the authors (the base scripts for each experiment are present and should reproduce the results we obtain in our report. You can vary the parameters present).

The `concept-bank` parameter should be the directory of the desired conceptbank in the following format below. This means that to get the CLIP concept results the concept bank should be changed accordingly. In addition, the `dataset-name` and `backbone-name` are the same as specified in the [table above](#training-and-evaluating-pcbms).

In [None]:
#get the correct lambda values for each dataset
cifar10_broden_lam = 2/(10 * 175)
cifar100_broden_lam = 2/(100 * 175)
cub_lam = 0.01/(200 * 112)
ham_lam = 2.0/(2*8)
isic_lam = 0.001/(2*8)
coco_stuff_lam = 0.001

cifar10_clip_lam = 2/(10 * 170)
cifar100_clip_lam = 2/(100 * 440)

In [None]:
!PYTHONPATH=models:.:$$PYTHONPATH NO_AUDIOCLIP=1 python train_pcbm.py \
  --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
  --dataset="cifar10" \
  --backbone-name="clip:RN50" \
  --out-dir=artifacts/outdir \
  --lam={cifar10_broden_lam}

In [None]:
!PYTHONPATH=models:.:$$PYTHONPATH NO_AUDIOCLIP=1 python train_pcbm.py \
  --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
  --dataset="cifar100" \
  --backbone-name="clip:RN50" \
  --out-dir=artifacts/outdir \
  --lam={cifar100_broden_lam}

In [None]:
!PYTHONPATH=models:.:$$PYTHONPATH NO_AUDIOCLIP=1 python train_pcbm.py \
  --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
  --dataset="coco_stuff" \
  --backbone-name="clip:RN50" \
  --out-dir=artifacts/outdir/coco-stuff \
  --lam={coco_stuff_lam} 

In [None]:
!PYTHONPATH=models:.:$$PYTHONPATH NO_AUDIOCLIP=1 python train_pcbm.py \
  --concept-bank="artifacts/outdir/cub_resnet18_cub_10.0_50.pkl" \
  --dataset="cub" \
  --backbone-name="resnet18_cub" \
  --out-dir=artifacts/outdir \
  --lam={cub_lam}

In [None]:
!PYTHONPATH=models:.:$$PYTHONPATH NO_AUDIOCLIP=1 python train_pcbm.py \
  --concept-bank="artifacts/outdir/derm7pt_ham10000_inception_10.0_50.pkl" \
  --dataset="ham10000" \
  --backbone-name="ham10000_inception" \
  --out-dir=artifacts/outdir \
  --lam={ham_lam}

In [None]:
!PYTHONPATH=models:.:$$PYTHONPATH NO_AUDIOCLIP=1 python train_pcbm.py \
  --concept-bank="artifacts/outdir/derm7pt_ham10000_inception_10.0_50.pkl" \
  --dataset="siim_isic" \
  --backbone-name="ham10000_inception" \
  --out-dir=artifacts/outdir \
  --lam={isic_lam}

### Saliency Maps
The following cells is the experiment of creating saliency maps for different concept. For Saliency maps from different classes change the 'targetclass' argument to a different class from the cifar100 dataset. Additionally the concepts names can be changed to different concepts from either the broden dataset or from 440 multimodal concepts for the CIFAR100 dataset.

In [None]:
!PYTHONPATH=models:.:$$PYTHONPATH NO_AUDIOCLIP=1 python get_saliency_maps.py \
  --concept-bank1="artifacts/multimodal/mmc_clipRN50_cifar100_recurse_1.pkl" \
  --concept-bank2="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
  --dataset="cifar100" \
  --backbone-name="clip:RN50" \
  --out-dir=artifacts/outdir \
  --concept-ix=0 \
  --method "smoothgrad" \
  --targetclass="bicycle" \
  --concept-names1 'bicycle wheel' 'coaster brake' 'two wheels' 'bicycle seat' 'green' \
  --concept-names2 'bicycle' 'handle_bar' 'chain_wheel' 'book' 'greenness' 

### Model Editing Experiments

By running the following script, you will perform the model editing experiments for the 6 scenarios and one seed. Feel free to add seeds or change the base model by swapping the base_config. The results will be found in a .csv file at `logs/base_clip_resnet50/0/{timestamp}` (or more generally at `logs/{model_name}/{seed0}-{...}-{seedn}/{timestamp}`). Make sure Broden CAV concepts with C=0.01 are dowloaded.

In [None]:
%%capture 
# Suppress output with capture magic
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python -m experiments.model_editing.make_table_sk \
    --seed 0 \
    --base_config configs/model_editing/classifier/sk_base_clip_resnet50.yaml

### Audio Experiments

In [None]:
!python train_pcbm.py \
  --concept-bank="artifacts/multimodal/mmc_audio_us8k_recurse_1.pkl" \
  --dataset="us8k" \
  --backbone-name="audio" \
  --out-dir=artifacts/outdir \
  --lam=2e-4

### Concept Feature Value Experiment
For this experiment we also need to get the multimodal conceptbank for clip as we do in the second cell below. To get the tree different results in the table vary the Concept bank from CAVs to Multimodal and use the --random_proj parameter

In [None]:
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="broden" \
  --backbone-name="clip:RN50" \
  --device="cuda"\
  --recurse=1

In [None]:
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python test_cav_activation.py \
    --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
    --concept-dataset="broden" \
    --backbone-name="clip:RN50" \
    --out-dir="artifacts" \
    --alpha=0.99 \
    --num-workers=2 \
    --seeds '42' 

In [None]:
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python learn_concepts_multimodal.py \
  --out-dir="artifacts/multimodal" \
  --classes="cub" \
  --backbone-name="clip:RN50" \
  --device="cuda"\
  --recurse=1

In [None]:
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python test_cav_activation.py \
    --concept-bank="artifacts/multimodal/mmc_clipRN50_cub_recurse_1.pkl" \
    --concept-dataset="cub" \
    --dataset="cub" \
    --backbone-name="clip:RN50" \
    --out-dir="artifacts" \
    --alpha=0.99 \
    --num-workers=2 \
    --seeds '42' 

### Random Projection Experiment
The code below is the template for the reuslts of one dataset. To get the results for the other datasets in the report change the parameters to those seen in the main experiments. Make sure the '--random_proj' and '--seeds' parameters are kept. 

In [None]:
!PYTHONPATH=models:.:$$PYTHONPATH NO_AUDIOCLIP=1 python verify_dataset_pcbm.py \
  --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
  --dataset="cifar10" \
  --backbone-name="clip:RN50" \
  --out-dir='artifacts' \
  --lam={cifar10_lam} \
  --seeds '42' \
  --random_proj

### Complexity-Acurracy trade-off results and plots
The results will be in the sum.png and sparsities.png of the artifacts folder

In [None]:
!PYTHONPATH=models:.:$$PYTHONPATH NO_AUDIOCLIP=1 python /content/Anonymous/test_sparsity_vs_accuracy.py \
    --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
    --dataset="cifar100" \
    --backbone-name="clip:RN50" \
    --out-dir="artifacts" \
    --alpha=0.99 \
    --strengths 10.0 1.0 0.1 0.01 0.001 \
    --num-workers=2 \
    --seed 42

# **Training PCBM-h's**

## Main Experiments

For this section, make sure to input the path to where the desired PCBM was saved for `pcbm-path`.

In [None]:
# You can adjust the 'seed' part in the pcbm-path variable
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python train_pcbm_h.py \
  --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
  --pcbm-path="artifacts/outdir/pcbm_cifar10__clipRN50__broden_clipRN50_10__lam_0.0002__alpha_0.99__seed_42.ckpt" \
  --out-dir=artifacts/outdir \
  --dataset="cifar10" \
  --num-workers=4

In [None]:
# You can adjust the 'seed' part in the pcbm-path variable
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python train_pcbm_h.py \
  --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
  --pcbm-path="artifacts/outdir/pcbm_cifar100__clipRN50__broden_clipRN50_10__lam_0.0002__alpha_0.99__seed_42.ckpt" \
  --out-dir=artifacts/outdir \
  --dataset="cifar100" \
  --num-workers=4

In [None]:
# You can adjust the 'seed' part in the pcbm-path variable
!PYTHONPATH=models:.:$PYTHONPATH NO_AUDIOCLIP=1 python train_pcbm_h.py \
  --concept-bank="artifacts/outdir/broden_clipRN50_10.0_50.pkl" \
  --pcbm-path="artifacts/outdir/coco-stuff/pcbm_coco_stuff__clipRN50__broden_clipRN50_10__lam_0.0002__alpha_0.99__seed_42_target_3.ckpt" \
  --out-dir=artifacts/outdir/coco-stuff \
  --dataset="coco_stuff" \
  --num-workers=4 \
  --no-print-out

## Extension Experiments

In [None]:
!python train_pcbm_h.py \
  --concept-bank="artifacts/multimodal/mmc_audio_audioset+us8k+esc50_recurse_1.pkl" \
  --pcbm-path="artifacts/outdir/" \
  --out-dir=artifacts/outdir \
  --dataset="us8k" \
  --num-workers=4