# **Post-Hoc Concept Bottleneck Models Replication**

This file assumes that the README instructions have already been followed. If not, then you can view it [here](../README.md).

Before starting, all the necessary files need to first be prepared. This notebook, when run, will setup all the necessary installations in the environment. We need to first move outside of the "notebooks" directory via the code block below. It should automatically setup the directory depending on whether this notebook is being run locally or on Google Colab.

In [1]:
import os

try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    from google.colab import userdata
    url = f"https://{userdata.get('gh_pat')}@github.com/dgcnz/FACT.git" # Note: may need to be scrubbed
    !git clone {url}
    !pip install git+https://github.com/openai/CLIP.git # for the CLIP library
    %cd FACT

else: # automatically checks if the current directory is 'FACT'
    curdir = os.getcwd()
    curdir = curdir.split('\\')[-1]
    if curdir != "FACT":
        %cd ..

c:\Users\Gregory Go\.github\FACT


You should now be in the **FACT** main folder.

# **Downloading the Datasets**

## Broden Concepts Dataset

_Note_: There is a potential permission error which may arise when trying to download the files via this notebook. Manual downloading may be needed.

In [None]:
# Get the BRODEN concepts dataset
!python ./scripts/download_broden

## COCO-Stuff Dataset

*Note:* The dataset is around 20 GB in total. Ensure you have enough space on your device before attempting to download.

In [None]:
# Get the COCO-stuff dataset (bash is needed to run the command below)
!bash ./scripts/download_cocostuff

## CUB Dataset

In [None]:
# Get the CUB dataset (bash is needed here to run the command below)
!bash ./scripts/download_cub

## Derm7pt Dataset

Due to how the download has been setup by the Original Derm7pt authors, registering to their site is necessary to access the dataset, thus meaning that we need to perform some manual processing. As such, please perform the following steps:

1. Go to the Derm7pt site [here](https://derm.cs.sfu.ca/Download.html).
2. Fill in the form with the necessary details.
3. The email received should contain the download link alongside the needed login credentials below it. Click the link and then fill in the details in the prompt given, which should automatically trigger the download afterwards.
4. Extract the .zip file and rename the folder extracted to "derm7pt".
5. Move this folder to "./FACT/artifacts/data".

_Note:_ If desired, for Google Colab you can upload the dataset to Google Drive and copy it to the current session using the following command:

## HAM10000 Dataset

The HAM10000 Dataset is made available as a public Kaggle dataset. In order to download it through this script, make sure you have a Kaggle API token ready and place it in the "scripts" folder.

To create a Kaggle API token, please do the following steps:

1. Go to your [account settings](https://www.kaggle.com/account). You will need to create a Kaggle account if you do not have one already.
2. Click on your profile icon > "Settings" > Scroll down to "API" > click "Create New Token"
3. This will download a file named kaggle.json. Again remember to move it to the scripts folder in the **FACT** directory.

Afterwards, just run the following codeblock:

In [None]:
# Get the HAM10K dataset (bash is needed here to run the command below)
!bash ./scripts/download_ham

## SIIM-ISIC Dataset

_Note_: The original dataset is around 23 GB in total. The version downloaded by this script is a trimmed-down version which replicates what the original authors did.

In [None]:
# Get the SIIM-ISIC dataset
!python ./scripts/download_siim

# **Training and Evaluating PCBMs**

Now that we have prepared all the necessary files, we can now begin with replicating the results obtained. 

Do note however, that some details for replication are missing, meaning that the results may somewhat differ compared to the original paper (which is also influenced by the hardware differences between experiments).

# Learning Concepts

In total, there are three concepts datasets needed for these experiments:
1. BRODEN
2. CUB
3. Derm7pt

Here we prepare each of these concepts for later use alongside the corresponding models, starting with the BRODEN ones.

In [None]:
!python learn_concepts_dataset.py \
  --dataset-name="broden" \
  --backbone-name="clip:RN50" \
  --C 0.001 0.01 0.1 1.0 10.0 \
  --n-samples=50 \
  --out-dir=artifacts/outdir

Then the CUB concepts,

In [None]:
!python learn_concepts_dataset.py \
  --dataset-name="cub" \
  --C 0.001 0.01 0.1 1.0 10.0 \
  --n-samples=50 \
  --out-dir=artifacts/outdir

... and finally the Derm7pt concepts.

In [None]:
!python learn_concepts_dataset.py \
  --dataset-name="derm7pt" \
  --backbone-name="ham10000_inception" \
  --C 0.001 0.01 0.1 1.0 10.0 \
  --n-samples=50 \
  --out-dir=artifacts/outdir

# Training PCBM

In [None]:
!python train_pcbm.py \
  --concept-bank="artifacts/outdir/broden_clip:RN50_10.0_50.pkl" \
  --dataset="cifar10" \
  --backbone-name="clip:RN50" \
  --out-dir=artifacts/outdir \
  --lam=2e-4

# Training PCBM-h

In [None]:
!python train_pcbm_h.py \
  --concept-bank="artifacts/outdir/broden_clip:RN50_10.0_50.pkl" \
  --pcbm-path="artifacts/outdir/pcbm_cifar10__clip:RN50__broden_clip:RN50_10__lam:0.0002__alpha:0.99__seed:42.ckpt" \
  --out-dir=artifacts/outdir \
  --dataset="cifar10" \
  --num-workers=4
