# Interpretable Machine Learning
## Exercise Sheet 6: LIME
## This exercise sheet covers lecture 6 on LIME
Sophie Langbein (langbein@leibniz-bips.de)<br>
Pegah Golchian (golchian@leibniz-bips.de)
<hr style="border:1.5px solid gray"> </hr>

# Counterfactual Explanations

Counterfactual explanations are a valuable tool to explain predictions of machine learning models. They provide an understanding of how a machine learning model's prediction for a specific instance would change if the input features were altered in a meaningful way. These explanations are generated by identifying a "counterfactual" instance, which is a data point that is as similar as possible to the original instance but with some feature values modified. The goal is to explain to end-users how they could change their input to receive a different model prediction.

One of the simplest approaches to generate counterfactuals is to determine for a given observation x (`x_interest`) the closest data point which has a prediction equal to the desired outcome. In the following exercise, you should implement this so called **WhatIf** approach for a binary classifier. 

**a)** Implement the following steps in `generate_whatif()`:

- Create a dataset containing only the values with a prediction different to the one of `x_interest` (this is equal to our desired prediction).
- Calculate the pairwise Gower’s distances between `x_interest` and all other points with differing predictions.
- Return the nearest data point as a counterfactual for `x_interest` with the shape (1, num_features).

**Solution:**

In [2]:
import gower
import numpy as np

def generate_whatif(x_interest, model, dataset): 
  
    """
    Computes whatif counterfactuals for binary classification models, 
    i.e., the closest data point with a different prediction.
  
    Parameters: 
    x_interest (np.array with shape (1, num_features)): Datapoint of interest.
    model: Binary classifier which has a predict method.
    dataset (np.array with shape (?, num_features)): Input data from which a counterfactual is selected.

    Returns:
    counterfactual (np.array with shape (1, num_features)): the closest observation/row to x_interest of the input dataset with a different prediction than x_interest. 
    """
    
    # fill in 
    
    return

**b)** Write a function `evaluate_counterfactual()` to evaluate whether a counterfactual is minimal. In other words, the function should check, if setting one feature to the value of `x_interest`, still results in a different prediction than for `x_interest`. It should return a list with names of features that if set for the counterfactual to the value of `x_interest`, still lead to a different prediction than for `x_interest`. Hence the function should check for every feature, whether setting it equal to its value in `x_interest` leads to a different prediction than the prediction of `x_interest` and if so, record the corresponding feature value. 

**Solution:**

In [3]:
def evaluate_counterfactual(counterfactual, x_interest, model, labels) :
    """
    Evaluates if counterfactuals are minimal, i.e., if setting one feature to the value of x_interest still results in a different prediction than for x_interest.
   
    Parameters: 
    counterfactual (np.array with shape (1, num_features)): Counterfactual of `x_interest`. 
    x_interest (np.array with shape (1, num_features)): Datapoint of interest. 
    model: Binary classifier which has a predict method.
    labels (list): A list of the labels of the feature matrix. 
  
    Returns: 
    List with names of features that if set for the counterfactual to the value of `x_interest`, still leads to a different prediction than for x_interest. 
    """

    # fill in 
    
    return feature_names

**c)** Now, to test the `generate_whatif()` and `evaluate_counterfactual()` functions, import the [wheat seeds dataset](https://archive.ics.uci.edu/dataset/236/seeds) from `wheat_seeds.csv`, which was already used in Exercise Session 6. 

- Drop all rows that contain `NA` values. 
- Covert the dataset into a dataset suitable for binary classification, by dropping all observations of type `Canadian` (`Type` = `2`). We only want to classify whether a kernel is of type `Kama` (`Type` = `0`) or `Rosa` `Type` = `1`). 
- Fit a random forest classifier to the full dataset (no training and test set required)
- Choose a suitable point for `x_interest`, check whether the prediction matches the true value. 

**d)** Compute the WhatIf counterfactual of the selected observation and interpret it. Evaluate whether it is minimal.

**e)** Which attributes from the lecture (valid, proximal/similar, sparse, plausible/likely, diverse) does this approach fulfill? Based on this, derive the advantages and disadvantages of the approach.

**f)** Now we want to use the [DiCE](https://github.com/interpretml/DiCE) (Diverse Counterfactual Explanations) package to generate counterfactual explanations for the same data. For that purpose at first:

- Split the data into training and test set. 
- Fit a random forest classifier to the training data.
- Compute the accuracy on the test set.

**g)** Now we generate the counterfactuals using the `DiCE` package. For that purpose: 
    
- Transform the data into an appropriate format using `dice_ml.Data`.
- Select two instances of interest from the test set for which counterfactuals should be generated (ideally they should have different outcomes).
- Generate counterfactuals using the `exp_random.generate_counterfactuals`function. For every instance of interest three counterfactuals should be generated. Only the features `Area` and `Perimeter` should be varied, while all other features should remain constant. 
- Visualize the counterfactuals as dataframes. 

Hint: In this [notebook](https://github.com/interpretml/DiCE/blob/main/docs/source/notebooks/DiCE_model_agnostic_CFs.ipynb) you can find a tutorial on how to generate DiCE counterfactuals using three different algorithms, you can chose the algorithm you prefer, random feature sampling is recommended. 

**h)** Which attributes from the lecture (valid, proximal/similar, sparse, plausible/likely, diverse) do the counterfactuals generated above fulfill?

<hr style="border:1.5px solid gray"> </hr>

# Adversarial Examples

Adversarial examples are inputs to machine learning models that are intentionally crafted to cause the model to make mistakes. These inputs are carefully designed to be very similar to genuine data instances but are subtly perturbed in ways that are imperceptible to humans. Adversarial examples have gained significant attention in the field of artificial intelligence and machine learning because they highlight vulnerabilities in the robustness of machine learning models.

Adversarial Examples are carefully constructed inputs that “fool” a trained model. In this exercise, your task is to generate adversarial examples using the Fast-Gradient-Sign-Method:

$$
\mathbf{a}_\mathbf{x} ← \mathbf{x} + \delta 
$$

where 

$$
\delta = \epsilon \cdot \text{sign}(\nabla_{\mathbf{x}}J(\mathbf{x}, y_{\mathbf{x}}; \theta)) 
$$

with a cost function J (here, we will use cross entropy loss). This is an untargeted method, which means the resulting class label is not specified in advance. The goal is to create an adversarial input that is indistinguishable (to the human eye) from an original input, but confuses the model. In particular, we will create adversarial examples to fool an image classifier trained on the ImageNet dataset.

**a)** First import all relevant modules including the custom functions from the `utils.dataset` module. Implement the `get_gradient()` function. The function should take the original `image` and the fitted `model` as an input and return the gradient of the loss with respect to the image. It should perform the following steps:

- Obtain a prediction for the input `image` tensor
- Set the `requires_grad` attribute of the `image` tensor to `True`
- Pass the image tensor through the model to obtain its output
- Compute the cross-entropy loss between the model outputs of the image tensor and the prediction obtained for the input image using the `torch.nn.functional.cross_entropy` function
- Perform backpropagation to compute the gradients of the loss with respect to the input image tensor
- Return the gradient of the loss with respect to the image

**Solution:**

In [19]:
# All relevant imports
import sys
import os  # noqa
sys.path.insert(0, "")  # noqa

import torch

# These are custom modules included in the exercises session
from utils.styled_plot import plt
from utils.dataset import (
    load_test_image,
    preprocess_image,
    normalize_image,
    unnormalize_image,
    convert_idx_to_label
)
from classifiers.cnn_classifier import ImageNetClassifier

torch.manual_seed(0)

<torch._C.Generator at 0x7fe1e90a9410>

In [20]:
def get_gradient(model, image):
    """
    Propagates the cross entropy loss between the model's output (logits) and the label (here the label
    that is predicted by the model is used) back to the input to get the input gradient.

    Parameters:
        model (ImageNetClassifier, torch.nn.Module):
            The image classification model. This is a torch.nn.Module, so you can call its forward method using
            `model()`. The output are logits (class probabilities). Also has a `.predict` method that returns the
            index of the predicted label.

        image (torch.tensor): The input for which to compute the gradient.

    Returns:
        gradient (torch.tensor): The input gradient. Same shape as the input image.
    """

    # fill in 
    
    return image.grad # return the gradient of the loss with respect to the image

**b)** Complete the `perturb_image()` function. It should perform the following steps: 

- Calculate the sign of each element in the input `grad` tensor as the perturbation
- Scale the sign of the gradient using `eps`
- Add the scaled pertubation to the original input `image` to create the perturbed image
- Return the perturbed image

**Solution:**

In [21]:
def perturb_image(image, grad, eps):
    """
    Applies a perturbation to an image based on the Fast-Gradient-Sign-Method.

    Parameters:
        image (torch.tensor): The image to perturb.

        grad (torch.tensor): The input gradient corresponding to the image.

        eps (float): The epsilon value for the perturbation, specifying the magnitude of the perturbation.

    Returns:
        image (torch.tensor): The perturbed image.
    """

    return perturbed_image

**c)** Complete the `create_adversarials()` function. It should perform the following steps: 

- Calculate the gradient of the model's output with respect to the input image using the `get_gradient()` function.
- Create a list of adversarial examples by perturbing the input image for each epsilon value in `eps_values` using the `perturb_image()` function.
- Return the list of adversarial examples. 

**Solution:**

In [22]:
def create_adversarials(model, image, eps_values):
    """
    Creates adversarial examples for the given image and model using the Fast-Gradient-Sign-Method.

    Parameters:
        model (ImageNetClassifier, torch.nn.Module):
            The image classification model. This is a torch.nn.Module, so you can call its forward method using `model()`.
            Also has a `.predict` method that returns the index of the predicted label.

        image (torch.tensor): The image to generate adversarial examples from.

        eps_values (List[float]): The list of epsilon values for which to generate adversarial examples.

    Returns:
        adversarials (List[torch.tensor]): A list containing one adversarial example for each eps value in eps_values.
    """

    
    return adversarials # return a list of adversarial examples

**d)** Familiarize yourself with the `plot_adversarials()` functions, which is used to plot the original image and the adversarial images in a single row. 

**Solution:**

In [23]:
def plot_adversarials(model, image, adv_images, eps_values):
    """
    Plots the  original image and the adversarial images in a single row.
    Uses the eps value and the predicted label as axis titles.

    Parameters:
        model (ImageNetClassifier, torch.nn.Module):
            The image classification model. This is a torch.nn.Module, so you can call its forward method using `model()`.
            Also has a `.predict` method that returns the index of the predicted label.

        image (torch.tensor): The original image corresponding to the adversarial examples.

        adv_images (List[torch.tensor]): A list containing the adversarial examples to visualize.

        eps_values (List[float]): The list of epsilon values corresponding to each adversarial example in adv_images.

    Hint: 
        - matplotlib expects a channels last format
        - The model works with normalized images. Before visualizing the images, you have to invert the normalization
        using `unnormalize()`
    """

    fig, axes = plt.subplots(len(adv_images) + 1, 1) # create a subplot for plotting the images. It creates a figure (fig) and a list of axes (axes). The number of axes is set to be len(adv_images) + 1, which allows space for the original image and all the adversarial images
    predictions = model.predict(torch.cat([image] + adv_images)) # make predictions for the original image and all the adversarial images combined by concatenating image and adv_images into a single tensor

    axes[0].imshow(unnormalize_image(image).squeeze().permute(1,2,0).detach().numpy()) # display the original image on the first axis axes[0], unnormalize_image(image): unnormalizes the image, squeeze(): removes any singleton dimensions from the tensor, permute(1,2,0): change the tensor's dimension order to match the expected channels-last format, detach().numpy(): converts the tensor to a NumPy array for visualization
    axes[0].axis('off') # turn off the axis (axis labels) for the original image.
    axes[0].set_title(f'eps: {0.0}\npred: {convert_idx_to_label(predictions[0].item())}') # sets the title for the original image
    plt.tight_layout() # adjust the subplot layout for better spacing
    for i, ax in enumerate(axes[1:]): # for loop that iterates through a sequence of axes, skipping the first one
        ax.imshow(unnormalize_image(adv_images[i]).squeeze().permute(1,2,0).detach().numpy()) # set up the current axis (ax) to display an image similar to aboveeffectively reducing dimensionality if there are dimensions with size 1
        ax.axis('off') # turn off the axis (axis labels) for the current axis (ax)
        ax.set_title(f'eps: {eps_values[i]}\npred: {convert_idx_to_label(predictions[i+1].item())}') # set title for the current axis
        plt.tight_layout() # adjust the layout of the subplots

**e)** Now we load and preprocess a test image, for which adversarial examples should be created as described down below. The custom `ImageNetClassifier()` from the custom `cnn_classifier` module is initialized as the prediction model. Perform the following steps: 

- Make a prediction for the preprocessed and normalized test input image using the `model.predict()` method. Set `return_probs=True` and return both the  predicted class (`y_pred`) and the predicted class probability (`y_prob`)
- Print the predicted class label by passing `y_pred.item()` to the `convert_idx_to_label()` function
- Print the predicted class probability `y_prob.item()`
- Check if the predicted class index (`y_pred`) matches the expected class index of 13 (`torch.tensor([13])`), the true label of the input image

**Solution:**

In [24]:
# Loading and preprocessing of the test image using the costum modules
image = load_test_image()
image_preprocessed = preprocess_image(image)
image_preprocessed_norm = normalize_image(image_preprocessed).unsqueeze(0)

# Initialize the model
model = ImageNetClassifier()

**f)** Generate five adversarial images using the `create_adversarial_function()`on the preprocessed and normalized image and the `ImageNetClassifier()` as the input model. Then plot the adversarial examples using the `plot_adversarials()` function. 

**g)** What is the difference between counterfactuals and adversarial examples? 

<hr style="border:1.5px solid gray"> </hr>