# Task Desciption
Welcome! Your task is to develop a machine learning model to predict house prices based on a dataset provided to you. The dataset contains several features, and your objective is to optimize the model's prediction performance for this task. This challenge is part of a broader effort to identify innovative solutions to data-driven problems.

Your performance will be evaluated based on how well your model performs on a holdout dataset.
To measure the perfoamce the F1-Score is being used.

#**Predetermined explanation function**
To support you in model development, we provide a predetermined model explanation function called Model_Explainer(). This function provides you with tools to investigate the influence of variables on model predictions. You can visualize the ranking of the most important variables as well as the learned patterns.

The function allows you to create three different visualizations:

1.   The **bar plot** shows you the ranking of the most important variables. With this plot, you can see which variables contribute the most to the model's prediction

2.   The **beeswarm plot** visualizes the influence of each variable in greater detail. Each point represents an observation of the test data set. The x-axis shows the (positive or negative) influence of the variables. The coloring shows the expression of the respective variable. Thus, red points in the left value range of the x-axis represent test data points in which high values of the variable had a negative influence on the prediction. Further explanations are available [here](https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/beeswarm.html).

3. The **interaction plot** shows how the influence of a variable depends on the expression of another variable. An exact explanation of this plot can be found [here](https://shap.readthedocs.io/en/latest/example_notebooks/api_examples/plots/scatter.html).



You will find the function in the **next cell**. You only have to execute the cell, then you can use the function as you wish.

The Model_Explainer() function uses three arguments:

- *model* is the trained machine learning model
- *X_train* is the training data set (excluding the target variable)
- *X_test* is the test data set (excluding the target variable), i.e. the data set with which you evaluate the performance of your model
Example application of the function: Model_Explainer(model = ML_classifier, X_train = X_train, X_test = X_train)


Example:
Model_Explainer(model = ML_classifier, X_train = X_train, X_test = test)

In [None]:
#@title Import necessary libraries
import numpy as np
import pandas as pd
import shap

import matplotlib.pyplot as plt
from matplotlib.widgets import Button
import ipywidgets as widgets
from IPython.display import display
from IPython.display import Markdown
import PIL
import io

import pathlib
import textwrap
import google.generativeai as genai
import os

# GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
GOOGLE_API_KEY= "AIzaSyDyLUAxmUTzet_ULN-xcIfWSop6tb7BzsQ"

In [None]:
#@title Model explanation function classification

### Classification tasks

import shap
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display, HTML
import IPython


def to_markdown(text):
  text = text.replace('•', '  *')
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))


genai.configure(api_key=GOOGLE_API_KEY)
llm = genai.GenerativeModel('gemini-2.0-flash')


class SHAPstory():
  """
  A class to generate SHAPstories, narratives that explain AI predictions based on SHAP values.

  Attributes:
  -----------
  feature_desc_df : DataFrame
      A DataFrame containing descriptions for each feature.
  dataset_description : str
      A brief description of the dataset.
  shap_feature_df : DataFrame
      A DataFrame containing the average SHAP values per features
  """

  def __init__(self, feature_desc,
               dataset_description,
               shap_feature_df,
               llm=None):

    """Initializes the SHAPstory class with necessary parameters."""
    self.feature_desc = feature_desc
    self.dataset_description = dataset_description
    self.shap_feature_df = shap_feature_df

    if llm is None:
      print("No language model provided. You will only be able to generate prompts.")
    else:
      self.llm = llm

  def generate_prompt(self):
    """
    Generates the prompt for the provided LLM to generate a narrative.

    """

    feature_names = self.shap_feature_df["Variable Name"].tolist()

    #row_values = self.shap_feature_df.iloc[iloc_pos]
    #feature_values = row_values[[name for name in feature_names]]
    #shap_values = row_values[[name + " SHAP Value" for name in feature_names]]

    prompt_string = f"""
    An AI model has been utilized to predict apartment prices based on a specific dataset described as follows: {self.dataset_description}. The target variable represents the price category of the apartments, categorized as '1' for prices above the median and '0' for prices below.

    The primary objective of utilizing SHAP (SHapley Additive exPlanations) in this analysis is to elucidate the model's decision-making process. SHAP values, rooted in coalitional game theory, offer a quantitative measure of each feature's contribution to the model’s predictions, revealing which features significantly sway the outcome, either positively or negatively.

    Could you construct a coherent narrative that explains how the model arrives at its predictions, focusing on the features with the highest absolute SHAP values? Incorporate these features naturally into your explanation, highlighting how they interact and influence the model’s decision-making process. Your narrative should include:
    In the prompt, you find appended the shap barplot image (that displays the mean shap values per feature) and the beeplot image (that shows the dispersion of shap values for each feature).

    Overview: A brief overview of the three most significant features.
    Deeper insights: Using the beeplot, explain the pattern and effect of these key features on the predictions.
    Summary: Conclude with your insights on the most compelling aspects of how these features drive the model's behavior.
    Please structure your response with clear paragraph breaks and titles to enhance readability, and limit your explanation to approximately eight sentences.

    Table containing feature values and SHAP values:
    {self.shap_feature_df}

    """

    return prompt_string


  def generate_response(self, prompt):

    response = self.llm.generate_content(prompt)
    markdown_response = to_markdown(response.text)
    display(markdown_response)

    return markdown_response

# Configure the Jupyter display environment for interactivity
def configure_jupyter_display():
    display(HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))

# Main function to display SHAP plots with controlled replacement of old plots
def Model_Explainer(model, X_train, X_test):
    configure_jupyter_display()

    # Limit the dataset for SHAP calculation to enhance performance
    if len(X_test) > 200:
        X_test = X_test.sample(n=200, random_state=42)

    # Initialize SHAP explainer and calculate SHAP values
    try:
        explainer = shap.TreeExplainer(model)
        shap_values = explainer.shap_values(X_test)
        # Use TreeExplainer for tree-based models for better performance
    except Exception as e:
        print(f"Warning: TreeExplainer failed with error: {e}")
        print("Using KernelExplainer instead.")
        background_data = shap.sample(X_train, 100)  # Sample 100 instances
        explainer = shap.LinearExplainer(model, X_train)
        shap_values = explainer.shap_values(X_test)

        # Check if SHAP values have more than two dimensions and adjust
    if len(shap_values.shape) > 2:
        # Assuming the SHAP values shape is like (n_classes, n_samples, n_features)
        # and we want to take the SHAP values for the positive class which is usually at index 1
        shap_values = shap_values[:, :, 1]

    #print("Shape of SHAP values:", shap_values.shape)

    # convert the shap values np array into dataframe with avg values
    feature_names = X_test.columns.tolist()
    shap_df = pd.DataFrame(shap_values, columns=feature_names)
    average_shap_values = np.mean(np.abs(shap_values), axis=0)

    shap_feature_df = pd.DataFrame({
        'Variable Name': feature_names,
        'Average SHAP Value': average_shap_values
    })

    shap_feature_df = shap_feature_df.sort_values(by='Average SHAP Value', ascending=False)

    # Print or return the DataFrame
    #print(shap_feature_df)


    ##########################################################################
    ##### Create, store, and display the plots

    # Set a default style (like 'ggplot') for grid-like visuals
    plt.style.use('ggplot')  # You can also use 'bmh' or 'classic'

    # Set global plot size and other parameters
    plt.rcParams.update({
        "figure.figsize": (10, 6),  # Set default figure size
        "figure.dpi": 100,          # Set default DPI for clarity
        "axes.titlesize": 16,
        "axes.labelsize": 14,
        "xtick.labelsize": 12,
        "ytick.labelsize": 12,
        "legend.fontsize": 12,
    })

    # Pre-generate plots and save them
    plt.figure()
    shap.summary_plot(shap_values, X_test, plot_type="bar", show=False)
    plt.savefig('/content/bar_plot.png')
    plt.close()

    plt.figure()
    shap.summary_plot(shap_values, X_test, show=False)
    plt.savefig('/content/summary_plot.png')
    plt.close()

    # Assume first feature for initial interaction plot
    plt.figure()
    shap.dependence_plot(feature_names[0], shap_values, X_test, show=False)
    plt.savefig('/content/interaction_plot.png')
    plt.close()

    # Output widgets
    output_bar = widgets.Output()
    output_summary = widgets.Output()
    output_interaction = widgets.Output()

    # Helper functions to display each plot type
    def show_bar_plot(_=None):
        with output_bar:
            output_bar.clear_output()
            display(IPython.display.Image(filename='/content/bar_plot.png'))

    def show_summary_plot(_=None):
        with output_summary:
            output_summary.clear_output()
            display(IPython.display.Image(filename='/content/summary_plot.png'))

    def show_interaction_plot(_=None):
        with output_interaction:
            output_interaction.clear_output()
            display(IPython.display.Image(filename='/content/interaction_plot.png'))


    # Set up widgets for interactivity
    button_bar = widgets.Button(description="Show Bar Plot")
    button_summary = widgets.Button(description="Show Summary Plot")
    button_interaction = widgets.Button(description="Show Interaction Plot")
    feature_dropdown = widgets.Dropdown(options=feature_names, description="Feature:")

    button_bar.on_click(show_bar_plot)
    button_summary.on_click(show_summary_plot)
    button_interaction.on_click(lambda _: show_interaction_plot())

    # Arrange buttons and outputs
    button_box = widgets.HBox([button_bar, button_summary, button_interaction, feature_dropdown])
    plot_box = widgets.HBox([output_bar, output_summary, output_interaction])

    # Arrange layout with VBox for a structured display
    interaction_box = widgets.VBox([
        widgets.HTML("<h3>SHAP Plot Selector</h3>"),
        button_box,
        plot_box
    ])

    # Display the organized layout
    display(interaction_box)

    ################################# Preparation of the LLM Narrative

    feature_desc_df = {
        "Variable Name": [
            "construction year", "elevator", "Anteil Gruenenwaehler", "floor (storey)",
            "nmbr of rooms", "unemployment", "basement", "garden", "balcony"
        ],
        "Description": [
            "The year in which the building was constructed.",
            "Indicates whether the building has an elevator (yes/no).",
            "The proportion of Green Party voters in the area, possibly a proxy for neighborhood characteristics.",
            "The floor number of the apartment in the building.",
            "The number of rooms in the apartment.",
            "Unemployment rate in the neighborhood, possibly affecting property demand and pricing.",
            "Indicates whether the building has a basement (yes/no).",
            "Indicates whether the apartment has access to a garden (yes/no).",
            "Indicates whether the apartment has a balcony (yes/no)."
        ]
    }


    ############# dataset_description
    dataset_description ="""
    The task involves predicting apartment prices based on a proprietary dataset containing various features such as the number of rooms or construction year of the apartment.
    We compiled the dataset by scraping 5090 apartment listings from a large online platform, focusing on the seven largest cities in Germany during 2022.
    The dataset contains different features, such as the listing price per square meter, construction year, and the presence of balconies and basements.
    We further augmented the data with third-party information, namely the percentage of Green Party voters and unemployment rates.
    """

    ######### dataset description
    barplot_image = PIL.Image.open('/content/bar_plot.png')
    beeplot_image = PIL.Image.open('/content/summary_plot.png')

    #### generate the narrative
    narrative = SHAPstory(feature_desc_df,
                          dataset_description,
                          shap_feature_df,
                          llm)
    prompt = narrative.generate_prompt()

    #test_narrative.generate_response(prompt)
    #narrative.generate_response(prompt)
    narrative.generate_response([prompt,beeplot_image])

In [None]:
#@title Load Data
url = "https://raw.githubusercontent.com/caradamm/XAI_HousePricePrediction/main/data/data.csv"
df = pd.read_csv(url)
df.head()


Unnamed: 0,garden,basement,elevator,balcony,floor (storey),nmbr of rooms,construction year,unemployment,Anteil Gruenenwaehler,price
0,False,False,False,False,2.0,3.0,1910.0,2,1,0
1,False,True,False,True,1.0,4.0,1900.0,1,3,0
2,False,True,False,True,2.0,2.0,1971.0,3,2,0
3,False,False,False,False,4.0,2.0,1910.0,2,1,0
4,False,False,False,False,3.0,3.0,1910.0,2,1,0


#Your code goes here

In [None]:
#@title Call model explanation function

Model_Explainer(model = "..."",
                X_train = "...",
                X_test = "...")

In [None]:
# Save classifier as pickle object
import pickle

with open('model.pkl', 'wb') as file:
    pickle.dump("...", file) # replace "..." with classifier object