In this assignment, you will use `LIME` library to perform local explanations using surrogate modelsto explain the results of Random Forest Classifier models.

In [None]:
# Install LIME library (uncomment plotly installation if needed)
#pip install plotly
!pip install lime


In [None]:
# Import libraries for data handling, modeling, and visualization
import pandas as pd
import numpy as np
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import warnings
# Set random seed for reproducible results
np.random.seed(0)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import matplotlib
# Configure default figure size
matplotlib.rcParams['figure.figsize'] = [10, 7]

# warnings.filterwarnings('ignore')
import lime
import lime.lime_tabular

# Load the dataset of songs
df_data = pd.read_csv('./music.xls')
df_data.head()



Question 1.1: Create the target of popular artists where artist familiarity is greater than 0.8 and artist hotttness is greater than 0.6.

In [None]:
# Create target label based on artist familiarity and hotttness thresholds
df_data['class'] = np.where((df_data['artist.familiarity'] > 0.8) & (df_data['artist.hotttnesss'] > 0.6),
                           'popular', 'not_popular')

# Use GroupBy on class and count artist.id
df_data.groupby('class')['artist.id'].count()


Question 1.2: Train a Random Forest Classifier with 100 estimators considering these variables:
* vars_keep = ['song.bars_confidence', 'song.bars_start', 'song.beats_confidence', 'song.beats_start', 'song.duration', 'song.end_of_fade_in', 'song.hotttnesss', 'song.key_confidence', 'song.loudness', 'song.mode', 'song.mode_confidence', 'song.start_of_fade_out', 'song.tatums_confidence', 'song.tatums_start', 'song.tempo', 'song.time_signature', 'song.time_signature_confidence']


In [None]:
# Features we will use to train the Random Forest
vars_keep = ['song.bars_confidence', 'song.bars_start', 'song.beats_confidence', 'song.beats_start',
             'song.duration', 'song.end_of_fade_in', 'song.hotttnesss', 'song.key_confidence',
             'song.loudness', 'song.mode', 'song.mode_confidence', 'song.start_of_fade_out',
             'song.tatums_confidence', 'song.tatums_start', 'song.tempo', 'song.time_signature',
             'song.time_signature_confidence']

# Separate features (X) and target (y)
X = df_data[vars_keep]
y = df_data['class']
# Split the dataset into training and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
print('Training set size - X_train: {} '.format(X_train.shape))
print('Training set size - X_test: {} '.format(X_test.shape))

# RandomForestClassifier n_estimators=100, oob_score=True, random_state=123456
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, Y_train)


Question 2.1: Initializing the LIME explainer. You need to include the following conditions - feature_names, class_names, verbose, discretize_continuous, and mode. It is important to note that when you tune class_name that the order is important.

In [None]:
# Initialize the LIME explainer
explainer = lime.lime_tabular.LimeTabularExplainer(
    X_train.values,
    feature_names=vars_keep,
    class_names=['not_popular', 'popular'],
    verbose=True,
    discretize_continuous=True,
    mode='classification'
)


For this assignment, you need to visit the [documentation](https://github.com/marcotcr/lime) for `LIME` and find out how you can pass an instance to get a local explanation and produce some visualizations. Since we are using `LimeTabularExplainer`, you can focus on that in the documentation (example notebooks are provided).

Question 2.2: Choose an instance from the test data, and obtain explanations for it. The explanations should include no more than 5 features (the top 5).

In [None]:
# Choose an instance from the test data
instance_num = 0

# Obtain a local explanation for the instance using up to 5 features
local_exp = explainer.explain_instance(X_test.iloc[instance_num], rf.predict_proba, num_features = 5)


Question 2.3: Produce a feature importance plot for the explanation. HINT: `LIME` has a method for this. You only need to call it. <span style="color:red" float:right>; # you need the semi-colon otherwise two dublicate plots are produced</span>

In [None]:
# Produce a feature importance plot for the explanation
local_exp.as_pyplot_figure();


In [None]:
# Display the explanation as a list of feature contributions
local_exp.as_list()


Quesiont 3: Call the `show_in_notebook` method to show a summary of the explanation. Set show_table = True, show_all = True

In [None]:
# Show a summary of the explanation inside the notebook
local_exp.show_in_notebook(show_table = True, show_all = True)


Question 4: Interpret the results shown by calling `show_in_notebook`. Confirm that the predicted probability shown on the left matches the predicted probability we get by calling the model directly on the instance.

In [None]:
# Confirm that the predicted probability matches the model output
rf.predict_proba([X_test.iloc[instance_num]])


[Bonus] Question 5: Repeat the above steps with a Support Vector Machine Classifier. What conclusions to you draw about model explainablity.

In [None]:
# Repeat the explanation using a Support Vector Machine classifier
from sklearn.svm import SVC
svm = SVC(probability=True, random_state=123456)
svm.fit(X_train, Y_train)
svm_exp = explainer.explain_instance(
    X_test.iloc[instance_num],
    svm.predict_proba,
    num_features=5
)
svm_exp.as_pyplot_figure();


Question 6: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

This assignment introduced LIME for explaining model predictions. I built a random forest to classify popular songs and explored feature contributions for a specific example. Comparing explanations with a support vector machine highlighted how different models focus on different attributes. Working through the steps clarified how local explanations can increase trust in machine learning results.