# Interpreting Existing models - examples

### These are Python imports. Imports are great because they allow us to easily bring code in that other people have already written

In this case, we're bringing in pandas and numpy, libaries for working with tabular data and number crunching, plus the scikit-learn library (sklearn) to help us build a model.

We're also bringing in the mondobrain python package along with shap and lime to help us explain the data and the model

In [None]:
# Importing necessary libraries
import pandas as pd
from pandas.api.types import is_numeric_dtype
import numpy as np

import mondobrain as mb

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, accuracy_score, confusion_matrix

import shap
import lime.lime_tabular

import warnings
warnings.filterwarnings('ignore')

#### We'll start by importing our dataset below. Unlike last time, this time we'll load it from an external datafile

There's a csv file and an excel file in the github repository that contains all the code for this exercise: https://github.com/datawhys/demo-xavier-ai-summit


In [None]:
df = pd.read_csv('https://github.com/datawhys/demo-xavier-ai-summit/blob/main/asthma_data.csv?raw=true')

#### One thing we almost always have to do with the machine learning library is data cleaning.

A lot of machine learning libraries don't like missing values, so we're going to treat those missing values. There are a lot of strategies for this that can be picked from. First, we have to find any column that has a null value

In [None]:
column_has_nulls = np.any(df.isna(), axis=0).values
df.columns[column_has_nulls]

# Now we find all of the numeric columns, and then categorical columns
# We can't use the same approach to treat missing values in both the categorical and numeric columns
column_is_numeric = np.array([is_numeric_dtype(v) for v in df.dtypes])
categorical_cols = df.columns[~column_is_numeric]; categorical_cols

# Let's split those into numeric columns with missing values, and categorical columns with missing values
numeric_cols_with_nulls = df.columns[np.all(np.array([column_has_nulls, column_is_numeric]), axis=0)]
non_numeric_cols_with_nulls = df.columns[np.all(np.array([column_has_nulls, np.bitwise_not(column_is_numeric)]), axis=0)]

#### Now we are going to fill those missing values

In [None]:
# first we'll make a copy of our original dataframe:
df_imputed_nums = df.copy()

# then we'll fill all the null values in all of the numeric columns with nulls with the mean of that column
df_imputed_nums[numeric_cols_with_nulls] = df_imputed_nums[numeric_cols_with_nulls].fillna(value=df[numeric_cols_with_nulls].mean())

### Now we have a new dataframe, and it's got all our missing values in the numerical columns replaced with the mean of each of those columns. 

Note: It turns out that this dataset doesn't actually have any missing values in categorical columns. If it did, we could handle things another way, perhaps by eliminating the column, or by randomly assigning a value along the same distribution as the original, or by introducing a third class, like 'missing'.

In [None]:
# let's take a look at our dataframe:

df_imputed_nums.head()

In [None]:
# we also need to treat the categorical vars as numbers. To do that, we'll do something called One Hot Encode them

ohe = OneHotEncoder(categories='auto')
cat_feats_encoded = ohe.fit_transform(df_imputed_nums[categorical_cols])
df_prepared = pd.get_dummies(df_imputed_nums, columns=categorical_cols)
df_prepared

The get_dummies method creates one hot encoded variables for us, but we don't need both Asthma_1. No Asthma and Asthma_2. No Asthma. In fact, having both of them would be a big problem, because we'd just find the rule that said you found Asthma when you found not "No Asthma". To avoid that, we're going to delete 'Asthma_1. No Asthma'

In [None]:
del df_prepared['Asthma_1. No Asthma']

In [None]:
df_prepared

In [None]:
# The convention for when we train a model is to use y for the dependent variable, and X for the independent variables
# We're going to prepare that now

y = df_prepared[['Asthma_2. Asthma']]
X = df_prepared[[c for c in df_prepared.columns if c != 'Asthma_2. Asthma']]

Remember from yesterday we said we should always test on a different set of data then we train on? We're going to use this function to do that split now:

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 21)

In [None]:
rfc = RandomForestClassifier(max_depth = 10,
                             min_samples_leaf = 2,
                             min_samples_split = 2,
                             n_estimators = 10)

In [None]:
rfc.fit(X_train, y_train)

In [None]:
pipeline_preds = rfc.predict(X_test)

test_accuracy = accuracy_score(y_test, pipeline_preds)
test_roc_auc = roc_auc_score(y_test, pipeline_preds)
test_confusion_matrix = confusion_matrix(y_test, pipeline_preds)

print(f'Accuracy Score: {test_accuracy}')
print(f'ROC AUC Score: {test_roc_auc}')
print(f'Confusion Matrix: \n{test_confusion_matrix}')

In [None]:
df_test = pd.concat([X_test, y_test], axis = 1)

In [None]:
df_test.head()

In [None]:
person_1 = X_test.loc[224]

In [None]:
X_test.loc[61]

In [None]:
import lime.lime_tabular

In [None]:
# Defining a quick function that can be used to explain the instance passed
predict_rfc_prob = lambda x: rfc.predict_proba(x).astype(float)

In [None]:
lime_explainer = lime.lime_tabular.LimeTabularExplainer(X_train.values,
                                                        mode = 'classification',
                                                        feature_names = X_train.columns,
                                                        class_names = ['Asthma', 'No_Asthma'],
                                                        random_state=528491,
                                                        )

In [None]:
# Viewing LIME explainability for person 1
person_1_lime = lime_explainer.explain_instance(person_1.values,
                                                predict_rfc_prob,
                                                num_features = 10)
person_1_lime.show_in_notebook()

In [None]:
shap_explainer = shap.TreeExplainer(rfc)

In [None]:
shap.initjs()

In [None]:
person_1_shap_values = shap_explainer.shap_values(person_1)

In [None]:
shap.force_plot(shap_explainer.expected_value[1], person_1_shap_values[1], person_1)

In [None]:
df

In [None]:
mdf = mb.MondoDataFrame(df)

mb.api_key = 'bDSAWfXXEw0h0WggDoDFg1ghBp5o4Myy'
mb.api_secret = 'umXDHYSUz1oaZJ43XIIBe6ck0XofzTxDgCqmMatt52cmQroghEDI-AMVQn4py2_n'

# Select a column as your outcome column & specify a target class
outcome = mdf["Asthma"]

# Check the classes of your outcome variable
outcome.classes

In [None]:
outcome.target_class = "2. Asthma"

In [None]:
explorable = mdf[[c for c in mdf.columns if c != "Asthma"]]

In [None]:
solver = mb.Solver()

In [None]:
solver.fit(explorable, outcome)

In [None]:
solver.rule

In [None]:
solver.size

In [None]:
solver.score

In [None]:
solver.rule_data

You can also check this out in our interactive dashboard!

To check it out, go to https://demo.mondobrain.com, and login with:

username: xavier_user
password: xavier_ai_conf