# Project 2: mushroom classification
<br>
Use the SHAP analysis to answer the following questions:
<ol>
<li> For the first prediction, which feature has the most significant contribution?
<li> Overall, which feature has the most significant contributions? 
<li> Which odors are associated with poisonous mushrooms? 
</ol>

<b>Dataset:</b> https://www.kaggle.com/datasets/uciml/mushroom-classification

In [1]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from catboost import CatBoostClassifier

import shap

from sklearn.metrics import accuracy_score,confusion_matrix

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#load data 
data = pd.read_csv("../data/mushrooms.csv")

#get features
y = data['class']
y = y.astype('category').cat.codes
X = data.drop('class', axis=1)


print(len(data))
data.head()

8124


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [3]:
model = CatBoostClassifier(iterations=20,
                           learning_rate=0.01,
                           depth=3)

# train model
cat_features = list(range(len(X.columns)))
model.fit(X, y, cat_features)

#Get predictions
y_pred = model.predict(X)

print(confusion_matrix(y, y_pred))
accuracy_score(y, y_pred)

0:	learn: 0.6660194	total: 62.6ms	remaining: 1.19s
1:	learn: 0.6373576	total: 71.3ms	remaining: 642ms
2:	learn: 0.6125475	total: 77.9ms	remaining: 441ms
3:	learn: 0.5880232	total: 85.7ms	remaining: 343ms
4:	learn: 0.5652512	total: 89.8ms	remaining: 269ms
5:	learn: 0.5428184	total: 94ms	remaining: 219ms
6:	learn: 0.5222711	total: 98.2ms	remaining: 182ms
7:	learn: 0.5016943	total: 103ms	remaining: 155ms
8:	learn: 0.4779907	total: 108ms	remaining: 131ms
9:	learn: 0.4601680	total: 112ms	remaining: 112ms
10:	learn: 0.4447001	total: 116ms	remaining: 94.7ms
11:	learn: 0.4281286	total: 122ms	remaining: 81.2ms
12:	learn: 0.4125468	total: 127ms	remaining: 68.3ms
13:	learn: 0.3990776	total: 131ms	remaining: 56.1ms
14:	learn: 0.3861614	total: 136ms	remaining: 45.2ms
15:	learn: 0.3724813	total: 140ms	remaining: 35.1ms
16:	learn: 0.3560532	total: 145ms	remaining: 25.5ms
17:	learn: 0.3448969	total: 149ms	remaining: 16.6ms
18:	learn: 0.3331749	total: 155ms	remaining: 8.13ms
19:	learn: 0.3218024	total:

0.9852289512555391

# Standard SHAP values

In [4]:
# get shap values
explainer = shap.Explainer(model)
shap_values = explainer(X)

In [None]:
#For the first prediction, which feature has the most significant contribution?

In [None]:
#Overall, which feature has the most significant contributions?

In [None]:
#Which odors are associated with poisonous mushrooms?