**In this notebook we will be applying classification techniques to the mushroom data. **

**Aside from being able to properly classify if a mushroom is edible or not, the interpretation of the most relevenat features is briefly explained.**

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

In [3]:
data = pd.read_csv("../input/mushrooms.csv")

Let's explore the data

In [4]:
data.head()

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


In [None]:
data.describe()

By exploring the data it is easy to realize that:
1. All variables are categorical
2. The response variable "class" presents two categories only (edible and poisonous)
3. "veil-type" only has one category so it does not provide any useful information for the classification procedure


In [None]:
#Let's discard "veil-type"
data = data.drop(['veil-type'], axis=1)

In [None]:
#Checking for missing values
data.isnull().sum()

The data doesn't exhibit missing values and since all the variables are categorical there is no need to take outliers into consideration

In [None]:
print(data.shape)
data.groupby('class').size()

The data contains 8124 observations evenly distributed across the two classes of mushrooms and 21 predictors (this will change once variables are converted to dummies)

In [None]:
data = pd.get_dummies(data, drop_first=True)
data.head()

After converting all variables to dummies, the response variable is 1 if mushroom is poisonous and 0 if not. In this case it is considered a "success" if the mushroom is indeed poisonous.
Now let's separate the response variable from the predictors and create our train and test subsets:

In [None]:
y = data.iloc[:,0]
X = data.iloc[:,1:]

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

First we'll be using XGBoost to define a classification model and identify important features:

In [None]:
import xgboost
from sklearn import metrics
from xgboost import XGBClassifier

In [None]:
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

In [None]:
auc_roc_xgb = metrics.roc_auc_score(y_test,y_pred_xgb)
print("auc score = {}".format(auc_roc_xgb))
confusion_matrix_xgb = metrics.confusion_matrix(y_test,y_pred_xgb)
confusion_matrix_xgb

XGBoost model is able to perfectly classify is a mushroom is edible or not. Let's look at the most important features in this model:

In [None]:
#Plot
xgboost.plot_importance(xgb)

In [None]:
#List
df_xgb = pd.DataFrame({'x': np.array(X.columns), 'y':xgb.feature_importances_})
df_xgb.sort_values('y', ascending=False)

The most important features when classifying if a mushroom is edible or not are:
* spore-print-color
* odor
* bruises
* gill-size
* gill-spacing

**In order to get an interpretation out of these features let's fit a logistic regression to the data. This, because Logit is a linear model and its interpretation is easier:**

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logit = LogisticRegression()
logit.fit(X_train, y_train)
y_pred_log = logit.predict(X_test)

In [None]:
auc_roc_log =metrics.roc_auc_score(y_test,y_pred_log)
print("auc score = {}".format(auc_roc_log))
confusion_matrix_log = metrics.confusion_matrix(y_test,y_pred_log)
confusion_matrix_log

Logit performs almost as well as XGBoost and because of this we can safely interpret its results. 

**The coefficients of the logit regression can be used to calculate the Odds Ratio of each variable. In this case the Odds Ratio is calculated against the base (the dummy variable that was left out) and this happens to all features since they all are categorical.**

In [None]:
df_log = pd.DataFrame({'x': np.array(X.columns), 'beta':logit.coef_[0]})
df_log['OR'] = np.exp(df_log['beta'])
df_log

**Let's check the OR (Odds Ratio) for the most important features found in the XGBoost model:

**It is important to notice that a bigger OR corresponds to a bigger magnitude of the odds of having a "success". In this case this represents the chances of the mushroom being poisonous. So a bigger OR is an indication of danger.****

In [None]:
df_log[df_log['x'].str.contains("spore-print-color")]

***In this case the base corresponds to spore-print-color = buff. Given the OR's for the rest of colors we can conclude that the most dangerous colors for spore print are (in order of magnitude):***

* Green
* White
* Chocolate

In [None]:
df_log[df_log['x'].str.contains("odor")]

***In this case the base corresponds to odor = almond. Given the OR's for the rest of odors we can conclude that the most dangerous odors are (in order of magnitude):***

* Creosote
* Pungent
* Foul
* Fishy
* Spicy
* Musty

In [None]:
df_log[df_log['x'].str.contains("bruises")]

***In this case the base corresponds to bruises = False. Given the OR for having bruises it is safer if a mushroom presents bruises than if it doesn't***


In [None]:
df_log[df_log['x'].str.contains("gill-size")]

***In this case the base corresponds to gill size = broad. Given the OR for gill size = narrow it is safer if a mushroom presents broad gill size than if it doesn't***

In [None]:
df_log[df_log['x'].str.contains("gill-spacing")]

***In this case the base corresponds to gill spacing = close. Given the OR for gill spacing = crowded it is safer if a mushroom presents crowded gill spacing than if it doesn't***
