Here I used a Decision Tree Classifier to determine the most important features in predicting whether a mushroom is poisonous. This dataset has over 8000 instances and 22 attributes relating to visual, olfactory, and ecological properties of the mushrooms.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split


mush_df = pd.read_csv('/Users/Shared/mushrooms.csv')
mush_df2 = pd.get_dummies(mush_df)

X_mush = mush_df2.iloc[:,2:]
y_mush = mush_df2.iloc[:,1]

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_mush, y_mush, random_state=0)

# For performance reasons I will create a smaller version of the
# entire mushroom dataset.  For simplicity I'll just re-use
# the 25% test split created above as the representative subset.
#

X_subset = X_test2
y_subset = y_test2

Train a Decision Tree Classifier to classify the mushrooms as poisonous or not. Then determine the 5 most important features found by the decision tree.

In [2]:
from sklearn.tree import DecisionTreeClassifier
    
clf = DecisionTreeClassifier(random_state = 0).fit(X_train2, y_train2)
features = clf.feature_importances_
num_features = 5
idx = features.argsort()[-num_features:]
feature_names = X_train2.columns[idx].tolist()
feature_names.reverse()
feature_names

['odor_n', 'stalk-root_c', 'stalk-root_r', 'spore-print-color_r', 'odor_l']

I used the validation_curve function in sklearn.model_selection to determine training and test scores for a Support Vector Classifier (SVC) with varying parameter values and a radial basis kernel. I used a subset of the original mushroom datase for performance reasons. 

So the first step is to create an SVC object with default parameters (i.e. kernel='rbf', C=1) and random_state=0. 

In [3]:
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve

gamma_range = np.logspace(-4,1,6)
clf = SVC(kernel = 'rbf', C = 1, random_state = 0)

With this classifier, and the dataset in X_subset, y_subset, I explored the effect of gamma on classifier accuracy by using the validation_curve function to find the training and test scores for 6 values of gamma from 0.0001 to 10 (i.e. np.logspace(-4,1,6)). 

In [5]:
train_validation, test_validation = validation_curve(clf, X_subset, y_subset, scoring = 'accuracy', param_name = 'gamma', param_range = gamma_range, cv = 3)
(train_validation, test_validation)

(array([[0.58936484, 0.55686854, 0.55317578],
        [0.93205318, 0.9239291 , 0.93722304],
        [0.99113737, 0.99039882, 0.99039882],
        [1.        , 1.        , 1.        ],
        [1.        , 1.        , 1.        ],
        [1.        , 1.        , 1.        ]]),
 array([[0.58345643, 0.56277696, 0.55539143],
        [0.91580502, 0.94977843, 0.92466765],
        [0.98966027, 0.99113737, 0.98818316],
        [1.        , 1.        , 1.        ],
        [0.98818316, 0.9985229 , 0.99704579],
        [0.52141802, 0.52289513, 0.52289513]]))

In this case, I used accuracy as the scoring metric and found the mean score across the three models for each level of gamma for both arrays.

In [6]:
training_scores = np.mean(train_validation, axis = 1)
test_scores = np.mean(test_validation, axis = 1)
(training_scores, test_scores)

(array([0.56646972, 0.93106844, 0.990645  , 1.        , 1.        ,
        1.        ]),
 array([0.56720827, 0.9300837 , 0.98966027, 1.        , 0.99458395,
        0.52240276]))

Finally, I determined which gamma value corresponds to a model that is underfitting (and has the worst training set accuracy), a model that is overfitting (and has the worst test set accuracy), and a model with good generalization performance (high accuracy on both training and test set).

In [9]:
gamma_range = np.logspace(-4,1,6)
underfit = training_scores.argmin()
overfit = (training_scores - test_scores).argmax()
good = (training_scores + test_scores).argmax()
(gamma_range[underfit], gamma_range[overfit], gamma_range[good])

(0.0001, 10.0, 0.1)