# Mushroom Classification

The objective of this activity is to employ the grid and randomized search strategies to find an optimal model capable of discerning whether a particular mushroom species is poisonous or not given attributed relating to its appearance.

### 1. Load the data into Python and call the object mushrooms.

Import pandas.

In [None]:
import pandas as pd

Read the data. Note the lack of header.

In [None]:
mushrooms = pd.read_csv('./agaricus-lepiota.data', header=None)

View the data.

In [None]:
mushrooms

### 2.	Separate the target y and features X from the dataset. 

In [None]:
y_raw = mushrooms.iloc[:,0]

X_raw = mushrooms.iloc[:,1:]

### 3.	Recode the target y such that poisonous mushrooms are represented as 1, edible mushrooms as 0.

In [None]:
y = (y_raw == 'p') * 1

In [None]:
y

### 4.	The featureset X will need to have its columns transformed into a numpy array with a binary representation. This is known as ‘one hot encoding.’

Import preprocessing.

In [None]:
from sklearn import preprocessing

Create and fit the encoder then transform the data.

In [None]:
encoder = preprocessing.OneHotEncoder()

encoder.fit(X_raw)

X = encoder.transform(X_raw).toarray()

View the data.

In [None]:
X

### 5.	Conduct both a grid and random search to find an optimal hyperparameterization for a random forest classifier. Use accuracy as your method of model evaluation. Which method of tuning is more effective?

Import ensemble.

In [None]:
from sklearn import ensemble

Create the random forest classifer.

In [None]:
rfc = ensemble.RandomForestClassifier(n_estimators=100, random_state=150)

Import model select, define the grid, set up the grid search, start the grid search and visualise the results.

In [None]:
from sklearn import model_selection

grid = {
    'criterion': ['gini', 'entropy'],
    'max_features': [2, 4, 6, 8, 10, 12, 14]
}

gscv = model_selection.GridSearchCV(estimator=rfc, param_grid=grid, cv=5, scoring='accuracy')

gscv.fit(X,y)

results = pd.DataFrame(gscv.cv_results_)

results.sort_values('rank_test_score', ascending=True).head(10)

### 6.	Plot mean test score vs hyperparameterization for the top 10 models found. Can you spot any obvious patterns?

In [None]:
(
    results
    .sort_values('rank_test_score', ascending=False)
    .loc[:,['params','mean_test_score']]
    .head(10).plot.barh(x='params', xlim=(0.8))
)

Import the stats models, define the parameter dictionary an any distributions, conduct a randomized search and visualise the results.

In [None]:
from scipy import stats

max_features = X.shape[1]

param_dist = {
    'criterion': ['gini', 'entropy'],
    'max_features': stats.randint(low=1, high=max_features)
}

rscv = model_selection.RandomizedSearchCV(estimator=rfc, param_distributions=param_dist, n_iter=50, cv=5, scoring='accuracy', random_state=100)

rscv.fit(X,y)

results = pd.DataFrame(rscv.cv_results_)

results.sort_values('rank_test_score', ascending=True).head(10)

In [None]:
results.loc[:,'params'] = results.loc[:,'params'].astype(str)

(
    results.sort_values('rank_test_score', ascending=False)
    .loc[:,['params','mean_test_score']]
    .drop_duplicates()
    .head(10)
    .plot.barh(x='params', xlim=(0.8))
)