# Separating Mushrooms

The [mushrooms dataset](http://archive.ics.uci.edu/ml/datasets/Mushroom) includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like _leaflets three, let it be_ for Poisonous Oak and Ivy.

General information about the dataset can be found on the aforementioned website. In particular, we'd be interested in
- the [training set](http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data), and
- the [dataset description](http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names), or data dictionary.

Let's download the dataset into a pandas dataframe.

In [1]:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"
data = data = pd.read_csv(url, header=None)

In [2]:
print len(data), "rows"
data.head()

8124 rows


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


Obviously you'll need the [data dictionary](http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names) to understand what all these columns and letters mean. 

We'd like to predict the first column (column # 0), which contains an `e` for mushrooms that are definitely edible, and a `p` for definitely poisonous ones, or ones of unknown edibility and not recommended. This will be our **_y_** target variable.

In [3]:
y = data[0]
y.value_counts()

e    4208
p    3916
dtype: int64

We'll try to predict the ebibility using SVMs. Note that SVMs require numerical features, not strings, so we need to use such tools as `patsy`'s `dmatrix` and `dmatrices`, or `sklearn`'s `CountVectorizer` or `TfidfVectorizer` again. Please refer the earlier notebooks for details on how to use these. Also note that your features are scaled properly, using `MinMaxScaler` or `StandardScaler`, since SVMs use the notion of distance. However, if you converted your data into a binary feature matrix, that is not necessary.

In [4]:
from patsy import dmatrix, dmatrices
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.svm import LinearSVC, SVC
from sklearn.cross_validation import cross_val_score, train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler

## Exercise
1. Convert your dataset, excluding the first column (!), into a numerical matrix that you could feed into your model
1. Make sure you scale your features properly, if necessary, using `MinMaxScaler` or `StandardScaler`
1. Fit your model, `LinearSVC` or `SVC`, and compute cross-validated accuracy (and optionally also AUC score)
1. Try different options for your numerical representation and see how that impacts your model
1. Try different options for model (inverse-regularization parameter `C=`, kernels `kernel=`, etc.)
1. Compare your results with another classifier you feel comfortable with (e.g., Logistic Regression)

In [5]:
# your code here