# DAT210x - Programming with Python for DS

## Module6- Lab5

Mycology is a branch of biology that generally deals with the study of fungi and mushrooms, and particularly their genetic and biochemical make-up, and their use to humans. Throughout history, fungi have been used for tinder, medicine, and food. For hundreds of years, specific mushrooms have been used as folk medicine in Russian, China, and Japan. Scientists elsewhere have also documented many medicinal uses of mushrooms as well, but not all mushrooms are beneficial--some are quite deadly.



In this lab, you're going to use decision trees to peruse The Mushroom Data Set, drawn from the Audobon Society Field Guide to North American Mushrooms (1981). The data set details mushrooms described in terms of many physical characteristics, such as cap size and stalk length, along with a classification of poisonous or edible.

As a standard disclaimer, if you eat a random mushroom you find, you are doing so at your own risk. While every effort has been made to ensure that the information contained with the data set is correct, please understand that no one associated with this course accepts any responsibility or liability for errors, omissions or representations, expressed or implied, contained therein, or that might arise from you mistakenly identifying a mushroom. Exercise due caution and just take this lab as informational purposes only.

First, visit the data set's page and read through it carefully. Understand what they're saying about missing value representations, and header names, and where the classification column is located. Peek through the data values in a spreadsheet program or text editor and get comfortable with it.
Load up the started code in Module6/Module6 - Lab5.ipynb.
A copy of the dataset is included in Module6/Datasets/agaricus-lepiota.data.
You're going to need to review the decision tree code in the SciKit-Learn section of the Decision Tree section. It contains a few calls in there, necessary for the completion of the assignment. If you're unable to install graphiz, use webgraphviz, or alternative complete the assignment by examining the attributes of your classifier.
Answer the following questions:

In [1]:
import pandas as pd

Useful information about the dataset used in this assignment can be [found here](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.names).

Load up the mushroom dataset into dataframe `X` and verify you did it properly, and that you have not included any features that clearly shouldn't be part of the dataset.

You should not have any doubled indices. You can check out information about the headers present in the dataset using the link we provided above. Also make sure you've properly captured any NA values.

In [37]:
X = pd.read_csv(r'Datasets\agaricus-lepiota.data')
X.head(5)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8123 entries, 0 to 8122
Data columns (total 23 columns):
p      8123 non-null object
x      8123 non-null object
s      8123 non-null object
n      8123 non-null object
t      8123 non-null object
p.1    8123 non-null object
f      8123 non-null object
c      8123 non-null object
n.1    8123 non-null object
k      8123 non-null object
e      8123 non-null object
e.1    8123 non-null object
s.1    8123 non-null object
s.2    8123 non-null object
w      8123 non-null object
w.1    8123 non-null object
p.2    8123 non-null object
w.2    8123 non-null object
o      8123 non-null object
p.3    8123 non-null object
k.1    8123 non-null object
s.3    8123 non-null object
u      8123 non-null object
dtypes: object(23)
memory usage: 1.4+ MB


In [3]:
# No double indices
X['p.2'].value_counts()

p    8123
Name: p.2, dtype: int64

In [4]:
# An easy way to show which rows have nans in them:
#No Nulls
X[pd.isnull(X).any(axis=1)]

Unnamed: 0,p,x,s,n,t,p.1,f,c,n.1,k,...,s.2,w,w.1,p.2,w.2,o,p.3,k.1,s.3,u


For this simple assignment, just drop any row with a nan in it, and then print out your dataset's shape:

In [5]:
X = X.dropna(axis=0)

Copy the labels out of the dataframe into variable `y`, then remove them from `X`.

Encode the labels, using the `.map()` trick we presented you in Module 5, using `p:0` (Poisonus), `e:1` (Edible).

In [6]:
y = X['p']
y = y.map({'p':0, 'e':1})
y[:5]

0    1
1    1
2    0
3    1
4    1
Name: p, dtype: int64

Encode the entire dataframe using dummies:

In [7]:
X = X.drop(labels=['p'], axis=1)

In [8]:
X = pd.get_dummies(X)

Split your data into `test` and `train` sets. Your `test` size should be 30% with `random_state` 7.

Please use variable names: `X_train`, `X_test`, `y_train`, and `y_test`:

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

Create an DT classifier. No need to set any parameters:

In [20]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()

Train the classifier on the `training` data and labels; then, score the classifier on the `testing` data and labels:

In [21]:
dtree.fit(X_train, y_train)
score = dtree.score(X_test, y_test)

In [22]:
print("High-Dimensionality Score: ", round((score*100), 3))

High-Dimensionality Score:  100.0


Use the code on the course's SciKit-Learn page to output a .DOT file, then render the .DOT to .PNGs.

You will need graphviz installed to do this. On macOS, you can `brew install graphviz`. On Windows 10, graphviz installs via a .msi installer that you can download from the graphviz website. Also, a graph editor, gvedit.exe can be used to view the tree directly from the exported tree.dot file without having to issue a call. On other systems, use analogous commands.

If you encounter issues installing graphviz or don't have the rights to, you can always visualize your .dot file on the website: http://webgraphviz.com/.

In [36]:
pd.DataFrame(list(zip(X.columns, dtree.feature_importances_))).sort_values(by=1)

Unnamed: 0,0,1
0,x_b,0.000000
83,w.2_n,0.000000
82,p.2_p,0.000000
80,w.1_w,0.000000
79,w.1_p,0.000000
78,w.1_o,0.000000
77,w.1_n,0.000000
76,w.1_g,0.000000
75,w.1_e,0.000000
74,w.1_c,0.000000


In [30]:
dtree.feature_importances_

array([0.00000000e+00, 7.02958708e-04, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.40414077e-03,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.38785809e-03, 1.17644770e-02, 0.00000000e+00,
       0.00000000e+00, 1.05067225e-02, 0.00000000e+00, 6.14690148e-01,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.98397208e-05,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 1.80145526e-01, 0.00000000e+00, 8.34470315e-02,
      

In [23]:
from sklearn import tree
tree.export_graphviz(dtree.tree_, out_file='tree.dot', feature_names=X.columns)

TypeError: <sklearn.tree._tree.Tree object at 0x000001A4C20D3D98> is not an estimator instance.

In [None]:
from subprocess import call


In [None]:
call(['dot', '-T', 'png', 'tree.dot', '-o', 'tree.png'])