# Classifying brains with Machine Learning
In this notebook you will learn how to use machine learning to predict whether or not a brain belongs to a modern bird or a non-avian dinosaur. 

First import pandas, numpy, and matplotlib.pyplot:

We will also need the tree module of the sklearn library:

In [None]:
from sklearn import tree

Read the bird_dino_data.csv file into a dataframe and add two new columns:
- Brain vs body mass (use total endocranium / body mass*1000)
- Cerebrum vs total brain (use cerebrum / endocranium)

Find the head of your dataframe to check that your changes are correct:

Our machine learning library requires that we replace our "classes" with integers instead of strings. 

Change the values of the "Bird or Dino" column from "Bird" to "0" and from "Dino" to "1":

Hints: 
- use .loc indexes
- you can reassign the value in a dataframe column using =

*Ignore the warning. Pandas tries to discourage you from making changes to the original dataframe, but it's ok*

Find the head of your dataframe to check that your changes are correct:

Our machine learning algorithm requires a numpy array instead of a dataframe.

PAUSE: When you get to this point, let your Helen Fellow know and we will review numpy matrices before we continue with machine learning

We can convert the dataframe to a numpy array using the .to_numpy() method. Assign your array to a variable:

Print out the data type of array and the first value in the array (note: this is a two dimensional array):

Now we will create our classifier. Just as it is common to call a dataframe "df" it is common to call a classifier "clf":

In [None]:
clf = tree.DecisionTreeClassifier()

Next, we will select the x and y data for our algorithm. x should be the two columns we will use to train the algorithm (brain to body ratio and cerebrum to whole brain ratio). y should be the first column which contains our "class labels".

Hint: You can use slicing to select a particular value from every row of a numpy array. For example, using the index [:,1] will select the second column.

Next we will use the .fit() method to fit our data to the classifier:

We can visualize the path of the decision tree's decision making using the .plot_tree function and matplotlib.pyplot's plt.show function:

Now let's test out our decision tree with some data from one of the brains we studied! We can use the .predict_proba method. 

A result of array([[1., 0.]]) means the algorithm is certain it's a bird and a result of array([[0., 1.]]) means the algorithm is certain it's a dinosaur.

For example:

In [None]:
# This is the brain to body mass ratio and cerebrum to whole brain ratio for the woodpecker:
clf.predict_proba([[0.22,0.71]])

Try it with the data from your brain specimen!

## Beautifying our graph

We can use `graphviz` to create a nicer visualization of our decision tree. You will need to import `export_graphviz` from `sklearn.tree`, import `pydotplus`, and import `graphviz`.

Look up the documentation for the `export_graphviz` function. You will need to supply the name of your decision tree, the names of your features (the "x" variable), and the names of your classes. You can play around with the other arguments to change the appearance of the tree.

## Plotting decision surface:

Plot the decision surface for your decision tree to visualize the values of each feature that the algorithm classifies as bird vs. dinosaur.

First, find the minimum and maximum of each feature. In this case, we want the minimum and maximum of brain to body ratio and cerebrum to whole brain ratio:

In [None]:
# Calculate the minimum and maximum of brain to body ratio and assign it to a variable:

# Calculate the minimum and maximum of cerebrum to whole brain ratio and assign it to a variable:


Now we want to use those to create simulated brains with ranges of values for the different ratios. We will create all possible combinations in these ranges with a step value of 0.02.

In [None]:
# Make coordinate matrices
# Fill in the minimum and maximum values for both ratios below
xx, yy = np.meshgrid(np.arange( - 0.05,  + 0.05, 0.02),
                         np.arange( - 0.05,  + 0.05, 0.02))

We can use our decision tree to classify these simulated brains:

In [None]:
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

Now we can plot the simulated values to see the decision surface and overlay the training points to see where the brains in our dataset fall.

In [None]:
# Plot the contour plot
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu)
# label the axes


# Plot the training points
for i, color in zip(range(2), "rb"):
    idx = np.where(y == i)
    plt.scatter(x[idx, 0], x[idx, 1], c=color, label=["Bird", "Dino"][i],
                cmap=plt.cm.RdBu, edgecolor='black', s=15)

plt.show()