# <i class="fas fa-laptop"></i> Practice: Coding Machine Learning
```{jupyter-info}
{rel-data-download}`mushrooms.csv`
```

In this notebook, we will practice trying to predict whether or not a mushroom is edible based on its features using data in `mushrooms.csv`. This dataset has many columns which we will not attempt to understand, but you can look at them [here](https://www.kaggle.com/uciml/mushroom-classification#mushrooms.csv) if you are interested. The columns of interest will be our target `class` which takes on values `e` for edible or `p` for poisonous. All the features we will use (explained below) are categorical. Some value are missing so we will need to handle that.


For this task, we will use only a subset of the columns for prediction. You should use the columns `cap-shape`, `cap-surface`, `cap-color` as features and `class` as the target. All other columns should be ignored for this analysis.

In this problem, you should follow the machine learning pipeline to do the following steps:
* Drop all columns that are not relevant to the analysis.
* Remove all rows that have missing values for the columns of interest. There is no need to throw out rows that have missing values outside of these 4 columns (since those missing values will not be included in the model).
* Separate the data into *usable* features and labels.
* Split the dataset into 70% training data and 30% test data.
* Train a decision tree model on the data.
* Evaluate the models training and test accuracy.

Remember to import any thing you need from `pandas` or `sklearn`!

## Problem 0: Load the Data
Load in the dataset into a `DataFrame` named `data`. Don't preprocess the data in any way for this problem.

In [6]:
# Write your code here!
import pandas as pd
from google.colab import files
uploaded = files.upload()

data = pd.read_csv('mushrooms.csv')
data.head()

Saving mushrooms.csv to mushrooms (2).csv


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,,p,k,s,u
1,e,x,s,y,t,a,f,c,b,,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g


## Problem 1: Process the Data
Do the data processing parts of the ML pipeline. Namely, do the following steps:
* Drop all columns that are not relevant to the analysis.
* Remove all rows that have missing values for the columns of interest. There is no need to throw out rows that have missing values outside of these 4 columns (since those missing values will not be included in the model).
* Separate the data into *usable* features and labels.
* Split the dataset into 70% training data and 30% test data.

Save the variables for the train and test features into `features_train`, `features_test`, `labels_train`, `labels_test` as we did before.

In [23]:
# Write your code here!
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split


data = data[['cap-shape', 'cap-surface', 'cap-color', 'class']]

data = data.dropna()

features = data.loc[:, data.columns != 'class']
labels = data['class']

features_encoded = pd.get_dummies(features)


features_train, features_test, labels_train, labels_test = train_test_split(
    features_encoded, labels, test_size=0.3
  )


## Problem 2: Train the Model
Write code to create a decision tree model and train it. Make sure you save it in a variable called `model`.

In [30]:
# Write your code here!

# Create a model
model = DecisionTreeClassifier()

# Fit the model to training data
model.fit(features_train, labels_train)

# Make predictions
test_predictions = model.predict(features_test)
train_predictions = model.predict(features_train)

## Problem 3: Assess the Model
Write code to compute the training and test accuracy of the model in the cell below. Save the train accuracy in varaible called `train_acc` and the test accuracy in a variable called `test_acc`.

For reference, each of your train and test accuracies should be around 70%.

**Check your Understanding**: If both the train and test accuracy are near 70%, would we say that the model is overfit? Why or why not?

In [35]:
# Write your code here!
train_acc = accuracy_score(labels_train, train_predictions)
test_acc = accuracy_score(labels_test, test_predictions)

print(train_acc)
print(test_acc)

0.7155095484561842
0.7127393838467944
