This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

I am using a bodyfat data table from kaggle. I will be predicting the body fat of a person based off of factors like density, weight, height, etc

In [24]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [52]:
bodyfat_data = pd.read_csv('bodyfat.csv')

2. Display columns and describe the data set

In [53]:
bodyfat_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 15 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Density  252 non-null    float64
 1   BodyFat  252 non-null    float64
 2   Age      252 non-null    int64  
 3   Weight   252 non-null    float64
 4   Height   252 non-null    float64
 5   Neck     252 non-null    float64
 6   Chest    252 non-null    float64
 7   Abdomen  252 non-null    float64
 8   Hip      252 non-null    float64
 9   Thigh    252 non-null    float64
 10  Knee     252 non-null    float64
 11  Ankle    252 non-null    float64
 12  Biceps   252 non-null    float64
 13  Forearm  252 non-null    float64
 14  Wrist    252 non-null    float64
dtypes: float64(14), int64(1)
memory usage: 29.7 KB


In [54]:
bodyfat_data.describe()

Unnamed: 0,Density,BodyFat,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist
count,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0,252.0
mean,1.055574,19.150794,44.884921,178.924405,70.14881,37.992063,100.824206,92.555952,99.904762,59.405952,38.590476,23.102381,32.273413,28.663889,18.229762
std,0.019031,8.36874,12.60204,29.38916,3.662856,2.430913,8.430476,10.783077,7.164058,5.249952,2.411805,1.694893,3.021274,2.020691,0.933585
min,0.995,0.0,22.0,118.5,29.5,31.1,79.3,69.4,85.0,47.2,33.0,19.1,24.8,21.0,15.8
25%,1.0414,12.475,35.75,159.0,68.25,36.4,94.35,84.575,95.5,56.0,36.975,22.0,30.2,27.3,17.6
50%,1.0549,19.2,43.0,176.5,70.0,38.0,99.65,90.95,99.3,59.0,38.5,22.8,32.05,28.7,18.3
75%,1.0704,25.3,54.0,197.0,72.25,39.425,105.375,99.325,103.525,62.35,39.925,24.0,34.325,30.0,18.8
max,1.1089,47.5,81.0,363.15,77.75,51.2,136.2,148.1,147.7,87.3,49.1,33.9,45.0,34.9,21.4


3. Prepare Data

In [56]:
# Run this section to inspect X
X = bodyfat_data.drop(columns = ['BodyFat'])
X

Unnamed: 0,Density,Age,Weight,Height,Neck,Chest,Abdomen,Hip,Thigh,Knee,Ankle,Biceps,Forearm,Wrist
0,1.0708,23,154.25,67.75,36.2,93.1,85.2,94.5,59.0,37.3,21.9,32.0,27.4,17.1
1,1.0853,22,173.25,72.25,38.5,93.6,83.0,98.7,58.7,37.3,23.4,30.5,28.9,18.2
2,1.0414,22,154.00,66.25,34.0,95.8,87.9,99.2,59.6,38.9,24.0,28.8,25.2,16.6
3,1.0751,26,184.75,72.25,37.4,101.8,86.4,101.2,60.1,37.3,22.8,32.4,29.4,18.2
4,1.0340,24,184.25,71.25,34.4,97.3,100.0,101.9,63.2,42.2,24.0,32.2,27.7,17.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,1.0736,70,134.25,67.00,34.9,89.2,83.6,88.8,49.6,34.8,21.5,25.6,25.7,18.5
248,1.0236,72,201.00,69.75,40.9,108.5,105.0,104.5,59.6,40.8,23.2,35.2,28.6,20.1
249,1.0328,72,186.75,66.00,38.9,111.1,111.5,101.7,60.3,37.3,21.5,31.3,27.2,18.0
250,1.0399,72,190.75,70.50,38.9,108.3,101.3,97.8,56.0,41.6,22.7,30.5,29.4,19.8


In [57]:
# Uncomment this section to inpect y
y = bodyfat_data['BodyFat']
y

0      12.3
1       6.1
2      25.3
3      10.4
4      28.7
       ... 
247    11.0
248    33.6
249    29.3
250    26.0
251    31.9
Name: BodyFat, Length: 252, dtype: float64

4. Calculate accuracy

In [64]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

5. Persisting Models

In [59]:
# Save the model to file
joblib.dump(model, 'bodyfat-predictor.joblib')


['bodyfat-predictor.joblib']

5.b. Import the model and make predictions

In [63]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('bodyfat-predictor.joblib')
predictions = model.predict([[1.0708, 23, 154.25, 67.75, 36.2, 93.1, 85.2, 94.5, 59.0, 37.3, 21.9, 32.0, 27.4, 17.1]])
predictions

6. (Optional) Visualize decision trees

In [60]:
tree.export_graphviz(model, out_file = 'bodyfat-predictor.dot',
                    feature_names = ['density', 'age', 'weight', 'height', 'neck', 'chest', 'abdomen', 'hip', 'thigh', 'knee', 'ankle', 'biceps', 'forearm', 'wrist'], 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
