This code is an introduction to supervised learning solving a classification problem using **decision trees**.
It follows [this tutorial](https://youtu.be/7eh4d6sabA0). 

# **Classification Problem**
We will follow these steps of solving a machine learning problem.


1. Import the Data
2. Clean the Data
3. split the Data into Training/ Test steps
4. Create a Model
5. Train the Model
6. Make Predictions
7. Evaluate and improve


# Problem description
Enter in the text cell below what you will be predicting in this classification problem (y) and which columns will be used in the prediction (X)

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import joblib
from sklearn import tree

1. Import the Data.

In [2]:
df = pd.read_csv('cleanedfile.csv')

2. Display columns and describe the data set

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  159 non-null    int64  
 1   Species     159 non-null    object 
 2   Weight      159 non-null    float64
 3   Length1     159 non-null    float64
 4   Length2     159 non-null    float64
 5   Length3     159 non-null    float64
 6   Height      159 non-null    float64
 7   Width       159 non-null    float64
dtypes: float64(6), int64(1), object(1)
memory usage: 10.1+ KB


In [4]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Weight,Length1,Length2,Length3,Height,Width
count,159.0,159.0,159.0,159.0,159.0,159.0,159.0
mean,79.0,398.326415,26.24717,28.415723,31.227044,8.970994,4.417486
std,46.043458,357.978317,9.996441,10.716328,11.610246,4.286208,1.685804
min,0.0,0.0,7.5,8.4,8.8,1.7284,1.0476
25%,39.5,120.0,19.05,21.0,23.15,5.9448,3.38565
50%,79.0,273.0,25.2,27.3,29.4,7.786,4.2485
75%,118.5,650.0,32.7,35.5,39.65,12.3659,5.5845
max,158.0,1650.0,59.0,63.4,68.0,18.957,8.142


3. Prepare Data

In [5]:
# Run this section to inspect X
X = df.drop(columns = ['Species'])
X

Unnamed: 0.1,Unnamed: 0,Weight,Length1,Length2,Length3,Height,Width
0,0,242.0,23.2,25.4,30.0,11.5200,4.0200
1,1,290.0,24.0,26.3,31.2,12.4800,4.3056
2,2,340.0,23.9,26.5,31.1,12.3778,4.6961
3,3,363.0,26.3,29.0,33.5,12.7300,4.4555
4,4,430.0,26.5,29.0,34.0,12.4440,5.1340
...,...,...,...,...,...,...,...
154,154,12.2,11.5,12.2,13.4,2.0904,1.3936
155,155,13.4,11.7,12.4,13.5,2.4300,1.2690
156,156,12.2,12.1,13.0,13.8,2.2770,1.2558
157,157,19.7,13.2,14.3,15.2,2.8728,2.0672


In [6]:
# Uncomment this section to inpect y
y = df['Species']
y

0      Bream
1      Bream
2      Bream
3      Bream
4      Bream
       ...  
154    Smelt
155    Smelt
156    Smelt
157    Smelt
158    Smelt
Name: Species, Length: 159, dtype: object

4. Calculate accuracy

In [7]:
# Train 80% of the data set and use the rest to test
X_train, X_test, y_train, y_test = train_test_split(
                                    X, y, test_size=0.2)

model = DecisionTreeClassifier()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Compute model accuracy
score = accuracy_score(y_test, predictions)
score

0.9375

5. Persisting Models

In [8]:
# Save the model to file
joblib.dump(model, 'MODELNAME.joblib')


['MODELNAME.joblib']

5.b. Import the model and make predictions

In [9]:
# Load saved model. Make sure that you have run the previous
# section at least once, and that the file exists.

model = joblib.load('MODELNAME.joblib')
predictions = model.predict(X_test)
predictions

array(['Roach', 'Smelt', 'Perch', 'Bream', 'Smelt', 'Parkki', 'Parkki',
       'Smelt', 'Roach', 'Perch', 'Perch', 'Perch', 'Perch', 'Parkki',
       'Perch', 'Perch', 'Perch', 'Perch', 'Smelt', 'Bream', 'Perch',
       'Perch', 'Bream', 'Perch', 'Pike', 'Roach', 'Smelt', 'Bream',
       'Perch', 'Roach', 'Perch', 'Bream'], dtype=object)

6. (Optional) Visualize decision trees

In [10]:
tree.export_graphviz(model, out_file = 'MODELNAME.dot',
                    feature_names = X.columns, 
                    class_names = str(sorted(y.unique())), 
                    label = 'all',
                    rounded = True, 
                    filled = True)

#Download the file music-recommender.dot and open it in VS Code.
