# Decision Trees and Random Forests in Python

Reference: http://net-informations.com/ds/mla/dtree.htm, https://www.kaggle.com/code

Credit (Image) from https://www.osmosis.org/learn/Lordosis,_kyphosis,_and_scoliosis

![](https://d16qt3wv6xm098.cloudfront.net/D8vzGbPOSmitZdUZkrleQYi-SZ6ZFOpZ/_.jpg)

# Install this package before starting this lab

## Import Libraries

In [1]:
!pip install --upgrade scikit-learn==1.0.2
!pip install --upgrade numpy==1.21.5

Collecting scikit-learn==1.0.2
  Downloading scikit-learn-1.0.2.tar.gz (6.7 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m[36m0:00:01[0mm eta [36m0:00:01[0m
[?25h  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mPreparing metadata [0m[1;32m([0m[32mpyproject.toml[0m[1;32m)[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[1817 lines of output][0m
  [31m   [0m Partial import of sklearn during the build process.
  [31m   [0m 
  [31m   [0m   `numpy.distutils` is deprecated since NumPy 1.23.0, as a result
  [31m   [0m   of the deprecation of `distutils` itself. It will be removed for
  [31m   [0m   Python >= 3.12. For older Python versions it will remain p

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

## Get the Data

In [None]:
!wget https://github.com/davidjohnnn/all_datasets/raw/master/bay/kyphosis.csv

In [None]:
df = pd.read_csv('kyphosis.csv')

In [None]:
df.head()

## EDA

We'll just check out a simple pairplot for this small dataset.

In [None]:
sns.pairplot(df,hue='Kyphosis',palette='Set1')

## Train Test Split

Let's split up the data into a training set and a test set!

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop('Kyphosis',axis=1)
y = df['Kyphosis']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.30, random_state=30)

## Decision Trees

We'll start just by training a single decision tree.

http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

In [None]:
from sklearn.tree import DecisionTreeClassifier

In [None]:
dtree = DecisionTreeClassifier(min_samples_leaf=10, criterion='entropy')

In [None]:
dtree.fit(X_train,y_train)

In [None]:
import pickle
filename = 'model.sav'
pickle.dump(dtree, open(filename, 'wb'))

## Prediction and Evaluation 

Let's evaluate our decision tree.

In [None]:
dtree = pickle.load(open(filename,'rb'))
dtree

In [None]:
predictions = dtree.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
print(classification_report(y_test,predictions,digits=4))

In [None]:
print(confusion_matrix(y_test,predictions,labels=['absent','present']))

## Tree Visualization

Scikit learn actually has some built-in visualization capabilities for decision trees, you won't use this often and it requires you to install the pydot library, but here is an example of what it looks like and the code to execute this:

In [None]:
from sklearn import tree
tree.plot_tree(dtree)

In [None]:
print(X.columns) # feature names
print(y.unique().tolist()) # class names

In [None]:
fn=X.columns # feature names
cn=y.unique().tolist() # class names
fig, axes = plt.subplots(nrows = 1,ncols = 1,figsize = (4,6), dpi=100)
tree.plot_tree(dtree,
               feature_names = fn, 
               class_names=cn,
               filled = True);
fig.savefig('imagename.png')