<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/DecisionTrees3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone -l -s https://github.com/cagBRT/Machine-Learning.git cloned-repo
%cd cloned-repo

**Decision Tree Classifier on the Diabetes Dataset**

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn import tree

In [None]:
col_names = ['pregnancies', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
diabetes= pd.read_csv("pima_indians_diabetes.csv", header=None, names=col_names)

In [None]:
diabetes = diabetes.drop([0])
diabetes.head()

In [None]:
#split dataset in features and target variable
feature_cols = ['pregnancies', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = diabetes[feature_cols] # Features
y = diabetes.label # Target variable

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
# Create Decision Tree classifer object
diabetes_clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
diabetes_clf = diabetes_clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = diabetes_clf.predict(X_test)

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
import pydotplus

In [None]:
!pip install six

In the decision tree chart, each internal node has a decision rule that splits the data. Gini referred as Gini ratio, which measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the leaf node.

Here, the resultant tree is unpruned. This unpruned tree is unexplainable and not easy to understand. In the next section, let's optimize it by pruning.

In [None]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(diabetes_clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

**Can we improve the accuracy?**

Optimizing Decision Tree Performance
criterion : optional (default=”gini”) or Choose attribute selection measure: This parameter allows us to use the different-different attribute selection measure. Supported criteria are “gini” for the Gini index and “entropy” for the information gain.

splitter : string, optional (default=”best”) or Split Strategy: This parameter allows us to choose the split strategy. Supported strategies are “best” to choose the best split and “random” to choose the best random split.

max_depth : int or None, optional (default=None) or Maximum Depth of a Tree: The maximum depth of the tree. If None, then nodes are expanded until all the leaves contain less than min_samples_split samples. The higher value of maximum depth causes overfitting, and a lower value causes underfitting

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html


This pruned model is less complex, explainable, and easy to understand 

In [None]:
# Create Decision Tree classifer object
better_diabetes_clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer
better_diabetes_clf = better_diabetes_clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred_better = better_diabetes_clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_better))

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(better_diabetes_clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('bettter_diabetes.png')
Image(graph.create_png())


**Pros**<br>
- Decision trees are easy to interpret and visualize.
- It can easily capture Non-linear patterns.
It requires fewer data preprocessing from the user, for example, there is no need to normalize columns.<br>
- It can be used for feature engineering such as predicting missing values, suitable for variable selection.<br>
- The decision tree has no assumptions about distribution because of the non-parametric nature of the algorithm. (Source)<br>

**Cons** <br>
- Sensitive to noisy data. It can overfit noisy data.<br>
- The small variation(or variance) in data can result in the different decision tree. This can be reduced by bagging and boosting algorithms.<br>
- Decision trees are biased with imbalance dataset, so it is recommended that balance out the dataset before creating the decision tree.<br>