<a href="https://colab.research.google.com/github/cagBRT/Machine-Learning/blob/master/DecisionTrees.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from IPython.display import Image

In [None]:
!git clone -l -s https://github.com/cagBRT/Machine-Learning.git cloned-repo
%cd cloned-repo

A decision tree has branches, nodes, leaves, etc. <br>

A root node is an initial node representing the entire sample or population, and it can get further divided into other nodes or homogeneous sets. <br>

A decision node consists of two or more nodes that represent separate values of the attribute tested.

**A leaf/terminal node does not split into further nodes, and it represents a decision**

In [None]:
Image("/content/cloned-repo/images/Decision Tree.png" , width=640)

In [None]:
Image("images/Decision Tree 2.png" , width=640)

In [None]:
from sklearn import tree
X = [[0,0,0], [1,1,1],[1,0,1],[0,1,1],[0,0,1],[0,1,0],[0,1,0]]
Y = [0, 1, 0,1,0,1,1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)

In [None]:
tree.plot_tree(clf)

In [None]:
clf.predict([[1,0,0]])

predict_prob is a function that predicts class probabilities of the input samples X.

The predicted class probability is the fraction of samples of the same
class in a leaf.



In [None]:
clf.predict_proba([[1,0,0]])
clf.predict_proba(X)

**Advantages of Decision Tree Algorithm:**<br>

- Understanding the results is easier than other models. You can have the technical team program your decision tree model, so it works faster, and you can apply it to new instances. Its calculations have inclusion tests according to an instance, which is a qualitative or a quantitative model.<br>

- It is non-parametric. The independent variables present in our problem don’t have to follow any specific probability distributions due to this reason. You can have collinear variables. Whether they are discriminating or not, it doesn’t have an impact on your decision tree because it doesn’t have to choose those variables.<br>

- They are capable of working with missing values. CHAID puts all the missing values in a category, which you can merge with another one or keep separate from others.<br>

- Extreme individual values (such as outliers) don’t have much effect on the decision trees. You can isolate them in small nodes so that they don’t affect the entire classification.<br>

- It gives you a great visual representation of a decision-making process. Every branch of a decision tree stands for the factors that can affect your decisions, and you get to see a bigger picture. You can use decision trees to improve communication in your team. <br>

**Disadvantages of Decision Tree Algorithm**<br>
- It doesn’t analyze all the independent variables simultaneously. Instead, it evaluates them sequentially. Due to this, the tree never revises the division of a node at any level, which can cause bias in the tree’s choices. <br>

- Modifying even a single variable can affect the entire tree if it’s close to the top. There are ways to solve this problem. For example, you can construct the tree on multiple samples and aggregate them according to a mean (or vote); this is called resampling. However, it leads to another set of problems as it reduces the readability of the model by making it more complex. So, through resampling, you can get rid of the best qualities of decision trees. Why is it a problem? Suppose one variable has all the qualities of a particular group, but it also has the quality according to which the tree splits. In this case, the tree would put it in the wrong class just because it has that important quality. <br>

- All the nodes of a specific level in a decision tree depend on the nodes in their previous levels. In other words, how you define the nodes on level ‘n +1’ depends entirely on your definition for the nodes on the level ‘n.’ If your definition at level ‘n’ is wrong, all the subsequent levels and the nodes present in those levels would also be wrong


**Decision Tree Classifier on the Iris Dataset**

In [None]:
from sklearn.datasets import load_iris
from sklearn import tree
iris = load_iris()
X, y = iris.data, iris.target
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, y)

In [None]:
X.shape

In [None]:
tree.plot_tree(clf) 

In [None]:
import graphviz 
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("iris") 

In [None]:
dot_data = tree.export_graphviz(clf, out_file=None, 
                     feature_names=iris.feature_names,  
                     class_names=iris.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 

**Decision Tree Classifier on the Diabetes Dataset**

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation

In [None]:
col_names = ['pregnancies', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
diabetes= pd.read_csv("pima_indians_diabetes.csv", header=None, names=col_names)

In [None]:
diabetes.head()

In [None]:
diabetes = diabetes.drop([0])

In [None]:
#split dataset in features and target variable
feature_cols = ['pregnancies', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = diabetes[feature_cols] # Features
y = diabetes.label # Target variable

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [None]:
# Create Decision Tree classifer object
diabetes_clf = DecisionTreeClassifier()

# Train Decision Tree Classifer
diabetes_clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = diabetes_clf.predict(X_test)

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

In [None]:
import pydotplus

In [None]:
!pip install six

In [None]:
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus

dot_data = StringIO()
export_graphviz(better_diabetes_clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('diabetes.png')
Image(graph.create_png())

**Can we improve the accuracy?**

In [None]:
# Create Decision Tree classifer object
better_diabetes_clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)

# Train Decision Tree Classifer
better_diabetes_clf = better_diabetes_clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred_better = better_diabetes_clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_better))

In [None]:
from sklearn.externals.six import StringIO  
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(better_diabetes_clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_png('bettter_diabetes.png')
Image(graph.create_png())
