##### What is difference between Pre-pruning and Post-pruning of decision tree?

**_Post-pruning_** is also known as backward pruning. In this, first Generate the decision tree and then remove non-significant branches. Post-pruning a decision tree implies that we begin by generating the (complete) tree and then adjust it with the aim of improving the classification accuracy on unseen instances. There are two principal methods of doing this. One method that is widely used begins by converting the tree to an equivalent set of rules. Another commonly used approach aims to retain the decision tree but to replace some of its subtrees by leaf nodes, thus converting a complete tree to a smaller pruned one which predicts the classification of unseen instances at least as accurately.

**_Pre-pruning_** is also called forward pruning or online-pruning. Pre-pruning prevent the generation of non-significant branches. Pre-pruning a decision tree involves using a ‘termination condition’ to decide when it is desirable to terminate some of the branches prematurely as the tree is generated. When constructing the tree some significant measures can be used to assess the goodness of a split. If partitioning the tuples at a node would result the split that falls below a prespecified threshold, then further partitioning of the given subset is halted otherwise it is expanded. High threshold result in oversimplified trees, whereas low threshold result in very little simplification.

##### What is over fitting in decision tree?

Over-fitting is the phenomenon in which the learning system tightly fits the given training data so much that it would be inaccurate in predicting the outcomes of the untrained data.

In decision trees, over-fitting occurs when the tree is designed so as to perfectly fit all samples in the training data set. Thus it ends up with branches with strict rules of sparse data. Thus this effects the accuracy when predicting samples that are not part of the training set.

One of the methods used to address over-fitting in decision tree is called pruning which is done after the initial training is complete.

##### Are tree based models better than linear models?

It is dependent on the type of problem we are solving. Let’s look at some key factors which will help us to decide which algorithm to use:  
1.If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model.  
2.If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.  
3.If we need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression.

##### What are ensemble methods in tree based modeling ?

The literary meaning of word ‘ensemble’ is group. Ensemble methods involve group of predictive models to achieve a better accuracy and model stability. Ensemble methods are known to impart supreme boost to tree based models.

Like every other model, a tree based model also suffers from the plague of bias and variance. Bias means, ‘how much on an average are the predicted values different from the actual value.’ Variance means, ‘how different will the predictions of the model be at the same point if different samples are taken from the same population’.

We build a small tree and we will get a model with low variance and high bias. Normally, as we increase the complexity of our model, we will see a reduction in prediction error due to lower bias in the model. As we continue to make our model more complex, we end up over-fitting our model and our model will start suffering from high variance.

A champion model should maintain a balance between these two types of errors. This is known as the trade-off management of bias-variance errors. Ensemble learning is one way to execute this trade off analysis.<img src="https://www.dataquest.io/wp-content/uploads/2019/01/biasvariance.png" alt="Drawing" style="height: 200px;"/>
Some of the commonly used ensemble methods include: Bagging, Boosting and Stacking.

##### Which is the best, Bagging or Boosting?

It depends on the data, the simulation and the circumstances.
Bagging and Boosting decrease the variance of our single estimate as they combine several estimates from different models. So the result may be a model with higher stability.

If the problem is that the single model gets a very low performance, Bagging will rarely get a better bias. However, Boosting could generate a combined model with lower errors as it optimises the advantages and reduces pitfalls of the single model.

By contrast, if the difficulty of the single model is over-fitting, then Bagging is the best option. Boosting for its part doesn’t help to avoid over-fitting; in fact, this technique is faced with this problem itself. For this reason, Bagging is effective more often than Boosting.

##### How do Bagging and Boosting get N learners?

Bagging and Boosting get N learners by generating additional data in the training stage. N new training data sets are produced by random sampling with replacement from the original set. By sampling with replacement some observations may be repeated in each new training data set.

In the case of Bagging, any element has the same probability to appear in a new data set. However, for Boosting the observations are weighted and therefore some of them will take part in the new sets more often.

These multiple sets are used to train the same learner algorithm and therefore different classifiers are produced.

##### Do Random Forest overfit?

A Random Forest with few trees is quite prone to overfit to noise. This is easily demonstrated because Random Forest with just one tree is the same as a single tree. As more trees are added, the tendency to overfit generally decreases. It never, however, approaches zero. No number of trees will ever remove overfit.

##### Show effect of Random Forest hyperparameter tuning  on performance in python

In [3]:
# Load libraries
import pandas as pd
import numpy as np
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()

#split dataset in features and target variable
X = pima.iloc[:, 0:8].values #features variable
y = pima.iloc[:, 8].values   #target variable

# Split dataset into training set and test set
# 70% training and 30% test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


**_No Hyperparameter tuning_**

In [29]:
classifier = RandomForestClassifier(n_estimators=20, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.7575757575757576
[[137  20]
 [ 36  38]]


**__n_estimators__**

In [24]:
classifier = RandomForestClassifier(n_estimators=100, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.7792207792207793
[[140  17]
 [ 34  40]]


Increasing n_estimators from 20 to 100 increases accuracy from 75.76% to 77.92%

**_criterion_**

In [25]:
classifier = RandomForestClassifier(n_estimators=20 , criterion='entropy', random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.7792207792207793
[[139  18]
 [ 33  41]]


Changing criterion from gini to entropy changes accuracy from 75.76% to 77.92%

**_max_depth_**

In [26]:
classifier = RandomForestClassifier(n_estimators=20 , max_depth=5, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.7748917748917749
[[142  15]
 [ 37  37]]


Restricting depth of tree to 5 changes accuracy from 75.76% to 77.49%

**_min_samples_split_**

In [27]:
classifier = RandomForestClassifier(n_estimators=20 , min_samples_split=5, random_state=0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

0.8095238095238095
[[142  15]
 [ 29  45]]


Keeping min_samples_split = 5 increases accuracy from 75.76% to 80.95%