### ***Decision tree exercise***

#### 1)What is the approximate depth of a Decision Tree trained (without restrictions) a training set with 1 million instances?
***ANS :*** its generelly O(log₂(m)) , so for 1 million = 10⁶ <br>
depth = log₂(10⁶) ≈ 20

#### 2) Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?
***ANS :*** node's gini impurity is **lower** than its parent's , its **generally** lower than its parent's because tree may increase one child node's impurity to compensated it with another child node's lower impurity

#### 3)If a Decision Tree is overfitting the training set,is it a good idea to try decreasing max_depth?
***ANS :*** yes , we can regularize our tree by decreasing *max_depth* as now it won't make unnecessary node to fit data very closely and end up overfitting

#### 4)If Decision Tree is underfitting the training set,is it a good idea to scale the input features?
***ANS :*** No , I would make no difference as decision tree is not scale-sensitive model. It just find the threshold for a feature , scaling will only change the threshold not the results

#### 5) If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?
***ANS :*** time taken to train 1M instances = 1H <br>
*we know comp. complexicity. for decision tree is  nxmxlog₂(m)* <br>
=> time taken to perform nx10⁶xlog₂(10⁶) calculations = 1H<br>
=> time taken to perform 1 calculation = (1/nx10⁶xlog₂(10⁶)) H<br>
time taken to train 10 M instances = (1/nx10⁶xlog₂(10⁶)) * (nx10⁷xlog₂(10⁷))<br>
= 10 x log₂(10⁷-10⁶)<br>
= 11.7 ≈ ***12H***<br>

#### 6) If your training set contains 100,000 instances, will setting presort=True speed up training?
***Ans :*** Presorting the training set speeds up training only if the dataset is smaller than a few thousand instances. If it contains 100,000 instances, setting presort=True will considerably slow down the training.

#### 7) Train and fine-tune a Decision Tree for the ***moons dataset***

In [59]:
from sklearn.datasets import make_moons
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

X,y = make_moons(n_samples=10000,noise=0.4)

In [60]:
from sklearn.model_selection import train_test_split
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.2)

In [61]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

tree_clf = DecisionTreeClassifier()
grid = {'criterion' : ["gini", "entropy"],
        'max_depth' : [10,20,30,40,50,60,70,80,90],
        'max_leaf_nodes' : [2,3,4,5,6,7,8,9,10]   
       }
rscv = RandomizedSearchCV(tree_clf,grid,verbose=2,n_iter=50,n_jobs=-1,cv=3)
rscv.fit(X_train,y_train)

tuned_model = rscv.best_estimator_

Fitting 3 folds for each of 50 candidates, totalling 150 fits


In [62]:
from sklearn.metrics import accuracy_score

tuned_model.fit(X_train,y_train)
y_pred = tuned_model.predict(X_test)

print(f'accuracy of tuned Decision tree is {accuracy_score(y_test,y_pred)*100}%')

accuracy of tuned Decision tree is 85.15%


#### 8) Grow a ***Random-Forest-Classifier*** with Decision-Trees

In [63]:
from sklearn.model_selection import ShuffleSplit

X,y = make_moons(10000,noise=0.4)
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.2,random_state=42)

rs = ShuffleSplit(n_splits=1000,train_size=0.0125,random_state=42,test_size=None)
#train_size=0.0125 to create samples with 100 instances

random_forest = [rscv.best_estimator_ for i in range(1000)]#list of 1000 decision_trees

X_train_subset = [[X_train[i] for i in j[0]] for j in list(rs.split(X_train))]
#rs.split() return random/shuffled indexs , above line will create 1000 subset with 100 samples
y_train_subset = [[y_train[i] for i in j[0]] for j in list(rs.split(X_train))]

In [64]:
trained_random_forest = [random_forest[i].fit(X_train_subset[i],y_train_subset[i]) for i in range(1000)]

In [65]:
from statistics import mode

predictions = [trained_random_forest[i].predict(X_test) for i in range(1000)]
final_pred = []
for i in range(2000):#test_size = 2000
    final_pred.append(mode([predictions[j][i] for j in range(1000)]))
#this predictions[j][i] will give the result of j-th tree for the i-th instance

In [66]:
print(f'Accuracy of Random forest : {accuracy_score(y_test,final_pred)*100}%')

Accuracy of Random forest : 83.8%
