1. What is the approximate depth of a Decision Tree trained (without restrictions) on a training set with 1 million instances?
+ The depth of a well-balanced binary tree containing m leaves is equal to log2(m), rounded up.
+ Thus, if the training set contains one million instances, the Decision Tree will have a depth of log2
(106) ≈ 20 (actually a bit more since the tree will generally not be perfectly well balanced).

2. Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?
+ Tạp chất Gini của một nút thường thấp hơn tạp chất của nút cha của nó. Điều này được đảm bảo bởi hàm chi phí của thuật toán đào tạo CART, hàm này chia nhỏ từng nút theo cách giảm thiểu tổng trọng số của các tạp chất Gini con của nó. Tuy nhiên, nếu một con nhỏ hơn con kia, thì nó có thể có tạp chất Gini cao hơn cha mẹ của nó, miễn là sự gia tăng này nhiều hơn được bù đắp bởi sự giảm tạp chất của con kia.
+ A node's Gini impurity is generally lower than its parent's. This is ensured by the CART training algorithm's cost function, which splits each node in a way that minimizes the weighted sum of its children's Gini impurities. However, if one child is smaller than the other, it is possible for it to have a higher Gini impurity than its parent, as long as this increase is more than compensated for by a decrease of the other child's impurity.

3. If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?
+ If a Decision Tree is overfitting the training set, it may be a good idea to decrease max_depth, since this will constrain the model, regularizing it.

4. If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?
+ Decision Trees don't care whether or not the training data is scaled or centered; scaling the input features will just be a waste of time.

5. Nếu mất một giờ để huấn luyện Cây quyết định trên một tập huấn luyện chứa 1 triệu trường hợp, thì khoảng bao nhiêu thời gian để huấn luyện một Cây quyết định khác trên tập huấn luyện chứa 10 triệu trường hợp?
+ The computational complexity of training a Decision Tree is O(n × m log(m)). So if you multiply the training set size by 10, the training time will be multiplied by K = (n × 10m × log(10m)) / (n × m × log(m)) = 10 × log(10m) / log(m). If m = 10^6, then K ≈ 11.7, so you can expect the training time to be roughly 11.7 hours.

6. If your training set contains 100,000 instances, will setting presort=True speed
up training?
+ Presorting the training set speeds up raining only if the dataset is smaller than a few thousand instances. If it contains 100,000 instances, setting presort=True will considerably slow down training. 
+ Việc lưu trữ tập hợp đào tạo chỉ tăng tốc độ mưa nếu tập dữ liệu nhỏ hơn vài nghìn trường hợp. Nếu nó chứa 100.000 phiên bản, việc đặt presort = True sẽ làm chậm quá trình đào tạo đáng kể.

## **7. Train and fine-tune a Decision Tree for the moons dataset.**

In [11]:
from sklearn.datasets import make_moons
X,y = make_moons(n_samples=10000, noise=0.3, random_state=42)

In [13]:
# Create data for training and evalue model
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, train_size=0.8, random_state=42)

In [15]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

params = {'max_leaf_nodes':list(range(2,100)), 'min_samples_split':[2,3,4]}
grid_search_cv = GridSearchCV(DecisionTreeClassifier(random_state=42), params, verbose=1, cv=3)

grid_search_cv.fit(X_train, y_train)

Fitting 3 folds for each of 294 candidates, totalling 882 fits


In [16]:
grid_search_cv.best_estimator_

In [18]:
from sklearn.metrics import classification_report
y_pred = grid_search_cv.predict(X_test)
score = classification_report(y_test, y_pred)
print(score)

              precision    recall  f1-score   support

           0       0.93      0.93      0.93      1013
           1       0.93      0.92      0.93       987

    accuracy                           0.93      2000
   macro avg       0.93      0.93      0.93      2000
weighted avg       0.93      0.93      0.93      2000



## **8. Grow a forest.**

In [22]:
print(len(X_train))

8000


In [24]:
from sklearn.model_selection import ShuffleSplit

n_trees = 100
n_instance = 100

mini_sets = []
# test_size = so luong mau thu nghiem tuyet doi
sp = ShuffleSplit(n_splits=1000, test_size = len(X_train) - n_instance, random_state=42) 

for mini_train_index, mini_test_index in sp.split(X_train):
    X_mini_train = X_train[mini_train_index]
    y_mini_train = y_train[mini_train_index]
    mini_sets.append((X_mini_train, y_mini_train))


b. Train one Decision Tree on each subset, using the best hyperparameter values found above. Evaluate these 1,000 Decision Trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely perform worse than the first Decision Tree, achieving only about 80% accuracy.

In [26]:
from sklearn.base import clone
import numpy as np
from sklearn.metrics import accuracy_score

forest = [clone(grid_search_cv.best_estimator_) for _ in range(n_trees)]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, mini_sets):
    tree.fit(X_mini_train, y_mini_train)
    
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

np.mean(accuracy_scores)

0.8675200000000001

c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction (you can use SciPy's `mode()` function for this). This gives you _majority-vote predictions_ over the test set.

In [30]:
forest

[DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(max_leaf_nodes=18, random_state=42),
 DecisionTreeClassifier(

In [27]:
Y_pred = np.empty([n_trees, len(X_test)], dtype=np.uint8)

for tree_index, tree in enumerate(forest):
    Y_pred[tree_index] = tree.predict(X_test)

In [28]:
from scipy.stats import mode

y_pred_majority_votes, n_votes = mode(Y_pred, axis=0)

In [29]:
accuracy_score(y_test, y_pred_majority_votes.reshape([-1]))

0.9285