# 1. What is the approximate depth of a Decision Tree trained (without restrictions)on a training set with 1 million instances?

The depth of a well-balanced tree containing m leaves is equal to log2(m) rounded up. A binary tree (as it is the case with scikit-learn) will end up more or less balanced at the end of training, with one leaf per training instance if it is trained without restrictions. If the training set contains 1 million instances therefore the Decision Tree will have a depth of log2(10^6) which is approximately 20 (a bit more since the tree will generally not be perfectly well balanced).

# 2. Is a node’s Gini impurity generally lower or greater than its parent’s? Is it generally lower/greater, or always lower/greater?

A node's gini impurity is generally lower than its parent's. This is due to the CART Training algorithm's cost function, which splits each node in a way that minimizes the weighted sum of its children's Gini impurities. However it is possible for a node to have a higher Gini impurity  than its parents, as long as it is compensated by for a decreased in the other child's impurity.

# 3. If a Decision Tree is overfitting the training set, is it a good idea to try decreasing max_depth?

Yes it is as it will reduce the depth of the trees and therefore constrain the model and reduce the overfitting by generalizing better.

# 4. If a Decision Tree is underfitting the training set, is it a good idea to try scaling the input features?

Scaling the input features won't change anything if a tree model is underfitting or overfitting. 

# 5. If it takes one hour to train a Decision Tree on a training set containing 1 million instances, roughly how much time will it take to train another Decision Tree on a training set containing 10 million instances?

The computational complexity of training a Decision Tree is O(n * m log(m)). So if the training size is multiply by 10, then the training time equal (n * 10m*(log10m)) / (n * m log(m) = 10 * log(10m) / log(m) and if m = 10^6, then the training time expected is 11.7 hours.

# 6. If your training set contains 100,000 instances, will setting presort=True speed up training?

Presort=True speeds up the process for small datasets (few thousand instances). On the opposite, if the dataset is large, presorting the data will considerably slowing down the process. So in the case that the training set contains 100,000 instances, it will slow down the process.

# 7. Train and fine-tune a Decision Tree for the moons dataset.

## a. Generate a moons dataset using make_moons(n_samples=10000, noise=0.4).

In [11]:
import numpy as np

In [1]:
from sklearn.datasets import make_moons

In [3]:
X,y = make_moons(n_samples=10000,noise=0.4)

## b. Split it into a training set and a test set using train_test_split().

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## c. Use grid search with cross-validation (with the help of the GridSearchCV class) to find good hyperparameter values for a DecisionTreeClassifier. Hint: try various values for max_leaf_nodes.

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [10]:
dt_clf = DecisionTreeClassifier(random_state=42)

In [33]:
params = {
    "max_depth": np.arange(1,30,10),
    "max_leaf_nodes":np.arange(10,50,10),
}

In [34]:
gridsearch_cv = GridSearchCV(dt_clf,param_grid=params)

In [73]:
gridsearch_cv.fit(X_train,y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': array([ 1, 11, 21]),
                  

In [74]:
print(f"Best estimators:{gridsearch_cv.best_estimator_}")
print(f"Best parameters:{gridsearch_cv.best_params_}")
print(f"Best score:{gridsearch_cv.best_score_}")

Best estimators:DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=11, max_features=None, max_leaf_nodes=30,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')
Best parameters:{'max_depth': 11, 'max_leaf_nodes': 30}
Best score:0.8591044776119403


## d. Train it on the full training set using these hyperparameters, and measure your model’s performance on the test set. You should get roughly 85% to 87% accuracy.

In [75]:
gridsearch_cv.fit(X_train,y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features=None,
                                              max_leaf_nodes=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              presort='deprecated',
                                              random_state=42,
                                              splitter='best'),
             iid='deprecated', n_jobs=None,
             param_grid={'max_depth': array([ 1, 11, 21]),
                  

In [76]:
predicted_y = gridsearch_cv.predict(X_test)

In [77]:
from sklearn.metrics import f1_score,confusion_matrix,classification_report,accuracy_score

In [78]:
print(accuracy_score(y_test,predicted_y))

0.843939393939394


In [79]:
print(f1_score(y_test,predicted_y))

0.8483956432146011


In [80]:
print(confusion_matrix(predicted_y,y_test))

[[1344  236]
 [ 279 1441]]


In [54]:
print(classification_report(predicted_y,y_test))

              precision    recall  f1-score   support

           0       0.85      0.87      0.86      1611
           1       0.88      0.85      0.87      1689

    accuracy                           0.86      3300
   macro avg       0.86      0.86      0.86      3300
weighted avg       0.86      0.86      0.86      3300



# 8. Grow a forest.

## a. Continuing the previous exercise, generate 1,000 subsets of the training set, each containing 100 instances selected randomly. Hint: you can use ScikitLearn’s ShuffleSplit class for this.

In [55]:
from sklearn.model_selection import ShuffleSplit

In [56]:
shufflesplit = ShuffleSplit(n_splits=1000,random_state=42,test_size=0.33)

## b. Train one Decision Tree on each subset, using the best hyperparameter values found above. Evaluate these 1,000 Decision Trees on the test set. Since they were trained on smaller sets, these Decision Trees will likely perform worse than the first Decision Tree, achieving only about 80% accuracy.

In [141]:
predicted_values = []
target_test = []
accuracy_scores = []
for train_index, test_index in shufflesplit.split(X,y):
    gridsearch_cv.fit(X[train_index],y[train_index])
    predicted_y = gridsearch_cv.predict(X[test_index])
    target_test.append(y[test_index])
    predicted_values.append(predicted_y)
    accuracy_scores.append(accuracy_score(y[test_index],predicted_y))          

In [142]:
np.mean(accuracy_scores)

0.8554606060606061

## c. Now comes the magic. For each test set instance, generate the predictions of the 1,000 Decision Trees, and keep only the most frequent prediction (you can use SciPy’s mode() function for this). This gives you majority-vote predictions over the test set.

In [143]:
from scipy import stats

In [144]:
m = stats.mode(predicted_values)

## d. Evaluate these predictions on the test set: you should obtain a slightly higher accuracy than your first model (about 0.5 to 1.5% higher). Congratulations, you have trained a Random Forest classifier!

In [145]:
print(accuracy_score(stats.mode(predicted_values)[0][0],y_test))

0.5057575757575757
