### Exercise 1
The approximate depth of a DT trained on a training set of a million instances is
log2(1000000) ~ 20

### Exercise 2
A node's Gini impurity is generally lower than it's parent. For example, a Node with a set {A, B, A, A, A} has G of 1 - (1/ 25 + 16/25) = 0.32. If the Node is split between {A, B} and {A, A, A}, we would have Gs of 0.5 and 0 respectively. 

### Exercise 3
If a DT if overfitting a training set, it is a good idea to reduce the max_depth. Or to increase the min_samples_leaf.

### Exercise 4
If a DT is underfitting a training set, do not try to scale the input_features. It will not do anything. Try increasing the max_depth, or decreasing the min_samples_leaf.

### Exercise 5
If we have a million training instances that takes an hour to train, a set with 10 million instances would take 
K = (n*10m*log(10m))/(n*m*log(m)) 
  = 10*log(10m)/log(m)
  = 10 * log(10 * 1000000) / log(1000000)
  ~ 11.67 hours
  
### Exercise 6
Presorting a training set with 100,000 instances would make training slower because the fastest sorting algorithm would be O(m*log(m)). So presorting is only better when m*log(m) < n. Great for datasets with picture as training data.

In [34]:
# Exercise 7
## Train and fine tune a DT for the moons dataset.
## 1) Generate a moons dataset using make_moons(n_samples=10000, 
##    noise = 0.4)
## 2) Split in into a training and test set using train test split.
## 3) use grid search with cross-validation (with help of the 
##    GridSearchCV class) to find good hyper values for a DTC.
## 4) Train it on the full training set using these hyper parameters
##    and measure your model's performance on the test set. ~86%

from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = make_moons(n_samples=10000, noise=0.4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

parameters_test = {
    'max_leaf_nodes': [10, 50, 90, 100],
    'max_depth': [5, 10, 20],
    'min_samples_leaf': range(1, 10),
    'min_samples_split': range(2, 10),
    'criterion': ['gini', 'entropy'],
}

parameters_good = {
    'max_leaf_nodes': [23, 24, 25, 26, 27],
    'max_depth': [5],
    'min_samples_leaf': [1],
    'min_samples_split': [2],
    'criterion': ['gini'],
}
dt_clf = DecisionTreeClassifier(random_state=42)
clf = GridSearchCV(dt_clf, parameters_good)
clf.fit(X_train, y_train)

print(clf.best_estimator_)

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=24,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')


0.86280000000000001

In [48]:
# Exercise 7
## Grow a fucking forest!
## 1) Generate 1000 subsets of 100 of the training set randomly 
##    (ShuffleSplit)
## 2) Train 1 DT on each subset. Evaluate the DTs on the test set.
##    They will perform worse than the normal one.
## 3) Pull predictions on the 1000 models and use the mode to make
##    a predictions.
## 4) Evaluate the predictions!

from sklearn.model_selection import ShuffleSplit
from scipy.stats import mode

rs = ShuffleSplit(n_splits=1000, test_size=0)

trees = []
for train_index, _ in rs.split(X_train):
    trees.append(DecisionTreeClassifier(class_weight=None, 
            criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=24,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, splitter='best'))
    trees[len(trees)-1].fit(X_train[train_index], y_train[train_index])

preds = []
for tree in trees:
    preds.append(tree.predict(X_test))

mode_pred, _ = mode(preds)

print(mode_pred.T.shape)
print(y_test.shape)

accuracy_score(y_test, mode_pred.T)

(2500, 1)
(2500,)


0.86280000000000001

In [62]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=1000, class_weight=None, 
            criterion='gini', max_depth=5,
            max_features=None, max_leaf_nodes=24,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0)
rf_clf.fit(X_train, y_train)
y_preds = rf_clf.predict(X_test)
accuracy_score(y_test, y_preds)

0.86519999999999997