## Build a single Decision Tree model

In [None]:
# # Import model.
# from sklearn.tree import DecisionTreeClassifier

In [None]:
# # Instantiate model with random_state = 42.
# dt = DecisionTreeClassifier(random_state = 42)

In [None]:
# # Fit model.
# dt.fit(X_train, y_train)

In [None]:
# # Evaluate model.
# print(f'Score on training set: {dt.score(X_train, y_train)}')
# print(f'Score on testing set: {dt.score(X_test, y_test)}')

Decision trees tend to overfit. To solve this problem,
- As with all models, try to gather more data.
- As with all models, remove some features.
- Stop our model from growing

### Tuning Hyperparameters of Decision Trees
There are four hyperparameters of decision trees that we may commonly tune in order to prevent overfitting.

- `max_depth`: The maximum depth of the tree.
    - By default, the nodes are expanded until all leaves are pure (or some other argument limits the growth of the tree).
    - In the 20 questions analogy, this is like "How many questions we can ask?"
    
    
- `min_samples_split`: The minimum number of samples required to split an internal node.
    - By default, the minimum number of samples required to split is 2. That is, if there are two or more observations in a node and if we haven't already achieved maximum purity, we can split it!
    
    
- `min_samples_leaf`: The minimum number of samples required to be in a leaf node (a terminal node at the end of the tree).
    - By default, the minimum number of samples required in a leaf node is 1. (This should ring alarm bells - it's very possible that we'll overfit our model to the data!)


- `ccp_alpha`: A [complexity parameter](https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning) similar to $\alpha$ in regularization. As `ccp_alpha` increases, we regularize more.
    - By default, this value is 0.

[Source: Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

In [None]:
# # Instantiate model with:
# # - a maximum depth of 5.
# # - at least 7 samples required in order to split an internal node.
# # - at least 3 samples in each leaf node.
# # - a cost complexity of 0.01.
# # - random state of 42.

# dt = DecisionTreeClassifier(max_depth = 5,
#                             min_samples_split = 7,
#                             min_samples_leaf = 3,
#                             ccp_alpha = 0.01,
#                             random_state = 42)

## Use GridSearch to tune hyperparameters and find better model

In [2]:
# from sklearn.model_selection import GridSearchCV

In [3]:
# grid = GridSearchCV(estimator = DecisionTreeClassifier(),
#                     param_grid = {'max_depth': [2, 3, 5, 7],
#                                   'min_samples_split': [5, 10, 15, 20],
#                                   'min_samples_leaf': [2, 3, 4, 5, 6],
#                                   'ccp_alpha': [0, 0.001, 0.01, 0.1, 1, 10]},
#                     cv = 5,
#                     verbose = 1)

In [4]:
# import time

# # Start our timer.
# t0 = time.time()

# # Let's GridSearch over the above parameters on our training data.
# grid.fit(X_train, y_train)

# # Stop our timer and print the result.
# print(time.time() - t0)

In [None]:
# # What is our best decision tree?
# grid.best_estimator_

# # What was the cross-validated score of the above decision tree?
# grid.best_score_

# # Evaluate model.
# print(f'Score on training set: {grid.score(X_train, y_train)}')
# print(f'Score on testing set: {grid.score(X_test, y_test)}')

# # Generate predictions on test set.
# preds = grid.predict(X_test)

# # Import confusion_matrix.
# from sklearn.metrics import confusion_matrix

# # Generate confusion matrix.
# tn, fp, fn, tp = confusion_matrix(y_test,
#                                   preds).ravel()

# print(confusion_matrix(y_test, preds))

# # Calculate sensitivity.

# sens = tp / (tp + fn)
# print(f'Sensitivity: {round(sens, 4)}')

# # Calculate specificity.

# spec = tn / (tn + fp)
# print(f'Specificity: {round(spec, 4)}')

### Why use a decision tree?

1. We don't have to scale our data
2. Decision trees don't make assumptions about how our data is distributed
3. Easy to interpret (feature importance)
4. Speed (fit very quickly)

### Why not use a decision tree?

1. Decision trees can very easily overfit.
2. Decision trees are locally optimal: Because we're making the best decision at each node (greedy), we might end up with a worse solution in the long run.
3. Decision trees don't work well with unbalanced data. (Check out the `class_weight` parameter if you're interested.)

## Decision trees vs logistic regression

- **Interpretability**: The coefficients in a logistic regression model are interpretable. (They represent the change in log-odds caused by the input variables.) However, this is complicated and not easy for non-technical audiences. Decision trees are interpretable; it is easy to explain to show a picture of a decision tree to a client or boss and get them to understand how predictions are made.
<br>

- **Performance**: Decision trees have a tendency to easily overfit, while logistic regression models usually do not overfit as easily.
<br>

- **Assumptions**: Decision trees have no assumptions about how data are distributed; logistic regression does make assumptions about how data are distributed.
<br>

- **Frequency**: Logistic regression is more commonly than decision trees.
<br>

- **Y variable**: Decision trees can handle regression and classification problems; logistic regression is only really used for classification problems.