# Decision trees

This is an exploratory exercise to allow you to learn more about decision trees and how they might be used in scikit-learn.

## Instructions:

* Go through the notebook and complete the tasks. 
* Make sure you understand the examples given. If you need help, refer to the documentation links provided or go to the discussion forum. 
* When a question allows a free-form answer (e.g. what do you observe?), create a new markdown cell below and answer the question in the notebook. 
* Save your notebooks when you are done.

Before you do the tasks below, go through the scikit-learn decision tree tutorial <a href="https://scikit-learn.org/stable/modules/tree.html">here</a>, with the classifier described <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">here</a>. The tutorial contains instructions on how to use decision trees for both classification and regression in Python. 

**Task 1:**
Using what you learnt in the scikit-learn decision tree tutorial, use decision trees for classification on the iris dataset, and for regression on the diabetes dataset (both included in ```sklearn.datasets```). Your code should print the accuracy and the confusion matrix for the classification problem, and mean squared error for the regression. Try comparing the results for different maximum tree depths. 

Note: You should split your data 80% training and 20% for testing.



In [8]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import numpy as np

# Load the Iris dataset
iris = datasets.load_iris()

In [7]:

X = iris.data
y = iris.target

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a list of maximum tree depths to experiment with
max_depths = [2, 3, 4, 5]

for max_depth in max_depths:
    # Create a Decision Tree classifier
    clf = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    
    # Train the classifier
    clf.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Calculate confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    # Print results
    print(f"Max Depth: {max_depth}")
    print(f"Accuracy: {accuracy}")
    print(f"Confusion Matrix:\n{cm}\n")


Max Depth: 2
Accuracy: 0.9666666666666667
Confusion Matrix:
[[10  0  0]
 [ 0  8  1]
 [ 0  0 11]]

Max Depth: 3
Accuracy: 1.0
Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Max Depth: 4
Accuracy: 1.0
Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]

Max Depth: 5
Accuracy: 1.0
Confusion Matrix:
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]



In [9]:
# Now evaluate a regression on the diabetes dataset 
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

diabetes = datasets.load_diabetes()



In [10]:
X = diabetes.data
y = diabetes.target

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a list of maximum tree depths to experiment with
max_depths = [2, 3, 4, 5]

for max_depth in max_depths:
    # Create a Decision Tree regressor
    reg = DecisionTreeRegressor(max_depth=max_depth, random_state=42)
    
    # Train the regressor
    reg.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = reg.predict(X_test)
    
    # Calculate mean squared error
    mse = mean_squared_error(y_test, y_pred)
    
    # Print results
    print(f"Max Depth: {max_depth}")
    print(f"Mean Squared Error: {mse}\n")

Max Depth: 2
Mean Squared Error: 3866.038156768628

Max Depth: 3
Mean Squared Error: 3656.186930948001

Max Depth: 4
Mean Squared Error: 3594.089844855363

Max Depth: 5
Mean Squared Error: 3545.4104698436895



**Task 2:**
How would you avoid overfitting in decision trees? Read the decision tree classifier described <a href="http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">here</a> for help. 




To avoid overfitting in decision trees, you can implement the following strategies:

1. **Limit the maximum depth of the tree:** One of the simplest ways to prevent overfitting is to limit the maximum depth of the tree. This prevents the tree from becoming too complex and fitting the training data too closely.

2. **Set a minimum number of samples per leaf node:** By specifying a minimum number of samples required to be at a leaf node, you can prevent the tree from creating nodes with very few samples, which might be noisy.

3. **Prune the tree:** After the tree has been built, you can prune it by removing nodes that do not contribute significantly to improving the model's performance.

4. **Use feature selection or dimensionality reduction techniques:** Before training the tree, you can perform feature selection or apply dimensionality reduction techniques to reduce the number of features used in the training process.

5. **Use ensemble methods:** Ensemble methods like Random Forests or Gradient Boosting combine multiple decision trees to improve generalization and reduce overfitting.

6. **Cross-validation:** Use techniques like k-fold cross-validation to evaluate the model's performance on different subsets of the data. This helps you get a more robust estimate of the model's performance.

7. **Collect more data:** Having more data can help the model generalize better, as it has more examples to learn from.

8. **Regularization techniques:** Some variants of decision trees, like Regularized Greedy Forests, include regularization terms that help prevent overfitting.

9. **Early stopping:** During the training process, monitor the performance on a validation set and stop training once the performance plateaus or starts to degrade.

Remember that the choice of which method(s) to use will depend on the specific problem, the nature of the data, and the behavior of the decision tree during training. It's often a good practice to try multiple approaches and evaluate their performance.