# Module 2: Decision Trees Practice

## Introduction
In this notebook, you'll learn how to implement a Decision Tree classifier using scikit-learn on a multiclass dataset.

## Initial Knowledge Check
1. What is the main idea behind a decision tree?
2. How does tree depth affect underfitting vs. overfitting?
3. Explain how splitting criteria (gini vs entropy) differ.


In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('./data/decision_trees.csv')
X = df[['feature1', 'feature2', 'feature3']]
y = df['target']
df.head()

## 2. Exploratory Data Analysis
Visualize the relationships between features and the target. We'll start with pairplots.


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, hue='target', palette='tab10')
plt.suptitle('Pairplot of Decision Tree Demo Data', y=1.02)
plt.show()

## 3. Train a Decision Tree Classifier
We'll train a DecisionTreeClassifier with default settings and evaluate on the same data.


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Instantiate and train
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X, y)

# Predict and evaluate
y_pred = dt.predict(X)
acc = accuracy_score(y, y_pred)
print(f"Training Accuracy: {acc:.2f}")

## 4. Exercise for the Student
**Task:**  
1. Vary the `max_depth` parameter from 1 to 10 and record the training and test accuracy.  
2. Plot accuracy vs. `max_depth`.  
3. Which `max_depth` provides the best test accuracy?  
4. **Bonus:** Visualize the tree with `export_text` or `plot_tree`.


## 5. Solution
Below is one possible solution, including splitting into train/test sets and plotting.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import plot_tree

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

train_acc = []
test_acc = []
depths = range(1, 11)

for d in depths:
    clf = DecisionTreeClassifier(max_depth=d, random_state=42)
    clf.fit(X_train, y_train)
    train_acc.append(accuracy_score(y_train, clf.predict(X_train)))
    test_acc.append(accuracy_score(y_test, clf.predict(X_test)))

# Plot results
plt.plot(depths, train_acc, label='Train Accuracy')
plt.plot(depths, test_acc,  label='Test Accuracy')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Decision Tree: Train vs Test Accuracy')
plt.legend()
plt.show()

# Best depth
best_depth = depths[test_acc.index(max(test_acc))]
print(f"Optimal max_depth based on test set: {best_depth}")

# Bonus: visualize tree
plt.figure(figsize=(12,8))
plot_tree(clf, feature_names=['feature1','feature2','feature3'], class_names=[str(c) for c in clf.classes_], filled=True)
plt.show()

---
### Next Steps
- Experiment with `criterion='entropy'` and compare results.
- Consider how pruning methods (like `min_samples_leaf`) can regularize the tree.
- Prepare for Linear Regression by reviewing how continuous targets differ from classification.
