### Decision Tree Classifier Exercise

This notebook will guide you through implementing a Decision Tree classifier using Python's popular `scikit-learn` library. We will use the same dataset generated previously with overlapping classes, making it a great exercise to understand the workings of decision trees.

## Objectives
- Understand how to build and visualize a Decision Tree classifier.
- Learn how to interpret the decision tree and its structure.
- Apply model evaluation techniques to assess performance.

# Step 1: Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

In [None]:
# Step 1: Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score


## Step 2: Prepare the Dataset

We will use the same generated dataset with overlapping classes.


In [None]:
# Generate the data again (Red and Green points with overlap)
np.random.seed(42)

# Red points (class 1) - overlapping with green
x_red = np.random.uniform(2, 7, 20)
y_red = np.random.uniform(2, 8, 20)

# Green points (class 2) - overlapping with red
x_green = np.random.uniform(4, 9, 20)
y_green = np.random.uniform(4, 10, 20)

# Create a DataFrame for the data
data_dict = {
    'Feature 1': np.concatenate([x_red, x_green]),
    'Feature 2': np.concatenate([y_red, y_green]),
    'Class': [0] * len(x_red) + [1] * len(x_green)  # 0: Red, 1: Green
}
data_df = pd.DataFrame(data_dict)

# Show the first few rows
data_df.head()


## Step 3: Data Preprocessing

### 3.1 Features and Target Variables
We need to separate the features and the target variable before training the model.



In [None]:

# Features (X) and target variable (y)
X = data_df[['Feature 1', 'Feature 2']]
y = data_df['Class']



### 3.2 Split the Data into Training and Test Sets
We will split the dataset into training and testing sets with a ratio of 70/30.




In [None]:

# Split the data: 70% for training and 30% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)




## Step 4: Train the Decision Tree Model
We will train a Decision Tree classifier using `scikit-learn`.



In [None]:

# Initialize the decision tree model
tree_clf = DecisionTreeClassifier(max_depth=3, random_state=42)  # Limiting depth to avoid overfitting

# Train the model with the training data
tree_clf.fit(X_train, y_train)



### Step 5: Visualize the Decision Tree
Plotting the decision tree helps us understand its structure and how it splits the data.



In [None]:

# Plotting the decision tree
plt.figure(figsize=(15, 5))
plot_tree(tree_clf, feature_names=['Feature 1', 'Feature 2'], class_names=['Red', 'Green'], filled=True)
plt.title('Decision Tree Visualization')
plt.show()




## Step 6: Make Predictions
We will now make predictions on the test set to evaluate the performance of the model.


In [None]:
# Make predictions on the test set
y_pred = tree_clf.predict(X_test)


## Step 7: Evaluate the Model
### 7.1 Confusion Matrix


In [None]:
# Calculate the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print(conf_matrix)

# Plotting the confusion matrix
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()


### 7.2 Classification Report

In [None]:
# Classification report
print(classification_report(y_test, y_pred))



### 7.3 Accuracy Score


In [None]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")



## Step 8: Interpret the Results
- The **decision tree visualization** shows the sequence of splits that are performed on the features to classify the points.
- The **confusion matrix** indicates how well the model performed in predicting the correct classes.
- The **classification report** provides metrics like precision, recall, and F1-score for each class.
- **Accuracy score** shows the percentage of correct predictions.

## Key Points
- Decision Trees are very intuitive and easy to visualize.
- **Depth of the tree** is crucial for controlling overfitting. Limiting depth helps the model generalize better.
- Decision Trees are powerful but can easily overfit, especially if there are no depth or minimum sample requirements.

## Step 9: Improving the Model (Optional)
- **Increase or Decrease Tree Depth**: Adjusting the `max_depth` parameter can help improve performance.
- **Feature Engineering**: Adding more features could lead to better splits.



