# Phase 3: Classification with Decision Trees (Inspired by Lab Guide)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import joblib

# READ FILE:

This step initiates the classification and clustering processes by loading a pre-cleaned dataset named "cleaned_data (1).csv". This dataset has already undergone essential preprocessing in a prior phase, such as handling missing values, normalizing inputs, and encoding categorical attributes where needed. Using the pd.read_csv() function from pandas, the file is read into a DataFrame (df), setting the stage for further analysis and model training.

In [None]:
df = pd.read_csv('cleaned_data.csv')
print(df.head())

# Feature Selection:

Here, the dataset is split into two parts: features (x) and the target variable (y). The features include all columns except the last one, Accident_Severity, which is the target we want the model to predict. A list of feature names (fn) is extracted from the dataset's columns, and these are used to define the input matrix x. The class labels (accident severity levels) are isolated into the variable y. This separation is essential to train supervised learning models like the Decision Tree classifier.

In [None]:
fn =df.keys().tolist() [:-1]
x=df[fn]
y=df['Accident_Severity']

# Model Evaluation Function:

This section defines a reusable function evaluate_model() that encapsulates the full training and evaluation process for a Decision Tree classifier
 The function accepts training and testing sets, along with a criterion (gini or entropy), then:

Instantiates a DecisionTreeClassifier using the specified criterion
Trains the model on the training data
Predicts labels on the test set
Calculates the model's accuracy using accuracy_score()
Generates a confusion matrix and visualizes it with ConfusionMatrixDisplay
This function returns the trained model and its accuracy score, providing a concise way to evaluate different configurations

In [None]:

def evaluate_model(X_train, X_test, y_train, y_test, criterion):
    clf = DecisionTreeClassifier(criterion=criterion, random_state=42)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    print(f"Criterion: {criterion}, Accuracy: {acc:.4f}")
    ConfusionMatrixDisplay(cm, display_labels=clf.classes_).plot()
    plt.title(f"Confusion Matrix ({criterion})")
    plt.show()
    return clf, acc

#  Model Evaluation Across Multiple Splits and Criteria:

This section evaluates the performance of Decision Tree classifiers across different train-test splits and classification criteria. To ensure a robust assessment of model behavior, the dataset is partitioned into three different ratios: 90% training - 10% testing, 80% training - 20% testing, and 70% training - 30% testing. For each partition, two Decision Tree classifiers are built using different splitting criteria—Gini index and Entropy—to compare their effectiveness under varying amounts of training data.

The train_test_split() function is used to create training and testing datasets for each ratio, ensuring that results are consistent by setting random_state=42. A loop iterates over the chosen split ratios and criteria, training the model on the corresponding subset and evaluating it on the test data. The evaluate_model() function encapsulates the training and prediction steps and returns both the trained model and its accuracy.

All results—including the train-test ratio, criterion used, and resulting accuracy—are stored in a list for comparison. This systematic evaluation allows us to determine how data quantity and splitting strategy influence model performance and whether the Gini or Entropy criterion is more effective for our dataset.




In [None]:
splits = [(0.9, 0.1), (0.8, 0.2), (0.7, 0.3)]
results = []
for train_size, test_size in splits:
    print(f"\n--- Train/Test Split: {int(train_size*100)}/{int(test_size*100)} ---")
    X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=train_size, test_size=test_size, random_state=42)
    for criterion in ["gini", "entropy"]:
        clf, acc = evaluate_model(X_train, X_test, y_train, y_test, criterion)
        results.append({"Train Size": train_size, "Test Size": test_size, "Criterion": criterion, "Accuracy": acc})


In [None]:
results_df = pd.DataFrame(results)
print("\nSummary of Results:")
print(results_df)


# Training the Final Model:



This step involves selecting the best-performing configuration from our earlier evaluation and retraining the Decision Tree classifier on the full training set using those parameters. Based on our previous results, we choose a specific train-test split (`70/30`) and splitting criterion (`entropy`) that yielded the most reliable accuracy and confusion matrix performance.

Training the final model on this configuration ensures we capture the most optimal balance between generalization and accuracy. This model will then be used for making predictions on new, unseen data.


In [None]:
 # Train final model (on 70/30 split with 'entropy') for saving/loading
X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.7, test_size=0.3, random_state=42)
final_model = DecisionTreeClassifier(criterion="entropy", random_state=42)
final_model.fit(X_train, y_train)

# Saving the Model for New Predictions:

To enable reuse of the trained model without needing to retrain it each time, we save it to disk using `joblib`. This is a common best practice in machine learning deployment workflows.

Saving the model allows for fast and consistent predictions in production environments or future experiments. It also ensures reproducibility, as the exact trained weights and structure of the model are preserved.


In [None]:
# Save the model
joblib.dump(final_model, "decision_tree_model.pkl")

# Use the same columns as in training
import pandas as pd


feature_names = X_train.columns  

# Create a DataFrame with the same columns
new_sample = pd.DataFrame([[5.1, 3.5, 1.4, 0.2] + [0]*10], columns=feature_names)  # Pad the rest if needed
loaded_model = joblib.load("decision_tree_model.pkl")
prediction = loaded_model.predict(new_sample)
print("Predicted class for new sample:", prediction[0])


#  Displaying the 70/30 Split Partition:

This step isolates the 70% training / 30% testing configuration from earlier evaluations and displays its a visual representation of the model's performance. The 70/30 split is especially useful for evaluating how well the model generalizes when trained with less data compared to other splits (like 90/10 or 80/20).

By focusing on this partition, we can observe whether the model maintains strong predictive power despite having less training data and a larger test set. This is often a good stress test for model robustness.


In [None]:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Make sure class names are strings
class_names_str = [str(cls) for cls in final_model.classes_]

# Plot the tree
plt.figure(figsize=(12, 8))
plot_tree(
    final_model,
    feature_names=list(x.columns),  # Ensure it's a list of strings
    class_names=class_names_str,
    filled=True
)
plt.title("Final Decision Tree (Entropy, 70/30)")
plt.show()



#  Optional: Visualization by Split Size and Criterion:

To better understand how the model's behavior changes with different train-test splits and criteria, we visualize key metrics. These visualizations make it easier to detect patterns and trade-offs—for example, whether a larger training set consistently improves accuracy or whether one criterion outperforms another in terms of class balance.

Optional visualizations help communicate results more clearly, and support better decision-making regarding model deployment.


In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Try different test sizes and both criteria
split_sizes = [0.3, 0.2, 0.1]
criteria = ['gini', 'entropy']

for split in split_sizes:
    for criterion in criteria:
        print(f"\n🔹 Criterion: {criterion.upper()}, Split: {int((1-split)*100)}/{int(split*100)}")

        # Split the data
        X_train, X_test, y_train, y_test = train_test_split(
            x, y, test_size=split, random_state=42
        )

        # Train the model
        model = DecisionTreeClassifier(criterion=criterion, random_state=42)
        model.fit(X_train, y_train)

        # Evaluate
        y_pred = model.predict(X_test)
        acc = accuracy_score(y_test, y_pred)
        print(f"Accuracy: {acc:.2f}")

        # Plot the tree
        plt.figure(figsize=(13, 2))
        plot_tree(
            model,
            feature_names=x.columns.astype(str),
            class_names=[str(cls) for cls in model.classes_],
            filled=True
        )
        plt.title(f"Decision Tree ({criterion.capitalize()}) — Split {int((1-split)*100)}/{int(split*100)}")
        plt.show()


 # Final Justification and Conclusion:

After evaluating the Decision Tree classifier using both the Gini index and Entropy across three different train-test splits (90/10, 80/20, 70/30), we observe the following patterns:

- **Accuracy Stability:** Both criteria performed similarly across splits, with minor differences. This indicates that the model is not overly sensitive to the choice of criterion on this dataset.
- **Train-Test Tradeoff:** As expected, models trained on larger training sets (e.g., 90%) tended to perform slightly better due to more information being available during training. However, the difference was not dramatic, suggesting the dataset is large and diverse enough to allow flexible partitioning.
- **Confusion Matrices:** These visualizations helped identify where the model struggled (e.g., if certain classes were consistently misclassified). If significant misclassification was seen in a particular class, it might be worth exploring further class-specific metrics or balancing the dataset.

Based on overall performance, a **train-test split of 80/20 with the Gini index** appears to offer a good balance of performance and generalization. However, both criteria are viable could further improve performance.
