1. What is a Decision Tree, and how does it work in the context of
classification?
- A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks.
- In classification, it is used to predict a categorical output (e.g., Yes/No, Spam/Not Spam, Disease/No Disease).

How Decision Tree Works in Classification
- Step 1: Select the Best Feature (Splitting)
  - The algorithm chooses the feature that best separates the data into classes.
- Step 2: Split the Dataset
  - The dataset is divided into subsets based on the selected feature.
- Step 3: Repeat Recursively
  - Again choose the best feature, Split the data
- Step 4: Final Prediction

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
- Both Gini Impurity and Entropy measure how “impure” (mixed) a node is.
They help the tree decide where to split the data.
- Entropy measures disorder using logarithms.
- Gini measures probability of misclassification.
- Both guide the tree to create pure child nodes.
- The best split = one that reduces impurity the most.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
- | Feature      | Pre-Pruning          | Post-Pruning              |
| ------------ | -------------------- | ------------------------- |
| When applied | During tree building | After full tree is built  |
| Risk         | May underfit         | Less risk of underfitting |
| Speed        | Faster               | Slower                    |
| Computation  | Low                  | Higher                    |
| Accuracy     | Sometimes lower      | Often better              |


- Pre-Pruning prevents the tree from growing too complex early.
  Advantage: Faster & computationally efficient.
- Post-Pruning trims the tree after full growth.
 Advantage: Better generalization and accuracy in many cases.

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
- Information Gain (IG) measures how much uncertainty (entropy) decreases after splitting a dataset on a feature.

In simple words:
- Information Gain tells us how good a feature is at separating the data into classes.

Why Is It Important?
- In a Decision Tree: At every node:
  - Try splitting on each feature
  -Compute Information Gain
  - Choose the feature with highest Information Gain

This ensures the tree reduces impurity as much as possible at each step.
- Information Gain measures reduction in entropy.
- It is used to select the best feature at each node.
- The feature with the highest Information Gain becomes the next decision node.
- It helps the tree grow in the most informative direction.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
- Decision Trees are widely used in real-world systems because they are simple yet powerful.
- Common Real-World Applications of Decision Trees
  - Healthcare (Disease Diagnosis)
  - Finance & Banking (Credit Risk / Loan Approval)
  - Marketing & Customer Analytics
  - Manufacturing & Quality Control
  - Education & HR

Main Advantages of Decision Trees
- Easy to Understand & Interpret
  - Visual structure
  - Human-readable rules
  - No “black-box” behavior

- Works with Different Data Types
  - Numerical
  - Categorical
  - No need for feature scaling

- Little Data Preparation Needed
  - No normalization required
  - Handles missing values (in some implementations)

- Fast Inference
  - Predictions are quick (just follow decision path)

Main Limitations of Decision Trees
- Overfitting
  - Deep trees memorize training data
  - Need pruning to control complexity

- High Variance
  - Small changes in data can create a different tree

- Greedy Algorithm
  - Chooses best split locally
  - May not find global optimal tree

- Bias Toward Features with Many Categories
  - Information Gain may favor such features


In [1]:
#  6:   Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier using the Gini criterion
# ● Print the model’s accuracy and feature importances

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")

Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


In [2]:
#  7:  Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
# a fully-grown tree.

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a fully-grown Decision Tree
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)

# 4. Train a Decision Tree with max_depth = 3
limited_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_tree.fit(X_train, y_train)

# 5. Make predictions
y_pred_full = full_tree.predict(X_test)
y_pred_limited = limited_tree.predict(X_test)

# 6. Calculate accuracies
accuracy_full = accuracy_score(y_test, y_pred_full)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# 7. Print results
print("Fully-Grown Tree Accuracy:", accuracy_full)
print("Max Depth = 3 Tree Accuracy:", accuracy_limited)

Fully-Grown Tree Accuracy: 1.0
Max Depth = 3 Tree Accuracy: 1.0


In [3]:
#  8: Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances


# Import required libraries
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# 1. Load the Boston Housing dataset from OpenML
boston = fetch_openml(name="boston", version=1, as_frame=True)
X = boston.data
y = boston.target.astype(float)

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# 4. Make predictions
y_pred = regressor.predict(X_test)

# 5. Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# 6. Print feature importances
print("\nFeature Importances:")
for feature_name, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature_name}: {importance:.4f}")

Mean Squared Error (MSE): 10.416078431372549

Feature Importances:
CRIM: 0.0513
ZN: 0.0034
INDUS: 0.0058
CHAS: 0.0000
NOX: 0.0271
RM: 0.6003
AGE: 0.0136
DIS: 0.0707
RAD: 0.0019
TAX: 0.0125
PTRATIO: 0.0110
B: 0.0090
LSTAT: 0.1933


In [4]:
#  9: Write a Python program to:
# ● Load the Iris Dataset
# ● Tune the Decision Tree’s max_depth and min_samples_split using
# GridSearchCV
# ● Print the best parameters and the resulting model accuracy

# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# 2. Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Define the Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# 4. Define the parameter grid
param_grid = {
    "max_depth": [None, 2, 3, 4, 5],
    "min_samples_split": [2, 5, 10]
}

# 5. Apply GridSearchCV
grid_search = GridSearchCV(
    estimator=dt,
    param_grid=param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

# 6. Get the best model
best_model = grid_search.best_estimator_

# 7. Make predictions using the best model
y_pred = best_model.predict(X_test)

# 8. Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# 9. Print results
print("Best Parameters:", grid_search.best_params_)
print("Model Accuracy with Best Parameters:", accuracy)

Best Parameters: {'max_depth': None, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.0


10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


Below is a structured, real-world workflow I would follow as a data scientist in a healthcare setting.

1. Understand the Problem and Data

Before modeling:

* Define the target variable (Disease: Yes/No).
* Identify feature types:

  * Numerical (age, blood pressure, cholesterol)
  * Categorical (gender, symptoms, medical history)
* Check class imbalance.
* Explore missing values and data distribution.

This ensures the modeling approach aligns with the medical objective.

---

2. Handle Missing Values

Step 1: Analyze Missingness

* Identify percentage of missing values per column.
* Check if missingness is random or systematic.

Step 2: Apply Appropriate Strategy

For Numerical Features:

* If small percentage missing → Impute using median (robust to outliers).
* If clinically meaningful → Use domain-specific imputation.
* Optionally add a “missing indicator” column.

For Categorical Features:

* Replace missing with:

  * Most frequent category, or
  * A new category like "Unknown".

If a feature has extremely high missing rate (e.g., >50%), consider dropping it after consulting domain experts.

Why this matters:
Healthcare data often has missing lab tests; improper handling may bias predictions.

---

3. Encode Categorical Features

Decision Trees do not require scaling but cannot handle raw text categories.

If using scikit-learn:

* Use One-Hot Encoding for nominal categories (e.g., blood type).
* Use Ordinal Encoding if categories have meaningful order (e.g., severity: mild, moderate, severe).

Pipeline example:

* ColumnTransformer:

  * Numerical → Imputer
  * Categorical → Imputer + OneHotEncoder

This ensures consistent preprocessing during training and prediction.

---

4. Train the Decision Tree Model

Steps:

1. Split data into:

   * Training set (80%)
   * Test set (20%)

2. Build preprocessing + model pipeline:

   * Imputation
   * Encoding
   * DecisionTreeClassifier

3. Fit the model on training data.

Decision Trees are suitable here because:

* They handle mixed data types.
* They are interpretable.
* No scaling required.

---

5. Tune Hyperparameters

To prevent overfitting, tune:

* max_depth
* min_samples_split
* min_samples_leaf
* max_features
* criterion (gini or entropy)

Use GridSearchCV or RandomizedSearchCV with cross-validation (e.g., 5-fold CV).

Example approach:

* Define parameter grid
* Use cross-validation
* Select best model based on accuracy, F1-score, or ROC-AUC

In healthcare, F1-score or Recall is often more important than accuracy.

---

6. Evaluate Model Performance

Use multiple evaluation metrics:

For Classification:

* Accuracy
* Precision
* Recall (very important for disease detection)
* F1-score
* ROC-AUC
* Confusion Matrix

Why Recall matters:

* False negatives (missed disease cases) can be dangerous.
* High recall ensures fewer sick patients are missed.

Also check:

* Overfitting (train vs test performance)
* Feature importance for interpretability

---

7. Model Interpretation

Decision Trees are interpretable:

* Extract decision rules
* Visualize tree
* Analyze feature importances

Doctors can understand rules like:

"If glucose > X and age > Y, then high risk."

This transparency is crucial in healthcare applications.

---

8. Business Value in Real-World Healthcare

9. Early Disease Detection -
   Helps identify high-risk patients before symptoms worsen.

10. Improved Clinical Decision Support -
    Assists doctors with data-driven recommendations.

11. Cost Reduction -
    Early diagnosis reduces hospitalization costs.

12. Resource Optimization -
    Hospitals can prioritize high-risk patients.

13. Risk Stratification -
    Enables preventive care programs.

14. Better Patient Outcomes -
    Faster intervention improves survival rates.

---

Final Summary

The complete workflow would be:

1. Understand data and objective.
2. Handle missing values carefully.
3. Encode categorical variables properly.
4. Train Decision Tree using a pipeline.
5. Tune hyperparameters with cross-validation.
6. Evaluate using clinically relevant metrics.
7. Interpret model and deploy responsibly.

In a healthcare setting, the true value of this model is not just accuracy but its ability to provide reliable, interpretable, and actionable insights that support better medical decisions.
