**THEORY**

Question 1
What is a Decision Tree, and how does it work in the context of classification?

Answer:

A Decision Tree is a supervised machine learning algorithm used for classification and regression. In classification, it works by recursively splitting the dataset based on feature values so that the resulting subsets are as pure as possible.

Each internal node represents a decision on a feature, each branch represents an outcome of that decision, and each leaf node represents a class label. The tree selects the best splits using impurity measures such as Gini Impurity or Entropy.

Question 2
Explain Gini Impurity and Entropy as impurity measures. How do they impact splits?

Answer:

Gini Impurity measures the probability of incorrect classification of a randomly chosen sample.

Formula:
Gini = 1 − Σ(pi²)

Entropy measures the level of randomness or uncertainty in the data.

Formula:
Entropy = − Σ(pi log₂ pi)

Lower impurity values indicate purer nodes. During tree construction, the algorithm chooses the split that minimizes impurity (or maximizes information gain), leading to better class separation.

Question 3
Difference between Pre-Pruning and Post-Pruning with one advantage each.

Answer:

Pre-Pruning stops the tree from growing early by applying constraints such as maximum depth or minimum samples per split.
Advantage: Reduces overfitting and improves generalization.

Post-Pruning grows a complete tree and then removes unnecessary branches.
Advantage: Produces a more optimized tree after observing the full structure.

Question 4
What is Information Gain and why is it important?

Answer:

Information Gain measures the reduction in entropy after splitting a dataset on a feature.

Information Gain = Entropy(parent) − Weighted Entropy(children)

It is important because it helps select the feature that best separates the classes, leading to more accurate and efficient decision trees.

Question 5
Real-world applications, advantages, and limitations of Decision Trees.

Answer:

Applications:

Medical diagnosis

Credit risk assessment

Fraud detection

Customer segmentation

Advantages:

Easy to understand and interpret

Handles numerical and categorical data

Requires minimal data preprocessing

Limitations:

Prone to overfitting

Sensitive to noisy data

Small changes in data can lead to different trees

**PRACTICAL**

In [1]:
#6
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = DecisionTreeClassifier(criterion='gini', random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", model.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


In [2]:
#7
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)

pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)

print("Fully Grown Tree Accuracy:", full_tree.score(X_test, y_test))
print("Pruned Tree Accuracy:", pruned_tree.score(X_test, y_test))


Fully Grown Tree Accuracy: 1.0
Pruned Tree Accuracy: 1.0


In [3]:
#8
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)


Mean Squared Error: 0.495235205629094
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


In [5]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()
X = iris.data
y = iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Parameter grid
param_grid = {
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10]
}

# Decision Tree model
dt = DecisionTreeClassifier(random_state=42)

# GridSearchCV
grid = GridSearchCV(dt, param_grid, cv=5)
grid.fit(X_train, y_train)

# Best model evaluation
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)
print("Test Accuracy:", accuracy_score(y_test, y_pred))


Best Parameters: {'max_depth': 7, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 0.9416666666666668
Test Accuracy: 1.0


Question 10
Healthcare use case: step-by-step process and business value

Answer:

Handling missing values:

Use mean or median for numerical features

Use mode for categorical features

Encoding categorical features:

Apply one-hot encoding or label encoding

Training the model:

Train a Decision Tree classifier on the cleaned dataset

Hyperparameter tuning:

Tune max_depth, min_samples_split, and min_samples_leaf using GridSearchCV

Model evaluation:

Use accuracy, precision, recall, F1-score, and confusion matrix

Business value:
This model can support early disease detection, reduce diagnostic errors, assist doctors in decision-making, improve patient outcomes, and lower healthcare costs.