Question 1: What is a Decision Tree, and how does it work in the context of
classification?

Answer:


A Decision Tree is a supervised learning algorithm commonly used for classification. It has a tree-like model where internal nodes represent decision rules based on features, branches represent outcomes, and leaf nodes represent class labels.

Working: The algorithm splits data step by step using measures like Information Gain or Gini Index, until the final leaf assigns a class.
It is easy to interpret, handles both categorical and numerical data, and shows the decision-making process clearly.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
Answer:

In Decision Trees, impurity measures decide how to split data.

Gini Impurity: Measures the probability of incorrectly classifying a randomly chosen element. A value of 0 means perfect purity.

Entropy: Based on information theory, it measures the level of uncertainty or disorder in data. A value of 0 means no randomness (pure).

Impact on Splits: Both are used to select the best feature for splitting. The algorithm chooses the split that reduces impurity the most, creating purer child nodes and improving classification accuracy.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Answer:

Pre-pruning stops the growth of a Decision Tree early by setting conditions such as maximum depth, minimum samples per node, or minimum information gain, to avoid overfitting. Post-pruning, on the other hand, allows the tree to grow fully and then removes branches that add little value. The practical advantage of pre-pruning is that it reduces training time and keeps the model simpler, while post-pruning usually results in better generalization since it evaluates the full tree before trimming unnecessary parts.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer:

Information Gain is a metric used in Decision Trees to measure how well a feature separates the data into classes. It is calculated as the reduction in entropy after a dataset is split based on a feature. A higher Information Gain means the feature provides better classification by making the child nodes purer. It is important because the algorithm selects the split with the highest Information Gain, ensuring that each step of the tree construction leads to more accurate and efficient classification.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Answer:

Decision Trees are widely used in real-world applications such as medical diagnosis, credit risk assessment, fraud detection, customer segmentation, and recommendation systems. Their main advantages are that they are simple to understand, easy to visualize, require little data preprocessing, and can handle both categorical and numerical features. However, their limitations include a tendency to overfit, sensitivity to small changes in data, and lower accuracy compared to more advanced ensemble methods like Random Forests or Gradient Boosting.

Dataset Info:
● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV).

Question 6: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier using the Gini criterion
● Print the model’s accuracy and feature importances
(Include your Python code and output in the code box below.)

Answer:


In [4]:
# Import necessary libraries
from sklearn.datasets import load_iris, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score

iris = load_iris()
X_iris, y_iris = iris.data, iris.target

X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(X_iris, y_iris, test_size=0.2, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train_i, y_train_i)

y_pred_i = clf.predict(X_test_i)
print("Iris Dataset - Decision Tree Classifier")
print("Accuracy:", accuracy_score(y_test_i, y_pred_i))
print("Feature Importances:", clf.feature_importances_)
print("\n")

# Note: Boston Housing (`load_boston`) has been removed in recent scikit-learn versions due to ethical concerns.
# California Housing is used here as a modern replacement.

housing = fetch_california_housing()
X_h, y_h = housing.data, housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(X_h, y_h, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train_h, y_train_h)

y_pred_h = regressor.predict(X_test_h)
print("California Housing Dataset - Decision Tree Regressor")
print("Mean Squared Error:", mean_squared_error(y_test_h, y_pred_h))
print("R^2 Score:", r2_score(y_test_h, y_pred_h))
print("Feature Importances:", regressor.feature_importances_)


Iris Dataset - Decision Tree Classifier
Accuracy: 1.0
Feature Importances: [0.         0.01667014 0.90614339 0.07718647]


California Housing Dataset - Decision Tree Regressor
Mean Squared Error: 0.495235205629094
R^2 Score: 0.622075845135081
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


Question 7: Write a Python program to:
● Load the Iris Dataset
● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.
(Include your Python code and output in the code box below.)

Answer:

In [5]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

pruned_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
pruned_tree.fit(X_train, y_train)
y_pred_pruned = pruned_tree.predict(X_test)
accuracy_pruned = accuracy_score(y_test, y_pred_pruned)

print("Fully-grown tree Accuracy:", accuracy_full)
print("Max_depth=3 tree Accuracy:", accuracy_pruned)


Fully-grown tree Accuracy: 1.0
Max_depth=3 tree Accuracy: 1.0


Question 8: Write a Python program to:
● Load the California Housing dataset from sklearn
● Train a Decision Tree Regressor
● Print the Mean Squared Error (MSE) and feature importances
(Include your Python code and output in the code box below.)

Answer:

In [6]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score

housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R^2 Score:", r2_score(y_test, y_pred))
print("Feature Importances:", regressor.feature_importances_)


Mean Squared Error: 0.495235205629094
R^2 Score: 0.622075845135081
Feature Importances: [0.52850909 0.05188354 0.05297497 0.02866046 0.03051568 0.13083768
 0.09371656 0.08290203]


Question 9: Write a Python program to:
● Load the Iris Dataset
● Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
● Print the best parameters and the resulting model accuracy
(Include your Python code and output in the code box below.)

Answer:

In [7]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

dt = DecisionTreeClassifier(random_state=42)

param_grid = {
    'max_depth': [1, 2, 3, 4, 5],
    'min_samples_split': [2, 3, 4, 5]
}

grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", best_params)
print("Model Accuracy with Best Parameters:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Answer:

To build a Decision Tree model for predicting disease in a healthcare dataset with mixed data types and missing values, the following steps can be taken:

Handle Missing Values: First, identify missing data in the dataset. For numerical features, missing values can be imputed using mean or median values, while categorical features can use mode or a separate category for missing data. Advanced techniques like K-Nearest Neighbors imputation can also be applied.

Encode Categorical Features: Convert categorical variables into numerical form using techniques like one-hot encoding or label encoding so that the Decision Tree can process them.

Train the Decision Tree Model: Split the dataset into training and testing sets and train a Decision Tree Classifier. Start with default hyperparameters to establish a baseline performance.

Hyperparameter Tuning: Use techniques like GridSearchCV or RandomizedSearchCV to tune parameters such as max_depth, min_samples_split, and criterion to improve model accuracy and prevent overfitting.

Evaluate Performance: Evaluate the model using metrics such as accuracy, precision, recall, F1-score, and ROC-AUC, depending on the business focus (e.g., minimizing false negatives in healthcare). Cross-validation can provide more reliable performance estimates.

Business Value: This model can help healthcare providers identify high-risk patients early, enabling timely interventions and treatment. It can also assist in resource allocation, reduce diagnostic costs, and improve patient outcomes by predicting disease presence with high accuracy.