#Theory and Practical Questions

Question 1: What is a Decision Tree, and how does it work in the context of
classification?
* A Decision Tree is a supervised learning algorithm used for both classification and regression tasks. In classification, it works by recursively splitting the dataset into smaller subsets based on feature values. Each split is made using conditions that maximize the separation of classes, and this process continues until we reach leaf nodes, which represent the final class labels. You can think of it as a sequence of yes/no questions that guide us step by step toward the correct category.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
* Gini Impurity and Entropy are metrics used to evaluate how mixed or impure the classes are in a dataset. Gini Impurity measures the probability of incorrectly classifying a randomly chosen element, while Entropy quantifies the level of randomness or disorder in the data. In building a Decision Tree, we select splits that achieve the greatest reduction in impurity, and these measures guide us in choosing the most informative feature to split on.
Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
* Pre-pruning means restricting the growth of a Decision Tree early by setting limits such as maximum depth, minimum samples per split, or minimum leaf size. This helps prevent overfitting and reduces computation time. Post-pruning, on the other hand, allows the tree to grow fully and then prunes back unnecessary branches to simplify the model. While pre-pruning is faster and more efficient, post-pruning often produces a more accurate and generalized model.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?
* Information Gain measures the reduction in uncertainty or impurity when the data is split using a particular feature. It is calculated by comparing the entropy before the split with the weighted entropy after the split. A higher Information Gain indicates that the feature is more effective at separating the classes. This makes it an important criterion for selecting the most informative features, resulting in a more accurate and efficient Decision Tree.

Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
* Decision Trees are widely applied in fields such as medical diagnosis, credit risk assessment, fraud detection, and customer segmentation. Their key advantage lies in being easy to understand, interpret, and visualize. However, they are prone to overfitting and may struggle with very large or noisy datasets unless techniques like pruning or ensemble methods are applied.

Dataset Info:
* Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).
* Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV).

Question 6: Write a Python program to:
* Load the Iris Dataset
* Train a Decision Tree Classifier using the Gini criterion
* Print the model’s accuracy and feature importances
* (Include your Python code and output in the code box below.)



In [1]:
# Answer:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score


data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target


X = df.drop('target', axis=1)
y = df['target']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)


classifier = DecisionTreeClassifier(criterion='gini')
classifier.fit(X_train, y_train)


y_pred = classifier.predict(X_test)


accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print("Feature Importances:", classifier.feature_importances_)

Accuracy: 0.9555555555555556
Feature Importances: [0.02146947 0.02146947 0.57196476 0.38509631]


Question 7: Write a Python program to:
* Load the Iris Dataset
* Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to fully-grown tree.
* (Include your Python code and output in the code box below.)

In [2]:
# Answer:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

classifier_limited = DecisionTreeClassifier(max_depth=3)
classifier_limited.fit(X_train, y_train)
y_pred_limited = classifier_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

classifier_full = DecisionTreeClassifier()
classifier_full.fit(X_train, y_train)
y_pred_full = classifier_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

print(f"Accuracy with max_depth=3: {acc_limited}")
print(f"Accuracy with fully grown tree: {acc_full}")

Accuracy with max_depth=3: 0.9555555555555556
Accuracy with fully grown tree: 0.9555555555555556


Question 8: Write a Python program to:
* Load the California Housing dataset from sklearn\
* Train a Decision Tree Regressor
* Print the Mean Squared Error (MSE) and feature importances
* (Include your Python code and output in the code box below.)

In [3]:
# Answer:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import pandas as pd

housing = fetch_california_housing()
df = pd.DataFrame(housing.data, columns=housing.feature_names)
df['target'] = housing.target

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

regressor = DecisionTreeRegressor()
regressor.fit(X_train, y_train)

y_pred = regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)

print(f"Mean Squared Error (MSE): {mse}")
print("Feature Importances:", regressor.feature_importances_)

Mean Squared Error (MSE): 0.49275203360544245
Feature Importances: [0.50916108 0.05132538 0.02996523 0.02690529 0.02699413 0.14046386
 0.10803194 0.10715309]


Question 9: Write a Python program to:
* Load the Iris Dataset
* Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
* Print the best parameters and the resulting model accuracy
* (Include your Python code and output in the code box below.)

In [4]:
# Answer:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
import pandas as pd


iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target


X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 3, 4, 5]
}

dt = DecisionTreeClassifier()

grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best Parameters:", grid_search.best_params_)

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy with Best Parameters: {accuracy}")

Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Model Accuracy with Best Parameters: 0.9555555555555556


Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:
* Handle the missing values
* Encode the categorical features
* Train a Decision Tree model
* Tune its hyperparameters
* Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Answer-
*  Handle Missing Values: For numerical features, we can use mean or median imputation, and for categorical features, we can use the most frequent category to fill missing values.
* Encode Categorical Features: Apply One-Hot Encoding or Label Encoding so that the Decision Tree can understand categorical data.
* Train the Decision Tree: Split the dataset into training and testing sets, then train a Decision Tree model on the training data.
* Tune Hyperparameters: Use GridSearchCV or RandomizedSearchCV to find the best values for parameters like max_depth, min_samples_split, and criterion.
* Evaluate the Model: Check performance using metrics like accuracy, precision, recall, F1-score, and ROC-AUC to ensure reliability.
* This model can help healthcare professionals predict diseases early, identify high-risk patients, and support better decision-making. It reduces manual workload, improves efficiency, and ultimately leads to better patient outcomes and cost savings for the company.