1Q) What is a Decision Tree, and how does it work in the context of classification?

Ans) A Decision Tree is a supervised machine learning algorithm used for both classification and regression tasks. In classification, it works by splitting the dataset into subsets based on the value of input features. Each internal node represents a decision on a feature, each branch represents the outcome of the decision, and each leaf node represents a class label. The tree is built by selecting the best feature that splits the data to maximize class purity, often using metrics like Gini Impurity or Entropy.

2Q) Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Ans)
Gini Impurity and Entropy are impurity measures used in Decision Trees to evaluate how well a feature separates the data into classes.

Gini Impurity measures the probability of incorrectly classifying a randomly chosen element if it was randomly labeled according to the distribution of labels in the subset.

Formula:
Gini = 1 − Σ (pᵢ)²
Where:

pᵢ is the probability of class i in the node.
Entropy measures the amount of uncertainty or randomness in the data.

Formula:
Entropy = − Σ pᵢ × log₂(pᵢ)
Where:

pᵢ is the probability of class i in the node.
Impact on Splits:

Lower Gini or Entropy values indicate purer nodes.
The Decision Tree algorithm chooses the feature that results in the greatest reduction in impurity (i.e., highest Information Gain).
Gini is computationally faster, while Entropy can be more informative in some cases.

3Q) What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Ans)
Pre-Pruning stops the tree from growing once a condition is met (e.g., max depth, min samples).
Advantage: Reduces overfitting early and improves training speed.
Post-Pruning allows the tree to grow fully and then removes branches that do not improve performance.
Advantage: Often results in better generalization by evaluating actual performance

4Q) What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Ans)
Information Gain (IG) is a metric used to measure how well a feature separates the training examples according to their target classes. It is based on the concept of Entropy, which quantifies the impurity or disorder in a dataset.

Formula:
Information Gain = Entropy(Parent) − Weighted Average Entropy(Children)

Mathematically:
IG(D, A) = Entropy(D) − Σ (|Dᵥ| / |D|) × Entropy(Dᵥ)

Where:

D is the dataset
A is the attribute
Dᵥ is the subset of D where attribute A has value v
|Dᵥ| / |D| is the proportion of samples in subset Dᵥ
Importance:

Information Gain helps in selecting the attribute that results in the most significant reduction in impurity.
A higher Information Gain means a better split, leading to more accurate and efficient Decision Trees.
It ensures that the tree focuses on the most informative features first, improving both performance and interpretability.

5Q) What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Ans)
Applications: Medical diagnosis, credit scoring, customer segmentation, fraud detection.
Advantages: Easy to interpret, handle both numerical and categorical data, require little data preprocessing.
Limitations: Prone to overfitting, sensitive to small data changes, can create biased trees if not balanced.

In [None]:
#6Q) Write a Python program to:

# Load the Iris Dataset
# Train a Decision Tree Classifier using the Gini criterion
# Print the model’s accuracy and feature importances

#Ans)

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", clf.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.03822004 0.         0.40445656 0.5573234 ]


In [2]:
# 7Q) Write a Python program to:

# Load the Iris Dataset
# Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.
# Ans)

clf_full = DecisionTreeClassifier()
clf_limited = DecisionTreeClassifier(max_depth=3)

clf_full.fit(X_train, y_train)
clf_limited.fit(X_train, y_train)

print("Full Tree Accuracy:", accuracy_score(y_test, clf_full.predict(X_test)))
print("Limited Tree Accuracy:", accuracy_score(y_test, clf_limited.predict(X_test)))


Full Tree Accuracy: 1.0
Limited Tree Accuracy: 1.0


In [3]:
# 8Q) Write a Python program to:

# Load the California Housing dataset from sklearn
# Train a Decision Tree Regressor
# Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

data = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)

reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)

y_pred = reg.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)


MSE: 0.5305526897620639
Feature Importances: [0.52368989 0.05217861 0.04925763 0.02639407 0.03200267 0.13861829
 0.0898093  0.08804955]


In [None]:
#9Q) Write a Python program to:

# Load the Iris Dataset
# Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
# Print the best parameters and the resulting model accuracy
# Ans)
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 3, 4]
}

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Perform GridSearchCV
grid = GridSearchCV(clf, param_grid, cv=5)
grid.fit(X_train, y_train)

# Evaluate the best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Output the results
print("Best Parameters:", grid.best_params_)
print("Test Accuracy:", accuracy)


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Test Accuracy: 1.0


10Q) Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:

Handle the missing values
Encode the categorical features
Train a Decision Tree model
Tune its hyperparameters
Evaluate its performance
And describe what business value this model could provide in the real-world setting.

Ans)
Handle Missing Values: Use imputation (mean for numerical, mode for categorical) or drop rows/columns with excessive missing data.
Encode Categorical Features: Use One-Hot Encoding or Label Encoding depending on the algorithm and feature cardinality.
Train Model: Use DecisionTreeClassifier from sklearn with default or initial parameters.
Tune Hyperparameters: Use GridSearchCV to optimize max_depth, min_samples_split, etc.
Evaluate Performance: Use metrics like accuracy, precision, recall, F1-score, and confusion matrix.
Business Value: Helps in early disease detection, reduces manual diagnosis time, improves patient outcomes, and optimizes resource allocation.