In [1]:
'''

## Question 1: What is a Decision Tree, and how does it work in the context of classification?

A decision tree is a supervised learning model resembling a tree (or flowchart) structure, used for classification (and regression) tasks. In the classification context, it partitions the feature space into regions associated with distinct class labels.

Structure of a Decision Tree

Root node: the topmost node that contains the entire training dataset.

Internal (decision) nodes: nodes where a test is applied on one of the features (e.g. “Is feature X > threshold?”).

Branches / edges: outcomes of the test, which lead to child nodes.

Leaf (terminal) nodes: nodes that assign a class label (final decision).

The path from the root to a leaf corresponds to a rule: a conjunction of feature‑tests that leads to a classification decision.

How It Works for Classification

Recursive splitting (“divide and conquer”)
Starting at the root, the algorithm selects a feature and a split (e.g. threshold) that best partitions the data into purer subsets (i.e. subsets where one class predominates). This process is repeated on each subset, recursively creating child nodes.

Impurity / splitting criterion
To decide which split is “best,” decision tree algorithms use impurity or information measures such as:

Entropy / Information Gain: how much uncertainty is reduced by the split


Gini impurity: probability of misclassifying a randomly chosen instance if labeled according to class proportions in the node


The algorithm evaluates candidate splits and picks the one that leads to the highest gain (or the greatest impurity reduction).


Stopping / leaf creation
The splitting continues until a stopping criterion is met, for example:

all instances in that node belong to the same class,

no further features remain for splitting,

or a maximum tree depth / minimum samples per node constraint is reached (to prevent overfitting).
At that point, the node becomes a leaf and is assigned a class (often the majority class among the training instances in that node).


Prediction on new instances
Given a new data sample, we start at the root and evaluate the feature test there. Depending on the outcome, we follow the appropriate branch. We continue until a leaf node is reached, and we output the class label stored in that leaf as the prediction.

Strengths and Limitations (brief)

Strengths

Highly interpretable — one can trace the path of decisions easily.

Handles both categorical and numerical features.

Little preprocessing (no requirement for scaling).

Can model non‑linear relationships.

Limitations

Prone to overfitting when grown deep.

Sensitive to small changes in data (unstable).

Biased towards features with many levels/categories.

When decision boundaries are complex or not axis-aligned, a tree may need many splits, reducing generalization.

To mitigate overfitting, one often uses pruning (cutting back unnecessary branches) or constraints like maximum depth, minimum samples per split, etc.

## Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

What are they (in simple terms)

Gini impurity is a measure of how “mixed” the classes are in a node. It gives the probability that a randomly chosen example from that node would be mislabeled if we randomly labeled it according to the class proportions in that node. The more mixed the node (i.e. classes are evenly distributed), the higher the impurity. If all examples in the node belong to one class, Gini impurity is zero (i.e. the node is “pure”).

Entropy (from information theory) measures the amount of uncertainty or disorder in the class distribution of a node. A node with a balanced mix of classes has high entropy (high uncertainty), while a node dominated by a single class has low entropy (low surprise). When all examples in a node are of the same class, entropy is zero — no uncertainty remains.

How they impact splits in a decision tree

Evaluating a candidate split
At each node, the tree algorithm considers possible ways to divide the data (by features and thresholds). For each candidate split, it looks at how “impure” the two resulting child nodes would be (using Gini or entropy), and also accounts for how many examples go into each child (i.e. weighted by size).

Choosing the best split
The algorithm picks the split that yields the largest reduction in impurity (i.e. the greatest “purity gain”). In other words, it chooses the division that makes the children as “pure” as possible (lowest impurity) while also keeping the split balanced or meaningful. For entropy, this is often expressed as information gain (how much uncertainty is removed by the split).

Differences in behavior / biases

Computational cost: Entropy involves logarithmic operations, which are relatively more expensive; Gini impurity uses simpler operations, making it slightly faster in practice.

Split preference: Gini tends to prefer splits that isolate the most frequent class more aggressively, aiming to reduce impurity quickly. Entropy is somewhat more sensitive to class distribution and may produce slightly more balanced splits in some cases. But in many real-world datasets, both criteria often lead to highly similar trees.

Because differences are often small, many decision tree implementations (such as CART) use Gini by default for efficiency.


## Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Pre‑Pruning (Early Stopping)

What it is:
Pre‑pruning means you stop the growth of the decision tree during its construction, rather than growing it fully. You impose constraints or criteria so that splits are only performed if they satisfy certain thresholds. Common constraints include: maximum tree depth, minimum number of samples required to split a node, minimum impurity improvement (or gain) needed for a split, minimum samples per leaf, etc.

Practical advantage:
Because you limit the size of the tree from the start, training is faster and less resource‑intensive. You avoid growing lots of branches that may be largely irrelevant or noisy. This can be especially valuable when dealing with large datasets or limited computational resources. Also simpler trees are easier to interpret early.

Post‑Pruning

What it is:
Post‑pruning (sometimes called “prune after full growth”) means you let the tree grow completely (or at least without strong early constraints), potentially overfitting the training data, and then afterwards remove ("prune") branches or subtrees that don’t help generalization. Pruning is based on evaluating performance (often on validation data) or using cost‑complexity criteria that trade off tree complexity vs. error. Methods include reduced error pruning, cost‑complexity pruning, etc.

Practical advantage:
Post‑pruning tends to yield trees that generalize better because you can first see the full complexity and only then trim away those parts that are truly unnecessary. This helps in capturing subtle patterns in the data that pre‑pruning might block prematurely. It often gives a better balance between bias and variance.

Key Differences & Trade‑Offs

When the decision to prune or stop comes: pre‑pruning acts during growth, post‑pruning acts after.

Risk: pre‑pruning risks underfitting (missing useful structure because you cut off splits too early). Post‑pruning risks greater computational work, since you build a large tree first and then evaluate many possible prunings.


## Question 4: What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Information Gain in decision trees is a measure of how much “useful information” a feature gives us about the target class. In other words, it quantifies the reduction in uncertainty (or impurity) about the class labels when we split a node using that feature. It is the difference between the impurity before the split and the weighted impurity after the split.

Why Information Gain matters for choosing splits

Guides the best split choice
At each internal node, the decision tree algorithm considers multiple candidate features (and possible split points). Information Gain provides a quantitative criterion: the split that yields the highest information gain is chosen, because it leads to the greatest reduction in class uncertainty (i.e. the most “pure” children).

Encourages purer child nodes
A high information gain indicates that after splitting, the resulting child nodes have more homogeneous class distributions compared to the parent. In effect, the split separates classes well. Thus, features that produce splits with strong separation will tend to have higher gain.

Helps in building efficient, interpretable trees
Because information gain favors splits that maximize purity early, the tree tends to place more informative features higher up (closer to the root). This leads to more compact and interpretable trees (since early splits already separate many classes).

Recognizing limits and biases
One downside is that information gain is biased toward features with many distinct values (e.g. unique identifiers), because they might artificially produce pure splits. To counteract this, variants like information gain ratio are used (which penalize features with many splits).


## Question 5: What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Real‑World Applications

Healthcare / Medical Diagnosis
Decision trees are used to help diagnose diseases or predict patient outcomes using features such as symptoms, medical history, lab test results. They’re valued especially where interpretability is critical (doctors need to see “why” a prediction is made). KNOWRA

Finance & Credit Scoring
Banks and financial institutions use decision trees to assess credit risk of loan applicants, to detect fraud, or to decide whether a transaction is suspicious.

Marketing & Customer Segmentation
In marketing, decision trees help categorize customers based on behavior, demographics, likelihood to respond, churn risk, etc. They support targeted campaigns, retention strategies.

Predictive Maintenance in Manufacturing
They monitor machinery/workflow data to predict failures and schedule maintenance before breakdowns. This reduces downtime and can save cost.

Fraud Detection / Anomaly Detection
Detecting unusual or fraudulent behavior (in transactions, insurance claims, etc.) is a common use case because trees can learn rules that separate normal vs abnormal patterns.

Main Advantages

Interpretability / Transparency: The “if‑then” rules are easy to follow; stakeholders can understand how a decision is made. Useful in domains where explanations are necessary (healthcare, finance).

Handles different data kinds: Trees can work with both categorical and numerical features without heavy preprocessing like scaling or normalization. Also useful when there are missing values.

Non‑linear relationships: They can capture nonlinear decision boundaries and variable interactions without having to manually specify interaction terms.

Main Limitations

Overfitting: If allowed to grow too deep, a tree may fit noise in training data, reducing its performance on new/unseen data. Pruning, limiting depth, etc., are mitigation methods.

Instability / High Variance: Small changes in the training data (e.g. one or few examples added/removed) can lead to very different tree structures and predictions. This reduces reliability in some settings.

Bias with Imbalanced Data / Features with Many Levels: Trees may favor features with many distinct values, and may underperform when one class is heavily underrepresented (since splits tend to favor majority class).

Limited Smoothness / Granularity: Predictions are piecewise constant (for classification) or piecewise constant/regional (for regression). In some applications, smoother or more continuous models are needed. Also, very complex trees are hard to interpret.

'''

'\n\n## Question 1: What is a Decision Tree, and how does it work in the context of classification?\n\nA decision tree is a supervised learning model resembling a tree (or flowchart) structure, used for classification (and regression) tasks. In the classification context, it partitions the feature space into regions associated with distinct class labels.\n\nStructure of a Decision Tree\n\nRoot node: the topmost node that contains the entire training dataset.\n\nInternal (decision) nodes: nodes where a test is applied on one of the features (e.g. “Is feature X > threshold?”).\n\nBranches / edges: outcomes of the test, which lead to child nodes.\n\nLeaf (terminal) nodes: nodes that assign a class label (final decision).\n\nThe path from the root to a leaf corresponds to a rule: a conjunction of feature‑tests that leads to a classification decision. \n\nHow It Works for Classification\n\nRecursive splitting (“divide and conquer”)\nStarting at the root, the algorithm selects a feature and 

In [2]:
#Dataset Info:
'''● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or
provided CSV).
● Boston Housing Dataset for regression tasks
(sklearn.datasets.load_boston() or provided CSV). '''
#Question 6:   Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load iris dataset
iris = load_iris()
X = iris.data       # feature matrix (shape: n_samples × n_features)
y = iris.target     # class labels (0, 1, 2 for three Iris species)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Instantiate Decision Tree Classifier with Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Predict on test set
y_pred = clf.predict(X_test)

# Compute accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {acc:.3f}")

# Feature importances
importances = clf.feature_importances_
for feat_name, imp in zip(iris.feature_names, importances):
    print(f"{feat_name}: {imp:.3f}")



Test accuracy: 1.000
sepal length (cm): 0.000
sepal width (cm): 0.019
petal length (cm): 0.893
petal width (cm): 0.088


In [3]:
# Question 7:  Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to
#a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# 1. Decision Tree with max_depth = 3
clf_limited = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)
print(f"Accuracy with max_depth=3: {acc_limited:.3f}")

# 2. Fully grown Decision Tree (default settings)
clf_full = DecisionTreeClassifier(criterion='gini', random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)
print(f"Accuracy of fully grown tree: {acc_full:.3f}")

# (Optional) Also print feature importances for both
print("\nFeature importances (max_depth=3):")
for name, imp in zip(iris.feature_names, clf_limited.feature_importances_):
    print(f"  {name}: {imp:.3f}")

print("\nFeature importances (full tree):")
for name, imp in zip(iris.feature_names, clf_full.feature_importances_):
    print(f"  {name}: {imp:.3f}")

Accuracy with max_depth=3: 1.000
Accuracy of fully grown tree: 1.000

Feature importances (max_depth=3):
  sepal length (cm): 0.000
  sepal width (cm): 0.000
  petal length (cm): 0.925
  petal width (cm): 0.075

Feature importances (full tree):
  sepal length (cm): 0.000
  sepal width (cm): 0.019
  petal length (cm): 0.893
  petal width (cm): 0.088


In [4]:
#Question 8: Write a Python program to:
#● Load the California Housing dataset from sklearn
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load data
housing = fetch_california_housing()
X = housing.data
y = housing.target
feature_names = housing.feature_names

# Split into train / test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Instantiate and train the regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predict on the test set
y_pred = reg.predict(X_test)

# Compute Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error on test data: {mse:.4f}")

# Print feature importances
importances = reg.feature_importances_
print("Feature importances:")
for name, imp in zip(feature_names, importances):
    print(f"  {name}: {imp:.4f}")

Mean Squared Error on test data: 0.5280
Feature importances:
  MedInc: 0.5235
  HouseAge: 0.0521
  AveRooms: 0.0494
  AveBedrms: 0.0250
  Population: 0.0322
  AveOccup: 0.1390
  Latitude: 0.0900
  Longitude: 0.0888


In [15]:
# Question 9: Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using
#GridSearchCV
#● Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Initialize the DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'max_depth': [3, 5, 7, 10, None],
    'min_samples_split': [2, 5, 10, 20]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5, scoring='accuracy', verbose=1)

# Fit the model
grid_search.fit(X_train, y_train)

# Print the best parameters
print(f"Best parameters: {grid_search.best_params_}")

# Evaluate the model on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test set accuracy: {accuracy:.3f}")


Fitting 5 folds for each of 20 candidates, totalling 100 fits
Best parameters: {'max_depth': 3, 'min_samples_split': 2}
Test set accuracy: 0.980


In [16]:
'''

## Question 10: Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
● Handle the missing values
● Encode the categorical features
● Train a Decision Tree model
● Tune its hyperparameters
● Evaluate its performance
And describe what business value this model could provide in the real-world
setting.



To develop a predictive model for disease diagnosis using a healthcare dataset with mixed data types and missing values, follow these steps:

1. Handle Missing Values

Utilize the MissForest algorithm, a nonparametric imputation method, to handle missing data in mixed-type datasets. MissForest iteratively imputes missing values using a random forest, effectively capturing complex relationships between variables. This approach has been shown to outperform other imputation methods, especially when dealing with nonlinear interactions

2. Encode Categorical Features

Apply one-hot encoding to transform categorical variables into binary columns. This method ensures that the model interprets categorical data appropriately, preventing any ordinal relationships from being inferred where none exist. For example, convert a 'Gender' column with values 'Male' and 'Female' into two separate columns: 'Gender_Male' and 'Gender_Female'.

3. Train a Decision Tree Model

Use a DecisionTreeClassifier from scikit-learn to train the model. Decision trees are suitable for healthcare data due to their interpretability and ability to handle both numerical and categorical features. Ensure to set parameters like max_depth and min_samples_split to prevent overfitting and enhance generalization.

4. Tune Hyperparameters

Implement GridSearchCV to perform an exhaustive search over specified parameter values, such as max_depth, min_samples_split, and min_samples_leaf. This method evaluates all combinations of parameters using cross-validation, ensuring the selection of the optimal model configuration. Alternatively, RandomizedSearchCV can be used for a more efficient search when dealing with a large number of hyperparameters

5. Evaluate Model Performance

Assess the model's performance using metrics like accuracy, precision, recall, and F1-score. In healthcare applications, precision and recall are particularly important to minimize false positives and false negatives, respectively. Additionally, consider using a confusion matrix and ROC-AUC score to evaluate the model's ability to discriminate between classes

Business Value

Implementing this predictive model can significantly enhance patient care by enabling early detection of diseases, leading to timely interventions. It can also optimize resource allocation, reduce healthcare costs, and improve patient outcomes by identifying high-risk individuals who may benefit from preventive measures or closer monitoring.

'''

"\n\n## Question 10: Imagine you’re working as a data scientist for a healthcare company that \nwants to predict whether a patient has a certain disease. You have a large dataset with \nmixed data types and some missing values. \nExplain the step-by-step process you would follow to: \n● Handle the missing values \n● Encode the categorical features \n● Train a Decision Tree model \n● Tune its hyperparameters \n● Evaluate its performance \nAnd describe what business value this model could provide in the real-world \nsetting. \n\n\n\nTo develop a predictive model for disease diagnosis using a healthcare dataset with mixed data types and missing values, follow these steps:\n\n1. Handle Missing Values\n\nUtilize the MissForest algorithm, a nonparametric imputation method, to handle missing data in mixed-type datasets. MissForest iteratively imputes missing values using a random forest, effectively capturing complex relationships between variables. This approach has been shown to outperform 