#Question 1: What is a Decision Tree, and how does it work in the context of classification?
Answer:
A Decision Tree is a supervised machine learning algorithm used for both classification and regression. It works like a flowchart, where:

Root Node: Represents the entire dataset.
Decision Nodes: Represent feature-based splitting rules.
Leaf Nodes: Represent the final class or predicted value.

In classification, the tree splits the dataset into subsets based on the feature that provides the maximum class separation (using criteria like Gini or Entropy). At prediction time, a data point follows the decision rules until it reaches a leaf node, which assigns a class label.

Example:
Suppose we classify whether a flower is Setosa or Versicolor.
Root Node checks: Petal Length ≤ 2.5 cm?

If Yes → Setosa
If No → Versicolor
Thus, the Decision Tree mimics a series of “if–else” rules for classification.

#Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?
Answer:
1. Gini Impurity:
- Formula: Gini = 1 - Σ (p_i)^2
- Measures how often a randomly chosen sample would be misclassified.
- A Gini of 0 means pure node.

2. Entropy:
- Formula: Entropy = - Σ p_i * log2(p_i)
- Measures the uncertainty or disorder in data.
- Entropy = 0 means pure node.

Impact on Splits:
- Decision Trees choose the split that reduces impurity the most.
- Lower impurity → better homogeneity → stronger decision boundary.
Example:
If a node has 50 samples, with 25 positive and 25 negative:
Gini = 0.5 (high impurity)
Entropy = 1 (maximum disorder)
The tree will try to split this node to reduce impurity.

#Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.
Answer:
- Pre-Pruning:
  - Stops tree growth early by setting constraints (max_depth, min_samples_split).
  - Advantage: Saves computation, avoids overfitting early.
  - Example: Limit tree to depth=3.

- Post-Pruning:
  - Allows full tree growth, then trims unimportant branches.
  - Advantage: More accurate since it evaluates tree first, then removes unnecessary complexity.
  - Example: CART pruning removes branches with low impact.

Difference Summary:
Pre-pruning = “Don't let it grow too big.”
Post-pruning = “Grow fully, then cut extra.”

#Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer:
- Information Gain (IG) measures reduction in impurity after a split.
- Formula: IG = Entropy(Parent) – Weighted Average(Entropy(Children))

Importance:
- High IG means a feature separates the data well.
- Decision Trees choose the feature with highest IG for splitting.

Example:
If splitting on Petal Length reduces Entropy from 1.0 to 0.3, IG = 0.7.
→ This feature is chosen for the split.

#Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?
Answer:

Applications:

Healthcare: Predicting diseases.
Finance: Credit risk scoring.
Retail: Customer segmentation.
Manufacturing: Defect detection.

Advantages:

Easy to interpret and visualize.
Works with both categorical and numerical data.
Requires minimal preprocessing.

Limitations:

Prone to overfitting.
Biased toward features with many levels.
Sensitive to small changes in data.

In [1]:
#Dataset Info:
#● Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).
#● Boston Housing Dataset for regression tasks (sklearn.datasets.load_boston() or provided CSV).
#Question 6: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Accuracy and feature importance
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", clf.feature_importances_)


Accuracy: 1.0
Feature Importances: [0.         0.01911002 0.89326355 0.08762643]


In [2]:
#Question 7: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.
clf_full = DecisionTreeClassifier(random_state=42)
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)

clf_full.fit(X_train, y_train)
clf_pruned.fit(X_train, y_train)

print("Full Tree Accuracy:", accuracy_score(y_test, clf_full.predict(X_test)))
print("Pruned Tree Accuracy (max_depth=3):", accuracy_score(y_test, clf_pruned.predict(X_test)))

Full Tree Accuracy: 1.0
Pruned Tree Accuracy (max_depth=3): 1.0


In [3]:
#Question 8: Write a Python program to:
#● Load the California Housing dataset from sklearn
#● Train a Decision Tree Regressor
#● Print the Mean Squared Error (MSE) and feature importances
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load data
housing = fetch_california_housing()
X, y = housing.data, housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

# Predictions
y_pred = reg.predict(X_test)

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)


Mean Squared Error: 0.5280096503174904
Feature Importances: [0.52345628 0.05213495 0.04941775 0.02497426 0.03220553 0.13901245
 0.08999238 0.08880639]


In [5]:
#Question 9:Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
#● Print the best parameters and the resulting model accuracy
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load Iris dataset (classification target)
iris = load_iris()
X, y = iris.data, iris.target   # y is discrete classes (0,1,2)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Define parameter grid
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10]
}

# GridSearchCV with accuracy scoring
grid = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Fit the model
grid.fit(X_train, y_train)

# Best parameters and accuracy
print("Best Parameters:", grid.best_params_)
print("Best Cross-Validation Accuracy:", grid.best_score_)

# Evaluate on test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print("Test Accuracy with Best Parameters:", accuracy_score(y_test, y_pred))

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Best Cross-Validation Accuracy: 0.9428571428571428
Test Accuracy with Best Parameters: 1.0


#Question 10: Imagine you're working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.Explain the step-by-step process you would follow to:
#● Handle the missing values
#● Encode the categorical features
#● Train a Decision Tree model
#● Tune its hyperparameters
#● Evaluate its performance
#And describe what business value this model could provide in the real-world setting
Answer:

Step 1 - Handle Missing Values

Use mean/median for numeric features.
Use mode or “Unknown” for categorical features.

Step 2 - Encode Categorical Features

OneHotEncoding for nominal data.
LabelEncoding for ordinal data.

Step 3 - Train Decision Tree

Use DecisionTreeClassifier with Gini/Entropy.

Step 4 - Tune Hyperparameters

Use GridSearchCV to optimize max_depth, min_samples_split.

Step 5 - Evaluate Model

Use Accuracy, Precision, Recall, F1-score, ROC-AUC.
Perform Cross-Validation for robustness.

Business Value:

Early disease prediction → preventive treatment.
Helps doctors prioritize high-risk patients.
Reduces costs and improves healthcare efficiency.