Question 1: What is a Decision Tree, and how does it work in the context of classification?

Ans:- A Decision Tree is a flowchart-like model that splits data into branches based on feature values, ultimately leading to class labels for classification.

Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Ans:- Gini Impurity and Entropy measure how mixed the classes are in a node; decision trees use them to choose splits that create purer child nodes.

Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Ans:- Pre-Pruning: Stops tree growth early (e.g., limiting depth or minimum samples per split). Advantage: Prevents overfitting and reduces computation time.

Post-Pruning: Grows the full tree first, then trims back branches. Advantage: Produces a simpler, more generalizable model after evaluating performance.

Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Ans:- Information Gain measures the reduction in impurity (entropy) after a split; it helps decision trees choose the feature and threshold that create the most informative, pure child nodes.

Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Ans:- Decision Trees are used in areas like medical diagnosis, credit risk assessment, customer segmentation, and fraud detection.

Advantages: Easy to understand, handle both numerical and categorical data, require little data preprocessing.

Limitations: Prone to overfitting, can be unstable with small data changes, less effective for very complex patterns compared to ensemble methods.

Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance and describe what business value this model could provide in the real-world setting.

Ans:-
- **Handle missing values:**  
  - **Assess:** Quantify missingness per feature.  
  - **Impute:** Use median for numeric, most frequent for categorical; consider domain-informed imputation.  
  - **Flag:** Add **missingness indicators** if patterns may be informative.

- **Encode categorical features:**  
  - **Low-cardinality:** **One-hot encoding**.  
  - **High-cardinality:** **Target encoding** (with CV to reduce leakage) or **ordinal encoding** if natural order exists.

- **Train a Decision Tree model:**  
  - **Split data:** Train/validation/test with stratification.  
  - **Fit:** DecisionTreeClassifier on preprocessed features; set a **class_weight='balanced'** if classes are imbalanced.

- **Tune hyperparameters:**  
  - **Parameters:** **max_depth**, **min_samples_split**, **min_samples_leaf**, **max_features**, **criterion** (gini/entropy).  
  - **Search:** Use **GridSearchCV** or **RandomizedSearchCV** with stratified CV; optimize for **ROC-AUC** or **F1**.

- **Evaluate performance and business value:**  
  - **Metrics:** ROC-AUC, precision/recall, F1, confusion matrix; calibrate probabilities if needed.  
  - **Business value:** Earlier risk identification, prioritized screenings, reduced costs via targeted interventions, transparent decisions to support clinicians, and measurable lift in detection rates with acceptable false-positive trade-offs.

In [1]:
#Question 6: Write a Python program to:
#● Load the Iris Dataset
#● Train a Decision Tree Classifier using the Gini criterion
#● Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier using the Gini criterion
dtc = DecisionTreeClassifier(criterion='gini', random_state=42)
dtc.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dtc.predict(X_test)

# Print the model’s accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, dtc.feature_importances_):
    print(f"  {feature}: {importance:.4f}")

Model Accuracy: 1.0000

Feature Importances:
  sepal length (cm): 0.0000
  sepal width (cm): 0.0191
  petal length (cm): 0.8933
  petal width (cm): 0.0876


In [2]:
# Question 7: Write a Python program to:
# ● Load the Iris Dataset
# ● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Classifier with max_depth=3
dtc_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
dtc_limited.fit(X_train, y_train)
y_pred_limited = dtc_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)
print(f"Accuracy with max_depth=3: {accuracy_limited:.4f}")

# Train a fully-grown Decision Tree Classifier (no max_depth limit)
dtc_full = DecisionTreeClassifier(random_state=42)
dtc_full.fit(X_train, y_train)
y_pred_full = dtc_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)
print(f"Accuracy of fully-grown tree: {accuracy_full:.4f}")

Accuracy with max_depth=3: 1.0000
Accuracy of fully-grown tree: 1.0000


In [5]:
#Question 8: Write a Python program to:
# ● Load the Boston Housing Dataset
# ● Train a Decision Tree Regressor
# ● Print the Mean Squared Error (MSE) and feature importances

from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing Dataset as an alternative to Boston Housing
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Decision Tree Regressor
dtr = DecisionTreeRegressor(random_state=42)
dtr.fit(X_train, y_train)

# Make predictions on the test set
y_pred = dtr.predict(X_test)

# Print the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(housing.feature_names, dtr.feature_importances_):
    print(f"  {feature}: {importance:.4f}")

Mean Squared Error (MSE): 0.5280

Feature Importances:
  MedInc: 0.5235
  HouseAge: 0.0521
  AveRooms: 0.0494
  AveBedrms: 0.0250
  Population: 0.0322
  AveOccup: 0.1390
  Latitude: 0.0900
  Longitude: 0.0888


In [6]:
#Question 9: Write a Python program to:
#● Load the Iris Dataset
#● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
#● Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Iris Dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define the parameter grid for GridSearchCV
param_grid = {
    'max_depth': [2, 3, 4, 5, None], # None means no limit on depth
    'min_samples_split': [2, 5, 10]
}

# Initialize a Decision Tree Classifier
dtc = DecisionTreeClassifier(random_state=42)

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=dtc, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters
print(f"Best Parameters: {grid_search.best_params_}")

# Get the best estimator (model) from GridSearchCV
best_dtc = grid_search.best_estimator_

# Make predictions on the test set using the best model
y_pred_best = best_dtc.predict(X_test)

# Print the accuracy of the best model
accuracy_best = accuracy_score(y_test, y_pred_best)
print(f"Accuracy of the best model: {accuracy_best:.4f}")

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Accuracy of the best model: 1.0000
