# Decision Tree Assignment 

1.  What is a Decision Tree, and how does it work in the context of classification?
-  A Decision Tree is a popular supervised machine learning algorithm used for both classification and regression tasks. In the context of classification, it is a predictive model that uses a tree-like structure of decisions to determine the class label of a given data point based on its features.


A decision tree is similar to a flowchart where:
-  Each internal node represents a decision based on a feature (e.g., "Temperature > 30°C?")
-  Each branch represents the outcome of that decision (Yes/No)
-  Each leaf node represents a final class label or output
-  The model makes predictions by traversing the tree from the root node to a leaf node, following the decisions at each branch


Building a decision tree involves the following steps:

1.  Select the Best Feature to Split:
-  The algorithm chooses the feature that best separates the data into distinct classes.


Selection is based on criteria such as:
-  Gini Impurity
-  Entropy / Information Gain

2.  Split the Dataset: Based on the selected feature, the dataset is split into subsets.

3.  Repeat the Process:

- The splitting process continues recursively for each subset until:
- All samples in a node belong to the same class.
- No more features are left to split
- A stopping condition (e.g., maximum depth) is reached.

4. Label the Leaf Nodes: Each leaf node is assigned the class that occurs most frequently in that subset

---

2.  Explain the concepts of Gini Impurity and Entropy as impurity measures.


How do they impact the splits in a Decision Tree?


-  When constructing a Decision Tree, the algorithm must decide which feature and threshold to split on at each node. The goal is to choose the split that results in the purest possible child nodes — i.e., nodes where samples mostly belong to a single class.
-  To measure how pure or impure a node is, we use impurity measures. Two commonly used impurity measures are Gini Impurity and Entropy (Information Gain).
-  Gini impurity measures the probability of incorrectly classifying a randomly chosen element from the dataset if it was randomly labeled according to the class distribution in that node.
-  Formula:
$$
Gini = 1 - \sum_{i=1}^{C} p_i^2
$$

Where:

- \( C \) = number of classes  
- \( p_i \) = probability of an instance belonging to class \( i \)

Interpretation:

- \( Gini = 0 \): The node is **pure** (all samples belong to one class).  
- Higher \( Gini \) → Higher impurity.


- Both Gini and Entropy are used to evaluate how good a split is.
- The decision tree algorithm tries different features and thresholds and chooses the split that produces child nodes with the lowest impurity
- Lower impurity = more homogeneous classes = better classification performance


Although both measures generally yield similar trees:
- Gini Impurity is computationally faster and slightly more sensitive to class distribution
- Entropy is more theoretically grounded (from information theory) but slightly slower

---

3.  What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.


Introduction:
Decision trees are powerful machine learning models, but they are prone to overfitting if allowed to grow without constraints. Overfitting occurs when the tree becomes too complex and starts to memorize the training data rather than generalizing to unseen data. To prevent this, pruning techniques are used. Pruning reduces the size of the decision tree by removing branches or nodes that provide little or no predictive power. There are two main types of pruning: pre-pruning and post-pruning.

Pre-Pruning (Early Stopping):
Pre-pruning, also known as early stopping, is a technique where the growth of the decision tree is stopped early before it becomes too deep or complex. Instead of allowing the tree to fully expand, the algorithm uses certain stopping criteria during the training process to decide whether to continue splitting a node. Common conditions include:
- Maximum depth: limiting how deep the tree can grow.
- Minimum samples per node: stopping splits if a node contains fewer samples than a defined threshold.
- Minimum impurity decrease: stopping if further splits do not significantly reduce impurity.

Practical Advantage:
Pre-pruning helps reduce training time and computational cost because the tree does not grow unnecessarily large. It also reduces the risk of overfitting by controlling the model's complexity from the start.

Post-Pruning (Pruning After Tree Construction):
Post-pruning is a technique where the decision tree is first allowed to grow fully without any constraints. After the complete tree is built, branches that have little importance or do not improve model accuracy are removed. This is typically done by evaluating the performance of subtrees on a validation dataset and pruning those that lead to overfitting. Post-pruning can be done using techniques like cost complexity pruning or reduced error pruning.

Practical Advantage:
Post-pruning often results in a more accurate and generalized model. Since the pruning decisions are based on the actual performance of the tree, it allows the algorithm to retain useful splits while removing only those that do not contribute to predictive power.

Conclusion:
Both pre-pruning and post-pruning aim to prevent overfitting and improve the generalization ability of decision trees. Pre-pruning stops the tree from becoming too complex during training, saving time and computational resources. Post-pruning, on the other hand, refines the fully grown tree and often results in higher accuracy by removing unnecessary branches. The choice between them depends on the problem, dataset size, and computational constraints

---

4.  What is Information Gain in Decision Trees, and why is it important for choosing the best split?


Introduction:
Information Gain is a key concept used in decision tree algorithms to decide which feature to split on at each node. It is based on the idea of entropy from information theory, which measures the amount of randomness or impurity in a dataset. The goal of a decision tree is to create nodes that are as pure as possible, meaning they contain samples from only one class. Information Gain helps in selecting the feature that results in the highest reduction in impurity after a split.

Definition:
Information Gain measures the reduction in entropy (or impurity) after splitting the dataset based on a particular feature. In other words, it quantifies how much information about the class label is gained by knowing the value of a specific feature.

Formula:
The formula for Information Gain is:

$$
IG(S, A) = Entropy(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} \times Entropy(S_v)
$$

Where:
- \( S \): the original dataset
- \( A \): the attribute (feature) on which we split
- \( S_v \): the subset of \( S \) for which attribute \( A \) has value \( v \)
- \( |S_v| \): number of samples in subset \( S_v \)
- \( |S| \): total number of samples
- \( Entropy(S) \): impurity of the original dataset

Interpretation:
- A higher information gain means the split has produced purer child nodes.
- A lower information gain means the split has not significantly improved purity.

Importance in Choosing the Best Split:
1. Feature Selection: Information Gain helps the decision tree algorithm choose the most informative feature for splitting at each node.
2. Improved Classification: By selecting the split with the highest information gain, the tree quickly reduces uncertainty and improves prediction accuracy.
3. Efficient Tree Growth: It ensures the tree grows in a way that maximizes information at each step, leading to smaller, more efficient, and more interpretable trees.
4. Reduction of Overfitting: Splits based on information gain lead to better generalization by focusing on meaningful attributes rather than irrelevant ones.

Conclusion:
Information Gain is a crucial criterion for building decision trees because it measures how well a feature separates the data into different classes. By selecting splits that maximize information gain, the algorithm ensures that each decision made by the tree significantly reduces uncertainty, resulting in a more accurate and efficient classification model.


---

5.   What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?


Introduction:
Decision trees are widely used supervised machine learning algorithms that classify data by splitting it based on feature values. Due to their simplicity, interpretability, and ability to handle different types of data, they are used in many real-world applications across various industries.

Common Real-World Applications:

1. Medical Diagnosis:
   - Decision trees are used in healthcare to diagnose diseases based on patient symptoms, medical history, and test results.
   - Example: Predicting whether a patient has diabetes or heart disease.

2. Credit Scoring and Risk Assessment:
   - Financial institutions use decision trees to evaluate the creditworthiness of applicants by analyzing factors like income, age, debt, and payment history.
   - Example: Deciding whether to approve a loan application.

3. Fraud Detection:
   - Decision trees help identify fraudulent transactions by learning patterns from historical data.
   - Example: Detecting credit card fraud based on transaction behavior.

4. Customer Segmentation and Marketing:
   - Businesses use decision trees to segment customers and target marketing campaigns more effectively.
   - Example: Predicting which customers are likely to buy a product.

5. Manufacturing and Quality Control:
   - In manufacturing, decision trees help in identifying defective products and optimizing production processes.
   - Example: Predicting whether a product meets quality standards based on sensor data.

6. Recommendation Systems:
   - Decision trees can be used in recommendation engines to suggest products or services based on user preferences and behavior.
   - Example: Suggesting movies or products to customers on e-commerce platforms.

Main Advantages of Decision Trees:
1. Easy to Understand and Interpret:
   - The tree structure is simple and visually intuitive, making it easy to explain results to non-technical stakeholders.
2. Handles Different Data Types:
   - Can process both numerical and categorical features without requiring scaling.
3. Requires Little Data Preprocessing:
   - No need for normalization or standardization of data.
4. Works Well with Non-linear Relationships:
   - Can model complex decision boundaries without requiring a mathematical equation.

Main Limitations of Decision Trees:
1. Overfitting:
   - Decision trees can become too complex and fit noise in the training data, reducing their performance on new data.
2. Instability:
   - Small changes in the data can lead to significantly different tree structures.
3. Bias Toward Features with Many Levels:
   - Features with many categories might dominate splits even if they are not the most informative.
4. Less Effective for Continuous Predictions:
   - Decision trees are more suitable for classification than regression unless combined with ensemble methods like Random Forests.

Conclusion:
Decision trees are powerful and versatile tools with wide-ranging real-world applications, from healthcare and finance to marketing and manufacturing. Their ease of use and interpretability make them especially valuable for decision-making tasks. However, they must be used carefully to avoid overfitting and instability, and they are often improved when used as part of ensemble methods.


---

Dataset Info:
-  Iris Dataset for classification tasks (sklearn.datasets.load_iris() or provided CSV).
-  Boston Housing Dataset for regression tasks


(sklearn.datasets.load_boston() or provided CSV)

---

6. Write a Python program to:
-  Load the Iris Dataset
-  Train a Decision Tree Classifier using the Gini criterion
-  Print the model’s accuracy and feature importances


(Include your Python code and output in the code box below)

In [33]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data        # Features
y = iris.target      # Labels

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Train a Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Step 4: Make predictions
y_pred = clf.predict(X_test)

# Step 5: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
feature_importances = clf.feature_importances_

# Output results
print("Model Accuracy:", accuracy)
print("Feature Importances:")
for feature, importance in zip(iris.feature_names, feature_importances):
    print(f"{feature}: {importance:.4f}")

Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


---

7.  Write a Python program to:
-  Load the Iris Dataset
-  Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.


(Include your Python code and output in the code box below.)

In [35]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Train a Decision Tree Classifier with max_depth=3
clf_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_limited.fit(X_train, y_train)
y_pred_limited = clf_limited.predict(X_test)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

# Step 4: Train a fully-grown Decision Tree Classifier (no max_depth)
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Step 5: Print and compare accuracies
print(f"Accuracy of Decision Tree with max_depth=3: {accuracy_limited:.4f}")
print(f"Accuracy of Fully-Grown Decision Tree: {accuracy_full:.4f}")

Accuracy of Decision Tree with max_depth=3: 1.0000
Accuracy of Fully-Grown Decision Tree: 1.0000


---

8.  Write a Python program to:
-  Load the California Housing dataset from sklearn
-  Train a Decision Tree Regressor
-  Print the Mean Squared Error (MSE) and feature importances


(Include your Python code and output in the code box below)

In [37]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Step 1: Load the California Housing dataset
california = fetch_california_housing()
X = california.data
y = california.target

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Step 4: Make predictions
y_pred = regressor.predict(X_test)

# Step 5: Evaluate the model
mse = mean_squared_error(y_test, y_pred)
feature_importances = regressor.feature_importances_

# Step 6: Print results
print("Mean Squared Error (MSE):", mse)
print("Feature Importances:")
for feature, importance in zip(california.feature_names, feature_importances):
    print(f"{feature}: {importance:.4f}")

Mean Squared Error (MSE): 0.5280096503174904
Feature Importances:
MedInc: 0.5235
HouseAge: 0.0521
AveRooms: 0.0494
AveBedrms: 0.0250
Population: 0.0322
AveOccup: 0.1390
Latitude: 0.0900
Longitude: 0.0888


---

9.  Write a Python program to:
-  Load the Iris Dataset
-  Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
-  Print the best parameters and the resulting model accuracy


(Include your Python code and output in the code box below.)

In [39]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Step 2: Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 3: Define the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=42)

# Step 4: Define the parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 5, 10, 15]
}

# Step 5: Use GridSearchCV to find the best parameters
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Step 6: Get the best parameters and evaluate the model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Step 7: Print results
print("Best Parameters:", best_params)
print("Model Accuracy with Best Parameters:", accuracy)

Best Parameters: {'max_depth': 4, 'min_samples_split': 10}
Model Accuracy with Best Parameters: 1.0


---

10.   Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance


And describe what business value this model could provide in the real-world setting.




Step 1: Handle Missing Values
- Identify missing data: Check each column for missing values using methods like `.isnull().sum()`.
- Decide on strategy:
  - Numerical features: Fill missing values using mean, median, or predictive imputation (e.g., KNN Imputer).
  - Categorical features: Fill missing values with the mode or create a special category like `"Unknown"`.
- Rationale: Ensures the model can learn effectively without bias from missing values.


Step 2: Encode Categorical Features
- Identify categorical variables: For example, gender, blood type, patient region.
- Encoding methods:
  - Label Encoding: Converts ordinal categories into numerical codes.
  - One-Hot Encoding: Converts nominal categories into binary vectors.
- Rationale: Decision Trees require numerical inputs to create splits on features.


Step 3: Train a Decision Tree Model
- Split data: Divide dataset into training and testing sets (e.g., 70%-30%).
- Initialize the model: Use `DecisionTreeClassifier()` from scikit-learn.
- Train the model: Fit the model using `.fit(X_train, y_train)`.
- Rationale: Decision Trees handle mixed data types and non-linear relationships effectively.


Step 4: Tune Hyperparameters
- Key hyperparameters:
  - `max_depth`: Maximum depth to prevent overfitting.
  - `min_samples_split` / `min_samples_leaf`: Minimum samples required for splitting or forming a leaf.
  - `criterion`: Splitting measure (`gini` or `entropy`).
- Tuning method: Use `GridSearchCV` or `RandomizedSearchCV` to test combinations of hyperparameters and select the best.
- Rationale: Optimized parameters improve generalization and interpretability.


Step 5: Evaluate Model Performance
- Metrics to consider:
  - Accuracy
  - Precision and Recall (important to minimize false positives/negatives)
  - F1-Score (balance between precision and recall)
  - ROC-AUC (discrimination capability for binary outcomes)
- Validation: Evaluate on a separate test set or using cross-validation.
- Rationale: Ensures the model is reliable and meets clinical safety standards.


Step 6: Business Value
- Early disease detection: Identify high-risk patients proactively.
- Resource allocation: Prioritize tests and treatments for likely cases.
- Decision support: Assist doctors with an interpretable, data-driven model.
- Cost reduction: Reduce unnecessary tests and hospitalizations.
- Improved patient outcomes: Enable timely interventions and better care


Summary
1. Handle missing values to clean the dataset.  
2. Encode categorical features for model compatibility.  
3. Train a Decision Tree on the training dataset.  
4. Tune hyperparameters using `GridSearchCV` for optimal performance.  
5. Evaluate the model with metrics like accuracy, precision, recall, and ROC-AUC.  
6. Deliver business value by supporting better clinical decisions, improving efficiency, and reducing costs.

This approach ensures the model is accurate, interpretable, and clinically useful, which is critical in healthcare settings.

---