##**DECISION TREE ASSIGNMENT**

### Question 1: What is a Decision Tree, and how does it work in the context of classification?

A **Decision Tree** is a supervised machine learning model that is structured like a flowchart.  
- **Nodes** represent conditions or questions on features.  
- **Branches** represent the outcomes of those conditions.  
- **Leaf nodes** represent the final prediction or class label.  

In the context of **classification**:  
- The dataset is split step by step based on the feature values that best separate the classes.  
- At each node, the tree asks a simple question (e.g., *Is age > 30?*).  
- Depending on the answer, the data follows the corresponding branch.  
- This continues until a leaf node is reached, which gives the predicted class.  

Decision trees are popular because they are easy to interpret and closely resemble human decision-making.  




### Question 2 : Explain the concepts of Gini Impurity and Entropy as impurity measures.  
### How do they impact the splits in a Decision Tree?

**Gini Impurity** and **Entropy** are two popular measures used to check how "pure" or "impure" a node is in a decision tree.

- **Gini Impurity**  
  - Measures the probability that a randomly chosen sample would be misclassified.  
  - Formula: `Gini = 1 - Σ(pᵢ²)`  
  - Value is **0** when all samples belong to one class (pure), and higher when classes are mixed.  

- **Entropy**  
  - Measures the amount of uncertainty or disorder in the data.  
  - Formula: `Entropy = - Σ(pᵢ * log₂(pᵢ))`  
  - Value is **0** for a pure node, and higher for mixed distributions.  

**Impact on splits:**  
- Both Gini and Entropy guide the tree to choose the **best split**.  
- The algorithm calculates impurity for each possible split and selects the one that reduces impurity the most (highest Information Gain for Entropy or lowest Gini).  
- This ensures that the tree keeps dividing data into groups that are as pure as possible.  

In short, these measures help the tree grow in a way that improves classification accuracy.  


### Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees?  
### Give one practical advantage of using each.

**Pre-Pruning (Early Stopping):**  
- In pre-pruning, the growth of the tree is restricted before it becomes too complex.  
- This can be done by setting limits such as maximum depth, minimum samples per leaf, or minimum information gain.  
- By stopping early, the tree avoids creating branches that do not add much value.  
- **Practical Advantage:** It reduces training time and computational cost, which is very helpful when dealing with large datasets.  
- **Example:** In a loan approval system, if splitting further on a feature like “favorite color” does not improve accuracy, the tree will stop there, saving time and avoiding useless rules.  

**Post-Pruning (Pruning after Full Growth):**  
- In post-pruning, the tree is allowed to grow fully first and then unnecessary branches are removed.  
- This is usually done using a validation set to check which branches hurt performance.  
- The final pruned tree becomes simpler and more general.  
- **Practical Advantage:** It improves prediction accuracy on unseen data by reducing overfitting, making the model more reliable in real-world scenarios.  
- **Example:** In a medical diagnosis model, a fully grown tree might create a very specific rule like *“If age = 37 and blood pressure = 122, then Disease X”*. Post-pruning can remove such overfitted branches, keeping only the general patterns that work for most patients.  

Overall, pre-pruning saves time and resources during training, while post-pruning improves generalization by removing unnecessary complexity.  


### Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

**Information Gain (IG):**  
- Information Gain is a measure that tells us how much "uncertainty" or impurity is reduced after splitting the dataset on a feature.  
- It is calculated as the difference between the impurity of the parent node and the weighted impurity of the child nodes.  
- Formula:  
  `IG = Impurity(parent) – [weighted average of Impurity(children)]`  

**Why it is important:**  
- At each step, a decision tree has to decide **which feature to split on**.  
- Information Gain helps select the feature that creates the **purest child nodes**, meaning the data points inside each node are more similar (belong to the same class).  
- The higher the Information Gain, the better the feature is for splitting.  

**Example:**  
Suppose we have 10 samples:  
- 6 belong to Class A, and 4 belong to Class B.  
- Parent Entropy = `-[(6/10) log₂(6/10) + (4/10) log₂(4/10)] ≈ 0.97`  

Now, if we split based on a feature:  
- Left child: 4 samples (all Class A) → Entropy = 0 (pure).  
- Right child: 6 samples (2 Class A, 4 Class B) → Entropy ≈ 0.92.  
- Weighted Child Entropy = `(4/10 * 0) + (6/10 * 0.92) ≈ 0.55`  

So,  
`IG = 0.97 – 0.55 = 0.42`  

This shows that the split reduces impurity by **0.42**, meaning it’s a good choice for the tree.


### Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

**Real-world applications:**  
- **Medical Diagnosis:** Used to predict diseases based on symptoms and patient history.  
- **Finance:** Helps in credit scoring and loan approval decisions.  
- **Marketing:** Used for customer segmentation and predicting buying behavior.  
- **Fraud Detection:** Identifies unusual patterns in transactions.  
- **Manufacturing:** Assists in quality control and defect detection.  

**Main Advantages:**  
- Easy to understand and interpret, even for non-technical users.  
- Handles both numerical and categorical data.  
- Requires little data preprocessing compared to other algorithms.  
- Works well for small to medium datasets.  

**Main Limitations:**  
- Prone to overfitting, especially with deep trees.  
- Can be unstable, as small changes in data may lead to a completely different tree.  
- Greedy splitting may not always result in the most optimal tree.  
- Less effective on very large datasets compared to ensemble methods like Random Forests.  


### Question 6: Write a Python program to:
- Load the Iris Dataset  
- Train a Decision Tree Classifier using the Gini criterion  
- Print the model’s accuracy and feature importances  


In [3]:
# Importing required libraries
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Decision Tree Classifier with Gini criterion
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion="gini", random_state=42)
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Model accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Feature importances
print("\nFeature Importances:")
for feature, importance in zip(iris.feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


### Question 7: Write a Python program to:
- Load the Iris Dataset  
- Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree  


In [4]:
# General libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Load dataset
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree with max_depth=3
from sklearn.tree import DecisionTreeClassifier
clf_depth3 = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_depth3.fit(X_train, y_train)

# Predictions and accuracy for depth=3
y_pred_depth3 = clf_depth3.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)

# Train a fully-grown Decision Tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full.fit(X_train, y_train)

# Predictions and accuracy for full tree
y_pred_full = clf_full.predict(X_test)
accuracy_full = accuracy_score(y_test, y_pred_full)

# Print results
print("Accuracy with max_depth=3:", accuracy_depth3)
print("Accuracy with fully-grown tree:", accuracy_full)


Accuracy with max_depth=3: 1.0
Accuracy with fully-grown tree: 1.0


### Question 8: Write a Python program to:
- Load the Boston Housing Dataset  
- Train a Decision Tree Regressor  
- Print the Mean Squared Error (MSE) and feature importances  


In [7]:
# General libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Load Boston Housing dataset from OpenML
from sklearn.datasets import fetch_openml
boston = fetch_openml(name="boston", version=1, as_frame=True)

# Features and target
X = boston.data
y = boston.target

# Split dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train Decision Tree Regressor
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predictions
y_pred = regressor.predict(X_test)

# Mean Squared Error
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Feature Importances
print("\nFeature Importances:")
for feature, importance in zip(X.columns, regressor.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Mean Squared Error (MSE): 11.588026315789474

Feature Importances:
CRIM: 0.0585
ZN: 0.0010
INDUS: 0.0099
CHAS: 0.0003
NOX: 0.0071
RM: 0.5758
AGE: 0.0072
DIS: 0.1096
RAD: 0.0016
TAX: 0.0022
PTRATIO: 0.0250
B: 0.0119
LSTAT: 0.1900


###Question 9: Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
- Print the best parameters and the resulting model accuracy


In [8]:
# General libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Load Iris dataset
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Decision Tree Classifier + GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 4, 6, 8, 10]
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring="accuracy"
)
grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

# Model Accuracy
from sklearn.metrics import accuracy_score
y_pred = grid_search.best_estimator_.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 6}
Model Accuracy: 1.0


### **Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to:**
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance And describe what business value this model could provide in the real-world
setting.


### Answer : If I am working as a data scientist for a healthcare company, I would approach this problem step by step :

**Step 1: Handle Missing Values**

- First, I’d explore the dataset to see where and how much data is missing.

- For numerical features, I could use mean/median imputation, or if the missingness is not random, try more advanced methods like KNN imputation.

- For categorical features, I’d use the mode (most frequent category) or a separate category like "Unknown".

- If a feature has too many missing values (say >50%), I’d consider dropping it after checking its importance.

**Step 2: Encode Categorical Features**

- Decision Trees don’t need feature scaling, but they do require numbers.

- I’d use One-Hot Encoding for nominal categories (like blood type) and Ordinal Encoding if categories have an order (like severity levels: mild → moderate → severe).

- Libraries like pandas.get_dummies() or sklearn.preprocessing.OneHotEncoder would be handy here.

**Step 3: Train a Decision Tree Model**

- Split the dataset into train and test sets (e.g., 70:30).

- Train a DecisionTreeClassifier from scikit-learn with basic parameters first.

- Since trees can easily overfit, I’d keep an eye on tree depth, min_samples_split, etc.

**Step 4: Tune Hyperparameters**

- Use GridSearchCV or RandomizedSearchCV to find the best combination of:

 - max_depth (to prevent overfitting)

 - min_samples_split and min_samples_leaf (to control branching)

 - criterion (Gini vs Entropy)

- Perform cross-validation to make sure results are stable across folds.

**Step 5: Evaluate Performance**

- Since it’s a disease prediction problem, accuracy alone isn’t enough.

- I’d look at precision, recall, F1-score, and ROC-AUC to evaluate.

- Recall is especially important here — missing a true disease case is costlier than a false alarm.

**Business Value**

- The model can help doctors screen patients faster, flagging high-risk individuals who may need urgent tests.

- It can reduce misdiagnosis by acting as a decision support tool.

- Saves costs for the hospital by focusing resources on the most at-risk patients.

- For patients, it improves early detection, which often means better treatment outcomes and reduced severity of illness
