# Decision Tree | Assignment

1. What is a Decision Tree, and how does it work in the context of
   classification?
   - A Decision Tree is a supervised learning algorithm used for classification (and also regression). It works by splitting the data into branches based on feature values, forming a tree-like structure.

   - In classification, the tree starts with a root node and asks simple yes/no questions (conditions). Based on the answer, the data moves down a branch to another node. This process continues until it reaches a leaf node, which gives the final class label.

   - In simple words, a decision tree makes decisions step by step, just like a flowchart, to classify data into different categories.

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
   How do they impact the splits in a Decision Tree?
   - Gini Impurity and Entropy are measures used in decision trees to check how impure a node is (how mixed the classes are).

   - Gini Impurity measures the chance of wrongly classifying a data point if it is randomly labeled. A Gini value of 0 means the node is pure (only one class). Lower Gini is better.

   - Entropy measures the level of disorder or uncertainty in a node. Entropy is 0 when all data belongs to one class and higher when classes are mixed.

   - Impact on splits : Decision Trees choose the split that reduces Gini Impurity or Entropy the most, making child nodes purer. Better splits create clearer class separation.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
   Trees? Give one practical advantage of using each.
   - Pre-Pruning stops the decision tree from growing early, by setting limits like maximum depth or minimum samples per node.
   - Advantage: It reduces overfitting and saves training time.

   - Post-Pruning allows the tree to grow fully and then cuts back unnecessary branches.
   - Advantage: It often gives better accuracy by removing branches that do not improve performance.

4. What is Information Gain in Decision Trees, and why is it important for
   choosing the best split?
   - Information Gain is a measure used in decision trees to decide the best feature for splitting the data.

   - It shows how much entropy (uncertainty) is reduced after a split. A higher information gain means the split creates purer child nodes.

   - Importance : Decision Trees choose the split with the highest Information Gain because it separates the data more clearly, leading to better classification.

5. What are some common real-world applications of Decision Trees, and
   what are their main advantages and limitations?
   - Real-world applications of Decision Trees:
     - Decision Trees are used in credit risk analysis (loan approval), medical diagnosis (disease detection), customer segmentation in marketing, fraud detection, and spam email classification.

   - Advantages:
     - They are easy to understand and interpret, work with both numerical and categorical data, and require little data preprocessing.

   - Limitations:
     - Decision Trees can overfit the data, are sensitive to small changes in data, and may give lower accuracy compared to ensemble models.


In [11]:
#6 Write a Python program to:
# Load the Iris Dataset
# Train a Decision Tree Classifier using the Gini criterion
# Print the model’s accuracy and feature importances

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = DecisionTreeClassifier(criterion="gini", random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

print("Feature Importances:")
for feature, importance in zip(iris.feature_names, model.feature_importances_):
    print(feature, ":", importance)


Model Accuracy: 1.0
Feature Importances:
sepal length (cm) : 0.0
sepal width (cm) : 0.016670139612419255
petal length (cm) : 0.9061433868879218
petal width (cm) : 0.07718647349965893


In [12]:
#7 Write a Python program to:
#  Load the Iris Dataset
#  Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

dt_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_limited.fit(X_train, y_train)
y_pred_limited = dt_limited.predict(X_test)
acc_limited = accuracy_score(y_test, y_pred_limited)

dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)
y_pred_full = dt_full.predict(X_test)
acc_full = accuracy_score(y_test, y_pred_full)

print("Accuracy with max_depth=3:", acc_limited)
print("Accuracy with fully-grown tree:", acc_full)


Accuracy with max_depth=3: 1.0
Accuracy with fully-grown tree: 1.0


In [13]:
#8 Write a Python program to:
# Load the Boston Housing Dataset
# Train a Decision Tree Regressor
# Print the Mean Squared Error (MSE) and feature importances

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

df = pd.read_csv("boston.csv")

X = df.drop("MEDV", axis=1)
y = df["MEDV"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = DecisionTreeRegressor(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

print("\nFeature Importances:")
for feature, importance in zip(X.columns, model.feature_importances_):
    print(feature, ":", importance)



Mean Squared Error (MSE): 10.416078431372549

Feature Importances:
CRIM : 0.05129567385985361
ZN : 0.003352705854613196
INDUS : 0.005816191711420081
CHAS : 2.279406506977855e-06
NOX : 0.027148378971777364
RM : 0.6003262563803439
AGE : 0.01361706300057042
DIS : 0.07068816216312718
RAD : 0.0019406229702577159
TAX : 0.012463865318667847
PTRATIO : 0.011011608907585924
B : 0.00900872741695864
LSTAT : 0.1933284640383171


In [14]:
#9 Write a Python program to:
#  Load the Iris Dataset
#  Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV
#  Print the best parameters and the resulting model accuracy

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

dt = DecisionTreeClassifier(random_state=42)

param_grid = {
    "max_depth": [2, 3, 4, 5, None],
    "min_samples_split": [2, 5, 10]
}

grid = GridSearchCV(dt, param_grid, cv=5)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_

y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)

print("Best Parameters:", grid.best_params_)
print("Model Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.0


10. Imagine you’re working as a data scientist for a healthcare company that
     wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.
     Explain the step-by-step process you would follow to:
      - Handle the missing values
      - Encode the categorical features
      - Train a Decision Tree model
      - Tune its hyperparameters
      - Evaluate its performance
      - And describe what business value this model could provide in the
        real-world setting.

   - As a data scientist, I would follow these steps:

     - First, I would handle missing values by filling numerical columns with the mean or median and categorical columns with the most frequent value. If a column has too many missing values and is not useful, I may remove it.

     - Next, I would encode categorical features so the model can understand them. For nominal categories (like gender or city), I would use One-Hot Encoding. For ordinal categories (like disease stage), I would use Label Encoding.

     - Then, I would train a Decision Tree model by splitting the data into training and testing sets and fitting a Decision Tree classifier on the training data.

     - After that, I would tune hyperparameters such as max_depth, min_samples_split, and min_samples_leaf using GridSearchCV to reduce overfitting and improve accuracy.

     - To evaluate the model, I would check metrics like accuracy, precision, recall, F1-score, and the confusion matrix, because in healthcare false negatives are very costly.

     - Business value:
       - This model can help doctors identify high-risk patients early, support faster diagnosis, reduce human error, and improve patient outcomes. It also helps hospitals optimize resources and lower treatment costs by enabling early intervention.