Q1. What is a Decision Tree, and how does it work in the context of
classification?


A Decision Tree in Machine Learning is a supervised learning algorithm that makes predictions by splitting data into branches based on decision rules derived from the input features.It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

How it works in the context of classification:

* Root Node: The process begins with the entire dataset at the root node.

* Attribute Selection: The algorithm selects the "best" attribute to split the data based on a criterion like Gini impurity or information gain. This attribute is chosen because it best separates the data into distinct classes.

* Splitting: The root node is then split into sub-nodes (branches) based on the different values or thresholds of the selected attribute. Each branch represents a possible outcome of the test on that attribute.

* Recursive Partitioning: This splitting process is recursively applied to each sub-node, creating further branches and internal nodes. The goal is to create increasingly "pure" subsets of data, meaning subsets where most instances belong to the same class.

* Leaf Nodes: The process continues until a stopping criterion is met, such as reaching a maximum tree depth, having a minimum number of samples in a node, or when a node becomes sufficiently pure (contains instances of predominantly one class). The final nodes in the tree are called leaf nodes, and each leaf node represents a class label.

* Classification: To classify a new, unseen data point, it traverses the tree from the root node, following the branches corresponding to its attribute values until it reaches a leaf node. The class label associated with that leaf node is the predicted class for the data point.

Q2. Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?


Gini Impurity and Entropy are both metrics for measuring the "impurity" or disorder of a node in a decision tree, with the goal of minimizing impurity at each split to create a pure, or well-classified, leaf node. Gini Impurity quantifies the probability of misclassification, favoring quicker, greedy splits towards the dominant class, while Entropy measures data uncertainty, preferring more balanced splits and yielding more information, though it is computationally more expensive. Decision trees use these criteria to find the best feature to split the data, selecting the split that results in the greatest reduction of impurity (also known as the most information gain). 

* Gini Impurity

Concept: The probability of a randomly selected element being incorrectly classified if it were randomly labeled. 

Calculation: A node is considered pure (0% impurity) when all its data points belong to the same class. 

Effect on Splits: Gini impurity generally favors splits that isolate the most frequent class, resulting in faster, greedier splits. 

* Entropy

Concept: A measure of the uncertainty or disorder in a node, reflecting the variety of class labels. 

Calculation: Uses a formula with logarithms (∑ -pᵢ * log₂(pᵢ)), where pᵢ is the probability of a specific outcome. 

Effect on Splits: Entropy provides a more nuanced measure of impurity and is more sensitive to the distribution of classes. It tends to result in more balanced splits and deeper trees. 

* Impact on Decision Tree Splits

Objective: At each node, the decision tree evaluates different features and split points to find the one that minimizes impurity. 

Selection Process: The tree calculates the impurity of the resulting child nodes for each potential split. 

Information Gain: The best split is the one that leads to the largest reduction in impurity, also known as the maximum information gain. 

Criteria Choice: You can choose to use either "gini" or "entropy" as the splitting criterion in your decision tree model, with Gini often preferred for its faster computation. 


Q3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.

Pre-pruning halts a decision tree's growth during training by setting stopping criteria, while post-pruning removes unnecessary branches from a fully grown tree after it's built

* Pre-Pruning (Early Stopping)

What it is: This method prevents the decision tree from growing too large by setting criteria (like maximum depth, minimum samples per leaf) to stop the splitting process early 
during the tree construction. 

Practical Advantage: It's faster and more efficient, especially for large datasets, because it doesn't require building a massive, fully grown tree only to trim it later. 

* Post-Pruning (Backward Pruning)

What it is: After the decision tree is fully grown, unnecessary subtrees or branches are removed to simplify the model. 

Practical Advantage: It can lead to better generalization and higher accuracy because it uses techniques like cross-validation to check if removing a branch improves the model's
performance on unseen data, rather than making a decision without full information, as can happen with pre-pruning. 

Q4.  What is Information Gain in Decision Trees, and why is it important for
choosing the best split?

Information Gain measures the reduction in entropy (uncertainty) of the target variable after a feature is used to split the data, indicating how much information that feature provides about the class labels. It's important because it guides decision tree algorithms to select the best feature for splitting at each node, leading to more homogeneous child nodes and a more accurate, efficient, and predictive model. 

* Why it's Important for Choosing the Best Split

Maximize Purity: The goal of a decision tree is to create nodes that are as pure as possible, meaning they contain instances of only one class. Information Gain helps achieve this by quantifying which feature, when used for splitting, results in the most significant reduction in impurity. 

Feature Importance: Features with higher Information Gain are more informative and thus more useful for classification. By choosing the feature with the highest Information Gain at each step, the algorithm prioritizes the most relevant features for making decisions, leading to a more effective tree structure. 

Build Predictive Accuracy: A decision tree built using Information Gain as the splitting criterion is more likely to have high predictive accuracy. This is because the best splits lead to clearer, more separated groups of data, which improves the overall decision-making process. 

Q5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?

Decision Trees are used in various fields, including healthcare for diagnosis, finance for risk assessment and fraud detection, marketing for customer segmentation and churn prediction, and education to identify at-risk students. 

* Real-World Applications

Healthcare: Used for disease diagnosis by analyzing patient symptoms and data, and for determining personalized treatment plans. 

Finance: Applied for loan application approval, credit scoring, and detecting fraudulent transactions. 

Marketing: Employed to segment customers, predict customer behavior like churn, and personalize marketing campaigns. 

Business: Aids in strategic planning, resource allocation, and inventory management by forecasting sales. 

Education: Helps predict student performance (pass/fail) based on factors like attendance and grades to provide targeted support. 

Manufacturing: Optimizes production processes by analyzing efficiency factors and predicting trends. 

* Advantages

Interpretability: Decision trees are easy to understand, visualize, and explain to non-technical audiences due to their tree-like structure. 

Minimal Data Preparation: They require less data preprocessing and cleaning compared to other algorithms, making them quicker to implement. 

Handle Mixed Data Types: Decision trees can handle both numerical and categorical data, providing flexibility in data input. 

Versatility: They can be used for both classification (e.g., diagnosing a disease) and regression (e.g., predicting house prices). 

Feature Importance: The structure of the tree can indicate which features are most important in making a prediction, aiding in feature selection. 

* Limitations 

Prone to Overfitting: Decision trees can become too complex, leading to overfitting where they perform well on training data but poorly on new data.

Unstable to Data Changes: Small changes in the data or the introduction of noise can lead to very different tree structures and predictions.

Non-Continuous Predictions: For regression tasks, decision trees produce piecewise constant predictions, not smooth, continuous ones.

Computational Complexity: For large datasets, building and interpreting decision trees can be computationally expensive and complex.

Greedy Approach: The algorithm makes locally optimal decisions at each step, which may not lead to a globally optimal solution.


Q6. Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier using the Gini criterion

● Print the model’s accuracy and feature importances

In [6]:
import pandas as pd 
import numpy as np 
import seaborn as sns
import matplotlib.pyplot as plt 

import warnings 
warnings.filterwarnings('ignore')

In [5]:
from sklearn.datasets import load_iris

data = load_iris()
data

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [8]:
df = pd.DataFrame(data.data, columns=data.feature_names)
df

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [13]:
X = df
y = data.target

In [14]:
# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

In [19]:
# Train a Decision Tree Classifier using Gini criterion
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(criterion='gini')
clf

clf.fit(X_train, y_train)

In [20]:
# Prediction on test data set
y_pred = clf.predict(X_test)

In [26]:
# Accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("The Model Accuracy is", accuracy)

The Model Accuracy is 0.9555555555555556


Q7. Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [27]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [46]:
data = load_iris()
X = data.data
y = data.target

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

In [48]:
# Train the data a fully grown decision Tree
clf_full = DecisionTreeClassifier(random_state=42)
clf_full

clf_full.fit(X_train, y_train)

y_pred_full = clf_full.predict(X_test)

accuracy_full = accuracy_score(y_test, y_pred_full)
accuracy_full

0.9333333333333333

In [49]:
# Train a Decision Tree with max_depth=3
clf_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
clf_pruned.fit(X_train, y_train)

y_pred_pruned = clf_pruned.predict(X_test)

accuracy_pruned = accuracy_score(y_test, y_pred_pruned)
accuracy_pruned

0.9777777777777777

Q8. Write a Python program to:

● Load the California Housing dataset from sklearn

● Train a Decision Tree Regressor

● Print the Mean Squared Error (MSE) and feature importances

In [50]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [None]:
# Load the California Housing dataset
data = fetch_california_housing(as_frame=True)

X = data.data
y = data.target

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# Train a Decision Tree Regressor
reg = DecisionTreeRegressor(random_state=42)
reg.fit(X_train, y_train)

In [57]:
# Predict on test set
y_pred = reg.predict(X_test)

In [None]:
# Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
mse

0.5280096503174904

In [60]:
# Print feature importances
print("\nFeature Importances:")
for feature, importance in zip(data.feature_names, reg.feature_importances_):
    print(f"{feature}: {importance:.3f}")


Feature Importances:
MedInc: 0.523
HouseAge: 0.052
AveRooms: 0.049
AveBedrms: 0.025
Population: 0.032
AveOccup: 0.139
Latitude: 0.090
Longitude: 0.089


Q9. Write a Python program to:

● Load the Iris Dataset

● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV

● Print the best parameters and the resulting model accuracy

In [62]:

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# 1. Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# Define the Decision Tree and hyperparameter grid
dt = DecisionTreeClassifier(random_state=42)

param_grid = {
    "max_depth": [2, 3, 4, 5, None],         # Possible depths
    "min_samples_split": [2, 3, 4, 5, 10]    # Minimum samples required to split a node
}

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid,
                           cv=5, scoring="accuracy", n_jobs=-1)

grid_search.fit(X_train, y_train)

# Best parameters and model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Evaluate on test set
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", best_params)
print(f"Model Accuracy on Test Set: {accuracy:.3f}")


Best Parameters: {'max_depth': 3, 'min_samples_split': 2}
Model Accuracy on Test Set: 0.978


Q10. Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.

Explain the step-by-step process you would follow to:

● Handle the missing values

● Encode the categorical features

● Train a Decision Tree model

● Tune its hyperparameters

● Evaluate its performance

And describe what business value this model could provide in the real-world
setting.


Step-by-Step Process

Handle Missing Values:

Identify Missing Data: Understand the extent and nature of missing data by examining the dataset.

Choose an Imputation Strategy:

For numerical features: Use methods like mean imputation, median imputation, or predictive imputation (e.g., using another machine learning model to predict the missing value). 

For categorical features: Impute with the most frequent category (mode) or create a new category for missing values. 

Consider Deletion: If a large percentage of data is missing for a particular feature, or if the missing data is random, consider dropping the rows or columns, but be cautious as this can lead to loss of valuable information. 

Encode Categorical Features:

Understand Categorical Features: Identify features with text or non-numeric values (e.g., 'gender', 'symptoms'). 

Choose an Encoding Method:

One-Hot Encoding: For nominal categorical features (no inherent order), create binary columns for each category. 

Label Encoding: For ordinal categorical features (with a clear order), assign numerical labels to each category. 

Binary Encoding: For categories with many unique values, this method is more efficient than one-hot encoding. 

Train a Decision Tree Model:

Split the Data: Divide the preprocessed dataset into training and testing sets to ensure the model can generalize to new data. 

Instantiate the Model: Create a Decision Tree classifier using a library like scikit-learn. 

Fit the Model: Train the Decision Tree on the training data to learn the patterns that differentiate diseased from healthy patients. 

Tune its Hyperparameters:

Define Hyperparameters: Identify key Decision Tree parameters that can be adjusted, such as max_depth, min_samples_split, and min_samples_leaf. 

Use Cross-Validation: Employ k-fold cross-validation to evaluate different hyperparameter combinations without overfitting to the training data. 

Apply Tuning Techniques:

GridSearchCV: Systematically try every combination of hyperparameters from a specified grid. 

RandomizedSearchCV: Randomly sample hyperparameter combinations, which can be more efficient for a large search space. 

Evaluate its Performance:

Use Appropriate Metrics: Evaluate the tuned model's performance on the unseen test data. 

Common Metrics for Classification:

Accuracy: Overall percentage of correct predictions. 

Precision: Of all patients predicted to have the disease, what proportion actually have it. 

Recall (Sensitivity): Of all patients who actually have the disease, what proportion were correctly identified. 

F1-Score: The harmonic mean of precision and recall, balancing both. 

AUC-ROC Curve: Measures the model's ability to distinguish between classes. 

Business Value

This predictive model can provide significant business value to a healthcare company by: 

Early Diagnosis: Enabling earlier identification of diseases, leading to more timely interventions and better patient outcomes.

Personalized Treatment: Assisting in tailoring treatment plans based on a patient's specific risk factors and disease presentation.