# Assignment | Decision Tree

Question 1:  What is a Decision Tree, and how does it work in the context of classification?

Answer-1. A Decision Tree is a supervised learning algorithm used for classification and regression tasks. It works by recursively partitioning the data into smaller subsets based on feature values.

How it Works
1. Root Node: The algorithm starts with a root node representing the entire dataset.
2. Splitting: The algorithm splits the data into subsets based on feature values, creating child nodes.
3. Recursive Partitioning: The process is repeated recursively for each child node until a stopping criterion is met.
4. Leaf Nodes: The final nodes, called leaf nodes, represent the predicted class labels.

Classification
1. Feature Selection: The algorithm selects the most informative feature to split the data.
2. Splitting Criteria: The algorithm uses a splitting criterion, such as Gini impurity or entropy, to determine the best split.
3. Class Prediction: The predicted class label is determined by the majority class in the leaf node.


Question 2: Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?

Answer-2. Gini Impurity
1. Definition: Gini Impurity measures the probability of misclassifying a randomly chosen instance from a dataset.
2. Calculation: Gini Impurity is calculated as 1 - ∑(p^2), where p is the proportion of each class.
3. Interpretation: Lower Gini Impurity values indicate a more pure node.

Entropy
1. Definition: Entropy measures the amount of uncertainty or randomness in a dataset.
2. Calculation: Entropy is calculated as -∑(p * log2(p)), where p is the proportion of each class.
3. Interpretation: Lower Entropy values indicate a more pure node.

Impact on Decision Tree Splits
1. Splitting criterion: Both Gini Impurity and Entropy are used as splitting criteria in Decision Trees.
2. Node purity: The algorithm chooses the split that results in the largest reduction in impurity (Gini Impurity or Entropy).
3. Tree structure: The choice of impurity measure can affect the structure of the Decision Tree.

Comparison
1. Similar results: Both Gini Impurity and Entropy often produce similar results.
2. Computational efficiency: Gini Impurity is generally faster to compute than Entropy.

Importance
1. Decision Tree performance: The choice of impurity measure can impact the performance of the Decision Tree.
2. Handling complex data: Understanding impurity measures is crucial for handling complex datasets.



Question 3: What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.

Answer-3. Pre-Pruning
1. Definition: Pre-pruning involves stopping the growth of a Decision Tree before it is fully grown, based on certain criteria.
2. Criteria: Examples include maximum depth, minimum number of samples per node, or minimum impurity decrease.
3. Practical advantage: Reduces computational cost and prevents overfitting by avoiding unnecessary splits.

Post-Pruning
1. Definition: Post-pruning involves growing a full Decision Tree and then removing branches that do not contribute significantly to the model's accuracy.
2. Process: The tree is pruned back to a point where the error rate is minimized.
3. Practical advantage: Allows for a more optimal pruning strategy, as the entire tree is considered before pruning.

Key differences
1. Timing: Pre-pruning occurs during tree growth, while post-pruning occurs after the tree is fully grown.
2. Approach: Pre-pruning is a more conservative approach, while post-pruning is more flexible.


Question 4: What is Information Gain in Decision Trees, and why is it important for choosing the best split?

Answer-4. Information Gain measures the reduction in uncertainty or entropy in a dataset after splitting it based on a particular feature.

Calculation
Information Gain is calculated as:

IG = Entropy(parent) - Entropy(child)

Importance
1. Choosing the best split: Information Gain helps determine the most informative feature to split the data.
2. Reducing uncertainty: By maximizing Information Gain, the algorithm reduces uncertainty and improves the purity of the nodes.

Role in Decision Trees
1. Feature selection: Information Gain is used to select the most relevant features for splitting.
2. Tree construction: By choosing the feature with the highest Information Gain, the algorithm constructs an optimal Decision Tree.

Benefits
1. Improved accuracy: Information Gain helps improve the accuracy of the Decision Tree model.
2. Efficient splitting: By selecting the most informative features, Information Gain enables efficient splitting and reduces the risk of overfitting.


Question 5: What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?

Answer-5. Real-World Applications
1. Credit Risk Assessment: Decision Trees are used to evaluate creditworthiness and predict loan defaults.
2. Medical Diagnosis: Decision Trees help diagnose diseases and identify high-risk patients.
3. Customer Segmentation: Decision Trees are used to segment customers based on demographic and behavioral characteristics.
4. Marketing and Sales: Decision Trees help predict customer responses to marketing campaigns and identify potential sales opportunities.

Advantages
1. Interpretability: Decision Trees are easy to understand and interpret.
2. Handling categorical features: Decision Trees can handle categorical features directly.
3. Fast training: Decision Trees are relatively fast to train.

Limitations
1. Overfitting: Decision Trees can overfit the training data, especially if they are too complex.
2. Instability: Small changes in the data can result in significantly different Decision Trees.
3. Limited handling of complex relationships: Decision Trees can struggle with complex relationships between features.

Mitigating Limitations
1. Pruning: Pruning techniques can help prevent overfitting.
2. Ensemble methods: Combining multiple Decision Trees (e.g., Random Forests) can improve stability and accuracy.

Dataset Information
Iris Dataset
- Classification task: Predict the species of an iris flower based on its features.
- Features:
    - Sepal length
    - Sepal width
    - Petal length
    - Petal width
- Target variable: Species (Setosa, Versicolor, or Virginica)
- Number of samples: 150
- Dataset source: sklearn.datasets.load_iris() or provided CSV

Boston Housing Dataset
- Regression task: Predict the median house price based on features of the neighborhood.
- Features:
    - CRIM (crime rate)
    - ZN (proportion of residential land zoned for large lots)
    - INDUS (proportion of non-retail business acres)
    - CHAS (Charles River dummy variable)
    - NOX (nitrogen oxides concentration)
    - RM (average number of rooms per dwelling)
    - AGE (proportion of owner-occupied units built prior to 1940)
    - DIS (weighted distances to five Boston employment centers)
    - RAD (index of accessibility to radial highways)
    - TAX (full-value property tax rate)
    - PTRATIO (pupil-teacher ratio)
    - B (proportion of black population)
    - LSTAT (percentage of lower status population)
- Target variable: Median house price (MEDV)
- Number of samples: 506
- Dataset source: sklearn.datasets.load_boston() or provided CSV


Question 6:   Write a Python program to:

 ● Load the Iris Dataset

 ● Train a Decision Tree Classifier using the Gini criterion

 ● Print the model’s accuracy and feature importances

 (Include your Python code and output in the code box below.)

 Answer-6.

In [4]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Classifier using the Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate and print the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)

# Print feature importances
feature_importances = clf.feature_importances_
print("Feature Importances:")
for i, feature in enumerate(iris.feature_names):
    print(f"{feature}: {feature_importances[i]:.3f}")


Model Accuracy: 1.0
Feature Importances:
sepal length (cm): 0.000
sepal width (cm): 0.017
petal length (cm): 0.906
petal width (cm): 0.077


Question 7:  Write a Python program to:

● Load the Iris Dataset

● Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree. (Include your Python code and output in the code box below.)

Answert-7.  

In [3]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a fully-grown Decision Tree Classifier
full_tree = DecisionTreeClassifier(random_state=42)
full_tree.fit(X_train, y_train)
y_pred_full = full_tree.predict(X_test)
full_tree_accuracy = accuracy_score(y_test, y_pred_full)
print("Fully-Grown Tree Accuracy:", full_tree_accuracy)

# Train a Decision Tree Classifier with max_depth=3
limited_depth_tree = DecisionTreeClassifier(max_depth=3, random_state=42)
limited_depth_tree.fit(X_train, y_train)
y_pred_limited = limited_depth_tree.predict(X_test)
limited_depth_accuracy = accuracy_score(y_test, y_pred_limited)
print("Limited Depth Tree (max_depth=3) Accuracy:", limited_depth_accuracy)


Fully-Grown Tree Accuracy: 1.0
Limited Depth Tree (max_depth=3) Accuracy: 1.0


Question 8: Write a Python program to:

● Load the Boston Housing Dataset

● Train a Decision Tree Regressor

 ● Print the Mean Squared Error (MSE) and feature importances (Include your Python code and output in the code box below.)

 Answer-8.

In [5]:
# Import necessary libraries
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Load the Boston Housing Dataset
boston = fetch_openml(name="boston", version=1)
X = boston.data
y = boston.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Make predictions on the test set
y_pred = regressor.predict(X_test)

# Calculate and print the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)

# Print feature importances
feature_importances = regressor.feature_importances_
print("Feature Importances:")
for i, feature in enumerate(boston.feature_names):
    print(f"{feature}: {feature_importances[i]:.3f}")

Mean Squared Error (MSE): 10.416078431372549
Feature Importances:
CRIM: 0.051
ZN: 0.003
INDUS: 0.006
CHAS: 0.000
NOX: 0.027
RM: 0.600
AGE: 0.014
DIS: 0.071
RAD: 0.002
TAX: 0.012
PTRATIO: 0.011
B: 0.009
LSTAT: 0.193


Question 9: Write a Python program to:

 ● Load the Iris Dataset

 ● Tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV

 ● Print the best parameters and the resulting model accuracy (Include your Python code and output in the code box below.)

 Answer-9.

In [6]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameter tuning space
param_grid = {
    'max_depth': [1, 2, 3, 4, 5],
    'min_samples_split': [2, 5, 10]
}

# Perform GridSearchCV
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best Parameters:", grid_search.best_params_)

# Train a Decision Tree Classifier with the best parameters
best_model = DecisionTreeClassifier(**grid_search.best_params_, random_state=42)
best_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = best_model.predict(X_test)

# Calculate and print the model accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy: 1.0


Question 10: Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values.

 Explain the step-by-step process you would follow to:

 ● Handle the missing values ● Encode the categorical features

  ● Train a Decision Tree model ● Tune its hyperparameters
  
  ● Evaluate its performance And describe what business value this model could provide in the real-world setting.

  Answer-10. Step-by-Step Process
Handling Missing Values
1. Identify missing values: Determine the extent of missing values in the dataset.
2. Imputation methods: Choose suitable imputation methods, such as mean, median, or imputation using regression.
3. Impute missing values: Apply the chosen imputation method to replace missing values.

Encoding Categorical Features
1. Identify categorical features: Determine which features are categorical.
2. Encoding methods: Choose suitable encoding methods, such as one-hot encoding, label encoding, or ordinal encoding.
3. Encode categorical features: Apply the chosen encoding method to transform categorical features into numerical features.

Training a Decision Tree Model
1. Split data: Split the dataset into training and testing sets.
2. Train Decision Tree: Train a Decision Tree model on the training data.

Tuning Hyperparameters
1. Define hyperparameter space: Define the hyperparameter space for tuning, including parameters like max_depth, min_samples_split, and min_samples_leaf.
2. Hyperparameter tuning method: Choose a hyperparameter tuning method, such as GridSearchCV or RandomSearchCV.
3. Tune hyperparameters: Perform hyperparameter tuning using the chosen method.

Evaluating Model Performance
1. Metrics: Choose suitable evaluation metrics, such as accuracy, precision, recall, F1-score, and AUC-ROC.
2. Evaluate model: Evaluate the model's performance on the testing data using the chosen metrics.

Business Value
1. Disease prediction: The model can predict the likelihood of a patient having a certain disease, enabling early intervention and treatment.
2. Improved patient outcomes: Accurate predictions can lead to better patient outcomes, reduced morbidity, and mortality.
3. Resource optimization: The model can help optimize resource allocation, reducing unnecessary tests and procedures.
4. Enhanced patient care: The model can provide valuable insights for healthcare professionals, enabling more informed decision-making and personalized patient care.

By following this step-by-step process, the healthcare company can develop a robust Decision Tree model that provides valuable predictions and insights, ultimately improving patient outcomes and resource allocation.

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////