1. What is a Decision Tree, and how does it work in the context of
classification?
  - A Decision Tree is a supervised machine learning algorithm that can be used for both classification and regression tasks. It models decisions and their potential consequences in a tree-like structure, similar to a flowchart.
In the context of classification, a Decision Tree works by recursively partitioning the input data into subsets based on the values of different features, aiming to create increasingly homogeneous groups with respect to the target class

2. Explain the concepts of Gini Impurity and Entropy as impurity measures.
How do they impact the splits in a Decision Tree?
  - Gini Impurity and Entropy are two mathematical measures used in decision tree algorithms to determine the best way to split a node into purer, more homogeneous subsets. A pure node contains data points from only one class, while an impure node contains a mix of data from different classes. The goal of a decision tree is to find splits that minimize impurity.
      - In a decision tree algorithm, Gini Impurity and Entropy are used as splitting criteria to find the optimal feature and threshold for dividing the data at each node.

          - Calculate parent node impurity: The algorithm first computes the impurity (Gini or Entropy) of the current node.
          - Evaluate potential splits: For every feature and its possible values, the algorithm considers splitting the data and creating child nodes.
          - Calculate weighted child impurity: For each potential split, it calculates the weighted average impurity of the resulting child nodes. The weighted average is determined by the number of instances in each child node relative to the parent.
          - Determine the best split: The split that results in the lowest weighted impurity (Gini) or the highest Information Gain (based on Entropy) is chosen as the optimal split for that node.

3. What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
  - Pre-pruning stops a decision tree from growing by setting stopping criteria during training, while post-pruning builds a full tree and then trims it. A practical advantage of pre-pruning is its efficiency because it avoids the computational cost of growing an oversized tree, whereas a practical advantage of post-pruning is that it can lead to more accurate results by fine-tuning the final model after full growth, as seen in methods like cost-complexity pruning.

4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
  - Information Gain measures the reduction in entropy (uncertainty) in a dataset after splitting it by a feature, and it's important for choosing the best split in decision trees because it identifies the feature that best separates the data into pure, homogeneous classes. Attributes with high Information Gain reduce uncertainty the most, leading to more accurate and informative decision trees, which helps in constructing a more efficient and effective model.  

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
  - Common applications of decision trees include healthcare for disease diagnosis, finance for risk assessment, and marketing for customer segmentation. Their main advantages are that they are easy to understand and require minimal data preparation. However, a major limitation is their tendency to overfit the data, meaning they can become too complex and perform poorly on new data.
      - Simple to understand: Decision trees are easy to interpret and visualize, making them simple to explain to non-technical audiences.
      - Minimal data preparation: They require less data preprocessing, such as cleaning or normalization, compared to many other algorithms.
      - Handles mixed data types: Decision trees can handle both numerical and categorical data.
      - Transparent logic: They provide a clear, step-by-step representation of the decision-making process.
      - Versatile: They can be used for both classification (predicting a category) and regression (predicting a continuous value).

6. Write a Python program to:
      - Load the Iris Dataset
      - Train a Decision Tree Classifier using the Gini criterion
      - Print the model’s accuracy and feature importances

In [1]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

def train_and_evaluate_iris_decision_tree():

    print("Loading Iris Dataset...")
    iris = load_iris()
    X = iris.data
    y = iris.target
    feature_names = iris.feature_names

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}\n")

    print("Training Decision Tree Classifier (Criterion: Gini)...")
    dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)
    dt_classifier.fit(X_train, y_train)
    print("Training complete.\n")

    y_pred = dt_classifier.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    print("-" * 40)
    print(f"Model Accuracy on Test Set: {accuracy:.4f}")
    print("-" * 40)

    importances = dt_classifier.feature_importances_
    feature_importances_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances
    }).sort_values(by='Importance', ascending=False).reset_index(drop=True)

    print("\nFeature Importances (Contribution to Model Prediction):")
    print(feature_importances_df.to_string(index=False))
    print("\nInterpretation: Higher importance means the feature was more crucial in splitting the nodes of the decision tree.")


if __name__ == "__main__":
    train_and_evaluate_iris_decision_tree()


Loading Iris Dataset...
Training samples: 120, Testing samples: 30

Training Decision Tree Classifier (Criterion: Gini)...
Training complete.

----------------------------------------
Model Accuracy on Test Set: 0.9333
----------------------------------------

Feature Importances (Contribution to Model Prediction):
          Feature  Importance
petal length (cm)    0.558568
 petal width (cm)    0.406015
 sepal width (cm)    0.029167
sepal length (cm)    0.006250

Interpretation: Higher importance means the feature was more crucial in splitting the nodes of the decision tree.


7. Write a Python program to:
  -  Load the Iris Dataset
  -  Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.


In [2]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

def compare_decision_tree_depths():

    print("Loading Iris Dataset...")
    iris = load_iris()
    X = iris.data
    y = iris.target

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}\n")

    print("Training Fully-Grown Decision Tree (max_depth=None)...")
    dt_full = DecisionTreeClassifier(criterion='gini', max_depth=None, random_state=42)
    dt_full.fit(X_train, y_train)

    y_pred_full = dt_full.predict(X_test)
    accuracy_full = accuracy_score(y_test, y_pred_full)

    print("Training Decision Tree with max_depth=3...")
    dt_depth3 = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)
    dt_depth3.fit(X_train, y_train)

    y_pred_depth3 = dt_depth3.predict(X_test)
    accuracy_depth3 = accuracy_score(y_test, y_pred_depth3)

    print("\n" + "=" * 55)
    print("Decision Tree Classifier Accuracy Comparison")
    print("=" * 55)
    print(f"1. Fully-Grown Tree (max_depth=None): {accuracy_full:.4f}")
    print(f"2. Max Depth 3 Tree (max_depth=3):    {accuracy_depth3:.4f}")
    print("=" * 55)

    if accuracy_full > accuracy_depth3:
        print("\nNote: The fully-grown tree achieved higher accuracy on this test set.")
    elif accuracy_full < accuracy_depth3:
        print("\nNote: The max_depth=3 tree performed slightly better on this test set.")
    else:
        print("\nNote: Both models achieved the same accuracy on this test set.")
    print("Using a constrained depth (like max_depth=3) is a form of pre-pruning to mitigate overfitting.")


if __name__ == "__main__":
    compare_decision_tree_depths()


Loading Iris Dataset...
Training samples: 120, Testing samples: 30

Training Fully-Grown Decision Tree (max_depth=None)...
Training Decision Tree with max_depth=3...

Decision Tree Classifier Accuracy Comparison
1. Fully-Grown Tree (max_depth=None): 0.9333
2. Max Depth 3 Tree (max_depth=3):    0.9667

Note: The max_depth=3 tree performed slightly better on this test set.
Using a constrained depth (like max_depth=3) is a form of pre-pruning to mitigate overfitting.


8. Write a Python program to:
  -  Load the Boston Housing Dataset
  -  Train a Decision Tree Regressor
  -  Print the Mean Squared Error (MSE) and feature importances

In [3]:
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

def train_and_evaluate_boston_decision_tree():

    print("Loading Boston Housing Dataset...")

    try:

        boston = fetch_openml(name='boston', version=1, as_frame=True, parser='auto')
    except Exception as e:
        print(f"Error loading Boston dataset via fetch_openml: {e}")
        print("Please ensure you have an internet connection and compatible scikit-learn version.")
        return

    X = boston.data
    y = boston.target

    feature_names = X.columns.tolist()

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}\n")

    print("Training Decision Tree Regressor...")

    dt_regressor = DecisionTreeRegressor(random_state=42)

    dt_regressor.fit(X_train, y_train)
    print("Training complete.\n")

    y_pred = dt_regressor.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    print("=" * 45)
    print(f"Mean Squared Error (MSE) on Test Set: {mse:.4f}")
    print("=" * 45)

    importances = dt_regressor.feature_importances_

    feature_importances_df = pd.DataFrame({
        'Feature': feature_names,
        'Importance': importances
    }).sort_values(by='Importance', ascending=False).reset_index(drop=True)

    print("\nFeature Importances (Contribution to Price Prediction):")
    print(feature_importances_df.to_string(index=False))
    print("\nInterpretation: Features with higher importance are more influential in determining the predicted house price.")


if __name__ == "__main__":
    train_and_evaluate_boston_decision_tree()


Loading Boston Housing Dataset...
Training samples: 404, Testing samples: 102

Training Decision Tree Regressor...
Training complete.

Mean Squared Error (MSE) on Test Set: 10.4161

Feature Importances (Contribution to Price Prediction):
Feature  Importance
     RM    0.600326
  LSTAT    0.193328
    DIS    0.070688
   CRIM    0.051296
    NOX    0.027148
    AGE    0.013617
    TAX    0.012464
PTRATIO    0.011012
      B    0.009009
  INDUS    0.005816
     ZN    0.003353
    RAD    0.001941
   CHAS    0.000002

Interpretation: Features with higher importance are more influential in determining the predicted house price.


9. Write a Python program to:
  -  Load the Iris Dataset
  -  Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
  -  Print the best parameters and the resulting model accuracy

In [4]:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

def tune_decision_tree_with_gridsearch():

    print("Loading Iris Dataset...")
    iris = load_iris()
    X = iris.data
    y = iris.target

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    print(f"Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}")
    print("Preparing Grid Search for Hyperparameter Tuning...\n")

    param_grid = {

        'max_depth': list(range(2, 11)),

        'min_samples_split': [2, 3, 5, 7, 10]
    }

    dt = DecisionTreeClassifier(random_state=42)

    grid_search = GridSearchCV(
        estimator=dt,
        param_grid=param_grid,
        cv=5,
        scoring='accuracy',
        n_jobs=-1,
        verbose=1
    )

    print("Starting Grid Search (testing parameter combinations)...")
    grid_search.fit(X_train, y_train)
    print("Grid Search complete.\n")

    best_params = grid_search.best_params_

    best_cv_score = grid_search.best_score_

    best_dt_model = grid_search.best_estimator_
    y_pred_test = best_dt_model.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_pred_test)


    print("=" * 55)
    print("GridSearchCV Results for Decision Tree Tuning")
    print("=" * 55)
    print(f"Best Hyperparameters Found: {best_params}")
    print(f"Best Cross-Validation Accuracy: {best_cv_score:.4f}")
    print("-" * 55)
    print(f"Accuracy on Independent Test Set: {test_accuracy:.4f}")
    print("=" * 55)

if __name__ == "__main__":
    tune_decision_tree_with_gridsearch()


Loading Iris Dataset...
Training samples: 120, Testing samples: 30
Preparing Grid Search for Hyperparameter Tuning...

Starting Grid Search (testing parameter combinations)...
Fitting 5 folds for each of 45 candidates, totalling 225 fits
Grid Search complete.

GridSearchCV Results for Decision Tree Tuning
Best Hyperparameters Found: {'max_depth': 4, 'min_samples_split': 2}
Best Cross-Validation Accuracy: 0.9417
-------------------------------------------------------
Accuracy on Independent Test Set: 0.9333


10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
  -  Handle the missing values
  -  Encode the categorical features
  -  Train a Decision Tree model
  -  Tune its hyperparameters
  -  Evaluate its performance
And describe what business value this model could provide in the real-world
setting.


        - As a data scientist predicting disease for a healthcare company, the following steps would be used to build and evaluate a Decision Tree model. Given the mixed data types and missing values, preprocessing techniques must be carefully selected to avoid biasing the results.
        - Handle the missing values
            - Begin by performing an exploratory data analysis (EDA) to understand the extent and patterns of missing data. Visualize the missing data using a library like missingno to identify if the values are missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR).
        - Encode the categorical features
            - Decision Trees in scikit-learn require numerical input, so categorical features must be converted before training.
        - Train a Decision Tree model
            - Data splitting: First, split the dataset into a training set and a testing set. A common split is 80% for training and 20% for testing. Stratified splitting should be used to ensure the proportion of the target disease class is maintained in both sets, especially in cases of class imbalance.
            - Training: Initialize and train the Decision Tree classifier on the training data. The model learns a series of if-then-else decision rules from the features to predict the target variable.
            - Handling large datasets: If the dataset is too large to fit into memory, consider using libraries like Dask or PySpark to process the data in chunks or parallelize computations. Alternatively, training can be done on a representative, stratified sample of the data.
        - Tune its hyperparameters
            - Hyperparameters are set before training and control the learning process. Tuning is critical to prevent overfitting and improve performance on unseen data.
        - Evaluate its performance
            - After tuning, the model's performance must be rigorously evaluated on the held-out test set to ensure its reliability. In a healthcare setting, a single metric is not sufficient due to the high costs associated with false predictions.