#Decision Tree - Assignment

Question 1: What is a Decision Tree, and how does it work in the context of
classification?
   - A Decision Tree is a supervised machine learning algorithm used for classification and regression tasks. It mimics human decision-making by breaking down a dataset into smaller and smaller subsets while simultaneously developing an associated decision tree structure.
   - How it works for Classification:
       - Select the best attribute using a metric like: Gini Index , Information Gain (based on Entropy), Gain Ratio

       - Split the dataset into subsets based on the selected attribute.

       - Repeat the process recursively for each child node: Continue splitting until a stopping condition is met (e.g., all samples at a node have the same label, or a maximum depth is reached).

       - Assign a class label to each leaf node based on the majority class of samples in that node.



2. Explain the concepts of Gini Impurity and Entropy as impurity measures.How   do they impact the splits in a Decision Tree?
     - In decision trees, impurity measures are used to determine how mixed the classes are within a node. Two commonly used impurity measures are Gini Impurity and Entropy.
     - Gini Impurity: Gini Impurity measures the probability of misclassifying a randomly chosen element from the dataset if it were labeled randomly based on the distribution of labels in the subset.
     - Entropy: Entropy measures the level of disorder or impurity in the node. It's rooted in information theory and quantifies the expected amount of information (bits) needed to classify a sample.
     - Impact on Decision Tree Splits:
          - During tree construction, the algorithm evaluates each feature's potential to split the data.
          - It chooses the split that results in the largest reduction in impurity. For Gini, it picks the split with the lowest Gini Impurity. For Entropy, it picks the split with the highest Information Gain (i.e., biggest reduction in Entropy).
         

3.  What is the difference between Pre-Pruning and Post-Pruning in Decision
Trees? Give one practical advantage of using each.
    - Pre-Pruning and Post-Pruning are techniques used to prevent overfitting in decision trees by controlling the growth of the tree.
    - Pre-pruning, also known as early stopping, involves halting the growth of the tree during its construction phase based on certain conditions such as maximum depth, minimum number of samples required to split a node, or a minimum gain in impurity. This helps in creating a simpler model and significantly reduces training time, which is a practical advantage.
    -  post-pruning involves allowing the tree to grow fully and then removing branches that do not improve the model’s performance on a validation dataset. A key advantage of post-pruning is that it often improves generalization by eliminating branches that lead to overfitting, thus making the model more robust on unseen data.
    -  The main difference between the two lies in the timing—pre-pruning stops tree growth during training, whereas post-pruning simplifies the tree after it is fully grown.




4. What is Information Gain in Decision Trees, and why is it important for
choosing the best split?
     - Information Gain is a metric used in decision trees to measure the effectiveness of an attribute in classifying the training data.
     - It quantifies the reduction in entropy, or uncertainty, after a dataset is split on a particular feature. In simpler terms, it tells us how much "information" a feature gives us about the class labels. The higher the information gain, the more effectively that feature separates the data into distinct classes. During the tree-building process, the algorithm evaluates all possible splits and selects the one with the highest information gain as the best split.
     - This is important because it ensures that each decision in the tree contributes maximally to reducing disorder and helps in building a more accurate and efficient model.

5. What are some common real-world applications of Decision Trees, and
what are their main advantages and limitations?
    - Decision Trees are widely used in real-world applications due to their simplicity and interpretability. Common applications include medical diagnosis, where trees help predict diseases based on symptoms; credit scoring and loan approval, where financial institutions use them to assess the risk level of applicants; fraud detection, where patterns of fraudulent activity are identified; customer segmentation in marketing; and predictive maintenance in industrial systems.
    - The main advantages of decision trees are that they are easy to understand and visualize, require little data preprocessing, and can handle both numerical and categorical data. However, they also have some limitations. Decision trees are prone to overfitting, especially with complex datasets, and small changes in data can result in a completely different tree, making them unstable. Additionally, they may struggle with imbalanced datasets and tend to prefer features with more levels. Despite these limitations, decision trees remain a powerful and popular tool in machine learning and data analysis.



6. Write a Python program to:
     -  Load the Iris Dataset
     - Train a Decision Tree Classifier using the Gini criterion
     - Print the model’s accuracy and feature importances


In [3]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the Decision Tree Classifier using Gini criterion
clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

# Predict on test data
y_pred = clf.predict(X_test)

# Print the model’s accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(feature_names, clf.feature_importances_):
    print(f"{feature}: {importance:.4f}")


Model Accuracy: 1.00
Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0167
petal length (cm): 0.9061
petal width (cm): 0.0772


7. Write a Python program to:
     - Load the Iris Dataset
     - Train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.

In [4]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree with max_depth=3
tree_limited = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_limited.fit(X_train, y_train)
pred_limited = tree_limited.predict(X_test)
acc_limited = accuracy_score(y_test, pred_limited)

# Train fully-grown Decision Tree (no depth limit)
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train, y_train)
pred_full = tree_full.predict(X_test)
acc_full = accuracy_score(y_test, pred_full)

# Print accuracies
print(f"Accuracy with max_depth=3: {acc_limited:.2f}")
print(f"Accuracy with fully-grown tree: {acc_full:.2f}")

Accuracy with max_depth=3: 1.00
Accuracy with fully-grown tree: 1.00


8. Write a Python program to:
- Load the Boston Housing Dataset
- Train a Decision Tree Regressor.
- Print the Mean Squared Error (MSE) and feature importances


In [5]:
# Import necessary libraries
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California Housing dataset
data = fetch_california_housing()
X = data.data
y = data.target
feature_names = data.feature_names

# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Decision Tree Regressor
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)

# Predict and calculate Mean Squared Error
y_pred = regressor.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

# Print results
print(f"Mean Squared Error (MSE): {mse:.2f}")
print("\nFeature Importances:")
for name, importance in zip(feature_names, regressor.feature_importances_):
    print(f"{name}: {importance:.4f}")


Mean Squared Error (MSE): 0.50

Feature Importances:
MedInc: 0.5285
HouseAge: 0.0519
AveRooms: 0.0530
AveBedrms: 0.0287
Population: 0.0305
AveOccup: 0.1308
Latitude: 0.0937
Longitude: 0.0829


9. Write a Python program to:
- Load the Iris Dataset
- Tune the Decision Tree’s max_depth and min_samples_split using
GridSearchCV
- Print the best parameters and the resulting model accuracy

In [6]:
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load the Iris dataset
iris = load_iris()
X = iris.data
y = iris.target

# Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define parameter grid for tuning
param_grid = {
    'max_depth': [2, 3, 4, 5, None],
    'min_samples_split': [2, 4, 6, 8]
}

# Initialize Decision Tree Classifier
dt = DecisionTreeClassifier(random_state=42)

# Use GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best model and predict
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", grid_search.best_params_)
print(f"Model Accuracy on Test Set: {accuracy:.2f}")


Best Parameters: {'max_depth': 4, 'min_samples_split': 2}
Model Accuracy on Test Set: 1.00


10. Imagine you’re working as a data scientist for a healthcare company that
wants to predict whether a patient has a certain disease. You have a large dataset with
mixed data types and some missing values.
Explain the step-by-step process you would follow to:
- Handle the missing values
- Encode the categorical features
- Train a Decision Tree model
- Tune its hyperparameters
- Evaluate its performance
And describe what business value this model could provide in the real-world
setting.

Ans:
- To build a predictive model for disease detection, the first step is to handle missing values. For numerical features, I would use imputation techniques such as filling with the mean or median, while for categorical features, I would fill missing values with the mode or use a placeholder like "Unknown".
- The next step is to encode categorical variables so that they can be used by the machine learning model. For Decision Trees, label encoding works well, but one-hot encoding can also be used for nominal categories.
-  After preprocessing, I would split the dataset into training and testing sets and train a Decision Tree Classifier using the processed data.
-  To optimize performance, I would apply GridSearchCV or RandomizedSearchCV to tune hyperparameters such as max_depth, min_samples_split, and criterion.
- Once the model is tuned, I would evaluate its performance using metrics like accuracy, precision, recall, F1-score, and ROC-AUC, especially considering the medical context where false negatives can be critical.
- Finally, the business value of this model lies in its ability to assist doctors in early disease detection, support faster decision-making, reduce manual errors, and ultimately improve patient outcomes while optimizing resource allocation in the healthcare system.

