### Question 1

**What is a Decision Tree, and how does it work in the context of classification?**

**Answer:**

A Decision Tree is a supervised machine learning model used for both classification and regression tasks. In classification, it splits the dataset into subsets based on feature values, creating a tree-like structure where each node represents a feature, each branch a decision rule, and each leaf a class label. It works by recursively selecting the best attribute to split the data using impurity measures such as Gini Impurity or Entropy.

### Question 2

**Explain the concepts of Gini Impurity and Entropy as impurity measures. How do they impact the splits in a Decision Tree?**

**Answer:**

Gini Impurity and Entropy are impurity measures used to evaluate splits in Decision Trees.  
- **Gini Impurity** measures the likelihood of incorrect classification of a randomly chosen element.  
- **Entropy** measures the level of uncertainty or disorder in the dataset.  
Lower impurity leads to better splits. The decision tree algorithm selects features that result in the greatest reduction in impurity, improving classification accuracy.

### Question 3

**What is the difference between Pre-Pruning and Post-Pruning in Decision Trees? Give one practical advantage of using each.**

**Answer:**

**Pre-Pruning** halts tree growth early by setting conditions like max depth or minimum samples per split.  
**Post-Pruning** removes branches from a fully grown tree to reduce overfitting.  
- Advantage of Pre-Pruning: Saves computation time and prevents over-complex models.  
- Advantage of Post-Pruning: Results in simpler models with better generalization.

### Question 4

**What is Information Gain in Decision Trees, and why is it important for choosing the best split?**

**Answer:**

Information Gain measures the decrease in entropy after a dataset is split on a feature.  
It is calculated as the difference between the entropy before the split and the weighted sum of entropy after the split.  
Features with higher information gain are preferred, as they better separate the data and reduce impurity.

### Question 5

**What are some common real-world applications of Decision Trees, and what are their main advantages and limitations?**

**Answer:**

**Common applications** include:  
- Medical diagnosis  
- Credit risk assessment  
- Customer segmentation  

**Advantages**: Easy to understand and visualize, handles both numerical and categorical data.  
**Limitations**: Prone to overfitting, sensitive to small data changes, and less accurate compared to ensemble methods like Random Forest.

### Question 6

**Imagine you’re working as a data scientist for a healthcare company that wants to predict whether a patient has a certain disease. You have a large dataset with mixed data types and some missing values. Explain the step-by-step process you would follow to handle the missing values, encode the categorical features, train a Decision Tree model, tune its hyperparameters, evaluate its performance, and describe what business value this model could provide in the real-world setting.**

**Answer:**

1. **Handle missing values**:  
   - Use `SimpleImputer` or drop rows/columns depending on missingness.  
2. **Encode categorical features**:  
   - Use `OneHotEncoder` or `LabelEncoder`.  
3. **Train model**:  
   - Use `DecisionTreeClassifier` on processed data.  
4. **Tune hyperparameters**:  
   - Use `GridSearchCV` or `RandomizedSearchCV`.  
5. **Evaluate performance**:  
   - Use accuracy, precision, recall, F1-score.  

**Business Value**:  
   - Enables early disease detection, reduces healthcare costs, assists in patient triage, and supports clinical decision-making.

### Question 6

**Load the Iris Dataset, train a Decision Tree Classifier using the Gini criterion, print the model’s accuracy and feature importances.**

In [39]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3, random_state=42)
clf = DecisionTreeClassifier(criterion="gini")
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Feature Importances:", clf.feature_importances_)

Accuracy: 1.0
Feature Importances: [0.01911002 0.01911002 0.53816374 0.42361622]


### Question 7

**Load the Iris Dataset, train a Decision Tree Classifier with max_depth=3 and compare its accuracy to a fully-grown tree.**

In [42]:
clf1 = DecisionTreeClassifier()
clf1.fit(X_train, y_train)
full_accuracy = accuracy_score(y_test, clf1.predict(X_test))

clf2 = DecisionTreeClassifier(max_depth=3)
clf2.fit(X_train, y_train)
limited_accuracy = accuracy_score(y_test, clf2.predict(X_test))

print("Full Tree Accuracy:", full_accuracy)
print("Max Depth=3 Accuracy:", limited_accuracy)

Full Tree Accuracy: 1.0
Max Depth=3 Accuracy: 1.0


### Question 8

**Load the Boston Housing Dataset, train a Decision Tree Regressor, and print the Mean Squared Error (MSE) and feature importances.**

In [50]:
from sklearn.datasets import fetch_california_housing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load California housing dataset
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, test_size=0.3, random_state=42)

# Train Decision Tree Regressor
reg = DecisionTreeRegressor()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

# Print MSE and feature importances
print("MSE:", mean_squared_error(y_test, y_pred))
print("Feature Importances:", reg.feature_importances_)


MSE: 0.5369178160347868
Feature Importances: [0.52256426 0.05331831 0.05044119 0.02578106 0.03138394 0.13862404
 0.09027145 0.08761574]


### Question 9

**Load the Iris Dataset, tune the Decision Tree’s max_depth and min_samples_split using GridSearchCV, and print the best parameters and resulting accuracy.**

In [56]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import accuracy_score

# Load Iris Dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# Set hyperparameter grid
params = {
    'max_depth': [2, 3, 4, 5],
    'min_samples_split': [2, 4, 6]
}

# Grid search with 3-fold CV
grid = GridSearchCV(DecisionTreeClassifier(), params, cv=3)
grid.fit(X_train, y_train)

# Print best parameters and accuracy
print("Best Parameters:", grid.best_params_)
print("Best Accuracy:", grid.best_score_)


Best Parameters: {'max_depth': 2, 'min_samples_split': 4}
Best Accuracy: 0.9238095238095237
