
### Assignment Questions:

#### Q1. Preprocess the dataset by handling missing values, encoding categorical variables, and scaling the numerical features if necessary.

**Answer:**
- Handle missing values: Use appropriate methods like imputation (mean, median, mode) or removal of rows/columns with too many missing values.
- Encode categorical variables: Use one-hot encoding or label encoding for categorical variables such as `sex`, `chest pain type`, etc.
- Scale numerical features: Use standardization (Z-score normalization) or min-max scaling for features like `age`, `resting blood pressure`, `serum cholesterol`, and `maximum heart rate`.

#### Q2. Split the dataset into a training set (70%) and a test set (30%).

**Answer:**
- Use `train_test_split` from scikit-learn to divide the dataset, with 70% data for training and 30% for testing.

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```

#### Q3. Train a random forest classifier on the training set using 100 trees and a maximum depth of 10 for each tree. Use the default values for other hyperparameters.

**Answer:**
- Use the `RandomForestClassifier` from scikit-learn to train the model with 100 trees and a maximum depth of 10.

```python
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
clf.fit(X_train, y_train)
```

#### Q4. Evaluate the performance of the model on the test set using accuracy, precision, recall, and F1 score.

**Answer:**
- Use metrics from scikit-learn to evaluate the model.

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
```

#### Q5. Use the feature importance scores to identify the top 5 most important features in predicting heart disease risk. Visualize the feature importances using a bar chart.

**Answer:**
- Extract feature importances from the random forest model and visualize them using matplotlib.

```python
import matplotlib.pyplot as plt
import numpy as np

feature_importances = clf.feature_importances_
indices = np.argsort(feature_importances)[-5:]  # Get top 5 features

plt.barh(range(len(indices)), feature_importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [X.columns[i] for i in indices])
plt.xlabel('Feature Importance')
plt.title('Top 5 Important Features')
plt.show()
```

#### Q6. Tune the hyperparameters of the random forest classifier using grid search or random search. Try different values of the number of trees, maximum depth, minimum samples split, and minimum samples leaf. Use 5-fold cross-validation to evaluate the performance of each set of hyperparameters.

**Answer:**
- Perform hyperparameter tuning using `GridSearchCV` or `RandomizedSearchCV` with 5-fold cross-validation.

```python
from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
```

#### Q7. Report the best set of hyperparameters found by the search and the corresponding performance metrics. Compare the performance of the tuned model with the default model.

**Answer:**
- Retrieve the best hyperparameters and compare the tuned model's performance with the default one.

```python
best_params = grid_search.best_params_
tuned_model = grid_search.best_estimator_

# Evaluate the tuned model
tuned_pred = tuned_model.predict(X_test)
tuned_accuracy = accuracy_score(y_test, tuned_pred)
```

#### Q8. Interpret the model by analyzing the decision boundaries of the random forest classifier. Plot the decision boundaries on a scatter plot of two of the most important features. Discuss the insights and limitations of the model for predicting heart disease risk.

**Answer:**
- Use two of the most important features to plot the decision boundaries. Random forests are ensemble methods, so decision boundaries may not be as interpretable as single models like decision trees.

```python
from matplotlib.colors import ListedColormap

# Assume we are using two features X1 and X2
X1, X2 = X_test[:, 0], X_test[:, 1]

# Decision boundary plot
plt.figure(figsize=(10, 6))
x_min, x_max = X1.min() - 1, X1.max() + 1
y_min, y_max = X2.min() - 1, X2.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
                     np.arange(y_min, y_max, 0.01))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, alpha=0.8, cmap=ListedColormap(('red', 'green')))
plt.scatter(X1, X2, c=y_test, s=20, edgecolor='k')
plt.title("Decision Boundaries of Random Forest Classifier")
plt.show()
```

### Insights:
- **Model Interpretation:** Random forests provide feature importance, but interpreting decision boundaries is challenging as they combine multiple decision trees.
- **Limitations:** The model may have limited generalizability on unseen data and cannot easily capture relationships in small datasets.