
# Assignment Questions: Decision Tree-2

### Q1. Import the dataset and examine the variables. Use descriptive statistics and visualizations to understand the distribution and relationships between the variables.

```python
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
data = pd.read_csv('diabetes.csv')

# Descriptive statistics
print(data.describe())

# Visualizations
plt.figure(figsize=(10, 6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation between Variables')
plt.show()

# Pairplot to visualize the relationships
sns.pairplot(data, hue='Outcome')
plt.show()
```

---

### Q2. Preprocess the data by cleaning missing values, removing outliers, and transforming categorical variables into dummy variables if necessary.

```python
# Check for missing values
print(data.isnull().sum())

# Assuming missing values in 'SkinThickness' and 'Insulin'
data['SkinThickness'].fillna(data['SkinThickness'].median(), inplace=True)
data['Insulin'].fillna(data['Insulin'].median(), inplace=True)

# Remove outliers using IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
data = data[~((data < (Q1 - 1.5 * IQR)) | (data > (Q3 + 1.5 * IQR))).any(axis=1)]

# No categorical variables, so no need to create dummy variables
```

---

### Q3. Split the dataset into a training set and a test set. Use a random seed to ensure reproducibility.

```python
from sklearn.model_selection import train_test_split

# Split data into features and target
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```

---

### Q4. Use a decision tree algorithm, such as ID3 or C4.5, to train a decision tree model on the training set. Use cross-validation to optimize the hyperparameters and avoid overfitting.

```python
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

# Initialize the model
dt = DecisionTreeClassifier(random_state=42)

# Set up hyperparameters for cross-validation
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10]
}

# Cross-validation
grid_search = GridSearchCV(dt, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best model
best_dt = grid_search.best_estimator_
print("Best Parameters: ", grid_search.best_params_)
```

---

### Q5. Evaluate the performance of the decision tree model on the test set using metrics such as accuracy, precision, recall, and F1 score. Use confusion matrices and ROC curves to visualize the results.

```python
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc
import matplotlib.pyplot as plt

# Predictions
y_pred = best_dt.predict(X_test)

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, best_dt.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, label='ROC curve (area = %0.2f)' % roc_auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()
```

---

### Q6. Interpret the decision tree by examining the splits, branches, and leaves. Identify the most important variables and their thresholds. Use domain knowledge and common sense to explain the patterns and trends.

```python
from sklearn.tree import plot_tree

# Plot the tree
plt.figure(figsize=(12, 8))
plot_tree(best_dt, feature_names=X.columns, class_names=['Non-Diabetic', 'Diabetic'], filled=True)
plt.show()

# Feature importance
importances = best_dt.feature_importances_
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': importances})
print(feature_importance.sort_values(by='Importance', ascending=False))

# Interpretation:
# Based on the tree and feature importance, glucose level is the most critical factor for predicting diabetes, followed by age and BMI. The model likely splits on glucose first, since it provides the most information gain.
```

---

### Q7. Validate the decision tree model by applying it to new data or testing its robustness to changes in the dataset or the environment. Use sensitivity analysis and scenario testing to explore the uncertainty and risks.

```python
# Sensitivity analysis can be done by slightly modifying the test set or adding synthetic noise to see how the model reacts.
import numpy as np

# Introduce small noise to the test data
X_test_noisy = X_test + np.random.normal(0, 0.01, X_test.shape)

# Predict and evaluate on noisy data
y_pred_noisy = best_dt.predict(X_test_noisy)
accuracy_noisy = accuracy_score(y_test, y_pred_noisy)

print(f"Accuracy with noisy data: {accuracy_noisy}")
```

---
