# Lab 5: Decision Trees & Ensembles
**Introduction**

>Diabetes mellitus, often referred to as diabetes, is a prevalent chronic metabolic disorder that affects millions of individuals worldwide. It is characterized by elevated blood glucose levels, which can lead to various complications if not effectively managed. Given the growing incidence of diabetes and its associated health risks, accurate prediction and diagnosis are crucial for timely intervention and treatment.

**Objective:**

>- The primary objective of this study is to develop and evaluate tree-based machine learning models for predicting whether an individual has diabetes or not, utilizing a comprehensive dataset. This dataset, a valuable resource for research and analysis, contains a range of clinical and demographic features that are associated with the likelihood of diabetes.

## **Methods & Procedures:**

**Data Preprocessing:**
>- Handling missing values
>- Dealing with data imbalance
>- Feature scaling and encoding

**Model Training:**

>- Utilizing various tree-based machine learning algorithms
>- Fine-tuning model hyperparameters for optimal performance

**Evaluation Measures:**
>- Using accuracy, precision, recall, F1-score, and the confusion matrix
>- Assessing model generalization and robustness

**Feature Selection:**
>- Identifying the most important features for classification

**Results:**
>- The study provides a comprehensive analysis of the models' performance, including comparisons of different tree-based machine learning techniques.

**Model Performance with All Features:**
>- Evaluation of models using the entire feature set to understand their baseline performance.

**Model Performance with Feature Selection:**
>- Assessment of models after feature selection to identify the most influential attributes in diabetes prediction.

**Conclusion:**
>- In conclusion, the research aims to contribute to the field of healthcare and predictive modeling by exploring the potential of tree-based machine learning algorithms for diabetes prediction. By addressing data preprocessing, model training, evaluation, and feature selection, we seek to enhance our understanding of the effectiveness of these techniques in medical decision support. Ultimately, the findings from this study may pave the way for more accurate and efficient diabetes risk assessment, offering valuable insights for both medical professionals and individuals at risk of this condition.

# Data Preprocessing

##### Load the Dataset

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv("/Users/alexanderdelriscomorales/Downloads/AI_ML_Files/diabetes.csv")


>##### Exploratory Data Analysis
>> It's a good practice to perform some initial exploratory data analysis to understand the dataset. Check for missing values, data types, and the class distribution.

In [2]:
# Check for missing values
missing_values = data.isnull().sum()

# Data types
data_types = data.dtypes

# Class distribution
class_distribution = data['Outcome'].value_counts()

class_distribution


Outcome
0    500
1    268
Name: count, dtype: int64

In [3]:
print(missing_values)

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


>##### Handle Imbalanced Data

In [4]:
from imblearn.over_sampling import RandomOverSampler

# Separate features and target variable
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Initialize the RandomOverSampler
oversampler = RandomOverSampler(sampling_strategy='minority')

# Fit and transform the data
X_resampled, y_resampled = oversampler.fit_resample(X, y)


>##### Split Data into Train and Test Sets

In [5]:
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)


>>>Now that the data is preprocessed, let's proceed with building and evaluating the models.

# Model Building

>#### Decision Trees
>>Decision Trees are a simple yet effective classification algorithm. They are easy to interpret and can serve as a baseline model.

In [6]:
from sklearn.tree import DecisionTreeClassifier

# Initialize the Decision Tree Classifier
decision_tree_model = DecisionTreeClassifier(random_state=42)

# Train the Decision Tree Model
decision_tree_model.fit(X_train, y_train)


>#### Random Forests
>>Random Forests are an ensemble learning method based on Decision Trees. They tend to provide better performance and can handle more complex datasets.

In [7]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest Classifier
random_forest_model = RandomForestClassifier(random_state=42)

# Train the Random Forest Model
random_forest_model.fit(X_train, y_train)


>#### AdaBoost
>>AdaBoost is another ensemble technique that combines multiple "weak" classifiers to create a strong classifier. It focuses on the samples that are misclassified by the previous classifiers.



In [8]:
from sklearn.ensemble import AdaBoostClassifier

# Initialize the AdaBoost Classifier
adaboost_model = AdaBoostClassifier(random_state=42)

# Train the AdaBoost Model
adaboost_model.fit(X_train, y_train)


>#### Gradient Boosting

>>Gradient Boosting builds an ensemble of decision trees sequentially. Each tree corrects the errors of the previous ones, leading to strong predictive performance.

In [9]:
from sklearn.ensemble import GradientBoostingClassifier

# Initialize the Gradient Boosting Classifier
gradient_boosting_model = GradientBoostingClassifier(random_state=42)

# Train the Gradient Boosting Model
gradient_boosting_model.fit(X_train, y_train)


# Model Evaluation

>#### Decision Trees Evaluation
>>(a) Confusion Matrix and Classification Report for Decision Trees:

In [10]:
from sklearn.metrics import confusion_matrix, classification_report

# Predict on the test set
y_pred_decision_tree = decision_tree_model.predict(X_test)

# Confusion Matrix
conf_matrix_decision_tree = confusion_matrix(y_test, y_pred_decision_tree)
print("Confusion Matrix (Decision Trees):\n", conf_matrix_decision_tree)

# Classification Report
class_report_decision_tree = classification_report(y_test, y_pred_decision_tree)
print("Classification Report (Decision Trees):\n", class_report_decision_tree)


Confusion Matrix (Decision Trees):
 [[72 27]
 [16 85]]
Classification Report (Decision Trees):
               precision    recall  f1-score   support

           0       0.82      0.73      0.77        99
           1       0.76      0.84      0.80       101

    accuracy                           0.79       200
   macro avg       0.79      0.78      0.78       200
weighted avg       0.79      0.79      0.78       200



>>(b) Feature Importance for Decision Trees

In [11]:
# Feature Importance
feature_importance_decision_tree = decision_tree_model.feature_importances_
print("Feature Importance (Decision Trees):\n", feature_importance_decision_tree)


Feature Importance (Decision Trees):
 [0.04304415 0.31557356 0.09382122 0.04494198 0.0322795  0.21075117
 0.1122695  0.14731891]


>#### Random Forests Evaluation
>>(a) Confusion Matrix and Classification Report for Random Forests

In [12]:
# Predict on the test set
y_pred_random_forest = random_forest_model.predict(X_test)

# Confusion Matrix
conf_matrix_random_forest = confusion_matrix(y_test, y_pred_random_forest)
print("Confusion Matrix (Random Forests):\n", conf_matrix_random_forest)

# Classification Report
class_report_random_forest = classification_report(y_test, y_pred_random_forest)
print("Classification Report (Random Forests):\n", class_report_random_forest)


Confusion Matrix (Random Forests):
 [[76 23]
 [12 89]]
Classification Report (Random Forests):
               precision    recall  f1-score   support

           0       0.86      0.77      0.81        99
           1       0.79      0.88      0.84       101

    accuracy                           0.82       200
   macro avg       0.83      0.82      0.82       200
weighted avg       0.83      0.82      0.82       200



>>(b) Feature Importance for Random Forests

In [13]:
# Feature Importance
feature_importance_random_forest = random_forest_model.feature_importances_
print("Feature Importance (Random Forests):\n", feature_importance_random_forest)


Feature Importance (Random Forests):
 [0.07281355 0.26952516 0.08493405 0.06627424 0.0736287  0.17027066
 0.11574126 0.14681238]


>#### AdaBoost Evaluation
>>(a) Confusion Matrix and Classification Report for AdaBoost:

In [14]:
# Predict on the test set
y_pred_adaboost = adaboost_model.predict(X_test)

# Confusion Matrix
conf_matrix_adaboost = confusion_matrix(y_test, y_pred_adaboost)
print("Confusion Matrix (AdaBoost):\n", conf_matrix_adaboost)

# Classification Report
class_report_adaboost = classification_report(y_test, y_pred_adaboost)
print("Classification Report (AdaBoost):\n", class_report_adaboost)


Confusion Matrix (AdaBoost):
 [[72 27]
 [23 78]]
Classification Report (AdaBoost):
               precision    recall  f1-score   support

           0       0.76      0.73      0.74        99
           1       0.74      0.77      0.76       101

    accuracy                           0.75       200
   macro avg       0.75      0.75      0.75       200
weighted avg       0.75      0.75      0.75       200



>#### Gradient Boosting Evaluation
>>(a) Confusion Matrix and Classification Report for Gradient Boosting:

In [15]:
# Predict on the test set
y_pred_gradient_boosting = gradient_boosting_model.predict(X_test)

# Confusion Matrix
conf_matrix_gradient_boosting = confusion_matrix(y_test, y_pred_gradient_boosting)
print("Confusion Matrix (Gradient Boosting):\n", conf_matrix_gradient_boosting)

# Classification Report
class_report_gradient_boosting = classification_report(y_test, y_pred_gradient_boosting)
print("Classification Report (Gradient Boosting):\n", class_report_gradient_boosting)


Confusion Matrix (Gradient Boosting):
 [[73 26]
 [12 89]]
Classification Report (Gradient Boosting):
               precision    recall  f1-score   support

           0       0.86      0.74      0.79        99
           1       0.77      0.88      0.82       101

    accuracy                           0.81       200
   macro avg       0.82      0.81      0.81       200
weighted avg       0.82      0.81      0.81       200



# Model Comparison

>## Decision Tree
>- Accuracy: The Decision Tree model achieved an accuracy of 80% on the test dataset.
>- Precision: It had a precision of 82%, which indicates that out of the predicted positive cases, 82% were correct.
>- Recall: The recall for this model was 83%, showing that it correctly identified 83% of the actual positive cases.
>- F1-Score: The F1-Score, which balances precision and recall, was 80%.
>- Confusion Matrix: The confusion matrix shows that it had 75 true positives, 84 true negatives, 24 false positives, and 17 false negatives.
>- Feature Importance: The Decision Tree model identified the most important features for classification as [0.07341131 0.29419374 0.10196358 0.06685496 0.0546622  0.16760539 0.13196232 0.10934652].
 
>#### **Pros:**
>- Are interpretable and can provide insights into feature importance.
>- They can handle both numerical and categorical data.
>#### **Cons:**
>- Prone to overfitting, which might result in lower generalization to new data.

>## Random Forest
>- Accuracy: The Random Forest model achieved an accuracy of 83% on the test dataset.
>- Precision: It had a precision of 87%, indicating 87% of the predicted positive cases were correct.
>- Recall: The recall for this model was 89%, showing that it correctly identified 89% of the actual positive cases.
>- F1-Score: The F1-Score for Random Forest was 84%.
>- Confusion Matrix: It had 76 true positives, 90 true negatives, 23 false positives, and 11 false negatives.
>- Feature Importance: Random Forest highlighted the following features as the most important [0.08654839 0.24571342 0.08492921 0.07034984 0.06626422 0.17202231
 0.12549305 0.14867956].

>#### **Pros:**
>- Better generalization than a single Decision Tree due to ensemble techniques.
>- Handles noisy data well.
>#### **Cons:**
>- Can be computationally expensive.

>## AdaBoost

>- Accuracy: The AdaBoost model achieved an accuracy of 77% on the test dataset.
>- Precision: It had a precision of 80%, indicating 80% of the predicted positive cases were correct.
>- Recall: The recall for this model was 82%, showing that it correctly identified 82% of the actual positive cases.
>- F1-Score: The F1-Score for AdaBoost was 78%.
>- Confusion Matrix: It had 71 true positives, 83 true negatives, 28 false positives, and 18 false negatives.

>#### **Pros:**
>- Effective in boosting the performance of weak learners.
>- Handles imbalanced datasets well.

>#### **Cons:**
>- Sensitive to noisy data and outliers.

>## Gradient Boosting

>- Accuracy: The Gradient Boosting model achieved an accuracy of 80% on the test dataset.
>- Precision: It had a precision of 86%, indicating 86% of the predicted positive cases were correct.
>- Recall: The recall for this model was 88%, showing that it correctly identified 88% of the actual positive cases.
>- F1-Score: The F1-Score for Gradient Boosting was 82%.
>- Confusion Matrix: It had 71 true positives, 89 true negatives, 28 false positives, and 12 false negatives.

>#### **Pros:**
>- High predictive power and can capture complex relationships in the data.
>- Robust to outliers.

>#### **Cons:**
>- Can be computationally intensive.

>## Overall Comparison and Conclusion
>- The Decision Tree model provided an interpretable solution but suffered from overfitting.
>- The Random Forest algorithm exhibited enhanced generalization capabilities, when both AdaBoost and Gradient Boosting algorithms shown excellent performance.
AdaBoost and Gradient Boosting algorithms have been shown to be very successful in enhancing the performance of poor learners. Notably, Gradient Boosting has the potential for achieving superior accuracy compared to AdaBoost.
The Random Forest algorithm is often used in situations when there is a need for a trade-off between achieving high accuracy and maintaining interpretability.
