<a href="https://colab.research.google.com/github/claudiatamas/colab_notebooks/blob/main/random_forest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Random Forest: Diabetes Prediction***




---






### **Algorithm Theory**


The Random Forest algorithm is an ensemble-based machine learning method used for classification, regression, and other tasks. It builds multiple decision trees and combines them to make a more robust and accurate prediction. Each tree in the forest is trained on a random subset of the training dataset, and the final prediction is determined by a weighted vote.

### **Problem Description**
We will use Random Forest to classify patients into two categories: diabetic and non-diabetic.



### **Libraries Used**



* **pandas** - A Python library for data manipulation and analysis. Provides fast, flexible, and expressive data structures designed to make working with relational or labeled data easy and intuitive. Supports operations like filtering, cleaning, exploring, and analyzing data.

* **scikit-learn** - One of the most popular machine learning libraries in Python. Provides tools for statistical modeling, including classification, regression, clustering, and dimensionality reduction. Includes preprocessing methods, ML algorithms, and model evaluation tools.








In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

### **Dataset**

The dataset **diabetes_prediction_dataset.csv** includes information about individuals with the following features:
* **gender** - The gender of the individual
* **age** - Age of the individual
* **hypertension** - Whether the individual has hypertension
* **heart_disease** - Presence or absence of heart disease
* **smoking_history** - Smoking history of the individual
* **BMI** - Body Mass Index, evaluating weight relative to height
* **HbA1c_level** - Glycated hemoglobin level in the blood
* **blood_glucose_level** - Blood sugar level

The variable **diabetes** is the dependent variable, indicating whether an individual has diabetes.






### **Algorithm Steps**


1.   At this stage, the dataset is loaded from a CSV file into a pandas DataFrame for processing..


In [None]:
import pandas as pd
import io

df = pd.read_csv(io.BytesIO(uploaded['diabetes_prediction_dataset.csv']))
print(df)

       gender   age  hypertension  heart_disease smoking_history    bmi  \
0      Female  80.0             0              1           never  25.19   
1      Female  54.0             0              0         No Info  27.32   
2        Male  28.0             0              0           never  27.32   
3      Female  36.0             0              0         current  23.45   
4        Male  76.0             1              1         current  20.14   
...       ...   ...           ...            ...             ...    ...   
99995  Female  80.0             0              0         No Info  27.32   
99996  Female   2.0             0              0         No Info  17.37   
99997    Male  66.0             0              0          former  27.83   
99998  Female  24.0             0              0           never  35.42   
99999  Female  57.0             0              0         current  22.43   

       HbA1c_level  blood_glucose_level  diabetes  
0              6.6                  140        

2. Data preprocessing – In this step, the dataset is transformed to prepare it for the Random Forest model. Categorical features are converted into a numeric format using one-hot encoding, a preprocessing technique that transforms each unique value in a categorical column into a new binary column.

In [None]:
categorical_features = ['gender', 'smoking_history']
one_hot_encoder = OneHotEncoder()


numerical_features = X.drop(columns=categorical_features).columns
scaler = StandardScaler()


preprocessor = ColumnTransformer(
    transformers=[
        ('cat', one_hot_encoder, categorical_features),
        ('num', scaler, numerical_features)
    ]
)


X_processed = preprocessor.fit_transform(X)


3. Preparing training and test sets – These lines of code separate the features (X) and the target variable (y), and then split the data into training and test sets.




In [None]:
X = df.drop('diabetes', axis=1)
y = df['diabetes']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. Training the model – Here, the Random Forest model is trained on the training dataset.

In [None]:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train, y_train)

5. Predictions and model evaluation – After training, the model makes predictions on the test set. The accuracy and classification report are then calculated and displayed.

In [None]:

y_pred_rf = rf_classifier.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f'Acuratețea modelului Random Forest: {accuracy_rf}')

report_rf = classification_report(y_test, y_pred_rf)
print(report_rf)

6. Visualizing results – Finally, a table is created to compare the actual data with the model's predicted values.

In [None]:
test_results_rf = pd.DataFrame({'Real Data': y_test, 'Predicted Data': y_pred_rf})
print(test_results_rf.head(51))

### **Data Splitting:**
The dataset will be split into 80% for training and 20% for testing.



### **Conclusions**

**High Accuracy**: The model achieves an overall accuracy of approximately 97%, indicating strong performance in correctly classifying cases as diabetic or non-diabetic.

***Class 0 (Non-diabetic) Classification:***:

1. Precision: 97% precision for class 0 suggests that 97% of the model’s predictions for non-diabetics are correct.
2. Recall: 100% recall indicates that the model correctly identified all actual non-diabetic cases in the test set.
3. F1-Score: An F1-score of 98% for class 0 is very high, reflecting a good balance between precision and recall.

***Class 1 (Diabetic) Classification:***:
1. Precision: 95% precision for class 1 is very good, indicating that almost all of the model’s predictions for diabetics are correct.
2. Recall: However, a recall of only 69% for class 1 suggests the model missed several actual diabetic cases in the test set.
2. F1-Score: An F1-score of 80% for class 1 is significantly lower than that for class 0 due to the lower recall.