<a href="https://colab.research.google.com/github/dlwub/Diabetic-Retinopathy-Classification/blob/master/Diabetic_Retinopathy_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### **Project Description:** This project focuses on classifying **Diabetic Retinopathy (DR)** using the UCI Diabetic Retinopathy Debrecen dataset. The dataset consists of features extracted from retinal images, capturing key indicators of diabetic retinopathy. Each instance contains 19 numerical attributes, with the final attribute serving as the target variable, where:

* ### 0 represents the absence of diabetic retinopathy (Non-DR case).
* ### 1 indicates the presence of diabetic retinopathy (DR case).
### To perform the classification, we employ three machine learning models:

* **Logistic Regression**
* **XGBoost Classifier**
* **Random Forest Classifier**
### Each model is evaluated based on key metrics such as accuracy, precision, recall, and F1-score, ensuring a robust comparison of their effectiveness in detecting diabetic retinopathy.



#### Import necessary libraries

In [None]:
from scipy.io import arff
import numpy as np
import pandas as pd
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier

### Step 1. Load and read the data

In [None]:
file_path = "/content/gdrive/MyDrive/Diabetic_Retinipathy_Debrecen/messidor_features.arff"
data, meta = arff.loadarff(file_path)
df = pd.DataFrame(data)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,Class
0,1.0,1.0,22.0,22.0,22.0,19.0,18.0,14.0,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1.0,b'0'
1,1.0,1.0,24.0,24.0,22.0,18.0,16.0,13.0,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0.0,b'0'
2,1.0,1.0,62.0,60.0,59.0,54.0,47.0,33.0,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0.0,b'1'
3,1.0,1.0,55.0,53.0,53.0,50.0,43.0,31.0,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.11479,0.0,b'0'
4,1.0,1.0,44.0,44.0,44.0,41.0,39.0,27.0,18.026254,8.570709,0.410381,0.0,0.0,0.0,0.0,0.0,0.475935,0.123572,0.0,b'1'


### Step 2. Data Preprocessing

In [None]:
print(df.isnull().sum())

0        0
1        0
2        0
3        0
4        0
5        0
6        0
7        0
8        0
9        0
10       0
11       0
12       0
13       0
14       0
15       0
16       0
17       0
18       0
Class    0
dtype: int64


#### No missing values

In [None]:
print(meta)

Dataset: dr
	0's type is numeric
	1's type is numeric
	2's type is numeric
	3's type is numeric
	4's type is numeric
	5's type is numeric
	6's type is numeric
	7's type is numeric
	8's type is numeric
	9's type is numeric
	10's type is numeric
	11's type is numeric
	12's type is numeric
	13's type is numeric
	14's type is numeric
	15's type is numeric
	16's type is numeric
	17's type is numeric
	18's type is numeric
	Class's type is nominal, range is ('0', '1')



In [None]:
print(meta.names())

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', 'Class']


#### Since the column names are not informative, we rename them.

In [None]:
# Rename feature names
df.columns = [f'feature_{i}' for i in range(19)] + ['Class']

In [None]:
df.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,Class
0,1.0,1.0,22.0,22.0,22.0,19.0,18.0,14.0,49.895756,17.775994,5.27092,0.771761,0.018632,0.006864,0.003923,0.003923,0.486903,0.100025,1.0,b'0'
1,1.0,1.0,24.0,24.0,22.0,18.0,16.0,13.0,57.709936,23.799994,3.325423,0.234185,0.003903,0.003903,0.003903,0.003903,0.520908,0.144414,0.0,b'0'
2,1.0,1.0,62.0,60.0,59.0,54.0,47.0,33.0,55.831441,27.993933,12.687485,4.852282,1.393889,0.373252,0.041817,0.007744,0.530904,0.128548,0.0,b'1'
3,1.0,1.0,55.0,53.0,53.0,50.0,43.0,31.0,40.467228,18.445954,9.118901,3.079428,0.840261,0.272434,0.007653,0.001531,0.483284,0.11479,0.0,b'0'
4,1.0,1.0,44.0,44.0,44.0,41.0,39.0,27.0,18.026254,8.570709,0.410381,0.0,0.0,0.0,0.0,0.0,0.475935,0.123572,0.0,b'1'


#### Scaling the features

In [None]:
# We use StandardScaler to standardize the numeric columns.
scaler = StandardScaler()
df.iloc[:, :-1] = scaler.fit_transform(df.iloc[:, :-1])

# We convert the target column to 0 and 1
df['Class'] = df['Class'].apply(lambda x: int(x))
df.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,Class
0,0.059054,0.298213,-0.641486,-0.618782,-0.576463,-0.630029,-0.551116,-0.473745,-0.242917,-0.246003,-0.296966,-0.271509,-0.218324,-0.194409,-0.205124,-0.186169,-1.294763,-0.468656,1.405048,0
1,0.059054,0.298213,-0.563391,-0.535778,-0.576463,-0.67741,-0.653676,-0.539992,-0.10925,0.032972,-0.465224,-0.408593,-0.224256,-0.197212,-0.205175,-0.186281,-0.082168,2.006054,-0.711719,0
2,0.059054,0.298213,0.920417,0.958299,1.046665,1.028299,0.936006,0.784951,-0.141383,0.227196,0.344463,0.769037,0.335538,0.15233,-0.110043,-0.164808,0.274283,1.121516,-0.711719,1
3,0.059054,0.298213,0.647084,0.667784,0.783456,0.838776,0.730886,0.652456,-0.404199,-0.214977,0.03583,0.316953,0.112573,0.056919,-0.195765,-0.199541,-1.423814,0.354501,-0.711719,0
4,0.059054,0.298213,0.217561,0.294265,0.388641,0.412349,0.525766,0.387468,-0.788069,-0.672306,-0.717335,-0.468311,-0.225828,-0.200905,-0.214968,-0.2081,-1.685874,0.844102,-0.711719,1


#### Step 3. Split data

In [None]:
X = df.drop(columns=['Class'])
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)

### Step 4. Train models
#### We start with Logistic Regression

In [None]:
# Define and train model
lr = LogisticRegression()
lr.fit(X_train, y_train)

# Predict
y_pred = lr.predict(X_test)

# Evaluate model
print(f'Accuracy Score:', accuracy_score(y_pred, y_test))
print(f'Classification Report:', classification_report(y_pred, y_test))

Accuracy Score: 0.70995670995671
Classification Report:               precision    recall  f1-score   support

           0       0.81      0.64      0.71       130
           1       0.63      0.80      0.71       101

    accuracy                           0.71       231
   macro avg       0.72      0.72      0.71       231
weighted avg       0.73      0.71      0.71       231



### XGBoost

In [None]:
# Define and train model
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_train, y_train)
#Predict
y_pred = xgb_model.predict(X_test)

# Evaluate model
print(f'Accuracy Score:', accuracy_score(y_pred, y_test))
print(f'Classification Report:', classification_report(y_pred, y_test))

Parameters: { "use_label_encoder" } are not used.



Accuracy Score: 0.6753246753246753
Classification Report:               precision    recall  f1-score   support

           0       0.70      0.62      0.66       116
           1       0.66      0.73      0.69       115

    accuracy                           0.68       231
   macro avg       0.68      0.68      0.67       231
weighted avg       0.68      0.68      0.67       231



#### Let's tune hyperparameters

In [None]:
# Let's start with Random Search CV
# Define parameter distribution
param_dist = {
    'max_depth': np.arange(3, 10, 2),
    'learning_rate': np.linspace(0.01, 0.3, 5),
    'n_estimators': np.arange(50, 500, 50),
    'subsample': np.linspace(0.5, 1.0, 5),
    'colsample_bytree': np.linspace(0.5, 1.0, 5)
}

# Perform Grid Search
random_search = RandomizedSearchCV(xgb_model, param_dist, n_iter=40, scoring='accuracy', cv=5, n_jobs=-1, random_state=42)
random_search.fit(X_train, y_train)

xgb_best_model = random_search.best_estimator_

# Best parpameters and accuracy
print("Best parameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

Parameters: { "use_label_encoder" } are not used.



Best parameters: {'subsample': 0.75, 'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.155, 'colsample_bytree': 0.625}
Best Score: 0.7076086956521739


In [None]:
# Predict using the best XGB model
y_pred = xgb_best_model.predict(X_test)

# Evaluate model
print(f'Accuracy Score:', accuracy_score(y_pred, y_test))
print(f'Classification Report:', classification_report(y_pred, y_test))

Accuracy Score: 0.6883116883116883
Classification Report:               precision    recall  f1-score   support

           0       0.73      0.63      0.68       119
           1       0.66      0.75      0.70       112

    accuracy                           0.69       231
   macro avg       0.69      0.69      0.69       231
weighted avg       0.69      0.69      0.69       231



### Random Forest Classifier

In [None]:
# Define and train model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Predict
y_pred = rf.predict(X_test)

# Evaluate model
print(f'Accuracy Score:', accuracy_score(y_pred, y_test))
print(f'Classification Report:', classification_report(y_pred, y_test))

Accuracy Score: 0.696969696969697
Classification Report:               precision    recall  f1-score   support

           0       0.76      0.63      0.69       123
           1       0.65      0.77      0.70       108

    accuracy                           0.70       231
   macro avg       0.70      0.70      0.70       231
weighted avg       0.71      0.70      0.70       231



#### Tune Hyperparameters

In [None]:
param_grid = {
    "n_estimators": [50, 100, 200],
    "max_depth": [5, 10, None],
    "min_samples_split": [2, 5, 10]
}

grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
best_model_rf = grid_search.best_estimator_

Best parameters: {'max_depth': 10, 'min_samples_split': 5, 'n_estimators': 200}


In [None]:
# Predict using the best rf model
y_pred = best_model_rf.predict(X_test)

# Evaluate model
print(f'Accuracy Score:', accuracy_score(y_pred, y_test))
print(f'Classification Report:', classification_report(y_pred, y_test))

Accuracy Score: 0.7142857142857143
Classification Report:               precision    recall  f1-score   support

           0       0.80      0.65      0.71       127
           1       0.65      0.80      0.72       104

    accuracy                           0.71       231
   macro avg       0.72      0.72      0.71       231
weighted avg       0.73      0.71      0.71       231



#### The Random Forest performed better in terms of overall accuracy (71.4%). It has also higher recall for class 0 and higher f1-score for class 1. Logistic regression is better in terms of precision for class 0 (Non-DR cases).