# FDS - Final Project: Prediction of Diabetes patients

### Descriptoin:

- We want to train a model that can predict whether a patient with certain medical properties has diabetes or not.
- For such a purpose, we use different techniques to train the model with the highest performance and accuracy.

## Motivation:
- This can be useful for healthcare professionals in identifying patients who may be at risk of developing diabetes and in developing personalized treatment plans.
- Additionally, the dataset can be used by researchers to explore the relationships between various medical and demographic factors and the likelihood of developing diabetes.

## Dataset:
- We obtained ‘Diabetes prediction’ dataset from the following URL:
- https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/data
- The 'Diabetes prediction dataset' is a collection of medical and demographic data from patients,
along with their diabetes status (positive or negative).
- The data includes features such as age, gender, body mass index (BMI), hypertension, heart
disease, smoking history, HbA1c level, and blood glucose level.

## Related Works
- URL: Diabetes dataset 1
- URL: Diabetes dataset 2
- URL: Pima Indians Diabetes Database
- URL: Predict Diabetes dataset



## Import modules:

In [22]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split


## Data Preprocessing:

In [23]:
file_path = './diabetes_prediction_dataset-100000.csv'

data = pd.read_csv(file_path)

# Replace 'Female' with 0 and 'Male' with 1 in the 'gender' column
data['gender'] = data['gender'].replace({'Female': 0, 'Male': 1})

data.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,0,80.0,0,1,never,25.19,6.6,140,0
1,0,54.0,0,0,No Info,27.32,6.6,80,0
2,1,28.0,0,0,never,27.32,5.7,158,0
3,0,36.0,0,0,current,23.45,5.0,155,0
4,1,76.0,1,1,current,20.14,4.8,155,0


add two more columns "is_smoker" and "been_smoker" based on the value of the 'smoking_history' column, with the following condition:
iterate through each row,

- if the value for 'smoking_history' is "never" or "ever", add 0 to "is_smoker" and "been_smoker".
- if the value for 'smoking_history' is "current", add 1 to "is_smoker" and "been_smoker".
- if the value for 'smoking_history' is "former", add 0 to "is_smoker" and add 1 to "been_smoker".

In [24]:
# Create 'is_smoker' and 'been_smoker' columns, initializing them with 0
data['is_smoker'] = 0
data['been_smoker'] = 0

# Apply conditions to populate 'is_smoker' and 'been_smoker' columns based on 'smoking_history'
for index, row in data.iterrows():
    if row['smoking_history'] in ['never', 'ever']:
        data.at[index, 'is_smoker'] = 0
        data.at[index, 'been_smoker'] = 0
    elif row['smoking_history'] == 'current':
        data.at[index, 'is_smoker'] = 1
        data.at[index, 'been_smoker'] = 1
    elif row['smoking_history'] == 'former':
        data.at[index, 'is_smoker'] = 0
        data.at[index, 'been_smoker'] = 1

data.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes,is_smoker,been_smoker
0,0,80.0,0,1,never,25.19,6.6,140,0,0,0
1,0,54.0,0,0,No Info,27.32,6.6,80,0,0,0
2,1,28.0,0,0,never,27.32,5.7,158,0,0,0
3,0,36.0,0,0,current,23.45,5.0,155,0,1,1
4,1,76.0,1,1,current,20.14,4.8,155,0,1,1


**Min-Max Scaling:** Use when you know the distribution of your data is not normal or when the algorithm you're using expects input features to be on a similar scale (e.g., neural networks, distance-based algorithms like KNN).

**Standardization (Z-score normalization):** Use when your data follows a normal distribution or when your chosen algorithm (e.g., linear regression, logistic regression) assumes normally distributed features. It's also useful when dealing with algorithms that are not scale-sensitive.

In [32]:
# Select columns to be normalized
columns_to_normalize = ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']

# Min-Max scaling
min_max_scaler = MinMaxScaler()
data_norm_min_max = min_max_scaler.fit_transform(data[columns_to_normalize])
data_norm_min_max = pd.DataFrame(data_norm_min_max, columns=[f'{col}_norm_MinMax' for col in columns_to_normalize])

# Standardization (Z-score scaling)
standard_scaler = StandardScaler()
data_norm_standardized = standard_scaler.fit_transform(data[columns_to_normalize])
data_norm_standardized = pd.DataFrame(data_norm_standardized, columns=[f'{col}_norm_Standardization' for col in columns_to_normalize])

# Add the new normalized columns to the original DataFrame
data = pd.concat([data, data_norm_min_max, data_norm_standardized], axis=1)

In [33]:
# Re-order the columns of the data frame
new_column_order = ['gender', 'age', 'age_norm_MinMax', 'age_norm_Standardization', 'bmi', 'bmi_norm_MinMax', 'bmi_norm_Standardization', 'HbA1c_level', 'HbA1c_level_norm_MinMax', 'HbA1c_level_norm_Standardization', 'blood_glucose_level', 'blood_glucose_level_norm_MinMax', 'blood_glucose_level_norm_Standardization', 'smoking_history', 'is_smoker', 'been_smoker', 'hypertension', 'heart_disease', 'diabetes']

# Create a new DataFrame with columns arranged in the new order
data = data[new_column_order]
data

Unnamed: 0,gender,age,age_norm_MinMax,age_norm_MinMax.1,age_norm_Standardization,age_norm_Standardization.1,bmi,bmi_norm_MinMax,bmi_norm_MinMax.1,bmi_norm_Standardization,...,blood_glucose_level_norm_MinMax,blood_glucose_level_norm_MinMax.1,blood_glucose_level_norm_Standardization,blood_glucose_level_norm_Standardization.1,smoking_history,is_smoker,been_smoker,hypertension,heart_disease,diabetes
0,0,80.0,1.000000,1.000000,1.692704,1.692704,25.19,0.177171,0.177171,-0.321056,...,0.272727,0.272727,0.047704,0.047704,never,0,0,0,1,0
1,0,54.0,0.674675,0.674675,0.538006,0.538006,27.32,0.202031,0.202031,-0.000116,...,0.000000,0.000000,-1.426210,-1.426210,No Info,0,0,0,0,0
2,1,28.0,0.349349,0.349349,-0.616691,-0.616691,27.32,0.202031,0.202031,-0.000116,...,0.354545,0.354545,0.489878,0.489878,never,0,0,0,0,0
3,0,36.0,0.449449,0.449449,-0.261399,-0.261399,23.45,0.156863,0.156863,-0.583232,...,0.340909,0.340909,0.416183,0.416183,current,1,1,0,0,0
4,1,76.0,0.949950,0.949950,1.515058,1.515058,20.14,0.118231,0.118231,-1.081970,...,0.340909,0.340909,0.416183,0.416183,current,1,1,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99995,0,80.0,1.000000,1.000000,1.692704,1.692704,27.32,0.202031,0.202031,-0.000116,...,0.045455,0.045455,-1.180558,-1.180558,No Info,0,0,0,0,0
99996,0,2.0,0.024024,0.024024,-1.771388,-1.771388,17.37,0.085901,0.085901,-1.499343,...,0.090909,0.090909,-0.934905,-0.934905,No Info,0,0,0,0,0
99997,1,66.0,0.824825,0.824825,1.070944,1.070944,27.83,0.207983,0.207983,0.076729,...,0.340909,0.340909,0.416183,0.416183,former,0,1,0,0,0
99998,0,24.0,0.299299,0.299299,-0.794336,-0.794336,35.42,0.296569,0.296569,1.220361,...,0.090909,0.090909,-0.934905,-0.934905,never,0,0,0,0,0


In [37]:
# Save the modified DataFrame to a new CSV file named 'dataset.csv'
data.to_csv('dataset.csv', index=False)

## Splitting the Data:
- We should divide the dataset into features (X) and the target variable (y) representing diabetes status.
- Eventually, we need to split our data into training (80%) and testing (20%) sets.

In [35]:
# Load the dataset.csv file into a DataFrame
file_path = 'dataset.csv'  # Replace with the path to your dataset.csv file
data = pd.read_csv(file_path)

# Specify the features and target variable
features = ['gender', 'age', 'age_norm_MinMax', 'age_norm_Standardization', 'bmi', 'bmi_norm_MinMax', 'bmi_norm_Standardization', 'HbA1c_level', 'HbA1c_level_norm_MinMax', 'HbA1c_level_norm_Standardization', 'blood_glucose_level', 'blood_glucose_level_norm_MinMax', 'blood_glucose_level_norm_Standardization', 'smoking_history', 'is_smoker', 'been_smoker', 'hypertension', 'heart_disease']
target = 'diabetes'

# Split the data into features (X) and target variable (y)
X = data[features]
y = data[target]

# Split the data into 80% training set and 20% testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the shapes of the resulting sets
print("Training set shape - Features:", X_train.shape, " Target:", y_train.shape)
print("Testing set shape - Features:", X_test.shape, " Target:", y_test.shape)


Training set shape - Features: (80000, 18)  Target: (80000,)
Testing set shape - Features: (20000, 18)  Target: (20000,)


  data = pd.read_csv(file_path)


## Analyzing Model function:
This is a function for analyzing accuracy, precision, recall, and F1-score. we should call this method for each model and represent the results.

In [38]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np

def evaluate_model(model, X_train, X_test, y_train, y_test):
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(X_test)
    
    # Calculate evaluation metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    return accuracy, precision, recall, f1

# Assume X_train, X_test, y_train, y_test are already defined from previous splits
# model will be a specific model instance (e.g., LinearRegression(), LogisticRegression(), etc.)

# Example usage:
# model = LinearRegression()  # Replace with the specific model you want to evaluate

# accuracy, precision, recall, f1 = evaluate_model(model, X_train, X_test, y_train, y_test)
# print(f"Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}")
