<a href="https://colab.research.google.com/github/alexandriaorvis/predicting_diabetes/blob/main/Scaled_Continuous_Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Import dependencies
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
pd.set_option('display.max_columns', None)

In [2]:
# Read in spreadsheet
from google.colab import drive
import os
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Read CSV file directly without changing directory
file_path = '/content/drive/My Drive/predicting_diabetes/Resources/diabetes_2_classes.csv'
diabetes_df = pd.read_csv(file_path)
diabetes_df.head()

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


The following code represents our efforts to fully understand the nature of each feature in order to determine how the features might be scaled to produce the most accurate model.
<br>-BMI: continuous variable
<br>-Age: cleaned to be categorical from 1 - 13
<br>-General Health: a categorical feature from 1 - 5
<br>-Mental Health: Range of 0 - 30 for number of days
<br>-Physical Health: Range of 0 - 30 for number of days
<br>-Education: Categorical variable from 1 - 6
<br>-Income: Categorical variable from 1 - 8


In [None]:
diabetes_df['BMI'].value_counts()

In [None]:
diabetes_df['Age'].value_counts()

In [None]:
diabetes_df['GenHlth'].value_counts()

In [None]:
diabetes_df['MentHlth'].value_counts()

In [None]:
diabetes_df['PhysHlth'].value_counts()

In [None]:
diabetes_df['Education'].value_counts()

In [None]:
diabetes_df['Income'].value_counts()

## Scaling continuous variables

In [4]:
# Copy the original df for scaling
scaled_df = diabetes_df.copy()

# Define the continuous columns to be scaled
scaled_cols = ['BMI', 'MentHlth', 'PhysHlth']

# Scale the columns in the copied dataframe
diabetes_scaled = StandardScaler().fit_transform(scaled_df[scaled_cols])

# Make these new scaled columns into a dataframe and display
diabetes_scaled = pd.DataFrame(diabetes_scaled, columns = ['BMI_scaled', 'MentHlth_scaled', 'PhysHlth_scaled'])


In [5]:
#Concatinate the original dataframe and the new scaled columns
diabetes_scaled = pd.concat([scaled_df, diabetes_scaled], axis=1)

In [6]:
# Drop columns that are going to be scaled
scaled_df = diabetes_scaled.drop(columns = scaled_cols)

# Logistic Regression Analysis

In [7]:
# Set our target and feature variables for the ML model
y = scaled_df['Diabetes_binary']
X = scaled_df.drop(columns='Diabetes_binary')

In [8]:
# Use SKlearn to train the data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

## Need to scale only numerical data here

In [9]:
# Create the StandardScaler instance
scaler = StandardScaler()

# Fit the Standard Scaler with the training data
X_scaler = scaler.fit(X_train)

# Scale the training data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

In [10]:
# Define the logistic regression model
log_classifier = LogisticRegression(solver="lbfgs",max_iter=500)

# Train the model
log_classifier.fit(X_train,y_train)

In [11]:
# Score the model
print(f"Training Data Score: {log_classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {log_classifier.score(X_test, y_test)}")

Training Data Score: 0.8639493324923788
Testing Data Score: 0.8625985493535162


In [15]:
# Predict outcomes for test data set
predictions = log_classifier.predict(X_test)
preds_v_actual = pd.DataFrame({"Prediction": predictions, "Actual": y_test})
preds_v_actual.head()

Unnamed: 0,Prediction,Actual
235899,0.0,0.0
74852,0.0,1.0
8205,0.0,0.0
127632,0.0,1.0
32021,0.0,0.0


In [16]:
# Create and save the training classification report
training_report = classification_report(y_test, predictions)

# Print the training classification report
print(training_report)

              precision    recall  f1-score   support

         0.0       0.88      0.98      0.92     54551
         1.0       0.53      0.15      0.23      8869

    accuracy                           0.86     63420
   macro avg       0.70      0.56      0.58     63420
weighted avg       0.83      0.86      0.83     63420



In [17]:
# Accuracy score for logistic regression
y_pred = log_classifier.predict(X_test)
print(f" Logistic regression model accuracy: {accuracy_score(y_test,y_pred):.3f}")

 Logistic regression model accuracy: 0.863


Scaling the continuous variables compared to scaling all of the variables does not make a difference in the accuracy of the model
