<a href="https://colab.research.google.com/github/cierrak18/Bio_Projects/blob/main/diabetes_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Diabetes Prediction using Apache Spark MLlib


Name: Cierra Britt

Date: 26Jan26

Summary:

Implemented a ML pipeline in PySpark to predict diabetes using Logestic Regression. The project involved:

*   Setting up Apache Spark in Google Colab and installing dependencies
*   Importing and exploring the diabetess dataset
*   Performing data cleaning, correlation analysis, and feature selection
*   Building and training Logistic Regression model using Spark MLlib
*   Evaluating model performance with appropiate metrics and testing on unseen data
*   Saving and loading the trained model for future use

Tech Stack: PySpark, Apache Spark MLlib, Google Colab, Python
Key Skills: Big Data Analytics, ML, Data Preprocessing, Model Evaluation





Installation of PySpark & Exploration of Data

In [2]:
# install dependencies
! pip install pyspark



In [3]:
# Start Spark Session by building one
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("spark").getOrCreate()

In [4]:
# Clone dataset from github repository
! git clone https://github.com/education454/diabetes_dataset

Cloning into 'diabetes_dataset'...
remote: Enumerating objects: 6, done.[K
remote: Counting objects: 100% (6/6), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 6 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (6/6), 13.02 KiB | 13.02 MiB/s, done.


In [5]:
# Check folder contents
! ls diabetes_dataset

diabetes.csv  new_test.csv


In [6]:
# Read data; stored as 'df'
df = spark.read.csv('/content/diabetes_dataset/diabetes.csv',header=True,inferSchema=True)

In [7]:
# Show data
df.show()

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          2|    138|           62|           35|      0|33.6|                   0.127| 47|      1|
|          0|     84|           82|           31|    125|38.2|                   0.233| 23|      0|
|          0|    145|            0|            0|      0|44.2|                    0.63| 31|      1|
|          0|    135|           68|           42|    250|42.3|                   0.365| 24|      1|
|          1|    139|           62|           41|    480|40.7|                   0.536| 21|      0|
|          0|    173|           78|           32|    265|46.5|                   1.159| 58|      0|
|          4|     99|           72|           17|      0|25.6|                   0.294| 28|      0|


In [8]:
# Display the variables for each column
df.printSchema()

root
 |-- Pregnancies: integer (nullable = true)
 |-- Glucose: integer (nullable = true)
 |-- BloodPressure: integer (nullable = true)
 |-- SkinThickness: integer (nullable = true)
 |-- Insulin: integer (nullable = true)
 |-- BMI: double (nullable = true)
 |-- DiabetesPedigreeFunction: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Outcome: integer (nullable = true)



In [9]:
# Rows, Columns
print(df.count(),len(df.columns))

2000 9


In [10]:
# Patient has diabetes (1), no diabetes (0)
df.groupby('Outcome').count().show()

+-------+-----+
|Outcome|count|
+-------+-----+
|      1|  684|
|      0| 1316|
+-------+-----+



In [11]:
# Summary of count, mean, std, min, and max
df.describe().show()

+-------+-----------------+------------------+------------------+-----------------+-----------------+------------------+------------------------+------------------+------------------+
|summary|      Pregnancies|           Glucose|     BloodPressure|    SkinThickness|          Insulin|               BMI|DiabetesPedigreeFunction|               Age|           Outcome|
+-------+-----------------+------------------+------------------+-----------------+-----------------+------------------+------------------------+------------------+------------------+
|  count|             2000|              2000|              2000|             2000|             2000|              2000|                    2000|              2000|              2000|
|   mean|           3.7035|          121.1825|           69.1455|           20.935|           80.254|32.192999999999984|     0.47092999999999974|           33.0905|             0.342|
| stddev|3.306063032730656|32.068635649902916|19.188314815604098|16.103242909926

Cleaning Data

In [12]:
# Find for null values
for col in df.columns:
  print(col+":",df[df[col].isNull()].count())

Pregnancies: 0
Glucose: 0
BloodPressure: 0
SkinThickness: 0
Insulin: 0
BMI: 0
DiabetesPedigreeFunction: 0
Age: 0
Outcome: 0


In [13]:
# Find total number of 0 values in columns: Glucose, BP, SkinThickness, Insulin, and BMI
def count_zero():
  columns_list = ['Glucose','BloodPressure','SkinThickness','Insulin', 'BMI']
  for i in columns_list:
    print(i+":",df[df[i]==0].count())

In [14]:
count_zero()

Glucose: 13
BloodPressure: 90
SkinThickness: 573
Insulin: 956
BMI: 28


In [17]:
from pyspark.sql.functions import *
# so we can use when
# Mean Value for each column
for i in df.columns[1:6]:
  data = df.agg({i:'mean'}).first()[0]
  print("Mean value for {} is {}".format(i,int(data)))
  # Replace zero values with the mean in each respective column
  df = df.withColumn(i,when(df[i]==0,int(data)).otherwise(df[i]))

Mean value for Glucose is 121
Mean value for BloodPressure is 69
Mean value for SkinThickness is 20
Mean value for Insulin is 80
Mean value for BMI is 32


In [18]:
df.show()

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          2|    138|           62|           35|     80|33.6|                   0.127| 47|      1|
|          0|     84|           82|           31|    125|38.2|                   0.233| 23|      0|
|          0|    145|           69|           20|     80|44.2|                    0.63| 31|      1|
|          0|    135|           68|           42|    250|42.3|                   0.365| 24|      1|
|          1|    139|           62|           41|    480|40.7|                   0.536| 21|      0|
|          0|    173|           78|           32|    265|46.5|                   1.159| 58|      0|
|          4|     99|           72|           17|     80|25.6|                   0.294| 28|      0|


Build & Train ML Model

In [19]:
# Calculate the correlation
for col in df.columns:
  print("correlation to outcome for {} is {}".format (col,df.stat.corr('Outcome',col)))

correlation to outcome for Pregnancies is 0.22443699263363961
correlation to outcome for Glucose is 0.48796646527321064
correlation to outcome for BloodPressure is 0.17171333286446713
correlation to outcome for SkinThickness is 0.1659010662889893
correlation to outcome for Insulin is 0.1711763270226193
correlation to outcome for BMI is 0.2827927569760082
correlation to outcome for DiabetesPedigreeFunction is 0.1554590791569403
correlation to outcome for Age is 0.23650924717620253
correlation to outcome for Outcome is 1.0


In [23]:
# ^ based on the info above, we can keep all of the columns as our features
# import ML features
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['Pregnancies','Glucose','BloodPressure','SkinThickness', 'Insulin', 'BMI','DiabetesPedigreeFunction','Age','Outcome'],outputCol='features')
output_data = assembler.transform(df)
output_data.printSchema()


root
 |-- Pregnancies: integer (nullable = true)
 |-- Glucose: integer (nullable = true)
 |-- BloodPressure: integer (nullable = true)
 |-- SkinThickness: integer (nullable = true)
 |-- Insulin: integer (nullable = true)
 |-- BMI: double (nullable = true)
 |-- DiabetesPedigreeFunction: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Outcome: integer (nullable = true)
 |-- features: vector (nullable = true)



In [24]:
output_data.show()

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+--------------------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|            features|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+--------------------+
|          2|    138|           62|           35|     80|33.6|                   0.127| 47|      1|[2.0,138.0,62.0,3...|
|          0|     84|           82|           31|    125|38.2|                   0.233| 23|      0|[0.0,84.0,82.0,31...|
|          0|    145|           69|           20|     80|44.2|                    0.63| 31|      1|[0.0,145.0,69.0,2...|
|          0|    135|           68|           42|    250|42.3|                   0.365| 24|      1|[0.0,135.0,68.0,4...|
|          1|    139|           62|           41|    480|40.7|                   0.536| 21|      0|[1.0,139.0,62.0,4...|
|          0|    173|           

In [25]:
# Regression ML Model: requires outcome and fetures
from pyspark.ml.classification import LogisticRegression
final_data = output_data.select('features','Outcome')

In [26]:
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- Outcome: integer (nullable = true)



In [28]:
# Split data into training and testing dataset
train, test = final_data.randomSplit([0.7,0.3])
models = LogisticRegression(labelCol='Outcome')
model = models.fit(train)

In [29]:
#Training Summary
summary = model.summary

In [30]:
summary.predictions.describe().show()

+-------+------------------+------------------+
|summary|           Outcome|        prediction|
+-------+------------------+------------------+
|  count|              1405|              1405|
|   mean|0.3366548042704626|0.3366548042704626|
| stddev|0.4727339692510063|0.4727339692510063|
|    min|               0.0|               0.0|
|    max|               1.0|               1.0|
+-------+------------------+------------------+



Evaluation & Test Model

In [31]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
predictions = model.evaluate(test)

In [33]:
predictions.predictions.show(20)

+--------------------+-------+--------------------+--------------------+----------+
|            features|Outcome|       rawPrediction|         probability|prediction|
+--------------------+-------+--------------------+--------------------+----------+
|[0.0,57.0,60.0,20...|      0|[20.1497915282250...|[0.99999999822557...|       0.0|
|[0.0,74.0,52.0,10...|      0|[20.3059238277408...|[0.99999999848207...|       0.0|
|[0.0,78.0,88.0,29...|      0|[19.6300702965373...|[0.99999999701620...|       0.0|
|[0.0,84.0,64.0,22...|      0|[19.7647004821325...|[0.99999999739204...|       0.0|
|[0.0,84.0,64.0,22...|      0|[19.7647004821325...|[0.99999999739204...|       0.0|
|[0.0,91.0,68.0,32...|      0|[19.4748633751877...|[0.99999999651522...|       0.0|
|[0.0,93.0,60.0,20...|      0|[19.7350485068844...|[0.99999999731355...|       0.0|
|[0.0,93.0,60.0,25...|      0|[19.7943582078265...|[0.99999999746825...|       0.0|
|[0.0,93.0,60.0,25...|      0|[19.7943582078265...|[0.99999999746825...|    

In [34]:
# Evaluate model
evaluator = BinaryClassificationEvaluator(rawPredictionCol='rawPrediction',labelCol='Outcome')
evaluator.evaluate(model.transform(test))

1.0

In [35]:
model.save("model")

In [36]:
# Reload Model
from pyspark.ml.classification import LogisticRegressionModel
model = LogisticRegressionModel.load('model')