In [1]:
# part 1
"""
This script processes a diabetes dataset that is kept in a CSV file by using PySpark. To start,
a SparkSession called "Diabetes" is created, and the dataset is loaded into a DataFrame called df. The script creates an updated DataFrame called d by calculating the dataset's mean BMI (body mass index) and replacing any zero values in the BMI column with the mean. After that, a new DataFrame called df_rs is created, with entries in it that have an age of at least 35. Filtering rows where the Diabetes Pedigree Function value is
greater than or equal to 0.51 results in the generation of another DataFrame, df_fltd. Lastly, the script produces the modified BMI DataFrame, the DataFrame filtered according to the Diabetes Pedigree Function threshold, and the DataFrame with age greater than or equal to 35. When processing is finished, the Spark session is terminated. Using PySpark, this method makes it possible to manipulate and filter data effectively while gaining insights from the diabetes dataset.
"""

'\nThis script processes a diabetes dataset that is kept in a CSV file by using PySpark. To start, \na SparkSession called "Diabetes" is created, and the dataset is loaded into a DataFrame called df. The script creates an updated DataFrame called d by calculating the dataset\'s mean BMI (body mass index) and replacing any zero values in the BMI column with the mean. After that, a new DataFrame called df_rs is created, with entries in it that have an age of at least 35. Filtering rows where the Diabetes Pedigree Function value is \ngreater than or equal to 0.51 results in the generation of another DataFrame, df_fltd. Lastly, the script produces the modified BMI DataFrame, the DataFrame filtered according to the Diabetes Pedigree Function threshold, and the DataFrame with age greater than or equal to 35. When processing is finished, the Spark session is terminated. Using PySpark, this method makes it possible to manipulate and filter data effectively while gaining insights from the diabe

In [2]:
!pip install pyspark --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [3]:

from pyspark.sql import SparkSession
from pyspark.sql.functions import mean, when

spark = SparkSession.builder \
    .appName("Diabetes") \
    .getOrCreate()


df = spark.read.csv("diabetes.csv", header=True, inferSchema=True)

mn_bmi = df.select(mean("BMI")).collect()[0][0]
d = df.withColumn("BMI", when(df["BMI"] == 0, mn_bmi).otherwise(df["BMI"]))

# new DataFrame for rows with age >= 35
df_rs = df.filter(df["Age"] >= 35)

# Diabetes Pedigree Function value is >= 0.51
df_fltd = df.filter(df["DiabetesPedigreeFunction"] >= 0.51)


print("Updated BMI:")
d.show()

print("Age >= 35:")
df_rs.show()

print("Diabetes Pedigree Function >= 0.51:")
df_fltd.show()
spark.stop()


Updated BMI:
+-----------+-------+-------------+-------------+-------+------------------+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin|               BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+------------------+------------------------+---+-------+
|          6|    148|           72|           35|      0|              33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|              26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|              23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|              28.1|                   0.167| 21|      0|
|          0|    137|           40|           35|    168|              43.1|                   2.288| 33|      1|
|          5|    116|           74|            0|      0|              25.6