-sandbox
# Uncovering Health Trends Through Machine Learning Analysis of BRFSS Data

**Introduction**

The project aims to use machine learning algorithms to analyze Behavioral Risk Factor Surveillance System (BRFSS) data from 2020 to gain a better understanding of the interactions between BMI, weight, and height and health outcomes. By analyzing the data, the project hopes to identify population-level trends and correlations between BMI, weight, and height and health risks. The project will also use the data to predict an individual's future health outcomes based on their current BMI, weight, and height measurements. Ultimately, the project seeks to gain a better understanding of how BMI, weight, and height can be used to assess health risks and outcomes.

The BRFSS Illness dataset for 2020 includes the following codes and their respective values:

- STATE: The two-letter postal code of the respondent's state or territory of residence.

- STATENAME: The full name of the respondent's state or territory of residence.

- AGE: The reported age of the respondent.

- SEXVAR: The reported sex of the respondent.

- ASTHMA: Whether or not the respondent has asthma.

- ARTHRITIS: Whether or not the respondent has arthritis.

- DEPRESSION: Whether or not the respondent has depression.

- DIABETES: Whether or not the respondent has diabetes.

- HEARTDISEASE: Whether or not the respondent has heart disease.

- STROKE: Whether or not the respondent has had a stroke.

- BMI: The reported body mass index (BMI) of the respondent.

- HIV: Whether or not the respondent has HIV/AIDS.

- HEARTATT: Whether or not the respondent has had a heart attack.

- CONFUSSION: Whether or not the respondent has confusion or disorientation.

- GENHLTH: The reported general health

- **Importing Libraries**

In [0]:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

- **Loading the dataset**

In [0]:
filePath = "/tmp/sf-brfss2020.parquet"
brfss2020_df = spark.read.parquet(filePath)
display(brfss2020_df)

STATE,AGE,SEXVAR,MARITAL,EDUCA,RENTHOME,CELLPHONES,VETERAN,EMPLOYE,INCOME,CHILDREN,WEIGHT,HEIGHT,BMI,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HIV,HEARTATT,CONFUSSION,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,GENHLTH,STATENAME
1.0,8.0,2.0,2.0,6.0,1.0,1.0,2.0,4.0,1.0,88.0,106.0,507.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,,1.0,1.0,5.0,1.0,1.0,2.0,3.0,2.0,Alabama
1.0,10.0,2.0,3.0,6.0,1.0,1.0,2.0,7.0,99.0,88.0,170.0,504.0,3.0,1.0,1.0,1.0,3.0,2.0,2.0,,2.0,,9.0,9.0,7.0,1.0,1.0,1.0,1.0,3.0,Alabama
1.0,10.0,2.0,1.0,5.0,1.0,1.0,2.0,7.0,7.0,88.0,7777.0,508.0,,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,,4.0,1.0,7.0,1.0,1.0,1.0,1.0,3.0,Alabama
1.0,13.0,2.0,3.0,4.0,1.0,9.0,2.0,5.0,99.0,88.0,9999.0,9999.0,,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,Alabama
1.0,13.0,2.0,3.0,6.0,2.0,8.0,2.0,7.0,77.0,88.0,126.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,9.0,2.0,,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0,Alabama
1.0,10.0,1.0,4.0,4.0,3.0,1.0,2.0,8.0,5.0,88.0,180.0,509.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,,3.0,1.0,8.0,1.0,2.0,3.0,3.0,4.0,Alabama
1.0,12.0,2.0,1.0,4.0,1.0,2.0,2.0,7.0,6.0,88.0,150.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,4.0,1.0,6.0,2.0,1.0,1.0,1.0,3.0,Alabama
1.0,10.0,2.0,1.0,4.0,1.0,1.0,2.0,7.0,5.0,88.0,150.0,503.0,3.0,2.0,1.0,1.0,1.0,2.0,2.0,,2.0,,1.0,9.0,6.0,1.0,2.0,3.0,2.0,4.0,Alabama
1.0,5.0,2.0,2.0,6.0,1.0,2.0,2.0,1.0,6.0,2.0,170.0,511.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,4.0,1.0,8.0,1.0,1.0,3.0,1.0,2.0,Alabama
1.0,12.0,2.0,3.0,2.0,1.0,8.0,2.0,7.0,99.0,88.0,163.0,503.0,3.0,2.0,1.0,2.0,3.0,1.0,2.0,2.0,2.0,,3.0,1.0,12.0,2.0,2.0,2.0,1.0,4.0,Alabama


- **Illness Dataset**

In [0]:
Illness = brfss2020_df["STATE","AGE","SEXVAR","ASTHMA","ARTHRITIS","DEPRESSION","DIABETE","HEARTDISEASE","STROKE","HIV","HEARTATT","CONFUSSION","GENHLTH","WEIGHT","HEIGHT","BMI"]
display(Illness)

STATE,AGE,SEXVAR,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HIV,HEARTATT,CONFUSSION,GENHLTH,WEIGHT,HEIGHT,BMI
1.0,8.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,,2.0,106.0,507.0,1.0
1.0,10.0,2.0,1.0,1.0,1.0,3.0,2.0,2.0,,2.0,,3.0,170.0,504.0,3.0
1.0,10.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,,3.0,7777.0,508.0,
1.0,13.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,1.0,9999.0,9999.0,
1.0,13.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,9.0,2.0,,2.0,126.0,506.0,2.0
1.0,10.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,,4.0,180.0,509.0,3.0
1.0,12.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,3.0,150.0,506.0,2.0
1.0,10.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,,2.0,,4.0,150.0,503.0,3.0
1.0,5.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,170.0,511.0,2.0
1.0,12.0,2.0,2.0,1.0,2.0,3.0,1.0,2.0,2.0,2.0,,4.0,163.0,503.0,3.0


In [0]:
Illness.columns

Out[178]: ['STATE',
 'AGE',
 'SEXVAR',
 'ASTHMA',
 'ARTHRITIS',
 'DEPRESSION',
 'DIABETE',
 'HEARTDISEASE',
 'STROKE',
 'HIV',
 'HEARTATT',
 'CONFUSSION',
 'GENHLTH',
 'WEIGHT',
 'HEIGHT',
 'BMI']

In [0]:
#print the schema
Illness.printSchema()

root
 |-- STATE: double (nullable = true)
 |-- AGE: double (nullable = true)
 |-- SEXVAR: double (nullable = true)
 |-- ASTHMA: double (nullable = true)
 |-- ARTHRITIS: double (nullable = true)
 |-- DEPRESSION: double (nullable = true)
 |-- DIABETE: double (nullable = true)
 |-- HEARTDISEASE: double (nullable = true)
 |-- STROKE: double (nullable = true)
 |-- HIV: double (nullable = true)
 |-- HEARTATT: double (nullable = true)
 |-- CONFUSSION: double (nullable = true)
 |-- GENHLTH: double (nullable = true)
 |-- WEIGHT: double (nullable = true)
 |-- HEIGHT: double (nullable = true)
 |-- BMI: double (nullable = true)



In [0]:
#Count the total number of rows and columns
print((Illness.count(), len(Illness.columns)))

(401958, 16)


- **Eliminating the Null Values**

In [0]:
for col in Illness.columns:
    print(col+":",Illness[Illness[col].isNull()].count())

STATE: 0
AGE: 0
SEXVAR: 0
ASTHMA: 3
ARTHRITIS: 5
DEPRESSION: 6
DIABETE: 6
HEARTDISEASE: 3
STROKE: 3
HIV: 34037
HEARTATT: 6
CONFUSSION: 334120
GENHLTH: 8
WEIGHT: 9852
HEIGHT: 10824
BMI: 41357


In [0]:
Illness =Illness.na.drop(subset=["STATE","AGE","SEXVAR","ASTHMA","ARTHRITIS","DEPRESSION","DIABETE","HEARTDISEASE","STROKE","HIV","HEARTATT","CONFUSSION","GENHLTH","WEIGHT","HEIGHT","BMI"])
display(Illness)

STATE,AGE,SEXVAR,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HIV,HEARTATT,CONFUSSION,GENHLTH,WEIGHT,HEIGHT,BMI
2.0,8.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,7.0,2.0,4.0,183.0,501.0,4.0
2.0,11.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,3.0,172.0,505.0,3.0
2.0,8.0,1.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,1.0,180.0,600.0,2.0
2.0,10.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,4.0,185.0,506.0,3.0
2.0,12.0,2.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,3.0,155.0,503.0,3.0
2.0,8.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,169.0,511.0,2.0
2.0,9.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,1.0,165.0,511.0,2.0
2.0,10.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,3.0,200.0,504.0,4.0
2.0,10.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,1.0,2.0,2.0,3.0,255.0,510.0,4.0
2.0,14.0,1.0,9.0,9.0,9.0,7.0,9.0,9.0,2.0,9.0,2.0,3.0,140.0,505.0,2.0


- **correlations between BMI, weight, and height and health outcomes**

In [0]:
for i in Illness.columns:
    print("Correlation to BMI, WEIGHT, HEIGHT for {} is {}".format(i,Illness.stat.corr('BMI',i)))

Correlation to BMI, WEIGHT, HEIGHT for STATE is 0.015256600601681682
Correlation to BMI, WEIGHT, HEIGHT for AGE is -0.1265098436882831
Correlation to BMI, WEIGHT, HEIGHT for SEXVAR is -0.09400423227042755
Correlation to BMI, WEIGHT, HEIGHT for ASTHMA is -0.058003414597922624
Correlation to BMI, WEIGHT, HEIGHT for ARTHRITIS is -0.08366692351506289
Correlation to BMI, WEIGHT, HEIGHT for DEPRESSION is -0.05569453613225473
Correlation to BMI, WEIGHT, HEIGHT for DIABETE is -0.17952315031371444
Correlation to BMI, WEIGHT, HEIGHT for HEARTDISEASE is -0.015121605108229885
Correlation to BMI, WEIGHT, HEIGHT for STROKE is -0.013335731228753575
Correlation to BMI, WEIGHT, HEIGHT for HIV is -0.014508816648644122
Correlation to BMI, WEIGHT, HEIGHT for HEARTATT is -0.02296934732695927
Correlation to BMI, WEIGHT, HEIGHT for CONFUSSION is -0.01474256184818663
Correlation to BMI, WEIGHT, HEIGHT for GENHLTH is 0.1744142614898635
Correlation to BMI, WEIGHT, HEIGHT for WEIGHT is 0.0955801555923166
Correla

- **Importing Libraries**

In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator

- **Preparing the Data**

In [0]:
assembler = VectorAssembler(inputCols=["WEIGHT","HEIGHT","BMI"], outputCol='features')
Illness_vectorized = vectorAssembler.transform(Illness)

In [0]:
final_data = Illness_vectorized.select('features','HEARTDISEASE')
final_data.show()

+------------------+------------+
|          features|HEARTDISEASE|
+------------------+------------+
| [183.0,501.0,4.0]|         1.0|
| [172.0,505.0,3.0]|         2.0|
| [180.0,600.0,2.0]|         2.0|
| [185.0,506.0,3.0]|         1.0|
| [155.0,503.0,3.0]|         2.0|
| [169.0,511.0,2.0]|         2.0|
| [165.0,511.0,2.0]|         2.0|
| [200.0,504.0,4.0]|         2.0|
| [255.0,510.0,4.0]|         2.0|
| [140.0,505.0,2.0]|         9.0|
| [140.0,508.0,2.0]|         2.0|
| [275.0,506.0,4.0]|         2.0|
| [169.0,501.0,4.0]|         2.0|
| [160.0,509.0,2.0]|         2.0|
| [140.0,506.0,2.0]|         2.0|
| [158.0,505.0,3.0]|         2.0|
| [150.0,509.0,2.0]|         2.0|
|[200.0,9155.0,4.0]|         2.0|
| [112.0,503.0,2.0]|         2.0|
| [215.0,510.0,4.0]|         2.0|
+------------------+------------+
only showing top 20 rows



In [0]:
#print schema of final data
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- HEARTDISEASE: double (nullable = true)



- **Train/Test Split**

In [0]:
#splitting the data into training and testing datasets
(training_data, testing_data) = Illness_vectorized.randomSplit([0.7, 0.3])


- **Creating Model**

In [0]:
lr = LinearRegression(labelCol='HEARTDISEASE')
model = lr.fit(training_data)

In [0]:
summary = model.summary
model.summary.predictions.show()

+-----+---+------+------+---------+----------+-------+------------+------+---+--------+----------+-------+------+------+---+-----------------+------------------+
|STATE|AGE|SEXVAR|ASTHMA|ARTHRITIS|DEPRESSION|DIABETE|HEARTDISEASE|STROKE|HIV|HEARTATT|CONFUSSION|GENHLTH|WEIGHT|HEIGHT|BMI|         features|        prediction|
+-----+---+------+------+---------+----------+-------+------------+------+---+--------+----------+-------+------+------+---+-----------------+------------------+
|  2.0|6.0|   1.0|   1.0|      1.0|       2.0|    3.0|         2.0|   2.0|1.0|     2.0|       2.0|    3.0| 265.0| 510.0|4.0|[265.0,510.0,4.0]|1.9556326865796583|
|  2.0|6.0|   1.0|   1.0|      1.0|       2.0|    3.0|         2.0|   2.0|2.0|     2.0|       2.0|    2.0| 320.0| 508.0|4.0|[320.0,508.0,4.0]|1.9551457781637698|
|  2.0|6.0|   1.0|   1.0|      2.0|       1.0|    1.0|         1.0|   2.0|1.0|     1.0|       1.0|    4.0| 240.0| 601.0|4.0|[240.0,601.0,4.0]|1.9558617605358457|
|  2.0|6.0|   1.0|   1.0|   

In [0]:
display(model.summary.predictions)

STATE,AGE,SEXVAR,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HIV,HEARTATT,CONFUSSION,GENHLTH,WEIGHT,HEIGHT,BMI,features,prediction
2.0,6.0,1.0,1.0,1.0,2.0,3.0,2.0,2.0,1.0,2.0,2.0,3.0,265.0,510.0,4.0,"Map(vectorType -> dense, length -> 3, values -> List(265.0, 510.0, 4.0))",1.9556326865796585
2.0,6.0,1.0,1.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,320.0,508.0,4.0,"Map(vectorType -> dense, length -> 3, values -> List(320.0, 508.0, 4.0))",1.9551457781637696
2.0,6.0,1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,4.0,240.0,601.0,4.0,"Map(vectorType -> dense, length -> 3, values -> List(240.0, 601.0, 4.0))",1.9558617605358457
2.0,6.0,1.0,1.0,2.0,1.0,3.0,2.0,2.0,2.0,2.0,2.0,1.0,170.0,511.0,2.0,"Map(vectorType -> dense, length -> 3, values -> List(170.0, 511.0, 2.0))",1.9814379065737124
2.0,6.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,1.0,2.0,2.0,5.0,190.0,509.0,3.0,"Map(vectorType -> dense, length -> 3, values -> List(190.0, 509.0, 3.0))",1.9687785356700012
2.0,6.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,1.0,2.0,2.0,2.0,155.0,510.0,2.0,"Map(vectorType -> dense, length -> 3, values -> List(155.0, 510.0, 2.0))",1.9815705667980936
2.0,6.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,1.0,2.0,2.0,5.0,200.0,602.0,3.0,"Map(vectorType -> dense, length -> 3, values -> List(200.0, 602.0, 3.0))",1.9686980404206824
2.0,6.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,1.0,168.0,509.0,2.0,"Map(vectorType -> dense, length -> 3, values -> List(168.0, 509.0, 2.0))",1.9814554339847077
2.0,6.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,165.0,510.0,2.0,"Map(vectorType -> dense, length -> 3, values -> List(165.0, 510.0, 2.0))",1.9814820692846051
2.0,6.0,1.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,3.0,250.0,603.0,4.0,"Map(vectorType -> dense, length -> 3, values -> List(250.0, 603.0, 4.0))",1.95577343511406


In [0]:
#summary of the model
summary = model.summary
display(summary.predictions.describe())

summary,STATE,AGE,SEXVAR,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HIV,HEARTATT,CONFUSSION,GENHLTH,WEIGHT,HEIGHT,BMI,prediction
count,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0,44163.0
mean,29.08615809614383,9.68360392183502,1.550641940085592,1.881393927043,1.6199986413966443,1.8361977220750405,2.716663270158277,1.9691370604352056,1.9662839933881304,2.112854652084324,1.95208658832054,1.9557548173810653,2.5433054819645404,189.287933337862,532.6262029300545,2.9719448407037565,1.9691370604351852
stddev,19.850267802799465,2.1688808511642232,0.4974344145786261,0.4527575386867858,0.632599036269616,0.5216011048174449,0.7874175745714884,0.5562212997217422,0.3681292622304223,1.742319201976565,0.4575305028776762,0.5581032260537717,1.1006781638200498,311.63934507161974,367.0716658070564,0.8224775711519756,0.0108993286678681
min,2.0,6.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,66.0,300.0,1.0,1.8762057379946084
max,72.0,14.0,2.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9.0,9240.0,9193.0,4.0,1.9951959943742088


In [0]:
lrPredictions = model.transform(testing_data)
lrPredictions

Out[193]: DataFrame[STATE: double, AGE: double, SEXVAR: double, ASTHMA: double, ARTHRITIS: double, DEPRESSION: double, DIABETE: double, HEARTDISEASE: double, STROKE: double, HIV: double, HEARTATT: double, CONFUSSION: double, GENHLTH: double, WEIGHT: double, HEIGHT: double, BMI: double, features: vector, prediction: double]

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
eval_accuracy = MulticlassClassificationEvaluator(labelCol="HEARTDISEASE",  metricName="accuracy")
accuracy = eval_accuracy.evaluate(lrPredictions)

In [0]:
print("Accuracy: %f" % accuracy)

Accuracy: 0.000000


In [0]:
print("numIterations: %d" % summary.totalIterations)
print("objectiveHistory: %s" % str(summary.objectiveHistory))
summary.residuals.show()

numIterations: 0
objectiveHistory: [0.0]
+--------------------+
|           residuals|
+--------------------+
|0.044367313420341725|
|0.044854221836230224|
| -0.9558617605358457|
|0.018562093426287563|
|0.031221464329998616|
|0.018429433201906376|
|0.031301959579317806|
| 0.01854456601529253|
|0.018517930715394648|
| 0.04422656488594012|
|0.018305966912278926|
| 0.04410173483402535|
| 0.04423473924181165|
| 0.01829720320678141|
| 0.03122129223829595|
| 0.03130187353346647|
|0.018642932859011863|
|0.031221378284147283|
| 0.04392491189875125|
| 0.04418248822089832|
+--------------------+
only showing top 20 rows



In [0]:
print("RMSE: %f" % summary.rootMeanSquaredError)
print("r2: %f" % summary.r2)

RMSE: 0.556108
r2: 0.000384


- **Conclusion**

The results of this machine learning analysis of the Behavioral Risk Factor Surveillance System (BRFSS) data from 2020 suggest that BMI, weight, and height can be used to assess health risks and outcomes. However, the accuracy, RMSE, and R2 values indicate that the predictive power of this model is low. Additional research is needed to further understand the relationship between BMI, weight, and height and health outcomes.