-sandbox
# BRFSS 2020 ILLNESS DATA

- **Dataset**

The BRFSS 2020 dataset provides a comprehensive look at the health of the nation by providing extensive information on a variety of topics related to illness, such as asthma, arthritis, depression, diabetes, heart disease, stroke, BMI, HIV, heart attack, confusion, and general health. This dataset can be used to gain valuable insights into the prevalence and characteristics of various illnesses, and to identify potential health disparities and areas of need. The data can also be used to inform public health policy decisions, as well as to develop interventions that can improve the overall health of the population. By leveraging this dataset, we can gain a better understanding of the factors that are associated with different illnesses and the impact of health disparities in our society.

Data Source - https://www.kaggle.com/datasets/aemreusta/brfss-2020-survey-data

columns - 279

size - 323.288 MB

The BRFSS Illness dataset for 2020 includes the following codes and their respective values:

- STATE: The two-letter postal code of the respondent's state or territory of residence.

- STATENAME: The full name of the respondent's state or territory of residence.

- AGE: The reported age of the respondent.

- SEXVAR: The reported sex of the respondent.

- ASTHMA: Whether or not the respondent has asthma.

- ARTHRITIS: Whether or not the respondent has arthritis.

- DEPRESSION: Whether or not the respondent has depression.

- DIABETES: Whether or not the respondent has diabetes.

- HEARTDISEASE: Whether or not the respondent has heart disease.

- STROKE: Whether or not the respondent has had a stroke.

- BMI: The reported body mass index (BMI) of the respondent.

- HIV: Whether or not the respondent has HIV/AIDS.

- HEARTATT: Whether or not the respondent has had a heart attack.

- CONFUSSION: Whether or not the respondent has confusion or disorientation.

- GENHLTH: The reported general health

- **Importing Libraries**

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import pyspark.sql.functions as F
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline



- **Loading the dataset**

In [0]:
filePath = "/tmp/sf-brfss2020.parquet"
brfss2020_df = spark.read.parquet(filePath)
display(brfss2020_df)

STATE,AGE,SEXVAR,MARITAL,EDUCA,RENTHOME,CELLPHONES,VETERAN,EMPLOYE,INCOME,CHILDREN,WEIGHT,HEIGHT,BMI,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HIV,HEARTATT,CONFUSSION,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,GENHLTH,STATENAME
1.0,8.0,2.0,2.0,6.0,1.0,1.0,2.0,4.0,1.0,88.0,106.0,507.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,,1.0,1.0,5.0,1.0,1.0,2.0,3.0,2.0,Alabama
1.0,10.0,2.0,3.0,6.0,1.0,1.0,2.0,7.0,99.0,88.0,170.0,504.0,3.0,1.0,1.0,1.0,3.0,2.0,2.0,,2.0,,9.0,9.0,7.0,1.0,1.0,1.0,1.0,3.0,Alabama
1.0,10.0,2.0,1.0,5.0,1.0,1.0,2.0,7.0,7.0,88.0,7777.0,508.0,,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,,4.0,1.0,7.0,1.0,1.0,1.0,1.0,3.0,Alabama
1.0,13.0,2.0,3.0,4.0,1.0,9.0,2.0,5.0,99.0,88.0,9999.0,9999.0,,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,Alabama
1.0,13.0,2.0,3.0,6.0,2.0,8.0,2.0,7.0,77.0,88.0,126.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,9.0,2.0,,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0,Alabama
1.0,10.0,1.0,4.0,4.0,3.0,1.0,2.0,8.0,5.0,88.0,180.0,509.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,,3.0,1.0,8.0,1.0,2.0,3.0,3.0,4.0,Alabama
1.0,12.0,2.0,1.0,4.0,1.0,2.0,2.0,7.0,6.0,88.0,150.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,4.0,1.0,6.0,2.0,1.0,1.0,1.0,3.0,Alabama
1.0,10.0,2.0,1.0,4.0,1.0,1.0,2.0,7.0,5.0,88.0,150.0,503.0,3.0,2.0,1.0,1.0,1.0,2.0,2.0,,2.0,,1.0,9.0,6.0,1.0,2.0,3.0,2.0,4.0,Alabama
1.0,5.0,2.0,2.0,6.0,1.0,2.0,2.0,1.0,6.0,2.0,170.0,511.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,4.0,1.0,8.0,1.0,1.0,3.0,1.0,2.0,Alabama
1.0,12.0,2.0,3.0,2.0,1.0,8.0,2.0,7.0,99.0,88.0,163.0,503.0,3.0,2.0,1.0,2.0,3.0,1.0,2.0,2.0,2.0,,3.0,1.0,12.0,2.0,2.0,2.0,1.0,4.0,Alabama


- **Illness Dataset**

In [0]:
Illness = brfss2020_df["STATE","STATENAME","AGE","SEXVAR","ASTHMA","ARTHRITIS","DEPRESSION","DIABETE","HEARTDISEASE","STROKE","BMI","HIV","HEARTATT","CONFUSSION","GENHLTH"]
display(Illness)

STATE,STATENAME,AGE,SEXVAR,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,BMI,HIV,HEARTATT,CONFUSSION,GENHLTH
1.0,Alabama,8.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,2.0,,2.0
1.0,Alabama,10.0,2.0,1.0,1.0,1.0,3.0,2.0,2.0,3.0,,2.0,,3.0
1.0,Alabama,10.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,,2.0,2.0,,3.0
1.0,Alabama,13.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,,2.0,2.0,,1.0
1.0,Alabama,13.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,2.0,9.0,2.0,,2.0
1.0,Alabama,10.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,3.0,1.0,2.0,,4.0
1.0,Alabama,12.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,,3.0
1.0,Alabama,10.0,2.0,2.0,1.0,1.0,1.0,2.0,2.0,3.0,,2.0,,4.0
1.0,Alabama,5.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,,2.0
1.0,Alabama,12.0,2.0,2.0,1.0,2.0,3.0,1.0,2.0,3.0,2.0,2.0,,4.0


In [0]:
Illness.columns

Out[4]: ['STATE',
 'STATENAME',
 'AGE',
 'SEXVAR',
 'ASTHMA',
 'ARTHRITIS',
 'DEPRESSION',
 'DIABETE',
 'HEARTDISEASE',
 'STROKE',
 'BMI',
 'HIV',
 'HEARTATT',
 'CONFUSSION',
 'GENHLTH']

In [0]:
#print the schema
Illness.printSchema()

root
 |-- STATE: double (nullable = true)
 |-- STATENAME: string (nullable = true)
 |-- AGE: double (nullable = true)
 |-- SEXVAR: double (nullable = true)
 |-- ASTHMA: double (nullable = true)
 |-- ARTHRITIS: double (nullable = true)
 |-- DEPRESSION: double (nullable = true)
 |-- DIABETE: double (nullable = true)
 |-- HEARTDISEASE: double (nullable = true)
 |-- STROKE: double (nullable = true)
 |-- BMI: double (nullable = true)
 |-- HIV: double (nullable = true)
 |-- HEARTATT: double (nullable = true)
 |-- CONFUSSION: double (nullable = true)
 |-- GENHLTH: double (nullable = true)



In [0]:
#Count the total number of rows and columns
print((Illness.count(), len(Illness.columns)))

(401958, 15)


- Eliminating the NUll Values

In [0]:
for col in Illness.columns:
    print(col+":",Illness[Illness[col].isNull()].count())

STATE: 0
STATENAME: 0
AGE: 0
SEXVAR: 0
ASTHMA: 3
ARTHRITIS: 5
DEPRESSION: 6
DIABETE: 6
HEARTDISEASE: 3
STROKE: 3
BMI: 41357
HIV: 34037
HEARTATT: 6
CONFUSSION: 334120
GENHLTH: 8


In [0]:
#drop the null value
Illness = Illness.na.drop(subset=["STATE","STATENAME","AGE","SEXVAR","ASTHMA","ARTHRITIS","DEPRESSION","DIABETE","HEARTDISEASE","STROKE","BMI","HIV","HEARTATT","CONFUSSION","GENHLTH"])
display(Illness)

STATE,STATENAME,AGE,SEXVAR,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,BMI,HIV,HEARTATT,CONFUSSION,GENHLTH
2.0,Alaska,8.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,4.0,2.0,7.0,2.0,4.0
2.0,Alaska,11.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,3.0,2.0,2.0,2.0,3.0
2.0,Alaska,8.0,1.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
2.0,Alaska,10.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,2.0,1.0,2.0,4.0
2.0,Alaska,12.0,2.0,2.0,1.0,2.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0,3.0
2.0,Alaska,8.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
2.0,Alaska,9.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
2.0,Alaska,10.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,4.0,2.0,2.0,2.0,3.0
2.0,Alaska,10.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,4.0,1.0,2.0,2.0,3.0
2.0,Alaska,14.0,1.0,9.0,9.0,9.0,7.0,9.0,9.0,2.0,2.0,9.0,2.0,3.0


In [0]:
display(Illness)

STATE,STATENAME,AGE,SEXVAR,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,BMI,HIV,HEARTATT,CONFUSSION,GENHLTH
2.0,Alaska,8.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,4.0,2.0,7.0,2.0,4.0
2.0,Alaska,11.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,3.0,2.0,2.0,2.0,3.0
2.0,Alaska,8.0,1.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
2.0,Alaska,10.0,2.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,2.0,1.0,2.0,4.0
2.0,Alaska,12.0,2.0,2.0,1.0,2.0,1.0,2.0,2.0,3.0,2.0,2.0,2.0,3.0
2.0,Alaska,8.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0
2.0,Alaska,9.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0
2.0,Alaska,10.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,4.0,2.0,2.0,2.0,3.0
2.0,Alaska,10.0,1.0,2.0,1.0,2.0,3.0,2.0,2.0,4.0,1.0,2.0,2.0,3.0
2.0,Alaska,14.0,1.0,9.0,9.0,9.0,7.0,9.0,9.0,2.0,2.0,9.0,2.0,3.0


- About the Illness data

In [0]:
Illness = Illness.withColumn("GENHLTHSTAT", Illness["GENHLTH"])

In [0]:
Illness = Illness.withColumn("GENHLTHSTAT", 
                                   when(Illness["GENHLTHSTAT"] == '1', 'Excellent ')
                                   .when(Illness["GENHLTHSTAT"] == '2', 'Very good')
                                   .when(Illness["GENHLTHSTAT"] == '3', 'Good ')
                                   .when(Illness["GENHLTHSTAT"] == '4', 'Fair ')
                                   .when(Illness["GENHLTHSTAT"] == '5', 'Poor')
                                   .when(Illness["GENHLTHSTAT"] == '7', 'Don’t know/Not Sure ')
                                   .when(Illness["GENHLTHSTAT"] == '9', 'Refused ')
                                   .when(Illness["GENHLTHSTAT"] == 'Blank', 'Not asked or Missing ')
                                   .otherwise('SEX'))

**How many respondents suffering from ASTHMA?**

In [0]:
Illness = Illness.withColumn("ASTHMASTATUS", Illness["ASTHMA"])

In [0]:
Illness = Illness.withColumn("ASTHMASTATUS", 
                                   when(Illness["ASTHMASTATUS"] == '1', 'Yes ')
                                   .when(Illness["ASTHMASTATUS"] == '2', 'No')
 .when(Illness["ASTHMASTATUS"] == '7', 'Don’t know/Not Sure')
                                   .when(Illness["ASTHMASTATUS"] == '9', 'Refused')
                             .when(Illness["ASTHMASTATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('ASTHMASTATUS'))

In [0]:
df1 = Illness.groupBy("ASTHMASTATUS").count()

In [0]:
display(df1.sort("count"))

ASTHMASTATUS,count
Refused,23
Don’t know/Not Sure,168
Yes,8483
No,54472


In [0]:
display(df1.sort("count"))

ASTHMASTATUS,count
Refused,23
Don’t know/Not Sure,168
Yes,8483
No,54472


Output can only be rendered in Databricks

The 2020 Behavioral Risk Factor Surveillance System (BRFSS) survey asked respondents if they had ever been diagnosed with asthma. Of the 63,106 respondents, 8,483 (13.5%) reported that they had been diagnosed with asthma, while 54,472 (86.5%) said they had not. A further 168 (0.3%) said they weren't sure if they had been diagnosed with asthma, and 23 (0.04%) refused to answer the question.

In [0]:
multiple = Illness\
.groupby(["ASTHMA","GENHLTHSTAT"])\
.agg({'ASTHMA':'count'})
multiple

Out[96]: DataFrame[ASTHMA: double, GENHLTHSTAT: string, count(ASTHMA): bigint]

In [0]:
display(multiple)

ASTHMA,GENHLTHSTAT,count(ASTHMA)
1.0,Poor,827
9.0,Refused,1
7.0,Fair,36
1.0,Excellent,784
7.0,Good,52
7.0,Poor,26
7.0,Excellent,18
2.0,Very good,18896
9.0,Good,7
2.0,Refused,32


In [0]:
res1 = multiple[multiple['ASTHMA'] == '1']

In [0]:
display(res1)

ASTHMA,GENHLTHSTAT,count(ASTHMA)
1.0,Poor,827
1.0,Excellent,784
1.0,Good,2802
1.0,Don’t know/Not Sure,19
1.0,Fair,1789
1.0,Very good,2256
1.0,Refused,6


Output can only be rendered in Databricks

The 2020 BRFSS Illness survey found that the most reported health status was "Good", with 2802 respondents, followed by "Very Good" with 2256 respondents. 1789 respondents reported "Fair" health status, while 827 reported "Poor". 784 reported excellent health status, 6 refused to answer, and 19 did not know or were not sure.

**How many respondents suffering from ARTHRITIS?**

In [0]:
Illness = Illness.withColumn("ARTHRITISATATUS", Illness["ARTHRITIS"])

In [0]:
Illness = Illness.withColumn("ARTHRITISATATUS", 
                                   when(Illness["ARTHRITISATATUS"] == '1', 'Yes ')
                                   .when(Illness["ARTHRITISATATUS"] == '2', 'No')
 .when(Illness["ARTHRITISATATUS"] == '7', 'Don’t know/Not Sure')
                                   .when(Illness["ARTHRITISATATUS"] == '9', 'Refused')
                             .when(Illness["ARTHRITISATATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('ARTHRITISATATUS'))

In [0]:
df2 = Illness.groupBy("ARTHRITISATATUS").count()

In [0]:
display(df2.sort("count"))

ARTHRITISATATUS,count
Refused,20
Don’t know/Not Sure,309
Yes,25666
No,37151


In [0]:
display(df2.sort("count"))

ARTHRITISATATUS,count
Refused,20
Don’t know/Not Sure,309
Yes,25666
No,37151


Output can only be rendered in Databricks

In [0]:
multiple2 = Illness\
.groupby(["ARTHRITIS","GENHLTHSTAT"])\
.agg({'ARTHRITIS':'count'})
multiple2


Out[124]: DataFrame[ARTHRITIS: double, GENHLTHSTAT: string, count(ARTHRITIS): bigint]

In [0]:
display(multiple2)

ARTHRITIS,GENHLTHSTAT,count(ARTHRITIS)
1.0,Poor,2007
7.0,Fair,54
1.0,Excellent,2375
7.0,Good,107
7.0,Poor,24
7.0,Excellent,25
7.0,Don’t know/Not Sure,3
2.0,Very good,13669
9.0,Good,9
2.0,Refused,18


In [0]:
res2 = multiple2[multiple2['ARTHRITIS'] == '1']

In [0]:
display(res2)

ARTHRITIS,GENHLTHSTAT,count(ARTHRITIS)
1.0,Poor,2007
1.0,Excellent,2375
1.0,Good,8687
1.0,Don’t know/Not Sure,51
1.0,Fair,5101
1.0,Very good,7424
1.0,Refused,21


Output can only be rendered in Databricks

The majority of respondents suffering from arthritis reported their health status as either very good (7424) or good (8687). Around 2375 reported their health status as excellent, while 5101 reported their health status as fair. 2007 reported their health status as poor, while 51 reported that they did not know or were not sure, and 21 refused to answer.

**How many respondents suffering from DEPRESSION?**

In [0]:
Illness = Illness.withColumn("DEPRESSIONSTATUS", Illness["DEPRESSION"])

In [0]:
Illness = Illness.withColumn("DEPRESSIONSTATUS", 
                                   when(Illness["DEPRESSIONSTATUS"] == '1', 'Yes ')
                                   .when(Illness["DEPRESSIONSTATUS"] == '2', 'No')
 .when(Illness["DEPRESSIONSTATUS"] == '7', 'Don’t know/Not Sure')
                                   .when(Illness["DEPRESSIONSTATUS"] == '9', 'Refused')
                             .when(Illness["DEPRESSIONSTATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('DEPRESSIONSTATUS'))

In [0]:
df3 = Illness.groupBy("DEPRESSIONSTATUS").count()


In [0]:
display(df3.sort("count"))

DEPRESSIONSTATUS,count
Refused,50
Don’t know/Not Sure,187
Yes,11747
No,51162


In [0]:
display(df3.sort("count"))

DEPRESSIONSTATUS,count
Refused,50
Don’t know/Not Sure,187
Yes,11747
No,51162


Output can only be rendered in Databricks

In [0]:
multiple3 = Illness\
.groupby(["DEPRESSION","GENHLTHSTAT"])\
.agg({'DEPRESSION':'count'})
multiple3

Out[128]: DataFrame[DEPRESSION: double, GENHLTHSTAT: string, count(DEPRESSION): bigint]

In [0]:
display(multiple3)

DEPRESSION,GENHLTHSTAT,count(DEPRESSION)
1.0,Poor,1289
7.0,Fair,50
1.0,Excellent,1060
7.0,Good,51
7.0,Refused,2
7.0,Poor,20
7.0,Excellent,14
7.0,Don’t know/Not Sure,2
2.0,Very good,18157
9.0,Good,16


In [0]:
res3 = multiple3[multiple3['DEPRESSION'] == '1']

In [0]:
display(res3)

DEPRESSION,GENHLTHSTAT,count(DEPRESSION)
1.0,Poor,1289
1.0,Excellent,1060
1.0,Good,3704
1.0,Don’t know/Not Sure,30
1.0,Fair,2678
1.0,Very good,2977
1.0,Refused,9


Output can only be rendered in Databricks

The survey results show that the majority of respondents reported their health status as either very good (2977) or good (3704). Fair (2678) and poor (1289) were the next most common responses, and the fewest respondents reported their health status as excellent (1060).

**How many respondents suffering from DIABETE?**

In [0]:
Illness = Illness.withColumn("DIABETESTATUS", Illness["DIABETE"])

In [0]:
Illness = Illness.withColumn("DIABETESTATUS", 
                                   when(Illness["DIABETESTATUS"] == '1', 'Yes ')
                                   .when(Illness["DIABETESTATUS"] == '2', 'No')
 .when(Illness["DIABETESTATUS"] == '7', 'Don’t know/Not Sure')
                                   .when(Illness["DIABETESTATUS"] == '9', 'Refused')
                             .when(Illness["DIABETESTATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('DIABETESTATUS'))

In [0]:
df4 = Illness.groupBy("DIABETESTATUS").count()

In [0]:
display(df4.sort("count"))

DIABETESTATUS,count
Refused,20
Don’t know/Not Sure,83
No,387
Yes,10063
DIABETESTATUS,52593


Output can only be rendered in Databricks

In [0]:
display(df4.sort("count"))

DIABETESTATUS,count
Refused,20
Don’t know/Not Sure,83
No,387
Yes,10063
DIABETESTATUS,52593


Output can only be rendered in Databricks

In [0]:
multiple4 = Illness\
.groupby(["DIABETE","GENHLTHSTAT"])\
.agg({'DIABETE':'count'})
multiple4

Out[132]: DataFrame[DIABETE: double, GENHLTHSTAT: string, count(DIABETE): bigint]

In [0]:
display(multiple4)

DIABETE,GENHLTHSTAT,count(DIABETE)
1.0,Poor,1111
9.0,Refused,1
4.0,Very good,624
7.0,Fair,26
1.0,Excellent,442
7.0,Good,31
3.0,Very good,18362
3.0,Good,14382
7.0,Poor,13
4.0,Excellent,235


In [0]:
res4 = multiple4[multiple4['DIABETE'] == '1']

In [0]:
display(res4)

DIABETE,GENHLTHSTAT,count(DIABETE)
1.0,Poor,1111
1.0,Excellent,442
1.0,Good,3724
1.0,Don’t know/Not Sure,24
1.0,Fair,2687
1.0,Very good,2069
1.0,Refused,6


Output can only be rendered in Databricks

The responses to the health status for diabetes showed that the majority of respondents (3724) reported good health, followed by fair (2687), very good (2069), and poor (1111). A small portion (442) reported excellent health, while 24 reported not knowing or being unsure, and 6 refused to answer.

**How many respondents suffering from HEARTDISEASE?**

In [0]:
Illness = Illness.withColumn("HEARTDISEASESTATUS", Illness["HEARTDISEASE"])


In [0]:
Illness = Illness.withColumn("HEARTDISEASESTATUS", 
                                   when(Illness["HEARTDISEASESTATUS"] == '1', 'Yes ')
                                   .when(Illness["HEARTDISEASESTATUS"] == '2', 'No')
 .when(Illness["HEARTDISEASESTATUS"] == '7', 'Don’t know/Not Sure')
                                   .when(Illness["HEARTDISEASESTATUS"] == '9', 'Refused')
                             .when(Illness["HEARTDISEASESTATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('HEARTDISEASESTATUS'))


In [0]:
df5 = Illness.groupBy("HEARTDISEASESTATUS").count()

In [0]:
display(df5.sort("count"))

HEARTDISEASESTATUS,count
Refused,18
Don’t know/Not Sure,567
Yes,4904
No,57657


In [0]:
display(df5.sort("count"))

HEARTDISEASESTATUS,count
Refused,18
Don’t know/Not Sure,567
Yes,4904
No,57657


Output can only be rendered in Databricks

In [0]:
multiple5 = Illness\
.groupby(["HEARTDISEASE","GENHLTHSTAT"])\
.agg({'HEARTDISEASE':'count'})
multiple5

Out[136]: DataFrame[HEARTDISEASE: double, GENHLTHSTAT: string, count(HEARTDISEASE): bigint]

In [0]:
display(multiple5)

HEARTDISEASE,GENHLTHSTAT,count(HEARTDISEASE)
1.0,Poor,764
9.0,Refused,1
7.0,Fair,161
1.0,Excellent,219
7.0,Good,203
7.0,Refused,2
7.0,Poor,92
7.0,Excellent,16
7.0,Don’t know/Not Sure,1
2.0,Very good,20204


In [0]:
res5 = multiple5[multiple5['HEARTDISEASE'] == '1']

In [0]:
display(res5)

HEARTDISEASE,GENHLTHSTAT,count(HEARTDISEASE)
1.0,Poor,764
1.0,Excellent,219
1.0,Good,1660
1.0,Don’t know/Not Sure,16
1.0,Fair,1350
1.0,Very good,893
1.0,Refused,2


Output can only be rendered in Databricks

The majority of respondents reported their health status for heart disease as fair (1350) or good (1660), while 893 reported their health status as very good and 764 reported their health status as poor. Only 219 respondents reported their health status as excellent.

**How many respondents suffering from STROKE?**

In [0]:
Illness = Illness.withColumn("STROKESTATUS", Illness["STROKE"])

In [0]:
Illness = Illness.withColumn("STROKESTATUS", 
                                   when(Illness["STROKESTATUS"] == '1', 'Yes ')
                                   .when(Illness["STROKESTATUS"] == '2', 'No')
 .when(Illness["STROKESTATUS"] == '7', 'Don’t know/Not Sure')
                                   .when(Illness["STROKESTATUS"] == '9', 'Refused')
                             .when(Illness["STROKESTATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('STROKESTATUS'))


In [0]:
df6 = Illness.groupBy("STROKESTATUS").count()

In [0]:
display(df6.sort("count"))

STROKESTATUS,count
Refused,17
Don’t know/Not Sure,171
Yes,3205
No,59753


Output can only be rendered in Databricks

In [0]:
display(df6.sort("count"))

STROKESTATUS,count
Refused,17
Don’t know/Not Sure,171
Yes,3205
No,59753


Output can only be rendered in Databricks

In [0]:
multiple6 = Illness\
.groupby(["STROKE","GENHLTHSTAT"])\
.agg({'STROKE':'count'})
multiple6

Out[140]: DataFrame[STROKE: double, GENHLTHSTAT: string, count(STROKE): bigint]

In [0]:
display(multiple6)

STROKE,GENHLTHSTAT,count(STROKE)
1.0,Poor,543
9.0,Refused,1
7.0,Fair,55
1.0,Excellent,172
7.0,Good,48
7.0,Poor,30
7.0,Excellent,8
7.0,Don’t know/Not Sure,2
2.0,Very good,20588
9.0,Good,6


In [0]:
res6 = multiple6[multiple6['STROKE'] == '1']

In [0]:
display(res6)

STROKE,GENHLTHSTAT,count(STROKE)
1.0,Poor,543
1.0,Excellent,172
1.0,Good,1051
1.0,Don’t know/Not Sure,8
1.0,Fair,852
1.0,Very good,574
1.0,Refused,5


Output can only be rendered in Databricks

The majority of responses to the health status for stroke were "Good" with 1051 respondents, followed by "Fair" with 852 respondents. There were 574 respondents who reported their health status as "Very Good", 543 who reported "Poor", and 172 who reported "Excellent".

**How many respondents suffering from BMI?**

In [0]:
Illness = Illness.withColumn("BMISTATUS", Illness["BMI"])

In [0]:
Illness = Illness.withColumn("BMISTATUS", 
                                   when(Illness["BMISTATUS"] == '1', 'Underweight')
                                   .when(Illness["BMISTATUS"] == '2', 'Normal Weight')
 .when(Illness["BMISTATUS"] == '3', 'Overweight')
                                   .when(Illness["BMISTATUS"] == '4', 'Obese ')
                             .when(Illness["BMISTATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('BMISTATUS'))


In [0]:
df7 = Illness.groupBy("BMISTATUS").count()

In [0]:
display(df7.sort("count"))

BMISTATUS,count
Underweight,982
Normal Weight,19356
Obese,19556
Overweight,23252


In [0]:
display(df7.sort("count"))

BMISTATUS,count
Underweight,982
Normal Weight,19356
Obese,19556
Overweight,23252


Output can only be rendered in Databricks

**How many respondents suffering from HIV?**

In [0]:
Illness = Illness.withColumn("HIVSTATUS", Illness["HIV"])

In [0]:
Illness = Illness.withColumn("HIVSTATUS", 
                                   when(Illness["HIVSTATUS"] == '1', 'Yes ')
                                   .when(Illness["HIVSTATUS"] == '2', 'No')
 .when(Illness["HIVSTATUS"] == '7', 'Don’t know/Not Sure')
                                   .when(Illness["HIVSTATUS"] == '9', 'Refused')
                             .when(Illness["HIVSTATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('HIVSTATUS'))



In [0]:
df8 = Illness.groupBy("HIVSTATUS").count()

In [0]:
display(df8.sort("count"))

HIVSTATUS,count
Refused,3528
Yes,17867
No,41751


Output can only be rendered in Databricks

In [0]:
display(df8.sort("count"))

HIVSTATUS,count
Refused,3528
Yes,17867
No,41751


Output can only be rendered in Databricks

**How many respondents suffering from HEART ATTACK?**

In [0]:
Illness = Illness.withColumn("HEARTATTSTATUS", Illness["HEARTATT"])

In [0]:
Illness = Illness.withColumn("HEARTATTSTATUS", 
                                   when(Illness["HEARTATTSTATUS"] == '1', 'Yes ')
                                   .when(Illness["HEARTATTSTATUS"] == '2', 'No')
 .when(Illness["HEARTATTSTATUS"] == '7', 'Don’t know/Not Sure')
                                   .when(Illness["HEARTATTSTATUS"] == '9', 'Refused')
                             .when(Illness["HEARTATTSTATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('HEARTATTSTATUS'))


In [0]:
df9 = Illness.groupBy("HEARTATTSTATUS").count()

In [0]:
display(df9.sort("count"))

HEARTATTSTATUS,count
Refused,22
Don’t know/Not Sure,304
Yes,4705
No,58115


In [0]:
display(df9.sort("count"))

HEARTATTSTATUS,count
Refused,22
Don’t know/Not Sure,304
Yes,4705
No,58115


Output can only be rendered in Databricks

In [0]:
multiple7 = Illness\
.groupby(["HEARTATT","GENHLTHSTAT"])\
.agg({'HEARTATT':'count'})
multiple7

Out[144]: DataFrame[HEARTATT: double, GENHLTHSTAT: string, count(HEARTATT): bigint]

In [0]:
display(multiple7)

HEARTATT,GENHLTHSTAT,count(HEARTATT)
1.0,Poor,720
7.0,Fair,88
1.0,Excellent,226
7.0,Good,97
7.0,Poor,46
7.0,Excellent,32
7.0,Don’t know/Not Sure,3
2.0,Very good,20253
9.0,Good,9
2.0,Refused,37


In [0]:
res7 = multiple7[multiple7['HEARTATT'] == '1']

In [0]:
display(res7)


HEARTATT,GENHLTHSTAT,count(HEARTATT)
1.0,Poor,720
1.0,Excellent,226
1.0,Good,1608
1.0,Don’t know/Not Sure,14
1.0,Fair,1237
1.0,Very good,898
1.0,Refused,2


Output can only be rendered in Databricks

The majority of respondents reported their health status as either good (1608) or fair (1237). A smaller portion reported their health status as very good (898), while the least reported their health status as either poor (720) or excellent (226).

**How many respondents suffering from CONFUSSION?**

In [0]:
Illness = Illness.withColumn("CONFUSSIONSTATUS", Illness["CONFUSSION"])

In [0]:
Illness = Illness.withColumn("CONFUSSIONSTATUS", 
                                   when(Illness["CONFUSSIONSTATUS"] == '1', 'Yes ')
                                   .when(Illness["CONFUSSIONSTATUS"] == '2', 'No')
 .when(Illness["CONFUSSIONSTATUS"] == '7', 'Don’t know/Not Sure')
                                   .when(Illness["CONFUSSIONSTATUS"] == '9', 'Refused')
                             .when(Illness["CONFUSSIONSTATUS"] == 'BLANK', 'Not asked or Missing')
                                   .otherwise('CONFUSSIONSTATUS'))


In [0]:
df10 = Illness.groupBy("CONFUSSIONSTATUS").count()

In [0]:
display(df10.sort("count"))

CONFUSSIONSTATUS,count
Refused,98
Don’t know/Not Sure,371
Yes,5418
No,57259


In [0]:
display(df10.sort("count"))

CONFUSSIONSTATUS,count
Refused,98
Don’t know/Not Sure,371
Yes,5418
No,57259


Output can only be rendered in Databricks

In [0]:
multiple8 = Illness\
.groupby(["CONFUSSION","GENHLTHSTAT"])\
.agg({'CONFUSSION':'count'})
multiple8

Out[148]: DataFrame[CONFUSSION: double, GENHLTHSTAT: string, count(CONFUSSION): bigint]

In [0]:
display(multiple8)

CONFUSSION,GENHLTHSTAT,count(CONFUSSION)
1.0,Poor,929
7.0,Fair,73
1.0,Excellent,347
7.0,Good,137
7.0,Poor,41
7.0,Excellent,32
7.0,Don’t know/Not Sure,5
2.0,Very good,20053
9.0,Good,31
2.0,Refused,33


In [0]:
res8 = multiple8[multiple8['CONFUSSION'] == '1']

In [0]:
display(res8)

CONFUSSION,GENHLTHSTAT,count(CONFUSSION)
1.0,Poor,929
1.0,Excellent,347
1.0,Good,1658
1.0,Don’t know/Not Sure,15
1.0,Fair,1429
1.0,Very good,1034
1.0,Refused,6


Output can only be rendered in Databricks

The majority of respondents reported their health status as either "Good" (1658) or "Fair" (1429). A total of 1034 respondents reported their health status as "Very Good," while 929 reported their health status as "Poor." A much smaller number of respondents reported their health status as "Excellent" (347) and a few reported "Don’t Know/Not Sure" (15) or "Refused" (6).

- **The Relationship Between Illness**

In [0]:
mul_r4 = mul.groupby(["ASTHMA","ARTHRITIS","DEPRESSION","DIABETE","HEARTDISEASE","STROKE","HEARTATT","CONFUSSION","GENHLTHSTAT","GENHLTH"]).agg({'ASTHMA':'count','ARTHRITIS':'count','DEPRESSION':'count','DIABETE':'count','HEARTDISEASE':'count','STROKE':'count','HEARTATT':'count','CONFUSSION':'count'})

In [0]:
display(mul_r4)

ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HEARTATT,CONFUSSION,GENHLTHSTAT,GENHLTH,count(DEPRESSION),count(ASTHMA),count(STROKE),count(ARTHRITIS),count(DIABETE),count(CONFUSSION),count(HEARTDISEASE),count(HEARTATT)
1.0,2.0,1.0,3.0,2.0,2.0,2.0,2.0,Excellent,1.0,1,1,1,1,1,1,1,1
2.0,7.0,2.0,3.0,1.0,2.0,1.0,2.0,Good,3.0,1,1,1,1,1,1,1,1
1.0,1.0,2.0,4.0,2.0,2.0,2.0,2.0,Fair,4.0,1,1,1,1,1,1,1,1
7.0,2.0,1.0,1.0,2.0,2.0,2.0,2.0,Poor,5.0,1,1,1,1,1,1,1,1
7.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,Excellent,1.0,1,1,1,1,1,1,1,1
2.0,1.0,1.0,3.0,2.0,2.0,1.0,1.0,Fair,4.0,1,1,1,1,1,1,1,1
2.0,1.0,2.0,3.0,2.0,2.0,1.0,2.0,Very good,2.0,1,1,1,1,1,1,1,1
1.0,2.0,2.0,3.0,2.0,2.0,7.0,2.0,Very good,2.0,1,1,1,1,1,1,1,1
2.0,2.0,2.0,3.0,1.0,1.0,1.0,2.0,Good,3.0,1,1,1,1,1,1,1,1
2.0,7.0,2.0,3.0,2.0,2.0,2.0,2.0,Fair,4.0,1,1,1,1,1,1,1,1


Output can only be rendered in Databricks

The graph shows the percentage of people who reported their general health as good, fair, or poor for eight different illnesses: Asthma, Arthritis, Depression, Diabetes, Heart Disease, Stroke, Heart Attack, and Confusion or Memory Loss. The majority of respondents reported their general health as good , followed by fair.