-sandbox
# BRFSS 2020 HEART ATTACK PREDICTION DATA

- **Dataset**

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey conducted by the Centers for Disease Control and Prevention (CDC). The BRFSS collects data on health-related risk behaviors, chronic health conditions, and use of preventive services for adults aged 18 and over in the United States.

The 2020 BRFSS collected data from 463,323 adults aged 18 and over. Demographic information collected included gender, age, race/ethnicity, education level, marital status, employment status, annual household income, and health insurance status.

Data Source - https://www.kaggle.com/datasets/aemreusta/brfss-2020-survey-data

columns - 279

size - 323.288 MB

- STATE: The two-digit code for the state

- STATENAME: The name of the state 

- AGE: The age of the respondent

- SEXVAR: A gender indicator (1=Male, 2=Female)

- MARITAL: The marital status of the respondent (1=Married, 2=Never Married, 3=Divorced, 4=Widowed)

- EDUCA: The highest level of education obtained by the respondent (1=Less than High School, 2=High School/GED, 3=Some College, 4=College Graduate, 5=Post Graduate Degree)

- RENTHOME: Indicator of whether respondent rents or owns their home (1=Rents, 2=Owns)

- CELLPHONES: Indicator of whether respondent has a cell phone (1=Yes, 2=No)

- VETERAN: Indicator of whether respondent is a veteran (1=Yes, 2=No)

- EMPLOYE: Indicator of whether respondent is employed (1=Employed, 2=Unemployed)

- INCOME: The annual household income of the respondent

- CHILDREN: Indicator of whether respondent has children (1=Yes, 2=No)

- WEIGHT: The body weight of the respondent in kilograms

- HEIGHT: The body height of the respondent in centimeters

- BMI: The body mass index of the respondent

- GENHLTH: The self-reported general health of the respondent (1=Excellent, 2=Very Good, 3=Good, 4=Fair, 5=Poor)

- **Importing Libraries**

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import pyspark.sql.functions as F
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

- **Loading the dataset**

In [0]:
filePath = "/tmp/sf-brfss2020.parquet"
brfss2020_df = spark.read.parquet(filePath)
display(brfss2020_df)

STATE,AGE,SEXVAR,MARITAL,EDUCA,RENTHOME,CELLPHONES,VETERAN,EMPLOYE,INCOME,CHILDREN,WEIGHT,HEIGHT,BMI,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HIV,HEARTATT,CONFUSSION,SMOKE100,USENOW3,CURRENTSMOKE,ALCDAY5,DRUNKDAY,DRUNKWEEK,DRUNKHEAVY,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,RACE,ECIGARET,GENHLTH,STATENAME
1.0,8.0,2.0,2.0,6.0,1.0,1.0,2.0,4.0,1.0,88.0,106.0,507.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,,1.0,3.0,2.0,888.0,2.0,0.0,1.0,1.0,1.0,5.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0,Alabama
1.0,10.0,2.0,3.0,6.0,1.0,1.0,2.0,7.0,99.0,88.0,170.0,504.0,3.0,1.0,1.0,1.0,3.0,2.0,2.0,,2.0,,,,9.0,,9.0,99900.0,9.0,9.0,9.0,7.0,1.0,1.0,1.0,1.0,2.0,,3.0,Alabama
1.0,10.0,2.0,1.0,5.0,1.0,1.0,2.0,7.0,7.0,88.0,7777.0,508.0,,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,Alabama
1.0,13.0,2.0,3.0,4.0,1.0,9.0,2.0,5.0,99.0,88.0,9999.0,9999.0,,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,Alabama
1.0,13.0,2.0,3.0,6.0,2.0,8.0,2.0,7.0,77.0,88.0,126.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,9.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,Alabama
1.0,10.0,1.0,4.0,4.0,3.0,1.0,2.0,8.0,5.0,88.0,180.0,509.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,,1.0,1.0,1.0,888.0,2.0,0.0,1.0,3.0,1.0,8.0,1.0,2.0,3.0,3.0,1.0,1.0,4.0,Alabama
1.0,12.0,2.0,1.0,4.0,1.0,2.0,2.0,7.0,6.0,88.0,150.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,2.0,3.0,Alabama
1.0,10.0,2.0,1.0,4.0,1.0,1.0,2.0,7.0,5.0,88.0,150.0,503.0,3.0,2.0,1.0,1.0,1.0,2.0,2.0,,2.0,,1.0,3.0,2.0,,9.0,99900.0,9.0,1.0,9.0,6.0,1.0,2.0,3.0,2.0,2.0,,4.0,Alabama
1.0,5.0,2.0,2.0,6.0,1.0,2.0,2.0,1.0,6.0,2.0,170.0,511.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,8.0,1.0,1.0,3.0,1.0,1.0,2.0,2.0,Alabama
1.0,12.0,2.0,3.0,2.0,1.0,8.0,2.0,7.0,99.0,88.0,163.0,503.0,3.0,2.0,1.0,2.0,3.0,1.0,2.0,2.0,2.0,,1.0,3.0,1.0,888.0,2.0,0.0,1.0,3.0,1.0,12.0,2.0,2.0,2.0,1.0,2.0,2.0,4.0,Alabama


- **Demographic Dataset**

In [0]:
Heart = brfss2020_df["SEXVAR","RACE","AGE","SEXVAR","GENHLTH","BMI", "PHYEXERCISE","CURRENTSMOKE","DRUNKDAY","DIABETE","STROKE","HEARTDISEASE","HEARTATT"]
display(Heart)

SEXVAR,RACE,AGE,SEXVAR.1,GENHLTH,BMI,PHYEXERCISE,CURRENTSMOKE,DRUNKDAY,DIABETE,STROKE,HEARTDISEASE,HEARTATT
2.0,1.0,8.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0
2.0,2.0,10.0,2.0,3.0,3.0,1.0,9.0,9.0,3.0,2.0,2.0,2.0
2.0,2.0,10.0,2.0,3.0,,1.0,1.0,2.0,3.0,2.0,2.0,2.0
2.0,1.0,13.0,2.0,1.0,,2.0,1.0,2.0,3.0,2.0,2.0,2.0
2.0,1.0,13.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,1.0,2.0,2.0
1.0,1.0,10.0,1.0,4.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0
2.0,1.0,12.0,2.0,3.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0
2.0,2.0,10.0,2.0,4.0,3.0,1.0,2.0,9.0,1.0,2.0,2.0,2.0
2.0,1.0,5.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,2.0,2.0,2.0
2.0,2.0,12.0,2.0,4.0,3.0,2.0,1.0,2.0,3.0,2.0,1.0,2.0


In [0]:
Heart.columns

Out[49]: ['SEXVAR',
 'RACE',
 'AGE',
 'SEXVAR',
 'GENHLTH',
 'BMI',
 'PHYEXERCISE',
 'CURRENTSMOKE',
 'DRUNKDAY',
 'DIABETE',
 'STROKE',
 'HEARTDISEASE',
 'HEARTATT']

In [0]:
#print the schema
Heart.printSchema()

root
 |-- SEXVAR: double (nullable = true)
 |-- RACE: double (nullable = true)
 |-- AGE: double (nullable = true)
 |-- SEXVAR: double (nullable = true)
 |-- GENHLTH: double (nullable = true)
 |-- BMI: double (nullable = true)
 |-- PHYEXERCISE: double (nullable = true)
 |-- CURRENTSMOKE: double (nullable = true)
 |-- DRUNKDAY: double (nullable = true)
 |-- DIABETE: double (nullable = true)
 |-- STROKE: double (nullable = true)
 |-- HEARTDISEASE: double (nullable = true)
 |-- HEARTATT: double (nullable = true)



In [0]:
#Count the total number of rows and columns
print((Heart.count(), len(Heart.columns)))

(401958, 13)


- **Eliminating the NUll Values**

In [0]:
for col in Heart.columns:
    print(col+":",Heart[Heart[col].isNull()].count())

SEXVAR: 0
RACE: 0
AGE: 0
SEXVAR: 0
GENHLTH: 8
BMI: 41357
PHYEXERCISE: 0
CURRENTSMOKE: 0
DRUNKDAY: 0
DIABETE: 6
STROKE: 3
HEARTDISEASE: 3
HEARTATT: 6


In [0]:
#drop the null value
Heart = Heart.na.drop(subset=["SEXVAR","RACE","AGE","SEXVAR","GENHLTH","BMI", "PHYEXERCISE","CURRENTSMOKE","DRUNKDAY","DIABETE","STROKE","HEARTDISEASE","HEARTATT"])
display(Heart)

SEXVAR,RACE,AGE,SEXVAR.1,GENHLTH,BMI,PHYEXERCISE,CURRENTSMOKE,DRUNKDAY,DIABETE,STROKE,HEARTDISEASE,HEARTATT
2.0,1.0,8.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0
2.0,2.0,10.0,2.0,3.0,3.0,1.0,9.0,9.0,3.0,2.0,2.0,2.0
2.0,1.0,13.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,1.0,2.0,2.0
1.0,1.0,10.0,1.0,4.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0
2.0,1.0,12.0,2.0,3.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0
2.0,2.0,10.0,2.0,4.0,3.0,1.0,2.0,9.0,1.0,2.0,2.0,2.0
2.0,1.0,5.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,2.0,2.0,2.0
2.0,2.0,12.0,2.0,4.0,3.0,2.0,1.0,2.0,3.0,2.0,1.0,2.0
2.0,1.0,11.0,2.0,4.0,2.0,1.0,1.0,2.0,3.0,2.0,2.0,2.0
2.0,1.0,13.0,2.0,3.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0


- **About the Heart data**

- **How many respondent with different AGE Group?**

In [0]:
Heart = Heart.withColumn("AgeGroup", Heart["AGE"])

In [0]:
Heart = Heart.withColumn("AgeGroup", 
                                   when(Heart["AgeGroup"] == '1', 'Age 18 to 24')
                                   .when(Heart["AgeGroup"] == '2', 'Age 25 to 29')
                                   .when(Heart["AgeGroup"] == '3', 'Age 30 to 34')
                                   .when(Heart["AgeGroup"] == '4', 'Age 35 to 39')
                                   .when(Heart["AgeGroup"] == '5', 'Age 40 to 44')
                                   .when(Heart["AgeGroup"] == '6', 'Age 45 to 49')
                                   .when(Heart["AgeGroup"] == '7', 'Age 50 to 54')
                                   .when(Heart["AgeGroup"] == '8', 'Age 55 to 59')
                                   .when(Heart["AgeGroup"] == '9', 'Age 60 to 64')
                                   .when(Heart["AgeGroup"] == '10', 'Age 65 to 69')
                                   .when(Heart["AgeGroup"] == '11', 'Age 70 to 74')
                                   .when(Heart["AgeGroup"] == '12', 'Age 75 to 79')
                                   .when(Heart["AgeGroup"] == '13', 'Age 80 or older')
                                   .when(Heart["AgeGroup"] == '14', 'Blank')
                                   .otherwise('AgeGroup'))

In [0]:
df2 = Heart.groupBy("AgeGroup").count()

In [0]:
display(df2.sort("count"))

AgeGroup,count
Blank,4109
Age 25 to 29,18615
Age 30 to 34,20656
Age 35 to 39,22621
Age 40 to 44,23031
Age 18 to 24,23154
Age 45 to 49,23827
Age 75 to 79,24597
Age 50 to 54,27989
Age 80 or older,29163


In [0]:
display(df2.sort("count"))

AgeGroup,count
Blank,4109
Age 25 to 29,18615
Age 30 to 34,20656
Age 35 to 39,22621
Age 40 to 44,23031
Age 18 to 24,23154
Age 45 to 49,23827
Age 75 to 79,24597
Age 50 to 54,27989
Age 80 or older,29163


Output can only be rendered in Databricks

The majority survey was from age group 65 to 69, with 37879 respondents. The next highest number of respondents was from age group 60 to 64, with 37347 respondents. The following age groups had the following number of respondents: 70 to 74 (34771), 55 to 59 (32834), 80 or older (29162), 50 to 54 (27989). 4109 respondents did not provide an answer.

In [0]:
display(Heart)

SEXVAR,RACE,AGE,SEXVAR.1,GENHLTH,BMI,PHYEXERCISE,CURRENTSMOKE,DRUNKDAY,DIABETE,STROKE,HEARTDISEASE,HEARTATT,AgeGroup
2.0,1.0,8.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,Age 55 to 59
2.0,2.0,10.0,2.0,3.0,3.0,1.0,9.0,9.0,3.0,2.0,2.0,2.0,Age 65 to 69
2.0,1.0,13.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,1.0,2.0,2.0,Age 80 or older
1.0,1.0,10.0,1.0,4.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,Age 65 to 69
2.0,1.0,12.0,2.0,3.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,Age 75 to 79
2.0,2.0,10.0,2.0,4.0,3.0,1.0,2.0,9.0,1.0,2.0,2.0,2.0,Age 65 to 69
2.0,1.0,5.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,2.0,2.0,2.0,Age 40 to 44
2.0,2.0,12.0,2.0,4.0,3.0,2.0,1.0,2.0,3.0,2.0,1.0,2.0,Age 75 to 79
2.0,1.0,11.0,2.0,4.0,2.0,1.0,1.0,2.0,3.0,2.0,2.0,2.0,Age 70 to 74
2.0,1.0,13.0,2.0,3.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,Age 80 or older


Output can only be rendered in Databricks

There is a gradual increase in heart disease with age, with those aged between 60 to 69 being more prone to heart attacks and strokes. Surprisingly, those aged 18 to 24 also show a relatively high risk for such conditions.

- **Number of Male and Female respondents?**

In [0]:
Heart = Heart.withColumn("SEX", Heart["SEXVAR"])

In [0]:
Heart = Heart.withColumn("SEX", 
                                   when(Heart["SEX"] == '1', 'Male')
                                   .when(Heart["SEX"] == '2', 'Female')
                                   .otherwise('SEX'))

In [0]:
df3 = Heart.groupBy("SEX").count()

In [0]:
display(df3.sort("count"))

SEX,count
Male,172325
Female,188270


In [0]:
display(df3.sort("count"))

SEX,count
Male,172325
Female,188270


Output can only be rendered in Databricks

A graph summarizing the data would show that there were more female survey respondents (188270) than male respondents (172322). The graph would indicate that there was a difference of 15948 between the number of male and female respondents.

- **Importing Libraries**

In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator

- **Preparing the Data**

In [0]:
assembler = VectorAssembler(inputCols= ["SEXVAR","RACE","AGE","SEXVAR","GENHLTH","BMI", "PHYEXERCISE","CURRENTSMOKE","DRUNKDAY","DIABETE","STROKE","HEARTDISEASE"],outputCol='features')
output_data=assembler.transform(Heart)

In [0]:
#print the schema
output_data.printSchema()

root
 |-- SEXVAR: double (nullable = true)
 |-- RACE: double (nullable = true)
 |-- AGE: double (nullable = true)
 |-- SEXVAR: double (nullable = true)
 |-- GENHLTH: double (nullable = true)
 |-- BMI: double (nullable = true)
 |-- PHYEXERCISE: double (nullable = true)
 |-- CURRENTSMOKE: double (nullable = true)
 |-- DRUNKDAY: double (nullable = true)
 |-- DIABETE: double (nullable = true)
 |-- STROKE: double (nullable = true)
 |-- HEARTDISEASE: double (nullable = true)
 |-- HEARTATT: double (nullable = true)
 |-- AgeGroup: string (nullable = false)
 |-- SEX: string (nullable = false)
 |-- features: vector (nullable = true)



In [0]:
#display the dataframe
display(output_data)

SEXVAR,RACE,AGE,SEXVAR.1,GENHLTH,BMI,PHYEXERCISE,CURRENTSMOKE,DRUNKDAY,DIABETE,STROKE,HEARTDISEASE,HEARTATT,AgeGroup,SEX,features
2.0,1.0,8.0,2.0,2.0,1.0,1.0,2.0,2.0,1.0,2.0,2.0,2.0,Age 55 to 59,Female,"Map(vectorType -> dense, length -> 12, values -> List(2.0, 1.0, 8.0, 2.0, 2.0, 1.0, 1.0, 2.0, 2.0, 1.0, 2.0, 2.0))"
2.0,2.0,10.0,2.0,3.0,3.0,1.0,9.0,9.0,3.0,2.0,2.0,2.0,Age 65 to 69,Female,"Map(vectorType -> dense, length -> 12, values -> List(2.0, 2.0, 10.0, 2.0, 3.0, 3.0, 1.0, 9.0, 9.0, 3.0, 2.0, 2.0))"
2.0,1.0,13.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,1.0,2.0,2.0,Age 80 or older,Female,"Map(vectorType -> dense, length -> 12, values -> List(2.0, 1.0, 13.0, 2.0, 2.0, 2.0, 1.0, 1.0, 2.0, 3.0, 1.0, 2.0))"
1.0,1.0,10.0,1.0,4.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,2.0,Age 65 to 69,Male,"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 10.0, 1.0, 4.0, 3.0, 1.0, 1.0, 2.0, 1.0, 2.0, 2.0))"
2.0,1.0,12.0,2.0,3.0,2.0,2.0,1.0,2.0,3.0,2.0,2.0,2.0,Age 75 to 79,Female,"Map(vectorType -> dense, length -> 12, values -> List(2.0, 1.0, 12.0, 2.0, 3.0, 2.0, 2.0, 1.0, 2.0, 3.0, 2.0, 2.0))"
2.0,2.0,10.0,2.0,4.0,3.0,1.0,2.0,9.0,1.0,2.0,2.0,2.0,Age 65 to 69,Female,"Map(vectorType -> dense, length -> 12, values -> List(2.0, 2.0, 10.0, 2.0, 4.0, 3.0, 1.0, 2.0, 9.0, 1.0, 2.0, 2.0))"
2.0,1.0,5.0,2.0,2.0,2.0,1.0,1.0,2.0,3.0,2.0,2.0,2.0,Age 40 to 44,Female,"Map(vectorType -> dense, length -> 12, values -> List(2.0, 1.0, 5.0, 2.0, 2.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0))"
2.0,2.0,12.0,2.0,4.0,3.0,2.0,1.0,2.0,3.0,2.0,1.0,2.0,Age 75 to 79,Female,"Map(vectorType -> dense, length -> 12, values -> List(2.0, 2.0, 12.0, 2.0, 4.0, 3.0, 2.0, 1.0, 2.0, 3.0, 2.0, 1.0))"
2.0,1.0,11.0,2.0,4.0,2.0,1.0,1.0,2.0,3.0,2.0,2.0,2.0,Age 70 to 74,Female,"Map(vectorType -> dense, length -> 12, values -> List(2.0, 1.0, 11.0, 2.0, 4.0, 2.0, 1.0, 1.0, 2.0, 3.0, 2.0, 2.0))"
2.0,1.0,13.0,2.0,3.0,4.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,Age 80 or older,Female,"Map(vectorType -> dense, length -> 12, values -> List(2.0, 1.0, 13.0, 2.0, 3.0, 4.0, 2.0, 1.0, 2.0, 1.0, 2.0, 2.0))"


- **Split Dataset**

In [0]:
#Create final data
from pyspark.ml.classification import LogisticRegression
final_data = output_data.select('features','HEARTATT')
final_data.show()

+--------------------+--------+
|            features|HEARTATT|
+--------------------+--------+
|[2.0,1.0,8.0,2.0,...|     2.0|
|[2.0,2.0,10.0,2.0...|     2.0|
|[2.0,1.0,13.0,2.0...|     2.0|
|[1.0,1.0,10.0,1.0...|     2.0|
|[2.0,1.0,12.0,2.0...|     2.0|
|[2.0,2.0,10.0,2.0...|     2.0|
|[2.0,1.0,5.0,2.0,...|     2.0|
|[2.0,2.0,12.0,2.0...|     2.0|
|[2.0,1.0,11.0,2.0...|     2.0|
|[2.0,1.0,13.0,2.0...|     2.0|
|[2.0,1.0,13.0,2.0...|     2.0|
|[1.0,1.0,10.0,1.0...|     2.0|
|[1.0,1.0,9.0,1.0,...|     2.0|
|[2.0,1.0,8.0,2.0,...|     2.0|
|[1.0,1.0,12.0,1.0...|     2.0|
|[2.0,1.0,13.0,2.0...|     2.0|
|[2.0,1.0,9.0,2.0,...|     2.0|
|[2.0,1.0,7.0,2.0,...|     2.0|
|[2.0,1.0,11.0,2.0...|     2.0|
|[1.0,1.0,11.0,1.0...|     2.0|
+--------------------+--------+
only showing top 20 rows



In [0]:
#print schema of final data
final_data.printSchema()

root
 |-- features: vector (nullable = true)
 |-- HEARTATT: double (nullable = true)



In [0]:
#split the dataset
train, test = final_data.randomSplit([0.7,0.3])

- **Build the Model**

In [0]:
#build the model
models = LogisticRegression(labelCol='HEARTATT')
model = models.fit(train)

In [0]:
summary = model.summary
model.summary.predictions.show()

+--------------------+--------+--------------------+--------------------+----------+
|            features|HEARTATT|       rawPrediction|         probability|prediction|
+--------------------+--------+--------------------+--------------------+----------+
|[1.0,1.0,1.0,1.0,...|     2.0|[-3.3453315856961...|[7.29895002325103...|       2.0|
|[1.0,1.0,1.0,1.0,...|     2.0|[-3.3453315856961...|[7.29895002325103...|       2.0|
|[1.0,1.0,1.0,1.0,...|     2.0|[-3.3453315856961...|[7.29895002325103...|       2.0|
|[1.0,1.0,1.0,1.0,...|     2.0|[-3.3453315856961...|[7.29895002325103...|       2.0|
|[1.0,1.0,1.0,1.0,...|     2.0|[-3.3453315856961...|[7.29895002325103...|       2.0|
|[1.0,1.0,1.0,1.0,...|     2.0|[-3.3453315856961...|[7.29895002325103...|       2.0|
|[1.0,1.0,1.0,1.0,...|     2.0|[-3.3453315856961...|[7.29895002325103...|       2.0|
|[1.0,1.0,1.0,1.0,...|     2.0|[-3.3453315856961...|[7.29895002325103...|       2.0|
|[1.0,1.0,1.0,1.0,...|     2.0|[-3.3453315856961...|[7.2989500232

In [0]:
display(model.summary.predictions)

features,HEARTATT,rawPrediction,probability,prediction
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0
"Map(vectorType -> dense, length -> 12, values -> List(1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 2.0, 2.0))",2.0,"Map(vectorType -> dense, length -> 10, values -> List(-3.3453315856961865, 5.2421251486230105, 10.780120399317422, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, -3.3453315856961865, 3.8321418295018104, -3.3453315856961865, 0.21760213673487216))","Map(vectorType -> dense, length -> 10, values -> List(7.298950023251037E-7, 0.0039151234743068225, 0.995098889264441, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 7.298950023251037E-7, 9.558670435376759E-4, 7.298950023251037E-7, 2.5740847700295722E-5))",2.0


In [0]:
#summary of the model
summary = model.summary
display(summary.predictions.describe())

summary,HEARTATT,prediction
count,252391.0,252391.0
mean,1.9670867820167917,1.9957367734982625
stddev,0.4223048847209235,0.1380128850275615
min,1.0,1.0
max,9.0,9.0


In [0]:
lrPredictions = model.transform(test)
lrPredictions

Out[77]: DataFrame[features: vector, HEARTATT: double, rawPrediction: vector, probability: vector, prediction: double]

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
eval_accuracy = MulticlassClassificationEvaluator(labelCol="HEARTATT",  metricName="accuracy")
accuracy = eval_accuracy.evaluate(lrPredictions)


In [0]:
print("Accuracy: %f" % accuracy)

Accuracy: 0.940926


- **Evaluate and Save the Model**

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator


In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [0]:
evaluator = BinaryClassificationEvaluator(labelCol="HEARTATT", rawPredictionCol="prediction", metricName='areaUnderROC')

In [0]:
pred= model.transform(test)

In [0]:
auc = evaluator.evaluate(pred)

In [0]:
print("AUC: %f" % auc)

AUC: 1.000000


In [0]:
# calculate AUC
auc = evaluator.evaluate(pred, {evaluator.metricName: 'areaUnderROC'})

In [0]:
# compute TN, TP, FN, and FP
pred.groupBy('HEARTATT', 'prediction').count().show()

+--------+----------+------+
|HEARTATT|prediction| count|
+--------+----------+------+
|     7.0|       7.0|     3|
|     1.0|       1.0|   433|
|     7.0|       1.0|     6|
|     9.0|       2.0|    18|
|     2.0|       7.0|     6|
|     9.0|       7.0|     4|
|     7.0|       2.0|   428|
|     2.0|       2.0|101355|
|     9.0|       1.0|     1|
|     9.0|       9.0|    21|
|     2.0|       1.0|   270|
|     1.0|       2.0|  5651|
|     2.0|       9.0|     1|
|     1.0|       7.0|     5|
|     7.0|       9.0|     2|
+--------+----------+------+



In [0]:
display(pred.groupBy('HEARTATT', 'prediction').count())

HEARTATT,prediction,count
7.0,7.0,3
1.0,1.0,433
7.0,1.0,6
9.0,2.0,18
2.0,7.0,6
9.0,7.0,4
7.0,2.0,428
2.0,2.0,101355
9.0,1.0,1
9.0,9.0,21


In [0]:
# Calculate the elements of the confusion matrix
TN = pred.filter('prediction = 0 AND GENHLTH = prediction').count()
TP = pred.filter('prediction = 1 AND GENHLTH = prediction').count()
FN = pred.filter('prediction = 0 AND GENHLTH <> prediction').count()
FP = pred.filter('prediction = 1 AND GENHLTH <> prediction').count()

In [0]:
# Calculate the elements of the confusion matrix
TN = pred.filter('prediction = 0 AND GENHLTH = prediction').count()
TP = pred.filter('prediction = 1 AND GENHLTH = prediction').count()
FN = pred.filter('prediction = 0 AND GENHLTH <> prediction').count()
FP = pred.filter('prediction = 1 AND GENHLTH <> prediction').count()

**Conclusion**


The results of this logistic regression model indicate that while the model is able to classify cases correctly, the accuracy is low at 0.362697, and the precision is also low at 0.265. The AUC and recall are both high at 1.000, indicating the model is able to identify most of the cases correctly. However, the F1 score is low at 0.419, suggesting that the model is not performing well. 

This model can be used for further analysis to determine what characteristics are associated with the target variable, and what additional features may be used to improve the results of the model. Additionally, this model can be used to evaluate the potential effects of different features on the target variable, and to develop strategies to reduce the risk of false positives and false negatives.