-sandbox
# Exploring Relationships Between Smoking, Drinking, and General Health

**Introduction**

The Behavioral Risk Factor Surveillance System (BRFSS) is a nationwide survey conducted by the Centers for Disease Control and Prevention (CDC) that collects data on health-related risk behaviors, chronic health conditions, and the use of preventive services. This dataset contains information from the 2020 BRFSS which includes responses from 464,483 adults across the United States.

This dataset can be used to explore relationships between smoking, drinking, and general health. Using logistic regression, we can analyze the BRFSS 2020 dataset to gain insights into the health habits of the United States population. By training a logistic regression model on this data, we can examine how different health behaviors, such as smoking and drinking, are related to the overall health of the population. We can also explore how these behaviors are distributed across different demographic characteristics and how they vary by region or state. Additionally, we can identify patterns and correlations in the data and build models that can predict health outcomes based on the data.

- SMOKE100 - Smoked at Least 100 Cigarettes
- USENOW3 - Use of Smokeless Tobacco Products
- SMOKESTATUS - Computed Smoking Status
- CURRENTSMOKE - Adults who are current smokers
- ALCDAY5 - Days in past 30 had alcoholic beverag
- DRUNKDAY - Drink any alcoholic beverages in past 30 days
- DRUNKWEEK - Computed number of drinks of alcohol beverages per week 
- DRUNKHEAVY - Heavy Alcohol Consumption Calculated
- GENHLTH - General Health

- **Importing Libraries**

In [0]:
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import pyspark.sql.functions as F
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline



- **Loading the dataset**

In [0]:
filePath = "/tmp/sf-brfss2020.parquet"
brfss2020_df = spark.read.parquet(filePath)
display(brfss2020_df)

STATE,AGE,SEXVAR,MARITAL,EDUCA,RENTHOME,CELLPHONES,VETERAN,EMPLOYE,INCOME,CHILDREN,WEIGHT,HEIGHT,BMI,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HIV,HEARTATT,CONFUSSION,SMOKE100,USENOW3,CURRENTSMOKE,ALCDAY5,DRUNKDAY,DRUNKWEEK,DRUNKHEAVY,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,RACE,ECIGARET,GENHLTH,STATENAME
1.0,8.0,2.0,2.0,6.0,1.0,1.0,2.0,4.0,1.0,88.0,106.0,507.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,,1.0,3.0,2.0,888.0,2.0,0.0,1.0,1.0,1.0,5.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0,Alabama
1.0,10.0,2.0,3.0,6.0,1.0,1.0,2.0,7.0,99.0,88.0,170.0,504.0,3.0,1.0,1.0,1.0,3.0,2.0,2.0,,2.0,,,,9.0,,9.0,99900.0,9.0,9.0,9.0,7.0,1.0,1.0,1.0,1.0,2.0,,3.0,Alabama
1.0,10.0,2.0,1.0,5.0,1.0,1.0,2.0,7.0,7.0,88.0,7777.0,508.0,,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,Alabama
1.0,13.0,2.0,3.0,4.0,1.0,9.0,2.0,5.0,99.0,88.0,9999.0,9999.0,,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,Alabama
1.0,13.0,2.0,3.0,6.0,2.0,8.0,2.0,7.0,77.0,88.0,126.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,9.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,Alabama
1.0,10.0,1.0,4.0,4.0,3.0,1.0,2.0,8.0,5.0,88.0,180.0,509.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,,1.0,1.0,1.0,888.0,2.0,0.0,1.0,3.0,1.0,8.0,1.0,2.0,3.0,3.0,1.0,1.0,4.0,Alabama
1.0,12.0,2.0,1.0,4.0,1.0,2.0,2.0,7.0,6.0,88.0,150.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,2.0,3.0,Alabama
1.0,10.0,2.0,1.0,4.0,1.0,1.0,2.0,7.0,5.0,88.0,150.0,503.0,3.0,2.0,1.0,1.0,1.0,2.0,2.0,,2.0,,1.0,3.0,2.0,,9.0,99900.0,9.0,1.0,9.0,6.0,1.0,2.0,3.0,2.0,2.0,,4.0,Alabama
1.0,5.0,2.0,2.0,6.0,1.0,2.0,2.0,1.0,6.0,2.0,170.0,511.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,8.0,1.0,1.0,3.0,1.0,1.0,2.0,2.0,Alabama
1.0,12.0,2.0,3.0,2.0,1.0,8.0,2.0,7.0,99.0,88.0,163.0,503.0,3.0,2.0,1.0,2.0,3.0,1.0,2.0,2.0,2.0,,1.0,3.0,1.0,888.0,2.0,0.0,1.0,3.0,1.0,12.0,2.0,2.0,2.0,1.0,2.0,2.0,4.0,Alabama


- **Subtance Dataset**

In [0]:
#Selecting the columns related to smoke, alcohol, cigrate
Smoke= brfss2020_df["STATE","STATENAME","AGE","SEXVAR","SMOKE100","USENOW3","SMOKESTATUS","CURRENTSMOKE", "ALCDAY5", "DRUNKDAY", "DRUNKWEEK", "DRUNKHEAVY","GENHLTH","ECIGARET"]
display(Smoke)

STATE,STATENAME,AGE,SEXVAR,SMOKE100,USENOW3,SMOKESTATUS,CURRENTSMOKE,ALCDAY5,DRUNKDAY,DRUNKWEEK,DRUNKHEAVY,GENHLTH,ECIGARET
1.0,Alabama,8.0,2.0,1.0,3.0,1.0,2.0,888.0,2.0,0.0,1.0,2.0,1.0
1.0,Alabama,10.0,2.0,,,9.0,9.0,,9.0,99900.0,9.0,3.0,
1.0,Alabama,10.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,3.0,2.0
1.0,Alabama,13.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,1.0,2.0
1.0,Alabama,13.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,2.0,2.0
1.0,Alabama,10.0,1.0,1.0,1.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0
1.0,Alabama,12.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,3.0,2.0
1.0,Alabama,10.0,2.0,1.0,3.0,1.0,2.0,,9.0,99900.0,9.0,4.0,
1.0,Alabama,5.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,2.0,2.0
1.0,Alabama,12.0,2.0,1.0,3.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,2.0


In [0]:
Smoke.columns

Out[7]: ['STATE',
 'STATENAME',
 'AGE',
 'SEXVAR',
 'SMOKE100',
 'USENOW3',
 'SMOKESTATUS',
 'CURRENTSMOKE',
 'ALCDAY5',
 'DRUNKDAY',
 'DRUNKWEEK',
 'DRUNKHEAVY',
 'GENHLTH',
 'ECIGARET']

In [0]:
#print the schema
Smoke.printSchema()

root
 |-- STATE: double (nullable = true)
 |-- STATENAME: string (nullable = true)
 |-- AGE: double (nullable = true)
 |-- SEXVAR: double (nullable = true)
 |-- SMOKE100: double (nullable = true)
 |-- USENOW3: double (nullable = true)
 |-- SMOKESTATUS: double (nullable = true)
 |-- CURRENTSMOKE: double (nullable = true)
 |-- ALCDAY5: double (nullable = true)
 |-- DRUNKDAY: double (nullable = true)
 |-- DRUNKWEEK: double (nullable = true)
 |-- DRUNKHEAVY: double (nullable = true)
 |-- GENHLTH: double (nullable = true)
 |-- ECIGARET: double (nullable = true)



In [0]:
#Count the total number of rows and columns
print((Smoke.count(), len(Smoke.columns)))

(401958, 14)


- **Eliminating the Null Values**

In [0]:
for col in Smoke.columns:
    print(col+":",Smoke[Smoke[col].isNull()].count())

STATE: 0
STATENAME: 0
AGE: 0
SEXVAR: 0
SMOKE100: 17860
USENOW3: 18493
SMOKESTATUS: 0
CURRENTSMOKE: 0
ALCDAY5: 20927
DRUNKDAY: 0
DRUNKWEEK: 0
DRUNKHEAVY: 0
GENHLTH: 8
ECIGARET: 137397


In [0]:
Smoke =Smoke.na.drop(subset=["STATE","STATENAME","AGE","SEXVAR","SMOKE100","USENOW3","SMOKESTATUS","CURRENTSMOKE", "ALCDAY5", "DRUNKDAY", "DRUNKWEEK", "DRUNKHEAVY","GENHLTH","ECIGARET"])
display(Smoke)

STATE,STATENAME,AGE,SEXVAR,SMOKE100,USENOW3,SMOKESTATUS,CURRENTSMOKE,ALCDAY5,DRUNKDAY,DRUNKWEEK,DRUNKHEAVY,GENHLTH,ECIGARET
1.0,Alabama,8.0,2.0,1.0,3.0,1.0,2.0,888.0,2.0,0.0,1.0,2.0,1.0
1.0,Alabama,10.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,3.0,2.0
1.0,Alabama,13.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,1.0,2.0
1.0,Alabama,13.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,2.0,2.0
1.0,Alabama,10.0,1.0,1.0,1.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0
1.0,Alabama,12.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,3.0,2.0
1.0,Alabama,5.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,2.0,2.0
1.0,Alabama,12.0,2.0,1.0,3.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,2.0
1.0,Alabama,11.0,2.0,2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,4.0,2.0
1.0,Alabama,13.0,2.0,1.0,3.0,3.0,1.0,888.0,2.0,0.0,1.0,3.0,2.0


- About Smoke data

In [0]:
Smoke = Smoke.withColumn("SEX", Smoke["SEXVAR"])

In [0]:
Smoke = Smoke.withColumn("SEX", 
                                   when(Smoke["SEX"] == '1', 'Male')
                                   .when(Smoke["SEX"] == '2', 'Female')
                                   .otherwise('SEX'))

In [0]:
multiple8 = Smoke\
.groupby(["SEXVAR","SMOKE100"])\
.agg({'SEXVAR':'count','SMOKE100':'count'})
multiple8

Out[14]: DataFrame[SEXVAR: double, SMOKE100: double, count(SMOKE100): bigint, count(SEXVAR): bigint]

In [0]:
display(multiple8)

SEXVAR,SMOKE100,count(SMOKE100),count(SEXVAR)
1.0,1.0,55419,55419
2.0,7.0,675,675
2.0,2.0,89881,89881
2.0,1.0,53924,53924
1.0,2.0,63018,63018
2.0,9.0,182,182
1.0,7.0,564,564
1.0,9.0,209,209


In [0]:
res8 = multiple8[multiple8['SMOKE100'] == '1']

In [0]:
display(res8)

SEXVAR,SMOKE100,count(SMOKE100),count(SEXVAR)
1.0,1.0,55419,55419
2.0,1.0,53924,53924


Output can only be rendered in Databricks

In [0]:
Smoke = Smoke.withColumn("AgeGroup", Smoke["AGE"])

In [0]:
Smoke = Smoke.withColumn("AgeGroup", 
                                   when(Smoke["AgeGroup"] == '1', 'Age 18 to 24')
                                   .when(Smoke["AgeGroup"] == '2', 'Age 25 to 29')
                                   .when(Smoke["AgeGroup"] == '3', 'Age 30 to 34')
                                   .when(Smoke["AgeGroup"] == '4', 'Age 35 to 39')
                                   .when(Smoke["AgeGroup"] == '5', 'Age 40 to 44')
                                   .when(Smoke["AgeGroup"] == '6', 'Age 45 to 49')
                                   .when(Smoke["AgeGroup"] == '7', 'Age 50 to 54')
                                   .when(Smoke["AgeGroup"] == '8', 'Age 55 to 59')
                                   .when(Smoke["AgeGroup"] == '9', 'Age 60 to 64')
                                   .when(Smoke["AgeGroup"] == '10', 'Age 65 to 69')
                                   .when(Smoke["AgeGroup"] == '11', 'Age 70 to 74')
                                   .when(Smoke["AgeGroup"] == '12', 'Age 75 to 79')
                                   .when(Smoke["AgeGroup"] == '13', 'Age 80 or older')
                                   .when(Smoke["AgeGroup"] == '14', 'Blank')
                                   .otherwise('AgeGroup'))

In [0]:
multiple = Smoke\
.groupby(["STATENAME","AGE","SEXVAR","SMOKE100","ECIGARET","DRUNKHEAVY","AgeGroup","SEX"])\
.agg({'STATENAME':'count','AGE':'count','SEXVAR':'count','SMOKE100':'count','ECIGARET':'count','DRUNKHEAVY':'count','AgeGroup':'count','SEX':'count'},)

multiple

Out[18]: DataFrame[STATENAME: string, AGE: double, SEXVAR: double, SMOKE100: double, ECIGARET: double, DRUNKHEAVY: double, AgeGroup: string, SEX: string, count(ECIGARET): bigint, count(AgeGroup): bigint, count(SMOKE100): bigint, count(AGE): bigint, count(SEX): bigint, count(SEXVAR): bigint, count(DRUNKHEAVY): bigint, count(STATENAME): bigint]

In [0]:
display(multiple)

STATENAME,AGE,SEXVAR,SMOKE100,ECIGARET,DRUNKHEAVY,AgeGroup,SEX,count(ECIGARET),count(AgeGroup),count(SMOKE100),count(AGE),count(SEX),count(SEXVAR),count(DRUNKHEAVY),count(STATENAME)
Nebraska,9.0,2.0,2.0,2.0,1.0,Age 60 to 64,Female,403,403,403,403,403,403,403,403
New Hampshire,12.0,2.0,2.0,2.0,1.0,Age 75 to 79,Female,135,135,135,135,135,135,135,135
Wyoming,10.0,1.0,2.0,2.0,9.0,Age 65 to 69,Male,1,1,1,1,1,1,1,1
Alabama,3.0,1.0,1.0,1.0,2.0,Age 30 to 34,Male,4,4,4,4,4,4,4,4
Arkansas,13.0,1.0,2.0,1.0,1.0,Age 80 or older,Male,3,3,3,3,3,3,3,3
Connecticut,13.0,2.0,2.0,9.0,1.0,Age 80 or older,Female,1,1,1,1,1,1,1,1
Connecticut,7.0,2.0,7.0,1.0,1.0,Age 50 to 54,Female,1,1,1,1,1,1,1,1
Florida,13.0,1.0,1.0,2.0,9.0,Age 80 or older,Male,8,8,8,8,8,8,8,8
Florida,1.0,1.0,2.0,1.0,9.0,Age 18 to 24,Male,3,3,3,3,3,3,3,3
Idaho,7.0,2.0,2.0,2.0,1.0,Age 50 to 54,Female,122,122,122,122,122,122,122,122


Output can only be rendered in Databricks

In [0]:
display(multiple)

STATENAME,AGE,SEXVAR,SMOKE100,ECIGARET,DRUNKHEAVY,AgeGroup,SEX,count(ECIGARET),count(AgeGroup),count(SMOKE100),count(AGE),count(SEX),count(SEXVAR),count(DRUNKHEAVY),count(STATENAME)
Nebraska,9.0,2.0,2.0,2.0,1.0,Age 60 to 64,Female,403,403,403,403,403,403,403,403
New Hampshire,12.0,2.0,2.0,2.0,1.0,Age 75 to 79,Female,135,135,135,135,135,135,135,135
Wyoming,10.0,1.0,2.0,2.0,9.0,Age 65 to 69,Male,1,1,1,1,1,1,1,1
Alabama,3.0,1.0,1.0,1.0,2.0,Age 30 to 34,Male,4,4,4,4,4,4,4,4
Arkansas,13.0,1.0,2.0,1.0,1.0,Age 80 or older,Male,3,3,3,3,3,3,3,3
Connecticut,13.0,2.0,2.0,9.0,1.0,Age 80 or older,Female,1,1,1,1,1,1,1,1
Connecticut,7.0,2.0,7.0,1.0,1.0,Age 50 to 54,Female,1,1,1,1,1,1,1,1
Florida,13.0,1.0,1.0,2.0,9.0,Age 80 or older,Male,8,8,8,8,8,8,8,8
Florida,1.0,1.0,2.0,1.0,9.0,Age 18 to 24,Male,3,3,3,3,3,3,3,3
Idaho,7.0,2.0,2.0,2.0,1.0,Age 50 to 54,Female,122,122,122,122,122,122,122,122


Output can only be rendered in Databricks

In [0]:
#Selecting the columns related to smoke, alcohol, cigrate
Substance= brfss2020_df["SMOKE100","USENOW3","SMOKESTATUS","CURRENTSMOKE", "ALCDAY5", "DRUNKDAY", "DRUNKWEEK", "DRUNKHEAVY","GENHLTH"]
display(Substance)

SMOKE100,USENOW3,SMOKESTATUS,CURRENTSMOKE,ALCDAY5,DRUNKDAY,DRUNKWEEK,DRUNKHEAVY,GENHLTH
1.0,3.0,1.0,2.0,888.0,2.0,0.0,1.0,2.0
,,9.0,9.0,,9.0,99900.0,9.0,3.0
2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,3.0
2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,1.0
2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,2.0
1.0,1.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0
2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,3.0
1.0,3.0,1.0,2.0,,9.0,99900.0,9.0,4.0
2.0,3.0,4.0,1.0,888.0,2.0,0.0,1.0,2.0
1.0,3.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0


- **Correlation Analysis & Feature Selection**

In [0]:
#find the correlation among the set of input & output variables
for i in Substance.columns:
    print("Correlation to GENHLTH for {} is {}".format(i,Substance.stat.corr('GENHLTH',i)))

Correlation to GENHLTH for SMOKE100 is -0.09501980526458813
Correlation to GENHLTH for USENOW3 is 0.0027188641360756686
Correlation to GENHLTH for SMOKESTATUS is -0.10300572735119758
Correlation to GENHLTH for CURRENTSMOKE is 0.02426429971047382
Correlation to GENHLTH for ALCDAY5 is 0.16469498075117966
Correlation to GENHLTH for DRUNKDAY is 0.039621361610181705
Correlation to GENHLTH for DRUNKWEEK is -0.0025811006167783305
Correlation to GENHLTH for DRUNKHEAVY is -0.004739665359770283
Correlation to GENHLTH for GENHLTH is 1.0


- **Importing Libraries**

In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator

- **Preparing the Data**

In [0]:
assembler = VectorAssembler(inputCols= ["SMOKE100","USENOW3","SMOKESTATUS","CURRENTSMOKE", "ALCDAY5", "DRUNKDAY", "DRUNKWEEK", "DRUNKHEAVY"],outputCol='features')
output_data = VectorAssembler.transform(Substance)

[0;31m---------------------------------------------------------------------------[0m
[0;31mTypeError[0m                                 Traceback (most recent call last)
File [0;32m<command-777055632115058>:2[0m
[1;32m      1[0m assembler [38;5;241m=[39m VectorAssembler(inputCols[38;5;241m=[39m [[38;5;124m"[39m[38;5;124mSMOKE100[39m[38;5;124m"[39m,[38;5;124m"[39m[38;5;124mUSENOW3[39m[38;5;124m"[39m,[38;5;124m"[39m[38;5;124mSMOKESTATUS[39m[38;5;124m"[39m,[38;5;124m"[39m[38;5;124mCURRENTSMOKE[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mALCDAY5[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mDRUNKDAY[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mDRUNKWEEK[39m[38;5;124m"[39m, [38;5;124m"[39m[38;5;124mDRUNKHEAVY[39m[38;5;124m"[39m],outputCol[38;5;241m=[39m[38;5;124m'[39m[38;5;124mfeatures[39m[38;5;124m'[39m)
[0;32m----> 2[0m output_data [38;5;241m=[39m VectorAssembler[38;5;241m.[39mtransform(Substance)

[0;31mTypeError

In [0]:
#print the schema
output_data.printSchema()



- **Split Dataset**

In [0]:
#Create final data
from pyspark.ml.classification import LogisticRegression
final_data = output_data.select('features','GENHLTH')



In [0]:
#print schema of final data
final_data.printSchema()



In [0]:
#split the dataset
train, test = final_data.randomSplit([0.7,0.3])



- **Build the Model**

In [0]:
#build the model
models = LogisticRegression(labelCol='GENHLTH')
model = models.fit(train)



In [0]:
summary = model.summary
model.summary.predictions.show()



In [0]:
display(model.summary.predictions)



In [0]:
#summary of the model
summary = model.summary
display(summary.predictions.describe())



In [0]:
lrPredictions = model.transform(test)
lrPredictions



In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
eval_accuracy = MulticlassClassificationEvaluator(labelCol="GENHLTH",  metricName="accuracy")
accuracy = eval_accuracy.evaluate(lrPredictions)




In [0]:
print("Accuracy: %f" % accuracy)



- **Evaluate and Save the Model**

In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
 
#bcEvaluator = BinaryClassificationEvaluator(metricName="areaUnderROC")
#print(f"Area under ROC curve: {bcEvaluator.evaluate(predDF)}")
#eval_accuracy = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
#mcEvaluator = MulticlassClassificationEvaluator(metricName="accuracy")
#print(f"Accuracy: {mcEvaluator.evaluate(predDF)}")
#evaluator = eval_accuracy.evaluate(predDf)
#display(mcEvaluator)



In [0]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator



In [0]:
evaluator = BinaryClassificationEvaluator(labelCol="GENHLTH", rawPredictionCol="prediction", metricName='areaUnderROC')



In [0]:
pred= model.transform(test)



In [0]:
auc = evaluator.evaluate(pred)




In [0]:
print("AUC: %f" % auc)



In [0]:
display(auc)



In [0]:
# calculate AUC
auc = evaluator.evaluate(pred, {evaluator.metricName: 'areaUnderROC'})



In [0]:
print('AUC: %0.3f' % auc)



In [0]:
# compute TN, TP, FN, and FP
pred.groupBy('GENHLTH', 'prediction').count().show()



In [0]:
display(pred.groupBy('GENHLTH', 'prediction').count())



In [0]:
# Calculate the elements of the confusion matrix
TN = pred.filter('prediction = 0 AND GENHLTH = prediction').count()
TP = pred.filter('prediction = 1 AND GENHLTH = prediction').count()
FN = pred.filter('prediction = 0 AND GENHLTH <> prediction').count()
FP = pred.filter('prediction = 1 AND GENHLTH <> prediction').count()



In [0]:
# calculate accuracy, precision, recall, and F1-score
accuracy = (TN + TP) / (TN + TP + FN + FP)
precision = TP / (TP + FP)
recall = TP / (TP + FN)
F =  2 * (precision*recall) / (precision + recall)
print('n precision: %0.3f' % precision)
print('n recall: %0.3f' % recall)
print('n accuracy: %0.3f' % accuracy)
print('n F1 score: %0.3f' % F)



**Conclusion**


The results of this logistic regression model indicate that while the model is able to classify cases correctly, the accuracy is low at 0.362697, and the precision is also low at 0.265. The AUC and recall are both high at 1.000, indicating the model is able to identify most of the cases correctly. However, the F1 score is low at 0.419, suggesting that the model is not performing well. 

This model can be used for further analysis to determine what characteristics are associated with the target variable, and what additional features may be used to improve the results of the model. Additionally, this model can be used to evaluate the potential effects of different features on the target variable, and to develop strategies to reduce the risk of false positives and false negatives.