-sandbox
# Predicting Health Behaviors Through Machine Learning Algorithms

**Introduction**

The Behavioral Risk Factor Surveillance System (BRFSS) is a large, random-digit-dialed telephone survey of non-institutionalized adults in the United States conducted by the Centers for Disease Control and Prevention (CDC). The 2020 BRFSS survey collects data on a variety of health-related topics including adult health and lifestyle behaviors, chronic conditions, and health access. This dataset includes information on age, sex, smoking status, binge drinking, sleeping time, physical exercise and general health of the respondents. The data from the BRFSS can be used to monitor and evaluate health behaviors, assess emerging health issues, and plan, implement and evaluate health promotion activities. This dataset can be used to gain insights into the health and lifestyles of U.S. adults and to better understand health disparities among populations.

The goal of this project is to use machine learning algorithms to predict whether a respondent is taking measures to improve their health by getting proper sleep, engaging in physical exercise, reducing their alcohol and cigarette consumption. 

The data used for this project will include survey responses from respondents that include their health habits and behaviors, such as how many hours of sleep they get, how often they exercise, how much they drink, and how much they smoke. This data will be used to build a model to predict whether or not a respondent is taking steps to improve their health. The model will then be tested on a test dataset to evaluate the accuracy of its predictions. 

Once the model is built and tested, it can be used to identify which respondents are taking steps to improve their health, and which respondents need more help to do so. This information can then be used to help health professionals and organizations better target their resources to those most in need.

The BRFSS Illness dataset for 2020 includes the following codes and their respective values:

- STATE: The two-letter postal code of the respondent's state or territory of residence.

- STATENAME: The full name of the respondent's state or territory of residence.

- AGE: The reported age of the respondent.

- SEXVAR: The reported sex of the respondent.

- SMOKESTATUS: Computed Smoking Status 

- DRINKS: Binge Drinking Calculated Variable 

- SLEPTIME: How Much Time Do You Sleep

- PHYEXERCISE:  Adults who reported doing physical activity or exercise

- GENHLTH: The reported general health

- **Importing Libraries**

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import pyspark.sql.functions as F
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

- **Loading the dataset**

In [0]:
filePath = "/tmp/sf-brfss2020.parquet"
brfss2020_df = spark.read.parquet(filePath)
display(brfss2020_df)

STATE,AGE,SEXVAR,MARITAL,EDUCA,RENTHOME,CELLPHONES,VETERAN,EMPLOYE,INCOME,CHILDREN,WEIGHT,HEIGHT,BMI,ASTHMA,ARTHRITIS,DEPRESSION,DIABETE,HEARTDISEASE,STROKE,HIV,HEARTATT,CONFUSSION,SMOKE100,USENOW3,CURRENTSMOKE,ALCDAY5,DRUNKDAY,DRUNKWEEK,DRUNKHEAVY,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,RACE,ECIGARET,GENHLTH,STATENAME
1.0,8.0,2.0,2.0,6.0,1.0,1.0,2.0,4.0,1.0,88.0,106.0,507.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,1.0,2.0,,1.0,3.0,2.0,888.0,2.0,0.0,1.0,1.0,1.0,5.0,1.0,1.0,2.0,3.0,1.0,1.0,2.0,Alabama
1.0,10.0,2.0,3.0,6.0,1.0,1.0,2.0,7.0,99.0,88.0,170.0,504.0,3.0,1.0,1.0,1.0,3.0,2.0,2.0,,2.0,,,,9.0,,9.0,99900.0,9.0,9.0,9.0,7.0,1.0,1.0,1.0,1.0,2.0,,3.0,Alabama
1.0,10.0,2.0,1.0,5.0,1.0,1.0,2.0,7.0,7.0,88.0,7777.0,508.0,,2.0,1.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0,2.0,3.0,Alabama
1.0,13.0,2.0,3.0,4.0,1.0,9.0,2.0,5.0,99.0,88.0,9999.0,9999.0,,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,Alabama
1.0,13.0,2.0,3.0,6.0,2.0,8.0,2.0,7.0,77.0,88.0,126.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,1.0,9.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,1.0,2.0,2.0,Alabama
1.0,10.0,1.0,4.0,4.0,3.0,1.0,2.0,8.0,5.0,88.0,180.0,509.0,3.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,2.0,,1.0,1.0,1.0,888.0,2.0,0.0,1.0,3.0,1.0,8.0,1.0,2.0,3.0,3.0,1.0,1.0,4.0,Alabama
1.0,12.0,2.0,1.0,4.0,1.0,2.0,2.0,7.0,6.0,88.0,150.0,506.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,2.0,3.0,Alabama
1.0,10.0,2.0,1.0,4.0,1.0,1.0,2.0,7.0,5.0,88.0,150.0,503.0,3.0,2.0,1.0,1.0,1.0,2.0,2.0,,2.0,,1.0,3.0,2.0,,9.0,99900.0,9.0,1.0,9.0,6.0,1.0,2.0,3.0,2.0,2.0,,4.0,Alabama
1.0,5.0,2.0,2.0,6.0,1.0,2.0,2.0,1.0,6.0,2.0,170.0,511.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,2.0,,2.0,3.0,1.0,888.0,2.0,0.0,1.0,4.0,1.0,8.0,1.0,1.0,3.0,1.0,1.0,2.0,2.0,Alabama
1.0,12.0,2.0,3.0,2.0,1.0,8.0,2.0,7.0,99.0,88.0,163.0,503.0,3.0,2.0,1.0,2.0,3.0,1.0,2.0,2.0,2.0,,1.0,3.0,1.0,888.0,2.0,0.0,1.0,3.0,1.0,12.0,2.0,2.0,2.0,1.0,2.0,2.0,4.0,Alabama


- **Treatment Dataset**

In [0]:
Treatment = brfss2020_df["AGE","SMOKESTATUS","DRINKS","SLEPTIME","PHYEXERCISE","HEALTH","PHYHLTH","MENTHLTH","GENHLTH"]
display(Treatment)

AGE,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,GENHLTH
8.0,1.0,1.0,5.0,1.0,1.0,2.0,3.0,2.0
10.0,9.0,9.0,7.0,1.0,1.0,1.0,1.0,3.0
10.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,3.0
13.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0
13.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0
10.0,3.0,1.0,8.0,1.0,2.0,3.0,3.0,4.0
12.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,3.0
10.0,1.0,9.0,6.0,1.0,2.0,3.0,2.0,4.0
5.0,4.0,1.0,8.0,1.0,1.0,3.0,1.0,2.0
12.0,3.0,1.0,12.0,2.0,2.0,2.0,1.0,4.0


In [0]:
for col in Treatment.columns:
    print(col+":",Treatment[Treatment[col].isNull()].count())

AGE: 0
SMOKESTATUS: 0
DRINKS: 0
SLEPTIME: 3
PHYEXERCISE: 0
HEALTH: 0
PHYHLTH: 0
MENTHLTH: 0
GENHLTH: 8


In [0]:
Treatment = Treatment.na.drop(subset=["AGE","SMOKESTATUS","DRINKS","SLEPTIME","PHYEXERCISE","HEALTH","PHYHLTH","MENTHLTH","GENHLTH"])
display(Treatment)

AGE,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,GENHLTH
8.0,1.0,1.0,5.0,1.0,1.0,2.0,3.0,2.0
10.0,9.0,9.0,7.0,1.0,1.0,1.0,1.0,3.0
10.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,3.0
13.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0
13.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0
10.0,3.0,1.0,8.0,1.0,2.0,3.0,3.0,4.0
12.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,3.0
10.0,1.0,9.0,6.0,1.0,2.0,3.0,2.0,4.0
5.0,4.0,1.0,8.0,1.0,1.0,3.0,1.0,2.0
12.0,3.0,1.0,12.0,2.0,2.0,2.0,1.0,4.0


In [0]:
Treatment = Treatment.withColumn("GENHLTHSTAT", Treatment["GENHLTH"])

In [0]:
Treatment = Treatment.withColumn("GENHLTHSTAT", 
                                   when(Treatment["GENHLTHSTAT"] == '1', 'Excellent ')
                                   .when(Treatment["GENHLTHSTAT"] == '2', 'Very good')
                                   .when(Treatment["GENHLTHSTAT"] == '3', 'Good ')
                                   .when(Treatment["GENHLTHSTAT"] == '4', 'Fair ')
                                   .when(Treatment["GENHLTHSTAT"] == '5', 'Poor')
                                   .when(Treatment["GENHLTHSTAT"] == '7', 'Don’t know/Not Sure ')
                                   .when(Treatment["GENHLTHSTAT"] == '9', 'Refused ')
                                   .when(Treatment["GENHLTHSTAT"] == 'Blank', 'Not asked or Missing ')
                                   .otherwise('GENHLTHSTAT'))

In [0]:
#drop the null value
Treatment  = Treatment.na.drop(subset=["AGE","SMOKESTATUS","DRINKS","SLEPTIME","PHYEXERCISE","HEALTH","PHYHLTH","MENTHLTH","GENHLTH"])

In [0]:
display(Treatment)

AGE,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,GENHLTH,GENHLTHSTAT
8.0,1.0,1.0,5.0,1.0,1.0,2.0,3.0,2.0,Very good
10.0,9.0,9.0,7.0,1.0,1.0,1.0,1.0,3.0,Good
10.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,3.0,Good
13.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,Excellent
13.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0,Very good
10.0,3.0,1.0,8.0,1.0,2.0,3.0,3.0,4.0,Fair
12.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,3.0,Good
10.0,1.0,9.0,6.0,1.0,2.0,3.0,2.0,4.0,Fair
5.0,4.0,1.0,8.0,1.0,1.0,3.0,1.0,2.0,Very good
12.0,3.0,1.0,12.0,2.0,2.0,2.0,1.0,4.0,Fair


Output can only be rendered in Databricks

In [0]:
display(Treatment)

AGE,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,GENHLTH,GENHLTHSTAT
8.0,1.0,1.0,5.0,1.0,1.0,2.0,3.0,2.0,Very good
10.0,9.0,9.0,7.0,1.0,1.0,1.0,1.0,3.0,Good
10.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,3.0,Good
13.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,Excellent
13.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0,Very good
10.0,3.0,1.0,8.0,1.0,2.0,3.0,3.0,4.0,Fair
12.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,3.0,Good
10.0,1.0,9.0,6.0,1.0,2.0,3.0,2.0,4.0,Fair
5.0,4.0,1.0,8.0,1.0,1.0,3.0,1.0,2.0,Very good
12.0,3.0,1.0,12.0,2.0,2.0,2.0,1.0,4.0,Fair


Output can only be rendered in Databricks

In [0]:
multiple = Treatment\
.groupby(["GENHLTH","SMOKESTATUS"])\
.agg({'GENHLTH':'count'})

multiple

Out[57]: DataFrame[GENHLTH: double, SMOKESTATUS: double, count(GENHLTH): bigint]

In [0]:
display(multiple)

GENHLTH,SMOKESTATUS,count(GENHLTH)
7.0,3.0,169
3.0,9.0,6478
5.0,1.0,2862
5.0,4.0,5476
5.0,2.0,974
1.0,1.0,4279
7.0,1.0,103
3.0,2.0,4558
4.0,2.0,2348
9.0,2.0,22


In [0]:
r1 = multiple[multiple['SMOKESTATUS'] == '1']

In [0]:
display(r1)

GENHLTH,SMOKESTATUS,count(GENHLTH)
5.0,1.0,2862
1.0,1.0,4279
7.0,1.0,103
3.0,1.0,13664
9.0,1.0,25
2.0,1.0,10015
4.0,1.0,7203


Output can only be rendered in Databricks

In [0]:
multiple1 = Treatment\
.groupby(["GENHLTH","DRINKS"])\
.agg({'GENHLTH':'count'})

multiple1

Out[61]: DataFrame[GENHLTH: double, DRINKS: double, count(GENHLTH): bigint]

In [0]:
display(multiple1)

GENHLTH,DRINKS,count(GENHLTH)
3.0,9.0,9563
5.0,1.0,13306
5.0,2.0,997
1.0,1.0,63254
7.0,1.0,506
3.0,2.0,13821
4.0,2.0,4071
9.0,2.0,13
4.0,9.0,3438
7.0,2.0,44


In [0]:
r2 = multiple1[multiple1['DRINKS'] == '1']

In [0]:
display(r2)

GENHLTH,DRINKS,count(GENHLTH)
5.0,1.0,13306
1.0,1.0,63254
7.0,1.0,506
3.0,1.0,96118
9.0,1.0,213
2.0,1.0,109246
4.0,1.0,38730


Output can only be rendered in Databricks

In [0]:
multiple3 = Treatment\
.groupby(["SLEPTIME","GENHLTH"])\
.agg({'SLEPTIME':'count'})

multiple3

Out[65]: DataFrame[SLEPTIME: double, GENHLTH: double, count(SLEPTIME): bigint]

In [0]:
display(multiple3)

SLEPTIME,GENHLTH,count(SLEPTIME)
8.0,3.0,35471
19.0,5.0,1
10.0,1.0,1406
99.0,9.0,33
12.0,5.0,445
24.0,4.0,8
77.0,9.0,14
22.0,5.0,5
7.0,3.0,32782
24.0,1.0,15


Output can only be rendered in Databricks

In [0]:
r3 = multiple3[multiple3['GENHLTH'] == '1']


In [0]:
display(r3)

SLEPTIME,GENHLTH,count(SLEPTIME)
10.0,1.0,1406
24.0,1.0,15
6.0,1.0,14436
11.0,1.0,81
23.0,1.0,2
22.0,1.0,4
5.0,1.0,3223
1.0,1.0,144
7.0,1.0,27271
17.0,1.0,5


Output can only be rendered in Databricks

In [0]:
multiple4 = Treatment\
.groupby(["PHYEXERCISE","GENHLTH"])\
.agg({'PHYEXERCISE':'count'})

multiple4


Out[69]: DataFrame[PHYEXERCISE: double, GENHLTH: double, count(PHYEXERCISE): bigint]

In [0]:
display(multiple4)

PHYEXERCISE,GENHLTH,count(PHYEXERCISE)
9.0,5.0,67
1.0,1.0,71670
2.0,7.0,266
9.0,2.0,140
9.0,7.0,10
2.0,2.0,21783
9.0,1.0,81
2.0,3.0,33525
1.0,4.0,25696
9.0,3.0,230


In [0]:
r4 = multiple4[multiple4['GENHLTH'] == '1']

In [0]:
display(r4)

PHYEXERCISE,GENHLTH,count(PHYEXERCISE)
1.0,1.0,71670
9.0,1.0,81
2.0,1.0,9909


Output can only be rendered in Databricks

In [0]:
import pyspark
import numpy as np
import pandas as pd

- VectorAssembler

In [0]:
from pyspark.ml.feature import VectorAssembler

In [0]:
# Creating an instance of the VectorAssembler
assembler = VectorAssembler(inputCols=["SMOKESTATUS","DRINKS","SLEPTIME","PHYEXERCISE"],outputCol='features')

In [0]:
# transforming our spark dataframe
Treatment_df = assembler.transform(Treatment)
# Viewing the first 5 rows
Treatment_df.show(5)

+----+-----------+------+--------+-----------+------+-------+--------+-------+-----------+-----------------+
| AGE|SMOKESTATUS|DRINKS|SLEPTIME|PHYEXERCISE|HEALTH|PHYHLTH|MENTHLTH|GENHLTH|GENHLTHSTAT|         features|
+----+-----------+------+--------+-----------+------+-------+--------+-------+-----------+-----------------+
| 8.0|        1.0|   1.0|     5.0|        1.0|   1.0|    2.0|     3.0|    2.0|  Very good|[1.0,1.0,5.0,1.0]|
|10.0|        9.0|   9.0|     7.0|        1.0|   1.0|    1.0|     1.0|    3.0|      Good |[9.0,9.0,7.0,1.0]|
|10.0|        4.0|   1.0|     7.0|        1.0|   1.0|    1.0|     1.0|    3.0|      Good |[4.0,1.0,7.0,1.0]|
|13.0|        4.0|   1.0|     6.0|        2.0|   1.0|    1.0|     1.0|    1.0| Excellent |[4.0,1.0,6.0,2.0]|
|13.0|        4.0|   1.0|     7.0|        1.0|   1.0|    1.0|     1.0|    2.0|  Very good|[4.0,1.0,7.0,1.0]|
+----+-----------+------+--------+-----------+------+-------+--------+-------+-----------+-----------------+
only showing top 5 

In [0]:
display(Treatment_df)

AGE,SMOKESTATUS,DRINKS,SLEPTIME,PHYEXERCISE,HEALTH,PHYHLTH,MENTHLTH,GENHLTH,GENHLTHSTAT,features
8.0,1.0,1.0,5.0,1.0,1.0,2.0,3.0,2.0,Very good,"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 5.0, 1.0))"
10.0,9.0,9.0,7.0,1.0,1.0,1.0,1.0,3.0,Good,"Map(vectorType -> dense, length -> 4, values -> List(9.0, 9.0, 7.0, 1.0))"
10.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,3.0,Good,"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 7.0, 1.0))"
13.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,1.0,Excellent,"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 6.0, 2.0))"
13.0,4.0,1.0,7.0,1.0,1.0,1.0,1.0,2.0,Very good,"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 7.0, 1.0))"
10.0,3.0,1.0,8.0,1.0,2.0,3.0,3.0,4.0,Fair,"Map(vectorType -> dense, length -> 4, values -> List(3.0, 1.0, 8.0, 1.0))"
12.0,4.0,1.0,6.0,2.0,1.0,1.0,1.0,3.0,Good,"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 6.0, 2.0))"
10.0,1.0,9.0,6.0,1.0,2.0,3.0,2.0,4.0,Fair,"Map(vectorType -> dense, length -> 4, values -> List(1.0, 9.0, 6.0, 1.0))"
5.0,4.0,1.0,8.0,1.0,1.0,3.0,1.0,2.0,Very good,"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 8.0, 1.0))"
12.0,3.0,1.0,12.0,2.0,2.0,2.0,1.0,4.0,Fair,"Map(vectorType -> dense, length -> 4, values -> List(3.0, 1.0, 12.0, 2.0))"


Output can only be rendered in Databricks

In [0]:
# While we could say something like this...
clean_df = Treatment_df.select(['features', 'GENHLTH'])

clean_df.show(5)

+-----------------+-------+
|         features|GENHLTH|
+-----------------+-------+
|[1.0,1.0,5.0,1.0]|    2.0|
|[9.0,9.0,7.0,1.0]|    3.0|
|[4.0,1.0,7.0,1.0]|    3.0|
|[4.0,1.0,6.0,2.0]|    1.0|
|[4.0,1.0,7.0,1.0]|    2.0|
+-----------------+-------+
only showing top 5 rows



In [0]:
display(clean_df)

features,GENHLTH
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 5.0, 1.0))",2.0
"Map(vectorType -> dense, length -> 4, values -> List(9.0, 9.0, 7.0, 1.0))",3.0
"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 7.0, 1.0))",3.0
"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 6.0, 2.0))",1.0
"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 7.0, 1.0))",2.0
"Map(vectorType -> dense, length -> 4, values -> List(3.0, 1.0, 8.0, 1.0))",4.0
"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 6.0, 2.0))",3.0
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 9.0, 6.0, 1.0))",4.0
"Map(vectorType -> dense, length -> 4, values -> List(4.0, 1.0, 8.0, 1.0))",2.0
"Map(vectorType -> dense, length -> 4, values -> List(3.0, 1.0, 12.0, 2.0))",4.0


Output can only be rendered in Databricks

In [0]:
# Creating our train and test sets
train, test = clean_df.randomSplit([0.7, 0.3], seed=42)

- Creating a Linear Regression Model

In [0]:
from pyspark.ml.regression import LinearRegression

In [0]:
# creating an instance of a linear regression model
lr = LinearRegression(featuresCol='features',labelCol='GENHLTH')

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="GENHLTH", metricName="r2")

In [0]:
# fitting the model to the train set
model = lr.fit(train)

In [0]:
# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(model.coefficients))
print("Intercept: %s" % str(model.intercept))

Coefficients: [-0.08331585929444325,0.02481871682955524,0.0067560549630756575,0.4703089020075944]
Intercept: 2.070810427187491


In [0]:
# Summarize the model over the training set and print out some metrics
trainingSummary = model.summary
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show()
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)


numIterations: 0
objectiveHistory: [0.0]
+-------------------+
|          residuals|
+-------------------+
|-1.4893782416932728|
|-1.4893782416932728|
|-0.4893782416932728|
|-0.4893782416932728|
|-0.4893782416932728|
|-0.4893782416932728|
| 0.5106217583067272|
| 0.5106217583067272|
| 0.5106217583067272|
| 0.5106217583067272|
| 0.5106217583067272|
| 0.5106217583067272|
| 0.5106217583067272|
| 0.5106217583067272|
| 1.5106217583067272|
| 1.5106217583067272|
| 1.5106217583067272|
|  2.510621758306727|
|  2.510621758306727|
|-0.9596871437008674|
+-------------------+
only showing top 20 rows

RMSE: 1.048460
r2: 0.069889


In [0]:
# Make predictions on the test data
test_results = model.transform(test)

In [0]:
display(test_results)

features,GENHLTH,prediction
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",2.0,2.489378241693273
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",2.0,2.489378241693273
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",3.0,2.489378241693273
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",3.0,2.489378241693273
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",3.0,2.489378241693273
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",3.0,2.489378241693273
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",3.0,2.489378241693273
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",3.0,2.489378241693273
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",3.0,2.489378241693273
"Map(vectorType -> dense, length -> 4, values -> List(1.0, 1.0, 1.0, 1.0))",3.0,2.489378241693273


In [0]:
# Evaluate the predictions
test_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="GENHLTH", metricName="r2")
test_r2 = test_evaluator.evaluate(test_results)
print("Test r2: %f" % test_r2)

Test r2: 0.070707
