###Heart Disease Prediction


Heart disease is a general term that includes many types of heart problems. It's also called cardiovascular disease, which means heart and blood vessel disease. 
The causes of heart disease depend on the type of disease. Some possible causes include lifestyle, genetics, infections, medicines, and other diseases.



This classification model will predict whether the patient has heart disease or not based on various conditions/symptoms of their body.

In [0]:
%scala 

// File location and type
val file_location = "/FileStore/tables/heart.csv"
val file_type = "csv"

// options
val infer_schema = "true"
val first_row_is_header = "true"
val delimiter = ","
// create a df
val heartDF = spark.read.format(file_type) 
  .option("inferSchema", infer_schema) 
  .option("header", first_row_is_header) 
  .option("sep", delimiter) 
  .load(file_location)

heartDF.show()

* **age:** The person's age in years

* **sex:** The person's sex (1 = male, 0 = female)

* **cp:** The chest pain experienced (Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic)

* **trestbps:** The person's resting blood pressure (mm Hg on admission to the hospital)

* **chol:** The person's cholesterol measurement in mg/dl

* **fbs:** The person's fasting blood sugar (> 120 mg/dl, 1 = true; 0 = false)

* **restecg:** Resting electrocardiographic measurement (0 = normal, 1 = having ST-T wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by Estes' 
criteria)

* **thalach:** The person's maximum heart rate achieved

* **exang:** Exercise induced angina (1 = yes; 0 = no)

* **oldpeak:** ST depression induced by exercise relative to rest ('ST' relates to positions on the ECG plot. See more here)

* **slope:** the slope of the peak exercise ST segment (Value 1: upsloping, Value 2: flat, Value 3: downsloping)

* **ca:** The number of major vessels (0-3)

* **thal:** A blood disorder called thalassemia

* **target:** Heart disease (0 = no, 1 = yes)

#### Diagnosis: 
The diagnosis of heart disease is done on a combination of clinical signs and test results.
<br>
 Examples - 
* From electrocardiograms and cardiac computerized tomography (CT) scans, to blood tests and exercise stress tests.
* High cholesterol, high blood pressure, diabetes, weight, family history and smoking.
* Hereditary variables such as Thalassemia.
* Smoking, high cholesterol, high blood pressure, physical inactivity, being overweight and having diabetes.
* Stress, alcohol and poor diet/nutrition.

<br>
Hypothesis - if the above model happens to be predictive, we'll see some interesting factors standing out as compared to the others.

In [0]:
%scala

heartDF.count()

In [0]:
%scala

heartDF.select("age", "sex", "cp", "trestbps", "chol", "fbs","restecg", "thalach","exang","oldpeak","slope","ca","thal","target").describe().show()


In [0]:
%scala

heartDF.select("age", "sex", "cp", "chol").describe().show()


In [0]:
%scala

heartDF.printSchema()

In [0]:
%scala

heartDF.createOrReplaceTempView("HeartData");

In [0]:
%sql

select * from HeartData;

age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
57,1,0,140,192,0,1,148,0,0.4,1,0,1,1
56,0,1,140,294,0,0,153,0,1.3,1,0,2,1
44,1,1,120,263,0,1,173,0,0.0,2,0,3,1
52,1,2,172,199,1,1,162,0,0.5,2,0,3,1
57,1,2,150,168,0,1,174,0,1.6,2,0,2,1


#Exploratory Data Analysis
(mostly graphical)

In [0]:
%sql

SELECT count(sex),
CASE
    WHEN sex == 1 THEN "Male"
    ELSE "Female"
END AS sex
FROM HeartData group by sex;

count(sex),sex
207,Male
96,Female


In [0]:
%sql

SELECT count(cp),
CASE
    WHEN cp == 0 THEN "Typical angina"
    WHEN cp == 1 THEN "Atypical angina"
    WHEN cp == 2 THEN "Non-anginal pain"
    ELSE "Asymptomatic"
END AS ChestPain
FROM HeartData group by cp;

count(cp),ChestPain
50,Atypical angina
23,Asymptomatic
87,Non-anginal pain
143,Typical angina


In [0]:
%sql

SELECT count(fbs),
CASE
    WHEN fbs == 1 THEN "True"
    ELSE "False"
END AS FastingBloodSugar
FROM HeartData group by fbs;

count(fbs),FastingBloodSugar
45,True
258,False


In [0]:
%sql

SELECT count(restecg),
CASE
    WHEN restecg == 0 THEN "Normal"
    WHEN restecg == 1 THEN "Abnormality"
    ELSE "Left Ventricular Hypertrophy"
END AS RestingElectrocardioGraphic
FROM HeartData group by restecg;

count(restecg),RestingElectrocardioGraphic
152,Abnormality
4,Left Ventricular Hypertrophy
147,Normal


In [0]:
%sql

SELECT count(exang),
CASE
    WHEN exang == 0 THEN "Not Induced"
    ELSE "Induced"
END AS ExerciseInducedAngina
FROM HeartData group by exang;


count(exang),ExerciseInducedAngina
99,Induced
204,Not Induced


In [0]:
%sql

SELECT count(slope),
CASE
    WHEN slope == 0 THEN "Upsloping"
    WHEN slope == 1 THEN "Flat"
    ELSE "Downsloping"
END AS Slope
FROM HeartData group by slope;

count(slope),Slope
140,Flat
142,Downsloping
21,Upsloping


In [0]:
%sql

SELECT count(ca), ca FROM HeartData group by ca order by ca;

count(ca),ca
175,0
65,1
38,2
20,3
5,4


In [0]:
%sql

select thal, count(thal) FROM HeartData group by thal order by thal;

thal,count(thal)
0,2
1,18
2,166
3,117


Output can only be rendered in Databricks

Output can only be rendered in Databricks

In [0]:
%sql

select count(sex),
CASE
    WHEN sex == 0 THEN "Female"
    ELSE "Male"
END AS Sex,
count(target), 
CASE
    WHEN target == 0 THEN "No Disease"
    ELSE "Has Disease"
  END AS DiseaseStatus
from HeartData group by sex, target;

count(sex),Sex,count(target),DiseaseStatus
114,Male,114,No Disease
93,Male,93,Has Disease
24,Female,24,No Disease
72,Female,72,Has Disease


In [0]:
%sql

select count(cp),
CASE
    WHEN cp == 0 THEN "Typical angina"
    WHEN cp == 1 THEN "Atypical angina"
    WHEN cp == 2 THEN "Non-anginal pain"
    ELSE "Asymptomatic"
END AS ChestPain,
count(target), 
CASE
    WHEN target == 0 THEN "No Disease"
    ELSE "Has Disease"
  END AS DiseaseStatus
from HeartData group by cp, target order by cp;

count(cp),ChestPain,count(target),DiseaseStatus
104,Typical angina,104,No Disease
39,Typical angina,39,Has Disease
9,Atypical angina,9,No Disease
41,Atypical angina,41,Has Disease
69,Non-anginal pain,69,Has Disease
18,Non-anginal pain,18,No Disease
16,Asymptomatic,16,Has Disease
7,Asymptomatic,7,No Disease


In [0]:
%sql

select count(fbs),
CASE
    WHEN fbs == 1 THEN "True"
    ELSE "False"
END AS FastingBloodSugar,
count(target), 
CASE
    WHEN target == 0 THEN "No Disease"
    ELSE "Has Disease"
  END AS DiseaseStatus
from HeartData group by fbs, target order by fbs;

count(fbs),FastingBloodSugar,count(target),DiseaseStatus
116,False,116,No Disease
142,False,142,Has Disease
22,True,22,No Disease
23,True,23,Has Disease


In [0]:
%sql

SELECT count(restecg),
CASE
    WHEN restecg == 0 THEN "Normal"
    WHEN restecg == 1 THEN "Abnormality"
    ELSE "Left Ventricular Hypertrophy"
END AS RestingElectrocardioGraphic,
count(target), 
CASE
    WHEN target == 0 THEN "No Disease"
    ELSE "Has Disease"
  END AS DiseaseStatus
from HeartData group by restecg, target order by restecg;

count(restecg),RestingElectrocardioGraphic,count(target),DiseaseStatus
79,Normal,79,No Disease
68,Normal,68,Has Disease
56,Abnormality,56,No Disease
96,Abnormality,96,Has Disease
1,Left Ventricular Hypertrophy,1,Has Disease
3,Left Ventricular Hypertrophy,3,No Disease


In [0]:
%sql

SELECT count(exang),
CASE
    WHEN exang == 0 THEN "Not Induced"
    ELSE "Induced"
END AS ExerciseInducedAngina,
count(target), 
CASE
    WHEN target == 0 THEN "No Disease"
    ELSE "Has Disease"
  END AS DiseaseStatus
from HeartData group by exang, target order by exang;

count(exang),ExerciseInducedAngina,count(target),DiseaseStatus
62,Not Induced,62,No Disease
142,Not Induced,142,Has Disease
76,Induced,76,No Disease
23,Induced,23,Has Disease


In [0]:
%sql

SELECT count(slope),
CASE
    WHEN slope == 0 THEN "Upsloping"
    WHEN slope == 1 THEN "Flat"
    ELSE "Downsloping"
END AS Slope,
count(target), 
CASE
    WHEN target == 0 THEN "No Disease"
    ELSE "Has Disease"
  END AS DiseaseStatus
from HeartData group by slope, target order by slope;

count(slope),Slope,count(target),DiseaseStatus
107,Downsloping,107,Has Disease
35,Downsloping,35,No Disease
91,Flat,91,No Disease
49,Flat,49,Has Disease
12,Upsloping,12,No Disease
9,Upsloping,9,Has Disease


In [0]:
%sql

SELECT ca, count(ca),
count(target), 
CASE
    WHEN target == 0 THEN "No Disease"
    ELSE "Has Disease"
  END AS DiseaseStatus
from HeartData group by ca, target order by ca;

ca,count(ca),count(target),DiseaseStatus
0,45,45,No Disease
0,130,130,Has Disease
1,44,44,No Disease
1,21,21,Has Disease
2,7,7,Has Disease
2,31,31,No Disease
3,3,3,Has Disease
3,17,17,No Disease
4,1,1,No Disease
4,4,4,Has Disease


In [0]:
%sql

SELECT thal, count(thal),count(target), 
CASE
    WHEN target == 0 THEN "No Disease"
    ELSE "Has Disease"
  END AS DiseaseStatus
from HeartData group by thal, target order by thal;

thal,count(thal),count(target),DiseaseStatus
0,1,1,No Disease
0,1,1,Has Disease
1,12,12,No Disease
1,6,6,Has Disease
2,130,130,Has Disease
2,36,36,No Disease
3,28,28,Has Disease
3,89,89,No Disease


In [0]:
%sql

select age from HeartData;

age
63
37
41
56
57
57
56
44
52
57


In [0]:
%sql

select trestbps from HeartData;

trestbps
145
130
130
120
120
140
140
120
172
150


In [0]:
%sql

select chol from HeartData;

chol
233
250
204
236
354
192
294
263
199
168


In [0]:
%sql

select thalach from HeartData;

thalach
150
187
172
178
163
148
153
173
162
174


In [0]:
%sql

select oldpeak from HeartData;

oldpeak
2.3
3.5
1.4
0.8
0.6
0.4
1.3
0.0
0.5
1.6


In [0]:
%sql

select target from HeartData;

target
1
1
1
1
1
1
1
1
1
1


In [0]:
%sql

select age,count(age) as AgeCounter from HeartData group by age order by age;

age,AgeCounter
29,1
34,2
35,4
37,2
38,3
39,4
40,3
41,10
42,8
43,8


In [0]:
%sql

SELECT 
CASE
    WHEN age >=29 and age <40 THEN "Young Ages"
    WHEN age >=40 and age <55 THEN "Middle Ages"
    ELSE "Elderly Ages"
  END AS Age,
  count(Age)
from HeartData group by Age;

Age,count(Age)
Elderly Ages,8
Middle Ages,8
Young Ages,2
Elderly Ages,1
Middle Ages,11
Middle Ages,5
Middle Ages,13
Middle Ages,3
Elderly Ages,17
Middle Ages,16


In [0]:
%sql

SELECT 
CASE
    WHEN age >=29 and age <40 THEN "Young Ages"
    WHEN age >=40 and age <55 THEN "Middle Ages"
    ELSE "Elderly Ages"
  END AS Age,
CASE
    WHEN sex == 1 THEN "Male"
    ELSE "Female"
END AS sex
from HeartData;

Age,sex
Elderly Ages,Male
Young Ages,Male
Middle Ages,Female
Elderly Ages,Male
Elderly Ages,Female
Elderly Ages,Male
Elderly Ages,Female
Middle Ages,Male
Middle Ages,Male
Elderly Ages,Male


In [0]:
%sql

select trestbps, age from HeartData;

trestbps,age
145,63
130,37
130,41
120,56
120,57
140,57
140,56
120,44
172,52
150,57


In [0]:
%sql

select age, thalach from HeartData order by age asc; 

age,thalach
29,202
34,174
34,192
35,182
35,174
35,130
35,156
37,187
37,170
38,173


### Classification Model

A decision tree classifier that uses features of patient details to predict the class label(Yes or No).

In [0]:
%scala

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.VectorAssembler

### Prepare the Training Data
We'll use the **VectorAssembler** class to transform the feature columns into a vector, and then rename the **Attrition** column to **label**.

###VectorAssembler()

 * VectorAssembler():  is a transformer that combines a given list of columns into a single vector column.
 * It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees. 

* VectorAssembler accepts the following input column types -> all numeric types, boolean types, and vector types.

* In each row, the values of the input columns will be concatenated into a vector in the specified order.

### Split the Data
It is common practice when building machine learning models to split the source data, using some of it to train the model and reserving some to test the trained model. 70% of the data for training, and 30% for testing.

In [0]:
%scala

val splits = heartDF.randomSplit(Array(0.7, 0.3))
val train = splits(0)
val test = splits(1)
val train_rows = train.count()
val test_rows = test.count()
println("Training Rows-: " + train_rows + " Testing Rows- " + test_rows)

In [0]:
%scala

import org.apache.spark.ml.feature.VectorAssembler

val assembler = new VectorAssembler().setInputCols(Array("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal")).setOutputCol("features")
// all the columns except target

val training = assembler.transform(train).select($"features", $"target".alias("label"))

training.show()

### Train a Classification Model (Decision tree classifier)
Next, you need to train a Classification Model using the training data. To do this, create an instance of the Decision tree classifier algorithm you want to use and use its **fit** method to train a model based on the training DataFrame. In this Project, you will use a *Decision tree classifier* algorithm

In [0]:
%scala

import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

val dt = new DecisionTreeClassifier().setLabelCol("label").setFeaturesCol("features")

val model = dt.fit(training)

println("Model Trained.")

### Prepare the Testing Data
**target** column -> **newLabel**.

In [0]:
%scala

val testing = assembler.transform(test).select($"features", $"target".alias("newLabel"))
testing.show()

### Test the Model
Time to make some predictions

In [0]:
%scala

val prediction = model.transform(testing)
val predicted = prediction.select("features", "prediction", "newLabel")
predicted.show(100)

The prediction column contains the predicted value for the label, and the newLabel column contains the actual known value from the testing data. We can observe that every prediction does not match with the actual value. These are residuals.

### Classification model Evalation

spark.mllib comes with a number of machine learning algorithms that can be used to learn from and make predictions on data. When these algorithms are applied to build machine learning models, there is a need to evaluate the performance of the model on some criteria, which depends on the application and its requirements. spark.mllib also provides a suite of metrics for the purpose of evaluating the performance of machine learning models.

In [0]:
%scala

val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("newLabel")
  .setPredictionCol("prediction")
  .setMetricName("accuracy")
val accuracy = evaluator.evaluate(prediction)



In [0]:
%scala
The accuracy depends on how the data is split. For now, it is 78%

### Confusion Matrix Metrics

In [0]:
%scala

val tp = predicted.filter("prediction == 1 AND newlabel == 1").count().toFloat
val fp = predicted.filter("prediction == 1 AND newlabel == 0").count().toFloat
val tn = predicted.filter("prediction == 0 AND newlabel == 0").count().toFloat
val fn = predicted.filter("prediction == 0 AND newlabel == 1").count().toFloat
val metrics = spark.createDataFrame(Seq(
 ("TP", tp),
 ("FP", fp),
 ("TN", tn),
 ("FN", fn),
 ("Precision", tp / (tp + fp)),
 ("Recall", tp / (tp + fn)))).toDF("metric", "value")
metrics.show()

This is how we can predict weather a person has heart disease or not.