## Predict Diabetes 
[data is here](/data/diabetes/).  (From UCI repository)

Sample Data:
```
+---+---+---+---+---+----+-----+---+-------+
|  a|  b|  c|  d|  e|   f|    g|  h|outcome|
+---+---+---+---+---+----+-----+---+-------+
|  6|148| 72| 35|  0|33.6|0.627| 50|      1|
|  1| 85| 66| 29|  0|26.6|0.351| 31|      0|
|  8|183| 64|  0|  0|23.3|0.672| 32|      1|
|  1| 89| 66| 23| 94|28.1|0.167| 21|      0|
|  0|137| 40| 35|168|43.1|2.288| 33|      1|
|  5|116| 74|  0|  0|25.6|0.201| 30|      0|
```

- attributes : 8
- target variable : 1
- number of observations : 768

Inputs:
- a :  Number of times pregnant
- b :  Plasma glucose concentration a 2 hours in an oral glucose tolerance test
- c :  Diastolic blood pressure (mm Hg)
- d :  Triceps skin fold thickness (mm)
- e :  2-Hour serum insulin (mu U/ml)
- f :  Body mass index (weight in kg/(height in m)^2)
- g :  Diabetes pedigree function
- h :  Age (years)

Output:
- outcome : 0 or 1


## Read Data

In [29]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
spark

Initializing Spark...
Spark found in :  /Users/sujee/spark
Spark config:
	 executor.memory=2g
	some_property=some_value
	spark.app.name=TestApp
	spark.master=local[*]
	spark.sql.warehouse.dir=/var/folders/lp/qm_skljd2hl4xtps5vw0tdgm0000gn/T/tmpvsry5p6_
	spark.submit.deployMode=client
	spark.ui.showConsoleProgress=true
Spark UI running on port 4041


In [30]:
## Reading data
data = spark.read.csv("/data/diabetes/pima-indians-diabetes-data.csv", header=True, inferSchema=True)
print("record count ", data.count())
data.printSchema()
data.show()

record count  768
root
 |-- a: integer (nullable = true)
 |-- b: integer (nullable = true)
 |-- c: integer (nullable = true)
 |-- d: integer (nullable = true)
 |-- e: integer (nullable = true)
 |-- f: double (nullable = true)
 |-- g: double (nullable = true)
 |-- h: integer (nullable = true)
 |-- outcome: integer (nullable = true)

+---+---+---+---+---+----+-----+---+-------+
|  a|  b|  c|  d|  e|   f|    g|  h|outcome|
+---+---+---+---+---+----+-----+---+-------+
|  6|148| 72| 35|  0|33.6|0.627| 50|      1|
|  1| 85| 66| 29|  0|26.6|0.351| 31|      0|
|  8|183| 64|  0|  0|23.3|0.672| 32|      1|
|  1| 89| 66| 23| 94|28.1|0.167| 21|      0|
|  0|137| 40| 35|168|43.1|2.288| 33|      1|
|  5|116| 74|  0|  0|25.6|0.201| 30|      0|
|  3| 78| 50| 32| 88|31.0|0.248| 26|      1|
| 10|115|  0|  0|  0|35.3|0.134| 29|      0|
|  2|197| 70| 45|543|30.5|0.158| 53|      1|
|  8|125| 96|  0|  0| 0.0|0.232| 54|      1|
|  4|110| 92|  0|  0|37.6|0.191| 30|      0|
| 10|168| 74|  0|  0|38.0|0.537| 34|

## Perform some data exploration
slice and dice data and see how the data is formed.   
For example, try to print the distribution of 'outcome'

In [31]:
data.describe().toPandas().T

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
a,768,3.8450520833333335,3.36957806269887,0,17
b,768,120.89453125,31.97261819513622,0,199
c,768,69.10546875,19.355807170644777,0,122
d,768,20.536458333333332,15.952217567727642,0,99
e,768,79.79947916666667,115.24400235133803,0,846
f,768,31.992578124999977,7.884160320375441,0.0,67.1
g,768,0.4718763020833327,0.331328595012775,0.078,2.42
h,768,33.240885416666664,11.760231540678689,21,81
outcome,768,0.3489583333333333,0.476951377242799,0,1


In [32]:
## TODO
data.groupBy('outcome').count().show()

+-------+-----+
|outcome|count|
+-------+-----+
|      1|  268|
|      0|  500|
+-------+-----+



In [33]:
from pyspark.ml.feature import VectorAssembler

assembler = VectorAssembler(inputCols=['a','b','c','d','e','f','g','h'], outputCol="features")
featureVector = assembler.transform(data)
featureVector = featureVector.withColumn("label", featureVector["outcome"])
featureVector.show()

+---+---+---+---+---+----+-----+---+-------+--------------------+-----+
|  a|  b|  c|  d|  e|   f|    g|  h|outcome|            features|label|
+---+---+---+---+---+----+-----+---+-------+--------------------+-----+
|  6|148| 72| 35|  0|33.6|0.627| 50|      1|[6.0,148.0,72.0,3...|    1|
|  1| 85| 66| 29|  0|26.6|0.351| 31|      0|[1.0,85.0,66.0,29...|    0|
|  8|183| 64|  0|  0|23.3|0.672| 32|      1|[8.0,183.0,64.0,0...|    1|
|  1| 89| 66| 23| 94|28.1|0.167| 21|      0|[1.0,89.0,66.0,23...|    0|
|  0|137| 40| 35|168|43.1|2.288| 33|      1|[0.0,137.0,40.0,3...|    1|
|  5|116| 74|  0|  0|25.6|0.201| 30|      0|[5.0,116.0,74.0,0...|    0|
|  3| 78| 50| 32| 88|31.0|0.248| 26|      1|[3.0,78.0,50.0,32...|    1|
| 10|115|  0|  0|  0|35.3|0.134| 29|      0|[10.0,115.0,0.0,0...|    0|
|  2|197| 70| 45|543|30.5|0.158| 53|      1|[2.0,197.0,70.0,4...|    1|
|  8|125| 96|  0|  0| 0.0|0.232| 54|      1|[8.0,125.0,96.0,0...|    1|
|  4|110| 92|  0|  0|37.6|0.191| 30|      0|[4.0,110.0,92.0,0...

In [34]:
# training & testing
(train, test) = featureVector.randomSplit([0.8,0.2], seed=1)
print ("total data count ", featureVector.count())
print ("training data count ", train.count())
print ("testing data count ", test.count())

total data count  768
training data count  606
testing data count  162


In [35]:
## Choose algorithm
from pyspark.ml.classification import LinearSVC
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import NaiveBayes

# SVM
algo = LinearSVC(maxIter=500, regParam=0.2)

# logistic
# algo = LogisticRegression(maxIter=500, regParam=0.1, elasticNetParam=0.8)

## add NB
# algo = NaiveBayes(smoothing=1.0)

In [36]:
%%time

## train the model
model = algo.fit(train)

CPU times: user 8.52 ms, sys: 3.34 ms, total: 11.9 ms
Wall time: 6.89 s


In [37]:
if hasattr(model, 'coefficients'):
    print("Coefficients: " + str(model.coefficients))

if hasattr(model, 'intercept'):
    print("Intercept: " + str(model.intercept))

Coefficients: [0.05797149057814717,0.017594326116102347,-0.005134748496412873,-0.00030425726906866817,-0.0004355871135062273,0.030156273276020124,0.4532573530091375,0.00787931143346063]
Intercept: -4.058328100780903


In [38]:
# predict
predictions = model.transform(test)
predictions.select(['features', 'label', 'prediction']).show()

+--------------------+-----+----------+
|            features|label|prediction|
+--------------------+-----+----------+
|[0.0,78.0,88.0,29...|    0|       0.0|
|[0.0,93.0,60.0,0....|    0|       0.0|
|[0.0,97.0,64.0,36...|    0|       0.0|
|(8,[1,5,6,7],[99....|    0|       0.0|
|[0.0,101.0,64.0,1...|    0|       0.0|
|[0.0,104.0,64.0,2...|    0|       0.0|
|[0.0,105.0,68.0,2...|    0|       0.0|
|[0.0,105.0,90.0,0...|    0|       0.0|
|[0.0,107.0,60.0,2...|    0|       0.0|
|[0.0,108.0,68.0,2...|    0|       0.0|
|[0.0,114.0,80.0,3...|    0|       0.0|
|[0.0,119.0,66.0,2...|    0|       0.0|
|[0.0,137.0,70.0,3...|    0|       0.0|
|[0.0,146.0,70.0,0...|    1|       0.0|
|[0.0,162.0,76.0,5...|    1|       1.0|
|[0.0,177.0,60.0,2...|    1|       1.0|
|[0.0,179.0,90.0,2...|    1|       1.0|
|[0.0,180.0,78.0,6...|    1|       1.0|
|[1.0,0.0,74.0,20....|    0|       0.0|
|[1.0,73.0,50.0,10...|    0|       0.0|
+--------------------+-----+----------+
only showing top 20 rows



In [39]:
## Evalauation
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
print ("AUC ", evaluator.evaluate(predictions))  #AUC

AUC  0.8832772166105496


In [40]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("accuracy",  accuracy)

accuracy 0.7777777777777778


In [41]:
predictions.groupBy('label').pivot('prediction', [0,1]).count().na.fill(0).orderBy('label').show()

+-----+---+---+
|label|  0|  1|
+-----+---+---+
|    0| 93|  6|
|    1| 30| 33|
+-----+---+---+



In [42]:
## Training internals
if hasattr(model, 'summary'):
    print("total iterations ", model.summary.totalIterations)
    print()
    print("objective history : ", model.summary.objectiveHistory)