# DS 5559 Group 1: Secondary Cause of Death Model

This model will attempt to predict the secondary cause of death listed in the Multiple Cause Mortality dataset.

## Setup

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
        .master("local[*]") \
        .appName("secondary_cause_final") \
        .getOrCreate()

In [2]:
df_train = spark.read.option("header",True).option("inferSchema",True).csv("train_data")
df_test = spark.read.option("header",True).option("inferSchema",True).csv("test_data")

In [3]:
df_train.count()

2200579

## Preprocessing

In [4]:
# drop records without secondary cause of death listerd

df_train = df_train.filter(df_train.mcd_2_R.isNotNull())
df_test = df_test.filter(df_test.mcd_2_R.isNotNull())

In [5]:
df_train.count()

1690328

In [6]:
# change age value coded as unknown (999) to null

from pyspark.sql.functions import when, lit, col, isnan, count

def replace(column, value):
    return when(column != value, column)

df_train = df_train.withColumn("age", replace(col("age"), 999))
df_test = df_test.withColumn("age", replace(col("age"), 999))

In [7]:
# keeping arbitrary variables to build simple model

vars_to_keep = ["mcd_2_R",
               "education",
               "age",
               "mcd_1_R",
               "race"]

df_train = df_train.select(vars_to_keep)
df_test = df_test.select(vars_to_keep)
#df_train.show(5)

In [8]:
# simplify icd codes to first letter
df_train = df_train.withColumn("mcd_2_R", df_train["mcd_2_R"].substr(0,1))
df_test = df_test.withColumn("mcd_2_R", df_test["mcd_2_R"].substr(0,1))

In [9]:
df_train = df_train.withColumn("mcd_1_R", df_train["mcd_1_R"].substr(0,1))
df_test = df_test.withColumn("mcd_1_R", df_test["mcd_1_R"].substr(0,1))

In [10]:
df_train.show(2)

+-------+---------+---+-------+----+
|mcd_2_R|education|age|mcd_1_R|race|
+-------+---------+---+-------+----+
|      R|       14| 94|      F|   1|
|      F|       12| 70|      J|   1|
+-------+---------+---+-------+----+
only showing top 2 rows



In [11]:
# remove mcd_1_R == 'U' (only 2 observations)
df_train = df_train.filter(df_train.mcd_1_R != 'U')

## Exploratory Data Analysis

In [12]:
df_train.groupby("mcd_2_R").count().orderBy('count', ascending=False).show(25)

+-------+------+
|mcd_2_R| count|
+-------+------+
|      I|470805|
|      F|280709|
|      E|220250|
|      A|137436|
|      J|113066|
|      C|109459|
|      R| 69204|
|      T| 66595|
|      G| 53671|
|      S| 47935|
|      D| 46776|
|      K| 24097|
|      N| 20413|
|      B| 17183|
|      P|  5091|
|      M|  3350|
|      L|  1555|
|      Q|  1056|
|      H|   672|
|      O|   671|
|      Y|   175|
|      W|   153|
|      V|     2|
|      X|     2|
+-------+------+



In [13]:
df_train.groupby("mcd_1_R").count().orderBy('count', ascending=False).show(25)

+-------+------+
|mcd_1_R| count|
+-------+------+
|      I|532864|
|      C|297260|
|      J|182532|
|      G|104498|
|      X|102089|
|      E| 92562|
|      F| 83157|
|      K| 70352|
|      N| 45083|
|      W| 38347|
|      A| 37100|
|      V| 32804|
|      D| 16730|
|      B| 11719|
|      Y| 11194|
|      M| 10096|
|      Q|  6812|
|      R|  6657|
|      P|  4286|
|      L|  3732|
|      O|   348|
|      H|   104|
+-------+------+



In [14]:
df_test.groupby("mcd_1_R").count().orderBy('count', ascending=False).show(25)

+-------+------+
|mcd_1_R| count|
+-------+------+
|      I|133376|
|      C| 74758|
|      J| 46038|
|      G| 26091|
|      X| 25570|
|      E| 23056|
|      F| 20530|
|      K| 17441|
|      N| 11476|
|      W|  9472|
|      A|  9444|
|      V|  8271|
|      D|  4202|
|      B|  2907|
|      Y|  2777|
|      M|  2588|
|      Q|  1625|
|      R|  1624|
|      P|  1064|
|      L|   940|
|      O|    77|
|      H|    21|
+-------+------+



### One Hot Encoding

Code to apply StrinIndexer to to several columns in a Pyspark df: https://stackoverflow.com/questions/36942233/apply-stringindexer-to-several-columns-in-a-pyspark-dataframe

In [15]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer

In [16]:
train_indexer = [StringIndexer(inputCol=column, outputCol=column+"_index", handleInvalid = "skip").fit(df_train) for column in list(set(df_train.columns)-set(['age']))]
test_indexer = [StringIndexer(inputCol=column, outputCol=column+"_index", handleInvalid = "skip").fit(df_test) for column in list(set(df_test.columns)-set(['age']))]

In [17]:
train_pipeline = Pipeline(stages=train_indexer)
test_pipeline = Pipeline(stages=test_indexer)
df_train_indexed = train_pipeline.fit(df_train).transform(df_train)
df_test_indexed = test_pipeline.fit(df_test).transform(df_test)

In [18]:
# df_test_indexed.show()

In [19]:
# applying a similar method using OneHotEncoder

train_encoder = [OneHotEncoder(inputCol=column, outputCol=column+"_Vec") for column in df_train_indexed.columns[5:]]
test_encoder = [OneHotEncoder(inputCol=column, outputCol=column+"_Vec") for column in df_test_indexed.columns[5:]]

In [20]:
train_pipeline2 = Pipeline(stages=train_encoder)
test_pipeline2 = Pipeline(stages=test_encoder)
df_train_encoded = train_pipeline2.fit(df_train_indexed).transform(df_train_indexed)
df_test_encoded = test_pipeline2.fit(df_test_indexed).transform(df_test_indexed)

In [21]:
df_train_encoded.show(5)

+-------+---------+---+-------+----+-------------+-------------+---------------+----------+-----------------+-----------------+-------------------+--------------+
|mcd_2_R|education|age|mcd_1_R|race|mcd_1_R_index|mcd_2_R_index|education_index|race_index|mcd_1_R_index_Vec|mcd_2_R_index_Vec|education_index_Vec|race_index_Vec|
+-------+---------+---+-------+----+-------------+-------------+---------------+----------+-----------------+-----------------+-------------------+--------------+
|      R|       14| 94|      F|   1|          6.0|          6.0|            9.0|       0.0|   (21,[6],[1.0])|   (23,[6],[1.0])|     (16,[9],[1.0])|(13,[0],[1.0])|
|      F|       12| 70|      J|   1|          2.0|          1.0|            8.0|       0.0|   (21,[2],[1.0])|   (23,[1],[1.0])|     (16,[8],[1.0])|(13,[0],[1.0])|
|      I|       16| 88|      I|   1|          0.0|          0.0|           10.0|       0.0|   (21,[0],[1.0])|   (23,[0],[1.0])|    (16,[10],[1.0])|(13,[0],[1.0])|
|      C|       12| 77

In [22]:
df_test_encoded.show(5)

+-------+---------+---+-------+----+-------------+-------------+---------------+----------+-----------------+-----------------+-------------------+---------------+
|mcd_2_R|education|age|mcd_1_R|race|mcd_1_R_index|mcd_2_R_index|education_index|race_index|mcd_1_R_index_Vec|mcd_2_R_index_Vec|education_index_Vec| race_index_Vec|
+-------+---------+---+-------+----+-------------+-------------+---------------+----------+-----------------+-----------------+-------------------+---------------+
|      C|        1| 82|      C|   2|          1.0|          5.0|            3.0|       1.0|   (21,[1],[1.0])|   (23,[5],[1.0])|     (16,[3],[1.0])| (13,[1],[1.0])|
|      A|        1| 74|      E|   1|          5.0|          3.0|            3.0|       0.0|   (21,[5],[1.0])|   (23,[3],[1.0])|     (16,[3],[1.0])| (13,[0],[1.0])|
|      I|        2| 70|      I|  38|          0.0|          0.0|            2.0|      12.0|   (21,[0],[1.0])|   (23,[0],[1.0])|     (16,[2],[1.0])|(13,[12],[1.0])|
|      F|       

In [23]:
# create map of icd codes and corresponding index values

codemap = df_train_encoded.select("mcd_2_R", "mcd_2_R_index").distinct()

In [24]:
codemap.show(25)

+-------+-------------+
|mcd_2_R|mcd_2_R_index|
+-------+-------------+
|      W|         21.0|
|      X|         22.0|
|      N|         12.0|
|      C|          5.0|
|      A|          3.0|
|      B|         13.0|
|      O|         19.0|
|      T|          7.0|
|      F|          1.0|
|      D|         10.0|
|      E|          2.0|
|      Y|         20.0|
|      V|         23.0|
|      R|          6.0|
|      K|         11.0|
|      Q|         17.0|
|      P|         14.0|
|      J|          4.0|
|      I|          0.0|
|      S|          9.0|
|      H|         18.0|
|      L|         16.0|
|      M|         15.0|
|      G|          8.0|
+-------+-------------+



In [25]:
# test models with age, education not encoded

# vars_to_keep2 = ["mcd_2_R_index",
#                "education",
#                "age",
#                "mcd_1_R_index_Vec",
#                "race"]

# df_train = df_train_encoded.select(vars_to_keep2)
# df_test = df_test_encoded.select(vars_to_keep2)
# df_test.show(5)

In [26]:
# test models with age, education encoded

vars_to_keep2 = ["mcd_2_R_index",
               "education_index_Vec",
               "age",
               "mcd_1_R_index_Vec",
               "race_index_Vec"]

df_train = df_train_encoded.select(vars_to_keep2)
df_test = df_test_encoded.select(vars_to_keep2)
df_test.show(5)

+-------------+-------------------+---+-----------------+---------------+
|mcd_2_R_index|education_index_Vec|age|mcd_1_R_index_Vec| race_index_Vec|
+-------------+-------------------+---+-----------------+---------------+
|          5.0|     (16,[3],[1.0])| 82|   (21,[1],[1.0])| (13,[1],[1.0])|
|          3.0|     (16,[3],[1.0])| 74|   (21,[5],[1.0])| (13,[0],[1.0])|
|          0.0|     (16,[2],[1.0])| 70|   (21,[0],[1.0])|(13,[12],[1.0])|
|          1.0|     (16,[0],[1.0])| 89|   (21,[1],[1.0])| (13,[0],[1.0])|
|          2.0|     (16,[0],[1.0])| 67|   (21,[0],[1.0])| (13,[1],[1.0])|
+-------------+-------------------+---+-----------------+---------------+
only showing top 5 rows



In [27]:
# df_train_model1.select([count(when(col(c).isNull(), c)).alias(c) for c in df_train_model1.columns]).show()

In [28]:
# df_train.count()

In [29]:
from pyspark.ml.feature import VectorAssembler

In [30]:
assembler1 = VectorAssembler(inputCols = [cols for cols in df_train.columns[1:]], outputCol = "features")
assembler2 = VectorAssembler(inputCols = [cols for cols in df_test.columns[1:]], outputCol = "features")

In [31]:
df_train = assembler1.setHandleInvalid("skip").transform(df_train)
df_test = assembler2.setHandleInvalid("skip").transform(df_test)

In [32]:
# df3 = df3.na.drop()

In [33]:
# df_test.select("features").show(truncate = False)

In [34]:
# df3 = df3.withColumn("mcd_2_R_index", df3["mcd_2_R_index"].cast('integer'))

In [35]:
cleanup_train = df_train.select("mcd_2_R_index", "features")
cleanup_test = df_test.select("mcd_2_R_index", "features")

### Additional Split for training and testing

In [36]:
train_data, test_data = cleanup_train.randomSplit(weights = [0.8, 0.2], seed = 314)

### Downsampling

In [37]:
train_data.groupby("mcd_2_R_index").count().sort('count').show(22)

+-------------+------+
|mcd_2_R_index| count|
+-------------+------+
|         22.0|     1|
|         23.0|     2|
|         21.0|   115|
|         20.0|   136|
|         18.0|   525|
|         19.0|   529|
|         17.0|   765|
|         16.0|  1229|
|         15.0|  2646|
|         14.0|  3583|
|         13.0| 13213|
|         12.0| 15882|
|         11.0| 18703|
|         10.0| 36549|
|          9.0| 37601|
|          8.0| 42044|
|          7.0| 51856|
|          6.0| 53925|
|          5.0| 86044|
|          4.0| 88341|
|          3.0|106920|
|          2.0|171791|
+-------------+------+
only showing top 22 rows



In [38]:
train_data = train_data.filter(train_data.mcd_2_R_index != 23)
train_data = train_data.filter(train_data.mcd_2_R_index != 22)
train_data = train_data.filter(train_data.mcd_2_R_index != 21)
train_data = train_data.filter(train_data.mcd_2_R_index != 20)

In [39]:
train_data.groupby("mcd_2_R_index").count().sort('count').show(25)

+-------------+------+
|mcd_2_R_index| count|
+-------------+------+
|         18.0|   525|
|         19.0|   529|
|         17.0|   765|
|         16.0|  1229|
|         15.0|  2646|
|         14.0|  3583|
|         13.0| 13213|
|         12.0| 15882|
|         11.0| 18703|
|         10.0| 36549|
|          9.0| 37601|
|          8.0| 42044|
|          7.0| 51856|
|          6.0| 53925|
|          5.0| 86044|
|          4.0| 88341|
|          3.0|106920|
|          2.0|171791|
|          1.0|218297|
|          0.0|366725|
+-------------+------+



In [40]:
# cleanup_test.groupby("mcd_2_R_index").count().sort('count').show(25)

In [41]:
# cleanup_test = cleanup_test.filter(cleanup_test.mcd_2_R_index != 23)
# cleanup_test = cleanup_test.filter(cleanup_test.mcd_2_R_index != 22)
# cleanup_test = cleanup_test.filter(cleanup_test.mcd_2_R_index != 21)
# cleanup_test = cleanup_test.filter(cleanup_test.mcd_2_R_index != 20)

In [42]:
# cleanup_test.groupby("mcd_2_R_index").count().sort('count').show(25)

In [43]:
# write function to do downsampling

def downSample(df, target, seed):
    
    # gather counts of each class 
    class_counts = df.groupby(target).count()

    # select smallest count size and corresponding class
    smallest_class_size = class_counts.agg({'count': 'min'})
    smallest_class_size = smallest_class_size.collect()[0]['min(count)']

    class_counts = class_counts.withColumn('min', lit(smallest_class_size))
    class_counts = class_counts.withColumn('ratio', class_counts['min']/ class_counts['count'])

    smallest_class = class_counts.filter(class_counts['count'] == class_counts['min']).collect()[0][target]
    
    # set up final dataframe to hold results - with only the smallest class to start
    adjusted_df = df.filter(df[target] == smallest_class)

    # iterate over outcome classes, sampling to match count of smallest class
    for i in range(class_counts.count()):

        outcome_class = class_counts.collect()[i][target]
        ratio = class_counts.collect()[i]['ratio']

        if outcome_class != smallest_class: 

            subset = df.filter(df[target] == outcome_class)
            subset_adjusted = subset.sample(False, ratio, seed = seed)

            adjusted_df = adjusted_df.unionAll(subset_adjusted)
        else:
            adjusted_df = adjusted_df

    return adjusted_df

In [44]:
adj_train = downSample(df = train_data, target = 'mcd_2_R_index', seed = 4)
adj_train.groupby("mcd_2_R_index").count().sort('count').show()

+-------------+-----+
|mcd_2_R_index|count|
+-------------+-----+
|          0.0|  504|
|         17.0|  514|
|          1.0|  519|
|          2.0|  522|
|         19.0|  523|
|         18.0|  525|
|         16.0|  528|
|          3.0|  532|
|         12.0|  539|
|         11.0|  552|
|         13.0|  555|
|          8.0|  556|
|         15.0|  559|
|         14.0|  565|
|          4.0|  566|
|         10.0|  573|
|          7.0|  575|
|          5.0|  578|
|          6.0|  580|
|          9.0|  581|
+-------------+-----+



In [45]:
# cache to improve performance
adj_train.cache()

DataFrame[mcd_2_R_index: double, features: vector]

### Model Building

In [46]:
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder, CrossValidatorModel
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression

In [47]:
lr = LogisticRegression(featuresCol = "features", labelCol = "mcd_2_R_index", maxIter=10, family = "multinomial")

In [48]:
lrModel = lr.fit(train_data)

In [49]:
# Model summary
trainingSummary = lrModel.summary

In [50]:
accuracy = trainingSummary.accuracy
falsePositiveRate = trainingSummary.weightedFalsePositiveRate
truePositiveRate = trainingSummary.weightedTruePositiveRate
fMeasure = trainingSummary.weightedFMeasure()
precision = trainingSummary.weightedPrecision
recall = trainingSummary.weightedRecall
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy, falsePositiveRate, truePositiveRate, fMeasure, precision, recall))

Accuracy: 0.3384898509529536
FPR: 0.16721900382266383
TPR: 0.3384898509529536
F-measure: 0.25061332765857075
Precision: 0.266103250946215
Recall: 0.3384898509529536


In [51]:
print("True positive rate by label:")
for i, rate in enumerate(trainingSummary.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

True positive rate by label:
label 0: 0.7638475697048197
label 1: 0.17281043715671768
label 2: 0.0007741965527879808
label 3: 0.2874392068836513
label 4: 2.2639544492364815e-05
label 5: 0.5552275579935847
label 6: 0.010199350950394067
label 7: 0.7796397716754089
label 8: 0.0
label 9: 0.0980559027685434
label 10: 0.0
label 11: 0.005614072608672405
label 12: 0.0
label 13: 0.09498221448573374
label 14: 0.8827797934691599
label 15: 0.0
label 16: 0.0
label 17: 0.0
label 18: 0.0
label 19: 0.32325141776937616


In [52]:
print("F-measure by label:")
for i, f in enumerate(trainingSummary.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

F-measure by label:
label 0: 0.5000691758611057
label 1: 0.22164837908899954
label 2: 0.0015386751197390036
label 3: 0.28369926935876194
label 4: 4.527396407510951e-05
label 5: 0.32790643403297326
label 6: 0.019841985641617665
label 7: 0.5533481608212147
label 8: 0.0
label 9: 0.1399294090857338
label 10: 0.0
label 11: 0.010781947938594239
label 12: 0.0
label 13: 0.11944986436967592
label 14: 0.6925013683634373
label 15: 0.0
label 16: 0.0
label 17: 0.0
label 18: 0.0
label 19: 0.42326732673267325


In [53]:
lrModel2 = lr.fit(adj_train)

In [54]:
trainingSummary2 = lrModel2.summary

In [55]:
accuracy2 = trainingSummary2.accuracy
falsePositiveRate2 = trainingSummary2.weightedFalsePositiveRate
truePositiveRate2 = trainingSummary2.weightedTruePositiveRate
fMeasure2 = trainingSummary2.weightedFMeasure()
precision2 = trainingSummary2.weightedPrecision
recall2 = trainingSummary2.weightedRecall
print("Accuracy: %s\nFPR: %s\nTPR: %s\nF-measure: %s\nPrecision: %s\nRecall: %s"
      % (accuracy2, falsePositiveRate2, truePositiveRate2, fMeasure2, precision2, recall2))

Accuracy: 0.2721542115841403
FPR: 0.03834224234066022
TPR: 0.2721542115841403
F-measure: 0.25482274832463336
Precision: 0.2737076565856912
Recall: 0.2721542115841403


In [56]:
print("True positive rate by label:")
for i, rate in enumerate(trainingSummary2.truePositiveRateByLabel):
    print("label %d: %s" % (i, rate))

True positive rate by label:
label 0: 0.23214285714285715
label 1: 0.046242774566473986
label 2: 0.10727969348659004
label 3: 0.16353383458646617
label 4: 0.19434628975265017
label 5: 0.49480968858131485
label 6: 0.08620689655172414
label 7: 0.49217391304347824
label 8: 0.23201438848920863
label 9: 0.40619621342512907
label 10: 0.05410122164048865
label 11: 0.29528985507246375
label 12: 0.011131725417439703
label 13: 0.22522522522522523
label 14: 0.647787610619469
label 15: 0.12701252236135957
label 16: 0.32007575757575757
label 17: 0.4785992217898833
label 18: 0.08761904761904762
label 19: 0.722753346080306


In [57]:
print("F-measure by label:")
for i, f in enumerate(trainingSummary2.fMeasureByLabel()):
    print("label %d: %s" % (i, f))

F-measure by label:
label 0: 0.16956521739130434
label 1: 0.06876790830945559
label 2: 0.1122244488977956
label 3: 0.18974918211559436
label 4: 0.18363939899833057
label 5: 0.2921348314606741
label 6: 0.1272264631043257
label 7: 0.4380804953560371
label 8: 0.17211474316210806
label 9: 0.38405207485760784
label 10: 0.08458390177353341
label 11: 0.2895204262877442
label 12: 0.02123893805309734
label 13: 0.24509803921568626
label 14: 0.7031700288184438
label 15: 0.16118047673098754
label 16: 0.22295514511873352
label 17: 0.45054945054945056
label 18: 0.11601513240857501
label 19: 0.6461538461538462


In [58]:
## Cross-validation model

# set up parameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()
    
# wrap into CrossValidator
crossval = CrossValidator(estimator=lr,
                          estimatorParamMaps=paramGrid,
                          evaluator=MulticlassClassificationEvaluator(labelCol="mcd_2_R_index", predictionCol="prediction", metricName="accuracy"),
                          numFolds=2)

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(adj_train)

In [59]:
# cvSummary2 = cvModel.summary

In [60]:
# print("True positive rate by label:")
# for i, rate in enumerate(cvSummary2.truePositiveRateByLabel):
#     print("label %d: %s" % (i, rate))

### Model Evaluation

In [61]:
evaluator = MulticlassClassificationEvaluator(labelCol="mcd_2_R_index", predictionCol="prediction", metricName="accuracy")
evaluator2 = MulticlassClassificationEvaluator(labelCol="mcd_2_R_index", predictionCol="prediction", metricName="f1")
evaluator3 = MulticlassClassificationEvaluator(labelCol="mcd_2_R_index", predictionCol="prediction", metricName="weightedRecall")
# evaluator4 = MulticlassClassificationEvaluator(labelCol="mcd_2_R_index", predictionCol="prediction", metricName="weightedFMeasure")
# evaluator4 = MulticlassClassificationEvaluator(labelCol="mcd_2_R_index", predictionCol="prediction", metricLabel = 14.0, metricName="recallByLabel")

In [62]:
model_1_preds = lrModel.transform(test_data)

In [63]:
accuracy1 = evaluator.evaluate(model_1_preds)
f1_score1 = evaluator2.evaluate(model_1_preds)
recall1 = evaluator3.evaluate(model_1_preds)
print("Accuracy: %s\nF-measure: %s\nRecall: %s" % (accuracy1, f1_score1, recall1))

Accuracy: 0.3368531534291703
F-measure: 0.24923548670131376
Recall: 0.3368531534291704


In [64]:
model_2_preds = lrModel2.transform(test_data)

In [65]:
accuracy2 = evaluator.evaluate(model_2_preds)
f1_score2 = evaluator2.evaluate(model_2_preds)
recall2 = evaluator3.evaluate(model_2_preds)
print("Accuracy: %s\nF-measure: %s\nRecall: %s" % (accuracy2, f1_score2, recall2))

Accuracy: 0.18518945449361002
F-measure: 0.19854521598260602
Recall: 0.18518945449361005


In [66]:
model_3_preds = cvModel.transform(test_data)

In [67]:
accuracy3 = evaluator.evaluate(model_3_preds)
f1_score3 = evaluator2.evaluate(model_3_preds)
recall3 = evaluator3.evaluate(model_3_preds)
print("Accuracy: %s\nF-measure: %s\nRecall: %s" % (accuracy3, f1_score3, recall3))

Accuracy: 0.1761497556551203
F-measure: 0.16818300839115424
Recall: 0.1761497556551203


In [68]:
model_1_preds.groupby("prediction").count().orderBy('count', ascending=False).show()

+----------+------+
|prediction| count|
+----------+------+
|       0.0|188093|
|       5.0| 51693|
|       1.0| 30613|
|       3.0| 27493|
|       7.0| 23724|
|       9.0|  3799|
|      13.0|  1971|
|      14.0|  1341|
|       6.0|   377|
|       2.0|   287|
|      11.0|   200|
|      19.0|    65|
|       4.0|     1|
+----------+------+



In [69]:
model_2_preds.groupby("prediction").count().orderBy('count', ascending=False).show()

+----------+-----+
|prediction|count|
+----------+-----+
|       5.0|48260|
|       8.0|43199|
|       0.0|39945|
|      16.0|28840|
|       4.0|26441|
|       2.0|23037|
|       9.0|14790|
|       3.0|14718|
|       7.0|14577|
|      11.0|13752|
|      13.0|12279|
|       1.0| 9042|
|      18.0| 8853|
|      15.0| 7813|
|      19.0| 6169|
|       6.0| 6050|
|      10.0| 5353|
|      17.0| 4550|
|      12.0| 1026|
|      14.0|  963|
+----------+-----+



In [70]:
model_3_preds.groupby("prediction").count().orderBy('count', ascending=False).show()

+----------+-----+
|prediction|count|
+----------+-----+
|       8.0|63480|
|       5.0|53990|
|       4.0|39189|
|       0.0|32285|
|       7.0|19173|
|      15.0|18854|
|       9.0|16478|
|       6.0|15637|
|      11.0|13730|
|       3.0|10899|
|       2.0|10128|
|      16.0| 8626|
|      13.0| 8297|
|      10.0| 7340|
|      17.0| 4983|
|      18.0| 4631|
|      14.0|  866|
|      12.0|  616|
|      19.0|  298|
|       1.0|  157|
+----------+-----+



### Top 5 predicted secondary causes of death

- 0.0 -> I -> circulatory system disease
- 5.0 -> C -> neoplasm (cancer)
- 1.0 -> F -> mental/behavioral disorder
- 3.0 -> A -> infectious/parasitic diseases
- 7.0 -> T -> injury/poisoning/external causes

### Top 5 predicted secondary causes of death after downsampling

- 5.0 -> C -> neoplasm (cancer)
- 8.0 -> G -> Nervous System
- 0.0 -> I -> circulatory system disease
- 16.0 -> L -> skin/subcutaneous disease
- 4.0 -> J -> repiratory

### Top 5 predicted secondary causes of death after downsampling and cross-validation (k=2)

- 8.0 -> G -> Nervous System
- 5.0 -> C -> neoplasm (cancer)
- 4.0 -> J -> repiratory
- 0.0 -> I -> circulatory system disease
- 7.0 -> T -> injury/poisoning/external causes

In [71]:
codemap.show(25)

+-------+-------------+
|mcd_2_R|mcd_2_R_index|
+-------+-------------+
|      W|         21.0|
|      X|         22.0|
|      N|         12.0|
|      C|          5.0|
|      A|          3.0|
|      B|         13.0|
|      O|         19.0|
|      T|          7.0|
|      F|          1.0|
|      D|         10.0|
|      E|          2.0|
|      Y|         20.0|
|      V|         23.0|
|      R|          6.0|
|      K|         11.0|
|      Q|         17.0|
|      P|         14.0|
|      J|          4.0|
|      I|          0.0|
|      S|          9.0|
|      H|         18.0|
|      L|         16.0|
|      M|         15.0|
|      G|          8.0|
+-------+-------------+



### Testing Holdout Data

In [72]:
test_pred = cvModel.transform(cleanup_test)

In [76]:
accuracy4 = evaluator.evaluate(test_pred)
f1_score4 = evaluator2.evaluate(test_pred)
recall4 = evaluator3.evaluate(test_pred)
print("Accuracy: %s\nF-measure: %s\nRecall: %s" % (accuracy4, f1_score4, recall4))

Accuracy: 0.17812172850994534
F-measure: 0.17032141701280623
Recall: 0.17812172850994534


In [74]:
# test_pred.show()