In the previous part, we applied undersampling of the majority class (non-fraud cases) to achieve class balanceness. We trained and tested logistic regression on balanced training and test set, and we were happy with the results.

In this part, we'll tackle the problem in a more realistic scenario: the test is unbalanced where we have way more non-fraud cases than the fraud cases. So we need to be careful about how we construct the training set and how the model is trained.

Here is what we'll do:
1. Construct balanced training set as in the previous part, and apply the model trained on it on the unbalanced test set.
2. Construct unbalanced training set with the same level of unbalanceness as test set, and apply the model trained on it on the unbalanced test set.
3. Combine (average) the model trained in **1** and **2**, and apply the averaged model on the test set.
4. Other than logistic regression, is there any other more sophisticated algorithm (boosting) that we can use ?

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import GBTClassifier

from pyspark.ml.feature import StandardScaler
from pyspark.ml.feature import Normalizer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

from pyspark.ml.linalg import DenseVector
from pyspark.sql import Row
from pyspark.sql.types import IntegerType, DoubleType

spark = SparkSession.builder.getOrCreate()

df = spark.read.csv("creditcard.csv", header=True)

As in previous parts, cast the columns into appropriate types

In [2]:
colnames = [col.name for col in df.schema.fields]
for col in colnames[:-1]:
    df = df.withColumn(col, df[col].cast("float"))

df = df.withColumn("Class", df["Class"].cast("int"))        

Define functions for evaluating binary classification performance

In [3]:
def binary_classification_eval(df, predictionCol="prediction", labelCol="label"):
    acc = df.select([predictionCol, labelCol]).rdd.map(lambda x: 1. if x[0] == x[1] else 0.).mean()
    
    precision = df.select([predictionCol, labelCol]).rdd.map(lambda x: 1. if (x[0] == 1) and (x[1] == 1) else 0.).sum()/\
    df.select([predictionCol, labelCol]).rdd.map(lambda x: 1. if x[0] == 1 else 0.).sum()
    
    recall = df.select([predictionCol, labelCol]).rdd.map(lambda x: 1. if (x[0] == 1) and (x[1] == 1) else 0.).sum()/\
    df.select([predictionCol, labelCol]).rdd.map(lambda x: 1. if x[1] == 1 else 0.).sum()
    
    return {"accuracy":acc, "precision": precision, "recall": recall}

def confusion_matrix(df, predictionCol="prediction", labelCol="label"):
    cm = np.zeros((2, 2))
    cm[0, 0] = df.select([predictionCol, labelCol]).rdd.map(lambda x: 1 if x[0] == 1 and x[1] == 1 else 0).sum()
    cm[0, 1] = df.select([predictionCol, labelCol]).rdd.map(lambda x: 1 if x[0] == 0 and x[1] == 1 else 0).sum()
    cm[1, 1] = df.select([predictionCol, labelCol]).rdd.map(lambda x: 1 if x[0] == 0 and x[1] == 0 else 0).sum()
    cm[1, 0] = df.select([predictionCol, labelCol]).rdd.map(lambda x: 1 if x[0] == 1 and x[1] == 0 else 0).sum()
    cm = pd.DataFrame(cm)
    cm.index = ["T", "F"]
    cm.index.name = "Target"
    cm.columns = ["T", "F"]
    cm.columns.name = "Predicted"
    cm = cm.applymap(int)
    return cm

1. Split the entire data into **train** and **test** partitions with the same positive-to-negative ratio.
2. Sample a **balanced** subset from train

In [4]:
train, test = df.randomSplit([0.6, 0.4], seed=201)

train = train.withColumnRenamed("Class", "label")
train.persist()

test = test.withColumnRenamed("Class", "label")
test.persist()

pos = train.filter(train["label"] == 1)
neg = train.filter(train["label"] == 0)
ratio = pos.count() / float(neg.count())
neg = neg.sample(False, ratio, seed=123)
balanced = pos.union(neg)
balanced.persist()

DataFrame[Time: float, V1: float, V2: float, V3: float, V4: float, V5: float, V6: float, V7: float, V8: float, V9: float, V10: float, V11: float, V12: float, V13: float, V14: float, V15: float, V16: float, V17: float, V18: float, V19: float, V20: float, V21: float, V22: float, V23: float, V24: float, V25: float, V26: float, V27: float, V28: float, Amount: float, label: int]

Show the percentage of minority class in **train**, **test** and **balanced**:

In [5]:
print "Training partition (unbalanced)\nNum of fraud cases: %d, Total: %d, Percentage: %f\n" % (
    train.select("label").rdd.map(lambda x: int(x[0])).sum(),
    train.count(),
    train.select("label").rdd.map(lambda x: x[0]).mean())

print "Training partition (balanced)\nNum of fraud cases: %d, Total: %d, Percentage: %f\n" % (
    balanced.select("label").rdd.map(lambda x: int(x[0])).sum(),
    balanced.count(),
    balanced.select("label").rdd.map(lambda x: x[0]).mean())

print "Test partition\nNum of fraud cases: %d, Total: %d, Percentage: %f\n" % (
    test.select("label").rdd.map(lambda x: int(x[0])).sum(),
    test.count(),
    test.select("label").rdd.map(lambda x: x[0]).mean())

Training partition (unbalanced)
Num of fraud cases: 287, Total: 170683, Percentage: 0.001681

Training partition (balanced)
Num of fraud cases: 287, Total: 580, Percentage: 0.494828

Test partition
Num of fraud cases: 205, Total: 114124, Percentage: 0.001796



Define function that returns a Spark ML **Model** (result of applying **fit** of an **Estimator** on data)

In [6]:
def trainUsing(data, suffix):
    assembler = VectorAssembler(inputCols=colnames[:-1],
                                outputCol="features_%s" % suffix)
    scaler = StandardScaler(inputCol="features_%s" % suffix, outputCol="scaledFeatures_%s" % suffix,
                            withStd=True, withMean=True)
    lr = LogisticRegression(maxIter=10, featuresCol="scaledFeatures_%s" % suffix,
                            predictionCol="prediction_%s" % suffix,
                            probabilityCol="probability_%s" % suffix,
                            rawPredictionCol="rawPrediction_%s" % suffix,
                            regParam=0.01, elasticNetParam=0.8)

    pipeline = Pipeline(stages=[assembler, scaler, lr])
    plModel = pipeline.fit(data)
    return plModel

Train logistic regression on balanced and unbalanced (i.e. **train**) data, and apply on the test set:

In [7]:
balancedModel = trainUsing(balanced, "balanced")
unbalancedModel = trainUsing(train, "unbalanced")

test = balancedModel.transform(test)
test = unbalancedModel.transform(test)
test

DataFrame[Time: float, V1: float, V2: float, V3: float, V4: float, V5: float, V6: float, V7: float, V8: float, V9: float, V10: float, V11: float, V12: float, V13: float, V14: float, V15: float, V16: float, V17: float, V18: float, V19: float, V20: float, V21: float, V22: float, V23: float, V24: float, V25: float, V26: float, V27: float, V28: float, Amount: float, label: int, features_balanced: vector, scaledFeatures_balanced: vector, rawPrediction_balanced: vector, probability_balanced: vector, prediction_balanced: double, features_unbalanced: vector, scaledFeatures_unbalanced: vector, rawPrediction_unbalanced: vector, probability_unbalanced: vector, prediction_unbalanced: double]

Performance of applying the **balancedModel** on the unbalanced test set:

In [8]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction_balanced", labelCol="label")
print "Area under PR:", evaluator.evaluate(test,
                                           {evaluator.labelCol: "label", evaluator.metricName: "areaUnderPR"})
print "Area under ROC:", evaluator.evaluate(test,
                                            {evaluator.labelCol: "label", evaluator.metricName: "areaUnderROC"})
print binary_classification_eval(test, predictionCol="prediction_balanced")
confusion_matrix(test, predictionCol="prediction_balanced")

Area under PR: 0.707464048935
Area under ROC: 0.982435187689
{'recall': 0.8975609756097561, 'precision': 0.12813370473537605, 'accuracy': 0.9888454663348645}


Predicted,T,F
Target,Unnamed: 1_level_1,Unnamed: 2_level_1
T,184,21
F,1252,112667


Performance of applying the **unbalancedModel** on the unbalanced test set:

In [9]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction_unbalanced", labelCol="label")
print "Area under PR:", evaluator.evaluate(test,
                                           {evaluator.labelCol: "label", evaluator.metricName: "areaUnderPR"})
print "Area under ROC:", evaluator.evaluate(test,
                                            {evaluator.labelCol: "label", evaluator.metricName: "areaUnderROC"})
print binary_classification_eval(test, predictionCol="prediction_unbalanced")
confusion_matrix(test, predictionCol="prediction_unbalanced")

Area under PR: 0.732983126256
Area under ROC: 0.950600844117
{'recall': 0.0975609756097561, 'precision': 0.9523809523809523, 'accuracy': 0.99837019382426}


Predicted,T,F
Target,Unnamed: 1_level_1,Unnamed: 2_level_1
T,20,185
F,1,113918


**balancedModel** resulted in better recall than the **unbalancedModel**, while **unbalancedModel** had better precision than **balancedModel**.

This is a dataset about detecting fraudulent cases of credit card transactions, so miss-classifying a fraud case as non-fraud (**False Negative**) should be much worse than miss-classifying non-fraud as fraud (**False Positive**). In this sense, the **balancedModel** should be preferred over **unbalancedModel**.

With that being said, we still want a model that can achieve better trade-off between precision and recall. Next, we'll simply try **averaging the prediction of balancedModel and unbalancedModel**:

The raw prediction of the two models:

In [10]:
test.select(["rawPrediction_balanced", "rawPrediction_unbalanced"]).show()

+----------------------+------------------------+
|rawPrediction_balanced|rawPrediction_unbalanced|
+----------------------+------------------------+
|  [1.81230462492876...|    [6.41437391569042...|
|  [2.04060271859517...|    [6.09943569050005...|
|  [2.57045867472748...|    [6.51967462963196...|
|  [2.35456939292213...|    [6.48793908689729...|
|  [1.09375508677486...|    [6.22885124543104...|
|  [1.06607125457434...|    [6.21531543670417...|
|  [2.46473647821187...|    [6.54057286342395...|
|  [1.24788587210638...|    [6.74475650339504...|
|  [2.09052369514980...|    [6.39325253222303...|
|  [2.03978568965274...|    [6.49363196521270...|
|  [2.10786595514564...|    [6.41143763658613...|
|  [1.85737776236148...|    [6.45719962582102...|
|  [1.97607746667212...|    [6.82237954177852...|
|  [2.00668814388170...|    [6.38470369209450...|
|  [3.44888347026610...|    [6.32755726452673...|
|  [1.58522714368824...|    [6.44621683407795...|
|  [3.00999300340899...|    [6.47138789825092...|


Assemble the relevant columns and combine the prediction:

In [11]:
assembler = VectorAssembler(inputCols=["rawPrediction_balanced", "rawPrediction_unbalanced", "label"],
                                outputCol="rawPredictionAverage")

test = assembler.transform(test)
test.persist()

DataFrame[Time: float, V1: float, V2: float, V3: float, V4: float, V5: float, V6: float, V7: float, V8: float, V9: float, V10: float, V11: float, V12: float, V13: float, V14: float, V15: float, V16: float, V17: float, V18: float, V19: float, V20: float, V21: float, V22: float, V23: float, V24: float, V25: float, V26: float, V27: float, V28: float, Amount: float, label: int, features_balanced: vector, scaledFeatures_balanced: vector, rawPrediction_balanced: vector, probability_balanced: vector, prediction_balanced: double, features_unbalanced: vector, scaledFeatures_unbalanced: vector, rawPrediction_unbalanced: vector, probability_unbalanced: vector, prediction_unbalanced: double, rawPredictionAverage: vector]

Define function that computes the average of model predictions:

In [12]:
def aveargeRawPrediction(row):
    array = row["rawPredictionAverage"].toArray()[::2]
    label = float(array[-1])
    array = array[:-1]
    rawPrediction = array.mean()
    probability = 1. / (1 + np.exp(-rawPrediction))
    prediction = 0.0 if probability >= 0.5 else 1.0

    rawPrediction = DenseVector([rawPrediction, -rawPrediction])
    probability = DenseVector([probability, 1.-probability])
    
    return Row(rawPredictionAverage=rawPrediction,
               probabilityAverage=probability,
               predictionAverage=prediction,
               label=label)

* Column 1 and 2: **balancedModel** prediction
* Column 3 and 4: **unbalancedModel** prediction
* Column 5: class label

In [13]:
test.select(["rawPredictionAverage"]).show(truncate=False)

+---------------------------------------------------------------------------------+
|rawPredictionAverage                                                             |
+---------------------------------------------------------------------------------+
|[1.812304624928764,-1.812304624928764,6.4143739156904225,-6.4143739156904225,0.0]|
|[2.040602718595173,-2.040602718595173,6.0994356905000595,-6.0994356905000595,0.0]|
|[2.5704586747274862,-2.5704586747274862,6.519674629631963,-6.519674629631963,0.0]|
|[2.3545693929221394,-2.3545693929221394,6.48793908689729,-6.48793908689729,0.0]  |
|[1.093755086774863,-1.093755086774863,6.228851245431046,-6.228851245431046,0.0]  |
|[1.0660712545743416,-1.0660712545743416,6.21531543670417,-6.21531543670417,0.0]  |
|[2.464736478211878,-2.464736478211878,6.540572863423953,-6.540572863423953,0.0]  |
|[1.2478858721063864,-1.2478858721063864,6.744756503395048,-6.744756503395048,0.0]|
|[2.090523695149805,-2.090523695149805,6.393252532223033,-6.393252532223033,

Construct the dataframe **pred** that holds all the relevant columns for evaluation

In [14]:
pred = spark.createDataFrame(test.rdd.map(lambda row: aveargeRawPrediction(row)))
pred.persist()

DataFrame[label: double, predictionAverage: double, probabilityAverage: vector, rawPredictionAverage: vector]

In [15]:
pred.show(truncate=False)

+-----+-----------------+-----------------------------------------+----------------------------------------+
|label|predictionAverage|probabilityAverage                       |rawPredictionAverage                    |
+-----+-----------------+-----------------------------------------+----------------------------------------+
|0.0  |0.0              |[0.983910044356488,0.01608995564351201]  |[4.113339270309593,-4.113339270309593]  |
|0.0  |0.0              |[0.9832096689527863,0.016790331047213725]|[4.0700192045476165,-4.0700192045476165]|
|0.0  |0.0              |[0.9894921231470449,0.010507876852955067]|[4.545066652179725,-4.545066652179725]  |
|0.0  |0.0              |[0.9881235963207355,0.011876403679264458]|[4.421254239909715,-4.421254239909715]  |
|0.0  |0.0              |[0.9749448904524635,0.025055109547536536]|[3.6613031661029547,-3.6613031661029547]|
|0.0  |0.0              |[0.974436488623058,0.025563511376942016] |[3.6406933456392556,-3.6406933456392556]|
|0.0  |0.0         

Evaluate the **averaged model**:

In [16]:
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPredictionAverage", labelCol="label")
print "Area under PR:", evaluator.evaluate(pred,
                                           {evaluator.labelCol: "label", evaluator.metricName: "areaUnderPR"})
print "Area under ROC:", evaluator.evaluate(pred,
                                            {evaluator.labelCol: "label", evaluator.metricName: "areaUnderROC"})
print binary_classification_eval(pred, predictionCol="predictionAverage")
confusion_matrix(pred, predictionCol="predictionAverage")

Area under PR: 0.74502021644
Area under ROC: 0.981824998036
{'recall': 0.6, 'precision': 0.8424657534246576, 'accuracy': 0.9990799481266001}


Predicted,T,F
Target,Unnamed: 1_level_1,Unnamed: 2_level_1
T,123,82
F,23,113896


It seems that the recall has improved a lot over the **unbalancedModel**, at the cost of a slight decrease of precision.

Note that we simply did unweighted average of the prediction. We could also try weighting the raw predictions with non-uniform weights (e.g. $\lambda, 1 - \lambda$)

In [17]:
assembler = VectorAssembler(inputCols=colnames[:-1],
                            outputCol="featuresGBT")
scaler = StandardScaler(inputCol="featuresGBT", outputCol="scaledFeaturesGBT",
                        withStd=True, withMean=True)
gbtc = GBTClassifier(maxIter=10, featuresCol="scaledFeaturesGBT", maxDepth=5, labelCol="label", predictionCol="predictionGBT", seed=123)

pipeline = Pipeline(stages=[assembler, scaler, gbtc])

    
plModel = pipeline.fit(train)
test = plModel.transform(test)

In [18]:
print binary_classification_eval(test, predictionCol="predictionGBT")
confusion_matrix(test, predictionCol="predictionGBT")

{'recall': 0.751219512195122, 'precision': 0.927710843373494, 'accuracy': 0.9994479688759593}


Predicted,T,F
Target,Unnamed: 1_level_1,Unnamed: 2_level_1
T,154,51
F,12,113907


As we can see, Gradient Boosted Trees achieved much better results than any previous model. However, still 51 out of 205 fraud cases were miss-classified, we could have applied grid search to find optimal hyperparameters that may hopefully lead to better recall while still maintaining similar precision.