# Random Forest
A thorough explanation of random forests can be found [here](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)

In [1]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier as RF
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.mllib.evaluation import BinaryClassificationMetrics as metric
import pandas as pd
import subprocess
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import lit, udf

Waiting for a Spark session to start...

Waiting for a Spark session to start...

  return f(*args, **kwds)
  return f(*args, **kwds)


Create a small function to extract the probabilities from a DenseVector

In [2]:
def ith_(v, i):
    try:
        return float(v[i])
    except ValueError:
        return None


ith = udf(ith_, DoubleType())

## Get the Spark Context
Get the numner of executors and nodes

In [3]:
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)

e = sc._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("Executor:", e)

s = sc._jsc.sc().getExecutorMemoryStatus().keys()
l = str(s).replace("Set(", "").replace(")", "").split(", ")

d = set()
for i in l:
    d.add(i.split(":")[0])

print("Node Count:", len(d))

Executor: 6
Node Count: 6


# Import Data
* Load data
* Remove useless columns and records (only CASH_OUT and TRANSFER have fraud).
* Remap transfer to 0/1.

Note: This is NOT the right way to load data into Spark. Usually, the data exists in s3, HDFS or another distributed file system. For purposes of this example it is easiest to do this way.

In [4]:
subprocess.call('cp ~/data-science/jupyter_home/simulated_transactions.csv.xz  /tmp/', shell=True)
subprocess.call('xz -d /tmp/simulated_transactions.csv.xz', shell=True)

df = spark.createDataFrame(pd.read_csv('/tmp/simulated_transactions.csv'))
print("Record Count:", df.count())

df.head(10)

Record Count: 1305514


[Row(step=2, type='PAYMENT', amount=18211.33, nameOrig='C1099717276', oldbalanceOrg=88.0, newbalanceOrig=0.0, nameDest='M417557780', oldbalanceDest=0.0, newbalanceDest=0.0, isFraud=0, isFlaggedFraud=0), Row(step=2, type='CASH_IN', amount=93240.07, nameOrig='C1350751778', oldbalanceOrg=47.0, newbalanceOrig=93287.07, nameDest='C665576141', oldbalanceDest=12.0, newbalanceDest=8650239.39, isFraud=0, isFlaggedFraud=0), Row(step=2, type='CASH_IN', amount=78314.86, nameOrig='C332699949', oldbalanceOrg=93287.07, newbalanceOrig=171601.93, nameDest='C1359044626', oldbalanceDest=178957.0, newbalanceDest=16435074.66, isFraud=0, isFlaggedFraud=0), Row(step=2, type='CASH_IN', amount=101282.39, nameOrig='C808417649', oldbalanceOrg=171601.93, newbalanceOrig=171601.93, nameDest='C1599771323', oldbalanceDest=171601.93, newbalanceDest=3771328.56, isFraud=0, isFlaggedFraud=0), Row(step=2, type='CASH_IN', amount=24227.29, nameOrig='C858204589', oldbalanceOrg=171601.93, newbalanceOrig=195829.22, nameDest='C

#### Description
From the above we can see we read in a total of over 6 million records with 11 columns.
<br>The description of the 11 columnes follows:

|Variable|Description|Keep|
| :------| :---------| :--|
|step|Maps a unit of time in the real world. In this case 1 step is 1 hour of time.|Drop|
|type|CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER|Keep (TRANSFER and CASH-OUT)|
|amount|The amount of the transaction.|Keep|
|nameOrig|The customer ID for the initiator of the transaction.|Drop|
|oldbalanceOrg|The initial balance before the transaction.|Keep|
|newbalanceOrg|The customer's balance after the transaction.|Keep|
|nameDest|The customer ID for the recipient of the transaction.|Drop|
|oldbalanceDest|The initial recipient balance before the transaction.|Keep|
|newbalanceDest|The recipient's balance after the transaction.|Keep|
|isFraud|This identifies a fraudulent transaction (1) and non fraudulent transaction(0).|Keep|
|isFlaggedFraud|This is a rule based system that flags illegal attempts to transfer more than 200.000 in a single transaction.|Drop|

### Filtering
Filter out types other than TRANSFER, and CASH_OUT.<br>
Remove variables, 'step', 'nameOrig', 'nameDest', and 'isFlaggedFraud'.

In [5]:
df.createOrReplaceTempView("table")

df = spark.sql('''
    SELECT isFraud,
    CAST(type = 'TRANSFER' AS INTEGER) isTransfer,
    amount,
    oldbalanceOrg, newbalanceOrig,
    oldbalanceDest, newbalanceDest
    FROM table
    WHERE type IN ('CASH_OUT', 'TRANSFER')
''')

print("Record Count:", df.count())

df.head(10)

Record Count: 586965


[Row(isFraud=0, isTransfer=0, amount=85351.19, oldbalanceOrg=32935.0, newbalanceOrig=0.0, oldbalanceDest=3.0, newbalanceDest=1030012.31), Row(isFraud=0, isTransfer=0, amount=158572.35, oldbalanceOrg=57.0, newbalanceOrig=0.0, oldbalanceDest=77.0, newbalanceDest=3259761.6), Row(isFraud=0, isTransfer=0, amount=34139.45, oldbalanceOrg=30.0, newbalanceOrig=0.0, oldbalanceDest=72.0, newbalanceDest=1660399.89), Row(isFraud=0, isTransfer=0, amount=185430.62, oldbalanceOrg=7.0, newbalanceOrig=0.0, oldbalanceDest=62.0, newbalanceDest=788131.95), Row(isFraud=0, isTransfer=0, amount=288980.51, oldbalanceOrg=78.0, newbalanceOrig=0.0, oldbalanceDest=71.0, newbalanceDest=3042241.45), Row(isFraud=0, isTransfer=0, amount=85855.0, oldbalanceOrg=10.0, newbalanceOrig=0.0, oldbalanceDest=90.0, newbalanceDest=2774338.5), Row(isFraud=0, isTransfer=0, amount=127327.5, oldbalanceOrg=0.0, newbalanceOrig=0.0, oldbalanceDest=289051.51, newbalanceDest=3042241.45), Row(isFraud=0, isTransfer=0, amount=13100.27, oldb

## Random Forest Classifier
Begin preparing for the model

### Training Set
Create a pyspark pipeline. Define features and the dependent variable. <br>
Partition the data with an 80/20 split: Training/Testing.

In [6]:
featureCols = ['isTransfer', 'amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']
assembler_features = VectorAssembler(inputCols=featureCols, outputCol='features')
labelIndexer = StringIndexer(inputCol='isFraud', outputCol="label")

dfX = [assembler_features, labelIndexer]
pipeline = Pipeline(stages=dfX)

In [7]:
allData = pipeline.fit(df).transform(df)

trainingData, testData = allData.randomSplit([0.8, 0.2], seed=0)
rf = RF(labelCol='label', featuresCol='features', numTrees=200)
fit = rf.fit(trainingData)
transformed = fit.transform(testData)
results = transformed.select(['probability', 'label'])

In [8]:
results_collect = results.collect()
results_list = [(float(i[0][0]), 1.0-float(i[1])) for i in results_collect]
scoreAndLabels = sc.parallelize(results_list)

metrics = metric(scoreAndLabels)
validation = results.select(['label', (ith("probability", lit(1)) > 0.5).cast('integer').alias('prediction') ])

truth_table = validation.groupBy(['label', 'prediction']).count().orderBy(['label', 'prediction'])

tt = truth_table.toPandas()
tp = tt[((tt.label == 1) & (tt.prediction == 1))]['count'].values[0]
fp = tt[((tt.label == 0) & (tt.prediction == 1))]['count'].values[0]
fn = tt[((tt.label == 1) & (tt.prediction == 0))]['count'].values[0]
tn = tt[((tt.label == 0) & (tt.prediction == 0))]['count'].values[0]

tt

   label  prediction   count
0    0.0           0  115831
1    0.0           1       6
2    1.0           0     156
3    1.0           1    1512

In [9]:
accuracy = (tp + tn)/(tp + tn + fp + fn)
precision = tp/(tp + fp)
recall = tp/(tp + fn)
f1 = (2.0 * precision*recall)/(precision+recall)

print("Out of sample accuracy =", accuracy)
print("Out of sample precision =", precision)
print("Out of sample recall =", recall)
print("Out of sample F1 =", f1)

Out of sample accuracy = 0.9986213352623292
Out of sample precision = 0.9960474308300395
Out of sample recall = 0.9064748201438849
Out of sample F1 = 0.9491525423728814


* Accuracy - Proportion of predictions that are correct. $\frac{True Positive + True Negative}{True Positive + True Negative + False Positive + False Negative}$
* Precision - True positive over total positive actual cases. $\frac{True Positive}{True Positive + False Positive}$
* Recall - True positive over total positive predicted cases. $\frac{True Positive}{True Positive + False Negative}$
* F1 - A balance between Precision and Recall (harmonic mean of precision and recall) $\frac{2 * Precision * Recall}{Precision + Recall}$