##Credit Card Fraud Detection

d
**Data**: The dataset covers credit card transactions done by European cardholders in September 2013.
In this dataset, we have 492 frauds out of 284,807 transactions that happened in two days. The dataset is heavily skewed, with the positive class (frauds) accounting for just 0.172 percent of all transactions. It contains only PCA transformation result as numerical input variables. The major components derived with PCA are features V1, V2,... V28; the only features not changed with PCA are 'Time' and 'Amount.' The seconds elapsed between each transaction and the first transaction in the dataset are stored in the 'Time' field. The transaction Amount is represented by the feature 'Amount'. The feature 'Class' has a value of 1 when there is a fraud and 0 when there isn't. This dataset is from [Kaggle- credit card fraud detection](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud).
             
**Goal**: The goal of this approach is to maintain or improve the performance of classifiers in the context of excessively imbalanced and large datasets using Big Data techniques.

**Approach**: This global approach first clusters the data using k-means clustering, then performs SMOTE + ENN on these clusters, before they are merged together using union and evaluated using a Random Forest classifier.

###Import required packages

In [0]:
# Import all the required libraries
from pyspark.sql import functions as F
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
from pyspark.ml.linalg import Vectors
import random
import numpy as np
from functools import reduce
import pyspark.sql.functions as F
from pyspark.sql import Row
from pyspark.sql.functions import rand,col,when,concat,substring,lit,udf,lower,sum as ps_sum,count as ps_count,row_number
from pyspark.sql.window import *
from pyspark.sql import DataFrame
from pyspark.ml.feature import VectorAssembler,BucketedRandomProjectionLSH,VectorSlicer
from pyspark.sql.window import Window
from pyspark.ml.linalg import Vectors,VectorUDT
from pyspark.sql.functions import array, create_map, struct
from pyspark.ml import Pipeline
from pyspark.sql.functions import col
import random
import numpy as np
from pyspark.sql import Row
from sklearn import neighbors
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator


###Load and understand the dataset

In [0]:
# Load the dataset which is in Comma-Separated Value (CSV) format.
df = spark.read.option("header",True).csv("/mnt/sjjhkkdf/creditcard.csv")

# Cache DataFrame so that we only read it from disk once
df.cache()

#Check the name, datatype and values of columns
df.take(1)

####Data description
**Attribute information**:
```
The features V1, V2,... V28 are derived from PCA transformation.

'Time'- The seconds elapsed between each transaction and the first transaction in the dataset 

'Amount'- The transaction Amount 

'Class'- a value of 1 when there is a fraud and 0 when there isn't.

```
**The target variable is the class of the transaction i.e. fraud (1) or not (0).**

##Data preparation

As all the features are in string, cast all features to floats except Time and Class

####Casting

In [0]:
#get the list of name of the columns except 'Time' and 'Class'
targets = [i for i in df.schema.names if i not in ["Time", "Class"]]
for col in targets: 
  df = df.withColumn(col, df['`{}`'.format(col)].cast('float'))

#Cast 'Time' and 'Class' to integer
for col in ["Time", "Class"]:
  df = df.withColumn(col, df['`{}`'.format(col)].cast('integer'))

#Check the datatype of columns
df.printSchema()

#Visualize the data
# display(df)

#### Check whether the dataset is balanced or not
Total fraud transactions in the dataset

In [0]:
df.agg(F.sum("Class")).collect()[0][0]

### Split data into training and test sets

Split the dataframe df into 70% for training and 30% for test.

In [0]:

df2=df
df2_train, df2_test = df2.randomSplit([0.7,0.3], seed=1)
feature_cols = df2.columns
feature_cols.remove("Class") # Remove output feature 

assembler = VectorAssembler(
  inputCols= ["V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8", "V9", "V10", "V11", "V12", "V13", "V14", "V15", "V16", "V17", "V18", "V19", "V20", "V21", "V22", "V23", "V24", "V25", "V26", "V27", "V28", "Amount"], 
  outputCol= "features"
)

output = assembler.transform(df2_train)
print("Assembled predictor columns to vector column 'features'")
output.select("features", "Class").show(truncate=True)



# Functions

Functions used:
  - Pre-SMOTE
  - SMOTE
  - ENN

## SMOTE (Synthetic Minority Over-sampling Technique)  and pre-SMOTE function



SMOTE function uses LSH with Bucketed Random Projection. The SMOTE used is a global approach. 

The code was taken from:https://gist.github.com/hwang018/420e288021e9bdacd133076600a9ea8c

###Pre-SMOTE
- pre-smote function - checks for features which are categorical and numerical, then pre process accordingly then vectorise the features.

In [0]:
# Taken from: https://gist.github.com/hwang018/420e288021e9bdacd133076600a9ea8c
#for categorical columns, must take its stringIndexed form (smote should be after string indexing, default by frequency)

def pre_smote_df_process(df,num_cols,cat_cols,target_col,index_suffix="_index"):
    '''
    string indexer and vector assembler
    inputs:
    * df: spark df, original
    * num_cols: numerical cols to be assembled
    * cat_cols: categorical cols to be stringindexed
    * target_col: prediction target
    * index_suffix: will be the suffix after string indexing
    output:
    * vectorized: spark df, after stringindex and vector assemble, ready for smote
    '''
    if(df.select(target_col).distinct().count() != 2):
        raise ValueError("Target col must have exactly 2 classes")
        
    if target_col in num_cols:
        num_cols.remove(target_col)

    # only assembled numeric columns into features
    assembler = VectorAssembler(inputCols = num_cols, outputCol = 'features')
    # index the string cols, except possibly for the label col
    assemble_stages = [StringIndexer(inputCol=column, outputCol=column+index_suffix).fit(df) for column in list(set(cat_cols)-set([target_col]))]
    # add the stage of numerical vector assembler
    assemble_stages.append(assembler.setHandleInvalid("skip"))
    pipeline = Pipeline(stages=assemble_stages)
    pos_vectorized = pipeline.fit(df).transform(df)
    
    # drop original num cols and cat cols
    drop_cols = num_cols+cat_cols
    
    keep_cols = [a for a in pos_vectorized.columns if a not in drop_cols]
    
    vectorized = pos_vectorized.select(*keep_cols).withColumn('label',pos_vectorized[target_col]).drop(target_col)
    
    return vectorized

### SMOTE (Synthetic Minority Over-sampling Technique) 

- SMOTE function uses LSH with Bucketed Random Projection. The SMOTE used is a global approach.

In [0]:
# From https://gist.github.com/hwang018/420e288021e9bdacd133076600a9ea8c



  
 ############################## spark smote oversampling ##########################
def smote(vectorized_sdf,smote_config):
    '''
    contains logic to perform smote oversampling, given a spark df with 2 classes
    inputs:
    * vectorized_sdf: cat cols are already stringindexed, num cols are assembled into 'features' vector
      df target col should be 'label'
    * smote_config: config obj containing smote parameters
    output:
    * oversampled_df: spark df after smote oversampling
    '''
    dataInput_min = vectorized_sdf[vectorized_sdf['label'] == 1]
    dataInput_maj = vectorized_sdf[vectorized_sdf['label'] == 0]
    
    # LSH, bucketed random projection
    brp = BucketedRandomProjectionLSH(inputCol="features", outputCol="hashes",seed=smote_config.seed, bucketLength=smote_config.bucketLength)
    # smote only applies on existing minority instances    
    model = brp.fit(dataInput_min)
    model.transform(dataInput_min)

    # here distance is calculated from brp's param inputCol
    self_join_w_distance = model.approxSimilarityJoin(dataInput_min, dataInput_min, float("inf"), distCol="EuclideanDistance")

    # remove self-comparison (distance 0)
    self_join_w_distance = self_join_w_distance.filter(self_join_w_distance.EuclideanDistance > 0)

    over_original_rows = Window.partitionBy("datasetA").orderBy("EuclideanDistance")

    self_similarity_df = self_join_w_distance.withColumn("r_num", F.row_number().over(over_original_rows))

    self_similarity_df_selected = self_similarity_df.filter(self_similarity_df.r_num <= smote_config.k)

    over_original_rows_no_order = Window.partitionBy('datasetA')

    # list to store batches of synthetic data
    res = []
    
    # two udf for vector add and subtract, subtraction include a random factor [0,1]
    subtract_vector_udf = F.udf(lambda arr: random.uniform(0, 1)*(arr[0]-arr[1]), VectorUDT())
    add_vector_udf = F.udf(lambda arr: arr[0]+arr[1], VectorUDT())
    
    # retain original columns
    original_cols = dataInput_min.columns
    
    for i in range(smote_config.multiplier):
        print("generating batch %s of synthetic instances"%i)
        # logic to randomly select neighbour: pick the largest random number generated row as the neighbour
        df_random_sel = self_similarity_df_selected.withColumn("rand", F.rand()).withColumn('max_rand', F.max('rand').over(over_original_rows_no_order))\
                            .where(F.col('rand') == F.col('max_rand')).drop(*['max_rand','rand','r_num'])
        # create synthetic feature numerical part
        df_vec_diff = df_random_sel.select('*', subtract_vector_udf(F.array('datasetA.features', 'datasetB.features')).alias('vec_diff'))
        df_vec_modified = df_vec_diff.select('*', add_vector_udf(F.array('datasetA.features', 'vec_diff')).alias('features'))
        
        # for categorical cols, either pick original or the neighbour's cat values
        for c in original_cols:
            # randomly select neighbour or original data
            col_sub = random.choice(['datasetA','datasetB'])
            val = "{0}.{1}".format(col_sub,c)
            if c != 'features':
                # do not unpack original numerical features
                df_vec_modified = df_vec_modified.withColumn(c,F.col(val))
        
        # this df_vec_modified is the synthetic minority instances,
        df_vec_modified = df_vec_modified.drop(*['datasetA','datasetB','vec_diff','EuclideanDistance'])
        
        res.append(df_vec_modified)
    
    dfunion = reduce(DataFrame.unionAll, res)
    # union synthetic instances with original full (both minority and majority) df
    oversampled_df = dfunion.union(vectorized_sdf.select(dfunion.columns))
    
    return oversampled_df


## ENN
- ENN function uses LSH with Bucketed Random Projection. The ENN used is a global approach. This ENN code is inspired from SMOTE code above

In [0]:
#ENN
df_enn=[]
def enn(vectorized_sdf,smote_config):
    dataInput_min = vectorized_sdf[vectorized_sdf['label'] == 1]
    dataInput_maj = vectorized_sdf[vectorized_sdf['label'] == 0]
    
    # LSH, bucketed random projection
    brp = BucketedRandomProjectionLSH(inputCol="features",outputCol="hashes",seed=smote_config.seed,bucketLength=smote_config.bucketLength)
    # smote only applies on existing minority instances    
    model = brp.fit(vectorized_sdf)
    model.transform(vectorized_sdf)

    # here distance is calculated from brp's param inputCol
    self_join_w_distance = model.approxSimilarityJoin(vectorized_sdf, vectorized_sdf, float("inf"), distCol="EuclideanDistance")

    # remove self-comparison (distance 0)
    self_join_w_distance = self_join_w_distance.filter(self_join_w_distance.EuclideanDistance > 0)

    over_original_rows = Window.partitionBy("datasetA").orderBy("EuclideanDistance")

    self_similarity_df = self_join_w_distance.withColumn("r_num", F.row_number().over(over_original_rows))
    
    # finding the k nearest neighbours
    self_similarity_df_selected = self_similarity_df.filter(self_similarity_df.r_num <= smote_config.k)
    
    self_similarity_df_selected1=self_similarity_df_selected.withColumn("label",F.col("datasetA.label"))
    
    self_similarity_df_selected1=self_similarity_df_selected1.withColumn("labelB",F.col("datasetB.label"))
    
 # checking if the class is same   
    self_similarity_df_selected1=self_similarity_df_selected1.withColumn('flag', F.when((F.col("label") == F.col("labelB")), 0).otherwise(1))
            
    windowSpec = Window.partitionBy().orderBy('flag')
    
    self_similarity_df_selected1 = self_similarity_df_selected1.withColumn("r_num2", F.row_number().over(windowSpec))
    

    self_similarity_df_selected1
                             
    win = Window.partitionBy("datasetA").orderBy("r_num2")                               
    prev = F.lag(F.col("Flag")).over(win)
    prev_2 = F.lag(F.col("Flag"), 2).over(win) 
    a = self_similarity_df_selected1.withColumn("Flag_1", prev)
    a = a.withColumn("Flag_2", prev_2)
    
     # removing if the class is not the same
    a=a.filter(F.col("r_num") ==3)
    a=a.filter((F.col("Flag_1")+F.col("Flag_2")+F.col("flag"))==0)
    
    a=a.withColumn("features",F.col("datasetA.features"))
    a=a.drop(*['datasetA','datasetB','r_num','EuclideanDistance','r_num2','labelB','flag','Flag_2','Flag_1'])
    return a


# Pre-Processing

### Vectorising the features using the pre smote function

In [0]:

a = list(df2_train.columns)

df2_train = pre_smote_df_process(df2_train, a, cat_cols=[], target_col = "Class" )
df2_train.cache()

df2_test = pre_smote_df_process(df2_test, a, cat_cols=[], target_col = "Class" )
df2_test.cache()


display(df2_train)


features,label
"Map(vectorType -> dense, length -> 30, values -> List(0.0, -1.3598071336746216, -0.07278117537498474, 2.536346673965454, 1.37815523147583, -0.3383207619190216, 0.4623877704143524, 0.23959855735301971, 0.09869790077209473, 0.3637869656085968, 0.0907941684126854, -0.5515995621681213, -0.6178008317947388, -0.9913898706436157, -0.3111693561077118, 1.4681769609451294, -0.47040051221847534, 0.2079712450504303, 0.025790579617023468, 0.4039929509162903, 0.2514120936393738, -0.018306778743863106, 0.2778375744819641, -0.11047390848398209, 0.06692807376384735, 0.12853935360908508, -0.18911483883857727, 0.13355837762355804, -0.021053053438663483, 149.6199951171875))",0
"Map(vectorType -> dense, length -> 30, values -> List(0.0, 1.191857099533081, 0.26615071296691895, 0.16648010909557343, 0.448154091835022, 0.060017649084329605, -0.08236081153154373, -0.07880298048257828, 0.0851016566157341, -0.2554251253604889, -0.16697441041469574, 1.6127266883850098, 1.0652352571487427, 0.489095002412796, -0.14377228915691376, 0.6355580687522888, 0.4639170467853546, -0.11480466276407242, -0.18336127698421478, -0.14578303694725037, -0.06908313184976578, -0.22577524185180664, -0.6386719346046448, 0.10128802061080933, -0.3398464620113373, 0.16717040538787842, 0.12589453160762787, -0.008983098901808262, 0.014724168926477432, 2.690000057220459))",0
"Map(vectorType -> dense, length -> 30, values -> List(1.0, -1.358354091644287, -1.3401631116867065, 1.7732093334197998, 0.3797796070575714, -0.5031981468200684, 1.800499439239502, 0.7914609313011169, 0.2476757913827896, -1.514654278755188, 0.20764286816120148, 0.6245014667510986, 0.06608368456363678, 0.7172927260398865, -0.16594591736793518, 2.34586501121521, -2.8900833129882812, 1.1099693775177002, -0.12135931104421616, -2.261857032775879, 0.5249797105789185, 0.24799814820289612, 0.7716794013977051, 0.9094122648239136, -0.6892809271812439, -0.3276418447494507, -0.13909657299518585, -0.055352795869112015, -0.05975184217095375, 378.6600036621094))",0
"Map(vectorType -> dense, length -> 30, values -> List(1.0, -0.966271698474884, -0.18522600829601288, 1.7929933071136475, -0.8632912635803223, -0.010308879427611828, 1.2472031116485596, 0.23760893940925598, 0.3774358630180359, -1.3870240449905396, -0.05495192110538483, -0.22648726403713226, 0.1782282292842865, 0.5077568888664246, -0.28792375326156616, -0.6314181089401245, -1.0596472024917603, -0.6840927600860596, 1.9657750129699707, -1.2326220273971558, -0.20803777873516083, -0.10830045491456985, 0.005273596849292517, -0.19032052159309387, -1.1755753755569458, 0.6473760604858398, -0.22192884981632233, 0.06272284686565399, 0.06145763024687767, 123.5))",0
"Map(vectorType -> dense, length -> 30, values -> List(4.0, 1.2296576499938965, 0.14100350439548492, 0.04537077248096466, 1.2026127576828003, 0.1918809860944748, 0.27270811796188354, -0.0051590027287602425, 0.08121293783187866, 0.4649600088596344, -0.0992543175816536, -1.4169071912765503, -0.1538258194923401, -0.7510626912117004, 0.1673719584941864, 0.05014359578490257, -0.4435867965221405, 0.0028205125126987696, -0.6119873523712158, -0.04557504504919052, -0.21963255107402802, -0.16771626472473145, -0.27070972323417664, -0.15410378575325012, -0.7800554037094116, 0.7501369118690491, -0.2572368383407593, 0.03450743108987808, 0.005167768802493811, 4.989999771118164))",0
"Map(vectorType -> dense, length -> 30, values -> List(7.0, -0.8942860960960388, 0.28615719079971313, -0.11319221556186676, -0.27152612805366516, 2.6695985794067383, 3.721817970275879, 0.37014514207839966, 0.8510844707489014, -0.39204758405685425, -0.4104304313659668, -0.7051165699958801, -0.11045226454734802, -0.2862536311149597, 0.0743553638458252, -0.3287830650806427, -0.21007727086544037, -0.4997679591178894, 0.11876486241817474, 0.5703281760215759, 0.052735667675733566, -0.07342509925365448, -0.26809161901474, -0.20423266291618347, 1.0115917921066284, 0.37320467829704285, -0.38415729999542236, 0.011747356504201889, 0.14240433275699615, 93.19999694824219))",0
"Map(vectorType -> dense, length -> 30, values -> List(7.0, -0.6442694664001465, 1.4179635047912598, 1.0743803977966309, -0.49219900369644165, 0.9489340782165527, 0.4281184673309326, 1.1206313371658325, -3.807864189147949, 0.615374743938446, 1.2493761777877808, -0.6194677948951721, 0.2914743423461914, 1.7579642534255981, -1.3238651752471924, 0.6861324906349182, -0.07612700015306473, -1.2221273183822632, -0.35822156071662903, 0.3245047330856323, -0.15674185752868652, 1.9434653520584106, -1.0154547691345215, 0.05750352889299393, -0.6497089862823486, -0.41526657342910767, -0.0516342967748642, -1.206921100616455, -1.0853391885757446, 40.79999923706055))",0
"Map(vectorType -> dense, length -> 30, values -> List(9.0, -0.33826175332069397, 1.1195933818817139, 1.0443665981292725, -0.22218728065490723, 0.4993607997894287, -0.24676109850406647, 0.651583194732666, 0.06953858584165573, -0.7367272973060608, -0.36684563755989075, 1.017614483833313, 0.8363895416259766, 1.0068435668945312, -0.4435228109359741, 0.15021909773349762, 0.7394527792930603, -0.5409799218177795, 0.476677268743515, 0.4517729580402374, 0.20371145009994507, -0.24691393971443176, -0.6337526440620422, -0.12079408764839172, -0.3850499391555786, -0.06973304599523544, 0.09419883042573929, 0.24621930718421936, 0.08307565003633499, 3.680000066757202))",0
"Map(vectorType -> dense, length -> 30, values -> List(10.0, 0.38497820496559143, 0.6161094307899475, -0.8742997050285339, -0.09401862323284149, 2.92458438873291, 3.3170270919799805, 0.4704546630382538, 0.5382472276687622, -0.5588946342468262, 0.3097553849220276, -0.2591155767440796, -0.3261432349681854, -0.090046726167202, 0.36283236742019653, 0.9289036393165588, -0.1294868141412735, -0.8099789023399353, 0.3599853813648224, 0.7076638340950012, 0.1259915828704834, 0.04992368444800377, 0.2384215146303177, 0.009129868820309639, 0.9967101812362671, -0.7673148512840271, -0.4922083020210266, 0.042472440749406815, -0.05433738976716995, 9.989999771118164))",0
"Map(vectorType -> dense, length -> 30, values -> List(10.0, 1.249998688697815, -1.2216367721557617, 0.38393014669418335, -1.2348986864089966, -1.485419511795044, -0.7532301545143127, -0.6894049644470215, -0.2274872213602066, -2.094010591506958, 1.3237292766571045, 0.2276662290096283, -0.24268199503421783, 1.2054167985916138, -0.3176305294036865, 0.7256749868392944, -0.8156121969223022, 0.8739364743232727, -0.8477885723114014, -0.6831926107406616, -0.10275594145059586, -0.23180924355983734, -0.4832853376865387, 0.08466769009828568, 0.39283087849617004, 0.16113455593585968, -0.35499003529548645, 0.026415549218654633, 0.0424220897257328, 121.5))",0


## Clustering

### Apply K-Means clustering

Reference- [ML Clustering](https://spark.apache.org/docs/latest/ml-clustering.html).

##### Step 1: Apply `KMeans` Clustering

In [0]:
import time
# starting of Pre-Processing
start_time=time.time()

In [0]:
# Apply KMeans clustering using same seed value
kmeans = KMeans().setK(6).setSeed(1)\
        .setFeaturesCol("features")\
        .setPredictionCol("Cluster_no")

##### Step 2:  `fit` the model, and store in a `kmeansModel` variable.

In [0]:
#Fit the output dataframe and store it in the model variable.
kmeansModel1 = kmeans.fit(df2_train)
# Make predictions
df_clusters1=kmeansModel1.transform(df2_train)

### Number of majority and minority points in each cluster

Get the number of fraudulent and nonfraudulent transactions in each cluster

In [0]:
# define a function with input as 
# 1. df after kmeans clustering that contains cluster_no column (Dataframe)
# 2. max value in Cluster_no column (integer)
# returns dataframe with columns - Class , Cluster_no, count
def count_classes_in_cluster(df, total_clusters):
  for i in range(total_clusters):
    out = df.groupBy("label", "Cluster_no").count()
  return out

#count_classes_in_cluster(df_clusters1,5).sort(F.asc("Cluster_no")).collect() # 6 clusters

### Ratio of majority and minority points in each clusters

Define a function calculating the ratio of number of fraudulent transaction vs non fraudulent transaction i.e minority vs majority datapoints respectively. in each cluster

In [0]:
def cluster_ratio(df):
  
  '''
    gives a list of ratio of Class fraudulent and non- fraudulent in each cluster, given a spark df after clustering
    input:
    * df: df should contain columns 'Class' and  'Cluster_no'
    output:
    * list_of_ratio: list of values - with ratio if both classes are present, -1 if only class 1 is present and 0 if onlt class 0 is present
  '''
  
  list_of_ratio=[]
  for cluster in range(1,7):
    out = df.groupBy("label", "Cluster_no").count().filter(df.Cluster_no==cluster-1).sort(F.desc("label"))
    
    # Only one class is present in the cluster i.e. fraudulent or non fraudulent
    if out.count()!=2:
      
      # if non fraudulent
      if out.select("label").take(1)==0:
        list_of_ratio.append(0)
        
      # if fraudulent
      else:
        list_of_ratio.append(-1)
    
    # Both fraudulent and non fraudulent classes are present in the cluster
    else:
      out=out.select("count").take(2)
      list_of_ratio.append(out[0][0]/out[1][0])
      
  return list_of_ratio

cluster_ratio_list=cluster_ratio(df_clusters1)

In [0]:
df_clusters_1=df_clusters1.drop("Cluster_no")
df_cluster5 = df_clusters_1.filter("Cluster_no == 4")
df_cluster5.cache()
df_cluster4 = df_clusters_1.filter("Cluster_no == 3")
df_cluster4.cache()
df_cluster3 = df_clusters_1.filter("Cluster_no == 2")
df_cluster3.cache()
df_cluster2 = df_clusters_1.filter("Cluster_no == 1")
df_cluster2.cache()
df_cluster1 = df_clusters_1.filter("Cluster_no == 0")
df_cluster1.cache()
df_cluster6 = df_clusters_1.filter("Cluster_no == 5")
df_cluster6.cache()

## SMOTE + ENN with clusters 

- Each clusters will have first SMOTE applied and then ENN by checking each clusters ratio.

### Application of SMOTE on each cluster 
- SMOTE is applied only to clusters whose ratio is between 0 and 1

In [0]:
#configure SMOTE configuration
class conf:
  def __init__(self):
    self.seed = 1
    self.bucketLength = 1000
    self.k = 5
    self.multiplier = 30

In [0]:

conf_settings = conf()
# SMOTE applied to each cluster dataframe
if(0<cluster_ratio_list[0] and cluster_ratio_list[0]<1):
  oversampled_1=smote(df_cluster1,conf_settings)
if(0<cluster_ratio_list[1] and cluster_ratio_list[1]<1):
  oversampled_2=smote(df_cluster2,conf_settings)
if(0<cluster_ratio_list[2] and cluster_ratio_list[2]<1):
  oversampled_3=smote(df_cluster3,conf_settings)
if(0<cluster_ratio_list[3] and cluster_ratio_list[3]<1):
  oversampled_4=smote(df_cluster4,conf_settings)
if(0<cluster_ratio_list[5] and cluster_ratio_list[5]<1):
  oversampled_6=smote(df_cluster6,conf_settings)  
if(0<cluster_ratio_list[4] and cluster_ratio_list[4]<1):
  oversampled_5=smote(df_cluster5,conf_settings)

### Application of ENN to each clusters

- ENN is applied only to clusters whose ratio is between 0 and 1 or else assigned to original cluster dataframe

In [0]:
# configure ENN configuration
class conf1:
  def __init__(self):
    self.seed = 2
    self.bucketLength = 10
    self.k = 3

conf_settings = conf1()

In [0]:

conf_settings = conf1()
# ENN applied to each cluster dataframe
if(0<cluster_ratio_list[0] and cluster_ratio_list[0]<1):
  df_enn_1=enn(oversampled_1,conf_settings)
  #df_enn_1.show(10)
else:
  df_cluster1 = df_cluster1.select(
    "label",
    "features")
  df_enn_1=df_cluster1
if(0<cluster_ratio_list[1] and cluster_ratio_list[1]<1):
  df_enn_2=enn(oversampled_2,conf_settings)
  #df_enn_2.show(10)
else:
  df_cluster2 = df_cluster2.select(
    "label",
    "features")
  df_enn_2=df_cluster2
if(0<cluster_ratio_list[2] and cluster_ratio_list[2]<1):
  df_enn_3=enn(oversampled_3,conf_settings)
  #df_enn_3.show(10)
else:
  df_cluster3 = df_cluster3.select(
    "label",
    "features")
  df_enn_3=df_cluster3
if(0<cluster_ratio_list[3] and cluster_ratio_list[3]<1):
  df_enn_4=enn(oversampled_4,conf_settings)
  #df_enn_4.show(10)
else:
  df_cluster4 = df_cluster4.select(
    "label",
    "features")
  df_enn_4=df_cluster4
if(0<cluster_ratio_list[4] and cluster_ratio_list[4]<1):
  df_enn_5=enn(oversampled_5,conf_settings)
  #df_enn_5.show(10)
else:
  df_cluster5 = df_cluster5.select(
    "label",
    "features")
  df_enn_5=df_cluster5
if(0<cluster_ratio_list[5] and cluster_ratio_list[5]<1):
  df_enn_6=enn(oversampled_6,conf_settings)
else:
  df_cluster6 = df_cluster6.select(
    "label",
    "features")
  df_enn_6=df_cluster6

#df_enn_6.show(10)

In [0]:
# Combining all dataframe to one dataframe
df_smote_enn_append = df_enn_1.union(df_enn_2)\
                                   .union(df_enn_3)\
                                   .union(df_enn_4)\
                                   .union(df_enn_5)\
                                   .union(df_enn_6)

# Evaluation

- Evaluating our pre-processed data

In [0]:
# to calculate the end time of the whole pre-processing which includes clustering to SMOTE+ENN
df_smote_enn_append.show(10)
end_time=time.time()

## Time taken for pre-processing

- Start time - End time

In [0]:
-start_time+end_time

## Ratio of pre-processed data

- Ratio  of minority and majority classes after pre-processing

In [0]:
# calculating the ratio  of minority samples to majority samples
df_smote_enn_append.filter(df_smote_enn_append.label==1).count()/df_smote_enn_append.filter(df_smote_enn_append.label==0).count()

## Evaluation of the Random Forest model using the pre-process data

- Creating a random forest model and fitting it with test data
- Running different evaluation metrics of the data

### Creating a  model and fitting it with test data

- Training a Random Forest Classifier model with pre-processed data and predicting the values with the test data

In [0]:
# training the pre-processed data with SMOTE+ENN using Random Forest
from pyspark.ml.classification import RandomForestClassifier
classifier = RandomForestClassifier(featuresCol = 'features', labelCol = 'label')
model = classifier.fit(df_smote_enn_append)
predictions = model.transform(df2_test)

### Evaluation the model

The metrics used:
- Area Under ROC
- Area Under Precision and Recall
- Accuracy of the model
- Precision of the model
- Recall of the model
- F1 score of the model

In [0]:
# Evaluating the model test data
binary_evaluator = BinaryClassificationEvaluator(labelCol ="label")

# Area Under ROC
print("Test Area Under ROC: " + str(binary_evaluator.evaluate(predictions, {binary_evaluator.metricName: "areaUnderROC"})))

# Area Under Precision and Recall
print("Test Area Under PR: " + str(binary_evaluator.evaluate(predictions, {binary_evaluator.metricName: "areaUnderPR"})))

#Accuracy of the model
evaluator_accuracy = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
print("Accuracy: " + str(evaluator_accuracy.evaluate(predictions)))

#Precision of the model
evaluator_precision = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="precisionByLabel")
print("Precision: " + str(evaluator_precision.evaluate(predictions)))

#Recall of the model
evaluator_recall = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="recallByLabel")
print("Recall: " + str(evaluator_recall.evaluate(predictions)))

#F1 score of the model
evaluator_f1 = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="f1")
print("F1: " + str(evaluator_f1.evaluate(predictions)))

