# Credit Card Fraud
* Resources
  * From https://datascience.ibm.com/exchange/public/entry/view/d80de77f784fed7915c14353512ef14d

## Individual Project in Spark

**By:**

- Ashish Devrani
- Panther# (002329273)

## Objective
* The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

* It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

* Use Logistic Regression
  * Train on a portion of the dataset
  * Test the trained model against the remainder of the dataset
    * Accuracy can be determined because the dataset is labeled (i.e., this uses supervised learning)

* Result:
  * In this kind of problem, instead of acheving an oveall accuracy, we would like to know how many fraud transactions were correctly identified and how many were missed.
  * Using Logistic Regression for prediction of fraud trannsaction, we were able to acheve a 63% accuracy in identifying the number of frauds.

## Download Data

In [5]:
# Using the below command we are importing our dataset, which is a csv file from our shared folder on Onedrive

In [6]:
%sh
wget --no-check-certificate 'https://onedrive.live.com/download?cid=77DD390E20E2AD35&resid=77DD390E20E2AD35%212201&authkey=AEz_A1GDynO8vq0' -O CreditCardData.csv
ls /databricks/driver

In [7]:
# listing all the files at the location /databrics/driver to check if our file CreditCardData.csv was saved

In [8]:
%sh
ls

In [9]:
# Using the below command we are trying to read the imported file as a dataframe into the Spark memory

In [10]:
creditcard_df = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load("file:/databricks/driver/CreditCardData.csv")
creditcard_df.show(10)

## Data Explored and Explained

In [12]:
creditcard_df.printSchema()

In [13]:
creditcard_df.count()

## Data Cleaning

In [15]:
# Here we have tried to clean our data of null values.
#Observation: The data that we got was really clean and there was no null column. We measured this by calculating the number of data rows before and after the cleaning operation for null values.

In [16]:
cleanedcc_df = creditcard_df.filter(creditcard_df.Amount.isNotNull() & creditcard_df.V1.isNotNull() & creditcard_df.V2.isNotNull()& creditcard_df.V3.isNotNull()& creditcard_df.V4.isNotNull()& creditcard_df.V5.isNotNull()& creditcard_df.V6.isNotNull()& creditcard_df.V7.isNotNull()& creditcard_df.V8.isNotNull()& creditcard_df.V9.isNotNull()& creditcard_df.V10.isNotNull()& creditcard_df.V11.isNotNull()& creditcard_df.V12.isNotNull()& creditcard_df.V13.isNotNull()& creditcard_df.V14.isNotNull()& creditcard_df.V15.isNotNull()& creditcard_df.V16.isNotNull()& creditcard_df.V17.isNotNull()& creditcard_df.V18.isNotNull()& creditcard_df.V19.isNotNull()& creditcard_df.V20.isNotNull()& creditcard_df.V21.isNotNull()& creditcard_df.V22.isNotNull()& creditcard_df.V23.isNotNull()& creditcard_df.V24.isNotNull()& creditcard_df.V25.isNotNull()& creditcard_df.V26.isNotNull()& creditcard_df.V27.isNotNull()& creditcard_df.V28.isNotNull()& creditcard_df.Time.isNotNull())

In [17]:
creditcard_df.count()

In [18]:
# Here we are trying to classify the data, check if any values apart from 0 and one exist in our predictor column, then such values need to be cleaned or normalized.

In [19]:
# The transaction to be predicted - Good transaction or Fraud transaction
cleanedcc_df.select('Class').distinct().show()

In [20]:
# transactions by their targets - 1 is the number of fraud transactions while 0 is the number of good transactions.
cleanedcc_df.select('Class').groupBy('Class').count().show()

## Data Transformation

In [22]:
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import *
from pyspark.ml.feature import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml import Pipeline, Model
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [23]:
# Convert results for to MLlib input, which requires labels as a float
def labelForResults(s):
     if s == 0:
         return 0.0
     elif s == 1:
         return 1.0
     else:
         return -1.0
label = UserDefinedFunction(labelForResults, DoubleType())

labeledData = cleanedcc_df.select(label(cleanedcc_df.Class).alias('label'),cleanedcc_df.V1, cleanedcc_df.V2, cleanedcc_df.V3, cleanedcc_df.V4, cleanedcc_df.V5, cleanedcc_df.V6, cleanedcc_df.V7, cleanedcc_df.V8, cleanedcc_df.V9, cleanedcc_df.V10, cleanedcc_df.V11, cleanedcc_df.V12, cleanedcc_df.V13, cleanedcc_df.V14, cleanedcc_df.V15, cleanedcc_df.V16, cleanedcc_df.V17, cleanedcc_df.V18, cleanedcc_df.V19, cleanedcc_df.V20, cleanedcc_df.V21, cleanedcc_df.V22, cleanedcc_df.V23, cleanedcc_df.V24, cleanedcc_df.V25, cleanedcc_df.V26, cleanedcc_df.V27, cleanedcc_df.V28, cleanedcc_df.Time, cleanedcc_df.Amount).where('label >= 0')
labeledData.show(10)

In [24]:
# Split into training and testing data
creditcard_train, creditcard_test = labeledData.randomSplit([0.8, 0.2], seed=12345)
display(creditcard_train)

## Data Modeling

In [26]:
# Configure an ML pipeline into stages:
vectorAssembler_features = VectorAssembler(inputCols = [x for x in labeledData.columns if x not in "label"], outputCol="features")
lr = LogisticRegression(maxIter=15, regParam=0.0005)
pipeline = Pipeline(stages=[vectorAssembler_features, lr])

In [27]:
# In this step we are pushing our training data into the ML pipeline, that we have created above

In [28]:
model = pipeline.fit(creditcard_train)

## Prediction

In [30]:
predictionsDf = model.transform(creditcard_test)
predictionsDf.registerTempTable('Predictions')
predictionsDf.show(3)

## Model Evaluation

In [32]:
numSuccesses = predictionsDf.where("(label = 1 AND prediction = 1)").count()
numFailure= predictionsDf.where("(label = 1 AND prediction = 0)").count()
numData = predictionsDf.count()

print "There were", numSuccesses+numFailure , "fraud transactions in a total pool of", numData , "available transactions for prediction, out of which -", numSuccesses, " fraud transactions were successfully predicted and ",numFailure,"fraud transactions were missed out or unsuccessfully predicted as good transactions."
print "This is a", str((float(numSuccesses) / float(numSuccesses+numFailure)) * 100) + "%", "success rate in identifying the fraud transactions."

In [33]:
# Using the graphical reprasentation below, I have shown how many fraud transactions were correctly predicted and how many fraud transactions were predicted as good transaction. Here I have achieved a 63% accuracy in a transaction database where fraud transactions were just 0.172% of the total data. Hence we found the needle in a haystack.

In [34]:
truePositive = int(predictionsDf.where("(label = 1 AND prediction = 1)").count())
trueNegative = int(predictionsDf.where("(label = 0 AND prediction = 0)").count())
falsePositive = int(predictionsDf.where("(label = 0 AND prediction = 1)").count())
falseNegative = int(predictionsDf.where("(label = 1 AND prediction = 0)").count())

resultTDF = sqlContext.createDataFrame([['TP', truePositive], ['FN', falseNegative] ], ['metric', 'value'])
display(resultTDF)

In [35]:
display(resultTDF)

In [36]:
# The number of good transactions predicted is high as expected as the good transaction data was already 99.87 percent and even if we would have predicted all labels as 0, we would have achieved high accuracy in this domain.

In [37]:
resultNDF = sqlContext.createDataFrame([['TN', trueNegative], ['FP', falsePositive] ], ['metric', 'value'])
display(resultNDF)

In [38]:
display(resultNDF)

In [39]:
# --------------------- Visualisations using R ------------------------------

In [40]:
resultTDF.createOrReplaceTempView("Tresult")

In [41]:
%r
library(SparkR)
sparkdf <- sql("FROM Tresult SELECT *")
tdf <- collect(sparkdf)
print( tdf)
vals <- (t(tdf[2]))
labels <- (t(tdf[1]))
# Simple Pie Chart
pie(vals,labels)

In [42]:
resultNDF.createOrReplaceTempView("Fresult")

In [43]:
%r
library(SparkR)
sparkdf2 <- sql("FROM Fresult SELECT *")
fdf <- collect(sparkdf2)
print( fdf)
vals <- (t(fdf[2]))
labels <- (t(fdf[1]))
# Simple Pie Chart
pie(vals,labels)