<a href="https://colab.research.google.com/github/fernandojunior/deteccao-fraude/blob/master/src/Fraud_Detection_Xente.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark load lib

In [7]:
'''!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install pandas_ml
!pip install -q findspark
!pip install catboost
!pip install -U imbalanced-learn
!pip install pyod seaborn catboost plotly_express==0.4.0
!pip install --upgrade pyod
!pip install shap
!pip install --user --upgrade ipywidgets
!jupyter nbextension enable --py widgetsnbextension
#!pip install -r requirements.txt'''

import os
#os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
#os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('xente').getOrCreate()
spark

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pyspark.sql.functions as F
import shap
import catboost
from catboost import Pool, CatBoostClassifier, cv

from pyspark.sql.types import IntegerType, DoubleType
from pyspark.sql.functions import mean, udf, array, col
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml.linalg import Vectors
from imblearn.over_sampling import RandomOverSampler, SMOTE, SMOTENC, ADASYN
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split, learning_curve, ShuffleSplit
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import (accuracy_score, classification_report, confusion_matrix,
                              f1_score, precision_score, recall_score, roc_auc_score)

# Load Training Data

In [8]:
def read_data_from_web(url):
  data = pd.read_csv(url)
  spark_data = spark.createDataFrame(data)
  return spark_data

fraud_data = read_data_from_web("https://drive.google.com/uc?export=download&id=1NrtVkKv8n_g27w5elq9HWZA1i8aFBW0G")
df_backup = fraud_data

In [21]:
fraud_data.show(5)

+-------------------+------------+------------+------------------+-----------+-------+-----+---------------+-----------+
|      TransactionId|  ProviderId|   ProductId|   ProductCategory|  ChannelId| Amount|Value|PricingStrategy|FraudResult|
+-------------------+------------+------------+------------------+-----------+-------+-----+---------------+-----------+
|TransactionId_76871|ProviderId_6|ProductId_10|           airtime|ChannelId_3| 1000.0| 1000|              2|          0|
|TransactionId_73770|ProviderId_4| ProductId_6|financial_services|ChannelId_2|  -20.0|   20|              2|          0|
|TransactionId_26203|ProviderId_6| ProductId_1|           airtime|ChannelId_3|  500.0|  500|              2|          0|
|  TransactionId_380|ProviderId_1|ProductId_21|      utility_bill|ChannelId_3|20000.0|21800|              2|          0|
|TransactionId_28195|ProviderId_4| ProductId_6|financial_services|ChannelId_2| -644.0|  644|              2|          0|
+-------------------+-----------

## Data Dictionary

*   **TransactionId:** Unique transaction identifier on platform.
*   **BatchId:** Unique number assigned to a batch of transactions for processing.
*   **AccountId:** Unique number identifying the customer on platform.
*   **SubscriptionId:** Unique number identifying the customer subscription.
*   **CustomerId:** Unique identifier attached to Account.
*   **CurrencyCode:** Country currency.
*   **CountryCode:** Numerical geographical code of country.
*   **ProviderId:** Source provider of Item bought.
*   **ProductId:** Item name being bought.
*   **ProductCategory:** ProductIds are organized into these broader product categories.
*   **ChannelId:** Identifies if customer used web,Android, IOS, pay later or checkout.
*   **Amount:** Value of the transaction. Positive for debits from customer account and negative for credit into customer account.
*   **Value:** Absolute value of the amount.
*   **TransactionStartTime:** Transaction start time.
*   **PricingStrategy:** Category of Xente's pricing structure for merchants.
*   **FraudResult:** Fraud status of transaction: 1) Yes; or 0) No.

In [9]:
fraud_data.printSchema()

root
 |-- TransactionId: string (nullable = true)
 |-- BatchId: string (nullable = true)
 |-- AccountId: string (nullable = true)
 |-- SubscriptionId: string (nullable = true)
 |-- CustomerId: string (nullable = true)
 |-- CurrencyCode: string (nullable = true)
 |-- CountryCode: long (nullable = true)
 |-- ProviderId: string (nullable = true)
 |-- ProductId: string (nullable = true)
 |-- ProductCategory: string (nullable = true)
 |-- ChannelId: string (nullable = true)
 |-- Amount: double (nullable = true)
 |-- Value: long (nullable = true)
 |-- TransactionStartTime: string (nullable = true)
 |-- PricingStrategy: long (nullable = true)
 |-- FraudResult: long (nullable = true)



# Data Preprocessing

## Missing Data Analysis

In [10]:
def there_is_missing_data(data):
  return data.count() != data.na.drop(how='any').count()

print('There is missing data? {0}.'.format(there_is_missing_data(fraud_data)))

There is missing data? False.


## Duplicated line?

In [12]:
def there_is_duplicate_lines(data):
  return data.count() != data.distinct().count()

print('There is distinct data? {0}.'.format(there_is_duplicate_lines(fraud_data)))

There is distinct data? False.


## Data Treatment

In [13]:
fraud_data.withColumn('diff', F.abs(fraud_data['Amount'])-F.col('Value')).select('diff').show()

+-------+
|   diff|
+-------+
|    0.0|
|    0.0|
|    0.0|
|-1800.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
|    0.0|
+-------+
only showing top 20 rows



It shows the real transaction value. To be precise, we need to categorize the transaction, but first, let us remove unuseful features.

## Removing unuseful features

In [28]:
print('Different CurrencyCode Values: \t\t {0}'.format(fraud_data.select('CurrencyCode').distinct().count()))
print('Different de CountryCode Values: \t {0}'.format(fraud_data.select('CountryCode').distinct().count()))
print('Different de TransactionId Values: \t {0}'.format(fraud_data.select('TransactionId').distinct().count()))

Different CurrencyCode Values: 		 1
Different de CountryCode Values: 	 1
Different de TransactionId Values: 	 95662


CurrencyCode and CountryCode columns have the same value in whole dataset, genuine or fraud.

In [14]:
# data: dataframe - columns: column list to remove
def remove_feature(data, columns_in):
  return data.drop(*columns_in)

In [16]:
def clean_data(fraud_data, columns_to_remove):
  fraud_data = remove_feature(fraud_data, columns_to_remove)
  return fraud_data

In [17]:
columns_to_remove = ['CurrencyCode','CountryCode','BatchId','AccountId','SubscriptionId','CustomerId', 'TransactionStartTime']
fraud_data = clean_data(fraud_data, columns_to_remove)

In [24]:
fraud_data.show(5)

+-------------------+------------+------------+------------------+-----------+-------+-----+---------------+-----------+
|      TransactionId|  ProviderId|   ProductId|   ProductCategory|  ChannelId| Amount|Value|PricingStrategy|FraudResult|
+-------------------+------------+------------+------------------+-----------+-------+-----+---------------+-----------+
|TransactionId_76871|ProviderId_6|ProductId_10|           airtime|ChannelId_3| 1000.0| 1000|              2|          0|
|TransactionId_73770|ProviderId_4| ProductId_6|financial_services|ChannelId_2|  -20.0|   20|              2|          0|
|TransactionId_26203|ProviderId_6| ProductId_1|           airtime|ChannelId_3|  500.0|  500|              2|          0|
|  TransactionId_380|ProviderId_1|ProductId_21|      utility_bill|ChannelId_3|20000.0|21800|              2|          0|
|TransactionId_28195|ProviderId_4| ProductId_6|financial_services|ChannelId_2| -644.0|  644|              2|          0|
+-------------------+-----------