<a href="https://colab.research.google.com/github/Yufanzh/02_study/blob/master/ML_for_credit_card_fraud_detection_baseline_feature_transformation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning in Fraud Detection
## Baseline methodology -- Supervised Learning
Most of the propose approaches follow a common baseline ML methodlogy:
Supervised learning:
* Feature engineering
* Model training
* Model validation
* Production
ML for CCFD is a notoriously difficult problem. Due to the following challenges:
1. **Class imbalance**: the percentage of fraudulent transactions in a real-world dataset is typically well under 1%. Learning from imbalanced data is a difficult task since most learning algorithms do not handle well large differences between classes. Dealing with class imbalance requires the use of additional learning strategies like sampling or loss weighting, a topic known as `imbalanced learning`.
2. **Concept drift**: Transaction and fraud patterns change over time. This time-dependent chagnes in the distributions of transactions and frauds require the design of learning strategies that can cope with temporal changes in statistical distributions, a topic known as online learning. The concept drift problem is accentuated in practice by the delayed feedbacks.
3. **Near real-time requirements**: Ideal case is to quickly detect fraudulent transactions. Classification times as low as tens of milliseconds may be required. This challenge closely relates to the `parallelization and scalability` of fraud detection system
4. **Categorical features**: Transactional data typically contain numerous categorical features, such as the ID of a customer, a terminal, the card type, and so on. Categorical features are not well handled by ML and must be transformed into numerical features. Common strategies for transforming categorical features include feature aggregation, graph-based transformation, or DL approaches such as feature-embeddings.
5. **Sequential modeling**: an important challenge of fraud detection consists in modeling these streams to better characterize their expected behaviors and detect when abnormal behaviors occur. Modeling may be done by aggregating features over time (e.g. keeping track of the mean freq or transaciton amounts of a customer), or by replying on sequential prediction models (e.g. hidden Markov models, RNN)
6. **Class overlap**: use only raw information about a transaction is almost impossible to detect fraud. Usually people use feature engineering techniques, that add contextual info to raw payment information.
7. **Performance measures**: Standard measures like mean misclassification error or AUC ROC, are not well suited. A fraud detection system is often necessary to consider multiple measures to assess the overall performance of a fraud detection system. There is currently no consensus on which set of performance measures should be used.
8. ** Lack of public datasets**

## Transaction data simulator

### Design choices
#### Transaction features
a payment card transaction consists of any amount paid to a merchant by a customer at a certain time.
1. transaction ID
2. Date and time
3. Customer ID
4. terminal ID
5. transaction amount
6. fraud label

In [1]:
# data simulated will use dataset from
# https://github.com/Fraud-Detection-Handbook/simulated-data-raw

### Baseline Feature Transformation
We will implement three types of feature transformation that are known to be relevant for payment card fraud detection.
1. Binary encoindg or one-hot encoding
2. RFM (Recency, Frequency, Monetary value)
3. Frequency encoding or risk encoding


In [2]:
# Initialization: Load shared functions and simulated data

# Load shared functions
!curl -O https://raw.githubusercontent.com/Fraud-Detection-Handbook/fraud-detection-handbook/main/Chapter_References/shared_functions.py
%run shared_functions.py

# Get simulated data from Github repository
if not os.path.exists("simulated-data-raw"):
    !git clone https://github.com/Fraud-Detection-Handbook/simulated-data-raw


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 63257  100 63257    0     0   247k      0 --:--:-- --:--:-- --:--:--  248k
Cloning into 'simulated-data-raw'...
remote: Enumerating objects: 189, done.[K
remote: Counting objects: 100% (189/189), done.[K
remote: Compressing objects: 100% (187/187), done.[K
remote: Total 189 (delta 0), reused 186 (delta 0), pack-reused 0[K
Receiving objects: 100% (189/189), 28.04 MiB | 19.04 MiB/s, done.


#### Loading of dataset
First load from April to Sept.

In [3]:
DIR_INPUT = './simulated-data-raw/data/'
BEGIN_DATE = '2018-04-01'
END_DATE = '2018-09-30'

print('Load files')
%time transactions_df = read_from_files(DIR_INPUT, BEGIN_DATE, END_DATE)
print(f'{len(transactions_df)} transactions loaded, containing {transactions_df.TX_FRAUD.sum()} fraudulent transactions')

Load files
CPU times: user 4.01 s, sys: 1.32 s, total: 5.34 s
Wall time: 5.6 s
1754155 transactions loaded, containing 14681 fraudulent transactions


In [4]:
transactions_df.head()

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO
0,0,2018-04-01 00:00:31,596,3156,57.16,31,0,0,0
1,1,2018-04-01 00:02:10,4961,3412,81.51,130,0,0,0
2,2,2018-04-01 00:07:56,2,1365,146.0,476,0,0,0
3,3,2018-04-01 00:09:29,4128,8737,64.49,569,0,0,0
4,4,2018-04-01 00:10:34,927,9906,50.99,634,0,0,0


In [7]:
transactions_df.TX_DATETIME.dtype

dtype('<M8[ns]')

#### Date and Time transformations
We will create two new binary features from the transaction dates and times:
* The first will characterize whether a transaction occurs during a weekday (0) or weekend(1), will be called TX_DURING_WEEKEND
* The second will char whether a transction occurs during the day(0) or during the night(1), will be called TX_DURING_NIGHT

In [8]:
# define a function is_Weekend to take input a Panda timestamp, and return 1 or 0
def is_weekend(tx_datetime):
  # Transform data into weekday (0 is Monday, 6 is Sunday)
  weekday = tx_datetime.weekday()

  return int(weekday >= 5)

In [9]:
# compute the feature
%time transactions_df['TX_DURING_WEEKEND'] = transactions_df['TX_DATETIME'].apply(is_weekend)

CPU times: user 4.95 s, sys: 352 ms, total: 5.3 s
Wall time: 5.38 s


In [11]:
transactions_df.tail()

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO,TX_DURING_WEEKEND
1754150,1754150,2018-09-30 23:56:36,161,655,54.24,15810996,182,0,0,1
1754151,1754151,2018-09-30 23:57:38,4342,6181,1.23,15811058,182,0,0,1
1754152,1754152,2018-09-30 23:58:21,618,1502,6.62,15811101,182,0,0,1
1754153,1754153,2018-09-30 23:59:52,4056,3067,55.4,15811192,182,0,0,1
1754154,1754154,2018-09-30 23:59:57,3542,9849,23.59,15811197,182,0,0,1


In [12]:
# follow the same logic to create feature for during night
def is_night(tx_datetime):
  tx_hour = tx_datetime.hour

  return int(tx_hour <= 6)

In [13]:
%time transactions_df['TX_DURING_NIGHT'] = transactions_df['TX_DATETIME'].apply(is_night)

CPU times: user 4.03 s, sys: 154 ms, total: 4.19 s
Wall time: 4.2 s


In [15]:
# to set if those features are correc implemented
transactions_df[transactions_df.TX_TIME_DAYS >= 30]

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO,TX_DURING_WEEKEND,TX_DURING_NIGHT
288062,288062,2018-05-01 00:01:21,3546,2944,18.71,2592081,30,0,0,0,1
288063,288063,2018-05-01 00:01:48,206,3521,18.60,2592108,30,0,0,0,1
288064,288064,2018-05-01 00:02:22,2610,4470,66.67,2592142,30,0,0,0,1
288065,288065,2018-05-01 00:03:15,4578,1520,79.41,2592195,30,0,0,0,1
288066,288066,2018-05-01 00:03:51,1246,7809,52.08,2592231,30,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...
1754150,1754150,2018-09-30 23:56:36,161,655,54.24,15810996,182,0,0,1,0
1754151,1754151,2018-09-30 23:57:38,4342,6181,1.23,15811058,182,0,0,1,0
1754152,1754152,2018-09-30 23:58:21,618,1502,6.62,15811101,182,0,0,1,0
1754153,1754153,2018-09-30 23:59:52,4056,3067,55.40,15811192,182,0,0,1,0


#### Customer ID transformations
We will take inspiration from the RFM framework proposed, and compute two of these features over three time window.
1. first feature: number of transactions that occur within a time window(Freq).
2. second feature: average amount spent in these transactions (Monetary value).
The time window: one, seven, 30 days.
This will in total generate 6 nex features. Note that these time windows could later be optimized along with the models using a model selection procedure.


In [16]:
# Write a function, take set of transactions for a customer, set of window size as inputs, return a dataframe with the six new features.
# We will use panda rolling func

def get_customer_spending_behavior_feat(customer_transactions, windows_size_in_days=[1, 7, 30]):

  # Let's first order transactions chronologically
  customer_transactions = customer_transactions.sort_values('TX_DATETIME')

  # set date and time as index, to use the rolling functinos
  customer_transactions.index = customer_transactions.TX_DATETIME

  # For each window size
  for window_size in windows_size_in_days:

    # Compute the sum of the transaction ammounts and number of transactions
    SUM_AMOUNT_TX_WINDOW = customer_transactions['TX_AMOUNT'].rolling(str(window_size)+'d').sum()
    NB_TX_WINDOW = customer_transactions['TX_AMOUNT'].rolling(str(window_size)+'d').count()

    # Compute average
    AVG_AMOUNT_TX_WINDOW = SUM_AMOUNT_TX_WINDOW / NB_TX_WINDOW

    # save feature values
    customer_transactions['CUSTOMER_ID_NB_TX_' + str(window_size) + 'DAY_WINDOW'] = list(NB_TX_WINDOW)
    customer_transactions['CUSTOMER_ID_AVG_AMOUNT_' + str(window_size) + 'DAY_WINDOW'] = list(AVG_AMOUNT_TX_WINDOW)

  # Reindex acoording to transaction IDs
  customer_transactions.index = customer_transactions.TRANSACTION_ID

  # return the dataframe
  return customer_transactions


In [17]:
# Let's comput the aggregate for the first customer
spending_behavior_customer_0 = get_customer_spending_behavior_feat(transactions_df[transactions_df.CUSTOMER_ID==0])

spending_behavior_customer_0

Unnamed: 0_level_0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO,TX_DURING_WEEKEND,TX_DURING_NIGHT,CUSTOMER_ID_NB_TX_1DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_1DAY_WINDOW,CUSTOMER_ID_NB_TX_7DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_7DAY_WINDOW,CUSTOMER_ID_NB_TX_30DAY_WINDOW,CUSTOMER_ID_AVG_AMOUNT_30DAY_WINDOW
TRANSACTION_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1758,1758,2018-04-01 07:19:05,0,6076,123.59,26345,0,0,0,1,0,1.0,123.590000,1.0,123.590000,1.0,123.590000
8275,8275,2018-04-01 18:00:16,0,858,77.34,64816,0,0,0,1,0,2.0,100.465000,2.0,100.465000,2.0,100.465000
8640,8640,2018-04-01 19:02:02,0,6698,46.51,68522,0,0,0,1,0,3.0,82.480000,3.0,82.480000,3.0,82.480000
12169,12169,2018-04-02 08:51:06,0,6569,54.72,118266,1,0,0,0,0,3.0,59.523333,4.0,75.540000,4.0,75.540000
15764,15764,2018-04-02 14:05:38,0,7707,63.30,137138,1,0,0,0,0,4.0,60.467500,5.0,73.092000,5.0,73.092000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1750390,1750390,2018-09-30 13:38:41,0,3096,38.23,15773921,182,0,0,1,0,5.0,64.388000,28.0,57.306429,89.0,63.097640
1750758,1750758,2018-09-30 14:10:21,0,9441,43.60,15775821,182,0,0,1,0,6.0,60.923333,29.0,56.833793,89.0,62.433933
1751039,1751039,2018-09-30 14:34:30,0,1138,69.69,15777270,182,0,0,1,0,7.0,62.175714,29.0,57.872414,90.0,62.514556
1751272,1751272,2018-09-30 14:54:59,0,9441,91.26,15778499,182,0,0,1,0,8.0,65.811250,30.0,58.985333,90.0,61.882333


In [18]:
from itertools import groupby
# Let's then generate for all the customers. This is easy by using Panda groupby and apply method
%time transactions_df = transactions_df.groupby('CUSTOMER_ID').apply(lambda x: get_customer_spending_behavior_feat(x))
transactions_df = transactions_df.sort_values('TX_DATETIME').reset_index(drop=True)

CPU times: user 41.1 s, sys: 863 ms, total: 42 s
Wall time: 43.2 s


#### Terminal ID transformations
The main goal is to extract a risk score, that assesses the exposure of a given terminal ID to fraudulent transactions.
Risk score is defined as the average number of fraudulent transactions that occurred on a terminal ID overa. time window.
Different from customer ID transformation, the time windows will not directly precede a given transaction. Instead, they will be shifted back by a `delay period`. It accounts for the fact that, in practice, the fraudulent transactions are only discovered after a fraud investigation or a customer complaint. Hence, the fraudulent labels, which are needed to compute the risk score, are only available after this delay period. We, for first approximation, this delay period will be set to one week.

In [19]:
def get_count_risk_rolling_window(terminal_transactions, delay_period=7, windows_size_in_days=[1, 7, 30], feature='TERMINAL_ID'):

  # sort and indexed by time
  terminal_transactions = terminal_transactions.sort_values('TX_DATETIME')
  terminal_transactions.index = terminal_transactions.TX_DATETIME

  # get the fraud delay count
  NB_FRAUD_DELAY = terminal_transactions['TX_FRAUD'].rolling(str(delay_period) + 'd').sum()
  NB_TX_DELAY = terminal_transactions['TX_FRAUD'].rolling(str(delay_period) + 'd').count()

  for window_size in windows_size_in_days:

    NB_FRAUD_DELAY_WINDOW = terminal_transactions['TX_FRAUD'].rolling(str(delay_period + window_size) + 'd').sum()
    NB_TX_DELAY_WINDOW = terminal_transactions['TX_FRAUD'].rolling(str(delay_period + window_size) + 'd').count()

    NB_FRAUD_WINDOW = NB_FRAUD_DELAY_WINDOW - NB_FRAUD_DELAY
    NB_TX_WINDOW = NB_FRAUD_DELAY_WINDOW - NB_TX_DELAY

    RISK_WINDOW = NB_FRAUD_WINDOW / NB_TX_WINDOW

    terminal_transactions[feature+'_NB_TX_' + str(window_size)+'DAY_WINDOW'] = list(NB_TX_WINDOW)
    terminal_transactions[feature+'_RISK_' + str(window_size) + 'DAY_WINDOW'] = list(RISK_WINDOW)

  terminal_transactions.index = terminal_transactions.TRANSACTION_ID
  # replace NA values with 0 (all undefined risk scores where NB_TX_WINDOW = 0)
  terminal_transactions.fillna(0, inplace=True)

  return terminal_transactions



In [20]:
# let's get the features for all terminals
%time transactions_df = transactions_df.groupby('TERMINAL_ID').apply(lambda x: get_count_risk_rolling_window(x))
transactions_df = transactions_df.sort_values('TX_DATETIME').reset_index(drop=True)

CPU times: user 1min 36s, sys: 420 ms, total: 1min 36s
Wall time: 1min 38s


In [22]:
import os
import datetime
# Now we can save our dataset
DIR_OUTPUT = './simulated-data-transformed/'

if not os.path.exists(DIR_OUTPUT):
  os.makedirs(DIR_OUTPUT)

start_date = datetime.datetime.strptime('2018-04-01', '%Y-%m-%d')
max_days = (transactions_df.TX_DATETIME.max() - start_date).days

for day in range(max_days + 1):

  transactions_day = transactions_df[transactions_df.TX_TIME_DAYS==day].sort_values('TX_TIME_SECONDS')

  date = start_date + datetime.timedelta(days=day)
  filename_output = date.strftime('%Y-%m-%d') + '.pkl'

  # Protocol = 4 requred for Google colab
  transactions_day.to_pickle(DIR_OUTPUT + filename_output, protocol=4)