# **Feature Engineering**

## **Why the need for transforming the features?**

The previously generated simulated dataset is not sufficiently suitable for training a machine learning model intended for accurate classification and prediction tasks. Machine learning algorithms require both numerical and categorical (empirical) features. In the current dataset, the only features that fit these criteria are the fraud label and the transaction amount. The objective is to create new features that can be effectively utilized for predictive modeling. To achieve this, several sources have been used as a reference, and the feature transformations to be implemented are as follows:

-**One-hot encoding**: This transformation will adjust the variables related to date and time, creating binary features that emphasize significant time periods. Specifically, the transformation will produce a feature indicating whether a transaction occurred during the day or night, and another feature denoting if the transaction took place on a weekend or a weekday. These features are crucial in real-world scenarios, as fraudulent patterns often differ between day and night and between weekdays and weekends. 

-**RFM(Recency, Frequency, Monetary value)**: This transformation will modify the ID variable to create features that highlight customer spending behaviors. This process will follow the RFM (Recency, Frequency, Monetary) framework proposed by the research in [here](https://www.sciencedirect.com/science/article/abs/pii/S0167923615000846).

-**Risk Encoding**: This transformation will create features that highlight the risk of a terminal. The risk will be the average amount of frauds that have occurred in that terminal for 3 periods of time.

## **One-hot Encoding**

In [1]:
import pandas as pd

from generator_modules.load_dataset import load_ds

PATH: str = './data/raw/'

BEGIN: str = '2024-07-02'
END: str = '2024-12-31'

print("Reading from transactions...")
transactions_df: pd.DataFrame = load_ds(PATH, BEGIN, END)
print("Transactions read from files: ", len(transactions_df))
print("Fraudulent transactions: ", transactions_df.IS_FRAUD.sum())

Reading from transactions...


  concatenated_df.replace([-1], 0, inplace=True)


Transactions read from files:  1772188
Fraudulent transactions:  14045


In [2]:
transactions_df.head()

Unnamed: 0,TRX_ID,TRX_DATETIME,CLIENT_ID,TERMINAL_ID,TRX_AMOUNT,TRX_SECONDS,TRX_DAYS,IS_FRAUD,FRAUD_SCENARIO
0,0,2024-07-02 00:02:29,834,3470,121.71,149,0,0,0
1,1,2024-07-02 00:03:12,53,8823,38.52,192,0,0,0
2,2,2024-07-02 00:04:18,1615,4188,33.51,258,0,0,0
3,3,2024-07-02 00:07:08,815,215,72.42,428,0,0,0
4,4,2024-07-02 00:07:56,2,660,76.37,476,0,0,0


Now that the dataset has been loaded the first method which is **weekend_indicator** will be declared. This function will simply create a new feature (ON_WEEKEND) that for every transaction will state whether the latter occurred during a weekend or not(0 weekday, 1 weekend):

In [3]:
from generator_modules.transformationClass import Transform

transform: Transform = Transform()

%time transactions_df['ON_WEEKEND'] = transactions_df.TRX_DATETIME.apply(transform.weekend_indicator)

CPU times: total: 1.41 s
Wall time: 1.82 s


The method **night_indicator**, similarly to the previous function, will create a new feature called ON_NIGHT that will state whether a transaction has happened during night time or not. (O day, 1 night).

In [4]:
%time transactions_df['ON_NIGHT'] = transactions_df.TRX_DATETIME.apply(transform.night_indicator)

CPU times: total: 6.94 s
Wall time: 8.24 s


In [5]:
transactions_df[transactions_df.ON_WEEKEND == 1]

Unnamed: 0,TRX_ID,TRX_DATETIME,CLIENT_ID,TERMINAL_ID,TRX_AMOUNT,TRX_SECONDS,TRX_DAYS,IS_FRAUD,FRAUD_SCENARIO,ON_WEEKEND,ON_NIGHT
38563,38563,2024-07-06 00:00:53,680,7069,29.16,345653,4,0,0,1,1
38564,38564,2024-07-06 00:06:19,2587,7499,108.00,345979,4,0,0,1,1
38565,38565,2024-07-06 00:07:21,3996,4476,84.33,346041,4,0,0,1,1
38566,38566,2024-07-06 00:08:15,4211,589,92.03,346095,4,0,0,1,1
38567,38567,2024-07-06 00:08:21,2587,4255,59.36,346101,4,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...
1752833,1752833,2024-12-29 23:56:59,1052,3598,51.62,15638219,180,0,0,1,0
1752834,1752834,2024-12-29 23:57:08,2688,4388,25.30,15638228,180,0,0,1,0
1752835,1752835,2024-12-29 23:58:49,1926,7482,53.34,15638329,180,0,0,1,0
1752836,1752836,2024-12-29 23:58:54,3873,3279,61.70,15638334,180,0,0,1,0


The methods have been successfully applied.

## **RFM(Recency, Frequency, Monetary value)**

Proceeding with the CLIENT_ID transformations, the approach, as said before, will be based on the RFM framework. Two features will be computed over three distinct time windows. The first feature will measure the number of transactions within each time window, so the frequency of the transactions. The second feature will calculate the average amount spent in these transactions, hence the monetary value throughout the three time windows. The time windows will be set to one, seven, and thirty days, resulting in the creation of six new features. Basically, the function(**analyse_customer_spending**) will create for each customer these new features:

In [6]:
client_behaviour: pd.DataFrame = transform.analyse_customer_spending(transactions_df[transactions_df.CLIENT_ID==0])
client_behaviour

Unnamed: 0,TRX_DATETIME,TRX_ID,CLIENT_ID,TERMINAL_ID,TRX_AMOUNT,TRX_SECONDS,TRX_DAYS,IS_FRAUD,FRAUD_SCENARIO,ON_WEEKEND,ON_NIGHT,CLIENT_TX_1DAY_WINDOW,CLIENT_MEAN_1DAY_WINDOW,CLIENT_TX_7DAY_WINDOW,CLIENT_MEAN_7DAY_WINDOW,CLIENT_TX_30DAY_WINDOW,CLIENT_MEAN_30DAY_WINDOW
0,2024-07-02 11:25:35,4333,0,782,91.71,41135,0,0,0,0,0,1.0,91.710000,1.0,91.710000,1.0,91.710000
1,2024-07-02 17:16:41,7998,0,7878,70.33,62201,0,0,0,0,0,2.0,81.020000,2.0,81.020000,2.0,81.020000
2,2024-07-02 22:22:31,9367,0,7578,38.91,80551,0,0,0,0,0,3.0,66.983333,3.0,66.983333,3.0,66.983333
3,2024-07-03 06:39:44,11052,0,7829,90.42,110384,1,0,0,0,1,4.0,72.842500,4.0,72.842500,4.0,72.842500
4,2024-07-03 09:55:10,12877,0,9190,86.55,122110,1,0,0,0,0,5.0,75.584000,5.0,75.584000,5.0,75.584000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
249,2024-12-29 11:56:17,1747964,0,5840,76.52,15594977,180,0,0,1,0,6.0,69.056667,14.0,66.605000,41.0,78.734634
250,2024-12-30 14:45:46,1759509,0,782,56.87,15691546,181,0,0,0,0,1.0,56.870000,13.0,62.437692,40.0,76.380250
251,2024-12-31 13:24:41,1768288,0,5695,18.30,15773081,182,0,0,0,0,2.0,37.585000,11.0,53.445455,41.0,74.963659
252,2024-12-31 14:02:20,1768757,0,6425,93.98,15775340,182,0,0,0,0,3.0,56.383333,12.0,56.823333,42.0,75.416429


Since the method has been tested, it will be applied to all transactions in the dataset:

In [7]:
%time transactions_df = transactions_df.groupby('CLIENT_ID').apply(lambda x: transform.analyse_customer_spending(x, [1, 7, 30]))
transactions_df = transactions_df.sort_values('TRX_DATETIME').reset_index(drop=True)



CPU times: total: 9.72 s
Wall time: 12.3 s


## **Risk-Encoding**

Proceeding with the TERMINAL_ID transformations, the objective is to extract a risk score that evaluates a terminal ID's exposure to fraudulent transactions. This risk score will be defined as the average number of fraudulent transactions associated with a terminal ID within a specified time window. Similar to the CLIENT_ID transformations, three time window sizes will be utilized: 1, 7, and 30 days. Unlike the CLIENT_ID transformations, the time windows for TERMINAL_ID will be shifted back by a delay period, rather than directly preceding a transaction. This delay accounts for the practical reality that fraudulent transactions are typically identified after an investigation or a customer complaint. Consequently, the fraudulent labels necessary for computing the risk score are only available after this delay. For initial approximation, the delay period will be set to one week. The computation of the risk scores will be carried out using a function called **analyse_terminal_risk**. This function will take as inputs the transaction DataFrame for a given terminal ID, the delay period, and a list of window sizes. In the first stage, the number of transactions and fraudulent transactions within the delay period will be computed. In the second stage, these metrics will be calculated for each window size plus the delay period (`total_tx_window` and `risk_window`). The number of transactions and fraudulent transactions occurring within a given window size, adjusted by the delay period, will be determined by computing the differences between the quantities obtained for the delay period and those for the window size plus delay period. The method will return the number of transactions as well per time window:

In [8]:
%time transactions_df = transactions_df.groupby('TERMINAL_ID').apply(lambda x: transform.analyse_terminal_risk(x, 7, [1, 7, 30], 'TERMINAL_ID'))
transactions_df=transactions_df.sort_values('TRX_DATETIME').reset_index(drop=True)



CPU times: total: 25.3 s
Wall time: 30.8 s


In [9]:
transactions_df = transactions_df.set_index('TRX_ID').reset_index(drop=False)
transactions_df = transactions_df.sort_values('TRX_ID')
transactions_df = transactions_df.sort_index()
transactions_df[transactions_df.TERMINAL_ID_RISK_SCORE_30DAY_WINDOW>0]

Unnamed: 0,TRX_ID,TRX_DATETIME,CLIENT_ID,TERMINAL_ID,TRX_AMOUNT,TRX_SECONDS,TRX_DAYS,IS_FRAUD,FRAUD_SCENARIO,ON_WEEKEND,...,CLIENT_TX_7DAY_WINDOW,CLIENT_MEAN_7DAY_WINDOW,CLIENT_TX_30DAY_WINDOW,CLIENT_MEAN_30DAY_WINDOW,TERMINAL_ID_TX_1DAY_WINDOW,TERMINAL_ID_RISK_SCORE_1DAY_WINDOW,TERMINAL_ID_TX_7DAY_WINDOW,TERMINAL_ID_RISK_SCORE_7DAY_WINDOW,TERMINAL_ID_TX_30DAY_WINDOW,TERMINAL_ID_RISK_SCORE_30DAY_WINDOW
82082,82082,2024-07-10 11:18:23,1660,9953,24.00,731903,8,1,2,0,...,24.0,50.355000,30.0,51.327333,2.0,1.0,3.0,0.666667,3.0,0.666667
83454,83454,2024-07-10 13:10:51,4229,5636,68.13,738651,8,0,0,0,...,21.0,46.891429,30.0,47.307667,0.0,0.0,2.0,0.500000,2.0,0.500000
87127,87127,2024-07-10 21:44:14,4255,6148,7.80,769454,8,0,0,0,...,20.0,7.101000,25.0,6.483200,0.0,0.0,2.0,0.500000,2.0,0.500000
88124,88124,2024-07-11 04:40:39,2011,9394,173.55,794439,9,1,2,0,...,26.0,103.807308,32.0,95.088437,1.0,1.0,1.0,1.000000,1.0,1.000000
89971,89970,2024-07-11 08:44:25,445,5636,90.42,809065,9,0,0,0,...,36.0,94.827778,43.0,93.797209,0.0,0.0,2.0,0.500000,2.0,0.500000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1772139,1772139,2024-12-31 23:24:57,78,569,9.86,15809097,182,0,0,0,...,21.0,21.990476,80.0,21.635125,1.0,0.0,8.0,0.000000,31.0,0.032258
1772142,1772142,2024-12-31 23:26:21,4480,1014,3.91,15809181,182,0,0,0,...,27.0,8.138889,97.0,8.555155,1.0,0.0,12.0,0.000000,42.0,0.023810
1772156,1772156,2024-12-31 23:41:12,2419,7769,18.14,15810072,182,0,0,0,...,27.0,70.581481,84.0,89.157857,1.0,0.0,9.0,0.000000,36.0,0.027778
1772163,1772163,2024-12-31 23:46:20,1080,1294,110.36,15810380,182,0,0,0,...,34.0,80.012647,123.0,99.027236,1.0,1.0,8.0,0.125000,23.0,0.043478


In [10]:
from generator_modules.save_dataset import save_ds

save_ds(transactions_df=transactions_df, path='./data/csv_transformed/', path1='./data/transformed/')