# Ibis Flink Backend Demo
## Project Scope
Build ML features using ibis Flink for credit card transaction level fraud detection.
This featusre will be used to build a binary classification model for transaction fraud detection. 
## Data Description
This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 - 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants.

### Schema
```
trans_date_trans_time: str, MM/DD/YY HH:MM:SS format
cc_num: int
merchant: str, Name of merchant. Prepended with “fraud_” for some reason
category: str, Purchase category, e.g. entertainment, kids_pets, home, food_dining, etc.
amt: float, Transaction amount
first: str, firt name 
last: str, last name
gender: str, (M/F)
street: str,
city: str,
state: str, Two-letter representation of US state
zip: int, 4-5 digit zip code
lat: float
long: float
city_pop: int, Population of city of buyer
job: str
Job of buyer
dob: str, DD/MM/YY format
trans_num: str, MD5 hash
unix_time: int, Event timestamp
merch_lat: float
merch_long: float
is_fraud: int, Event label 
```

## Features
While this is not an exhaustive feature list, we aim to showcase how to leverage Ibis Flink for feature engineering. Below are some sample features, and feel free to explore additional ideas:
- Transaction level features
    - Amt
    - Month of the year of this transaction
    - Day of week
    - Hour of the day
- Credit card level features
    - cc_num_{total, max, min, median}_amt_in_last_x(min, hour, day)
- User (use first name, last name, and dob as user identifier) level features
    - cc_num_{total, max, min, median}_amt_in_last_x(min, hour, day)
    - user age
- {merchant, category, and region (zipcode here)} level features
    - {merchant, category, and region}_{total, max, min, median}_amt_in_last_x(min, hour, day)


## Ibis Flink Transformation

In [1]:

import ibis
import pandas as pd
from io import StringIO

In [2]:
import requests

# Specify S3 bucket and object key
bucket_name = "claypot-fraud-detection"
object_key = "FraudTransactions.csv"

# Generate the S3 URL
s3_url = f"https://{bucket_name}.s3.amazonaws.com/{object_key}"

# Send an HTTP GET request to download the file
response = requests.get(s3_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Now, response.text contains the content of the CSV file in memory
    csv_data = response.text
    # You can use csv_data as needed in your script
else:
    print(f"Failed to download file. Status code: {response.status_code}")

In [3]:
csv_file = StringIO(csv_data)
df = pd.read_csv(csv_file)

In [4]:
len(df)

1604294

In [5]:

df['ts'] = pd.to_datetime(df['unix_time'], unit='ms')

In [6]:
df.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,longitude,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud,ts
0,1/1/19 0:00,2703190000000000,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,...,-81.1781,3495,"Psychologist, counselling",3/9/88,0b242abb623afc578575680df30655b9,1325376018000,36.011293,-82.048315,0,2012-01-01 00:00:18
1,1/1/19 0:00,630423000000,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,...,-118.2105,149,Special educational needs teacher,6/21/78,1f76529f8574734946361c461b024d99,1325376044000,49.159047,-118.186462,0,2012-01-01 00:00:44
2,1/1/19 0:00,38859500000000,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,...,-112.262,4154,Nature conservation officer,1/19/62,a1a22d70485983eac12b5b88dad1cf95,1325376051000,43.150704,-112.154481,0,2012-01-01 00:00:51
3,1/1/19 0:01,3534090000000000,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,...,-112.1138,1939,Patent attorney,1/12/67,6b849c168bdad6f867558c3793159a81,1325376076000,47.034331,-112.561071,0,2012-01-01 00:01:16
4,1/1/19 0:03,375534000000000,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,...,-79.4629,99,Dance movement psychotherapist,3/28/86,a41d7549acf90789359a9aa5346dcb46,1325376186000,38.674999,-78.632459,0,2012-01-01 00:03:06


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1604294 entries, 0 to 1604293
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype         
---  ------                 --------------    -----         
 0   trans_date_trans_time  1604294 non-null  object        
 1   cc_num                 1604294 non-null  int64         
 2   merchant               1604294 non-null  object        
 3   category               1604294 non-null  object        
 4   amt                    1604294 non-null  float64       
 5   first                  1604294 non-null  object        
 6   last                   1604294 non-null  object        
 7   gender                 1604294 non-null  object        
 8   street                 1604294 non-null  object        
 9   city                   1604294 non-null  object        
 10  state                  1604294 non-null  object        
 11  zipcode                1604294 non-null  int64         
 12  latitude               16042

In [8]:
con = ibis.pandas.connect()
num_samples = 100000
tm = con.create_table("transaction", df[:num_samples])

In [9]:
def create_window_spec(group_by, order_by, interval_in_minutes):
    return ibis.window(
        group_by=group_by,
        order_by=order_by,
        range=(-ibis.interval(minutes=interval_in_minutes), 0),
    )

time_windows_in_minutes = [5, 60, 60*24]
ts_field = tm.ts

user_window_specs = [
    {
        "agg_level": "user", 
        "agg_col": "amt", 
        "agg_stats": ["sum", "min", "max", "median"],
        "group_by": [tm.first, tm.last, tm.dob], 
        "order_by": ts_field, 
        "windows": time_windows_in_minutes # units: minutes
    },
    {
        "agg_level": "credict_card", 
        "agg_col": "amt", 
        "agg_stats": ["sum", "min", "max", "median"],
        "group_by": [tm.cc_num], 
        "order_by": ts_field, 
        "windows": time_windows_in_minutes # units: minutes
    },
]

context_window_specs = [
    {
        "agg_level": "zipcode", 
        "agg_col": "amt", 
        "agg_stats": ["sum", "min", "max", "median"],
        "group_by": [tm.zipcode], 
        "order_by": ts_field, 
        "windows": time_windows_in_minutes # units: minutes
    },
    {
        "agg_level": "merchant", 
        "agg_col": "amt", 
        "agg_stats": ["sum", "min", "max", "median"],
        "group_by": [tm.merchant], 
        "order_by": ts_field, 
        "windows": time_windows_in_minutes # units: minutes
    },
    {
        "agg_level": "category", 
        "agg_col": "amt", 
        "agg_stats": ["sum", "min", "max", "median"],
        "group_by": [tm.category], 
        "order_by": ts_field, 
        "windows": time_windows_in_minutes # units: minutes
    },
]


user_context_window_specs = [
    {
        "agg_level": "user_category", 
        "agg_col": "amt", 
        "agg_stats": ["sum", "min", "max", "median"],
        "group_by": [tm.first, tm.last, tm.dob, tm.category], 
        "order_by": ts_field, 
        "windows": time_windows_in_minutes # units: minutes
    },
    {
        "agg_level": "user_merchant", 
        "agg_col": "amt", 
        "agg_stats": ["sum", "min", "max", "median"],
        "group_by": [tm.first, tm.last, tm.dob, tm.merchant], 
        "order_by": ts_field, 
        "windows": time_windows_in_minutes # units: minutes
    },
    {
        "agg_level": "credit_card_category", 
        "agg_col": "amt", 
        "agg_stats": ["sum", "min", "max", "median"],
        "group_by": [tm.cc_num, tm.category], 
        "order_by": ts_field, 
        "windows": time_windows_in_minutes # units: minutes
    },
    {
        "agg_level": "credit_card_merchant", 
        "agg_col": "amt", 
        "agg_stats": ["sum", "min", "max", "median"],
        "group_by": [tm.cc_num, tm.merchant], 
        "order_by": ts_field, 
        "windows": time_windows_in_minutes # units: minutes
    },
]

def generate_dataset(base_list, window_specs):

    
    # Generate aggregations dynamically using a for loop
    for spec in window_specs:
        agg_level = spec["agg_level"]
        agg_col = spec["agg_col"]
        agg_stats = spec["agg_stats"]
        group_by = spec["group_by"]
        order_by = spec["order_by"]
        windows = spec["windows"]

        for window_size in windows:
            window_spec = create_window_spec(group_by, order_by, window_size)
            for stat in agg_stats:
                base_list.append(getattr(tm[agg_col], stat)().over(window_spec).name(f"{agg_level}_{agg_col}_{stat}_last_{window_size}min"))

    # Perform the aggregation
    agged = tm[base_list]
    return agged



In [10]:
base_list = [tm.is_fraud, tm.amt]
dataset = generate_dataset(base_list, user_window_specs + context_window_specs + user_context_window_specs)

## Feature importance

In [11]:
data = dataset.to_pandas()

In [12]:
data.head()

Unnamed: 0,is_fraud,amt,user_amt_sum_last_5min,user_amt_min_last_5min,user_amt_max_last_5min,user_amt_median_last_5min,user_amt_sum_last_60min,user_amt_min_last_60min,user_amt_max_last_60min,user_amt_median_last_60min,...,credit_card_merchant_amt_max_last_5min,credit_card_merchant_amt_median_last_5min,credit_card_merchant_amt_sum_last_60min,credit_card_merchant_amt_min_last_60min,credit_card_merchant_amt_max_last_60min,credit_card_merchant_amt_median_last_60min,credit_card_merchant_amt_sum_last_1440min,credit_card_merchant_amt_min_last_1440min,credit_card_merchant_amt_max_last_1440min,credit_card_merchant_amt_median_last_1440min
0,0,4.97,4.97,4.97,4.97,4.97,4.97,4.97,4.97,4.97,...,4.97,4.97,4.97,4.97,4.97,4.97,4.97,4.97,4.97,4.97
1,0,107.23,107.23,107.23,107.23,107.23,107.23,107.23,107.23,107.23,...,107.23,107.23,107.23,107.23,107.23,107.23,107.23,107.23,107.23,107.23
2,0,220.11,220.11,220.11,220.11,220.11,220.11,220.11,220.11,220.11,...,220.11,220.11,220.11,220.11,220.11,220.11,220.11,220.11,220.11,220.11
3,0,45.0,45.0,45.0,45.0,45.0,45.0,45.0,45.0,45.0,...,45.0,45.0,45.0,45.0,45.0,45.0,45.0,45.0,45.0,45.0
4,0,41.96,41.96,41.96,41.96,41.96,41.96,41.96,41.96,41.96,...,41.96,41.96,41.96,41.96,41.96,41.96,41.96,41.96,41.96,41.96


In [13]:
data.describe(percentiles = [x/10.0 for x in range(1, 10)])

Unnamed: 0,is_fraud,amt,user_amt_sum_last_5min,user_amt_min_last_5min,user_amt_max_last_5min,user_amt_median_last_5min,user_amt_sum_last_60min,user_amt_min_last_60min,user_amt_max_last_60min,user_amt_median_last_60min,...,credit_card_merchant_amt_max_last_5min,credit_card_merchant_amt_median_last_5min,credit_card_merchant_amt_sum_last_60min,credit_card_merchant_amt_min_last_60min,credit_card_merchant_amt_max_last_60min,credit_card_merchant_amt_median_last_60min,credit_card_merchant_amt_sum_last_1440min,credit_card_merchant_amt_min_last_1440min,credit_card_merchant_amt_max_last_1440min,credit_card_merchant_amt_median_last_1440min
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,...,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,0.0099,71.908232,73.126872,71.453025,72.590748,71.967009,84.105318,66.854844,77.463159,71.866361,...,71.908232,71.907974,71.954454,71.903333,71.915808,71.909571,72.37074,71.78637,72.054156,71.9243
std,0.099005,145.8954,156.663356,145.215753,154.468697,146.592976,195.75712,134.629571,165.313193,140.184658,...,145.8954,145.89544,146.115697,145.894803,145.914429,145.903801,147.39233,145.714094,146.155656,145.846657
min,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
10%,0.0,4.15,4.19,4.11,4.19,4.18,4.62,3.79,4.57,4.48,...,4.15,4.15,4.15,4.15,4.15,4.15,4.16,4.14,4.16,4.16
20%,0.0,7.78,7.87,7.7,7.86,7.84,8.78,7.09,8.58,8.36,...,7.78,7.78,7.78,7.77,7.78,7.78,7.8,7.76,7.79,7.78
30%,0.0,16.16,16.57,15.837,16.5,16.42,20.64,12.82,20.03,18.67,...,16.16,16.16,16.17,16.16,16.17,16.16,16.26,16.08,16.23,16.21
40%,0.0,32.766,33.236,32.37,33.14,32.96,37.83,28.46,36.99,34.45,...,32.766,32.766,32.78,32.76,32.78,32.776,32.89,32.69,32.86,32.82
50%,0.0,48.15,48.59,47.8,48.48,48.23,52.76,44.54,51.59,48.84,...,48.15,48.15,48.16,48.14,48.16,48.15,48.27,48.06,48.24,48.19
60%,0.0,61.554,62.05,61.19,61.91,61.58,66.73,58.02,65.04,61.822,...,61.554,61.554,61.57,61.544,61.564,61.55,61.76,61.45,61.67,61.61


### Estimate mutual information for a discrete target variable.

    Mutual information (MI) [1]_ between two random variables is a non-negative
    value, which measures the dependency between the variables. It is equal
    to zero if and only if two random variables are independent, and higher
    values mean higher dependency.

    The function relies on nonparametric methods based on entropy estimation
    from k-nearest neighbors distances as described in [2]_ and [3]_. Both
    methods are based on the idea originally proposed in [4]_.

    It can be used for univariate features selection.

    discrete_features : 'auto', bool or array-like, default='auto'
        If bool, then determines whether to consider all features discrete
        or continuous. If array, then it should be either a boolean mask
        with shape (n_features,) or array with indices of discrete features.
        If 'auto', it is assigned to False for dense `X` and to True for
        sparse `X`.

    n_neighbors : int, default=3
        Number of neighbors to use for MI estimation for continuous variables,
        see [2]_ and [3]_. Higher values reduce variance of the estimation, but
        could introduce a bias.


    References
    ----------
    .. [1] `Mutual Information
           <https://en.wikipedia.org/wiki/Mutual_information>`_
           on Wikipedia.
    .. [2] A. Kraskov, H. Stogbauer and P. Grassberger, "Estimating mutual
           information". Phys. Rev. E 69, 2004.
    .. [3] B. C. Ross "Mutual Information between Discrete and Continuous
           Data Sets". PLoS ONE 9(2), 2014.
    .. [4] L. F. Kozachenko, N. N. Leonenko, "Sample Estimate of the Entropy
           of a Random Vector:, Probl. Peredachi Inf., 23:2 (1987), 9-16


In [14]:
import pandas as pd
from sklearn.feature_selection import mutual_info_classif

# Assuming you have a DataFrame df with features and a target column
# Replace 'target_column' with the actual name of your target column
target_column = 'is_fraud'

# Extract features (X) and target (y)
X = data.drop(columns=[target_column])
y = data[target_column]

# Calculate mutual information for each feature
mutual_info = mutual_info_classif(X, y, n_neighbors=3, copy=True, random_state=42)

# Create a DataFrame to store the information gain for each feature
info_gain_df = pd.DataFrame({'Feature': X.columns, 'Information_Gain': mutual_info})

# Display the DataFrame sorted by information gain in descending order
info_gain_df = info_gain_df.sort_values(by='Information_Gain', ascending=False)



In [15]:

info_gain_df.head(40)

Unnamed: 0,Feature,Information_Gain
11,user_amt_max_last_1440min,0.043177
35,zipcode_amt_max_last_1440min,0.043149
23,credict_card_amt_max_last_1440min,0.04306
31,zipcode_amt_max_last_60min,0.032973
19,credict_card_amt_max_last_60min,0.032822
7,user_amt_max_last_60min,0.032807
12,user_amt_median_last_1440min,0.032526
36,zipcode_amt_median_last_1440min,0.032522
24,credict_card_amt_median_last_1440min,0.031575
33,zipcode_amt_sum_last_1440min,0.030868
