# Backfill notebook

In [None]:
!pip install -U hopsworks --quiet

#### Data: Data comes in three different parquet files: 
- `credit_cards.parquet`: Credit card information such as expiration date and provider.
- `transactions.parquet`: Transaction information such as timestamp of transaction, the location of transaction, and the amount of transaction. Importantly, a binary **'fraud_label'** variable tells us whether a transaction was fraudulent or not.
- `profiles.parquet`: Credit card user information such as birthdate and city of residence.

All of these files have `cc_num` column as a natural join key.

#### Data Source: 
As real time sources of credit card fraud data are NOT easily accessible(owing to both, their financial value and associated security concerns), we conceptualize these three files as originating from separate data sources

### Imports

In [None]:
import pandas as pd
from datetime import datetime
import hopsworks

In [None]:
url = "https://repo.hops.works/master/hopsworks-tutorials/data/card_fraud_data"
credit_cards_df = pd.read_parquet(url + "/credit_cards.parquet")
credit_cards_df.head(5)

In [None]:
profiles_df = pd.read_parquet(url + "/profiles.parquet")
profiles_df.head(5)

In [None]:
trans_df = pd.read_parquet(url + "/transactions.parquet")
trans_df.head(3)

## Feature Engineering
Fraudulent transactions can differ from regular ones in many different ways. Typical red flags would for instance be a large transaction volume/frequency in the span of a few hours. It could also be the case that elderly people in particular are targeted by fraudsters. To facilitate model learning we will create additional features based on these patterns. In particular, we will create two types of features:
1. **Features that aggregate data from different data sources**. This could for instance be the age of a customer at the time of a transaction, which combines the `birthdate` feature from `profiles.csv` with the `datetime` feature from `transactions.csv`.
2. **Features that aggregate data from multiple time steps**. An example of this could be the transaction frequency of a credit card in the span of a few hours, which is computed using a window function.

Let's start with the first category, aggregating from different sources.

#### Here we create three additional features. 'card_owner_age', 'expiry_days', and 'activity_level'.

In [None]:
from helper import features
import warnings
warnings.filterwarnings('ignore')

fraud_labels = trans_df[["tid", "cc_num", "datetime", "fraud_label"]]
fraud_labels.datetime = fraud_labels.datetime.map(lambda x: features.date_to_timestamp(x))

trans_df = trans_df.drop(['fraud_label'], axis=1)
trans_df = features.card_owner_age(trans_df, profiles_df)
trans_df = features.expiry_days(trans_df, credit_cards_df)
trans_df = features.activity_level(trans_df, 1)

trans_df

We also need to create feature groups(a feature group can be seen as a collection of conceptually related features) in our feature store.
Let's create our first feature group.

In [None]:
project = hopsworks.login()
fs = project.get_feature_store()

In [None]:
trans_fg = fs.get_or_create_feature_group(
    name="cc_trans_fraud",
    version=2,
    description="Credit Card transactions",
    primary_key=["cc_num"],
    event_time="datetime"
)

In [None]:
trans_fg.insert(trans_df, write_options={"wait_for_job" : False})

In [None]:
feature_descriptions = [
    {"name": "tid", "description": "Transaction id"},
    {"name": "datetime", "description": "Transaction time"},
    {"name": "cc_num", "description": "Number of the credit card performing the transaction"},
    {"name": "category", "description": "Expense category"},
    {"name": "amount", "description": "Dollar amount of the transaction"},
    {"name": "city", "description": "City in which the transaction was made"},
    {"name": "country", "description": "Country in which the transaction was made"},
    {"name": "age_at_transaction", "description": "Age of the card holder when the transaction was made"},
    {"name": "days_until_card_expires", "description": "Card validity days left when the transaction was made"},
    {"name": "loc_delta_t_minus_1", "description": "Haversine distance between this transaction location and the previous transaction location from the same card"},
    {"name": "time_delta_t_minus_1", "description": "Time in days between this transaction and the previous transaction location from the same card"},
]

for desc in feature_descriptions: 
    trans_fg.update_feature_description(desc["name"], desc["description"])

#### Now for the second part, we create features from each credit card aggregated from multiple time steps.


In [None]:
window_len = 4
window_aggs_df = cc_features.aggregate_activity_by_hour(trans_df, window_len)
window_aggs_df.tail()

In [None]:
window_aggs_fg = fs.get_or_create_feature_group(
    name=f"cc_trans_fraud_{window_len}h",
    version=2,
    description=f"Counts of the number of credit card transactions over {window_len} hour windows.",
    primary_key=["cc_num"],
    event_time="datetime"
)

In [None]:
window_aggs_fg.insert(window_aggs_df, write_options={"wait_for_job" : False})

In [None]:
feature_descriptions = [
    {"name": "datetime", "description": "Transaction time"},
    {"name": "cc_num", "description": "Number of the credit card performing the transaction"},
    {"name": "loc_delta_mavg", "description": "Moving average of location difference between consecutive transactions from the same card"},
    {"name": "trans_freq", "description": "Moving average of transaction frequency from the same card"},
    {"name": "trans_volume_mavg", "description": "Moving average of transaction volume from the same card"},
    {"name": "trans_volume_mstd", "description": "Moving standard deviation of transaction volume from the same card"},
]

for desc in feature_descriptions: 
    window_aggs_fg.update_feature_description(desc["name"], desc["description"])

#### Creating another fetaure group for fraud transactions

In [None]:
trans_label_fg = fs.get_or_create_feature_group(
    name="transactions_fraud_label",
    version=2,
    description="CC transactions that have been flagged as fraud",
    primary_key=['cc_num'],
    event_time='datetime'
)

trans_label_fg.insert(fraud_labels, write_options={"wait_for_job" : False})

In [None]:
feature_descriptions = [
    {"name": "tid", "description": "Transaction id"},
    {"name": "cc_num", "description": "Number of the credit card performing the transaction"},    
    {"name": "datetime", "description": "Transaction time"},
    {"name": "fraud_label", "description": "Whether the transaction was fraudulent or not"},
]
for desc in feature_descriptions: 
    trans_label_fg.update_feature_description(desc["name"], desc["description"])