<center><img src=https://raw.githubusercontent.com/feast-dev/feast/master/docs/assets/feast_logo.png width=400/></center>

# Credit Risk Data Preparation

Predicting credit risk is an important task for financial institutions. If a bank can accurately determine the probability that a borrower will pay back a future loan, then they can make better decisions on loan terms and approvals. Getting credit risk right is critical to offering good financial services, and getting credit risk wrong could mean going out of business.

AI models have played a central role in modern credit risk assessment systems. In this example, we develop a credit risk model to predict whether a future loan will be good or bad, given some context data (presumably supplied from the loan application). We use the modeling process to demonstrate how Feast can be used to facilitate the serving of data for training and inference use-cases.

In this notebook, we prepare the data.

### Setup

*The following code assumes that you have read the example README.md file, and that you have setup an environment where the code can be run. Please make sure you have addressed the prerequisite needs.*

In [1]:
# Import Python libraries
import os
import warnings
import datetime as dt
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml

In [2]:
# suppress warning messages for example flow (don't run if you want to see warnings)
warnings.filterwarnings('ignore')

In [3]:
# Seed for reproducibility
SEED = 142

### Pull the Data

The data we will use to train the model is from the [OpenML](https://www.openml.org/) dataset [credit-g](https://www.openml.org/search?type=data&sort=runs&status=active&id=31), obtained from a 1994 German study. More details on the data can be found in the `DESC` attribute and `details` map (see below).

In [4]:
data = fetch_openml(name="credit-g", version=1, parser='auto')

In [5]:
print(data.DESCR)

**Author**: Dr. Hans Hofmann  
**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)) - 1994    
**Please cite**: [UCI](https://archive.ics.uci.edu/ml/citation_policy.html)

**German Credit dataset**  
This dataset classifies people described by a set of attributes as good or bad credit risks.

This dataset comes with a cost matrix: 
``` 
Good  Bad (predicted)  
Good   0    1   (actual)  
Bad    5    0  
```

It is worse to class a customer as good when they are bad (5), than it is to class a customer as bad when they are good (1).  

### Attribute description  

1. Status of existing checking account, in Deutsche Mark.  
2. Duration in months  
3. Credit history (credits taken, paid back duly, delays, critical accounts)  
4. Purpose of the credit (car, television,...)  
5. Credit amount  
6. Status of savings account/bonds, in Deutsche Mark.  
7. Present employment, in number of years.  
8. Installment rate in percentage of disposable income  
9. Perso

In [6]:
print("Original data url: ".ljust(20), data.details["original_data_url"])
print("Paper url: ".ljust(20), data.details["paper_url"])

Original data url:   https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
Paper url:           https://dl.acm.org/doi/abs/10.1145/967900.968104


### High-Level Data Inspection

Let's inspect the data to see high level details like data types and size. We also want to make sure there are no glaring issues (like a large number of null values).

In [7]:
df = data.frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   checking_status         1000 non-null   category
 1   duration                1000 non-null   int64   
 2   credit_history          1000 non-null   category
 3   purpose                 1000 non-null   category
 4   credit_amount           1000 non-null   int64   
 5   savings_status          1000 non-null   category
 6   employment              1000 non-null   category
 7   installment_commitment  1000 non-null   int64   
 8   personal_status         1000 non-null   category
 9   other_parties           1000 non-null   category
 10  residence_since         1000 non-null   int64   
 11  property_magnitude      1000 non-null   category
 12  age                     1000 non-null   int64   
 13  other_payment_plans     1000 non-null   category
 14  housing                 1

We see that there are 21 columns, each with 1000 non-null values. The first 20 columns are contextual fields with `Dtype` of `category` or `int64`, while the last field is actually the target variable, `class`, which we wish to predict. 

From the description (above), the `class` tells us whether a loan to a customer was "good" or "bad". We are anticipating that patterns in the contextual data, as well as their relationship to the class outcomes, can give insight into loan classification. In the following notebooks, we will build a loan classification model that seeks to encode these patterns and relationships in its weights, such that given a new loan application (context data), the model can predict whether the loan (if approved) will be good or bad in the future.

### Data Preparation For Demonstrating Feast

At this point, it's important to bring up that Feast was developed primarily to work with production data. Feast requires datasets to have entities (in our case, IDs) and timestamps, which it uses in joins. Feast can support joining data on multiple entities (like primary keys in SQL), as well as "created" timestamps and "event" timestamps. However, in this example, we'll keep things more simple.

In a real loan application scenario, the application fields (in a database) would be associated with a timestamp, while the actual loan outcome (label) would be determined much later and recorded separately with a different timestamp.

In order to demonstrate Feast capabilities, such as point-in-time joins, we will mock IDs and timestamps for this data. For IDs, we will use the original dataframe index values. For the timestamps, we will generate random values between "Tue Sep 24 12:00:00 2023" and "Wed Oct  9 12:00:00 2023".

In [8]:
# Make index into "ID" column
df = df.reset_index(names=["ID"])

In [9]:
# Add mock timestamps
time_format = "%a %b %d %H:%M:%S %Y"
date = dt.datetime.strptime("Wed Oct  9 12:00:00 2023", time_format)
end = int(date.timestamp())
start = int((date - dt.timedelta(days=15)).timestamp())  # 'Tue Sep 24 12:00:00 2023'

def make_tstamp(date):
    dtime = dt.datetime.fromtimestamp(date).ctime()
    return dtime
    
# (seed set for reproducibility)
np.random.seed(SEED)
df["application_timestamp"] = pd.to_datetime([
    make_tstamp(d) for d in np.random.randint(start, end, len(df))
])

Verify that the newly created "ID" and "application_timestamp" fields were added to the data as expected.

In [10]:
# Check data (first few records, transposed for readability)
df.head(3).T

Unnamed: 0,0,1,2
ID,0,1,2
checking_status,<0,0<=X<200,no checking
duration,6,48,12
credit_history,critical/other existing credit,existing paid,critical/other existing credit
purpose,radio/tv,radio/tv,education
credit_amount,1169,5951,2096
savings_status,no known savings,<100,<100
employment,>=7,1<=X<4,4<=X<7
installment_commitment,4,2,2
personal_status,male single,female div/dep/mar,male single


We'll also generate counterpart IDs and timestamps on the label data. In a real-life scenario, the label data would come separate and later relative to the loan application data. To mimic this, let's create a labels dataset with an "outcome_timestamp" column with a variable lag from the application timestamp of 30 to 90 days.

In [11]:
# Add (lagged) label timestamps (30 to 90 days)
def lag_delta(data, seed):
    np.random.seed(seed)
    delta_days = np.random.randint(30, 90, len(data))
    delta_hours = np.random.randint(0, 24, len(data))
    delta = np.array([dt.timedelta(days=int(delta_days[i]), hours=int(delta_hours[i])) for i in range(len(data))])
    return delta

labels = df[["ID", "class"]]
labels["outcome_timestamp"] = pd.to_datetime(df.application_timestamp + lag_delta(df, SEED))

In [12]:
# Check labels
labels.head(3)

Unnamed: 0,ID,class,outcome_timestamp
0,0,good,2023-11-24 22:50:13
1,1,bad,2023-11-03 12:10:13
2,2,good,2023-11-30 22:06:03


You can verify that the `outcome timestamp` has a difference of 30 to 90 days from the "application_timestamp" (above).

### Save Data

Now that we have our data prepared, let's save it to local parquet files in the `data` directory (parquet is one of the file formats supported by Feast).

One more step we will add is splitting the context data column-wise and saving it in two files. This step is contrived--we don't usually split data when we don't need to--but it will allow us to demonstrate later how Feast can easily join datasets (a common need in Data Science projects).

In [13]:
# Create the data directory if it doesn't exist
os.makedirs("Feature_Store/data", exist_ok=True)

# Split columns and save context data
a_cols = [
    'ID', 'checking_status', 'duration', 'credit_history', 'purpose',
    'credit_amount', 'savings_status', 'employment', 'application_timestamp',
    'installment_commitment', 'personal_status', 'other_parties',
]
b_cols = [
    'ID', 'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
    'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
    'foreign_worker', 'application_timestamp'
]

df[a_cols].to_parquet("Feature_Store/data/data_a.parquet", engine="pyarrow")
df[b_cols].to_parquet("Feature_Store/data/data_b.parquet", engine="pyarrow")

# Save label data
labels.to_parquet("Feature_Store/data/labels.parquet", engine="pyarrow")

We have saved the following files to the `Feature_Store/data` directory: 
- `data_a.parquet` (training data, a columns)
- `data_b.parquet` (training data, b columns)
- `labels.parquet` (label outcomes)

With the feature data prepared, we are ready to setup and deploy the feature store. 

Continue with the [02_Deploying_the_Feature_Store.ipynb](02_Deploying_the_Feature_Store.ipynb) notebook.