<center><img src=https://raw.githubusercontent.com/feast-dev/feast/master/docs/assets/feast_logo.png width=400/></center>

# Credit Risk Data Preparation

Predicting credit risk is an important task for financial institutions. If a bank can accurately determine the probability that a borrower will pay back a future loan, then they can make better decisions on loan terms and approvals. Getting credit risk right is critical to offering good financial services, and getting credit risk wrong could mean going out of business.

AI models have played a central role in modern credit risk assessment systems. In this example, we develop a credit risk model to predict whether a future loan will be good or bad, given some context data (presumably supplied from the loan application). We use the modeling process to demonstrate how Feast can be used to facilitate the serving of data for training and inference use-cases.

In this notebook, we prepare the data.

### Setup

*The following code assumes that you have read the example README.md file, and that you have setup an environment where the code can be run. Please make sure you have addressed the prerequisite needs.*

In [1]:
# Import Python libraries
import os
import warnings
import datetime as dt
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_openml

In [2]:
# suppress warning messages for example flow (don't run if you want to see warnings)
warnings.filterwarnings('ignore')

In [3]:
# Seed for reproducibility
SEED = 142

### Pull the Data

The data we will use to train the model is from the [OpenML](https://www.openml.org/) dataset [credit-g](https://www.openml.org/search?type=data&sort=runs&status=active&id=31), obtained from a 1994 German study. More details on the data can be found in the `DESC` attribute and `details` map (see below).

In [11]:
import tempfile
import shutil

# Create a temporary directory for data_home
temp_data_home = tempfile.mkdtemp()
print(f"Using temporary data_home: {temp_data_home}")

# Set the SCIKIT_LEARN_DATA environment variable to the temporary directory
os.environ['SCIKIT_LEARN_DATA'] = temp_data_home

try:
    print("Retrying data fetch with temporary data_home...")
    data = fetch_openml(name="credit-g", version=1, parser='auto')
    print("Data fetched successfully with temporary data_home!")
except ValueError as e:
    print(f"Error fetching data with temporary data_home: {e}")
finally:
    # Clean up the temporary directory
    print(f"Cleaning up temporary data_home: {temp_data_home}")
    shutil.rmtree(temp_data_home)
    # Unset the SCIKIT_LEARN_DATA environment variable to revert to default behavior
    if 'SCIKIT_LEARN_DATA' in os.environ:
        del os.environ['SCIKIT_LEARN_DATA']

Using temporary data_home: /tmp/tmp4dxm95cf
Retrying data fetch with temporary data_home...
Error fetching data with temporary data_home: md5 checksum of local file for data/v1/download/31 does not match description: expected: 9a475053fed0c26ee95cd4525e50074c but got 4faec5c39a1f821270a5878880fbfc8c. Downloaded file could have been modified / corrupted, clean cache and retry...
Cleaning up temporary data_home: /tmp/tmp4dxm95cf


In [5]:
import hashlib
import os
from sklearn.datasets import get_data_home

# Get the scikit-learn data home directory
data_home = get_data_home()

# Construct the path to the downloaded file for 'credit-g'
# The path is usually data_home/openml/openml.org/data/v1/download/{data_id}
# From the error message, the data_id is 31.
file_path = os.path.join(data_home, "openml", "openml.org", "data", "v1", "download", "31")

print(f"Checking MD5 for file: {file_path}")

# Check if the file exists
if not os.path.exists(file_path):
    print("Error: File not found at the specified path. It might not have been downloaded yet or is in a different location.")
else:
    # Calculate the MD5 checksum
    hasher = hashlib.md5()
    with open(file_path, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hasher.update(chunk)
    calculated_checksum = hasher.hexdigest()

    # The expected checksum was from the previous error message
    expected_checksum = "9a475053fed0c26ee95cd4525e50074c"

    print(f"Calculated MD5 checksum: {calculated_checksum}")
    print(f"Expected MD5 checksum:   {expected_checksum}")

    if calculated_checksum == expected_checksum:
        print("MD5 checksums MATCH. The file is correct.")
    else:
        print("MD5 checksums DO NOT MATCH. The file might be corrupted or outdated.")

Checking MD5 for file: /root/scikit_learn_data/openml/openml.org/data/v1/download/31
Error: File not found at the specified path. It might not have been downloaded yet or is in a different location.


In [14]:
data.frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   checking_status         1000 non-null   category
 1   duration                1000 non-null   float64 
 2   credit_history          1000 non-null   category
 3   purpose                 1000 non-null   category
 4   credit_amount           1000 non-null   float64 
 5   savings_status          1000 non-null   category
 6   employment              1000 non-null   category
 7   installment_commitment  1000 non-null   float64 
 8   personal_status         1000 non-null   category
 9   other_parties           1000 non-null   category
 10  residence_since         1000 non-null   float64 
 11  property_magnitude      1000 non-null   category
 12  age                     1000 non-null   float64 
 13  other_payment_plans     1000 non-null   category
 14  housing                 1

In [10]:
import os
import shutil
from sklearn.datasets import get_data_home

# Get the scikit-learn data home directory
data_home = get_data_home()

# Construct the path to the OpenML cache directory
openml_cache_dir = os.path.join(data_home, "openml")

# Remove the entire OpenML cache directory if it exists
if os.path.exists(openml_cache_dir):
    print(f"Removing OpenML cache directory: {openml_cache_dir}")
    shutil.rmtree(openml_cache_dir)
else:
    print(f"OpenML cache directory not found: {openml_cache_dir}")

print("Retrying data fetch...")
# Now, retry fetching the data, which will force a fresh download
data = fetch_openml(name="credit-g", version=1, parser='auto')

Removing OpenML cache directory: /root/scikit_learn_data/openml
Retrying data fetch...


ValueError: md5 checksum of local file for data/v1/download/31 does not match description: expected: 9a475053fed0c26ee95cd4525e50074c but got 4faec5c39a1f821270a5878880fbfc8c. Downloaded file could have been modified / corrupted, clean cache and retry...

In [9]:
import os
from sklearn.datasets import get_data_home

# Get the scikit-learn data home directory
data_home = get_data_home()

# Construct the path to the downloaded file for 'credit-g'
# The path is usually data_home/openml/openml.org/data/v1/download/{data_id}
# From the error message, the data_id is 31.
file_to_remove = os.path.join(data_home, "openml", "openml.org", "data", "v1", "download", "31")

if os.path.exists(file_to_remove):
    print(f"Removing corrupted file: {file_to_remove}")
    os.remove(file_to_remove)
else:
    print(f"File not found, no need to remove: {file_to_remove}")

# Now, retry fetching the data
data = fetch_openml(name="credit-g", version=1, parser='auto')

File not found, no need to remove: /root/scikit_learn_data/openml/openml.org/data/v1/download/31


ValueError: md5 checksum of local file for data/v1/download/31 does not match description: expected: 9a475053fed0c26ee95cd4525e50074c but got 4faec5c39a1f821270a5878880fbfc8c. Downloaded file could have been modified / corrupted, clean cache and retry...

In [15]:
print(data.DESCR)

Manually downloaded credit-g dataset from OpenML


In [16]:
print("Original data url: ".ljust(20), data.details["original_data_url"])
print("Paper url: ".ljust(20), data.details["paper_url"])

Original data url:   https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)
Paper url:           https://dl.acm.org/doi/abs/10.1145/967900.968104


### High-Level Data Inspection

Let's inspect the data to see high level details like data types and size. We also want to make sure there are no glaring issues (like a large number of null values).

In [17]:
df = data.frame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   checking_status         1000 non-null   category
 1   duration                1000 non-null   float64 
 2   credit_history          1000 non-null   category
 3   purpose                 1000 non-null   category
 4   credit_amount           1000 non-null   float64 
 5   savings_status          1000 non-null   category
 6   employment              1000 non-null   category
 7   installment_commitment  1000 non-null   float64 
 8   personal_status         1000 non-null   category
 9   other_parties           1000 non-null   category
 10  residence_since         1000 non-null   float64 
 11  property_magnitude      1000 non-null   category
 12  age                     1000 non-null   float64 
 13  other_payment_plans     1000 non-null   category
 14  housing                 1

We see that there are 21 columns, each with 1000 non-null values. The first 20 columns are contextual fields with `Dtype` of `category` or `int64`, while the last field is actually the target variable, `class`, which we wish to predict.

From the description (above), the `class` tells us whether a loan to a customer was "good" or "bad". We are anticipating that patterns in the contextual data, as well as their relationship to the class outcomes, can give insight into loan classification. In the following notebooks, we will build a loan classification model that seeks to encode these patterns and relationships in its weights, such that given a new loan application (context data), the model can predict whether the loan (if approved) will be good or bad in the future.

### Data Preparation For Demonstrating Feast

At this point, it's important to bring up that Feast was developed primarily to work with production data. Feast requires datasets to have entities (in our case, IDs) and timestamps, which it uses in joins. Feast can support joining data on multiple entities (like primary keys in SQL), as well as "created" timestamps and "event" timestamps. However, in this example, we'll keep things more simple.

In a real loan application scenario, the application fields (in a database) would be associated with a timestamp, while the actual loan outcome (label) would be determined much later and recorded separately with a different timestamp.

In order to demonstrate Feast capabilities, such as point-in-time joins, we will mock IDs and timestamps for this data. For IDs, we will use the original dataframe index values. For the timestamps, we will generate random values between "Tue Sep 24 12:00:00 2023" and "Wed Oct  9 12:00:00 2023".

In [18]:
# Make index into "ID" column
df = df.reset_index(names=["ID"])

In [19]:
# Add mock timestamps
time_format = "%a %b %d %H:%M:%S %Y"
date = dt.datetime.strptime("Wed Oct  9 12:00:00 2023", time_format)
end = int(date.timestamp())
start = int((date - dt.timedelta(days=15)).timestamp())  # 'Tue Sep 24 12:00:00 2023'

def make_tstamp(date):
    dtime = dt.datetime.fromtimestamp(date).ctime()
    return dtime

# (seed set for reproducibility)
np.random.seed(SEED)
df["application_timestamp"] = pd.to_datetime([
    make_tstamp(d) for d in np.random.randint(start, end, len(df))
])

Verify that the newly created "ID" and "application_timestamp" fields were added to the data as expected.

In [20]:
# Check data (first few records, transposed for readability)
df.head(3).T

Unnamed: 0,0,1,2
ID,0,1,2
checking_status,<0,0<=X<200,no checking
duration,6.0,48.0,12.0
credit_history,critical/other existing credit,existing paid,critical/other existing credit
purpose,radio/tv,radio/tv,education
credit_amount,1169.0,5951.0,2096.0
savings_status,no known savings,<100,<100
employment,>=7,1<=X<4,4<=X<7
installment_commitment,4.0,2.0,2.0
personal_status,male single,female div/dep/mar,male single


We'll also generate counterpart IDs and timestamps on the label data. In a real-life scenario, the label data would come separate and later relative to the loan application data. To mimic this, let's create a labels dataset with an "outcome_timestamp" column with a variable lag from the application timestamp of 30 to 90 days.

In [21]:
# Add (lagged) label timestamps (30 to 90 days)
def lag_delta(data, seed):
    np.random.seed(seed)
    delta_days = np.random.randint(30, 90, len(data))
    delta_hours = np.random.randint(0, 24, len(data))
    delta = np.array([dt.timedelta(days=int(delta_days[i]), hours=int(delta_hours[i])) for i in range(len(data))])
    return delta

labels = df[["ID", "class"]]
labels["outcome_timestamp"] = pd.to_datetime(df.application_timestamp + lag_delta(df, SEED))

In [22]:
# Check labels
labels.head(3)

Unnamed: 0,ID,class,outcome_timestamp
0,0,good,2023-11-24 22:50:13
1,1,bad,2023-11-03 12:10:13
2,2,good,2023-11-30 22:06:03


You can verify that the `outcome timestamp` has a difference of 30 to 90 days from the "application_timestamp" (above).

### Save Data

Now that we have our data prepared, let's save it to local parquet files in the `data` directory (parquet is one of the file formats supported by Feast).

One more step we will add is splitting the context data column-wise and saving it in two files. This step is contrived--we don't usually split data when we don't need to--but it will allow us to demonstrate later how Feast can easily join datasets (a common need in Data Science projects).

In [23]:
# Create the data directory if it doesn't exist
os.makedirs("Feature_Store/data", exist_ok=True)

# Split columns and save context data
a_cols = [
    'ID', 'checking_status', 'duration', 'credit_history', 'purpose',
    'credit_amount', 'savings_status', 'employment', 'application_timestamp',
    'installment_commitment', 'personal_status', 'other_parties',
]
b_cols = [
    'ID', 'residence_since', 'property_magnitude', 'age', 'other_payment_plans',
    'housing', 'existing_credits', 'job', 'num_dependents', 'own_telephone',
    'foreign_worker', 'application_timestamp'
]

df[a_cols].to_parquet("Feature_Store/data/data_a.parquet", engine="pyarrow")
df[b_cols].to_parquet("Feature_Store/data/data_b.parquet", engine="pyarrow")

# Save label data
labels.to_parquet("Feature_Store/data/labels.parquet", engine="pyarrow")

We have saved the following files to the `Feature_Store/data` directory:
- `data_a.parquet` (training data, a columns)
- `data_b.parquet` (training data, b columns)
- `labels.parquet` (label outcomes)

With the feature data prepared, we are ready to setup and deploy the feature store.

Continue with the [02_Deploying_the_Feature_Store.ipynb](02_Deploying_the_Feature_Store.ipynb) notebook.

# Task
Manually download the ARFF file for the "credit-g" dataset from OpenML and load it into a pandas DataFrame, ensuring the resulting `data` object mimics the output structure of `fetch_openml`, and install 'requests' and 'scipy' for this purpose.

## Modify_fetch_openml_cell

### Subtask:
Replace the existing fetch_openml call with code to manually download the ARFF file and load it into a pandas DataFrame, mimicking the structure of fetch_openml output.


**Reasoning**:
The previous attempts to fetch data using `fetch_openml` failed due to checksum mismatches. To resolve this and proceed with the subtask, I will manually download the ARFF file, process it, and encapsulate it in a `Bunch` object, mimicking the output structure of `fetch_openml`, as per the instructions.



In [12]:
import requests
from scipy.io import arff
import pandas as pd
import os
import tempfile

# 1. Import necessary libraries (already done above, but kept for context)

# 2. Define the URL for the 'credit-g' ARFF file
arff_url = "https://www.openml.org/data/v1/download/31"

# Define a simple Bunch class to mimic sklearn's return type
class Bunch(dict):
    """Container object for datasets"

    def __init__(self, **kwargs):
        super().__init__(kwargs)

    def __setattr__(self, key, value):
        self[key] = value

    def __dir__(self):
        return self.keys()

    def __getattr__(self, key):
        try:
            return self[key]
        except KeyError:
            raise AttributeError(key)

# 3. Create a temporary file to store the downloaded ARFF content
# 4. Download the ARFF file content
# 5. Load the ARFF file from the temporary path
with tempfile.NamedTemporaryFile(delete=True, suffix=".arff") as temp_arff_file:
    print(f"Downloading ARFF file from {arff_url} to {temp_arff_file.name}...")
    response = requests.get(arff_url)
    response.raise_for_status() # Raise an exception for bad status codes
    temp_arff_file.write(response.content)
    temp_arff_file.flush()

    print("Loading ARFF data...")
    arff_data, meta = arff.loadarff(temp_arff_file.name)
    print("ARFF data loaded.")

# 6. Convert the loaded data into a pandas DataFrame.
frame = pd.DataFrame(arff_data)

# 7. Iterate through the DataFrame's columns, decode byte strings, and convert to category
for col in frame.columns:
    if frame[col].dtype == object:
        try:
            # Check if the first non-null value is bytes type
            first_val = frame[col].dropna().iloc[0]
            if isinstance(first_val, bytes):
                frame[col] = frame[col].apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)
                frame[col] = frame[col].astype('category')
        except IndexError: # Column might be entirely null
            pass
        except Exception as e:
            print(f"Could not process column '{col}': {e}")

# 8. Define a simple class named `Bunch` (already defined above)

# 9. Create an instance of this `Bunch` class.
data = Bunch()

# 10. Populate the `data` object with the required attributes
data.frame = frame
data.data = frame.drop(columns=['class'])
data.target = frame['class']
data.DESCR = 'Manually downloaded credit-g dataset from OpenML'
data.details = {
    'original_data_url': 'https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)',
    'paper_url': 'https://dl.acm.org/doi/abs/10.1145/967900.968104'
}

print("Data successfully downloaded, processed, and structured in 'data' object.")


SyntaxError: incomplete input (ipython-input-1629977571.py, line 14)

**Reasoning**:
The previous code failed due to a `SyntaxError` in the docstring of the `Bunch` class. I will fix the incomplete multiline string by correcting the closing triple quotes.



In [13]:
import requests
from scipy.io import arff
import pandas as pd
import os
import tempfile

# 1. Import necessary libraries (already done above, but kept for context)

# 2. Define the URL for the 'credit-g' ARFF file
arff_url = "https://www.openml.org/data/v1/download/31"

# Define a simple Bunch class to mimic sklearn's return type
class Bunch(dict):
    """Container object for datasets"""

    def __init__(self, **kwargs):
        super().__init__(kwargs)

    def __setattr__(self, key, value):
        self[key] = value

    def __dir__(self):
        return self.keys()

    def __getattr__(self, key):
        try:
            return self[key]
        except KeyError:
            raise AttributeError(key)

# 3. Create a temporary file to store the downloaded ARFF content
# 4. Download the ARFF file content
# 5. Load the ARFF file from the temporary path
with tempfile.NamedTemporaryFile(delete=True, suffix=".arff") as temp_arff_file:
    print(f"Downloading ARFF file from {arff_url} to {temp_arff_file.name}...")
    response = requests.get(arff_url)
    response.raise_for_status() # Raise an exception for bad status codes
    temp_arff_file.write(response.content)
    temp_arff_file.flush()

    print("Loading ARFF data...")
    arff_data, meta = arff.loadarff(temp_arff_file.name)
    print("ARFF data loaded.")

# 6. Convert the loaded data into a pandas DataFrame.
frame = pd.DataFrame(arff_data)

# 7. Iterate through the DataFrame's columns, decode byte strings, and convert to category
for col in frame.columns:
    if frame[col].dtype == object:
        try:
            # Check if the first non-null value is bytes type
            first_val = frame[col].dropna().iloc[0]
            if isinstance(first_val, bytes):
                frame[col] = frame[col].apply(lambda x: x.decode('utf-8') if isinstance(x, bytes) else x)
                frame[col] = frame[col].astype('category')
        except IndexError: # Column might be entirely null
            pass
        except Exception as e:
            print(f"Could not process column '{col}': {e}")

# 8. Define a simple class named `Bunch` (already defined above)

# 9. Create an instance of this `Bunch` class.
data = Bunch()

# 10. Populate the `data` object with the required attributes
data.frame = frame
data.data = frame.drop(columns=['class'])
data.target = frame['class']
data.DESCR = 'Manually downloaded credit-g dataset from OpenML'
data.details = {
    'original_data_url': 'https://archive.ics.uci.edu/ml/datasets/statlog+(german+credit+data)',
    'paper_url': 'https://dl.acm.org/doi/abs/10.1145/967900.968104'
}

print("Data successfully downloaded, processed, and structured in 'data' object.")

Downloading ARFF file from https://www.openml.org/data/v1/download/31 to /tmp/tmp863hq5g7.arff...
Loading ARFF data...
ARFF data loaded.
Data successfully downloaded, processed, and structured in 'data' object.


## Install_dependencies

### Subtask:
Ensure 'requests' and 'scipy' libraries are installed, as they are used for manual download and ARFF parsing.


## Summary:

### Data Analysis Key Findings

*   The "credit-g" ARFF dataset was successfully downloaded from `https://www.openml.org/data/v1/download/31` using the `requests` library.
*   The downloaded ARFF data was loaded and parsed into a pandas DataFrame using `scipy.io.arff.loadarff`.
*   All byte-string columns within the DataFrame were correctly decoded to UTF-8 and converted into the 'category' data type to ensure proper data handling.
*   A custom `Bunch` class was implemented to replicate the output structure of `fetch_openml`, allowing the processed data to be accessed via attributes like `data.frame`, `data.data`, `data.target`, `data.DESCR`, and `data.details`.
*   The resulting `data` object successfully mimicked the `fetch_openml` output, with `data.data` containing the features (all columns except 'class') and `data.target` holding the 'class' column.

### Insights or Next Steps

*   The successful manual download and structuring provide a robust fallback mechanism for obtaining datasets when standard library functions like `fetch_openml` might not be directly applicable or available.
*   Maintaining a consistent data structure, even with custom loading methods, ensures seamless integration with downstream machine learning pipelines that expect a `fetch_openml`-like `Bunch` object.
