# 01. Feature Engineering & Matrix Construction



**Objective:** Transform raw data from the "German Credit Dataset" into numerical structures (Matrices and Vectors) suitable for Linear Algebra analysis.

### 1. Environment Configuration
We connect this runtime environment with the GitHub repository to ensure the directory structure (`data/`, `src/`) is consistent and reproducible.

In [4]:
import pandas as pd
import numpy as np
import os
import shutil

# Clone the repository if not currently in it (for ephemeral environments like Google Colab)
if not os.path.exists('Credit-Risk-Algebraic-ML'):
    !git clone https://github.com/adriangonz-afk/Credit-Risk-Algebraic-ML.git

# Change working directory to the project root
os.chdir('Credit-Risk-Algebraic-ML')

print(f"Environment configured. Working directory: {os.getcwd()}")

Cloning into 'Credit-Risk-Algebraic-ML'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 23 (delta 4), reused 16 (delta 1), pack-reused 0 (from 0)[K
Receiving objects: 100% (23/23), 58.85 KiB | 2.45 MiB/s, done.
Resolving deltas: 100% (4/4), done.
Environment configured. Working directory: /content/Credit-Risk-Algebraic-ML/Credit-Risk-Algebraic-ML


### 2. Data Ingestion (Official UCI Source)
To avoid corrupt files or incomplete versions (common in third-party mirrors like Kaggle), we download the data directly from the **UCI Machine Learning Repository**.

* **Sanitization:** Any previous version of the `data` directory is removed to ensure a "clean slate."
* **Source:** `german.data` (Original raw file).

In [5]:
print("Starting Data Engineering Protocol...")

# Cleanup: Remove old data to avoid conflicts
if os.path.exists('data'):
    shutil.rmtree('data')

# Create standard directory structure
os.makedirs('data/raw', exist_ok=True)
os.makedirs('data/processed', exist_ok=True)

# Direct download using wget for data integrity
raw_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
raw_path = "data/raw/german.data"

print(f"Downloading from UCI Repository: {raw_url}")
!wget -q {raw_url} -O {raw_path}
print("Download complete.")

Starting Data Engineering Protocol...
Downloading from UCI Repository: https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data
Download complete.


### 3. Parsing and Schema Definition
The original `german.data` file lacks headers and uses spaces as delimiters. We manually define the column schema based on the official dataset documentation.

In [6]:
# Manual Schema Definition
columns = [
    'checking_account', 'duration', 'credit_history', 'purpose',
    'credit_amount', 'savings_account', 'employment_since', 'installment_rate',
    'personal_status_sex', 'other_debtors', 'residence_since', 'property',
    'age', 'other_installment_plans', 'housing', 'existing_credits',
    'job', 'people_liable', 'telephone', 'foreign_worker', 'target'
]

# Read CSV with space separator
df = pd.read_csv(raw_path, names=columns, sep=' ', index_col=False)

print(f"Raw Dataset Dimensions: {df.shape}")

Raw Dataset Dimensions: (1000, 21)


### 4. Transformation and Encoding
To enable algebraic operations, we must convert all categorical variables into numerical representations.

1.  **Target Standardization:** Convert the target variable to a standard binary format.
    * `1` (Good) -> `0` (Negative Class / No Default)
    * `2` (Bad) -> `1` (Positive Class / Default Risk)
2.  **One-Hot Encoding:** Project categorical variables (e.g., "Property") into orthogonal numerical vectors. We use `drop_first=True` to reduce structural multicollinearity.

In [7]:
# 1. Target Standardization
df['target'] = df['target'].map({1: 0, 2: 1})

# 2. Vectorization (One-Hot Encoding)
# Pandas automatically detects object columns and generates dummy variables
df_encoded = pd.get_dummies(df, drop_first=True)

# 3. Separation of Feature Matrix (X) and Target Vector (y)
X = df_encoded.drop('target', axis=1).astype(float)
y = df_encoded['target']

print("Transformation complete.")

Transformation complete.


### 5. Artifact Serialization
We save the results in `.npy` (NumPy Arrays) format. This format is computationally more efficient for the linear algebra operations performed in the subsequent notebook.

* `X_matrix.npy`: Design Matrix.
* `y_vector.npy`: Label Vector.
* `feature_names.json`: Metadata for interpreting matrix columns.

In [8]:
import json

# Save NumPy Arrays (Optimized for Algebra)
np.save('data/processed/X_matrix.npy', X.values)
np.save('data/processed/y_vector.npy', y.values)

# Save human-readable CSV version (optional, for inspection)
df_encoded.to_csv('data/processed/german_credit_clean.csv', index=False)

# Save column names for future reference
with open('data/processed/feature_names.json', 'w') as f:
    json.dump(list(X.columns), f)

print("="*30)
print(f"PROCESS COMPLETED SUCCESSFULLY")
print(f"Matrix X saved: {X.shape}")
print(f"Vector y saved: {y.shape}")
print("="*30)

PROCESS COMPLETED SUCCESSFULLY
Matrix X saved: (1000, 48)
Vector y saved: (1000,)
