# 01 — Data Quality & Cleaning Pipeline 

This notebook loads the NovaCred credit application JSON dataset, performs a data quality audit, and demonstrates remediation steps in code.

**Input:** `data/raw/raw_credit_applications.json` (kept local / not committed due to PII fields)  
**Outputs:**  
- `data/processed/applications_clean.csv` (cleaned, may still contain PII)  
- `data/processed/applications_analysis_ready.csv` (PII removed for analysis)

**Quality checks covered (with counts and %):** duplicate records, missing/incomplete values, inconsistent data types, inconsistent categorical coding, inconsistent date formats, invalid/impossible values, and internal consistency checks (accuracy).

In [4]:
# Imports
import json
import warnings
from pathlib import Path

import numpy as np
import pandas as pd

## 0. Load & schema sanity check

Goal: confirm the dataset loads correctly, has the expected shape, and contains the key columns required for downstream quality checks.

- **Input file:** `data/raw/raw_credit_applications.json`
- **Method:** load JSON array → flatten nested objects using `pd.json_normalize(sep=".")`
- **Sanity checks:** print dataset shape, preview 3 rows, and verify presence of key expected columns (schema check).

In [5]:
from pathlib import Path
import json
import pandas as pd

# Raw file lives in data/raw/ (no repo)
data_path = Path("../data/raw/raw_credit_applications.json")

# Read JSON
with data_path.open("r", encoding="utf-8") as f:
    apps = json.load(f)

# Flatten nested JSON into a tabular DataFrame
df_raw = pd.json_normalize(apps, sep=".")

# Basic sanity checks
print("rows, cols:", df_raw.shape)
display(df_raw.head(3))

# Check expected columns
expected = {
    "_id",
    "spending_behavior",
    "processing_timestamp",
    "applicant_info.full_name",
    "applicant_info.ssn",
    "financials.annual_income",
    "financials.credit_history_months",
    "financials.debt_to_income",
    "financials.savings_balance",
    "decision.loan_approved",
}

missing = expected - set(df_raw.columns)
print("Missing expected columns:", missing)

rows, cols: (502, 21)


Unnamed: 0,_id,spending_behavior,processing_timestamp,applicant_info.full_name,applicant_info.email,applicant_info.ssn,applicant_info.ip_address,applicant_info.gender,applicant_info.date_of_birth,applicant_info.zip_code,...,financials.credit_history_months,financials.debt_to_income,financials.savings_balance,decision.loan_approved,decision.rejection_reason,loan_purpose,decision.interest_rate,decision.approved_amount,financials.annual_salary,notes
0,app_200,"[{'category': 'Shopping', 'amount': 480}, {'ca...",2024-01-15T00:00:00Z,Jerry Smith,jerry.smith17@hotmail.com,596-64-4340,192.168.48.155,Male,2001-03-09,10036,...,23,0.2,31212,False,algorithm_risk_score,,,,,
1,app_037,"[{'category': 'Rent', 'amount': 608}, {'catego...",,Brandon Walker,brandon.walker2@yahoo.com,425-69-4784,10.1.102.112,M,1992-03-31,10032,...,51,0.18,17915,False,algorithm_risk_score,,,,,
2,app_215,"[{'category': 'Rent', 'amount': 109}]",,Scott Moore,scott.moore94@mail.com,370-78-5178,10.240.193.250,Male,1989-10-24,10075,...,41,0.21,37909,True,,vacation,3.7,59000.0,,


Missing expected columns: set()


Dataset loaded successfully (502 rows, 21 columns) and all key expected columns are present.

## 1. `spending_behavior` (array of objects) — Data quality checks

`spending_behavior` is a nested array (list of `{category, amount}` objects).  
We validate completeness, type consistency, validity (e.g., negative amounts), formatting consistency (category casing), and within-application duplicates.

### 1A. Create an analysis table

`spending_behavior` is stored as an array of objects per application.  
To run data quality checks at the entry level, we **explode** the array so each `(category, amount)` becomes one row in a new table (`spend`).

In [6]:
# Build an exploded table: one row per spending entry
spend = (
    df_raw[["_id", "spending_behavior"]]
    .dropna(subset=["spending_behavior"])
    .explode("spending_behavior", ignore_index=True)
)

# Extract fields safely
spend["category"] = spend["spending_behavior"].map(lambda x: x.get("category") if isinstance(x, dict) else None)
spend["amount"] = pd.to_numeric(
    spend["spending_behavior"].map(lambda x: x.get("amount") if isinstance(x, dict) else None),
    errors="coerce"
)

print("Total spending entries:", len(spend))
display(spend[["_id", "category", "amount"]].head(10))

Total spending entries: 827


Unnamed: 0,_id,category,amount
0,app_200,Shopping,480
1,app_200,Rent,790
2,app_200,Alcohol,247
3,app_037,Rent,608
4,app_037,Dining,96
5,app_037,Healthcare,243
6,app_215,Rent,109
7,app_024,Fitness,575
8,app_184,Entertainment,463
9,app_275,Entertainment,571


### 1B. Missing / incomplete

We check completeness at two levels:
1) application-level: is `spending_behavior` missing for any applications?
2) entry-level: after exploding the array, are any entries missing `category` or `amount`?

In [7]:
# Total number of rows in the main table
n_apps = len(df_raw)

# Count applications where spending_behavior is missing (NaN)
missing_sb_apps = df_raw["spending_behavior"].isna().sum()
print("Applications missing spending_behavior:", missing_sb_apps, f"({missing_sb_apps/n_apps*100:.2f}%)")

# Total number of spending entries (rows) after exploding the array
n_entries = len(spend)

# Count entries where category is missing
missing_cat = spend["category"].isna().sum()

# Count entries where amount is missing
missing_amt = spend["amount"].isna().sum()

print("Missing category entries:", missing_cat, f"({missing_cat/max(n_entries,1)*100:.2f}%)")
print("Missing amount entries:", missing_amt, f"({missing_amt/max(n_entries,1)*100:.2f}%)")

Applications missing spending_behavior: 0 (0.00%)
Missing category entries: 0 (0.00%)
Missing amount entries: 0 (0.00%)


### 1C. Inconsistent data types

Validate that each spending entry has:
- `category` as text (string)
- `amount` as a numeric value 

In [8]:
# Count values that aren't Python strings
non_string_cat = (~spend["category"].dropna().map(lambda x: isinstance(x, str))).sum()
print("Non-string category entries:", non_string_cat,
      f"({non_string_cat/max(n_entries, 1)*100:.2f}%)")

# Count amounts that are missing or not numeric (NaN after pd.to_numeric)
non_numeric_amount = spend["amount"].isna().sum()
print("Non-numeric/missing amount entries:", non_numeric_amount,
      f"({non_numeric_amount/max(n_entries, 1)*100:.2f}%)")


Non-string category entries: 0 (0.00%)
Non-numeric/missing amount entries: 0 (0.00%)


### 1D. Invalid values
Check for impossible values such as negative spending amounts.

In [9]:
# Count negative amounts (should be 0)
neg_amount = int((spend["amount"] < 0).sum())

print("Negative amount entries:", neg_amount, f"({neg_amount/max(n_entries,1)*100:.2f}%)")

Negative amount entries: 0 (0.00%)


### 1E. Category formatting consistency
Compare raw vs normalized category values (lowercase + trimmed) to detect casing/spacing inconsistencies.

In [10]:
cat_raw = spend["category"].astype("string")
cat_norm = cat_raw.str.strip().str.lower()

# Count unique categories before vs after normalization
n_raw = cat_raw.dropna().nunique()
n_norm = cat_norm.dropna().nunique()

print("Distinct raw categories:", n_raw)
print("Distinct normalized categories:", n_norm)

# Show normalized categories that have multiple raw variants (case/spacing differences)
variants = (
    pd.DataFrame({"raw": cat_raw, "norm": cat_norm})
    .dropna()
    .drop_duplicates()
    .groupby("norm")["raw"]
    .unique()
)

variants = variants[variants.map(len) > 1]

if len(variants) == 0:
    print("No formatting variants found (case/spacing is consistent).")
else:
    display(variants.head(20))

Distinct raw categories: 15
Distinct normalized categories: 15
No formatting variants found (case/spacing is consistent).
