# 01 — Data Quality & Cleaning Pipeline 

This notebook loads the NovaCred credit application JSON dataset, performs a data quality audit, and demonstrates remediation steps in code.

**Input:** `data/raw/raw_credit_applications.json` (kept local / not committed due to PII fields)  
**Outputs:**  
- `data/processed/applications_clean.csv` (cleaned, may still contain PII)  
- `data/processed/applications_analysis_ready.csv` (PII removed for analysis)

**Quality checks covered (with counts and %):** duplicate records, missing/incomplete values, inconsistent data types, inconsistent categorical coding, inconsistent date formats, invalid/impossible values, and internal consistency checks (accuracy).

In [4]:
# Imports
import json
import warnings
from pathlib import Path

import numpy as np
import pandas as pd

## 0. Load & schema sanity check

Goal: confirm the dataset loads correctly, has the expected shape, and contains the key columns required for downstream quality checks.

- **Input file:** `data/raw/raw_credit_applications.json`
- **Method:** load JSON array → flatten nested objects using `pd.json_normalize(sep=".")`
- **Sanity checks:** print dataset shape, preview 3 rows, and verify presence of key expected columns (schema check).

In [5]:
from pathlib import Path
import json
import pandas as pd

# Raw file lives in data/raw/ (no repo)
data_path = Path("../data/raw/raw_credit_applications.json")

# Read JSON
with data_path.open("r", encoding="utf-8") as f:
    apps = json.load(f)

# Flatten nested JSON into a tabular DataFrame
df_raw = pd.json_normalize(apps, sep=".")

# Basic sanity checks
print("rows, cols:", df_raw.shape)
display(df_raw.head(3))

# Check expected columns
expected = {
    "_id",
    "spending_behavior",
    "processing_timestamp",
    "applicant_info.full_name",
    "applicant_info.ssn",
    "financials.annual_income",
    "financials.credit_history_months",
    "financials.debt_to_income",
    "financials.savings_balance",
    "decision.loan_approved",
}

missing = expected - set(df_raw.columns)
print("Missing expected columns:", missing)

rows, cols: (502, 21)


Unnamed: 0,_id,spending_behavior,processing_timestamp,applicant_info.full_name,applicant_info.email,applicant_info.ssn,applicant_info.ip_address,applicant_info.gender,applicant_info.date_of_birth,applicant_info.zip_code,...,financials.credit_history_months,financials.debt_to_income,financials.savings_balance,decision.loan_approved,decision.rejection_reason,loan_purpose,decision.interest_rate,decision.approved_amount,financials.annual_salary,notes
0,app_200,"[{'category': 'Shopping', 'amount': 480}, {'ca...",2024-01-15T00:00:00Z,Jerry Smith,jerry.smith17@hotmail.com,596-64-4340,192.168.48.155,Male,2001-03-09,10036,...,23,0.2,31212,False,algorithm_risk_score,,,,,
1,app_037,"[{'category': 'Rent', 'amount': 608}, {'catego...",,Brandon Walker,brandon.walker2@yahoo.com,425-69-4784,10.1.102.112,M,1992-03-31,10032,...,51,0.18,17915,False,algorithm_risk_score,,,,,
2,app_215,"[{'category': 'Rent', 'amount': 109}]",,Scott Moore,scott.moore94@mail.com,370-78-5178,10.240.193.250,Male,1989-10-24,10075,...,41,0.21,37909,True,,vacation,3.7,59000.0,,


Missing expected columns: set()


Dataset loaded successfully (502 rows, 21 columns) and all key expected columns are present.