# 01 — Data Quality & Cleaning Pipeline 

This notebook loads the NovaCred credit application JSON dataset, performs a data quality audit, and demonstrates remediation steps in code.

**Input:** `data/raw/raw_credit_applications.json` (kept local / not committed due to PII fields)  
**Outputs:**  
- `data/processed/applications_clean.csv` (cleaned, may still contain PII)  
- `data/processed/applications_analysis_ready.csv` (PII removed for analysis)

**Quality checks covered (with counts and %):** duplicate records, missing/incomplete values, inconsistent data types, inconsistent categorical coding, inconsistent date formats, invalid/impossible values, and internal consistency checks (accuracy).

In [4]:
# Imports
import json
import warnings
from pathlib import Path

import numpy as np
import pandas as pd

## 0. Load & schema sanity check

Goal: confirm the dataset loads correctly, has the expected shape, and contains the key columns required for downstream quality checks.

- **Input file:** `data/raw/raw_credit_applications.json`
- **Method:** load JSON array → flatten nested objects using `pd.json_normalize(sep=".")`
- **Sanity checks:** print dataset shape, preview 3 rows, and verify presence of key expected columns (schema check).

In [5]:
from pathlib import Path
import json
import pandas as pd

# Raw file lives in data/raw/ (no repo)
data_path = Path("../data/raw/raw_credit_applications.json")

# Read JSON
with data_path.open("r", encoding="utf-8") as f:
    apps = json.load(f)

# Flatten nested JSON into a tabular DataFrame
df_raw = pd.json_normalize(apps, sep=".")

# Basic sanity checks
print("rows, cols:", df_raw.shape)
display(df_raw.head(3))

# Check expected columns
expected = {
    "_id",
    "spending_behavior",
    "processing_timestamp",
    "applicant_info.full_name",
    "applicant_info.ssn",
    "financials.annual_income",
    "financials.credit_history_months",
    "financials.debt_to_income",
    "financials.savings_balance",
    "decision.loan_approved",
}

missing = expected - set(df_raw.columns)
print("Missing expected columns:", missing)

rows, cols: (502, 21)


Unnamed: 0,_id,spending_behavior,processing_timestamp,applicant_info.full_name,applicant_info.email,applicant_info.ssn,applicant_info.ip_address,applicant_info.gender,applicant_info.date_of_birth,applicant_info.zip_code,...,financials.credit_history_months,financials.debt_to_income,financials.savings_balance,decision.loan_approved,decision.rejection_reason,loan_purpose,decision.interest_rate,decision.approved_amount,financials.annual_salary,notes
0,app_200,"[{'category': 'Shopping', 'amount': 480}, {'ca...",2024-01-15T00:00:00Z,Jerry Smith,jerry.smith17@hotmail.com,596-64-4340,192.168.48.155,Male,2001-03-09,10036,...,23,0.2,31212,False,algorithm_risk_score,,,,,
1,app_037,"[{'category': 'Rent', 'amount': 608}, {'catego...",,Brandon Walker,brandon.walker2@yahoo.com,425-69-4784,10.1.102.112,M,1992-03-31,10032,...,51,0.18,17915,False,algorithm_risk_score,,,,,
2,app_215,"[{'category': 'Rent', 'amount': 109}]",,Scott Moore,scott.moore94@mail.com,370-78-5178,10.240.193.250,Male,1989-10-24,10075,...,41,0.21,37909,True,,vacation,3.7,59000.0,,


Missing expected columns: set()


Dataset loaded successfully (502 rows, 21 columns) and all key expected columns are present.

## 1. `spending_behavior` (array of objects) — Data quality checks

`spending_behavior` is a nested array (list of `{category, amount}` objects).  
We validate completeness, type consistency, validity (e.g., negative amounts), formatting consistency (category casing), and within-application duplicates.

### 1A. Create an analysis table

`spending_behavior` is stored as an array of objects per application.  
To run data quality checks at the entry level, we **explode** the array so each `(category, amount)` becomes one row in a new table (`spend`).

In [6]:
# Build an exploded table: one row per spending entry
spend = (
    df_raw[["_id", "spending_behavior"]]
    .dropna(subset=["spending_behavior"])
    .explode("spending_behavior", ignore_index=True)
)

# Extract fields safely
spend["category"] = spend["spending_behavior"].map(lambda x: x.get("category") if isinstance(x, dict) else None)
spend["amount"] = pd.to_numeric(
    spend["spending_behavior"].map(lambda x: x.get("amount") if isinstance(x, dict) else None),
    errors="coerce"
)

print("Total spending entries:", len(spend))
display(spend[["_id", "category", "amount"]].head(10))

Total spending entries: 827


Unnamed: 0,_id,category,amount
0,app_200,Shopping,480
1,app_200,Rent,790
2,app_200,Alcohol,247
3,app_037,Rent,608
4,app_037,Dining,96
5,app_037,Healthcare,243
6,app_215,Rent,109
7,app_024,Fitness,575
8,app_184,Entertainment,463
9,app_275,Entertainment,571


### 1B. Missing / incomplete

We check completeness at two levels:
1) application-level: is `spending_behavior` missing for any applications?
2) entry-level: after exploding the array, are any entries missing `category` or `amount`?

In [7]:
# Total number of rows in the main table
n_apps = len(df_raw)

# Count applications where spending_behavior is missing (NaN)
missing_sb_apps = df_raw["spending_behavior"].isna().sum()
print("Applications missing spending_behavior:", missing_sb_apps, f"({missing_sb_apps/n_apps*100:.2f}%)")

# Total number of spending entries (rows) after exploding the array
n_entries = len(spend)

# Count entries where category is missing
missing_cat = spend["category"].isna().sum()

# Count entries where amount is missing
missing_amt = spend["amount"].isna().sum()

print("Missing category entries:", missing_cat, f"({missing_cat/max(n_entries,1)*100:.2f}%)")
print("Missing amount entries:", missing_amt, f"({missing_amt/max(n_entries,1)*100:.2f}%)")

Applications missing spending_behavior: 0 (0.00%)
Missing category entries: 0 (0.00%)
Missing amount entries: 0 (0.00%)


### 1C. Inconsistent data types

Validate that each spending entry has:
- `category` as text (string)
- `amount` as a numeric value 

In [8]:
# Count values that aren't Python strings
non_string_cat = (~spend["category"].dropna().map(lambda x: isinstance(x, str))).sum()
print("Non-string category entries:", non_string_cat,
      f"({non_string_cat/max(n_entries, 1)*100:.2f}%)")

# Count amounts that are missing or not numeric (NaN after pd.to_numeric)
non_numeric_amount = spend["amount"].isna().sum()
print("Non-numeric/missing amount entries:", non_numeric_amount,
      f"({non_numeric_amount/max(n_entries, 1)*100:.2f}%)")


Non-string category entries: 0 (0.00%)
Non-numeric/missing amount entries: 0 (0.00%)


### 1D. Invalid values
Check for impossible values such as negative spending amounts.

In [9]:
# Count negative amounts (should be 0)
neg_amount = int((spend["amount"] < 0).sum())

print("Negative amount entries:", neg_amount, f"({neg_amount/max(n_entries,1)*100:.2f}%)")

Negative amount entries: 0 (0.00%)


### 1E. Category formatting consistency
Compare raw vs normalized category values (lowercase + trimmed) to detect casing/spacing inconsistencies.

In [10]:
cat_raw = spend["category"].astype("string")
cat_norm = cat_raw.str.strip().str.lower()

# Count unique categories before vs after normalization
n_raw = cat_raw.dropna().nunique()
n_norm = cat_norm.dropna().nunique()

print("Distinct raw categories:", n_raw)
print("Distinct normalized categories:", n_norm)

# Show normalized categories that have multiple raw variants (case/spacing differences)
variants = (
    pd.DataFrame({"raw": cat_raw, "norm": cat_norm})
    .dropna()
    .drop_duplicates()
    .groupby("norm")["raw"]
    .unique()
)

variants = variants[variants.map(len) > 1]

if len(variants) == 0:
    print("No formatting variants found (case/spacing is consistent).")
else:
    display(variants.head(20))

Distinct raw categories: 15
Distinct normalized categories: 15
No formatting variants found (case/spacing is consistent).


## 2. Main table — Data quality checks

### 2A. Duplicate records (Uniqueness)

We check for duplicate records using key identifiers:
- duplicate application IDs (`_id`)
- duplicate applicant identifiers (`applicant_info.ssn`, excluding missing values)

We report counts and percentages, and later apply a remediation rule during the cleaning step.

In [11]:
# Total number of applications
n = len(df_raw)

# 1) Duplicate application IDs (_id)
# _id should uniquely identify an application
dup_id = df_raw["_id"].duplicated().sum()

# 2) SSN duplicates (excluding missing)
# SSN identifies an applicant, so we check duplicates only among non-missing SSNs
ssn = df_raw["applicant_info.ssn"]
missing_ssn = ssn.isna().sum()
dup_ssn_excl_missing = ssn.dropna().duplicated().sum()
n_non_missing_ssn = ssn.notna().sum()

print("Duplicate _id:", dup_id, f"({dup_id/n*100:.2f}%)")
print("Missing SSN:", missing_ssn, f"({missing_ssn/n*100:.2f}%)")
print("Duplicate SSN (excluding missing):", dup_ssn_excl_missing,
      f"({dup_ssn_excl_missing/max(n_non_missing_ssn,1)*100:.2f}% of non-missing SSNs)")

Duplicate _id: 2 (0.40%)
Missing SSN: 5 (1.00%)
Duplicate SSN (excluding missing): 3 (0.60% of non-missing SSNs)


Results show 2 duplicate application IDs (0.40%), and 3 duplicate SSNs among non-missing values (0.60%). These indicate uniqueness issues and will be addressed in the cleaning pipeline.

In [12]:
# Show duplicated _id rows
dup_id_rows = df_raw[df_raw["_id"].duplicated(keep=False)].sort_values("_id")
display(dup_id_rows[["_id", "applicant_info.full_name", "applicant_info.ssn", "processing_timestamp"]])

# Show duplicated SSN rows (excluding missing SSNs)
dup_ssn_rows = df_raw[df_raw["applicant_info.ssn"].notna() & df_raw["applicant_info.ssn"].duplicated(keep=False)] \
    .sort_values("applicant_info.ssn")
display(dup_ssn_rows[["_id", "applicant_info.full_name", "applicant_info.ssn"]].head(20))

Unnamed: 0,_id,applicant_info.full_name,applicant_info.ssn,processing_timestamp
383,app_001,Stephanie Nguyen,427-90-1892,
455,app_001,Stephanie Nguyen,,
8,app_042,Joseph Lopez,652-70-5530,
354,app_042,Joseph Lopez,652-70-5530,


Unnamed: 0,_id,applicant_info.full_name,applicant_info.ssn
8,app_042,Joseph Lopez,652-70-5530
354,app_042,Joseph Lopez,652-70-5530
92,app_088,Susan Martinez,780-24-9300
122,app_016,Gary Wilson,780-24-9300
16,app_101,Sandra Smith,937-72-8731
499,app_234,Samuel Hill,937-72-8731


These tables show example duplicate cases:
- `_id` duplicates (e.g., `app_001`) appear as repeated application records.
- SSN duplicates appear either within the same `_id` (e.g., `app_042`) or across different `_id`s (e.g., the same SSN linked to multiple applications).
These will be handled during remediation.

### 2B. Missing / incomplete records (Completeness)

We quantify missingness for each field (count and %) to identify incomplete columns and prioritize remediation.

In [13]:
# Count missing (NaN) values per column
missing_count = df_raw.isna().sum().sort_values(ascending=False)

# Percent missing per column
missing_pct = (df_raw.isna().mean() * 100).round(2).sort_values(ascending=False)

# Combine into one table for reporting
missing_table = pd.DataFrame({
    "missing_count": missing_count,
    "missing_%": missing_pct
})

display(missing_table)

Unnamed: 0,missing_count,missing_%
notes,500,99.6
financials.annual_salary,497,99.0
loan_purpose,452,90.04
processing_timestamp,440,87.65
decision.rejection_reason,292,58.17
decision.approved_amount,210,41.83
decision.interest_rate,210,41.83
financials.annual_income,5,1.0
applicant_info.ip_address,5,1.0
applicant_info.ssn,5,1.0


Results show that the most incomplete fields are `notes` (99.6%), `financials.annual_salary` (99.0%), `loan_purpose` (90.0%), and `processing_timestamp` (87.7%).  
Decision-related fields are conditionally missing: `decision.approved_amount` and `decision.interest_rate` are missing for ~41.8% of applications, and `decision.rejection_reason` is missing for ~58.2%, consistent with approval/denial outcomes.

By looking into the table we can also notice that the dataset contains both `financials.annual_income` and `financials.annual_salary`.  
`annual_salary` is almost always missing, suggesting inconsistent field usage. This can cause confusion in downstream analysis.
We address this in remediation by merging `annual_salary` into `annual_income` when `annual_income` is missing, then dropping `annual_salary`.

### 2C. Inconsistent data types (Consistency)

We verify that numeric fields are consistently stored as numeric values (not strings).  
This is important before any modeling or aggregation.

In [14]:
numeric_cols = [
    "financials.annual_income",
    "financials.annual_salary",
    "financials.credit_history_months",
    "financials.debt_to_income",
    "financials.savings_balance",
    "decision.interest_rate",
    "decision.approved_amount",
]

# For each numeric column, count how many values are int/float/str (excluding NaN)
type_counts = {}
for col in numeric_cols:
    s = df_raw[col]
    counts = s.dropna().map(lambda x: type(x).__name__).value_counts()
    type_counts[col] = counts.to_dict()

# Convert the dictionary into a table
type_counts_df = pd.DataFrame(type_counts).fillna(0).astype(int).T

print("Type counts by column (non-missing values):")
display(type_counts_df)

# Percentages (out of non-missing values in each column)
non_missing = df_raw[numeric_cols].notna().sum()
type_pct_df = (type_counts_df.div(non_missing, axis=0) * 100).round(2)

print("Type percentages by column (non-missing values):")
display(type_pct_df)

Type counts by column (non-missing values):


Unnamed: 0,int,str,float
financials.annual_income,488,8,1
financials.annual_salary,0,0,5
financials.credit_history_months,502,0,0
financials.debt_to_income,0,0,502
financials.savings_balance,502,0,0
decision.interest_rate,0,0,292
decision.approved_amount,0,0,292


Type percentages by column (non-missing values):


Unnamed: 0,int,str,float
financials.annual_income,98.19,1.61,0.2
financials.annual_salary,0.0,0.0,100.0
financials.credit_history_months,100.0,0.0,0.0
financials.debt_to_income,0.0,0.0,100.0
financials.savings_balance,100.0,0.0,0.0
decision.interest_rate,0.0,0.0,100.0
decision.approved_amount,0.0,0.0,100.0


**Findings (type consistency):**
- `financials.annual_income` has mixed types (mostly integers, but some values are stored as strings), indicating inconsistent typing across records.
- `financials.annual_salary` is rarely populated and appears separately from `annual_income`, suggesting inconsistent schema usage for income.
- Other numeric fields (`credit_history_months`, `debt_to_income`, `savings_balance`, and decision terms) are consistently numeric when present.

As a quick schema validation step, we verify that fields expected to be strings contain only string values (for non-missing entries), and that `decision.loan_approved` behaves as a boolean. This helps catch unexpected structures (e.g., numbers/lists in text fields) before remediation.

In [15]:
expected_str = [
    "_id",
    "applicant_info.full_name",
    "applicant_info.email",
    "applicant_info.ssn",
    "applicant_info.ip_address",
    "applicant_info.gender",
    "applicant_info.date_of_birth",
    "applicant_info.zip_code",
    "decision.rejection_reason",
    "loan_purpose",
    "notes",
]

# For these, check if any non-missing values are NOT strings
non_string_counts = {}
for col in expected_str:
    if col in df_raw.columns:
        s = df_raw[col]
        non_string_counts[col] = int(s.dropna().map(lambda x: not isinstance(x, str)).sum())

display(pd.Series(non_string_counts).sort_values(ascending=False))

# Boolean check for loan_approved
import numpy as np

vals = df_raw["decision.loan_approved"].dropna().unique()
print("Unique loan_approved values:", vals)
print("Types:", [type(v) for v in vals])
print("All boolean (Python bool or numpy.bool_)?", all(isinstance(v, (bool, np.bool_)) for v in vals))

_id                             0
applicant_info.full_name        0
applicant_info.email            0
applicant_info.ssn              0
applicant_info.ip_address       0
applicant_info.gender           0
applicant_info.date_of_birth    0
applicant_info.zip_code         0
decision.rejection_reason       0
loan_purpose                    0
notes                           0
dtype: int64

Unique loan_approved values: [False  True]
Types: [<class 'numpy.bool'>, <class 'numpy.bool'>]
All boolean (Python bool or numpy.bool_)? True


**Result:** All fields expected to be strings contain only string values for non-missing entries (0 non-string values found in each).  
`decision.loan_approved` contains only boolean values (`True`/`False`, stored as `numpy.bool_`), so it is type-consistent with the schema.

### 2D. Inconsistent coding/formatting of categorical fields (Consistency)

We check whether categorical fields use consistent representations (e.g., `Male` vs `M`, casing/spacing issues, blanks).
We focus on `applicant_info.gender` because it is used downstream for fairness/bias analysis and must be standardized.

In [16]:
n = len(df_raw)
g = df_raw["applicant_info.gender"].astype("string")

# Full raw value distribution (counts + %)
vc = g.value_counts(dropna=False)
vc_pct = (vc / n * 100).round(2)

gender_table = pd.DataFrame({
    "count": vc,
    "pct_%": vc_pct
})

print("Raw gender distribution (count and %):")
display(gender_table)

# Summary metric for shorthand + blanks (optional, still derived from the same table)
n_M = vc.get("M", 0)
n_F = vc.get("F", 0)
n_blank = vc.get("", 0) # blank string after astype("string")
n_na = vc.get(pd.NA, 0) # sometimes shows as <NA>

print("Total shorthand (M+F):", n_M + n_F, f"({(n_M+n_F)/n*100:.2f}%)")
print("Blank + NA:", n_blank + n_na, f"({(n_blank+n_na)/n*100:.2f}%)")

Raw gender distribution (count and %):


Unnamed: 0_level_0,count,pct_%
applicant_info.gender,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,195,38.84
Female,193,38.45
F,58,11.55
M,53,10.56
,2,0.4
,1,0.2


Total shorthand (M+F): 111 (22.11%)
Blank + NA: 3 (0.60%)


**Finding:** `applicant_info.gender` is inconsistently coded.  
22.11% of records use shorthand (`M`/`F`) instead of full labels (`Male`/`Female`). There are also 2 blank values (0.40%) and 1 missing value (0.20%).  
This will be remediated by standardizing `M→Male`, `F→Female`, and converting blanks to missing.

### 2E. Inconsistent date formats (Consistency)

We test whether date fields are consistently parseable:
- `applicant_info.date_of_birth`
- `processing_timestamp`

In [17]:
dob = df_raw["applicant_info.date_of_birth"]
ts  = df_raw["processing_timestamp"]

# Try to parse the date fields (invalid formats become NaT)
dob_parsed = pd.to_datetime(dob, errors="coerce", utc=False)
ts_parsed  = pd.to_datetime(ts, errors="coerce", utc=True)

# Invalid = present but cannot be parsed
invalid_dob = dob.notna() & dob_parsed.isna()
invalid_ts  = ts.notna()  & ts_parsed.isna()

print("DOB missing:", dob.isna().sum(), f"({dob.isna().mean()*100:.2f}%)")
print("DOB invalid format:", invalid_dob.sum(), f"({invalid_dob.mean()*100:.2f}%)")

print("processing_timestamp missing:", ts.isna().sum(), f"({ts.isna().mean()*100:.2f}%)")
print("processing_timestamp invalid format:", invalid_ts.sum(), f"({invalid_ts.mean()*100:.2f}%)")

DOB missing: 1 (0.20%)
DOB invalid format: 161 (32.07%)
processing_timestamp missing: 440 (87.65%)
processing_timestamp invalid format: 0 (0.00%)


**Findings (date format consistency):**
- `applicant_info.date_of_birth` has a high rate of unparseable values (161 rows, 32.07%), indicating inconsistent date formats across records.
- `processing_timestamp` is largely missing (440 rows, 87.65%), but when present it is consistently parseable (0 invalid formats).
These issues will be handled in remediation by standardizing DOB parsing (multiple formats) and keeping `processing_timestamp` as missing when not available.

In [18]:
# Show a few examples where DOB is present but could not be parsed
invalid_dob_rows = df_raw[invalid_dob][["_id", "applicant_info.date_of_birth"]].head(15)
display(invalid_dob_rows)

Unnamed: 0,_id,applicant_info.date_of_birth
5,app_275,14/02/1982
6,app_099,28/01/1990
11,app_320,01/12/1978
14,app_307,1990/07/26
21,app_173,18/07/1979
23,app_289,20/04/1979
26,app_075,
32,app_274,1986/11/20
35,app_276,1995/05/07
36,app_386,03/20/1968


DOB values are stored in mixed formats (DD/MM/YYYY, MM/DD/YYYY, YYYY/MM/DD), which causes parsing failures under a single default parser.

### 2F. Invalid / impossible values (Validity)

We check numeric fields for values outside valid ranges (e.g., negative months, ratios outside [0,1], negative balances).

In [19]:
# Convert numeric-like columns to numbers (strings become numeric; bad values become NaN)
income = pd.to_numeric(df_raw["financials.annual_income"], errors="coerce")
chm    = pd.to_numeric(df_raw["financials.credit_history_months"], errors="coerce")
dti    = pd.to_numeric(df_raw["financials.debt_to_income"], errors="coerce")
sav    = pd.to_numeric(df_raw["financials.savings_balance"], errors="coerce")
rate   = pd.to_numeric(df_raw["decision.interest_rate"], errors="coerce")
amt    = pd.to_numeric(df_raw["decision.approved_amount"], errors="coerce")

# Count values that violate basic business/validity rules
invalid = {
    "annual_income < 0": (income < 0).sum(),
    "credit_history_months < 0": (chm < 0).sum(),
    "debt_to_income < 0": (dti < 0).sum(),
    "debt_to_income > 1": (dti > 1).sum(),
    "savings_balance < 0": (sav < 0).sum(),
    "interest_rate < 0": (rate < 0).sum(),
    "interest_rate > 1": int((rate > 1).sum()),  # if rate is stored as fraction; if it's % (e.g. 3.7), we'll adjust
    "approved_amount < 0": (amt < 0).sum(),
}

# Build a reporting table with counts + percentages (of all rows)
invalid_table = pd.DataFrame({
    "count": pd.Series(invalid),
    "pct_%": (pd.Series(invalid) / n * 100).round(2)
})

display(invalid_table)


Unnamed: 0,count,pct_%
annual_income < 0,0,0.0
credit_history_months < 0,2,0.4
debt_to_income < 0,0,0.0
debt_to_income > 1,1,0.2
savings_balance < 0,1,0.2
interest_rate < 0,0,0.0
interest_rate > 1,292,58.17
approved_amount < 0,0,0.0


decision.interest_rate appears to be recorded in percentage units (e.g., 3.7 meaning 3.7%) rather than as a fraction (0.037). Therefore, the “> 1” threshold is not treated as invalid; instead we validate interest rates using a realistic percent range.

In [20]:
rate = pd.to_numeric(df_raw["decision.interest_rate"], errors="coerce")

# Count rates outside a realistic percent range
invalid_rate_pct = ((rate < 0) | (rate > 100)).sum()
print("interest_rate outside [0, 100] percent:", invalid_rate_pct, f"({invalid_rate_pct/len(df_raw)*100:.2f}%)")

# Show min/max to justify treating this as percent points (not fractions)
print("interest_rate min/max (excluding NaN):", rate.min(), rate.max())

interest_rate outside [0, 100] percent: 0 (0.00%)
interest_rate min/max (excluding NaN): 2.5 6.5


Validity (interest rate units): decision.interest_rate is stored in percentage points (min 2.5, max 6.5). Using a realistic percent range 0, 100, we found 0 invalid values.

We validate identifier-like fields against basic format rules:
- SSN: `###-##-####`
- ZIP: 5 digits
- IP address: valid IPv4

In [21]:
import ipaddress
import pandas as pd

# Read the three fields as strings
ssn = df_raw["applicant_info.ssn"].astype("string")
zipc = df_raw["applicant_info.zip_code"].astype("string")
ip   = df_raw["applicant_info.ip_address"].astype("string")

# SSN pattern ###-##-####
ssn_invalid = ssn.dropna().str.match(r"^\d{3}-\d{2}-\d{4}$") == False
print("Invalid SSN format:", ssn_invalid.sum(), f"({ssn_invalid.sum()/len(df_raw)*100:.2f}%)")

# ZIP 5 digits (string)
zip_invalid = zipc.dropna().str.match(r"^\d{5}$") == False
print("Invalid ZIP format:", zip_invalid.sum(), f"({zip_invalid.sum()/len(df_raw)*100:.2f}%)")

# IP valid IPv4
def is_valid_ip(x):
    if pd.isna(x):
        return True
    try:
        ipaddress.ip_address(str(x))
        return True
    except ValueError:
        return False

ip_invalid = ip.dropna().map(lambda x: not is_valid_ip(x))
print("Invalid IP address:", ip_invalid.sum(), f"({ip_invalid.sum()/len(df_raw)*100:.2f}%)")

Invalid SSN format: 0 (0.00%)
Invalid ZIP format: 1 (0.20%)
Invalid IP address: 0 (0.00%)


In [22]:
# Show rows where ZIP is present but not 5 digits
zipc = df_raw["applicant_info.zip_code"].astype("string")
bad_zip_mask = zipc.notna() & (zipc.str.match(r"^\d{5}$") == False)

display(df_raw.loc[bad_zip_mask, ["_id", "applicant_info.zip_code", "applicant_info.ssn", "applicant_info.full_name"]])

Unnamed: 0,_id,applicant_info.zip_code,applicant_info.ssn,applicant_info.full_name
26,app_075,,,Margaret Williams


Validity/Completeness: We found 1 invalid ZIP format (0.20%), caused by a blank ZIP code string (example: app_075). We treat blank strings as missing and set them to NA during cleaning.

### 2G. Cross-field consistency checks (Accuracy)

We verify that related decision fields are internally consistent:
- Approved loans should have `approved_amount` and `interest_rate`, and should not have `rejection_reason`.
- Denied loans should have `rejection_reason`, and should not have approval terms.

In [24]:
approved = df_raw["decision.loan_approved"] == True
denied   = df_raw["decision.loan_approved"] == False

# Approved but missing terms
approved_missing_terms = approved & (
    df_raw["decision.approved_amount"].isna() | df_raw["decision.interest_rate"].isna()
)

# Approved but has rejection reason
approved_has_rejection = approved & df_raw["decision.rejection_reason"].notna()

# Denied but missing rejection reason
denied_missing_rejection = denied & df_raw["decision.rejection_reason"].isna()

# Denied but has terms
denied_has_terms = denied & (
    df_raw["decision.approved_amount"].notna() | df_raw["decision.interest_rate"].notna()
)

print("Approved but missing approved_amount or interest_rate:", int(approved_missing_terms.sum()))
print("Approved but has rejection_reason:", int(approved_has_rejection.sum()))
print("Denied but missing rejection_reason:", int(denied_missing_rejection.sum()))
print("Denied but has approved_amount or interest_rate:", int(denied_has_terms.sum()))

Approved but missing approved_amount or interest_rate: 0
Approved but has rejection_reason: 0
Denied but missing rejection_reason: 0
Denied but has approved_amount or interest_rate: 0
