In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("credit_risk_eda").getOrCreate()
spark

In [0]:
application_df=spark.table("workspace.credit_risk_data_delta.application_train")
application_df.printSchema()

In [0]:
display(application_df.limit(10))

In [0]:
display(application_df.describe())

#### 1. Target Imbalance:
The `TARGET` variable has a mean of approximately **0.08**, confirming a **strong class imbalance** — only ~8% of applicants default.  
This requires:
- **Stratified sampling** during train-test split  
- Evaluation metrics like **AUC**, **F1**, or **log loss** over accuracy

---

#### 2. Missing Values:
Many columns — especially **apartment-level features** and **housing metadata** — have significantly fewer entries than 307,511, suggesting heavy missingness.  
Action Plan:
- Use null percentage summaries to decide on **imputation or removal**  
- Prioritize important columns for retention during feature engineering 
---

#### 3. Applicant Demographics & Behavior:

- **Age (`DAYS_BIRTH`)**: Mean age is ~44 years (16037 / 365)  
- **Employment (`DAYS_EMPLOYED`)**: Contains unrealistic values (e.g., 365243), which likely indicate retirees or placeholder codes — needs cleaning
- **Income & Credit**:  
  - `AMT_INCOME_TOTAL` and `AMT_CREDIT` show **high variance** and **extreme outliers**  
  - These features are ideal candidates for **log transformation** or **binning**

---

#### 4. Data Consistency & Encoding Clues:
Some categorical features such as `NAME_CONTRACT_TYPE`, `CODE_GENDER`, `OCCUPATION_TYPE`, etc., contain **ambiguous entries** like `'XNA'`.  
These will require:
- **Cleaning and consolidation**
- Proper encoding during (e.g., one-hot or target encoding)

---

#### 5. Engineered External Scores:
Features `EXT_SOURCE_1`, `EXT_SOURCE_2`, and `EXT_SOURCE_3`:
- Appear to be **pre-normalized** (values in 0–1 range)
- Typically **strong predictors** in credit risk modeling  
Must visualize their correlation with `TARGET` in `01_eda.ipynb`

---

#### 6. Document Flags and Count Features:
Flags like `FLAG_DOCUMENT_3`, `FLAG_PHONE`, etc.:
- Are mostly **binary indicators** or **low-range counts**
- Can be used directly or **combined into composite features** (e.g., total documents submitted)


In [0]:
# Calculating the number of Rows and Columns
num_rows = application_df.count()
num_cols = len(application_df.columns)
print(f"({num_rows}, {num_cols})")

In [0]:
from collections import Counter
Counter(dict(application_df.dtypes).values())

In [0]:
from pyspark.sql.functions import col

# Filter only string (categorical) columns
categorical_cols = [field.name for field in application_df.schema.fields if field.dataType.simpleString() == 'string']

# Show number of unique values for each categorical column
for col_name in categorical_cols:
    count = application_df.select(col_name).distinct().count()
    print(f"{col_name}: {count}")