In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  01 – Data Preparation\n",
    "\n",
    "In this notebook, we load and explore the German Credit Risk dataset, perform basic cleaning, create a proxy target variable (default) for risk modeling, and analyze key categorical and numerical features.\n",
    "\n",
    "This step serves as the foundation for subsequent modeling, security testing, and governance analysis of AI systems."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Imports\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from src.data_loader import load_and_preprocess_data\n",
    "\n",
    "sns.set(style=\"whitegrid\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load dataset\n",
    "df = pd.read_csv(\"data/credit.csv\")\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Basic structure and missing values\n",
    "print(\"Columns:\", df.columns.tolist())\n",
    "print(\"\\nMissing values:\\n\", df.isna().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create proxy target variable: 'default' (high credit amount + long duration → risky)\n",
    "df[\"default\"] = ((df[\"Credit amount\"] > 5000) & (df[\"Duration\"] > 24)).astype(int)\n",
    "\n",
    "df[\"default\"].value_counts().plot(kind=\"bar\", title=\"Distribution of Target Variable (default)\")\n",
    "plt.xticks([0, 1], [\"No Default\", \"Default\"], rotation=0)\n",
    "plt.ylabel(\"Count\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Explore categorical features\n",
    "categorical_cols = [\"Sex\", \"Housing\", \"Saving accounts\", \"Checking account\", \"Purpose\"]\n",
    "\n",
    "for col in categorical_cols:\n",
    "    plt.figure(figsize=(6, 3))\n",
    "    sns.countplot(data=df, x=col, order=df[col].value_counts().index)\n",
    "    plt.title(f\"Distribution of {col}\")\n",
    "    plt.xticks(rotation=45)\n",
    "    plt.tight_layout()\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  Final preprocessing\n",
    "\n",
    "We now use the centralized load_and_preprocess_data() function from the src/ module to encode, scale and split the dataset into training and test sets. This ensures reusability across all subsequent notebooks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = load_and_preprocess_data()\n",
    "\n",
    "print(\"Training data:\", X_train.shape)\n",
    "print(\"Test data:\", X_test.shape)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}