<p align="center">
  <img alt="University Paris Cité" src="https://img.shields.io/badge/University-Paris%20Cité-6f42c1?style=for-the-badge&logo=academia&logoColor=white">
  <img alt="Module Machine Learning" src="https://img.shields.io/badge/Course-Machine%20Learning-1976D2?style=for-the-badge&logo=scikitlearn&logoColor=white">
  <img alt="Project Type" src="https://img.shields.io/badge/Type-Exploratory%20Data%20Analysis-FF9800?style=for-the-badge&logo=jupyter&logoColor=white">
  <img alt="Dataset" src="https://img.shields.io/badge/Dataset-Census%20Income-009688?style=for-the-badge&logo=databricks&logoColor=white">
</p>

---

<p align="center">
  <strong>🎓 Master 2 — Machine Learning for Data Science (MLSD)</strong><br>
  <strong>University Paris Cité</strong>
</p>

---

<p align="center">

### Project Information  

| **Category**        | **Details**                                 |
|---------------------|---------------------------------------------|
| **University**      | University Paris Cité                       |
| **Master Program**  | Machine Learning for Data Science (MLSD)    |
| **Project Theme**   | Census Income Classification — EDA          |
| **Instructor**      | *Amine Ferddjaoui*                          |
| **Student**         | *Hamady GACKOU*                             |
| **Dataset Source**  | U.S. Census Bureau — CPS / ASEC 2019        |
| **Objective**       | Exploratory Data Analysis (EDA) of socio-economic indicators influencing income levels |
| **Session**         | Practical Project (Machine Learning Module)  |

</p>

---

## Overview

This notebook presents the **Exploratory Data Analysis (EDA)** of the **Census Income** dataset, a benchmark problem in socio-economic data modeling.  
It aims to uncover key patterns, correlations, and imbalances before applying predictive modeling techniques.  
The analysis focuses on understanding how **demographic, professional, and educational factors** influence the probability of earning more than **$50K** per year.

---

## Key Steps

1. **Dataset inspection and metadata**  
2. **Missing values & outliers detection**  
3. **Distribution analysis of categorical and numerical features**  
4. **Correlation and feature relationships**  
5. **Class imbalance visualization**  
6. **Insights summary for model preparation**  

---

<p align="center">
  <em>Université Paris Cité — Master MLSD — Exploratory Data Analysis</em>
</p>


# Step 1 — Label raw CPS/ASEC files

Run these two lines to create **labeled train/test CSVs** with clean headers and a numeric target.

```bash
python scripts/label_columns.py --input data/raw/census_income_learn.csv --output data/processed/train_labeled.csv
python scripts/label_columns.py --input data/raw/census_income_test.csv  --output data/processed/test_labeled.csv


In [None]:
!python scripts/label_columns.py --input data/raw/census_income_learn.csv --output data/processed/train_labeled.csv
!python scripts/label_columns.py --input data/raw/census_income_test.csv  --output data/processed/test_labeled.csv

# Step 2 — Missing values & outliers (IQR rule)

**Goal.** Quickly scan the labeled data to spot **missing values** and **numeric outliers** (Tukey fences: *Q1 − 1.5·IQR*, *Q3 + 1.5·IQR*).  
Saves a compact report to `reports/metrics/data_quality_summary.csv`.

**Run**
```bash
python scripts/eda_quality_checks.py


In [2]:
import pandas as pd

dq = pd.read_csv("reports/metrics/data_quality_summary.csv")
dq.sort_values(["pct_missing","pct_outliers"], ascending=False).head(20)


Unnamed: 0,column,dtype,n_missing,pct_missing,n_outliers,pct_outliers,q1,q3,iqr_low,iqr_high
0,income_binary,float64,199523,100.0,0,0.0,,,,
1,hispanic_origin,object,874,0.438,0,0.0,,,,
2,dividends_from_stocks,int64,0,0.0,21141,10.5958,0.0,0.0,0.0,0.0
3,wage_per_hour,int64,0,0.0,11304,5.6655,0.0,0.0,0.0,0.0
4,capital_gains,int64,0,0.0,7379,3.6983,0.0,0.0,0.0,0.0
5,instance_weight,float64,0,0.0,6238,3.1265,1061.615,2188.61,-628.8775,3879.1025
6,capital_losses,int64,0,0.0,3906,1.9577,0.0,0.0,0.0,0.0
7,age,int64,0,0.0,0,0.0,15.0,50.0,-37.5,102.5
8,class_of_worker,object,0,0.0,0,0.0,,,,
9,detailed_industry_recode,int64,0,0.0,0,0.0,,,,


# Step 3 — Quick profiles (class balance, categorical rates, numeric summary)

Run one line to save compact reports + a couple of small plots.

```bash
python scripts/eda_profiles.py

```

Outputs :

- reports/metrics/class_balance.csv

- reports/metrics/cat_profiles.csv (count + positive-rate by category)

- reports/metrics/numeric_summary.csv

- reports/figures/class_balance.png, reports/figures/age_hist.png

Minimal notebook peek:

In [3]:
import pandas as pd
print(pd.read_csv("reports/metrics/class_balance.csv"))
pd.read_csv("reports/metrics/cat_profiles.csv").query("feature=='education'").head(12)


FileNotFoundError: [Errno 2] No such file or directory: 'reports/metrics/class_balance.csv'

In [None]:

```python
# scripts/eda_profiles.py  — minimal, does the job
from pathlib import Path
import re
import pandas as pd

# -------- locate data --------
candidates = [
    Path("data/processed/train_labeled.csv"),
    Path("data/processed/census_income_labeled.csv"),
]
src = next((p for p in candidates if p.exists()), None)
if src is None:
    raise SystemExit("No labeled file found. Run scripts/label_columns.py first.")

out_metrics = Path("reports/metrics"); out_metrics.mkdir(parents=True, exist_ok=True)
out_fig = Path("reports/figures"); out_fig.mkdir(parents=True, exist_ok=True)

# -------- load --------
df = pd.read_csv(src)

# -------- robust target mapping (in case it's missing) --------
def map_income(x: str):
    s = re.sub(r"[^\w<>+\-]", "", str(x)).lower()
    if "50k" in s and (">" in s or "+" in s): return 1
    if "50k" in s and ("<=" in s or "-" in s): return 0
    return None

if "income_binary" not in df or df["income_binary"].isna().all():
    df["income_binary"] = df["income_raw"].map(map_income)

# keep only valid targets
df = df[df["income_binary"].isin([0,1])].copy()

# -------- class balance --------
cb = df["income_binary"].value_counts(dropna=False).rename_axis("income_binary").to_frame("count")
cb["pct"] = (cb["count"] / cb["count"].sum() * 100).round(3)
cb.to_csv(out_metrics / "class_balance.csv", index=True)

# -------- categorical profiles (count + positive rate) --------
cat_cols = [c for c in [
    "education","marital_status","class_of_worker","major_industry_code",
    "major_occupation_code","sex","race","citizenship","tax_filer_status",
    "family_members_under_18","hispanic_origin"
] if c in df.columns]

rows = []
for col in cat_cols:
    g = df.groupby(col, dropna=False)["income_binary"]
    tmp = pd.DataFrame({
        "feature": col,
        "category": g.count().index,
        "count": g.count().values,
        "pos_rate": g.mean().values,     # mean of 0/1 = P(income>50K)
    }).sort_values("count", ascending=False)
    # keep top 20 per feature for compactness
    rows.append(tmp.head(20))

cat_prof = pd.concat(rows, ignore_index=True)
cat_prof["pos_rate"] = cat_prof["pos_rate"].round(4)
cat_prof.to_csv(out_metrics / "cat_profiles.csv", index=False)

# -------- numeric summary --------
num_cols = [c for c in [
    "age","wage_per_hour","capital_gains","capital_losses","dividends_from_stocks",
    "num_persons_worked_for_employer","weeks_worked_in_year","instance_weight","year"
] if c in df.columns]

for c in num_cols:
    df[c] = pd.to_numeric(df[c], errors="coerce")

def q(dfcol, p): 
    return dfcol.quantile(p) if dfcol.notna().sum() else None

rows = []
for c in num_cols:
    s = df[c]
    rows.append({
        "feature": c,
        "count": int(s.notna().sum()),
        "missing": int(s.isna().sum()),
        "mean": s.mean(),
        "std": s.std(),
        "p01": q(s, 0.01),
        "p50": q(s, 0.50),
        "p99": q(s, 0.99),
    })
num_sum = pd.DataFrame(rows)
num_sum.to_csv(out_metrics / "numeric_summary.csv", index=False)

# -------- tiny plots (saved) --------
try:
    import matplotlib.pyplot as plt

    # class balance
    cb_plot = cb.copy()
    ax = cb_plot["count"].plot(kind="bar")
    ax.set_title("Class balance (income_binary)")
    ax.set_xlabel("Class"); ax.set_ylabel("Count")
    plt.tight_layout(); plt.savefig(out_fig / "class_balance.png"); plt.close()

    # age histogram
    if "age" in df.columns:
        ax = df["age"].dropna().plot(kind="hist", bins=30)
        ax.set_title("Age distribution"); ax.set_xlabel("age")
        plt.tight_layout(); plt.savefig(out_fig / "age_hist.png"); plt.close()
except Exception as e:
    # plotting is optional — keep silent if headless env lacks backends
    pass

print("✅ saved:",
      out_metrics / "class_balance.csv",
      out_metrics / "cat_profiles.csv",
      out_metrics / "numeric_summary.csv")
