# Phase 4 â€” Advanced Analytics (Project-Driven)  
## EO3 / IRS Nonprofit Dataset

### What Phase 4 Is (and Is NOT)

**Phase 4 is optional but powerful.**  
It moves beyond descriptive EDA into **analytical structure discovery**, without forcing prediction where it doesnâ€™t belong.

Phase 4 focuses on:

1. **Segmentation / Clustering** of nonprofits  
2. **Typology creation** (small / medium / large org profiles)  
3. **Outlier detection** (extreme orgs)  
4. **Decision-oriented insights** (how stakeholders might act)

ðŸš« We are **not** doing:
- causal inference
- forecasting
- ML for MLâ€™s sake

This keeps the analysis responsible and realistic for nonprofit data.


In [None]:
# 0) Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

pd.set_option("display.max_columns", 200)


## 1) Load Data
We continue using the same cleaned EO3 dataset.


In [None]:
DATA_PATH = "dataset_cleaned_merged.csv"
df = pd.read_csv(DATA_PATH)

print("Dataset shape:", df.shape)
df.head()


## 2) Feature Selection for Phase 4

### Why feature selection matters
Clustering works best on:
- numeric variables
- comparable scales
- variables representing *organizational capacity*

### Chosen features
- Revenue
- Assets
- Income
- Organization age (derived if available)

We avoid:
- IDs
- text fields
- geography (used later for interpretation, not clustering)


In [None]:
features = ["REVENUE_AMT", "ASSET_AMT", "INCOME_AMT"]
cluster_df = df[features].copy()

# Drop rows with missing clustering features
cluster_df = cluster_df.dropna()

print("Rows used for clustering:", cluster_df.shape[0])
cluster_df.describe()


## 3) Scaling Features

### Why scaling is required
Revenue and assets can differ by orders of magnitude.
Without scaling, clustering will be dominated by one variable.


In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(cluster_df)


## 4) Choosing Number of Clusters (Elbow Method)

This step helps us pick a reasonable number of organizational types.


In [None]:
inertia = []
k_range = range(2, 9)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertia.append(km.inertia_)

plt.figure()
plt.plot(list(k_range), inertia, marker="o")
plt.xlabel("Number of clusters (k)")
plt.ylabel("Inertia")
plt.title("Elbow Method for K Selection")
plt.show()


## 5) Final Clustering (Choose k)

Based on the elbow curve, choose a reasonable `k` (usually 3â€“5).


In [None]:
k = 4  # adjust if needed
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
clusters = kmeans.fit_predict(X_scaled)

cluster_df["cluster"] = clusters
cluster_df.head()


## 6) Cluster Profiling

### Goal
Turn clusters into **interpretable nonprofit types**.


In [None]:
cluster_profiles = (
    cluster_df
    .groupby("cluster")
    .agg(
        count=("REVENUE_AMT", "count"),
        median_revenue=("REVENUE_AMT", "median"),
        median_assets=("ASSET_AMT", "median"),
        median_income=("INCOME_AMT", "median"),
        mean_revenue=("REVENUE_AMT", "mean"),
    )
    .reset_index()
)

cluster_profiles


### How to interpret clusters

Typical outcome:
- Cluster 0 â†’ Small, low-resource organizations
- Cluster 1 â†’ Medium-sized operating nonprofits
- Cluster 2 â†’ Large, asset-heavy institutions
- Cluster 3 â†’ Very large / national organizations

These are **organizational typologies**, not value judgments.


## 7) Attach Cluster Labels Back to Main Dataset
This allows sector/state analysis by cluster later.


In [None]:
cluster_map = cluster_df["cluster"]
df_clustered = df.loc[cluster_df.index].copy()
df_clustered["cluster"] = cluster_map.values

df_clustered.head()


## 8) Cluster Distribution by Sector (NTEE)

### Question
Do some sectors consist mainly of small orgs while others skew large?


In [None]:
sector_cluster = (
    df_clustered
    .groupby(["NTEE_CD", "cluster"])
    .size()
    .reset_index(name="org_count")
)

sector_cluster.head()


## 9) Cluster Distribution by State

### Question
Are large nonprofits geographically concentrated?


In [None]:
state_cluster = (
    df_clustered
    .groupby(["STATE", "cluster"])
    .size()
    .reset_index(name="org_count")
)

state_cluster.head()


## 10) Outlier Detection (Extreme Organizations)

We identify organizations far above typical revenue levels.


In [None]:
q99 = df["REVENUE_AMT"].quantile(0.99)

outliers = df[df["REVENUE_AMT"] >= q99]

print("99th percentile revenue:", q99)
outliers[["EIN", "STATE", "NTEE_CD", "REVENUE_AMT"]].head()


## 11) Save Phase 4 Outputs


In [None]:
OUT_DIR = "phase4_outputs"
import os
os.makedirs(OUT_DIR, exist_ok=True)

cluster_profiles.to_csv(f"{OUT_DIR}/phase4_cluster_profiles.csv", index=False)
sector_cluster.to_csv(f"{OUT_DIR}/phase4_sector_cluster_distribution.csv", index=False)
state_cluster.to_csv(f"{OUT_DIR}/phase4_state_cluster_distribution.csv", index=False)
outliers.to_csv(f"{OUT_DIR}/phase4_extreme_outliers.csv", index=False)

print("Phase 4 outputs saved")


## 12) Phase 4 Findings (Write Here)

Suggested bullets:
- Nonprofits naturally segment into ___ organizational types based on financial capacity.
- Most organizations fall into smaller clusters, while a small cluster controls outsized resources.
- Certain sectors and states are disproportionately represented in large-org clusters.
- Extreme outliers represent unique institutions that heavily influence aggregate statistics.

---

## ðŸŽ¯ Final Outcome
With Phase 4, you now have:
- Descriptive understanding (Phase 1)
- Structural patterns (Phase 2)
- Inequality insights (Phase 3)
- Organizational typologies & segmentation (Phase 4)

This is **end-to-end professional data analysis**.
