# Phase 3 EDA (Project-Driven, Explained) ‚Äî EO3 / IRS Nonprofit Dataset

## Phase 3 goal (Deep Insight Layer)
Phase 1 = Structure & Trust  
Phase 2 = Sector + Geography + Basic Concentration  
Phase 3 = **Advanced Insight + Storytelling Metrics**

In Phase 3 we answer:

### 1Ô∏è‚É£ Inequality & Distribution
How uneven is nonprofit funding overall?

### 2Ô∏è‚É£ Sector √ó Geography Interaction
Which states dominate within each sector?

### 3Ô∏è‚É£ Advanced Concentration Metrics
Lorenz Curve + Gini Coefficient (optional but portfolio-level)

---

## Why Phase 3 Matters
This is where your analysis moves from:
üëâ Describing data  
To:
üëâ Explaining structural patterns in resource distribution

This is **portfolio-level analytics** and **industry-level EDA**.


In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.max_columns", 200)


## 1Ô∏è‚É£ Load Data
Use the same dataset from Phase 1 and Phase 2.


In [None]:
DATA_PATH = "dataset_cleaned_merged.csv"
df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
df.head()


# 2Ô∏è‚É£ Phase 3A ‚Äî Global Financial Inequality

## Why this matters
In many nonprofit ecosystems:
- Few orgs control most funding
- Many orgs operate with small budgets

We measure:
- Top 1%, 5%, 10% share
- Lorenz Curve
- Gini Coefficient


In [None]:
def top_share(series, frac):
    s = series.dropna().sort_values(ascending=False)
    n = max(1, int(len(s) * frac))
    return s.head(n).sum() / s.sum()

inequality = []

for col in ["REVENUE_AMT", "ASSET_AMT"]:
    if col in df.columns:
        inequality.append({
            "metric": col,
            "top_1_pct": round(top_share(df[col], 0.01) * 100, 2),
            "top_5_pct": round(top_share(df[col], 0.05) * 100, 2),
            "top_10_pct": round(top_share(df[col], 0.10) * 100, 2),
        })

inequality_df = pd.DataFrame(inequality)
inequality_df


## Lorenz Curve + Gini Coefficient
Portfolio-level metric for inequality.


In [None]:
def gini(array):
    array = array.dropna().values
    array = np.sort(array)
    n = len(array)
    index = np.arange(1, n + 1)
    return ((np.sum((2 * index - n - 1) * array)) / (n * np.sum(array)))

def lorenz_curve(series):
    s = series.dropna().sort_values()
    cum = np.cumsum(s)
    cum = np.insert(cum, 0, 0)
    cum = cum / cum[-1]
    return cum

rev_lorenz = lorenz_curve(df["REVENUE_AMT"])

plt.figure()
plt.plot(np.linspace(0, 1, len(rev_lorenz)), rev_lorenz)
plt.plot([0,1], [0,1])
plt.title("Lorenz Curve ‚Äî Revenue")
plt.xlabel("Cumulative Organizations")
plt.ylabel("Cumulative Revenue Share")
plt.show()

print("Revenue Gini:", round(gini(df["REVENUE_AMT"]), 3))


# 3Ô∏è‚É£ Phase 3B ‚Äî Sector √ó Geography Interaction

## Why this matters
Stakeholders ask:
üëâ Which states dominate each sector?  
üëâ Are certain sectors regionally concentrated?


In [None]:
sector_state = (
    df.groupby(["STATE", "NTEE_CD"])
      .agg(
          org_count=("EIN", "count"),
          total_revenue=("REVENUE_AMT", "sum"),
          median_revenue=("REVENUE_AMT", "median")
      )
      .reset_index()
)

sector_state.head()


## Top States Within Each Sector


In [None]:
top_sector_state = (
    sector_state.sort_values(["NTEE_CD", "total_revenue"], ascending=[True, False])
    .groupby("NTEE_CD")
    .head(5)
)

top_sector_state.head(20)


# 4Ô∏è‚É£ Phase 3C ‚Äî Sector Financial Inequality

## Question
Do some sectors have more internal inequality than others?


In [None]:
sector_gini = []

for sector in df["NTEE_CD"].dropna().unique():
    sub = df[df["NTEE_CD"] == sector]
    if len(sub) > 20:
        sector_gini.append({
            "NTEE_CD": sector,
            "gini_revenue": gini(sub["REVENUE_AMT"])
        })

sector_gini_df = pd.DataFrame(sector_gini).sort_values("gini_revenue", ascending=False)
sector_gini_df.head(15)


# 5Ô∏è‚É£ Save Phase 3 Outputs


In [None]:
OUT_DIR = "phase3_outputs"
import os
os.makedirs(OUT_DIR, exist_ok=True)

inequality_df.to_csv(f"{OUT_DIR}/phase3_global_inequality.csv", index=False)
sector_state.to_csv(f"{OUT_DIR}/phase3_sector_state.csv", index=False)
sector_gini_df.to_csv(f"{OUT_DIR}/phase3_sector_gini.csv", index=False)

print("Saved Phase 3 outputs")


# 6Ô∏è‚É£ Phase 3 Findings (Write Here)

Suggested bullets:

- Nonprofit funding shows strong inequality, with the top ___% controlling ___% of total revenue.
- Some sectors show much higher internal inequality than others.
- Certain states dominate revenue within specific sectors.
- These patterns suggest structural concentration of nonprofit resources.

---

## üöÄ You Have Now Completed Full Professional EDA
Phase 1 ‚Üí Data Understanding  
Phase 2 ‚Üí Sector + Geography + Concentration  
Phase 3 ‚Üí Inequality + Structural Patterns + Interaction Effects
