### ðŸ“š Data Collection

In this project, I collected publicly available datasets that track **COâ‚‚ emissions**, **economic indicators**, and **human development metrics** across countries. All data were obtained from reliable international sources such as **Kaggle**, and the **World Bank** ensuring accuracy and global comparability.

The study uses three main datasets:

- **COâ‚‚ Emissions Dataset (Kaggle):** Annual country-level COâ‚‚ totals and related pollution indicators.  https://www.kaggle.com/datasets/shreyanshdangi/co-emissions-across-countries-regions-and-sectors
- **GDP Dataset (World Bank):** Current GDP values in USD for all reporting countries.  https://www.kaggle.com/datasets/iamsouravbanerjee/human-development-index-dataset
- **Human Development Dataset (Kaggle):** HDI, Life Expectancy, and Gender Inequality Index (GII). https://www.kaggle.com/datasets/iamsouravbanerjee/human-development-index-dataset

These datasets were collected to investigate whether **economic growth**, **environmental impact**, and **social development** move together or diverge across countries.  
All files were cleaned, standardized using ISO3 country codes, and aligned for the period **2010â€“2019** to construct a unified master dataset suitable for cross-sectional analysis.


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import pearsonr

CO2_FILE = "Dataco2 emission.csv"
GDP_FILE = "gdp.csv"
HDI_FILE = "Human Development Index - Full.csv"

YEAR_START = 2010
YEAR_END   = 2019
YEARS = list(range(YEAR_START, YEAR_END + 1))

FIG_DIR = "figures"
os.makedirs(FIG_DIR, exist_ok=True)

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (7, 5)

def safe_read_worldbank(path: str) -> pd.DataFrame:
    """Safe file reading for World Bank GDP file"""
    try:
        df = pd.read_csv(path, skiprows=4)
        if "Country Code" in df.columns:
            return df
        return pd.read_csv(path)
    except Exception:
        return pd.read_csv(path)

def pearson_test(df, x, y, label):
    """p-value ,  pearson technique"""
    sub = df[[x, y]].dropna()
    if len(sub) < 5:
        print(f"{label}: not enough observation (n={len(sub)})")
        return np.nan, np.nan
    r, p = pearsonr(sub[x], sub[y])
    print(f"{label}: r = {r:.3f}, p-value = {p:.4g}, n = {len(sub)}")
    if p < 0.05:
        print("   --> Statistically meaningful (H0 is rejected)")
    else:
        print("   --> Not meaningful (H0 cannot be rejected)")
    return r, p


### Step 1 â€” Importing Libraries and Preparing Environment

In this step, we:

- Import all required libraries (`pandas`, `numpy`, `matplotlib`, `seaborn`, `scipy`)
- Define file paths for:
  - COâ‚‚ dataset  
  - GDP dataset  
  - HDI dataset  
- Set the analysis period **2010â€“2019**
- Create a `figures/` folder
- Define two helper functions:
  - `safe_read_worldbank` â†’ safely loads World Bank GDP files  
  - `pearson_test` â†’ computes Pearson correlation and prints statistical meaning

This cell does **not** load any data yet â€” it only prepares the workspace.

#### âœ” Example of expected GDP file structure (visual table):

| Country Name | Country Code | 2010       | 2011       | 2012       | ... |
|--------------|--------------|------------|------------|------------|-----|
| Afghanistan  | AFG          | 9.53E+09   | 1.04E+10   | 1.06E+10   | ... |
| Albania      | ALB          | 1.19E+10   | 1.29E+10   | 1.22E+10   | ... |

#### âœ” Example of expected COâ‚‚ raw structure:

| Name        | year | co2  | co2_per_capita | Description |
|-------------|------|------|----------------|-------------|
| Afghanistan | 2010 | 8.36 | 0.296          | Country     |
| Afghanistan | 2011 | 11.8 | 0.403          | Country     |


In [None]:
# ===== CO2 DATA =====
co2 = pd.read_csv(CO2_FILE)
co2 = co2[co2["Description"] == "Country"].copy()
co2 = co2.rename(columns={
    "Name": "Country",
    "year": "Year",
    "co2": "CO2_total",
    "co2_per_capita": "CO2_per_capita",
    "co2_per_gdp": "CO2_per_GDP"
})
co2 = co2[[
    "iso_code", "Country", "Year",
    "population", "CO2_total", "CO2_per_capita", "CO2_per_GDP"
]]
co2 = co2[(co2["Year"] >= YEAR_START) & (co2["Year"] <= YEAR_END)]

# ===== GDP DATA =====
gdp_raw = safe_read_worldbank(GDP_FILE)
year_cols = [str(y) for y in YEARS if str(y) in gdp_raw.columns]
if not year_cols:
    raise ValueError("gdp.csv ERROR.")

gdp = gdp_raw[["Country Code"] + year_cols].rename(
    columns={"Country Code": "iso_code"}
)
gdp = gdp.melt(
    id_vars="iso_code",
    value_vars=year_cols,
    var_name="Year",
    value_name="GDP"
)
gdp["Year"] = gdp["Year"].astype(int)

# ===== HDI / LIFE / GII DATA =====
hdi_full = pd.read_csv(HDI_FILE)
rows = []
for _, r in hdi_full.iterrows():
    iso = r["ISO3"]
    cname = r["Country"]
    for y in YEARS:
        rows.append({
            "iso_code": iso,
            "Country_hdi": cname,
            "Year": y,
            "HDI": r.get(f"Human Development Index ({y})", np.nan),
            "LifeExpectancy": r.get(f"Life Expectancy at Birth ({y})", np.nan),
            "GII": r.get(f"Gender Inequality Index ({y})", np.nan),
        })
hdi = pd.DataFrame(rows)

### Step 2 â€” Cleaning COâ‚‚, GDP, and HDI Datasets

#### âœ” COâ‚‚ Cleaning Steps:
- Filter only rows where `Description == "Country"`
- Rename columns (`Nameâ†’Country`, `yearâ†’Year`, `co2â†’CO2_total`, etc.)
- Keep only relevant columns
- Restrict to **2010â€“2019**

**COâ‚‚ cleaned sample (conceptual):**

| iso_code | Country     | Year | population | CO2_total | CO2_per_capita |
|----------|-------------|------|------------|-----------|----------------|
| AFG      | Afghanistan | 2010 | 28M        | 8.36      | 0.296          |
| AFG      | Afghanistan | 2011 | 29M        | 11.83     | 0.403          |

---

#### âœ” GDP Cleaning Steps:
- Load using `safe_read_worldbank`
- Keep `"Country Code"` and all year columns found between 2010â€“2019
- Convert GDP from **wide** to **long** format using `melt()`

**GDP long-format sample:**

| iso_code | Year | GDP        |
|----------|------|------------|
| AFG      | 2010 | 9.53E+09   |
| AFG      | 2011 | 1.04E+10   |

---

#### âœ” HDI / Life Expectancy / GII Cleaning:
- Loop all countries and all years (2010â€“2019)
- Extract HDI, Life Expectancy, GII fr


In [None]:
# ===== MERGE CO2 + GDP =====
panel = co2.merge(
    gdp, on=["iso_code", "Year"], how="inner"
)
# ===== ADD HDI / LIFE / GII =====
panel = panel.merge(
    hdi, on=["iso_code", "Year"], how="left"
)

panel["Country"] = panel["Country"].fillna(panel["Country_hdi"])
panel = panel.drop(columns=["Country_hdi"])

# ===== DERIVED VARIABLE =====
panel["GDP_per_capita"] = panel["GDP"] / panel["population"]

# ===== DROP CRITICAL MISSING VALUES =====
panel = panel.dropna(
    subset=["CO2_total", "CO2_per_capita", "GDP", "HDI", "LifeExpectancy"]
)

print(f">> Panel data size: {panel.shape[0]} observation, {panel.shape[1]} variable")

# ===== SAVE FINAL DATASET =====
panel.to_csv("master_cross_section.csv", index=False)

>> Panel data size: 1871 observation, 12 variable


### Step 3 â€” Merging All Datasets & Generating the Final Master Dataset

#### âœ” Merge Process:
1. **COâ‚‚ + GDP** merged using `iso_code` + `Year`
2. **HDI/LifeExpectancy/GII** added using a left merge
3. Missing country names replaced using `Country_hdi`
4. Temporary `Country_hdi` column removed

---

#### âœ” Create Derived Metric:
`GDP_per_capita = GDP / population`

This allows more meaningful development comparisons.

---

#### âœ” Drop rows with critical missing values:
To ensure analysis quality, rows missing any of the essential indicators are removed:

- `CO2_total`
- `CO2_per_capita`
- `GDP`
- `HDI`
- `LifeExpectancy`

---

#### âœ” Final dataset preview :

| iso_code | Country     | Year | CO2_total | GDP        | HDI  | LifeExpectancy | GII  | GDP_per_capita |
|----------|-------------|------|-----------|------------|------|----------------|------|----------------|
| AFG      | Afghanistan | 2010 | 8.36      | 9.53E+09   | 0.46 | 53.8           | 0.70 | 330            |
| AFG      | Afghanistan | 2011 | 11.83     | 1.04E+10   | 0.47 | 54.2           | 0.69 | 354            |

---

### The cleaned master dataset is now saved as:
`master_cross_section.csv`

This file will be used in the **EDA + Hypothesis Testing Notebook**.
