# 02 — WGI Governance Data Cleaning

This notebook prepares Worldwide Governance Indicators (WGI)
for cross-country analysis.

## Indicators Used
- Control of Corruption
- Government Effectiveness
- Rule of Law

## Output
Saved to `/data/interim/` as a cleaned panel dataset.

In [1]:
# =========================
# SETUP
# =========================

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# =========================
# PROJECT PATHS
# =========================

BASE_PATH = "/content/drive/MyDrive/thesis_project/"

RAW_DATA = BASE_PATH + "data/raw/"
INTERIM_DATA = BASE_PATH + "data/interim/"

In [3]:
# =========================
# IMPORTS
# =========================

import pandas as pd
import numpy as np

## Load WGI Dataset
The WGI file contains multiple sheets corresponding to governance dimensions.

In [4]:
file_path = RAW_DATA + "wgidataset_with_sourcedata.xlsx"

xls = pd.ExcelFile(file_path)

print("Sheets found:", xls.sheet_names)

Sheets found: ['va', 'pv', 'ge', 'rq', 'rl', 'cc']


## Combine All Sheets
Each governance indicator is stored in a separate sheet.

In [5]:
dfs = []

for sheet in xls.sheet_names:
    df = pd.read_excel(xls, sheet_name=sheet)
    dfs.append(df)

wgi = pd.concat(dfs, ignore_index=True)

wgi.head()

Unnamed: 0,ID variable (economy code/ gov. dimension/ year),Economy (name),Economy (code),Region,Income classification,Year,Governance dimension,Number of sources,Governance estimate (approx. -2.5 to +2.5),Standard error (estimate),...,OBI mean,PIA mean,PRS mean,RSF mean,VAB mean,VDM mean,WBS mean,WCY mean,WJP mean,WMO mean
0,ADOva1996,Andorra,ADO,,,1996,va,3,1.541954,0.301021,...,..,..,..,..,..,..,..,..,..,1.0
1,AFGva1996,Afghanistan,AFG,South Asia,Low income,1996,va,4,-2.235444,0.24508,...,..,..,..,..,..,0.246567,..,..,..,0.0625
2,AGOva1996,Angola,AGO,Sub-Saharan Africa,Lower middle income,1996,va,6,-1.746207,0.193985,...,..,..,0.25,..,..,0.320267,..,..,..,0.125
3,ALBva1996,Albania,ALB,Europe & Central Asia,Upper middle income,1996,va,5,-0.826077,0.221397,...,..,..,0.5,..,..,0.448356,..,..,..,0.25
4,AREva1996,United Arab Emirates,ARE,Middle East & North Africa,High income,1996,va,6,-0.848031,0.193985,...,..,..,0.583333,..,..,0.332567,..,..,..,0.6875


## Select Required Columns

In [6]:
wgi_clean = wgi[[
    "Economy (name)",
    "Economy (code)",
    "Year",
    "Governance dimension",
    "Governance estimate (approx. -2.5 to +2.5)"
]].copy()

wgi_clean.columns = [
    "Country Name",
    "Country Code",
    "Year",
    "Indicator",
    "Governance"
]

## Standardize Indicator Labels

In [7]:
wgi_clean["Indicator"] = (
    wgi_clean["Indicator"]
    .astype(str)
    .str.strip()
    .str.lower()
)

indicator_map = {
    "cc": "Control of Corruption",
    "ge": "Government Effectiveness",
    "rl": "Rule of Law",
    "rq": "Regulatory Quality",
    "pv": "Political Stability",
    "va": "Voice and Accountability"
}

wgi_clean["Indicator"] = wgi_clean["Indicator"].map(indicator_map)

## Keep Selected Governance Indicators

In [8]:
keep_indicators = [
    "Control of Corruption",
    "Government Effectiveness",
    "Rule of Law"
]

wgi_clean = wgi_clean[
    wgi_clean["Indicator"].isin(keep_indicators)
]

## Convert to Panel Format
Country-Year dataset with governance indicators as columns.

In [9]:
wgi_panel = wgi_clean.pivot_table(
    index=["Country Name", "Country Code", "Year"],
    columns="Indicator",
    values="Governance",
    aggfunc="mean"
).reset_index()

wgi_panel = wgi_panel[
    [
        "Country Name",
        "Country Code",
        "Year",
        "Control of Corruption",
        "Government Effectiveness",
        "Rule of Law",
    ]
]

wgi_panel = wgi_panel.sort_values(
    ["Country Code", "Year"]
).reset_index(drop=True)

## Save Clean Dataset

In [10]:
wgi_panel.to_csv(
    INTERIM_DATA + "wgi_governance_clean.csv",
    index=False,
    float_format="%.10f"
)

## Pipeline Completion

In [11]:
print("✅ WGI cleaning completed successfully.")
print("Saved to:", INTERIM_DATA)

✅ WGI cleaning completed successfully.
Saved to: /content/drive/MyDrive/thesis_project/data/interim/
