# Clean OG Solar — Full Pipeline (with flexible building-type mapping)
This notebook keeps everything the same as your previous workflow **except** the building-type mapping.
We now:
1) **Load & standardize** all original files (no encoding),
2) **Discover all unique building-type strings** across datasets and write a CSV for you to edit,
3) **Apply your CSV mapping** and encode to integers deterministically,
4) **Audit nulls**, drop null rows across common columns,
5) **Save per-city and combined outputs**.

**Note:** We drop `Estimated_building_height` and `Estimated_capacity_factor` as before.
Edit the folder paths in Cell 1A if needed to match your machine.


## Cell 1A — Setup & discovery (NO encoding here)

In [1]:

import os, glob, re
from pathlib import Path
import pandas as pd
import numpy as np

# === Adjust if needed ===
DATA_DIR = Path(r"C:/Users/User/Desktop/ML/Project/solar-potential-analysis-github-setup/original_datasets")
FILE_GLOB = "*rooftop*solar*"

# Outputs
OUT_DIR = Path(r"C:/Users/User/Desktop/ML/Project/solar-potential-analysis-github-setup/cleaned_datasets")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Discover files
files = sorted([p for p in DATA_DIR.glob(FILE_GLOB) if p.suffix.lower() in {".csv",".xlsx",".xls"}])
print(f"Found {len(files)} files:")
for f in files:
    print(" -", f.name)

# Common columns (same as before)
common_columns = [
    "City",
    "Surface_area",
    "Potential_installable_area",
    "Peak_installable_capacity",
    "Energy_potential_per_year",
    "Assumed_building_type",
    "Estimated_tilt",
    "Estimated_building_height",
    "Estimated_capacity_factor"
]

def read_standardize(path: Path, common_cols):
    # read
    if path.suffix.lower() in {".xlsx",".xls"}:
        df = pd.read_excel(path)
    else:
        df = pd.read_csv(path, low_memory=False)
    # cols
    df.columns = (df.columns
                    .str.strip()
                    .str.replace(r"\s+","_", regex=True))
    # city
    if "City" not in df.columns:
        df["City"] = path.stem.replace("_rooftop_solar_potential","")
    # ensure common cols
    for c in common_cols:
        if c not in df.columns:
            df[c] = pd.NA
    df = df[common_cols].copy()
    # numeric coercion
    num_cols = [c for c in common_cols if c not in {"City","Assumed_building_type"}]
    for c in num_cols:
        df[c] = pd.to_numeric(df[c], errors="coerce")
    # city string
    df["City"] = df["City"].astype("string").str.strip()
    # keep building type as normalized TEXT (lowercased) — still strings here
    df["Assumed_building_type"] = (
        df["Assumed_building_type"].astype("string").str.strip().str.lower().str.replace(r"\s+"," ", regex=True)
    )
    return df

dfs = {f.stem: read_standardize(f, common_columns) for f in files}
print("Loaded and standardized (no label encoding yet).")


Found 25 files:
 - accra_rooftop_solarpotential.csv
 - almaty_rooftop_solarpotential.csv
 - antigua_rooftop_solarpotential.csv
 - beirut_rooftop_solarpotential.csv
 - colombo_rooftop_solarpotential.csv
 - daressalaam_rooftop_solarpotential.csv
 - dhaka_rooftop_solarpotential.csv
 - dominica_rooftop_solarpotential.csv
 - grenada_rooftop_solarpotential.csv
 - izmir_rooftop_solarpotential.csv
 - johannesburg_rooftop_solarpotential.csv
 - karachi_rooftop_solarpotential.csv
 - lagosstate_rooftop_solarpotential.csv
 - maldives_rooftop_solarpotential.csv
 - manila_rooftop_solarpotential.csv
 - mexicocity_rooftop_solarpotential.csv
 - nairobi_rooftop_solarpotential.csv
 - panama8cities_rooftop_solarpotential.csv
 - rustavi_rooftop_solarpotential.csv
 - samarkand_rooftop_solarpotential.csv
 - sanpedrosula_rooftop_solarpotential.csv
 - sintmaarten_rooftop_solarpotential.csv
 - stlucia_rooftop_solarpotential.csv
 - svg_rooftop_solarpotential.csv
 - tegucigalpa_rooftop_solarpotential.csv
Loaded an

## Cell 1B — Discover unique building-type strings & write a mapping template (edit the CSV)

In [2]:

import pandas as pd
from collections import Counter
from pathlib import Path

def norm_bt(s: pd.Series) -> pd.Series:
    return (s.astype("string").str.strip().str.lower().str.replace(r"\s+"," ", regex=True))

# collect uniques across all datasets
global_counter = Counter()
per_ds = []
for name, df in dfs.items():
    s = norm_bt(df["Assumed_building_type"])
    u = s.dropna().unique().tolist()
    per_ds.append({"dataset": name, "n_rows": len(df), "n_unique_bt": len(u), "sample": u[:8]})
    global_counter.update(s.dropna().tolist())

per_ds_df = pd.DataFrame(per_ds).sort_values("n_unique_bt", ascending=False).reset_index(drop=True)
display(per_ds_df)

# Write mapping template CSV
map_df = pd.DataFrame(sorted(global_counter.items(), key=lambda kv: (-kv[1], kv[0])),
                      columns=["raw_label","global_count"])
map_df["canonical_label"] = ""   # <-- YOU fill this column

tmpl_path = Path("building_type_mapping_template.csv")
tmpl_path.parent.mkdir(parents=True, exist_ok=True)
map_df.to_csv(tmpl_path, index=False)
print(f"✍️  Edit canonical_label in: {tmpl_path.resolve()}")
display(map_df.head(30))


Unnamed: 0,dataset,n_rows,n_unique_bt,sample
0,antigua_rooftop_solarpotential,48904,8,"[single-family residential, public, industrial..."
1,dominica_rooftop_solarpotential,37841,8,"[single-family residential, public health faci..."
2,grenada_rooftop_solarpotential,51115,8,"[single-family residential, schools, commercia..."
3,sanpedrosula_rooftop_solarpotential,61932,8,"[commercial, schools, single-family residentia..."
4,sintmaarten_rooftop_solarpotential,15279,8,"[single-family residential, multi-family resid..."
5,svg_rooftop_solarpotential,46707,8,"[single-family residential, commercial, public..."
6,maldives_rooftop_solarpotential,92074,8,"[single-family residential, commercial, public..."
7,lagosstate_rooftop_solarpotential,1542057,8,"[single-family residential, public, multi-fami..."
8,johannesburg_rooftop_solarpotential,414834,8,"[single-family residential, commercial, hotels..."
9,tegucigalpa_rooftop_solarpotential,165629,8,"[single-family residential, public, commercial..."


✍️  Edit canonical_label in: C:\Users\User\Desktop\ML\Project\solar-potential-analysis-github-setup\data_standardization\building_type_mapping_template.csv


Unnamed: 0,raw_label,global_count,canonical_label
0,single-family residential,2306463,
1,single family residential,1845113,
2,multi-family residential,1290607,
3,commercial,367820,
4,multi family residential,196225,
5,industrial,154164,
6,public,136983,
7,peri-urban settlement,86643,
8,multifamily residential,58155,
9,schools,35564,


## Cell 2 — Apply your mapping CSV and encode deterministically

In [4]:

import pandas as pd
from pathlib import Path

mapping_path = Path("building_type_mapping_template.csv")
m = pd.read_csv(mapping_path)

# Normalize keys same as discovery
def norm_text(x):
    return (pd.Series(x, dtype="string")
              .str.strip().str.lower().str.replace(r"\s+"," ", regex=True))

if "raw_label" not in m.columns or "canonical_label" not in m.columns:
    raise ValueError("Mapping CSV must have columns: raw_label, canonical_label")

m["raw_label"] = norm_text(m["raw_label"])
m["canonical_label"] = m["canonical_label"].astype("string").str.strip()

# Build raw -> canonical dict (skip blanks)
raw2canon = dict(m[m["canonical_label"]!=""][["raw_label","canonical_label"]].values)

# Derive deterministic integer codes from the canonical set actually used
canonical_classes = sorted(m.loc[m["canonical_label"]!="","canonical_label"].unique().tolist())
canon2int = {c:i for i,c in enumerate(canonical_classes)}
int2canon = {i:c for c,i in canon2int.items()}
print("Encoding used (int → canonical):", int2canon)

def map_and_encode_bt(series: pd.Series) -> pd.Series:
    s = norm_text(series)
    canon = s.map(raw2canon)      # NaN if not in mapping or left blank
    codes = canon.map(canon2int)  # NaN if canon missing
    return codes.astype("Int64")

reports = []
for name, df in dfs.items():
    before = int(df["Assumed_building_type"].isna().sum())
    dfs[name]["Assumed_building_type"] = map_and_encode_bt(df["Assumed_building_type"])
    after  = int(df["Assumed_building_type"].isna().sum())
    reports.append({"dataset": name, "rows": len(df), "null_before": before, "null_after": after})

rep = pd.DataFrame(reports).sort_values("null_after", ascending=False).reset_index(drop=True)
display(rep)

# Fail fast if anything remains unmapped
total_unmapped = int(rep["null_after"].sum())
if total_unmapped > 0:
    raise RuntimeError(f"{total_unmapped} rows still unmapped. Fill canonical_label in the CSV and re-run this cell.")


Encoding used (int → canonical): {0: '0', 1: '1', 2: '2', 3: '3', 4: '4', 5: '5', 6: '6', 7: '7', 8: '8', 9: '9'}


Unnamed: 0,dataset,rows,null_before,null_after
0,accra_rooftop_solarpotential,268947,0,0
1,almaty_rooftop_solarpotential,138339,0,0
2,antigua_rooftop_solarpotential,48904,0,0
3,beirut_rooftop_solarpotential,69859,0,0
4,colombo_rooftop_solarpotential,267989,0,0
5,daressalaam_rooftop_solarpotential,533855,0,0
6,dhaka_rooftop_solarpotential,632745,0,0
7,dominica_rooftop_solarpotential,37841,0,0
8,grenada_rooftop_solarpotential,51115,0,0
9,izmir_rooftop_solarpotential,287695,0,0


## Cell 3 — Drop calculated/unused columns (same as your previous workflow)

In [5]:

# Drop Estimated_building_height and Estimated_capacity_factor from workflow
for col in ["Estimated_building_height","Estimated_capacity_factor"]:
    if col in common_columns:
        common_columns.remove(col)

for name in list(dfs):
    to_drop = [c for c in ["Estimated_building_height","Estimated_capacity_factor"] if c in dfs[name].columns]
    if to_drop:
        dfs[name] = dfs[name].drop(columns=to_drop)

common_columns


['City',
 'Surface_area',
 'Potential_installable_area',
 'Peak_installable_capacity',
 'Energy_potential_per_year',
 'Assumed_building_type',
 'Estimated_tilt']

## Cell 4 — Null % audit per dataset (before dropping)

In [6]:

from pandas import DataFrame

summary_rows = []
for name, df in dfs.items():
    n = len(df)
    any_null = df[common_columns].isna().any(axis=1).sum()
    pct = (any_null / n * 100) if n else 0.0
    summary_rows.append({"dataset": name, "rows": n, "rows_with_any_null": any_null, "pct_with_any_null": round(pct, 2)})

null_summary = DataFrame(summary_rows).sort_values("pct_with_any_null", ascending=False).reset_index(drop=True)
null_summary


Unnamed: 0,dataset,rows,rows_with_any_null,pct_with_any_null
0,accra_rooftop_solarpotential,268947,0,0.0
1,almaty_rooftop_solarpotential,138339,0,0.0
2,antigua_rooftop_solarpotential,48904,0,0.0
3,beirut_rooftop_solarpotential,69859,0,0.0
4,colombo_rooftop_solarpotential,267989,0,0.0
5,daressalaam_rooftop_solarpotential,533855,0,0.0
6,dhaka_rooftop_solarpotential,632745,0,0.0
7,dominica_rooftop_solarpotential,37841,0,0.0
8,grenada_rooftop_solarpotential,51115,0,0.0
9,izmir_rooftop_solarpotential,287695,0,0.0


## Cell 5 — Drop rows with any null across `common_columns`

In [7]:

drop_report = []
for name, df in dfs.items():
    n0 = len(df)
    mask_keep = ~df[common_columns].isna().any(axis=1)
    dropped = int((~mask_keep).sum())
    kept = int(mask_keep.sum())
    drop_report.append({"dataset": name, "before": n0, "drop_null_rows": dropped, "after": kept, "pct_dropped": round((dropped/n0*100) if n0 else 0.0, 2)})
    dfs[name] = df.loc[mask_keep].reset_index(drop=True)

drop_null_df = pd.DataFrame(drop_report).sort_values("pct_dropped", ascending=False).reset_index(drop=True)
drop_null_df


Unnamed: 0,dataset,before,drop_null_rows,after,pct_dropped
0,accra_rooftop_solarpotential,268947,0,268947,0.0
1,almaty_rooftop_solarpotential,138339,0,138339,0.0
2,antigua_rooftop_solarpotential,48904,0,48904,0.0
3,beirut_rooftop_solarpotential,69859,0,69859,0.0
4,colombo_rooftop_solarpotential,267989,0,267989,0.0
5,daressalaam_rooftop_solarpotential,533855,0,533855,0.0
6,dhaka_rooftop_solarpotential,632745,0,632745,0.0
7,dominica_rooftop_solarpotential,37841,0,37841,0.0
8,grenada_rooftop_solarpotential,51115,0,51115,0.0
9,izmir_rooftop_solarpotential,287695,0,287695,0.0


## Cell 6 — Save cleaned per-city + combined outputs (Parquet & CSV)

In [8]:

import pandas as pd, json
from pathlib import Path

OUT_DIR.mkdir(parents=True, exist_ok=True)

# Save per dataset
manifest = []
for name, df in dfs.items():
    p_parquet = OUT_DIR / f"{name}.parquet"
    p_csv     = OUT_DIR / f"{name}.csv"  # keep if you want CSV too
    df.to_parquet(p_parquet, index=False)
    df.to_csv(p_csv, index=False)
    manifest.append({"dataset": name, "rows": int(len(df)), "parquet": str(p_parquet), "csv": str(p_csv)})

# Save combined
combined_df = pd.concat(dfs.values(), ignore_index=True)
combined_parquet = OUT_DIR / "all_cities_clean.parquet"
combined_csv     = OUT_DIR / "all_cities_clean.csv"
combined_df.to_parquet(combined_parquet, index=False)
combined_df.to_csv(combined_csv, index=False)

# Save encoding dictionary we used
encoding_json = OUT_DIR / "building_type_encoding.json"
encoding_json.write_text(json.dumps({"int2canon": int2canon, "canon2int": canon2int}, indent=2))

# Save manifest
manifest_path = OUT_DIR / "_manifest.json"
with open(manifest_path, "w", encoding="utf-8") as f:
    json.dump({
        "columns": list(combined_df.columns),
        "total_rows": int(len(combined_df)),
        "per_dataset": manifest,
        "combined": {"parquet": str(combined_parquet), "csv": str(combined_csv)},
        "encoding": str(encoding_json)
    }, f, indent=2)

# Quick confirmation table
pd.DataFrame(manifest + [{
    "dataset": "__ALL__",
    "rows": int(len(combined_df)),
    "parquet": str(combined_parquet),
    "csv": str(combined_csv)
}]).sort_values("dataset").reset_index(drop=True)


Unnamed: 0,dataset,rows,parquet,csv
0,__ALL__,6530761,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...
1,accra_rooftop_solarpotential,268947,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...
2,almaty_rooftop_solarpotential,138339,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...
3,antigua_rooftop_solarpotential,48904,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...
4,beirut_rooftop_solarpotential,69859,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...
5,colombo_rooftop_solarpotential,267989,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...
6,daressalaam_rooftop_solarpotential,533855,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...
7,dhaka_rooftop_solarpotential,632745,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...
8,dominica_rooftop_solarpotential,37841,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...
9,grenada_rooftop_solarpotential,51115,C:\Users\User\Desktop\ML\Project\solar-potenti...,C:\Users\User\Desktop\ML\Project\solar-potenti...


## (Optional) Cell 7 — Quick EDA checks (schema, counts)

In [9]:

df = pd.read_parquet(OUT_DIR / "all_cities_clean.parquet")
print("Shape:", df.shape)
print(df.dtypes)

city_counts = df["City"].value_counts().rename_axis("City").reset_index(name="rows")
bt_counts   = df["Assumed_building_type"].value_counts(dropna=False).sort_index().rename("rows").to_frame()
display(city_counts.head(10))
display(bt_counts.head(10))


Shape: (6530761, 7)
City                          string[python]
Surface_area                         float64
Potential_installable_area           float64
Peak_installable_capacity            float64
Energy_potential_per_year            float64
Assumed_building_type                  Int64
Estimated_tilt                       float64
dtype: object


Unnamed: 0,City,rows
0,LagosState,1329525
1,GreatDhakaRegion,632745
2,Mexico City,589629
3,DarEsSalaam,533855
4,SouthAfrica,414834
5,Manila,301381
6,Karachi,296688
7,Izmir,287695
8,Nairobi,272751
9,Accra,268947


Unnamed: 0_level_0,rows
Assumed_building_type,Unnamed: 1_level_1
0,4151576
1,1544987
2,367820
3,8644
4,154164
5,150032
6,86643
7,35564
8,17537
9,13794
