Pregled širine Parquet fajlova (schema only)

Ovde koristim Dask/PyArrow samo da pročitam šemu (broj kolona) bez učitavanja podataka.
Rezultat pokazuje zašto su mi trebali alati za velike podatke.

Fajl: sample_wide_2k.parquet – 2000 kolona
Fajl: sample_big.parquet – 100 kolona
Fajl: merged_clean_data.parquet – 4 kolone

Objašnjenje:
• sample_wide_2k.parquet je demo sa 2000 kolona i ilustruje problem širokih tabela.
• sample_big.parquet je dataset sa 100 kolona, primer umerenog obima.
• merged_clean_data.parquet je finalni očišćeni dataset sa 4 kolone.

In [1]:
from pathlib import Path
import dask.dataframe as dd

ROOTS = [
    Path("."), Path(".."), Path("../.."),
    Path("data"), Path("data/raw_data/processed"),
    Path("final_features"),
    Path("model_input"), Path("model_input_parquet"),
    Path("outputs"), Path("subset"), Path("notebooks"),
    Path("case_studies/big_data_project/data/raw_data/processed"),
]

def all_parquet_paths(roots):
    out = []
    for r in roots:
        if r.exists():
            out += list(r.rglob("*.parquet"))
    # dedup
    seen, uniq = set(), []
    for p in out:
        rp = p.resolve()
        if rp not in seen:
            uniq.append(p)
            seen.add(rp)
    return uniq

rows = []
for p in all_parquet_paths(ROOTS):
    try:
        df = dd.read_parquet(str(p))   # čita samo metapodatke/šemu
        ncols = len(df.columns)
        rows.append((ncols, str(p)))
    except Exception:
        pass

rows.sort(reverse=True)  # po broju kolona
print(f"{'Cols':>6}  Path")
for ncols, path in rows[:10]:
    print(f"{ncols:6d}  {path}")




  Cols  Path
   100  data/raw_data/processed/sample_big.parquet/part.0.parquet
   100  data/raw_data/processed/sample_big.parquet
     4  data/raw_data/processed/merged_clean_data.parquet/part.0.parquet
     4  data/raw_data/processed/merged_clean_data.parquet
