# Build Dataset: MagangHub Vacancies (Full Pipeline)

Notebook ini menjalankan pipeline lengkap:
1. **Fetch** data lowongan dari API MagangHub
2. **Prepare** (flatten, parse JSON, enrich skills)
3. **Score** setiap lowongan berdasarkan peluang
4. Menyimpan dataset siap analisis (`vacancies_scored.parquet`)

> Direktori hasil:  
> - Raw JSON â†’ `data/raw/run_*`  
> - Clean parquet â†’ `data/clean/vacancies.parquet`  
> - Scored parquet â†’ `data/clean/vacancies_scored.parquet`


In [None]:
import os
import sys
from pathlib import Path

import pandas as pd
from IPython.display import display, Markdown

# Pastikan path ke root project
ROOT = Path(__file__).resolve().parents[1]
sys.path.append(str(ROOT / "src"))

from utils import load_yaml, ensure_dir, preview, Timer


In [None]:
cfg = load_yaml(ROOT / "config" / "params.yaml")

display(Markdown("### âœ… Config Loaded"))
print(f"Source URL   : {cfg['source']['url']}")
print(f"Page limit    : {cfg['source']['params'].get('limit')}")
print(f"Run mode      : {cfg['run'].get('pages')} pages")
print(f"Output folder : {cfg['output']['raw_dir']}")


## ðŸ”¹ Step 1 â€” Fetch Data
Menjalankan `src/fetch.py` untuk menarik semua lowongan MagangHub dan menyimpan JSON mentah di `data/raw/run_*`.

In [None]:
from src import fetch
timer = Timer("Fetch Data")

fetch.main()  # otomatis ambil semua halaman (pages=0/"all" di params.yaml)
timer.stop()


## ðŸ”¹ Step 2 â€” Prepare Dataset
Gabungkan seluruh file JSON dari `data/raw/`, flatten kolom penting, parse jadwal & jenjang,
hitung kolom turunan (`competition_ratio`, `days_to_deadline`),
serta ekstraksi skill dengan `config/skills.yaml`.

In [None]:
from src import prepare

timer = Timer("Prepare Data")
prepare.main()
timer.stop()

clean_path = ROOT / "data" / "clean" / "vacancies.parquet"
df_clean = pd.read_parquet(clean_path)
preview(df_clean)

## ðŸ”¹ Step 3 â€” Hitung Skor & Ranking
Gunakan bobot dari `config/weights.yaml` untuk menghasilkan kolom:
`freshness_score`, `quota_score`, `competition_score`, dan `priority_score`.

In [None]:
from src import score

timer = Timer("Score Data")
score.main()
timer.stop()

scored_path = ROOT / "data" / "clean" / "vacancies_scored.parquet"
df = pd.read_parquet(scored_path)
display(Markdown("### ðŸŽ¯ Top 10 Lowongan (berdasarkan priority_score)"))
df.head(10)[["rank","posisi","nama_perusahaan","nama_provinsi","priority_score"]]

## ðŸ”¹ Step 4 â€” Cek Ringkasan Cepat
Lihat total baris, persentase lowongan terkait *data/analitik*, serta ringkasan skor.

In [None]:
n_total = len(df)
n_data_related = df["is_data_related"].sum()
avg_score = df["priority_score"].mean()

display(Markdown(f"""
**Total lowongan:** {n_total:,}  
**Lowongan terkait data:** {n_data_related:,} ({n_data_related/n_total*100:.1f}%)  
**Rata-rata skor peluang:** {avg_score:.2f}
"""))

## ðŸ”¹ Step 5 â€” Simpan Snapshot Harian
Dataset ini akan digunakan untuk analisis tren & visualisasi.

In [None]:
import datetime as dt

today = dt.date.today().strftime("%Y-%m-%d")
snap_path = ROOT / "output" / "tables" / f"vacancies_snapshot_{today}.csv"
ensure_dir(snap_path.parent)
df.to_csv(snap_path, index=False, encoding="utf-8-sig")
print(f"[SAVED] {snap_path}")


# âœ… Pipeline Selesai!
Dataset lengkap telah dibangun ðŸŽ‰

ðŸ“‚ Hasil utama:
- `data/clean/vacancies.parquet`
- `data/clean/vacancies_scored.parquet`
- `output/tables/top_recommendations.xlsx`

Lanjut ke **`02_analysis.ipynb`** untuk visualisasi:
- Profesi terpadat  
- Provinsi dengan perusahaan terbanyak  
- Skill paling sering muncul  
- Peluang terbesar (rasio pelamar/kuota terendah)
