# 03_text_topics (Project, with LDA per proposal)
Create **project notices/news** (English + Nepali), derive **keyword features**, and perform **topic modelling (LDA)** as specified in the proposal. Finally, merge daily text features into engineered features.

**Outputs**
- `project_notices.csv` (created here)
- `text_corpus.csv` (corpus used downstream)
- `topics_daily.csv` (daily keyword counts + flags)
- `lda_topics_daily.csv` (daily LDA topic proportions)
- `master_with_topics.csv` (merged table for modelling)


### Cell 1 — Setup & configuration

In [1]:
import pandas as pd, numpy as np, re
from pathlib import Path
from datetime import date, timedelta

# Project window
START_DATE = pd.Timestamp('2019-01-01')
END_DATE   = pd.Timestamp('2023-12-31')

# Files
PROJECT_NOTICES = 'project_notices.csv'   # created in Cell 2
TEXT_CORPUS_CSV = 'text_corpus.csv'       # used downstream
TOPICS_DAILY_CSV = 'topics_daily.csv'
LDA_TOPICS_DAILY_CSV = 'lda_topics_daily.csv'
MASTER_PATH = 'features_daily.csv'        # your engineered daily features
MASTER_WITH_TOPICS = 'master_with_topics.csv'

print('Outputs will be saved next to this notebook.')

Outputs will be saved next to this notebook.


### Cell 2 — Build **project notices/news** corpus (2019–2023)

In [2]:
rows = []

def add(d, title, text):
    rows.append({'date': d.isoformat(), 'title': title, 'text': text})

# Phrase banks for varied but relevant content
flood_lines = [
    "DHM alert: Heavy rain (मुसलधारे वर्षा) in Syangja/Kaski; high flow/flood (बाढी) risk along Kali Gandaki; warning (चेतावनी) issued.",
    "Flood advisory: Tribeni rainfall > 80 mm/24h; downstream inundation (डुबान) risk; maintain spillway readiness.",
    "River discharge rising rapidly; landslide (पहिरो) reported upstream; public सूचना to avoid river banks."
]
maint_lines = [
    "Scheduled maintenance (सम्भार/मर्मत) of Kaligandaki A Unit {u}; partial shutdown (बन्द) expected; generation impact 40–70 MW.",
    "Gate inspection and lubrication; short outage (विद्युत अवरोध) in afternoon window; SCADA checks.",
    "Sediment flushing and desilting at intake; debris removal near trash rack; safety protocol active."
]
outage_lines = [
    "Unplanned outage due to transformer trip; restoration in 2 hours; load shedding avoided via import/dispatch.",
    "Breaker failure at switchyard; protection operated; controlled restart procedure underway.",
    "Transient fault cleared; unit re-synchronized; monitoring vibration and temperature."
]
policy_lines = [
    "NEA policy (नीति) update on seasonal tariff and curtailment; provisions for import (आयात) in dry months, export (निर्यात) in surplus.",
    "Directive (निर्देशन) on load management during peak hours; dispatch priority for run-of-river vs reservoir plants.",
    "Regulation (विनियमन) note: dam safety inspection schedule and reporting requirements."
]
weather_lines = [
    "Weather watch: thunderstorm (मेघगर्जन) and strong winds (हावाहुरी) possible; short-duration heavy rain may spike inflow.",
    "Monsoon onset declared; persistent precipitation (पानी) expected for 3–5 days; monitor reservoir level (जलस्तर).",
    "Monsoon withdrawal likely next week; decreasing rainfall trend; plan reservoir refill strategy."
]
ops_lines = [
    "Dry season operation: low river discharge; reservoir near minimum level; prioritize peak-hour dispatch.",
    "Controlled spilling initiated due to high inflow; downstream चेतावनी जारी; avoid river banks.",
    "Turbine efficiency recovery plan after overhaul; performance test scheduled."
]

for y in range(2019, 2024):
    # Dry-season ops
    add(date(y,1,21), "Dry season operation notice – low discharge", ops_lines[0])
    add(date(y,2,10), "Reservoir refill & sediment management", "Reservoir refill plan; sediment (सिल्ट) monitoring; intake trash rack cleaning.")

    # Pre-monsoon maintenance (two windows)
    add(date(y,3,12), f"Planned maintenance outage at Kaligandaki A – Unit 1", maint_lines[0].format(u=1))
    add(date(y,4,18), f"Planned maintenance outage at Kaligandaki A – Unit 2", maint_lines[0].format(u=2))
    add(date(y,5,9),  "Gate inspection & lubrication", maint_lines[1])

    # Monsoon flood & weather alerts
    add(date(y,6,25), "DHM flood/weather bulletin – Gandaki basin", flood_lines[0])
    add(date(y,7,10), "Flood advisory – Tribeni gauge rising", flood_lines[1])
    add(date(y,7,25), "High silt load expected during storm", maint_lines[2])
    add(date(y,8,1),  "Weather watch – thunderstorm activity", weather_lines[0])
    add(date(y,8,18), "Flood advisory – controlled spilling at Kaligandaki A", ops_lines[1])

    # Unplanned outages
    add(date(y,9,5),  "Unplanned outage – transformer trip", outage_lines[0])
    add(date(y,10,7), "Breaker failure at switchyard", outage_lines[1])

    # Policy/Directive updates
    add(date(y,11,2), "Policy update: seasonal generation & import management", policy_lines[0])
    add(date(y,12,15),"Directive on peak load management", policy_lines[1])

# Extra varied events
extras = [
    (date(2020,7,20), "Landslide blocking headrace – debris removal",
     "Emergency: landslide (पहिरो) induced debris near intake; partial shutdown (बन्द) for 6 hours; rapid response team deployed."),
    (date(2021,10,3), "Gandaki basin weather watch – thunderstorm", weather_lines[0]),
    (date(2022,5,15), "Planned turbine overhaul – Unit 2",
     "Major overhaul (मर्मत) for efficiency recovery; expected completion 7 days; generation reduced by 70 MW."),
    (date(2023,6,12), "Monsoon onset – rainfall increase expected", weather_lines[1]),
]
for d0, t, txt in extras:
    add(d0, t, txt)

project_df = pd.DataFrame(rows).sort_values('date').reset_index(drop=True)
project_df.to_csv(PROJECT_NOTICES, index=False)
print('Saved', PROJECT_NOTICES, '| rows =', len(project_df))
project_df.head(8)

Saved project_notices.csv | rows = 74


Unnamed: 0,date,title,text
0,2019-01-21,Dry season operation notice – low discharge,Dry season operation: low river discharge; res...
1,2019-02-10,Reservoir refill & sediment management,Reservoir refill plan; sediment (सिल्ट) monito...
2,2019-03-12,Planned maintenance outage at Kaligandaki A – ...,Scheduled maintenance (सम्भार/मर्मत) of Kaliga...
3,2019-04-18,Planned maintenance outage at Kaligandaki A – ...,Scheduled maintenance (सम्भार/मर्मत) of Kaliga...
4,2019-05-09,Gate inspection & lubrication,Gate inspection and lubrication; short outage ...
5,2019-06-25,DHM flood/weather bulletin – Gandaki basin,DHM alert: Heavy rain (मुसलधारे वर्षा) in Syan...
6,2019-07-10,Flood advisory – Tribeni gauge rising,Flood advisory: Tribeni rainfall > 80 mm/24h; ...
7,2019-07-25,High silt load expected during storm,Sediment flushing and desilting at intake; deb...


### Cell 3 — Select project window and save `text_corpus.csv`

In [3]:
corpus_df = pd.read_csv(PROJECT_NOTICES, parse_dates=['date'])
corpus_df = corpus_df[(corpus_df['date']>=START_DATE) & (corpus_df['date']<=END_DATE)]
corpus_df = corpus_df.sort_values('date').reset_index(drop=True)
corpus_df.to_csv(TEXT_CORPUS_CSV, index=False)
print('Saved', TEXT_CORPUS_CSV, '| rows =', len(corpus_df))
corpus_df.head(5)

Saved text_corpus.csv | rows = 74


Unnamed: 0,date,title,text
0,2019-01-21,Dry season operation notice – low discharge,Dry season operation: low river discharge; res...
1,2019-02-10,Reservoir refill & sediment management,Reservoir refill plan; sediment (सिल्ट) monito...
2,2019-03-12,Planned maintenance outage at Kaligandaki A – ...,Scheduled maintenance (सम्भार/मर्मत) of Kaliga...
3,2019-04-18,Planned maintenance outage at Kaligandaki A – ...,Scheduled maintenance (सम्भार/मर्मत) of Kaliga...
4,2019-05-09,Gate inspection & lubrication,Gate inspection and lubrication; short outage ...


### Cell 4 — Build daily keyword features → `topics_daily.csv`

In [4]:
KEYWORDS = {
    "maintenance": ["maintenance","overhaul","shutdown","servicing","repair","gate inspection","desilting","trash rack",
                    "सम्भार","मर्मत","सम्भाल","बन्द","नियमित जाँच","सिल्ट","सिल्टेशन"],
    "outage":      ["outage","blackout","interruption","trip","fault","breaker","protection","resynchroniz",
                    "विद्युत अवरोध","लोडसेडिङ","बत्ती बन्द","लोड व्यवस्थापन"],
    "flood":       ["flood","high flow","inundation","landslide","debris","spill","spilling","downstream",
                    "बाढी","पहिरो","डुबान","जोखिम","चेतावनी","सूचना"],
    "policy":      ["policy","tariff","regulation","directive","order","curtail","import","export","dispatch",
                    "नीति","दर","विनियमन","निर्देशन","आदेश","आयात","निर्यात"],
    "weather":     ["heavy rain","thunder","storm","wind","monsoon","precipitation","rainfall","forecast",
                    "मुसलधारे वर्षा","मेघगर्जन","हावाहुरी","मौसम","पानी"],
}

def kw_counts(text: str):
    t = (text or "").lower()
    out = {k: 0 for k in KEYWORDS}
    for k, words in KEYWORDS.items():
        out[k] = sum(t.count(w.lower()) for w in words)
    return out

rows = []
for _, r in corpus_df.iterrows():
    c = kw_counts(f"{r.get('title','')} {r.get('text','')}")
    c["date"] = r["date"].date()
    rows.append(c)

daily_kw = pd.DataFrame(rows)
daily_kw["date"] = pd.to_datetime(daily_kw["date"])

for k in KEYWORDS:
    daily_kw[f"{k}_flag"] = (daily_kw[k] > 0).astype(int)

daily_kw = daily_kw.groupby("date", as_index=False).sum().sort_values("date")
daily_kw.to_csv(TOPICS_DAILY_CSV, index=False)
print("Saved", TOPICS_DAILY_CSV, "| rows =", len(daily_kw))
daily_kw.head(10)

Saved topics_daily.csv | rows = 74


Unnamed: 0,date,maintenance,outage,flood,policy,weather,maintenance_flag,outage_flag,flood_flag,policy_flag,weather_flag
0,2019-01-21,0,0,0,1,0,0,0,0,1,0
1,2019-02-10,2,0,0,0,0,1,0,0,0,0
2,2019-03-12,6,1,0,0,0,1,1,0,0,0
3,2019-04-18,6,1,0,0,0,1,1,0,0,0
4,2019-05-09,2,2,0,0,1,1,1,0,0,1
5,2019-06-25,0,0,5,0,2,0,0,1,0,1
6,2019-07-10,0,0,6,0,1,0,0,1,0,1
7,2019-07-25,2,0,1,0,1,1,0,1,0,1
8,2019-08-01,0,0,0,0,8,0,0,0,0,1
9,2019-08-18,0,0,7,0,0,0,0,1,0,0


### Cell 5 — Topic modelling (LDA) per proposal → `lda_topics_daily.csv`

In [5]:
# LDA on (title + text). We use English stopwords + a small Nepali stoplist.
# If scikit-learn is missing, install it in your environment before running this cell.
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

# Basic Nepali stoplist (extend as needed)
nep_stop = [
    "छ","देखि","र","को","मा","का","को लागि","यस","भएको","गरिएका","गरिने","सम्बन्धी",
    "सूचना","चेतावनी","प्रेस","विज्ञप्ति","लगायत","तथा","प्रति","नि"
]

def build_corpus_text(df):
    return (df["title"].fillna("") + " " + df["text"].fillna("")).tolist()

def make_vectorizer():
    # Include bigrams to capture phrases like "heavy rain", "load shedding"
    return CountVectorizer(
        max_df=0.95,
        min_df=1,
        ngram_range=(1,2),
        stop_words='english'  # English stopwords
    )

texts = build_corpus_text(corpus_df)
vec = make_vectorizer()
X = vec.fit_transform(texts)

# Remove a few Nepali stop tokens post-hoc by zeroing columns (simple)
# (Optional refinement: custom tokenizer that drops Nepali stopwords before vectorization)
vocab = np.array(vec.get_feature_names_out())
mask_keep = ~np.isin(vocab, nep_stop)
X = X[:, mask_keep]
vocab = vocab[mask_keep]

# Fit LDA (choose topics K=5; adjust if you want finer granularity)
K = 5
lda = LatentDirichletAllocation(n_components=K, random_state=42, learning_method='online')
W = lda.fit_transform(X)   # doc-topic weights (rows = documents)

# Attach doc-topic weights back to corpus
corpus_lda = corpus_df.copy()
for k in range(K):
    corpus_lda[f"topic{k}"] = W[:, k]

# Aggregate to daily mean topic proportions
topics_daily = corpus_lda.groupby("date")[ [f"topic{k}" for k in range(K)] ].mean().reset_index().sort_values("date")
topics_daily.to_csv(LDA_TOPICS_DAILY_CSV, index=False)
print("Saved", LDA_TOPICS_DAILY_CSV, "| rows =", len(topics_daily))

# Show top words per topic for interpretation
def top_words(component, vocab, n=10):
    idx = component.argsort()[-n:][::-1]
    return [vocab[i] for i in idx]

topic_words = {f"topic{k}": top_words(lda.components_[k], vocab, n=12) for k in range(K)}
print("Top words per topic:")
for k, words in topic_words.items():
    print(k, "→", ", ".join(words))

Saved lda_topics_daily.csv | rows = 74
Top words per topic:
topic0 → reservoir refill, reservoir, sediment, refill, intake, plan, management, rack cleaning, trash, intake trash, management reservoir, trash rack
topic1 → low, operation, dry season, discharge, dry, season operation, season, flood, gandaki, dhm, near, near minimum
topic2 → kaligandaki, unit, maintenance, kaligandaki unit, expected, controlled, 70 mw, मत, planned, 70, मर, mw
topic3 → management, gate, advisory tribeni, policy, flood, gate inspection, lubrication, import, tribeni, peak, update, update seasonal
topic4 → thunderstorm, watch, weather, watch thunderstorm, weather watch, unplanned outage, transformer, outage, trip, unplanned, transformer trip, outage transformer


### Cell 6 — Merge daily keywords + LDA with engineered features → `master_with_topics.csv`

In [6]:
master = pd.read_csv(MASTER_PATH, parse_dates=["date"])
kw = pd.read_csv(TOPICS_DAILY_CSV, parse_dates=["date"])
lda_daily = pd.read_csv(LDA_TOPICS_DAILY_CSV, parse_dates=["date"]) if Path(LDA_TOPICS_DAILY_CSV).exists() else pd.DataFrame(columns=["date"])

out = master.merge(kw, on="date", how="left")
if not lda_daily.empty:
    out = out.merge(lda_daily, on="date", how="left")

# Fill new cols with 0 (keywords) and with daily means for topics where missing
for c in out.columns:
    if c not in master.columns and c != "date":
        if c.startswith("topic"):
            out[c] = out[c].fillna(out[c].mean())
        else:
            out[c] = out[c].fillna(0)

out.to_csv(MASTER_WITH_TOPICS, index=False)
print("Saved", MASTER_WITH_TOPICS, "| rows:", len(out), "| cols:", len(out.columns))
out.head(5)

Saved master_with_topics.csv | rows: 1796 | cols: 81


Unnamed: 0,date,discharge_m3s,reservoir_m,rainfall_mm,load_MW,avg_load_mw,energy_mwh,year,doy,dow,...,maintenance_flag,outage_flag,flood_flag,policy_flag,weather_flag,topic0,topic1,topic2,topic3,topic4
0,2019-01-31,120.0,0.64,0.0,1428.76,946.092065,28941.248,2019.0,31,3,...,0.0,0.0,0.0,0.0,0.0,0.071865,0.13978,0.341538,0.28139,0.165426
1,2019-02-01,132.0,0.72,0.0,1437.155,970.446968,29686.272,2019.0,32,4,...,0.0,0.0,0.0,0.0,0.0,0.071865,0.13978,0.341538,0.28139,0.165426
2,2019-02-02,131.0,0.71,0.0,1395.755,967.499629,29596.112,2019.0,33,5,...,0.0,0.0,0.0,0.0,0.0,0.071865,0.13978,0.341538,0.28139,0.165426
3,2019-02-03,131.0,0.71,0.0,1352.055,926.9545,28355.824,2019.0,34,6,...,0.0,0.0,0.0,0.0,0.0,0.071865,0.13978,0.341538,0.28139,0.165426
4,2019-02-04,134.0,0.73,0.0,1373.905,918.874032,28108.64,2019.0,35,0,...,0.0,0.0,0.0,0.0,0.0,0.071865,0.13978,0.341538,0.28139,0.165426
