# Dry Spell Warning â€” Starter Notebook (Baseline)

This notebook shows how to:

1. Load a few provided feature files
2. Load the training labels (`solution_train.csv`)
3. Train a simple Logistic Regression baseline
4. Generate `submission.csv` and `submittion.zip`


In [33]:
from pathlib import Path
import pandas as pd
import numpy as np

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

DATA_DIR = Path("../public_data/input_data")
TRAIN_DIR = DATA_DIR / "train"
TEST_DIR  = DATA_DIR / "test"

LABEL_COL = "dryspell_warn_7d"   # target label column in solution_train.csv
DATE_COL  = "date"              # merge key


## 1) Load feature files + labels (train)

We will load:
- `sea level pressure era5_train.csv`
- `temp_trimmed_train.csv`
- `vapour pressure deficit_train.csv`
- `solution_train.csv`

Then merge them on `date`.

After Downloading the data from the link in the `data` tab, you can run The following:


In [None]:
def load_csv(path: Path) -> pd.DataFrame:
    df = pd.read_csv(path)
    df[DATE_COL] = pd.to_datetime(df[DATE_COL])
    return df

slp = load_csv(TRAIN_DIR / "sea level pressure era5_train.csv")
tmp = load_csv(TRAIN_DIR / "temp_trimmed_train.csv")
vpd = load_csv(TRAIN_DIR / "vapour pressure deficit_train.csv")
y   = load_csv(TRAIN_DIR / "solution_train.csv")

print("SLP columns:", slp.columns.tolist())
print("TEMP columns:", tmp.columns.tolist())
print("VPD columns:", vpd.columns.tolist())
print("Y columns:", y.columns.tolist())

df_train = y.merge(slp, on=DATE_COL, how="inner") \
            .merge(tmp, on=DATE_COL, how="inner") \
            .merge(vpd, on=DATE_COL, how="inner")

df_train = df_train.sort_values(DATE_COL).reset_index(drop=True)
df_train.head()


## 2) Build X/y

We drop the `date` column and keep the rest as numeric features.


In [None]:
y_train = df_train[LABEL_COL].astype(int).values

drop_cols = {DATE_COL, LABEL_COL}
feature_cols = [c for c in df_train.columns if c not in drop_cols]

X_train = df_train[feature_cols].copy()

print("Train rows:", len(X_train))
print("Num features:", len(feature_cols))
print("Positive rate:", float(y_train.mean()))


## 3) Train a simple Logistic Regression baseline

- Impute missing values with median
- Standardize features
- Fit Logistic Regression

This is intentionally simple and meant only as a baseline.


In [None]:

clf = LogisticRegression(
    max_iter=2000,
)

clf.fit(X_train, y_train)
print("Done training.")


## 4) Load test features and generate predictions

We load the matching test feature files, merge on `date`, and predict.

We output `dryspell_warn_7d` as **0/1** by thresholding probabilities at 0.5.


In [37]:
slp_t = load_csv(TEST_DIR / "sea level pressure era5_test.csv")
tmp_t = load_csv(TEST_DIR / "temp_trimmed_test.csv")
vpd_t = load_csv(TEST_DIR / "vapour pressure deficit_test.csv")

dfs = [slp_t, tmp_t, vpd_t]

In [None]:
from functools import reduce
dfs = [d.groupby(DATE_COL, as_index=False).mean(numeric_only=True) for d in dfs]

# Union of all dates across files
df_test = reduce(lambda left, right: left.merge(right, on=DATE_COL, how="outer"), dfs)

df_test = df_test.fillna(0)

# Sort + clean
df_test[DATE_COL] = pd.to_datetime(df_test[DATE_COL])
df_test = df_test.sort_values(DATE_COL).reset_index(drop=True)

X_test = df_test[feature_cols].copy()

proba = clf.predict_proba(X_test)[:, 1]

pred = (proba >= 0.5).astype(int)

submission = pd.DataFrame({DATE_COL: df_test[DATE_COL], LABEL_COL: pred})
submission.head()


## 5) Save `submission.csv` and `submission.zip`

The zip file is what you upload to Codabench (result-upload submission).


In [None]:
out_path = Path("submission.csv")
submission.to_csv(out_path, index=False)
print("Saved:", out_path.resolve())


In [None]:
import zipfile
from pathlib import Path
out_path = Path("submission.csv")
zip_path = Path("submission.zip")
with zipfile.ZipFile(zip_path, mode="w", compression=zipfile.ZIP_DEFLATED) as zf:
    # Put submission.csv at the ZIP root (no folders)
    zf.write(out_path, arcname="submission.csv")

print("Zipped:", zip_path.resolve())
