In [None]:
import pathlib

import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn import compose, dummy, impute, metrics, pipeline, preprocessing


DATA_DIR = pathlib.Path("/kaggle/input/rainfall-probability-cs-209-spring-2026")
RANDOM_STATE = np.random.RandomState(42)


# Load the Data

In [None]:
%%bash

ls /kaggle/input/rainfall-probability-cs-209-spring-2026

In [None]:
%%bash

cat /kaggle/input/rainfall-probability-cs-209-spring-2026/train.csv | head -n 5

In [None]:
label_name = "rainfall"

train_df = pd.read_csv(
    DATA_DIR / "train.csv",
    index_col="id",
)
train_features_df = train_df.drop(label_name, axis="columns")
train_labels = train_df.loc[:, label_name]

In [None]:
%%bash

cat /kaggle/input/rainfall-probability-cs-209-spring-2026/test.csv | head -n 5

In [None]:
test_features_df = pd.read_csv(
    DATA_DIR / "test.csv",
    index_col="id",
)

# Prepare the Data for ML

## Create Data Preparation Pipelines

The `pipeline.Pipeline` class in **Scikit-Learn** is a **tool for chaining multiple data processing and modeling steps together** into a single object. Its main purpose is to **streamline preprocessing and model training**, while **preventing data leakage** during cross-validation or testing.

### Key points:

1. **Sequence of steps:**
   Each step has a name and a transformer or estimator:

   ```python
   ml_pipeline = pipeline.Pipeline([
       ("simple_imputer", impute.SimpleImputer()),             # preprocessing step 1
       ('standard_scaler', preprocessing.StandardScaler()),    # preprocessing step 2
       ('linear_regression', linear_model.LinearRegression())  # final estimator
   ])
   ```

2. **Fit and transform automatically:**

   * `ml_pipeline.fit(X_train, y_train)` applies all transformations in order using each transformer's `fit_transform` method and then trains the final estimator using the `fit` method.
   * `ml_pipeline.predict(X_test)` applies the same transformations in order to new data using each transformer's `transform` method and then uses the final estimator's `predict` method to make predictions.

3. **Prevents leakage during cross-validation:**
   When used with cross-validation routines such as `model_selection.cross_val_score`, or `model_selection.GridSearchCV`, `pipeline.Pipeline` objects are fit to the training folds avoiding any leakage from the validation data.

4. **Hyperparameter tuning parameter naming convention:**
   You can tune parameters of any step in the pipeline with `model_selection.GridSearchCV` (or similar routine) using the syntax `"step_name__parameter"`.


In [None]:
pipeline.Pipeline?

## Categorical Features


### Handling Missing Values

`impute.SimpleImputer` is a preprocessing tool that **fills in missing values** in a dataset.

* During `.fit()`, it **learns a replacement value** from the training data (e.g., the **mean**, **median**, or **most frequent** value in each column).
* During `.transform()`, it **replaces missing entries** (like `NaN`) using those learned values.

It’s commonly used inside a `pipeline.Pipeline` to avoid data leakage and ensure consistent preprocessing during cross-validation.


In [None]:
impute.SimpleImputer?

### Encoding Ordered Categories

`preprocessing.OrdinalEncoder` converts **categorical (string) features** into **integer-coded categories**.

For example:

* `"red", "green", "blue"`
  → `2, 1, 0` (or some learned ordering)

It assigns each category in each feature an **integer index** based on what it sees during `fit()`.

By mapping categories to integers, `preprocessing.OrdinalEncoder` implies an numerical ordering to the original categories (e.g., `blue < green < red`). It’s often best for:

* **tree-based models**, or
* categorical variables that are **truly ordered**.

For unordered categories, `preprocessing.OneHotEncoder` is usually safer.


In [None]:
preprocessing.OrdinalEncoder?

In [None]:
categorical_features_preprocessing = pipeline.Pipeline(
    steps=[
        (
            "simple_imputer",
            impute.SimpleImputer(
                strategy="most_frequent",
            ),
        ),
        (
            "ordinal_encoder",
            preprocessing.OrdinalEncoder(
                categories=[
                    range(1, 365 + 1)
                ],
                handle_unknown="error",
            )
        )
    ],
    memory=None,
    verbose=False,
)


In [None]:
categorical_features_preprocessing

## Numerical Features

### Standardizing Features

`preprocessing.StandardScaler` **standardizes numerical features** by transforming each feature (column) to have:

* **mean = 0**
* **standard deviation = 1**

It does this by **learning** the mean and std from the training data during `fit()`, then applying:

$$ x' = \frac{x - \mu}{\sigma} $$

during `transform()`.

This is especially useful for models that are sensitive to feature scale (e.g., SGD, SVMs, k-NN, neural nets).


In [None]:
preprocessing.StandardScaler?

In [None]:
numerical_features_preprocessing = pipeline.Pipeline(
    steps=[
        (
            "simple_imputer",
            impute.SimpleImputer(
                strategy="mean",
            )
        ),
        (
            "standard_scaler",
            preprocessing.StandardScaler(
                with_mean=True,
                with_std=True,
            )
        )
    ],
    memory=None,
    verbose=False,
)

In [None]:
numerical_features_preprocessing

## Combine Feature Preprocessing Pipelines

### Column-based Transformations

`compose.ColumnTransformer` lets you apply **different preprocessing pipelines to different columns** of your dataset in a single, clean step.

For example, you can:

* **impute + scale** numeric columns, and
* **impute + one-hot encode** categorical columns,

then it **combines all transformed outputs into one final feature matrix** that you can feed into a model (often inside a `pipeline.Pipeline`).


In [None]:
compose.ColumnTransformer?

In [None]:
features_preprocessing = compose.ColumnTransformer(
    transformers=[
        (
            "categorical_features",
            categorical_features_preprocessing,
            [
                "day",
            ]
        ),
        (
            "numerical_features",
            numerical_features_preprocessing,
            [
                "pressure",
                "maxtemp",
                "temparature",
                "mintemp",
                "dewpoint",
                "humidity",
                "cloud",
                "sunshine",
                "winddirection",
                "windspeed",
            ]
        ),
    ],  
    force_int_remainder_cols=False,
    remainder="drop",
    n_jobs=2,
    verbose=False,
    verbose_feature_names_out=False,
).set_output(transform="pandas")


In [None]:
features_preprocessing

### Manually Preprocessing Features

In [None]:
processed_train_features_df = features_preprocessing.fit_transform(train_features_df)

In [None]:
processed_train_features_df.info()

In [None]:
processed_train_features_df.head()

# Create Benchmark Model

## Dummy Classifiers

`dummy.DummyClassifier` is a **baseline classifier** that makes predictions using **simple, non-learning rules** instead of actually training on patterns in the data.

Common strategies include:

* **`most_frequent`**: always predicts the most common class
* **`prior`**: predicts according to class proportions
* **`stratified`**: predicts randomly but respecting class proportions
* **`uniform`**: predicts completely at random
* **`constant`**: always predicts a user-specified class

It’s mainly used to check whether your real model is doing **better than a trivial baseline**.


In [None]:
dummy.DummyClassifier?

## Using Manually Preprocessed Features

In [None]:
dummy_classifier = dummy.DummyClassifier(
    strategy="prior",
    random_state=RANDOM_STATE,
)

_ = dummy_classifier.fit(
    processed_train_features_df,
    train_labels
)

## Combine Feature Preprocessing with a Model

In [None]:
classifier_pipeline = pipeline.Pipeline(
    steps=[
        ("features_preprocessing", features_preprocessing),
        ("dummy_classifier", dummy_classifier)
    ]
)


In [None]:
classifier_pipeline

In [None]:
_ = classifier_pipeline.fit(
    train_features_df,
    train_labels
)

### Save a Trained Pipeline

In [None]:
_ = joblib.dump(classifier_pipeline, "dummy-classifier-pipeline.pkl")

In [None]:
%%bash

ls -lh

# Submit Predictions

In [None]:
%%bash

cat /kaggle/input/rainfall-probability-cs-209-spring-2026/sample_submission.csv | head -n 5

## Generate Model Predictions

In [None]:
loaded_classifier_pipeline = joblib.load("dummy-classifier-pipeline.pkl")

In [None]:
loaded_classifier_pipeline

In [None]:
predicted_rainfall_probas = loaded_classifier_pipeline.predict_proba(
    test_features_df
)


## Create a Submission File

In [None]:
_ = (
    pd.read_csv(
        DATA_DIR / "sample_submission.csv",
        index_col="id"
    ).assign(
        rainfall=predicted_rainfall_probas[:, 1]
    ).to_csv(
        "submission.csv",
        index=True
    )
)

In [None]:
%%bash

cat submission.csv | head -n 5