In [1]:
import pathlib

import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from sklearn import compose, dummy, impute, metrics, pipeline, preprocessing


DATA_DIR = pathlib.Path("/kaggle/input/rainfall-probability-cs-209-spring-2026")
RANDOM_STATE = np.random.RandomState(42)


# Load the Data

In [2]:
%%bash

ls /kaggle/input/rainfall-probability-cs-209-spring-2026

sample_submission.csv
test.csv
train.csv


In [3]:
%%bash

cat /kaggle/input/rainfall-probability-cs-209-spring-2026/train.csv | head -n 5

id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed,rainfall
0,1,1017.4,21.2,20.6,19.9,19.4,87.0,88.0,1.1,60.0,17.2,1
1,2,1019.5,16.2,16.9,15.8,15.4,95.0,91.0,0.0,50.0,21.9,1
2,3,1024.1,19.4,16.1,14.6,9.3,75.0,47.0,8.3,70.0,18.1,1
3,4,1013.4,18.1,17.8,16.9,16.8,95.0,95.0,0.0,60.0,35.6,1


In [4]:
label_name = "rainfall"

train_df = pd.read_csv(
    DATA_DIR / "train.csv",
    index_col="id",
)
train_features_df = train_df.drop(label_name, axis="columns")
train_labels = train_df.loc[:, label_name]

In [5]:
%%bash

cat /kaggle/input/rainfall-probability-cs-209-spring-2026/test.csv | head -n 5

id,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed
2190,1,1019.5,17.5,15.8,12.7,14.9,96.0,99.0,0.0,50.0,24.3
2191,2,1016.5,17.5,16.5,15.8,15.1,97.0,99.0,0.0,50.0,35.3
2192,3,1023.9,11.2,10.4,9.4,8.9,86.0,96.0,0.0,40.0,16.9
2193,4,1022.9,20.6,17.3,15.2,9.5,75.0,45.0,7.1,20.0,50.6


In [6]:
test_features_df = pd.read_csv(
    DATA_DIR / "test.csv",
    index_col="id",
)

# Prepare the Data for ML

## Create Data Preparation Pipelines

The `pipeline.Pipeline` class in **Scikit-Learn** is a **tool for chaining multiple data processing and modeling steps together** into a single object. Its main purpose is to **streamline preprocessing and model training**, while **preventing data leakage** during cross-validation or testing.

### Key points:

1. **Sequence of steps:**
   Each step has a name and a transformer or estimator:

   ```python
   ml_pipeline = pipeline.Pipeline([
       ("simple_imputer", impute.SimpleImputer()),             # preprocessing step 1
       ('standard_scaler', preprocessing.StandardScaler()),    # preprocessing step 2
       ('linear_regression', linear_model.LinearRegression())  # final estimator
   ])
   ```

2. **Fit and transform automatically:**

   * `ml_pipeline.fit(X_train, y_train)` applies all transformations in order using each transformer's `fit_transform` method and then trains the final estimator using the `fit` method.
   * `ml_pipeline.predict(X_test)` applies the same transformations in order to new data using each transformer's `transform` method and then uses the final estimator's `predict` method to make predictions.

3. **Prevents leakage during cross-validation:**
   When used with cross-validation routines such as `model_selection.cross_val_score`, or `model_selection.GridSearchCV`, `pipeline.Pipeline` objects are fit to the training folds avoiding any leakage from the validation data.

4. **Hyperparameter tuning parameter naming convention:**
   You can tune parameters of any step in the pipeline with `model_selection.GridSearchCV` (or similar routine) using the syntax `"step_name__parameter"`.


In [7]:
pipeline.Pipeline?

## Categorical Features


### Handling Missing Values

`impute.SimpleImputer` is a preprocessing tool that **fills in missing values** in a dataset.

* During `.fit()`, it **learns a replacement value** from the training data (e.g., the **mean**, **median**, or **most frequent** value in each column).
* During `.transform()`, it **replaces missing entries** (like `NaN`) using those learned values.

It’s commonly used inside a `pipeline.Pipeline` to avoid data leakage and ensure consistent preprocessing during cross-validation.


In [8]:
impute.SimpleImputer?

### Encoding Ordered Categories

`preprocessing.OrdinalEncoder` converts **categorical (string) features** into **integer-coded categories**.

For example:

* `"red", "green", "blue"`
  → `2, 1, 0` (or some learned ordering)

It assigns each category in each feature an **integer index** based on what it sees during `fit()`.

By mapping categories to integers, `preprocessing.OrdinalEncoder` implies an numerical ordering to the original categories (e.g., `blue < green < red`). It’s often best for:

* **tree-based models**, or
* categorical variables that are **truly ordered**.

For unordered categories, `preprocessing.OneHotEncoder` is usually safer.


In [9]:
preprocessing.OrdinalEncoder?

In [10]:
categorical_features_preprocessing = pipeline.Pipeline(
    steps=[
        (
            "simple_imputer",
            impute.SimpleImputer(
                strategy="most_frequent",
            ),
        ),
        (
            "ordinal_encoder",
            preprocessing.OrdinalEncoder(
                categories=[
                    range(1, 365 + 1)
                ],
                handle_unknown="error",
            )
        )
    ],
    memory=None,
    verbose=False,
)


In [11]:
categorical_features_preprocessing

## Numerical Features

### Standardizing Features

`preprocessing.StandardScaler` **standardizes numerical features** by transforming each feature (column) to have:

* **mean = 0**
* **standard deviation = 1**

It does this by **learning** the mean and std from the training data during `fit()`, then applying:

$$ x' = \frac{x - \mu}{\sigma} $$

during `transform()`.

This is especially useful for models that are sensitive to feature scale (e.g., SGD, SVMs, k-NN, neural nets).


In [12]:
preprocessing.StandardScaler?

In [13]:
numerical_features_preprocessing = pipeline.Pipeline(
    steps=[
        (
            "simple_imputer",
            impute.SimpleImputer(
                strategy="mean",
            )
        ),
        (
            "standard_scaler",
            preprocessing.StandardScaler(
                with_mean=True,
                with_std=True,
            )
        )
    ],
    memory=None,
    verbose=False,
)

In [14]:
numerical_features_preprocessing

## Combine Feature Preprocessing Pipelines

### Column-based Transformations

`compose.ColumnTransformer` lets you apply **different preprocessing pipelines to different columns** of your dataset in a single, clean step.

For example, you can:

* **impute + scale** numeric columns, and
* **impute + one-hot encode** categorical columns,

then it **combines all transformed outputs into one final feature matrix** that you can feed into a model (often inside a `pipeline.Pipeline`).


In [15]:
compose.ColumnTransformer?

In [16]:
features_preprocessing = compose.ColumnTransformer(
    transformers=[
        (
            "categorical_features",
            categorical_features_preprocessing,
            [
                "day",
            ]
        ),
        (
            "numerical_features",
            numerical_features_preprocessing,
            [
                "pressure",
                "maxtemp",
                "temparature",
                "mintemp",
                "dewpoint",
                "humidity",
                "cloud",
                "sunshine",
                "winddirection",
                "windspeed",
            ]
        ),
    ],  
    force_int_remainder_cols=False,
    remainder="drop",
    n_jobs=2,
    verbose=False,
    verbose_feature_names_out=False,
).set_output(transform="pandas")


In [17]:
features_preprocessing

### Manually Preprocessing Features

In [18]:
processed_train_features_df = features_preprocessing.fit_transform(train_features_df)

In [19]:
processed_train_features_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2190 entries, 0 to 2189
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   day            2190 non-null   float64
 1   pressure       2190 non-null   float64
 2   maxtemp        2190 non-null   float64
 3   temparature    2190 non-null   float64
 4   mintemp        2190 non-null   float64
 5   dewpoint       2190 non-null   float64
 6   humidity       2190 non-null   float64
 7   cloud          2190 non-null   float64
 8   sunshine       2190 non-null   float64
 9   winddirection  2190 non-null   float64
 10  windspeed      2190 non-null   float64
dtypes: float64(11)
memory usage: 205.3 KB


In [20]:
processed_train_features_df.head()

Unnamed: 0_level_0,day,pressure,maxtemp,temparature,mintemp,dewpoint,humidity,cloud,sunshine,winddirection,windspeed
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,0.0,0.671702,-0.913809,-0.642199,-0.448815,-0.199457,0.636434,0.681269,-0.729397,-0.560901,-0.465291
1,1.0,1.043116,-1.798289,-1.350846,-1.259418,-0.956001,1.662224,0.847728,-1.032804,-0.685925,0.009629
2,2.0,1.856688,-1.232222,-1.504067,-1.496667,-2.109731,-0.90225,-1.59368,1.256536,-0.435876,-0.374349
3,3.0,-0.035752,-1.462187,-1.178472,-1.041939,-0.69121,1.662224,1.069675,-1.032804,-0.560901,1.393971
4,4.0,1.449902,-0.89612,-1.063556,-1.378043,-2.05299,-3.851394,-1.704654,-0.039837,-0.81095,0.302665


# Create Benchmark Model

## Dummy Classifiers

`dummy.DummyClassifier` is a **baseline classifier** that makes predictions using **simple, non-learning rules** instead of actually training on patterns in the data.

Common strategies include:

* **`most_frequent`**: always predicts the most common class
* **`prior`**: predicts according to class proportions
* **`stratified`**: predicts randomly but respecting class proportions
* **`uniform`**: predicts completely at random
* **`constant`**: always predicts a user-specified class

It’s mainly used to check whether your real model is doing **better than a trivial baseline**.


In [21]:
dummy.DummyClassifier?

## Using Manually Preprocessed Features

In [22]:
dummy_classifier = dummy.DummyClassifier(
    strategy="prior",
    random_state=RANDOM_STATE,
)

_ = dummy_classifier.fit(
    processed_train_features_df,
    train_labels
)

## Combine Feature Preprocessing with a Model

In [23]:
classifier_pipeline = pipeline.Pipeline(
    steps=[
        ("features_preprocessing", features_preprocessing),
        ("dummy_classifier", dummy_classifier)
    ]
)


In [24]:
classifier_pipeline

In [25]:
_ = classifier_pipeline.fit(
    train_features_df,
    train_labels
)

### Save a Trained Pipeline

In [26]:
_ = joblib.dump(classifier_pipeline, "dummy-classifier-pipeline.pkl")

In [27]:
%%bash

ls -lh

total 156K
-rw-r--r-- 1 root root  14K Feb 19 07:43 dummy-classifier-pipeline.pkl
---------- 1 root root 138K Feb 19 07:43 __notebook__.ipynb


# Submit Predictions

In [28]:
%%bash

cat /kaggle/input/rainfall-probability-cs-209-spring-2026/sample_submission.csv | head -n 5

id,rainfall
2190,0
2191,0
2192,0
2193,0


## Generate Model Predictions

In [29]:
loaded_classifier_pipeline = joblib.load("dummy-classifier-pipeline.pkl")

In [30]:
loaded_classifier_pipeline

In [31]:
predicted_rainfall_probas = loaded_classifier_pipeline.predict_proba(
    test_features_df
)


## Create a Submission File

In [32]:
_ = (
    pd.read_csv(
        DATA_DIR / "sample_submission.csv",
        index_col="id"
    ).assign(
        rainfall=predicted_rainfall_probas[:, 1]
    ).to_csv(
        "submission.csv",
        index=True
    )
)

In [33]:
%%bash

cat submission.csv | head -n 5

id,rainfall
2190,0.7534246575342466
2191,0.7534246575342466
2192,0.7534246575342466
2193,0.7534246575342466
