# Rainfall Prediction in Melbourne Using Machine Learning


## Project Overview

This project builds a machine learning classifier to predict whether it will rain
on a given day in the Melbourne region using historical weather data.

The project demonstrates an end-to-end ML workflow including data preprocessing,
feature engineering, model pipelines, hyperparameter tuning, and model evaluation.


## Dataset

The dataset contains daily weather observations from Australia (2008â€“2017),
sourced from the Australian Bureau of Meteorology and made publicly available
via Kaggle.

To reduce geographical variability, the analysis focuses on the following locations:

- Melbourne
- Melbourne Airport
- Watsonia

The target variable indicates whether measurable rainfall occurred on a given day.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay


## Data Loading and Cleaning


In [None]:
# add data_source_url from data source and replace data_source_url with actual url

url = "data_source_url" 

df = pd.read_csv(url)

# Drop rows with missing values
df = df.dropna()

## Target Redefinition and Data Leakage Prevention

Some features depend on full-day observations, which could introduce data leakage
when predicting future rainfall.

To avoid this, the prediction task is reframed to predict **today's rainfall**
using historical weather data.


In [None]:
df = df.rename(columns={
    'RainToday': 'RainYesterday',
    'RainTomorrow': 'RainToday'
})

## Location Selection

The model focuses on geographically close locations to reduce variability
in weather patterns.


In [None]:
df = df[df.Location.isin(['Melbourne', 'MelbourneAirport', 'Watsonia'])]

## Feature Engineering: Seasonality

Seasonal patterns play a significant role in rainfall behavior.
A new categorical feature representing seasons is engineered from the date column.


In [None]:
def date_to_season(date):
    month = date.month
    if month in [12, 1, 2]:
        return 'Summer'
    elif month in [3, 4, 5]:
        return 'Autumn'
    elif month in [6, 7, 8]:
        return 'Winter'
    else:
        return 'Spring'

df['Date'] = pd.to_datetime(df['Date'])
df['Season'] = df['Date'].apply(date_to_season)
df = df.drop(columns=['Date'])


In [None]:
X = df.drop(columns=['RainToday'])
y = df['RainToday']

## Train-Test Split

Stratified splitting ensures class balance is preserved.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

In [None]:
numeric_features = X_train.select_dtypes(include=['number']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])
categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

## Model 1: Random Forest Classifier


In [None]:
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])

param_grid = {
    'classifier__n_estimators': [50, 100],
    'classifier__max_depth': [None, 10, 20],
    'classifier__min_samples_split': [2, 5]
}

cv = StratifiedKFold(n_splits=5, shuffle=True)

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=cv,
    scoring='accuracy',
    verbose=2
)

grid_search.fit(X_train, y_train)


In [None]:
y_pred = grid_search.predict(X_test)

print(classification_report(y_test, y_pred))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.title("Random Forest Confusion Matrix")
plt.show()


## Feature Importance Analysis


In [None]:
feature_names = numeric_features + list(
    grid_search.best_estimator_['preprocessor']
    .named_transformers_['cat']
    .named_steps['onehot']
    .get_feature_names_out(categorical_features)
)

importances = grid_search.best_estimator_['classifier'].feature_importances_

importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10,6))
plt.barh(importance_df.head(15)['Feature'], importance_df.head(15)['Importance'])
plt.gca().invert_yaxis()
plt.title("Top Features Influencing Rainfall Prediction")
plt.show()


## Model Comparison: Logistic Regression


In [None]:
pipeline.set_params(classifier=LogisticRegression(random_state=42))

param_grid = {
    'classifier__solver': ['liblinear'],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__class_weight': [None, 'balanced']
}

grid_search.estimator = pipeline
grid_search.param_grid = param_grid

grid_search.fit(X_train, y_train)

y_pred_lr = grid_search.predict(X_test)

print(classification_report(y_test, y_pred_lr))


## Conclusion

Both models achieved strong performance, with Logistic Regression showing
slightly improved recall for rainy days.

This project demonstrates practical machine learning skills including
feature engineering, pipeline construction, hyperparameter tuning, and
model evaluation on real-world data.
