## Problem Statement



> **QUESTION**: The [Rain in Australia dataset](https://kaggle.com/jsphyg/weather-dataset-rattle-package) contains about 10 years of daily weather observations from numerous Australian weather stations.
> As a data scientist at the Bureau of Meteorology, you are tasked with creating a fully-automated system that can use today's weather data for a given location to predict whether it will rain at the location tomorrow.


**EXERCISE**: Before proceeding further, take a moment to think about how you can approach this problem. List five or more ideas that come to your mind below:

1. Do basic EDA: check missing values, class balance (how many “RainTomorrow = Yes/No”), look for obvious predictors like RainToday, Humidity, Cloud, Pressure.
2. Build a simple baseline model: predict “No” for everyone (or always use today’s rain → tomorrow’s rain), to have something to beat.
3. Create features from today’s weather: e.g. Humidity3pm, RainToday, Temp3pm, WindGustSpeed, Pressure3pm — and train a classification model (Logistic Regression / RandomForest) to predict RainTomorrow.
4. Handle missing data properly: impute numeric cols (median/mean), fill/categorize missing categorical values, maybe drop columns with too many NaNs.
5. Encode categorical columns (Location, WindDir9am, WindDir3pm, WindGustDir) using one-hot encoding.
6. Split into train/test by **date** (older → train, newer → test) to simulate real forecasting.
7. Evaluate with accuracy **and** recall/precision for the “RainTomorrow = Yes” class (it’s usually imbalanced).

## Linear Regression vs. Logistic Regression

When predicting **continuous values** (like medical charges), we use **Linear Regression**.
When predicting **categories or classes** (like rain vs no rain), we use **Logistic Regression**.

---

| Aspect | Linear Regression | Logistic Regression |
|:--------|:------------------|:--------------------|
| **Goal** | Predict a continuous **numeric value** | Predict a **category/class** (e.g., Rain/No Rain) |
| **Output** | Any real number (−∞ to +∞) | A probability between 0 and 1 |
| **Typical Use** | Price, temperature, salary, medical costs | Spam detection, disease diagnosis, rainfall prediction |
| **Decision Boundary** | Continuous value, no threshold | Converts probability to class label using threshold (e.g., 0.5) |
| **Loss Function** | Mean Squared Error (MSE) | Binary Cross-Entropy (Log Loss) |
| **Assumption** | Linear relationship between X and y | Classes are separable in feature space |
| **Interpretation** | Predicts *how much* | Predicts *which class* |

---

### Mathematical Form

**Linear Regression:**

$$
\hat{y} = w^T x + b
$$

**Logistic Regression:**

$$
\hat{p} = \sigma(w^T x + b)
$$

where the **sigmoid (logistic)** function is:

$$
\sigma(z) = \frac{1}{1 + e^{-z}}
$$

This ensures that the output probability \( \hat{p} \) always lies between **0 and 1**.

---

### Summary

- **Linear Regression** → Best for predicting continuous outcomes
- **Logistic Regression** → Best for predicting binary (yes/no) or categorical outcomes



## Downloading the Data


In [None]:
import pandas as pd

df = pd.read_csv("weatherAUS.csv")
df.head()

The dataset contains over 145,000 rows and 23 columns. The dataset contains date, numeric and categorical columns. Our objective is to create a model to predict the value in the column `RainTomorrow`.

Let's check the data types and missing values in the various columns.

In [None]:
df.info()

In [None]:
df.dropna(subset=['RainToday', 'RainTomorrow'], inplace=True)

## Exploratory Data Analysis and Visualization

Before training a machine learning model, its always a good idea to explore the distributions of various columns and see how they are related to the target column. Let's explore and visualize the data using the Plotly, Matplotlib and Seaborn libraries.

In [None]:
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

In [None]:
px.histogram(df, x='Location', title='Location vs. Rainy Days', color='RainToday')

In [None]:
px.histogram(df,
             x='Temp3pm',
             title='Temp at 3pm vs. Rainy Days',
             color='RainTomorrow')

In [None]:
px.histogram(df,
             x='RainTomorrow',
             color='RainToday',
             title='Rain Tomorrow vs. Rain Today')

In [None]:
px.scatter(df.sample(2000),
           title='Min Temp. vs Max Temp.',
           x='MinTemp',
           y='MaxTemp',
           color='RainToday')

In [None]:
px.scatter(df.sample(2000),
           title='Temp at 3 pm vs. Humidity (3 pm)',
           x='Temp3pm',
           y='Humidity3pm',
           color='RainTomorrow')

> **EXERCISE**: Visualize all the other columns of the dataset and study their relationship with the `RainToday` and `RainTomorrow` columns.

In [None]:
# --- NUMERIC FEATURES ---
from plotly.subplots import make_subplots

num_features = [
    'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
    'WindGustSpeed', 'WindSpeed9am', 'WindSpeed3pm',
    'Humidity9am', 'Humidity3pm', 'Pressure9am', 'Pressure3pm',
    'Cloud9am', 'Cloud3pm', 'Temp9am', 'Temp3pm'
]

# Create small box plots for numeric columns vs RainTomorrow
for col in num_features:
    fig = px.box(
        df, x='RainTomorrow', y=col, color='RainTomorrow',
        title=f"{col} vs RainTomorrow", height=300, width=400
    )
    fig.update_layout(showlegend=False)
    fig.show()

# --- CATEGORICAL FEATURES ---
cat_features = ['Location', 'WindGustDir', 'WindDir9am', 'WindDir3pm', 'RainToday']

for col in cat_features:
    fig = px.histogram(
        df, x=col, color='RainTomorrow', barmode='group',
        title=f"{col} distribution by RainTomorrow", height=300, width=500
    )
    fig.update_xaxes(categoryorder='total descending')
    fig.update_layout(showlegend=True)
    fig.show()

# --- OPTIONAL COMPACT GRID FOR KEY FEATURES ---
fig = make_subplots(rows=2, cols=2, subplot_titles=[
    "Humidity3pm vs RainTomorrow",
    "Pressure3pm vs RainTomorrow",
    "WindSpeed3pm vs RainTomorrow",
    "Temp3pm vs RainTomorrow"
])

features = ['Humidity3pm', 'Pressure3pm', 'WindSpeed3pm', 'Temp3pm']
r, c = 1, 1

for f in features:
    box = px.box(df, x='RainTomorrow', y=f, color='RainTomorrow')
    for trace in box.data:
        fig.add_trace(trace, row=r, col=c)
    c += 1
    if c == 3:
        c = 1
        r += 1

fig.update_layout(height=700, width=850, title_text="Key Weather Features vs RainTomorrow", showlegend=False)
fig.show()


In [None]:
use_sample = False

sample_fraction = 0.1

if use_sample:
    df = df.sample(frac=sample_fraction).copy()

## Training, Validation and Test Sets

While building real-world machine learning models, it is quite common to split the dataset into three parts:

1. **Training set** - used to train the model, i.e., compute the loss and adjust the model's weights using an optimization technique.


2. **Validation set** - used to evaluate the model during training, tune model hyperparameters (optimization technique, regularization etc.), and pick the best version of the model. Picking a good validation set is essential for training models that generalize well.

3. **Test set** - used to compare different models or approaches and report the model's final accuracy. For many datasets, test sets are provided separately. The test set should reflect the kind of data the model will encounter in the real-world, as closely as feasible.


As a general rule of thumb you can use around 60% of the data for the training set, 20% for the validation set and 20% for the test set. If a separate test set is already provided, you can use a 75%-25% training-validation split.


In [None]:
from sklearn.model_selection import train_test_split

In [None]:
train_val_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42)

In [None]:
print('train_df.shape :', train_df.shape)
print('val_df.shape :', val_df.shape)
print('test_df.shape :', test_df.shape)

In [None]:
plt.title('No. of Rows per Year')
sns.countplot(x=pd.to_datetime(df.Date).dt.year);

In [None]:
year = pd.to_datetime(df.Date).dt.year

train_df = df[year < 2015]
val_df = df[year == 2015]
test_df = df[year > 2015]

In [None]:
print('train_df.shape :', train_df.shape)
print('val_df.shape :', val_df.shape)
print('test_df.shape :', test_df.shape)

## Identifying Input and Target Columns

Often, not all the columns in a dataset are useful for training a model. In the current dataset, we can ignore the `Date` column, since we only want to weather conditions to make a prediction about whether it will rain the next day.

Let's create a list of input columns, and also identify the target column.

In [None]:
input_cols = list(train_df.columns)[1:-1]
target_col = 'RainTomorrow'
print(input_cols)

We can now create inputs and targets for the training, validation and test sets for further processing and model training.

In [None]:
train_inputs = train_df[input_cols].copy()
train_targets = train_df[target_col].copy()

val_inputs = val_df[input_cols].copy()
val_targets = val_df[target_col].copy()

test_inputs = test_df[input_cols].copy()
test_targets = test_df[target_col].copy()

Let's also identify which of the columns are numerical and which ones are categorical. This will be useful later, as we'll need to convert the categorical data to numbers for training a logistic regression model.

In [None]:
import numpy as np

numeric_cols = train_inputs.select_dtypes(include=np.number).columns.tolist()
categorical_cols = train_inputs.select_dtypes('object').columns.tolist()

In [None]:
train_inputs[numeric_cols].describe()

In [None]:
train_inputs[categorical_cols].nunique()

## Imputing Missing Numeric Data

Machine learning models can't work with missing numerical data. The process of filling missing values is called imputation.


There are several techniques for imputation, but we'll use the most basic one: replacing missing values with the average value in the column using the `SimpleImputer` class from `sklearn.impute`.

In [23]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')

In [24]:
df[numeric_cols].isna().sum()

MinTemp            468
MaxTemp            307
Rainfall             0
Evaporation      59694
Sunshine         66805
WindGustSpeed     9105
WindSpeed9am      1055
WindSpeed3pm      2531
Humidity9am       1517
Humidity3pm       3501
Pressure9am      13743
Pressure3pm      13769
Cloud9am         52625
Cloud3pm         56094
Temp9am            656
Temp3pm           2624
dtype: int64

In [25]:
imputer.fit(df[numeric_cols])

0,1,2
,missing_values,
,strategy,'mean'
,fill_value,
,copy,True
,add_indicator,False
,keep_empty_features,False


In [26]:
train_inputs[numeric_cols] = imputer.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = imputer.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = imputer.transform(test_inputs[numeric_cols])

In [27]:
train_inputs[numeric_cols].isna().sum()

MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
dtype: int64