# Project Proposal - Will it Rain Tomorrow?

## Introduction

To predict the weather, government agencies divide Earth's atmosphere into thousands of 3D cells: for each cell, they feed millions of current weather observations into powerful supercomputers, generating forecasts which must be refreshed several times a day<sup>1</sup>. But what if the goal was not to predict "the weather", but simply to predict *whether or not it will rain tomorrow*? Can this less ambitious question be answered accurately using simpler variables, fewer data points, and less computational horsepower?  

Our goal is to answer the question, ***which broad weather variable, when measured today, is most predictive of rain tomorrow?*** By "broad", we mean ideas of "temperature" and "windiness" rather than "temperature at 3 pm" and "maximum wind gust speed". These broad variables are things the average person would have an intuitive sense of or could search up quickly: the hope is to find a simple variable that can serve as an effective heuristic for predicting rain in our daily lives.

We will use the **"Rain in Australia"** dataset, publicly avaialable on Kaggle<sup>2</sup>. It contains more than 140,000 weather observations gathered from locations across Australia over a span of 10 years. Each row contains temperature, rainfall, wind, humidity, pressure, and cloud cover measurements for the day. The target variable is `RainTomorrow`, a Boolean: `Yes` if it rained at least 1 mm the day after, and `No` otherwise.

## Preliminary EDA

### Loading libraries

In [1]:
suppressMessages({
    library(tidyverse)
    library(tidymodels)
    library(repr)
    library(forcats)
})
options(repr.matrix.max.rows = 6)

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
“package ‘tidymodels’ was built under R version 4.0.2”
“package ‘broom’ was built under R version 4.0.2”
“package ‘dials’ was built under R version 4.0.2”
“package ‘infer’ was built under R version 4.0.3”
“package ‘modeldata’ was built under R version 4.0.1”
“package ‘parsnip’ was built under R version 4.0.2”
“package ‘recipes’ was built under R version 4.0.1”
“package ‘tune’ was built under R version 4.0.2”
“package ‘workflows’ was built under R version 4.0.2”
“package ‘yardstick’ was built under R version 4.0.2”


### Reading and cleaning data

The dataset was downloaded from Kaggle and then uploaded to GitHub. We load in the data from the raw URL and preview the columns using `glimpse`:

In [2]:
weather <- read_csv("https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/weatherAUS.csv")
glimpse(weather)

Parsed with column specification:
cols(
  .default = col_double(),
  Date = [34mcol_date(format = "")[39m,
  Location = [31mcol_character()[39m,
  Evaporation = [33mcol_logical()[39m,
  Sunshine = [33mcol_logical()[39m,
  WindGustDir = [31mcol_character()[39m,
  WindDir9am = [31mcol_character()[39m,
  WindDir3pm = [31mcol_character()[39m,
  RainToday = [31mcol_character()[39m,
  RainTomorrow = [31mcol_character()[39m
)

See spec(...) for full column specifications.

“153782 parsing failures.
 row         col           expected actual                                                                              file
6050 Evaporation 1/0/T/F/TRUE/FALSE   12   'https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/weatherAUS.csv'
6050 Sunshine    1/0/T/F/TRUE/FALSE   12.3 'https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/weatherAUS.csv'
6051 Evaporation 1/0/T/F/TRUE/FALSE   14.8 'https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/w

Rows: 145,460
Columns: 23
$ Date          [3m[90m<date>[39m[23m 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 2008-1…
$ Location      [3m[90m<chr>[39m[23m "Albury", "Albury", "Albury", "Albury", "Albury", "Albu…
$ MinTemp       [3m[90m<dbl>[39m[23m 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 13.1,…
$ MaxTemp       [3m[90m<dbl>[39m[23m 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31.9, 3…
$ Rainfall      [3m[90m<dbl>[39m[23m 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, …
$ Evaporation   [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Sunshine      [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ WindGustDir   [3m[90m<chr>[39m[23m "W", "WNW", "WSW", "NE", "W", "WNW", "W", "W", "NNW", "…
$ WindGustSpeed [3m[90m<dbl>[39m[23m 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61, 44,…
$ WindDir9am    [3m[90m<chr>[39m[23m "W", "NNW", "W", "SE", "ENE", "W", "SW", "SSE", "SE

We encounter many parsing failures. Upon closer inspection, it seems `readr` parsed the `Evaporation` and `Sunshine` columns as logical vectors when they are actually doubles. We can coerce both variables to numeric using `as.numeric`. Furthermore, we convert the class labels `RainToday` and `RainTomorrow` from characters to factors.  

At this point we want to see if the missing values will become problematic. We use `colSums` to print the list of how many NAs are in each column.

In [3]:
weather_not_tidy <- weather %>%
    # Fixing initial parsing error
    mutate(Evaporation = as.numeric(Evaporation), Sunshine = as.numeric(Sunshine)) %>%
    # Converting class labels to factors
    mutate(RainToday = as_factor(RainToday), RainTomorrow = as_factor(RainTomorrow))

# How many NAs are there?
colSums(is.na(weather_not_tidy))

It seems the `Evaporation` and `Sunshine` columns are unusable with more than 95% of rows having NAs, so we drop them. On the other hand, we feel comfortable keeping all the other numerical variables, even `Cloud9am` and `Cloud9pm`, and instead dropping the *rows* that have NA values for them. Although we will lose data, we will still end up with more than enough for training and testing. We accomplish this using `drop_na`. 

At this point, we split our data 75:25 into training and testing sets. We check to see if our manipulation was successful

In [4]:
weather_tidy <- weather_not_tidy %>%
    # Drop Evaporation and Sunshine
    select(-Evaporation, -Sunshine) %>%
    # Get rid of all rows with NAs in variables we might use
    drop_na(MinTemp, MaxTemp, Rainfall, WindGustSpeed, WindSpeed9am:RainTomorrow)
    
set.seed(1)
weather_split <- initial_split(weather_tidy, prop = 0.75, strata = RainTomorrow)
weather_train <- training(weather_split)
weather_test <- testing(weather_split)

# Are our class variables factors?
class(weather_train$RainToday)
class(weather_train$RainTomorrow)

# Do we have enough training and testing data, in the right proportions?
nrow(weather_train)
nrow(weather_test)

# How bad is the class imbalance?
summary(weather_train$RainTomorrow)

# How many NAs are there?
colSums(is.na(weather_train))

The class variable `RainTomorrow` has been successfully turned into a factor. There is some class imbalance but with a 3.3:1 ratio it should not be too problematic for KNN analysis. Most NAs have been elimininated: we now have around 55,000 rows in our training set and 18,000 in the testing set.

### Tables

- **Using only training data, summarize the data in at least one table**
- **An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data.**
- *Maybe we could make a chart where each row is a different variable, and there are four columns: `variable`, `mean(no rain tomorrow)`, `mean(rain tomorrow)`, and `difference`: the rows with the biggest `difference` would be the most useful predictors*

### Visualizations

- **Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do**
- **An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.**
- We should do ***frequency polygons*** of the 8 relevant variables (from the Methods section) like we saw in tutorial activity 4: e.g. for `Temp9am`, have one line representing the distribution of `Temp9am` among rows where it rained tomorrow and another line representing the distribution of `Temp9am` among rows where it didn't rain tomorrow

## Methods

It seems that pressure, humidity, temperature, wind speed, and cloud cover all have some relationship with `RainTomorrow`. They are all numeric variables and come in nice pairs of two variables (measurements taken at 9 AM and 3 PM). Thus we plan to build five separate KNN classifiers capturing 5 broad weather variables:
1. **Temperature** (`Temp9am` and `Temp3pm`)
2. **Humidity** (`Humidity9am` and `Humidity3pm`)
3. **Pressure** (`Pressure9am` and `Pressure3pm`)
4. **Windiness** (`WindSpeed9am`, `WindSpeed3pm`)
5. **Cloudiness** (`Cloud9am` and `Cloud3pm`)  

We will not use `Date`, `Location`, or any of the wind direction variables since we lack the tools to incorporate categorical predictors. Although `MinTemp`, `MaxTemp`, and `WindGustSpeed` are numerical, we decided not to use them, since the 9 AM and 3 PM measurements are sufficient to capture "temperature" and "windiness" respectively and we want to avoid unnecessary complexity.  

After building and tuning each model for the best performance using 10-fold cross-validation, we will identify the best performing classifier (most predictive variable) for `RainTomorrow`. Next, we will build a classifier that integrates all the numeric variables in the dataset as predictors. This will let us compare the effectiveness of our chosen "heuristic" variable with the "best accuracy" we can achieve from our data. Since KNN gets slow with many predictors, we might choose a different, more efficient algorithm. Finally, we will compare these two results to the "naive approach" of predicting it will rain tomorrow if it rained today, and vice-versa, to see how much value our models actually have in a real-world setting.  

**Possible visualizations include:**
- K versus accuracy plot for all 5 KNN classifiers
- Bar graph comparing the accuracy of all 7 classifiers
- Confusion matrix for the best single classifier and for the complete classifier

## Expected outcomes and significance

We expect to find that our best single variable classifier will be able to predict `RainTomorrow` with a high degree of accuracy, perhaps around 80%. We expect that the classifier incorporating all the predictors will perform better, but not *that* much better considering the additional computational expense. We expect both classifiers to be better than the naive approach.

When we want to know if it will rain tomorrow, most of us just check the forecast on our phones. Still, finding variables highly associated with rain tomorrow could be useful whenever we forget to or are unable to check the forecast. Plus, most forecasts put a percentage on the chance of rain, so having a sense of what variables are actually *behind* those percentages could be useful. But the most significant outcome will predicting rain accurately despite using *significantly* less computing power and data than contemporary weather models.  

This would raise several questions for future exploration. If we were able to be accurate with limited computing power, are there more computationally efficient methods available to meteorologists if they *only* had to forecast rain? If rain can be predicted with just a few variables, how many of the sensors installed in weather stations are truly *necessary*? Finally, how effective would our approach be at regression, predicting the *amount* of rainfall tomorrow?

### References

[1] https://www.nationalgeographic.com/environment/article/weather-forecasting  
[2] https://www.kaggle.com/jsphyg/weather-dataset-rattle-package?select=weatherAUS.csv