# Project Proposal - Will it Rain Tomorrow?

## Introduction

- Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
- Clearly state the question you will try to answer with your project
- Identify and describe the dataset that will be used to answer the question

To predict the weather, government agencies divide Earth's entire atmosphere into thousands of rectangular cells: for each cell, they feed millions of current weather observations into powerful supercomputers to generate forecasts, repeating this process four times a day<sup>1</sup>. But what if the goal was not to predict "the weather", but simply to predict *whether or not it will rain tomorrow*? Can this simple yet valuable question be answered accurately using easily-understood variables, fewer data points, and at minimal computational expense?  

Our goal is to answer the question, ***which broad weather variable, when measured today, is most predictive of rain tomorrow?*** By "broad", we mean ideas like "temperature" and "windiness" rather than "temperature at 3 pm" and "maximum wind gust speed". These broad variables are things the average person could get a sense of as they go about their lives: if we find one that is strongly associated with rainfall, it could serve as a useful heuristic.

We will use the "Rain in Australia" dataset, publicly avaialable on Kaggle<sup>2</sup>. It contains more than 140,000 weather observations gathered from locations across Australia over a span of 10 years. Each row contains temperature, rainfall, wind, humidity, pressure, and cloud cover measurements for the day. Although there are thousands of N/As, all rows have a Boolean value for the target variable, `RainTomorrow`: `Yes` if it rained at least 1 mm the day after, and `No` otherwise.

## Preliminary EDA

- Demonstrate that the dataset can be read from the web into R 
- Clean and wrangle your data into a tidy format
- Using only training data, summarize the data in at least one table 
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do

### Loading libraries

In [1]:
suppressPackageStartupMessages({
    library(tidyverse)
    library(tidymodels)
    library(repr)
    library(forcats)
})
options(repr.matrix.max.rows = 6)

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
“package ‘tidymodels’ was built under R version 4.0.2”
“package ‘broom’ was built under R version 4.0.2”
“package ‘dials’ was built under R version 4.0.2”
“package ‘infer’ was built under R version 4.0.3”
“package ‘modeldata’ was built under R version 4.0.1”
“package ‘parsnip’ was built under R version 4.0.2”
“package ‘recipes’ was built under R version 4.0.1”
“package ‘tune’ was built under R version 4.0.2”
“package ‘workflows’ was built under R version 4.0.2”
“package ‘yardstick’ was built under R version 4.0.2”


### Reading and cleaning data

The dataset was downloaded from Kaggle and then uploaded to GitHub. We load in the data and preview the columns using `glimpse`:

In [2]:
weather <- read_csv("https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/weatherAUS.csv")
glimpse(weather)

Parsed with column specification:
cols(
  .default = col_double(),
  Date = [34mcol_date(format = "")[39m,
  Location = [31mcol_character()[39m,
  Evaporation = [33mcol_logical()[39m,
  Sunshine = [33mcol_logical()[39m,
  WindGustDir = [31mcol_character()[39m,
  WindDir9am = [31mcol_character()[39m,
  WindDir3pm = [31mcol_character()[39m,
  RainToday = [31mcol_character()[39m,
  RainTomorrow = [31mcol_character()[39m
)

See spec(...) for full column specifications.

“153782 parsing failures.
 row         col           expected actual                                                                              file
6050 Evaporation 1/0/T/F/TRUE/FALSE   12   'https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/weatherAUS.csv'
6050 Sunshine    1/0/T/F/TRUE/FALSE   12.3 'https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/weatherAUS.csv'
6051 Evaporation 1/0/T/F/TRUE/FALSE   14.8 'https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/w

Rows: 145,460
Columns: 23
$ Date          [3m[90m<date>[39m[23m 2008-12-01, 2008-12-02, 2008-12-03, 2008-12-04, 2008-1…
$ Location      [3m[90m<chr>[39m[23m "Albury", "Albury", "Albury", "Albury", "Albury", "Albu…
$ MinTemp       [3m[90m<dbl>[39m[23m 13.4, 7.4, 12.9, 9.2, 17.5, 14.6, 14.3, 7.7, 9.7, 13.1,…
$ MaxTemp       [3m[90m<dbl>[39m[23m 22.9, 25.1, 25.7, 28.0, 32.3, 29.7, 25.0, 26.7, 31.9, 3…
$ Rainfall      [3m[90m<dbl>[39m[23m 0.6, 0.0, 0.0, 0.0, 1.0, 0.2, 0.0, 0.0, 0.0, 1.4, 0.0, …
$ Evaporation   [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Sunshine      [3m[90m<lgl>[39m[23m NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ WindGustDir   [3m[90m<chr>[39m[23m "W", "WNW", "WSW", "NE", "W", "WNW", "W", "W", "NNW", "…
$ WindGustSpeed [3m[90m<dbl>[39m[23m 44, 44, 46, 24, 41, 56, 50, 35, 80, 28, 30, 31, 61, 44,…
$ WindDir9am    [3m[90m<chr>[39m[23m "W", "NNW", "W", "SE", "ENE", "W", "SW", "SSE", "SE

We encounter many parsing failures. Upon closer inspection, it seems `readr` parsed the `Evaporation` and `Sunshine` columns as logical vectors when they are actually doubles. We can coerce them to numeric using `as.numeric`. Next, we convert the class labels `RainToday` and `RainTomorrow` from characters to factors.  

At this point we want to see if the missing values will become problematic. We `map` over the data frame to count the number of missing values in each column.

In [3]:
weather_not_tidy <- weather %>%
    # Fixing initial parsing error
    mutate(Evaporation = as.numeric(Evaporation), Sunshine = as.numeric(Sunshine)) %>%
    # Converting class labels to factors
    mutate(RainToday = as_factor(RainToday), RainTomorrow = as_factor(RainTomorrow))

map(weather_not_tidy, ~sum(is.na(.)))

It seems the `Evaporation` and `Sunshine` columns are unusable with more than 95% of rows having NAs, so we drop them. On the other hand, we feel comfortable keeping the other numerical variables and instead dropping the *rows* that have NA values, because even after doing so, we will still have plenty of data. We accomplish this using `drop_na`.  

At this point we split our data 75:25 into training and testing sets, and use `summary` to check the results of our data wrangling on the training set.

In [4]:
weather_tidy <- weather_not_tidy %>%
    select(-Evaporation, -Sunshine) %>%
    drop_na(MinTemp, MaxTemp, Rainfall, WindGustSpeed, WindSpeed9am:RainTomorrow)
    
set.seed(1)
weather_split <- initial_split(weather_tidy, prop = 0.75, strata = RainTomorrow)
weather_train <- training(weather_split)
weather_test <- testing(weather_split)

summary(weather_train)

      Date              Location            MinTemp         MaxTemp     
 Min.   :2007-11-02   Length:55059       Min.   :-6.70   Min.   : 4.10  
 1st Qu.:2010-09-30   Class :character   1st Qu.: 8.10   1st Qu.:18.00  
 Median :2012-11-18   Mode  :character   Median :12.80   Median :23.30  
 Mean   :2012-12-19                      Mean   :13.03   Mean   :23.69  
 3rd Qu.:2015-02-04                      3rd Qu.:18.00   3rd Qu.:29.20  
 Max.   :2017-06-25                      Max.   :31.40   Max.   :48.10  
    Rainfall       WindGustDir        WindGustSpeed     WindDir9am       
 Min.   :  0.000   Length:55059       Min.   :  9.00   Length:55059      
 1st Qu.:  0.000   Class :character   1st Qu.: 31.00   Class :character  
 Median :  0.000   Mode  :character   Median : 39.00   Mode  :character  
 Mean   :  2.656                      Mean   : 40.67                     
 3rd Qu.:  1.000                      3rd Qu.: 48.00                     
 Max.   :367.600                      Max.   

Our data is now neat and tidy, ready for EDA!

### Tables

### Visualizations

Histogram of distributions of relevant variables  
Scatterplot between two variables, coloured by whether or not it rained

## Methods

- Explain how you will conduct either your data analysis and which variables/columns you will use
- Describe at least one way that you will visualize the results

We'll build KNN models for MinTemp & MaxTemp, Sunshine & Cloud3pm, WindSpeed9am & WindSPeed3pm, Humidity9am & Humidity3pm, Pressure9am & Pressure 3pm, Cloud9am & Cloud3pm, Temp9am & Temp3pm to find which combination of two variables is the most effective (accurate) at classifying RainTomorrow  
We'll compare that to the accuracy from a KNN model that incorporates all the variables  
Finally we'll compare that to the "dumb approach" - if it rained today it will rain tomorrow and vice-versa
Visualize the results using scatterplots

## Expected outcomes and significance

- What do you expect to find?
- What impact could such findings have?
- What future questions could this lead to?

We expect to find one or two variables that are really good  
Almost as good as a model incorporating all the relevant variables  
This could serve as a useful heuristic in our daily lives, the one thing we should look at  
Question - if we achieved decent accuracy with limited computing power, are there less computationally expensive ways to predict rain available to meteorologists if they weren't required to forecast other weather variables?
Question - How effective would our variables be at regression?

### References

[1] https://www.nationalgeographic.com/environment/article/weather-forecasting  
[2] https://www.kaggle.com/jsphyg/weather-dataset-rattle-package?select=weatherAUS.csv