# Project Proposal - Will it Rain Tomorrow?

## Introduction

- Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal
- Clearly state the question you will try to answer with your project
- Identify and describe the dataset that will be used to answer the question

Modern weather forecasting involves feeding in millions of current weather observations from across the globe into complex computer models built on decades of atmospheric physics research<sup>1</sup>.  
While we can't hope to model the weather, answering the simple yet important question of whether it will rain tomorrow might be within our reach  
The dataset is from Australia ... 140,000 observations

## Preliminary EDA

- Demonstrate that the dataset can be read from the web into R 
- Clean and wrangle your data into a tidy format
- Using only training data, summarize the data in at least one table 
- Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do

### Loading libraries

In [2]:
suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
suppressMessages(library(repr))
suppressMessages(library(forcats))
options(repr.matrix.max.rows = 6)

### Reading and cleaning data

The dataset from Kaggle<sup>2</sup> was uploaded to GitHub, from which we can access the raw:

In [5]:
weather <- read_csv("https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/weatherAUS.csv")

Parsed with column specification:
cols(
  .default = col_double(),
  Date = [34mcol_date(format = "")[39m,
  Location = [31mcol_character()[39m,
  Evaporation = [33mcol_logical()[39m,
  Sunshine = [33mcol_logical()[39m,
  WindGustDir = [31mcol_character()[39m,
  WindDir9am = [31mcol_character()[39m,
  WindDir3pm = [31mcol_character()[39m,
  RainToday = [31mcol_character()[39m,
  RainTomorrow = [31mcol_character()[39m
)

See spec(...) for full column specifications.

“153782 parsing failures.
 row         col           expected actual                                                                              file
6050 Evaporation 1/0/T/F/TRUE/FALSE   12   'https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/weatherAUS.csv'
6050 Sunshine    1/0/T/F/TRUE/FALSE   12.3 'https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/weatherAUS.csv'
6051 Evaporation 1/0/T/F/TRUE/FALSE   14.8 'https://github.com/geoffreyyang/dsci100-002-group7/raw/main/data/w

What's up with the parsing failures?  
Need to remove all rows with NAs, turn the class variables into factors, and split into training and testing datasets

### Tables

### Visualizations

Histogram of distributions of relevant variables  
Scatterplot between two variables, coloured by whether or not it rained

## Methods

- Explain how you will conduct either your data analysis and which variables/columns you will use
- Describe at least one way that you will visualize the results

We'll build KNN models for MinTemp & MaxTemp, Sunshine & Cloud3pm, WindSpeed9am & WindSPeed3pm, Humidity9am & Humidity3pm, Pressure9am & Pressure 3pm, Cloud9am & Cloud3pm, Temp9am & Temp3pm to find which combination of two variables is the most effective (accurate) at classifying RainTomorrow  
We'll compare that to the accuracy from a KNN model that incorporates all the variables  
Finally we'll compare that to the "dumb approach" - if it rained today it will rain tomorrow and vice-versa
Visualize the results using scatterplots

## Expected outcomes and significance

- What do you expect to find?
- What impact could such findings have?
- What future questions could this lead to?

We expect to find one or two variables that are really good  
Almost as good as a model incorporating all the relevant variables  
This could serve as a useful heuristic in our daily lives, the one thing we should look at  
Question - if we achieved decent accuracy with limited computing power, are there less computationally expensive ways to predict rain available to meteorologists if they weren't required to forecast other weather variables?
Question - How effective would our variables be at regression?

### References

[1] https://www.nationalgeographic.com/environment/article/weather-forecasting  
[2] https://www.kaggle.com/jsphyg/weather-dataset-rattle-package?select=weatherAUS.csv