## Rain EDA Notebook
### Dylan M Jones
### 2/11/2021

This notebook is to examine the structure of a weather dataset from kaggle. It contains the weather data from various stations in Australia, with the objective of building a model to predict precipitation the following day.

This dataset looks fun, because we can start with initial assumptions and check whether they hold true with exploration of the variables. It also lays a foundation for possibly asking new questions using the data, or including new external data to the set as we dive deeper and which to enhance the performance of the model.

## Table of Contents

1. [Inspiration](#1)

## 1. Inspiration <a class="anchor" id = "1"></a>

In [None]:
## Initial library setup

import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split 
%matplotlib inline


## Load data
rawWeather = pd.read_csv("weatherAUS.csv")


In [None]:
## Check table
rawWeather['Date'] = pd.to_datetime(rawWeather['Date'])
rawWeather['Month'] = rawWeather['Date'].dt.month
rawWeather['Day'] = rawWeather['Date'].dt.day
rawWeather['Year'] = rawWeather['Date'].dt.year
rawWeather['Quarter'] = rawWeather['Date'].dt.quarter
rawWeather['Weekday'] = rawWeather['Date'].dt.strftime('%A')
rawWeather['WeekNumber'] = rawWeather['Date'].dt.strftime('%U')

print(rawWeather.shape)
print(rawWeather.columns)
# print(rawWeather.info())

This looks promising. We can see that there are 145,460 rows, and 23 columns. 22 columns are fields we can use to try and predict the last column ('RainTomorrow').


Before we move much further in the EDA, we should split the dataset to ensure a validation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(rawWeather.loc[:,rawWeather.columns != 'RainTomorrow'],rawWeather.loc[:,'RainTomorrow'], test_size = 0.3, random_state = 42)

print('X_train is: ' + str(X_train.shape))
print('y_train is: ' + str(y_train.shape))
print('X_test is: ' + str(X_test.shape))
print('y_test is: ' + str(y_test.shape))

# Data Exploration

## Univariate analysis

## Bivariate Analysis

## Multivariate Analysis

# Feature Engineering

## Numerical Feature Engineering: Missing Values and Outliers

## Categorical Feature Engineering: Missing Values and One-hot enoding

# Model Training and Output

# Model Performance Assessment

## Accuracy vs Recall vs Specificity

## Confusion Matrix

## ROC Curve and AUC

# Enhancing the Dataset with Geocoding and topology data

In [None]:
## Geocoding the Locations

In [None]:
## Gathing topology features

In [None]:
## New Feature analysis

In [None]:
## New model training and output

In [None]:
## Conclusions and Future Directions

trainSet = pd.concat([X_train, y_train], axis = 1)
# print(trainSet.info())

# plt.figure(figsize = (14,5) )
# sns.barplot(trainSet['Quarter'],trainSet['Rainfall'], hue = trainSet['Location'], alpha = 0.7)
# plt.show()

Interesting. Just looking at a handful of locations, we can see that rainfall is seasonal, with more rain in the late Spring/eary Summer (Nov and Dec), and some occasional deluges in March and June for Wollongong. Ballarat and Melbourne are relatively far South in the state of Victoria: in an arid, temparate region of Australia. Wollongong is in New South Whales, which is slightly more tropical, but not as much as the Northern coast of the country.

Let's run a pairplot on this sparse dataset to see what fields interact with each other, and with the target variable.

In [None]:
smallW = trainSet[trainSet['Location'].isin(['Melbourne','Wollongong','Ballarat'])]



smallW['LogRainfall'] = np.log(smallW['Rainfall'])

feature_list = ['Location','LogRainfall','Sunshine','MaxTemp','Month','RainTomorrow']

plt.figure(figsize = (14,14 ) )
sns.pairplot(smallW[feature_list], hue = "RainTomorrow")
plt.show()

## Basic Modeling and Results
Now that we\'ve taken a look at the core features, and placed some transformations to acheive normal distributions, we can run a decision tree-based classifier on the data to see how well these features predict for rain the next day.

We won\'t focus too much on tuning hyperparameters, or aggregating across models, as our goal is to identify external data that might improve the dataset with better features.

Several people on Kaggle have acheived an accuracy of about 86% using Random Forests on this dataset, so let's start there.

## Providing context to locations
Despite the wealth of atmospheric data gathered from these stations, there is nothing in her that describes the physical location of these places. Our exploration reveals large differences in rainfall based on region, with the tropical stations collecting far more rain than the arid stations. While a basic mapping of location to State or 