## Section Two - To Rain Or Not To Rain  
**Author: Zak Hussain**    
**Date: 10/18/2019 - 11/01/2019**  
**Course:  ML 6140** 

**Purpose:**  
    Using the Rain in Australia Data Set, build a system to predict whether it is going to rain tomorrow. 
    
**Note:**  
* Exclude the variable Risk-MM when training a binary classification model. 
* Exclude RainTomorrow when training a regression model.

In [1]:
import numpy as np 
import pandas as pd

**Preprocessing**

In [2]:
# read in the csv file .
df = pd.read_csv('../Data/weatherAUS.csv') 

# Remove 'Risk-MM' as I am building a binary classification model. 
df.drop(columns='RISK_MM', inplace=True) 

# save all remaining feature names to a list. 
features = df.columns.values.tolist()

In [3]:
# check the type of missing information. 
df.isnull().sum()

Date                 0
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64

In [4]:
# get the position of missing labels.
nan_label_indxs = df[df['RainTomorrow'].isnull()].index

# drop rows with missing labels, y, where y = 'RainTomorrow'.
df.drop(nan_label_indxs, inplace=True)

In [5]:
# TODO: split the date into three columns, preferably just key the month or year. not sure if the day is useful. 
df['Date'] = pd.to_datetime(df['Date'])

# split the date into a year, month, and day columns. 
df['Year'] = df["Date"].dt.year
df['Month'] = df["Date"].dt.month
df['Day'] = df["Date"].dt.day

#Then drop the date. This may result in  better classification. 
df.drop(columns="Date", inplace=True)

In [6]:
# check the types in the df, the next part in preprocessing will involve encoding non-categorical labels.
df.dtypes

Location          object
MinTemp          float64
MaxTemp          float64
Rainfall         float64
Evaporation      float64
Sunshine         float64
WindGustDir       object
WindGustSpeed    float64
WindDir9am        object
WindDir3pm        object
WindSpeed9am     float64
WindSpeed3pm     float64
Humidity9am      float64
Humidity3pm      float64
Pressure9am      float64
Pressure3pm      float64
Cloud9am         float64
Cloud3pm         float64
Temp9am          float64
Temp3pm          float64
RainToday         object
RainTomorrow      object
Year               int64
Month              int64
Day                int64
dtype: object

In [7]:
# seperate the names of the features based on categorical vs non-categorical 
types = pd.Series(df.dtypes)

non_categorical_features = []
categorical_features = [] 

for i in range(len(types.index)): 
    if types.values[i] == 'float64' or types.values[i] == 'int64': 
        non_categorical_features.append(types.index[i])
    else: 
        categorical_features.append(types.index[i])  

# drop the label from the collection of feature names. 
categorical_features.remove('RainTomorrow') 

In [8]:
# check column-wise distribution of null values. 
df[categorical_features].isnull().sum()

Location           0
WindGustDir     9330
WindDir9am     10013
WindDir3pm      3778
RainToday       1406
dtype: int64

In [9]:
# perform mode imputation on columns with null values.
for i in range(1, len(categorical_features)):
    df[categorical_features[i]] = df[categorical_features[i]].value_counts().index[0]

In [10]:
# ensure column-wise distribution of null values are zeroes across all columns. 
df[categorical_features].isnull().sum()

Location       0
WindGustDir    0
WindDir9am     0
WindDir3pm     0
RainToday      0
dtype: int64

In [11]:
# encode the categorical features, converting them to R-space using one-hot encoding
from sklearn.preprocessing import OneHotEncoder
df = pd.get_dummies(df, columns=categorical_features)

In [12]:
# check column-wise distribution of null values on non_categorical features 
df[non_categorical_features].isnull().sum()

MinTemp            637
MaxTemp            322
Rainfall          1406
Evaporation      60843
Sunshine         67816
WindGustSpeed     9270
WindSpeed9am      1348
WindSpeed3pm      2630
Humidity9am       1774
Humidity3pm       3610
Pressure9am      14014
Pressure3pm      13981
Cloud9am         53657
Cloud3pm         57094
Temp9am            904
Temp3pm           2726
Year                 0
Month                0
Day                  0
dtype: int64

In [13]:
# impute missing non-categorical values based on the mode. 
for i in range(len(non_categorical_features)-3):
    df[non_categorical_features[i]] = df[non_categorical_features[i]].value_counts().index[0]

In [14]:
# ensure column-wise distribution of null values on non_categorical features is zero 
df[non_categorical_features].isnull().sum()

MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
Year             0
Month            0
Day              0
dtype: int64

In [15]:
# seperate the features from the labels. 
y = df['RainTomorrow']
X = df.loc[:, df.columns != 'RainTomorrow']

In [16]:
X[non_categorical_features].isnull().sum()

MinTemp          0
MaxTemp          0
Rainfall         0
Evaporation      0
Sunshine         0
WindGustSpeed    0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Cloud9am         0
Cloud3pm         0
Temp9am          0
Temp3pm          0
Year             0
Month            0
Day              0
dtype: int64

In [17]:
# convert the labels to binary, neg_label is 0, pos is 1 (it will rain). 
from sklearn import preprocessing
le = preprocessing.LabelBinarizer()
y = le.fit_transform(y)

In [18]:
# check a description of the numerical data 
X[non_categorical_features].describe()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,Year,Month,Day
count,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0,142193.0
mean,9.6,20.0,0.0,4.0,0.0,35.0,9.0,13.0,99.0,52.0,1016.4,1015.5,7.0,7.0,17.0,20.0,2012.758926,6.402544,15.715084
std,8.967081e-12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.704278e-09,0.0,0.0,0.0,0.0,0.0,2.541256,3.426506,8.79815
min,9.6,20.0,0.0,4.0,0.0,35.0,9.0,13.0,99.0,52.0,1016.4,1015.5,7.0,7.0,17.0,20.0,2007.0,1.0,1.0
25%,9.6,20.0,0.0,4.0,0.0,35.0,9.0,13.0,99.0,52.0,1016.4,1015.5,7.0,7.0,17.0,20.0,2011.0,3.0,8.0
50%,9.6,20.0,0.0,4.0,0.0,35.0,9.0,13.0,99.0,52.0,1016.4,1015.5,7.0,7.0,17.0,20.0,2013.0,6.0,16.0
75%,9.6,20.0,0.0,4.0,0.0,35.0,9.0,13.0,99.0,52.0,1016.4,1015.5,7.0,7.0,17.0,20.0,2015.0,9.0,23.0
max,9.6,20.0,0.0,4.0,0.0,35.0,9.0,13.0,99.0,52.0,1016.4,1015.5,7.0,7.0,17.0,20.0,2017.0,12.0,31.0


In [19]:
# split the data
from sklearn.model_selection import train_test_split

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3)

In [50]:
y_train = y_train.flatten()
y_test = y_test.flatten()

**Train a LogisticRegressionCV Model**   
Here I implement a LogisticRegressionCV model the incorporates cross-validation into training. Since the class labels are a binary output (it will rain or not rain), Logistic Regression techniques are cleary a good model to use because 'it works'. 

In [51]:
from sklearn.linear_model import LogisticRegressionCV

In [58]:
# train the logist regression model. 
clf = LogisticRegressionCV(cv=5)
clf.fit(X_train, y_train) 

LogisticRegressionCV(Cs=10, class_weight=None, cv=5, dual=False,
                     fit_intercept=True, intercept_scaling=1.0, l1_ratios=None,
                     max_iter=100, multi_class='warn', n_jobs=None,
                     penalty='l2', random_state=None, refit=True, scoring=None,
                     solver='lbfgs', tol=0.0001, verbose=0)

In [59]:
print("On the test data, the model has an accuracy of:", clf.score(X_test, y_test))

On the test data, the model has an accuracy of: 0.7736649631956491
