DataSet:
* Rain Prediction

In [None]:
!python --version

# **1. import Libraries/Dataset**

## 1.1. Download the dataset
Upload a required data (CSV) file to the colab file systems.

In [None]:
from google.colab import files
uploaded = files.upload()

## 1.2. Import the required libraries

In [None]:
#!pip install category_encoders
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from keras.layers import Dense, Dropout
from keras.models import Sequential
from sklearn.metrics import confusion_matrix, classification_report
from keras import callbacks
import warnings
warnings.filterwarnings('ignore')


# **2. Data Visualization and Exploration**

Read the uploaded 'weatherAUS.csv' as dataframe

In [None]:
df = pd.read_csv('weatherAUS.csv')

# 2.1. Print 10 rows
sanity check to identify all the features present in the dataset and if the target matches with them.

In [None]:
df.head(10)

Represents the number of rows and columns respectively.

In [None]:
df.shape

There are 23 columns with 145K rows

In [None]:
df.info()

There are 16 numberic type columns and 7 object (categorical) type columns

The data from "Date" column should be Date type

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

Checking for null values in the dataset

In [None]:
(df.isnull().sum()/len(df) *100).sort_values(ascending=False)

Filling the null values of the categorical columns with the mode

In [None]:
objDF = (df.dtypes == "object")
categoricalIdx = list(objDF[objDF].index)
# Filling missing values with mode of the column in value
for idx in categoricalIdx:
    df[idx].fillna(df[idx].mode()[0], inplace=True)

Filling the null values of the numerical columns with the median becuse we do not if the features are skewed or not

In [None]:
numDF = (df.dtypes == "float64")
numericalIdx = list(numDF[numDF].index)
# Filling missing values with median of the column in value
for idx in numericalIdx:
    df[idx].fillna(df[idx].median(), inplace=True)


Checking again for null values in the dataset

In [None]:
df.isnull().sum()

There is no NULL after the null values are all replaced with appropriate values.

#2.2. Apply Univariate analysis with target variables.

Creating new dataframe for numerical and categorical variables

In [None]:
numericalDF = df.select_dtypes(np.number)
categoricalDF = df.select_dtypes('object')

Apply Univariate analysis for numericals columns

In [None]:
plt.rcParams['figure.figsize'] = 15,5
for col in df.select_dtypes(np.number):
    fig,ax= plt.subplots(1,3)
    print(col,':')
    sns.distplot(numericalDF[col],ax=ax[0], color='gray')
    sns.boxplot(numericalDF[col], ax=ax[1], palette='Greens')
    sns.violinplot(numericalDF[col],ax=ax[2], palette='Blues')
    plt.show()

**Inferences from the univariate analysis of the numerical columns:**
* *mintemp* follows a normal distribution 
with outliers in both sides
* *maxtemp* is left skewed more than right skewed
* *Rainfall* is right skewed, whenever there is rain, it is having higher values
* *Evaporation* is highly right skewed similar to rainfall
* *Sunshine* has outliers on both sides, which means there were only less days when the sunshine was warm, many days it was cloudy and equally many days it was a sunshiny day.
* *Windgustspeed* is right skewed, as only on rainy days the *windgustspeed* was very high
* *Windspeed9am* and *Windspeed3pm* are right skewed too
* *Humidity9am* is left skewed 
* *Humidity3pm* is normally distributed.
* *Pressure9am* has outliers in both sides and same applies for *Pressure3pm*, the pressure is either too less or too high on rainy days
* *Cloud9am* and *Cloud3pm* follows almost normal distribution
* *Temp9am* and *Temp3pm* follows similar distribution with outliers on both sides

Apply Univariate analysis for categorical columns

In [None]:
plt.rcParams['figure.figsize'] = 12,5
for col in categoricalDF:
    fig ,ax = plt.subplots(1,2)
    print(col,':')
    categoricalDF[col].value_counts().plot(kind='bar',rot=0, ax=ax[0],cmap='Purples_r')
    categoricalDF[col].value_counts().plot(kind='pie',autopct='%.1f%%',ax=ax[1],cmap='crest')
    plt.show()

**Inferences from the univariate analysis of the Categorical columns**
* *Location* category almost all the locations contribute to the dataset equally
* *Windgustdir* is more from the West direction compared to all the other directions
* *Winddir9am* is more from the North direction compared to all the other directions
* *Winddir3pm* is more from the South-East direction followed almost equally by West and South directions
* Most of the days there was no rain only 22% times of the days, rain occurred.

## 2.3. Print each class label count and create charts  for each class (% of data distribution).

In [None]:
plt.rcParams['figure.figsize'] = 20,12
for col in categoricalDF:      
    print(col,':')
    #categoricalDF[col].value_counts().plot(kind='bar',rot=0, cmap='crest')
    categoricalDF[col].value_counts().plot(kind='pie',autopct='%.1f%%', cmap='crest')
    #plt.xticks(rotation=90)
    plt.show()

**Data Balancing analysis of the Categorical columns**
* *Location* category almost all the locations contribute to the dataset equally. Hence data is balanced
* *Windgustdir* is more from the West direction compared to all the other directions. Hence data is partially balanced
* *Winddir9am* is more from the North direction compared to all the other directions. Hence data is partially balanced
* *Winddir3pm* is more from the South-East direction followed almost equally by West and South directions. Hence data is partially balanced
* Most of the days there was no rain only 22% times of the days, rain occurred. Hence data in *RainToday* and *RainTomorrow* are imbalanced

**Exploring the length of date objects to observe on data balancing on "Date" column**

In [None]:
# Creating different columns based on the date feature, for further usage
df['Year'] = df['Date'].dt.year
df['Month'] = df.Date.dt.month
df['Day'] = df.Date.dt.day

df.head(5)

**Data distribution Of Days Over Year**

In [None]:
section = df[:365] 
tm = section["Day"].plot(color="Blue")
tm.set_title("Distribution Of Days Over Year")
tm.set_ylabel("Days In month")
tm.set_xlabel("Days In Year")

As per the above distribution, the "Year" data repeats cyclical continuous feature with "Month" data.

# **3. Data Pre-processing and cleaning**

# 3.1. Label encoding columns with categorical data

Apply label encoder to each column with categorical data

In [None]:
label_encoder = LabelEncoder()
# Get list of categorical variables
s = (df.dtypes == "object")
categoricalCols = list(s[s].index)
for i in categoricalCols:
    df[i] = label_encoder.fit_transform(df[i])
    
df.info()

In [None]:
df.head()

encoding the categorical features into numeric values to normalize labels.
This data is used on the training data so that we can scale the training data and also learn the scaling parameters.

# 3.2. Perform the scaling of the features

Scaling the train and test data seperately so as the model will not be biased towards values

In [None]:
#Identifying feature columns
features = df.drop(['RainTomorrow', 'Date', 'Day', 'Month'], axis=1)

#Identifying the target column
target = df['RainTomorrow']

#Set up a standard scaler for the features
col_names = list(features.columns)
s_scaler = preprocessing.StandardScaler()
features = s_scaler.fit_transform(features)
features = pd.DataFrame(features, columns=col_names) 
outliers = features.describe().T
outliers

Removes the mean and scales each feature/variable to unit variance

# 3.3. Detecting outliers

Box plots are a visual method to identify outliers

In [None]:
features.plot(kind="box",subplots=True,layout=(8, 3), figsize=(15,25));
plt.show()

**Finding the Boundary Values**

In [None]:
def outlier_treatment(outliersDF, col):
  Q1 = outliersDF.loc[col, '25%']
  Q3 = outliersDF.loc[col, '75%']
  IQR = Q3 - Q1
  lower_range = Q1 - (1.5 * IQR)
  upper_range = Q3 + (1.5 * IQR)
  return lower_range,upper_range

t = (df.dtypes == "float64")
neumericCols = list(t[t].index)
for column  in neumericCols:
  lowerbound,upperbound = outlier_treatment(outliers, column)
  print(f"Highest allowed {column}:",lowerbound)
  print(f"Lowest allowed {column}:",upperbound)
    

* IQR method is used by box plot to highlight outliers, which is the difference between q3 (75th percentile) and q1 (25th percentile)
* The IQR method computes lower bound and upper bound to identify outliers.

* Lower Bound = q1–1.5*IQR
* Upper Bound = q3+1.5*IQR

* *Rainfall*, *Evaporation*, *Sunshine* are having less different in loer and upper bount, hence taking the mean value

# 4.4. Dropping the outliers based on data analysis

In [None]:
features["RainTomorrow"] = target

In [None]:
features = features[(features["MinTemp"]<2.0)&(features["MinTemp"]>-2.0)]
features = features[(features["MaxTemp"]<2.0)&(features["MaxTemp"]>-1.8)]
features = features[(features["Rainfall"]<2.5)]
features = features[(features["Evaporation"]<2.3)]
features = features[(features["Sunshine"]<2.1)]
features = features[(features["WindGustSpeed"]<2.3)&(features["WindGustSpeed"]>-2.4)]
features = features[(features["WindSpeed9am"]<2.3)&(features["WindSpeed9am"]>-2.2)]
features = features[(features["WindSpeed3pm"]<2.5)&(features["WindSpeed3pm"]>-2.6)]
features = features[(features["Humidity9am"]<2.8)&(features["Humidity9am"]>-2.2)]
features = features[(features["Humidity3pm"]<2.2)&(features["Humidity3pm"]>-2.0)]
features = features[(features["Pressure9am"]< 2.0)&(features["Pressure9am"]>-2.2)]
features = features[(features["Pressure3pm"]< 2.0)&(features["Pressure3pm"]>-2.2)]
features = features[(features["Cloud9am"]<1.8)&(features["Cloud9am"]>-1.7)]
features = features[(features["Cloud3pm"]<2)&(features["Cloud3pm"]>-2.0)]
features = features[(features["Temp9am"]<2.0)&(features["Temp9am"]>-2.0)]
features = features[(features["Temp3pm"]<2.0)&(features["Temp3pm"]>-1.7)]


features.shape

Dropped the outliers based on above data analysis using IQR

In [None]:
features.plot(kind="box",subplots=True,layout=(8, 3), figsize=(15,25));
plt.show()

In [None]:
#looking at the scaled features without outliers

plt.figure(figsize=(20,12))
sns.boxenplot(data = features,palette = "Spectral")
plt.xticks(rotation=90)
plt.show()

After removing all outliers, the scaled features are looking good

# **4. Model Building**

# 4.1. Split the dataset into training and test sets.



**Get X and Y feature variables**

Assigning x and y variable in which the x feature variable has *independent* variables and the y feature variable has a *dependent* variable

In [None]:
x = features.drop(["RainTomorrow"], axis=1)
y = features["RainTomorrow"]

x.shape

In [None]:
y.shape

The shape of the above x dataframe is (104721, 22). The features columns are taken in the X variable and the outcome column is taken in the y variable. X and y variables are passed in the train_test_split() method to split the data frame into train and test sets.

We are going to split the dataset with different ratios of 20% and 30%  test data

**Case 1** : Train = 80 % Test = 20% [ x_train,y_train] = 80% ;
[ x_test,y_test] = 20% ;

In [None]:
# Splitting test and training sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42, stratify=y)

print(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

**Case 2** : Train = 70 % Test = 30% [ x_train1,y_train1] = 70% ; [ x_test1,y_test1] = 30% ;

In [None]:
# Splitting test and training sets
x_train1, x_test1, y_train1, y_test1 = train_test_split(x, y, test_size = 0.3, random_state = 64, stratify=y)

print(x_train1.shape, x_test1.shape, y_train1.shape, y_test1.shape)

# 4.2.a. Develop ANN Model

Create a Sequential model for a plain stack of layers where each layer has exactly one input tensor and one output tensor.

We have created one input layer, three hidden layer with two dropout and one output layer to train our data.

* The input layer start with 24 dimensions and 32 neuron units of hidden layer with relu activation function
* The next hidden layer have 32 neurons plus relu activation function and take the previous 32 dimensions as input
* The next hidden layer have 16 neurons plus relu activation function and take the previous 32 dimensions as input
* To reduce overfitting *Dropout* of 20% of the neurons to regularize ANN
* The next hidden layer have 8 neurons plus relu activation function and take the previous 16 dimensions as input
* To reduce overfitting *Dropout* of 40% of the neurons to regularize ANN
* The output layer has one node and uses the sigmoid activation function.



In [None]:
model = Sequential()
model.add(Dense(units = 64, kernel_initializer = 'uniform', activation = 'relu', input_dim = 22))
model.add(Dense(units = 32, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.25))
model.add(Dense(units = 16, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.35))
model.add(Dense(units = 8, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.50))
model.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

model.summary()

model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

* The *Dense* function in Keras constructs a fully connected neural network layer, automatically initializing the weights as biases
* We have added the LR model has the form y=f(xW) where f is the sigmoid function
* The Output layer being directly connected to the input reflects this function
* The *compile* function creates the neural network model by specifying the details of the learning process. The model hasn’t been trained yet
* *Adam* optimizer computes individual learning rates for different parameters
* *Adam* uses estimations of first and second moments of gradient to adapt the learning rate for each weight of the neural network
* We are using a binary 0/1 classifier, the loss function to minimize is *binary_crossentropy*

# 4.2.b. Train the model and print the training accuracy and loss values

Create a callback to specify the performance measure to monitor, the trigger, and once triggered, it will stop the training process.

In [None]:
earlyStoppingCallbacks  = callbacks.EarlyStopping(
    # "no longer improving" being defined as "no better than 1e-2 less"
    min_delta = 0.0001, 
    # "no longer improving" being further defined as "for at least 5 epochs"
    patience=5,
    #Restore model weights from the epoch with the best value of the monitored quantity 
    restore_best_weights=True,
)

Using fit(), train the model by slicing the data into "*batches*" of size batch_size, and repeatedly iterating over the entire dataset for a given number of *epochs*.
* Using *EarlyStopping*, stop training when training is no longer improving the validation metrics
* This callback will stop the training when there is no improvement in the loss for 5 consecutive epochs

In [None]:
history = model.fit(x_train, y_train, batch_size = 64, epochs = 5, callbacks=[earlyStoppingCallbacks], validation_split=0.2)

In [None]:
print(history.history)

In [None]:
history1 = model.fit(x_train1, y_train1, batch_size = 64, epochs = 5, callbacks=[earlyStoppingCallbacks], validation_split=0.3)
print(history1.history)

**A plot of accuracy on the training and validation datasets over training epochs**

In [None]:
plt.figure(figsize=(10,5))

plt.plot(history.history['accuracy'], "#BDE2E2", label='Training accuracy')
plt.plot(history.history['val_accuracy'], "#C2C4E2", label='Validation accuracy')

plt.title('Model Training and validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc="best")
plt.show()

From the above plot of the accuracy, we can see that the model could probably be trained a little more as the trend for accuracy on both datasets is still rising for the last few epochs. we can also see that the model has not yet over-learned the training dataset, showing comparable skill on both datasets.

In [None]:
plt.figure(figsize=(10,5))

plt.plot(history1.history['accuracy'], "#BDE2E2", label='Training accuracy')
plt.plot(history1.history['val_accuracy'], "#C2C4E2", label='Validation accuracy')

plt.title('Model Training and validation Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc="best")
plt.show()

From the above plot of the accuracy, there is the trend for accuracy on both datasets is still falling for the last few epochs.

**A plot of loss on the training and validation datasets over training epochs**

In [None]:
plt.figure(figsize=(10,5))

plt.plot(history.history['loss'], "#BDE2E2", label='Training loss')
plt.plot(history.history['val_loss'],"#C2C4E2", label='Validation loss')

plt.title('Model Training and validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(loc="best")
plt.show()

From the above plot of the loss, you can see that the model has comparable performance on both train and validation datasets. If these parallel plots start to depart consistently, it might be a sign to stop training at an earlier epoch.

In [None]:
plt.figure(figsize=(10,5))

plt.plot(history1.history['loss'], "#BDE2E2", label='Training loss')
plt.plot(history1.history['val_loss'],"#C2C4E2", label='Validation loss')

plt.title('Model Training and validation Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend(loc="best")
plt.show()

From the above plot of the loss, you can see that the model has comparable performance on both train and validation datasets. If these parallel plots start to depart consistently, it might be a sign to stop training at an earlier epoch.

# **5. Performance Evaluation**

**Evaluating the Performance of our two models**

In [None]:
#Evaluate model on test data
scores = model.evaluate(x_test, y_test, batch_size=128)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

scores1 = model.evaluate(x_test1, y_test1, batch_size=128)
print("%s: %.2f%%" % (model.metrics_names[1], scores1[1]*100))

Both Splits (20% and 30%  test data) gives the same accuracy. Hence we will be chossing the first model (20%  test data) for further performance evaluation

# 5.1. Confusion matrix

**A prediction model is trained with a set of training sequences. Once trained, the model is used to perform sequence predictions.**

* Given a sequence of observations about the weather over time, predict the expected weather tomorrow.

In [None]:
x_test = np.nan_to_num(x_test)
y_test = np.nan_to_num(y_test)

# Predicting the test set results
predicted = model.predict(x_test)
y_pred = (predicted > 0.5)
print("x=%s, Predicted=%s" % (x_test[0], y_pred))

Confusion matrix measures the quality of predictions from a classification model by looking at how many predictions are True and how many are False.

In [None]:
# confusion matrix
cfm = confusion_matrix(y_test, y_pred)
cfm


In [None]:
plt.subplots(figsize=(12,8))
sns.heatmap(cfm/np.sum(cfm),cmap="crest", annot = True, annot_kws = {'size':15})

* Top left quadrant = True Positives = 76% RainTomorrow as TP
* Bottom right quadrant = True Negatives = 3.6% RainTomorrow as TN
* Top right quadrant = False Positives = 12% RainTomorrow as FP
* Bottom left quadrant = False Negatives = 8.9% RainTomorrow as FN

With data from the confusion matrix, we can interpret the results by looking at the classification report.

In [None]:
print(classification_report(y_test, y_pred))

**Precision**: The precision tells us the *accuracy* of positive predictions.
As per the above precision, the positive prediciton is more than 83%

**Recall**: The fraction of correctly identified positive predictions.

**f1-score**: Measures precision and recall at the same time by finding the harmonic mean of the two values.

**Support**: The support is the number of occurrences of each class in your y_test

