

# Machine Learning Classifiers for Rain Prediction - AMM '24

# Introduction


In this notebook, we will compare the performance of some basic machine learning classification algorithms in the context of predicting if it will rain tomorrow (in Australia) from a meteorological dataset using pandas, NumPy, and scikit-learn.


We are going to use the classification algorithms to create models based on our training data and evaluate our testing data using some common evaluation metrics.

We will use the following algorithms:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

We will evaluate our models using:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score



# About The Dataset


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## **Importing the required libraries**


In [37]:
import pandas as pd
import numpy as np
import sklearn.metrics as metrics
import sklearn.linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import accuracy_score


In [38]:
# We surpress some expected warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

### Importing the Dataset


In [39]:
# Here we are reading the data in externally. A copy of the data has been included in the folder/repository to load locally 
filepath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv"
df = pd.read_csv(filepath)

In [40]:
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing


#### One Hot Encoding


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [41]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [42]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data


Now, we set our 'features' or x values and our Y or target variable.


In [43]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [44]:
df_sydney_processed = df_sydney_processed.astype(float)

In [45]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression


#### Here we use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.


In [46]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)

#### Next, we create and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [49]:
LinearReg = sklearn.linear_model.LinearRegression()
LinearReg.fit(x_train, y_train)

#### Now we use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [50]:
predictions = LinearReg.predict(x_test)

#### Using the `predictions` and the `y_test` dataframe we can calculate the value for some chosen evaluation metrics.


In [51]:
LinearRegression_MAE = np.mean(np.absolute(predictions - y_test))
LinearRegression_MSE = np.mean((predictions - y_test) ** 2)
LinearRegression_R2 = 1 - np.sum((predictions- y_test) ** 2) / np.sum((y_test - np.mean(y_test)) ** 2)

#### We can show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [52]:
Report = {
    "Error Measure": ["Mean Absolute Error [MAE]", "Mean Squared Error [MSE]", "R-Squared [R*2]"],
    "Score": [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]
}
Report_df = pd.DataFrame(Report)
print(Report_df)

               Error Measure     Score
0  Mean Absolute Error [MAE]  0.256319
1   Mean Squared Error [MSE]  0.115723
2            R-Squared [R*2]  0.427121


### KNN


#### Here we create and train a K Nearest Neighbors model using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [53]:
# Here we ensure that the features are contiguous numpy arrays to avoid a c_contiguous attribute error
x_train = np.ascontiguousarray(x_train)
x_test = np.ascontiguousarray(x_test)
KNN = KNeighborsClassifier(n_neighbors = 4).fit(x_train,y_train)
# Use KNeighborsRegressor() if y is a continuous variable instead of categorical; 
# Here our y RainTomorrow is Yes/No, so we us KNeighborsClassifier

#### We now use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [54]:
predictions = KNN.predict(x_test)

#### Using the `predictions` and the `y_test` dataframe we can calculate the value for chosen evaluation metrics.

In [55]:
KNN_Accuracy_Score = metrics.accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions)
KNN_F1_Score = f1_score(y_test, predictions)

### Decision Tree


#### Next we create and train a Decision Tree model using the training data (`x_train`, `y_train`).


In [56]:
Tree = DecisionTreeClassifier(criterion="entropy")
Tree.fit(x_train, y_train)

#### Again we use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [57]:
predictions = Tree.predict(x_test)

#### Using the `predictions` and the `y_test` dataframe we can again calculate the value for each evaluation metric.


In [58]:
Tree_Accuracy_Score = metrics.accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)

### Logistic Regression


#### Here we use the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1` in preparation for our model.


In [59]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

#### We now create and train a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [60]:
LR = LogisticRegression(solver='liblinear').fit(x_train,y_train)

#### Now, we use the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.


In [61]:
predictions = LR.predict(x_test)

In [62]:
predict_proba = LR.predict_proba(x_test)

#### Using the `predictions`, `predict_proba` and the `y_test` dataframe we can calculate the value for each metric as before, with the addition of log loss.


In [63]:
LR_Accuracy_Score = metrics.accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)
# We only use log loss for Logistic Regression Model

### SVM


#### Lastly we will create and train a Support Vector Machine model using the training data (`x_train`, `y_train`).


In [64]:
SVM = svm.SVC(kernel='rbf', class_weight='balanced') # We are using a balanced class weight to combat potential model overweighting of days with no rain
# We use Radial Basis Function as the default kernel type ; Also possible to use Linear, Polynomial, Sigmoid etc. kernels instead
SVM.fit(x_train, y_train) 

#### We use the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [65]:
predictions = SVM.predict(x_test)

#### Using the `predictions` and the `y_test` dataframe we calculate the value for each evaluation metric.


In [66]:
SVM_Accuracy_Score = metrics.accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions)

### Report


#### Here we report the Accuracy Score, Jaccard Index, F1-Score, and Log Loss in a tabular format using data frame for all of the above models.

LogLoss is reported only for Logistic Regression Model


In [67]:
Report = {
    "Model": ["KNN", "Decision Tree", "Logistic Regression", "SVM"],
    "Accuracy Score": [KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
    "Jaccard Index Score": [KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
    "F-1 Score": [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
    "Log Loss": [None, None, LR_Log_Loss, None]  # Log Loss only reported for Logistic Regression
}

Report_df = pd.DataFrame(Report)

print(Report_df)
# See Linear Regression section for tabular representation of linear regression model errors and accuracy measures

                 Model  Accuracy Score  Jaccard Index Score  F-1 Score  \
0                  KNN        0.818321             0.425121   0.596610   
1        Decision Tree        0.748092             0.388889   0.560000   
2  Logistic Regression        0.836641             0.509174   0.674772   
3                  SVM        0.754198             0.456081   0.626450   

   Log Loss  
0       NaN  
1       NaN  
2  0.381064  
3       NaN  


Here we see that the best performing model across our evaluation metrics is the Logistic Regression model, which intuitively makes sense as it is theoretically the model best intentioned for a categorical or binary decision output like RainTomorrow. Though the models are basic and could improve with more training, I hope this notebook has demonstrated how straightforward using and comparing basic machine learning classifier models can be ! 