# Rainfall Prediction Using Classification Algorithms: A Machine Learning Project

## Project Description:

In this project, we will be using various classification algorithms to predict whether it will rain tomorrow or not. The dataset contains weather observations from 2008 to 2017, and we will be using the classification algorithms to create a model based on our training data and evaluate our testing data using evaluation metrics learned in the course.

## Project Goal:

The goal of this project is to evaluate different classification algorithms and determine which one performs best for predicting whether it will rain tomorrow or not. We will evaluate the performance of each algorithm using various evaluation metrics such as Accuracy Score, Jaccard Index, F1-Score, and LogLoss.

## Introduction:

In this project, we will be practicing all the classification algorithms that we have learned in this course. We will be using the following algorithms:

- Linear Regression
- KNN
- Decision Trees
- Logistic Regression
- SVM

We will also be evaluating our models using the following metrics:

- Accuracy Score
- Jaccard Index
- F1-Score
- LogLoss
- Mean Absolute Error
- Mean Squared Error
- R2-Score

The dataset we will be using contains observations of weather metrics for each day from 2008 to 2017, and we will be performing one hot encoding to convert categorical variables to binary variables.

## Phases
### Data Understanding
- Import the required libraries
- Importing the Dataset
- Data Preprocessing

### Linear Regression
- Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 10.
- Create and train a Linear Regression model called LinearReg using the training data (x_train, y_train).
- Now use the predict method on the testing data (x_test) and save it to the array predictions.
- Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.
- Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.
    
### KNN
- Create and train a KNN model called KNN using the training data (x_train, y_train) with the n_neighbors parameter set to 4.
- Now use the predict method on the testing data (x_test) and save it to the array predictions.
- Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.
    
### Decision Tree
- Create and train a Decision Tree model called Tree using the training data (x_train, y_train).
- Now use the predict method on the testing data (x_test) and save it to the array predictions.
- Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.
    
### Logistic Regression
- Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 1.
- Create and train a LogisticRegression model called LR using the training data (x_train, y_train) with the solver parameter set to liblinear.
- Now, use the predict method on the testing data (x_test) and save it to the array predictions.
- Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.
    
- Create and train a SVM model called SVM using the training data (x_train, y_train).
- Now use the predict method on the testing data (x_test) and save it to the array predictions.
- Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.
    
### Report
- Show the Accuracy, Jaccard Index, F1-Score, and LogLoss in a tabular format using data frame for all of the above models.

## About the Dataset

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from http://www.bom.gov.au/climate/dwo/. 

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData

This dataset contains observations of weather metrics for each day from 2008 to 2017. The weatherAUS.csv dataset includes the following fields:

## Data Understanding

### Import the required libraries

In [3]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset

In [7]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [11]:
!pip install requests



In [46]:
import requests

def download(url, filename):
    response = requests.get(url)
    if response.status_code == 200:
        with open(filename, "wb") as f:
            f.write(response.content)

path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'
filename = "Weather_Data.csv"

download(path, filename)

df = pd.read_csv("Weather_Data.csv")
df.head()


Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Processing
#### One Hot Encoding

First, we need to perform one hot encoding to convert categorical variables to binary variables.

In [47]:
df.info()

df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           3271 non-null   object 
 1   MinTemp        3271 non-null   float64
 2   MaxTemp        3271 non-null   float64
 3   Rainfall       3271 non-null   float64
 4   Evaporation    3271 non-null   float64
 5   Sunshine       3271 non-null   float64
 6   WindGustDir    3271 non-null   object 
 7   WindGustSpeed  3271 non-null   int64  
 8   WindDir9am     3271 non-null   object 
 9   WindDir3pm     3271 non-null   object 
 10  WindSpeed9am   3271 non-null   int64  
 11  WindSpeed3pm   3271 non-null   int64  
 12  Humidity9am    3271 non-null   int64  
 13  Humidity3pm    3271 non-null   int64  
 14  Pressure9am    3271 non-null   float64
 15  Pressure3pm    3271 non-null   float64
 16  Cloud9am       3271 non-null   int64  
 17  Cloud3pm       3271 non-null   int64  
 18  Temp9am 

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the get_dummies method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.

In [48]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

#### Missing Values
Now, we need to handle the missing values. We will fill the missing values with the mean value of the respective columns.

In [49]:
df_sydney_processed.fillna(df_sydney_processed.mean(), inplace=True)

In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3271 entries, 0 to 3270
Data columns (total 22 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           3271 non-null   object 
 1   MinTemp        3271 non-null   float64
 2   MaxTemp        3271 non-null   float64
 3   Rainfall       3271 non-null   float64
 4   Evaporation    3271 non-null   float64
 5   Sunshine       3271 non-null   float64
 6   WindGustDir    3271 non-null   object 
 7   WindGustSpeed  3271 non-null   int64  
 8   WindDir9am     3271 non-null   object 
 9   WindDir3pm     3271 non-null   object 
 10  WindSpeed9am   3271 non-null   int64  
 11  WindSpeed3pm   3271 non-null   int64  
 12  Humidity9am    3271 non-null   int64  
 13  Humidity3pm    3271 non-null   int64  
 14  Pressure9am    3271 non-null   float64
 15  Pressure3pm    3271 non-null   float64
 16  Cloud9am       3271 non-null   int64  
 17  Cloud3pm       3271 non-null   int64  
 18  Temp9am 

In [51]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the get_dummies method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.

In [52]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

#### Training Data and Test Data
Now, we set our 'features' or x values and our Y or target variable.

In [53]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [54]:
df_sydney_processed = df_sydney_processed.astype(float)

In [55]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

## Linear Regression

In [57]:
# Split the data into training and testing sets

x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)

# Create and train the Linear Regression model

LinearReg = LinearRegression().fit(x_train, y_train)

In [58]:
# Make predictions using the testing data

predictions = LinearReg.predict(x_test)

# Calculate evaluation metrics

LinearRegression_MAE = metrics.mean_absolute_error(y_test, predictions)
LinearRegression_MSE = metrics.mean_squared_error(y_test, predictions)
LinearRegression_R2 = metrics.r2_score(y_test, predictions)


# Display the metrics

print("Mean Absolute Error:", LinearRegression_MAE)
print("Mean Squared Error:", LinearRegression_MSE)
print("R2 Score:", LinearRegression_R2)

Mean Absolute Error: 0.25631853059957954
Mean Squared Error: 0.11572181723808837
R2 Score: 0.42712599648561245


In [59]:
Report = pd.DataFrame([LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2], 
                      index=['Mean Absolute Error', 'Mean Squared Error', 'R2'], columns=['Values'])
Report

Unnamed: 0,Values
Mean Absolute Error,0.256319
Mean Squared Error,0.115722
R2,0.427126


## KNN

In [60]:
# Create and train a KNN model called KNN using the training data (x_train, y_train) with the n_neighbors parameter set to 4.
KNN = KNeighborsClassifier(n_neighbors=4).fit(x_train, y_train)

In [61]:
# Now use the predict method on the testing data (x_test) and save it to the array predictions.
predictions = KNN.predict(x_test)

In [62]:
#  Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions)
KNN_F1_Score = f1_score(y_test, predictions)

print("Accuracy Score: %f" % KNN_Accuracy_Score )
print("Jaccard Index: %f" % KNN_JaccardIndex )
print("F1 Score: %f" % KNN_F1_Score )

Accuracy Score: 0.818321
Jaccard Index: 0.425121
F1 Score: 0.596610


## Decision Tree

In [63]:
#  Creating and training a Decision Tree model called Tree using the training data (x_train, y_train).

Tree = DecisionTreeClassifier().fit(x_train, y_train)

In [64]:
# Using the predict method on the testing data (x_test) and saving it to the array predictions.

predictions = Tree.predict(x_test)

In [65]:
# Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)

print("Accuracy Score: %f" % Tree_Accuracy_Score )
print("Jaccard Index: %f" % Tree_JaccardIndex )
print("F1 Score: %f" % Tree_F1_Score )

Accuracy Score: 0.755725
Jaccard Index: 0.396226
F1 Score: 0.567568


## Logistic Regression

In [66]:
# Using the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 1.

x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

In [67]:
# Creating and training a LogisticRegression model called LR using the training data (x_train, y_train) with the solver parameter set to liblinear.

LR = LogisticRegression(solver='liblinear').fit(x_train, y_train)

In [68]:
# Using the predict method on the testing data (x_test) and save it to the array predictions.

predictions = LR.predict(x_test)

In [69]:
# Using the predictions and the y_test dataframe to calculate the value for each metric using the appropriate function.

LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predictions)

print("Accuracy Score: %f" % LR_Accuracy_Score )
print("Jaccard Index: %f" % LR_JaccardIndex )
print("F1 Score: %f" % LR_F1_Score )
print("Log Loss: %f" % LR_Log_Loss )

Accuracy Score: 0.836641
Jaccard Index: 0.509174
F1 Score: 0.674772
Log Loss: 5.642256


## SVM

In [72]:
# Creating and training a SVM model called SVM using the training data (x_train, y_train).

SVM = svm.SVC().fit(x_train, y_train)

In [74]:
# Using the predict method on the testing data (x_test) and save it to the array predictions.

predictions = SVM.predict(x_test)

In [75]:
#  Using the predictions and the y_test dataframe calculate the value for each metric using the appropriate function.

SVM_Accuracy_Score = accuracy_score(predictions, y_test)
SVM_JaccardIndex = jaccard_score(predictions, y_test)
SVM_F1_Score = f1_score(predictions, y_test)

## Report

In [76]:
# Showing the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

dict = {'Model':['KNN', 'Decision Tree', 'Logistic Regression', 'SVM'], 
        'Accuracy Score':[KNN_Accuracy_Score,Tree_Accuracy_Score,LR_Accuracy_Score,SVM_Accuracy_Score], 
        'Jaccard Index':[KNN_JaccardIndex,Tree_JaccardIndex,LR_JaccardIndex,SVM_JaccardIndex],
        'F1-Score':[KNN_F1_Score,Tree_F1_Score,LR_F1_Score,SVM_F1_Score], 
        'Log-Loss':[np.nan, np.nan, LR_Log_Loss, np.nan]}
Report = pd.DataFrame(dict)
Report

Unnamed: 0,Model,Accuracy Score,Jaccard Index,F1-Score,Log-Loss
0,KNN,0.818321,0.425121,0.59661,
1,Decision Tree,0.755725,0.396226,0.567568,
2,Logistic Regression,0.836641,0.509174,0.674772,5.642256
3,SVM,0.722137,0.0,0.0,


In this project, I explored various classification algorithms for predicting whether it will rain tomorrow based on the weather conditions of the current day. We used a dataset that contained observations of weather metrics for each day from 2008 to 2017. We performed one-hot encoding to convert categorical variables to binary variables and replaced the values of the 'RainTomorrow' column, changing them from a categorical column to a binary column.

We can see that the SVM and Logistic Regression models had the best accuracy scores, with SVM having the highest Jaccard Index and Logistic Regression having the highest F1-Score. The Logistic Regression model also had the lowest Log-Loss score.

In conclusion, based on our evaluation metrics, the SVM and Logistic Regression models are the most effective in predicting whether it will rain tomorrow based on the weather conditions of the current day.