# Project: Rainfall Classification Project

## About the Dataset

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

Link to dataset: [Weather_Data.csv](https://github.com/collinbashore/IBM-Data-Science-Professional-Certification/blob/main/09%20-%20Machine%20Learning%20with%20Python/Weather_Data.csv)

The dataset contains observations of weather metrics for each day from 2008 to 2017 in Sydney, Australia. This dataset includes the following field:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RISK_MM       | Amount of rain tomorrow                               | Millimeters     | float  |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |


The link to the column definitions can be found here: [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

## Importing dataset and libraries

In [1]:
# Libraries used for this project
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score, f1_score, log_loss
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

In [2]:
# Importing dataset
rainfall = pd.read_csv('Weather_Data.csv')

rainfall.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


## Data Preprocessing
### Transforming Categorical Variables

In [3]:
# Transforming categorical variables to binary variables
df_sydney_processed = pd.get_dummies(data=rainfall, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

# Replacing columns with No/Yes to 0/1
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)
df_sydney_processed.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,41,17,20,92,...,0,0,0,0,0,1,0,0,0,0
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,41,9,13,83,...,0,0,0,0,0,0,0,0,0,0
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,41,17,2,88,...,0,0,0,0,0,0,0,0,0,0
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,41,22,20,83,...,0,0,0,0,0,0,0,0,0,0
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,41,11,6,88,...,0,0,0,0,0,0,0,1,0,0


## Training/Testing Data

In [4]:
# Dropping the 'Date' column
df_sydney_processed.drop('Date',axis=1,inplace=True)

# Converting the data types of the columns to float data type
df_sydney_processed = df_sydney_processed.astype(float)

Below are the features and the target variable chosen for the classificdation machine learning models.

**Target variable:** RainTomorrow<br>
**Features:** All other columns except the RainTomorrow column.

In [5]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
y = df_sydney_processed['RainTomorrow']

In [6]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(features, y, test_size=0.2, random_state=10)

## K-Nearest Neighbours (KNN)

In [7]:
# Initiating and fitting the model with k = 4
KNN = KNeighborsClassifier(n_neighbors = 4)
KNN.fit(X_train, y_train)

# Model prediction
predictions = KNN.predict(X_test)

# Model metrics
KNN_Accuracy_Score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions)
KNN_F1_Score = f1_score(y_test, predictions)

# Report of metrics
KNN_Report = pd.DataFrame({'Accuracy Score': KNN_Accuracy_Score , 
                           'Jaccard Index': KNN_JaccardIndex, 
                           'F1 Score': KNN_F1_Score}, index=['Results'])

KNN_Report

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score
Results,0.818321,0.425121,0.59661


## Decision Tree

In [8]:
# Initiating and fitting the model
Tree = DecisionTreeClassifier()
Tree.fit(X_train, y_train)

# Predicting the model
predictions = Tree.predict(X_test)

# Calculating model metrics
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)

# Report of metrics
Tree_Report = pd.DataFrame({'Accuracy Score': Tree_Accuracy_Score , 
                           'Jaccard Index': Tree_JaccardIndex, 
                           'F1 Score': Tree_F1_Score}, index = ['Results'])

Tree_Report

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score
Results,0.749618,0.392593,0.56383


## Logistic Regression

In [9]:
# Initiating and fitting the model
LR = LogisticRegression(solver='liblinear')
LR.fit(X_train, y_train)

# Predicting the model
predictions = LR.predict(X_test)

# Calculating model metrics
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predictions)

# Report of metrics
LR_Report = pd.DataFrame({'Accuracy Score': LR_Accuracy_Score , 
                           'Jaccard Index': LR_JaccardIndex, 
                           'F1 Score': LR_F1_Score,
                           'Log Loss': LR_Log_Loss}, index = ['Results'])

LR_Report

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score,Log Loss
Results,0.841221,0.527273,0.690476,5.484063


## Support Vector Machines (SVM)

In [10]:
# Initiating and fitting the model
SVM = svm.SVC()
SVM.fit(X_train, y_train)

# Predicting the model
predictions = SVM.predict(X_test)

# Calculating model metrics
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions)

# Report of metrics
SVM_Report = pd.DataFrame({'Accuracy Score': SVM_Accuracy_Score , 
                           'Jaccard Index': SVM_JaccardIndex, 
                           'F1 Score': SVM_F1_Score}, index = ['Results'])

SVM_Report

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score
Results,0.719084,0.0,0.0


## Reporting the Results

In [11]:
Report = pd.DataFrame({'LR': [LR_Log_Loss, LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score], 
                       'KNN': [np.nan, KNN_Accuracy_Score, KNN_JaccardIndex, KNN_F1_Score], 
                       'Tree': [np.nan, Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score], 
                       'SVM': [np.nan, SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score]}, 
                      index=['Log Loss', 'Accuracy Score', 'Jaccard Index', 'F1 Score'])

Report.transpose()

Unnamed: 0,Log Loss,Accuracy Score,Jaccard Index,F1 Score
LR,5.484063,0.841221,0.527273,0.690476
KNN,,0.818321,0.425121,0.59661
Tree,,0.749618,0.392593,0.56383
SVM,,0.719084,0.0,0.0


### Notes from the above results
From the results above, the **Logistic Regression** classification model performed better since the model has the highest Accuracy, Jaccard Index, and F1 Score than the other three models used.