<a href="https://colab.research.google.com/github/evanch98/rain-prediction-in-australia-python/blob/main/Final_Project_Classification_with_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Final Project: Classification with Python

Date: 4/10/2024

## About the Dataset

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)


This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

## Import the required libraries

In [1]:
# Surpress warnings:
def warn(*args, **kwargs):
  pass

import warnings
warnings.warn = warn

In [2]:
import pandas as pd
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn import preprocessing, svm
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import jaccard_score, f1_score, log_loss, confusion_matrix, accuracy_score
import sklearn.metrics as metrics

## Import the Dataset

In [3]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv")
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


## Data Preprocessing

### One Hot Encoding

First, we need to perform one hot encoding to convert categorical variables to binary variables.

In [4]:
df_sydney_processed = pd.get_dummies(data=df, columns=["RainToday", "WindGustDir", "WindDir9am", "WindDir3pm"])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want that since 'RainTomorrow' is our target.


In [5]:
df_sydney_processed.replace(["No", "Yes"], [0, 1], inplace=True)

## Train Data and Test Data

In [6]:
df_sydney_processed.drop("Date", axis=1, inplace=True)

In [7]:
df_sydney_processed = df_sydney_processed.astype(float)

In [8]:
features = df_sydney_processed.drop(columns="RainTomorrow", axis=1)
Y = df_sydney_processed["RainTomorrow"]

### Linear Regression

In [9]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)
print(f"Train set: ${x_train.shape} ${y_train.shape}")
print(f"Test set: ${x_test.shape} ${y_test.shape}")

Train set: $(2616, 66) $(2616,)
Test set: $(655, 66) $(655,)


In [10]:
LinearReg = LinearRegression()
LinearReg.fit(x_train, y_train)

In [11]:
predictions = LinearReg.predict(x_test)

In [12]:
LinearRegression_MAE = metrics.mean_absolute_error(y_test, predictions)
LinearRegression_MSE = metrics.mean_squared_error(y_test, predictions)
LinearRegression_R2 = metrics.r2_score(y_test, predictions)

In [13]:
Report = pd.DataFrame({
    "Metric": ["Mean Absolute Error (MAE)", "Mean Squared Error (MSE)", "R-squared (R2)"],
    "Value": [LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]
})
Report

Unnamed: 0,Metric,Value
0,Mean Absolute Error (MAE),0.256309
1,Mean Squared Error (MSE),0.115719
2,R-squared (R2),0.427138


### KNN

In [14]:
KNN = KNeighborsClassifier(n_neighbors=4)
KNN.fit(x_train, y_train)

In [15]:
predictions = KNN.predict(x_test)

In [16]:
KNN_accuracy_score = accuracy_score(y_test, predictions)
KNN_JaccardIndex = jaccard_score(y_test, predictions)
KNN_F1_Score = f1_score(y_test, predictions)

### Decision Tree

In [17]:
Tree = DecisionTreeClassifier()
Tree.fit(x_train, y_train)

In [18]:
predictions = Tree.predict(x_test)

In [19]:
Tree_Accuracy_Score = accuracy_score(y_test, predictions)
Tree_JaccardIndex = jaccard_score(y_test, predictions)
Tree_F1_Score = f1_score(y_test, predictions)

### Logistic Regression

In [20]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=1)

In [21]:
LR = LogisticRegression(solver="liblinear")
LR.fit(x_train, y_train)

In [22]:
predictions = LR.predict(x_test)

In [23]:
predict_proba = LR.predict_proba(x_test)

In [24]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)

### SVM

In [25]:
SVM = svm.SVC()
SVM.fit(x_train, y_train)

In [26]:
predictions = SVM.predict(x_test)

In [27]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions, pos_label=0.0)
SVM_F1_Score = f1_score(y_test, predictions, average="weighted")

## Report

In [28]:
Report = pd.DataFrame({
    "Metrics": ["KNN_Accuracy_Score", "KNN_JaccardIndex", "KNN_F1_Score"],
    "Value": [KNN_accuracy_score, KNN_JaccardIndex, KNN_F1_Score]
})
print("KNN Metrics")
Report

KNN Metrics


Unnamed: 0,Metrics,Value
0,KNN_Accuracy_Score,0.818321
1,KNN_JaccardIndex,0.425121
2,KNN_F1_Score,0.59661


In [29]:
Report = pd.DataFrame({
    "Metrics": ["Tree_Accuracy_Score", "Tree_JaccardIndex", "Tree_F1_Score"],
    "Value": [Tree_Accuracy_Score, Tree_JaccardIndex, Tree_F1_Score]
})
print("Tree Metrics")
Report

Tree Metrics


Unnamed: 0,Metrics,Value
0,Tree_Accuracy_Score,0.757252
1,Tree_JaccardIndex,0.402256
2,Tree_F1_Score,0.573727


In [30]:
Report = pd.DataFrame({
    "Metrics": ["LR_Accuracy_Score", "LR_JaccardIndex", "LR_F1_Score", "LR_Log_Loss"],
    "Value": [LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score, LR_Log_Loss]
})
print("Logistic Regression Metrics")
Report

Logistic Regression Metrics


Unnamed: 0,Metrics,Value
0,LR_Accuracy_Score,0.836641
1,LR_JaccardIndex,0.509174
2,LR_F1_Score,0.674772
3,LR_Log_Loss,0.381259


In [31]:
Report = pd.DataFrame({
    "Metrics": ["SVM_Accuracy_Score", "SVM_JaccardIndex", "SVM_F1_Score"],
    "Value": [SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score]
})
print("SVM Metrics")
Report

SVM Metrics


Unnamed: 0,Metrics,Value
0,SVM_Accuracy_Score,0.722137
1,SVM_JaccardIndex,0.722137
2,SVM_F1_Score,0.605622
