<h1 align="center"><font size="5">Project: Weather Prediction with Python</font></h1>

<h2 align="center"><font size="3">Australian Government's Bureau of Meteorology Dataset</font></h2>

<h2> Contents </h2>
<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ul>
    <li><a href="#summary">Summary</a></li>
    <li><a href="#about_the_dataset">About the Dataset</a></li>
    <li><a href="#importing_data">Importing Data</a></li>
    <li><a href="#data_preprocessing">Data Preprocessing</a> </li>
    <li><a href="#linear_regression">Linear Regression</a></li>
    <li><a href="#knn">KNN</a></li>
    <li><a href="#decision_tree">Decision Tree</a></li>
    <li><a href="#logistic_regression">Logistic Regression</a></li>
    <li><a href="#svm">SVM</a></li>
    <li><a href="#report">Final Results</a></li>
    </ul>
</div>

<h2><a id="summary"> Summary </a></h2> 


In this project, we will build a classifier to predict whether there will be rain the following day.
We will use a rainfall dataset from the Australian Government's Bureau of Meteorology, clean the data, and apply different classification algorithms to the data.

Algorithms used:

1. Linear Regression
2. KNN
3. Decision Trees
4. Logistic Regression
5. SVM

Evaluation methods:

1.  Accuracy Score
2.  Jaccard Index
3.  F1-Score
4.  LogLoss
5.  Mean Absolute Error
6.  Mean Squared Error
7.  R2-Score

<h2><a id="about_the_dataset">About the Dataset</a></h2>

The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)

<h2><a id="importing_data"> Importing Data </a></h2>

In [6]:
import pandas as pd
import numpy as np

from sklearn import preprocessing, svm
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import jaccard_score, f1_score, log_loss, confusion_matrix, accuracy_score
import sklearn.metrics as metrics

In [7]:
path='https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


<h2><a id="data_preprocessing"> Data Preprocessing </a></h2>

#### One Hot Encoding

In [9]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

#### Turning the target column (RainTomorrow) into numerical

In [11]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)
df_sydney_processed.drop('Date',axis=1,inplace=True)
df_sydney_processed = df_sydney_processed.astype(float)

features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

<h2><a id="linear_regression">Linear Regression</a></h2>

In [31]:
x_train, x_test, y_train, y_test = train_test_split(features, Y, test_size=0.2, random_state=10)

print(features.shape)
print(x_train.shape, x_test.shape)
print(y_train.shape, y_test.shape)

(3271, 66)
(2616, 66) (655, 66)
(2616,) (655,)


In [32]:
LinearReg = LinearRegression()
x = np.asanyarray(x_train)
y = np.asanyarray(y_train)

LinearReg.fit(x,y)

In [33]:
x = np.asanyarray(x_test)
predictions = LinearReg.predict(x)

In [34]:
LinearRegression_MAE = metrics.mean_squared_error(predictions, y_test)
LinearRegression_MSE = metrics.mean_absolute_error(predictions, y_test)
LinearRegression_R2 = metrics.r2_score(predictions, y_test)

Report = pd.DataFrame({"MAE":LinearRegression_MAE, 
                       "MSE":LinearRegression_MSE, 
                       "R2":LinearRegression_R2}, index=[0])
Report

Unnamed: 0,MAE,MSE,R2
0,0.115721,0.256318,-0.38476


<h2><a id="knn">KNN</a></h2>

In [35]:
features_norm = preprocessing.StandardScaler().fit(features).transform(features.astype(float))
x_train_norm, x_test_norm, y_train, y_test = train_test_split(features_norm, Y, test_size=0.2, random_state=10)

print(features.shape)
print(x_train_norm.shape, x_test_norm.shape)
print(y_train.shape, y_test.shape)

(3271, 66)
(2616, 66) (655, 66)
(2616,) (655,)


In [36]:
KNN = KNeighborsClassifier(n_neighbors=4).fit(x_train_norm, y_train)
KNN

In [37]:
predictions = KNN.predict(x_test_norm)

In [38]:
KNN_Accuracy_Score = accuracy_score(predictions, y_test)
KNN_JaccardIndex = jaccard_score(predictions, y_test)
KNN_F1_Score = f1_score(predictions, y_test)

report = pd.DataFrame({"Accuracy Score":KNN_Accuracy_Score,
                       "Jaccard Index":KNN_JaccardIndex,
                       "F1 Score":KNN_F1_Score}, index=[0])
report

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score
0,0.760305,0.241546,0.389105


<h2><a id="decision_tree">Decision Tree</a></h2>

In [39]:
Tree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
Tree.fit(x_train,y_train)

In [40]:
predictions = Tree.predict(x_test)

In [41]:
Tree_Accuracy_Score = accuracy_score(predictions, y_test)
Tree_JaccardIndex = jaccard_score(predictions, y_test)
Tree_F1_Score = f1_score(predictions, y_test)

report = pd.DataFrame({"Accuracy Score":Tree_Accuracy_Score,
                       "Jaccard Index":Tree_JaccardIndex,
                       "F1 Score":Tree_F1_Score}, index=[0])
report

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score
0,0.818321,0.480349,0.648968


<h2><a id="logistic_regression">Logistic Regression</a></h2>

In [42]:
features_norm = preprocessing.StandardScaler().fit(features).transform(features)

In [43]:
x_train_norm, x_test_norm, y_train, y_test = train_test_split(features_norm, Y, test_size=0.2, random_state=1)

print(features.shape, x_train_norm.shape, x_test_norm.shape)

(3271, 66) (2616, 66) (655, 66)


In [44]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(x_train_norm,y_train)

In [45]:
predictions = LR.predict(x_test_norm)
predict_proba = LR.predict_proba(x_test_norm)

In [46]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)

report = pd.DataFrame({"Accuracy Score":LR_Accuracy_Score,
                       "Jaccard Index":LR_JaccardIndex,
                       "F1 Score":LR_F1_Score,
                       "Log Loss": LR_Log_Loss}, index=[0])
report

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score,Log Loss
0,0.825954,0.502183,0.668605,0.388603


<h2><a id="svm">SVM</a></h2>

In [47]:
SVM = svm.SVC(kernel='rbf')
SVM.fit(x_train, y_train) 

In [48]:
predictions = SVM.predict(x_test)

In [49]:
SVM_Accuracy_Score = accuracy_score(predictions, y_test)
SVM_JaccardIndex = jaccard_score(y_test, predictions, pos_label=0)
SVM_F1_Score = f1_score(predictions, y_test, average='weighted')

report = pd.DataFrame({"Accuracy Score":SVM_Accuracy_Score,
                       "Jaccard Index":SVM_JaccardIndex,
                       "F1 Score":SVM_F1_Score}, index=[0])
report

Unnamed: 0,Accuracy Score,Jaccard Index,F1 Score
0,0.722137,0.722137,0.838652


<h2><a id="report">Final Results</a></h2>

In [50]:
index = ["Accuracy","Jaccard Index","F1-Score","LogLoss"]
columns = ["KNN","Decision Tree","Logistic Regression","SVM"]
values = [[KNN_Accuracy_Score, Tree_Accuracy_Score, LR_Accuracy_Score, SVM_Accuracy_Score],
          [KNN_JaccardIndex, Tree_JaccardIndex, LR_JaccardIndex, SVM_JaccardIndex],
          [KNN_F1_Score, Tree_F1_Score, LR_F1_Score, SVM_F1_Score],
          [np.nan, np.nan, LR_Log_Loss, np.nan]]


Report = pd.DataFrame(values, index=index, columns=columns)
Report

Unnamed: 0,KNN,Decision Tree,Logistic Regression,SVM
Accuracy,0.760305,0.818321,0.825954,0.722137
Jaccard Index,0.241546,0.480349,0.502183,0.722137
F1-Score,0.389105,0.648968,0.668605,0.838652
LogLoss,,,0.388603,


<br>
<br>
<br>
<i>This project is based on studies from the IBM course on Coursera "Machine Learning With Python"</i>