<a href="https://colab.research.google.com/github/ashasmalik/Machine_Learning/blob/main/Rain%20Prediction%20in%20Australia%20Using%20Machine%20Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About The Dataset


The original source of the data is Australian Government's Bureau of Meteorology and the latest data can be gathered from [http://www.bom.gov.au/climate/dwo/](http://www.bom.gov.au/climate/dwo/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01).

The dataset to be used has extra columns like 'RainToday' and our target is 'RainTomorrow', which was gathered from the Rattle at [https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData](https://bitbucket.org/kayontoga/rattle/src/master/data/weatherAUS.RData?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)




This dataset contains observations of weather metrics for each day from 2008 to 2017. The **weatherAUS.csv** dataset includes the following fields:

| Field         | Description                                           | Unit            | Type   |
| ------------- | ----------------------------------------------------- | --------------- | ------ |
| Date          | Date of the Observation in YYYY-MM-DD                 | Date            | object |
| Location      | Location of the Observation                           | Location        | object |
| MinTemp       | Minimum temperature                                   | Celsius         | float  |
| MaxTemp       | Maximum temperature                                   | Celsius         | float  |
| Rainfall      | Amount of rainfall                                    | Millimeters     | float  |
| Evaporation   | Amount of evaporation                                 | Millimeters     | float  |
| Sunshine      | Amount of bright sunshine                             | hours           | float  |
| WindGustDir   | Direction of the strongest gust                       | Compass Points  | object |
| WindGustSpeed | Speed of the strongest gust                           | Kilometers/Hour | object |
| WindDir9am    | Wind direction averaged of 10 minutes prior to 9am    | Compass Points  | object |
| WindDir3pm    | Wind direction averaged of 10 minutes prior to 3pm    | Compass Points  | object |
| WindSpeed9am  | Wind speed averaged of 10 minutes prior to 9am        | Kilometers/Hour | float  |
| WindSpeed3pm  | Wind speed averaged of 10 minutes prior to 3pm        | Kilometers/Hour | float  |
| Humidity9am   | Humidity at 9am                                       | Percent         | float  |
| Humidity3pm   | Humidity at 3pm                                       | Percent         | float  |
| Pressure9am   | Atmospheric pressure reduced to mean sea level at 9am | Hectopascal     | float  |
| Pressure3pm   | Atmospheric pressure reduced to mean sea level at 3pm | Hectopascal     | float  |
| Cloud9am      | Fraction of the sky obscured by cloud at 9am          | Eights          | float  |
| Cloud3pm      | Fraction of the sky obscured by cloud at 3pm          | Eights          | float  |
| Temp9am       | Temperature at 9am                                    | Celsius         | float  |
| Temp3pm       | Temperature at 3pm                                    | Celsius         | float  |
| RainToday     | If there was rain today                               | Yes/No          | object |
| RainTomorrow  | If there is rain tomorrow                             | Yes/No          | float  |

Column definitions were gathered from [http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml](http://www.bom.gov.au/climate/dwo/IDCJDW0000.shtml?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMDeveloperSkillsNetworkML0101ENSkillsNetwork20718538-2022-01-01)



## **Import the required libraries**


In [1]:
# All Libraries required for this lab are listed below.
!pip install -qy pandas==1.3.4 numpy==1.21.4 seaborn==0.9.0 matplotlib==3.5.0 scikit-learn==0.20.1


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: -y


In [1]:
# Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [1]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

### Importing the Dataset


In [14]:
!wget -O Weather_Data.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv

--2024-02-06 05:25:49--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 284201 (278K) [text/csv]
Saving to: ‘Weather_Data.csv’


2024-02-06 05:25:50 (536 KB/s) - ‘Weather_Data.csv’ saved [284201/284201]



In [28]:
df = pd.read_csv("Weather_Data.csv")
df.head()

Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


### Data Preprocessing


#### One Hot Encoding


First, we need to perform one hot encoding to convert categorical variables to binary variables.


In [31]:
df_sydney_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

Next, we replace the values of the 'RainTomorrow' column changing them from a categorical column to a binary column. We do not use the `get_dummies` method because we would end up with two columns for 'RainTomorrow' and we do not want, since 'RainTomorrow' is our target.


In [33]:
df_sydney_processed.replace(['No', 'Yes'], [0,1], inplace=True)

### Training Data and Test Data


Now, we set our 'features' or x values and our Y or target variable.


In [34]:
df_sydney_processed.drop('Date',axis=1,inplace=True)

In [37]:
df_sydney_processed = df_sydney_processed.astype(float)

In [38]:
features = df_sydney_processed.drop(columns='RainTomorrow', axis=1)
Y = df_sydney_processed['RainTomorrow']

### Linear Regression


Using  the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `10`.


In [46]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features,Y, test_size=0.2 , random_state = 10)


In [50]:
print(x_train.shape)
print(y_train.shape)

(2616, 66)
(2616,)


In [51]:
print(x_test.shape)
print(y_test.shape)

(655, 66)
(655,)


Creating and train a Linear Regression model called LinearReg using the training data (`x_train`, `y_train`).


In [52]:
LinearReg = LinearRegression()
LinearReg.fit(x_train,y_train)

Using the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [53]:
predictions = LinearReg.predict(x_test)

Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [84]:
from sklearn.metrics import r2_score
LinearRegression_MAE = np.mean(np.absolute(y_test - predictions))
LinearRegression_MSE = np.mean((y_test - predictions)**2)
LinearRegression_R2 = r2_score(y_test,predictions)

print(type(LinearRegression_MAE))

<class 'numpy.float64'>




MAE, MSE, and R2 in a tabular format using data frame for the linear model.


In [86]:
data  = {"Metric" : ['MAE','MSE','R2'],"Values":[LinearRegression_MAE,LinearRegression_MSE,LinearRegression_R2]}
Report = pd.DataFrame(data)
Report

Unnamed: 0,Metric,Values
0,MAE,0.256311
1,MSE,0.115714
2,R2,0.427163


### KNN


Training a KNN model called KNN using the training data (`x_train`, `y_train`) with the `n_neighbors` parameter set to `4`.


In [93]:
KNN = KNeighborsClassifier(n_neighbors  = 4)
KNN.fit(x_train,y_train)

Using the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [94]:
predictions = KNN.predict(x_test)


Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [99]:
from sklearn.metrics import accuracy_score,jaccard_score,f1_score
KNN_Accuracy_Score = accuracy_score(y_test,predictions)
KNN_JaccardIndex =  jaccard_score(y_test,predictions)
KNN_F1_Score = f1_score(y_test,predictions)

In [101]:
dataKnn = {"Metrics" : ["Accuracy_Score","JaccardIndex","F1_Score"], "values": [KNN_Accuracy_Score,KNN_JaccardIndex,KNN_F1_Score]}
ReportKNN = pd.DataFrame(dataKnn)
ReportKNN

Unnamed: 0,Metrics,values
0,Accuracy_Score,0.818321
1,JaccardIndex,0.425121
2,F1_Score,0.59661


### Decision Tree




Training a Decision Tree model called Tree using the training data (`x_train`, `y_train`).


In [106]:
Tree = DecisionTreeClassifier()
Tree.fit(x_train,y_train)



Using the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [107]:
predictions = Tree.predict(x_test)

Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [108]:
Tree_Accuracy_Score = accuracy_score(y_test,predictions)
Tree_JaccardIndex = jaccard_score(y_test,predictions)
Tree_F1_Score = f1_score(y_test,predictions)

In [109]:
dataTree = {"Metrics" : ["Accuracy_Score","JaccardIndex","F1_Score"], "values": [Tree_Accuracy_Score,Tree_JaccardIndex,Tree_F1_Score]}
ReportTree = pd.DataFrame(dataTree)
ReportTree

Unnamed: 0,Metrics,values
0,Accuracy_Score,0.757252
1,JaccardIndex,0.404494
2,F1_Score,0.576


### Logistic Regression


Using the `train_test_split` function to split the `features` and `Y` dataframes with a `test_size` of `0.2` and the `random_state` set to `1`.


In [110]:
x_train, x_test, y_train, y_test = train_test_split(features,Y,test_size  = 0.2 , random_state= 1)

Training a LogisticRegression model called LR using the training data (`x_train`, `y_train`) with the `solver` parameter set to `liblinear`.


In [112]:
LR = LogisticRegression(solver = 'liblinear')
LR.fit(x_train,y_train)

Using the `predict` and `predict_proba` methods on the testing data (`x_test`) and save it as 2 arrays `predictions` and `predict_proba`.


In [113]:
predictions = LR.predict(x_test)

In [114]:
predict_proba = LR.predict_proba(x_test)

Using the `predictions`, `predict_proba` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [119]:
LR_Accuracy_Score = accuracy_score(y_test, predictions)
LR_JaccardIndex = jaccard_score(y_test, predictions)
LR_F1_Score = f1_score(y_test, predictions)
LR_Log_Loss = log_loss(y_test, predict_proba)

In [129]:
dataLR = {"Metrics" : ["Accuracy_Score","JaccardIndex","F1_Score","Log loss"] ,"LogisticRegression": [LR_Accuracy_Score, LR_JaccardIndex , LR_F1_Score ,LR_Log_Loss]}
ReportLR = pd.DataFrame(dataLR)
ReportLR

Unnamed: 0,Metrics,values,KNN
0,Accuracy_Score,0.836641,0.818321
1,JaccardIndex,0.509174,0.425121
2,F1_Score,0.674772,0.59661
3,Log loss,0.380482,


### SVM


training a SVM model called SVM using the training data (`x_train`, `y_train`).


In [137]:
SVM = svm.SVC()
SVM.fit(x_train,y_train)

Using the `predict` method on the testing data (`x_test`) and save it to the array `predictions`.


In [125]:
predictions = SVM.predict(x_test)

 Using the `predictions` and the `y_test` dataframe calculate the value for each metric using the appropriate function.


In [133]:
SVM_Accuracy_Score = accuracy_score(y_test, predictions)
SVM_JaccardIndex = jaccard_score(y_test, predictions)
SVM_F1_Score = f1_score(y_test, predictions)

### Report


 Showing the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

\*LogLoss is only for Logistic Regression Model


In [135]:
datasummary = { "Metric" : ["Accuracy","Jaccard Index","F1-Score ","LogLoss"] , "LinearRegression":[LinearRegression_MAE,LinearRegression_MSE,LinearRegression_R2,None] , "KNN": [KNN_Accuracy_Score,KNN_JaccardIndex,KNN_F1_Score,None]  ,  "Decision": [Tree_Accuracy_Score,Tree_JaccardIndex,Tree_F1_Score,None] , "LogisticRegression": [LR_Accuracy_Score, LR_JaccardIndex , LR_F1_Score ,LR_Log_Loss], "SVM": [SVM_Accuracy_Score,  SVM_JaccardIndex, SVM_F1_Score ,None]}
Report = pd.DataFrame(datasummary)
Report

Unnamed: 0,Metric,LinearRegression,KNN,Decision,LogisticRegression,SVM
0,Accuracy,0.256311,0.818321,0.757252,0.836641,0.722137
1,Jaccard Index,0.115714,0.425121,0.404494,0.509174,0.0
2,F1-Score,0.427163,0.59661,0.576,0.674772,0.0
3,LogLoss,,,,0.380482,
