<h1><b><center> Honor's Project: Predicting Rainfall in Australia Using a Classifier</center></b></h1>

This project involves using classification algorithms (**Linear Regression, KNN, Decision Trees, Logistic Regression, SVM**) to create a model based on our training data and evaluate our testing data using evaluation metrics (**Accuracy Score, Jaccard Index, F1-Score, LogLoss, Mean Absolute Error, Mean Squared Error and R2-Score**). 

In [75]:
#Surpress warnings:
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [76]:
#Import libraries
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix, accuracy_score
import sklearn.metrics as metrics

In [77]:
#Import dataset
import requests

#Load data
req = requests.get('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillUp/labs/ML-FinalAssignment/Weather_Data.csv')

url_content=req.content
file=open('Weather_Data.csv', 'wb')
file.write(url_content)

df= pd.read_csv('Weather_Data.csv')

df.head()


Unnamed: 0,Date,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,...,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2/1/2008,19.5,22.4,15.6,6.2,0.0,W,41,S,SSW,...,92,84,1017.6,1017.4,8,8,20.7,20.9,Yes,Yes
1,2/2/2008,19.5,25.6,6.0,3.4,2.7,W,41,W,E,...,83,73,1017.9,1016.4,7,7,22.4,24.8,Yes,Yes
2,2/3/2008,21.6,24.5,6.6,2.4,0.1,W,41,ESE,ESE,...,88,86,1016.7,1015.6,7,8,23.5,23.0,Yes,Yes
3,2/4/2008,20.2,22.8,18.8,2.2,0.0,W,41,NNE,E,...,83,90,1014.2,1011.8,8,8,21.4,20.9,Yes,Yes
4,2/5/2008,19.7,25.7,77.4,4.8,0.0,W,41,NNE,W,...,88,74,1008.3,1004.8,8,8,22.5,25.5,Yes,Yes


## **Data Preprocessing**

In [87]:
#Convert categorical variables to binary variables using one-hot encoding
df_processed = pd.get_dummies(data=df, columns=['RainToday', 'WindGustDir', 'WindDir9am', 'WindDir3pm'])

#Change values of 'RainTomorrow' to 0 and 1 (dont make new column)
df_processed.replace(['No', 'Yes'], [0,1], inplace=True)

In [88]:
#Drop the 'Date' column
df_processed.drop('Date',axis=1,inplace=True)

In [89]:
#Change to data type float
df_processed = df_processed.astype(float)

## **Training/Testing Data**

In [90]:
#Set features
X = df_processed.drop(columns='RainTomorrow', axis=1)
#Set target variable
Y = df_processed['RainTomorrow']

## **Question 1**
Use the train_test_split function to split the features and Y dataframes with a test_size of 0.2 and the random_state set to 10.

In [91]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=10)

## **Question 2**
Create and train a Linear Regression model called LinearReg using the training data (x_train, y_train).

In [92]:
LinearReg= LinearRegression()
LinearReg.fit(X_train, Y_train)

LinearRegression()

## **Question 3**
Use the predict method on the testing data (x_test) and save it to the array **predictions**.

In [93]:
predictions = LinearReg.predict(X_test)
predictions[0:5]

array([0.13180923, 0.27615738, 0.97816086, 0.2874527 , 0.13239288])

## **Question 4**
Use the predictions and the y_test dataframe to calculate the value for each metric using the appropriate function.

In [94]:
LinearRegression_MAE = metrics.mean_absolute_error(Y_test, predictions)
LinearRegression_MSE = metrics.mean_squared_error(Y_test, predictions)
LinearRegression_R2 = metrics.r2_score (Y_test, predictions)
print('The mean absolute error is:', LinearRegression_MAE, '\nThe mean squared error is:', LinearRegression_MSE, '\nThe R2 score is:',LinearRegression_R2)

The mean absolute error is: 0.25631853059957954 
The mean squared error is: 0.11572181723808837 
The R2 score is: 0.42712599648561245


## **Question 5**
Show the MAE, MSE, and R2 in a tabular format using data frame for the linear model.

In [95]:
Report={'Metrics':['MAE', 'MSE', 'R2'], 'Results':[LinearRegression_MAE, LinearRegression_MSE, LinearRegression_R2]}
pd.DataFrame(Report).style.hide_index()

Metrics,Results
MAE,0.256319
MSE,0.115722
R2,0.427126


## **Question 6**
Create and train a KNN model called KNN using the training data (x_train, y_train) with the n_neighbors parameter set to 4

In [37]:
KNN = KNeighborsClassifier(n_neighbors = 4).fit(X_train,Y_train)
KNN

KNeighborsClassifier(n_neighbors=4)

## **Question 7**
Use the *predict* method on the testing data (x_test) and save it to the array 'predictions'.

In [54]:
predictions_KNN = KNN.predict(X_test)
predictions_KNN[0:5]

array([0., 0., 1., 0., 0.])

## **Question 8**
Using the *predictions* and the *y_test* dataframe calculate the value for each metric.

In [40]:
KNN_Accuracy_Score = metrics.accuracy_score(Y_test, predictions_KNN)
KNN_JaccardIndex = metrics.jaccard_score(Y_test, predictions_KNN)
KNN_F1_Score = metrics.f1_score(Y_test, predictions_KNN)
print('The testing set accuracy score is:', KNN_Accuracy_Score, '\nThe Jaccard similarity coefficient score. is:', KNN_JaccardIndex, '\nThe F1 score is:',KNN_F1_Score)

The testing set accuracy score is: 0.8183206106870229 
The Jaccard similarity coefficient score. is: 0.4251207729468599 
The F1 score is: 0.5966101694915255


## **Question 9**
Create and train a Decision Tree model called *Tree* using the training data (*x_train, y_train*).

In [43]:
Tree = DecisionTreeClassifier()
Tree.fit(X_train,Y_train)

DecisionTreeClassifier()

## **Question 10**
Use the *predict* method on the testing data (*x_test*) and save it to the array *predictions*.

In [55]:
predictions_DT = Tree.predict(X_test)
predictions_DT[0:5]

array([0., 0., 1., 0., 0.])

## **Question 11**
Use the *predictions* and the *y_test* dataframe to calculate the value for each metric.

In [45]:
Tree_Accuracy_Score = metrics.accuracy_score(Y_test, predictions_DT)
Tree_JaccardIndex = metrics.jaccard_score(Y_test, predictions_DT)
Tree_F1_Score = metrics.f1_score(Y_test, predictions_DT)
print('The testing set accuracy score is:', Tree_Accuracy_Score, '\nThe Jaccard similarity coefficient score. is:', Tree_JaccardIndex, '\nThe F1 score is:',Tree_F1_Score)

The testing set accuracy score is: 0.7557251908396947 
The Jaccard similarity coefficient score. is: 0.40298507462686567 
The F1 score is: 0.5744680851063831


## **Question 12**
Use the *train_test_split* function to split the X and Y dataframes with a test_size of 0.2 and the random_state set to 1.

In [46]:
X_train_LR, X_test_LR, Y_train_LR, Y_test_LR = train_test_split(X, Y, test_size=.2, random_state=1)

## **Question 13**
Create and train a LogisticRegression model called *LR* using the training data (*X_train, y_train*) with the solver parameter set to *liblinear*.

In [51]:
LR = LogisticRegression(solver='liblinear').fit(X_train_LR,Y_train_LR)
LR

LogisticRegression(solver='liblinear')

## **Question 14**
Now, use the *predict* and *predict_proba* methods on the testing data (*x_test*) and save it as 2 arrays, *predictions* and *predict_proba*

In [56]:
predictions_LR = LR.predict(X_test_LR)
predictions_LR[0:5]

array([0., 0., 0., 0., 0.])

In [57]:
predict_proba = LR.predict_proba(X_test_LR)
predict_proba[0:5]
#1st column is probability of class 0, P(Y=0|X), and 2nd column is probability of class 1, P(Y=1|X)

array([[0.74339483, 0.25660517],
       [0.97495683, 0.02504317],
       [0.50982014, 0.49017986],
       [0.84891209, 0.15108791],
       [0.9684643 , 0.0315357 ]])

## **Question 15**
Use the *predictions, predict_proba* and the *y_test* dataframe to calculate the value for each metric.

In [58]:
LR_Accuracy_Score = metrics.accuracy_score(Y_test_LR, predictions_LR)
LR_JaccardIndex = metrics.jaccard_score(Y_test_LR, predictions_LR)
LR_F1_Score = metrics.f1_score(Y_test_LR, predictions_LR)
LR_Log_Loss = metrics.log_loss(Y_test_LR, predict_proba)
print('The testing set accuracy score is:', LR_Accuracy_Score, '\nThe Jaccard similarity coefficient score. is:', LR_JaccardIndex, '\nThe F1 score is:',LR_F1_Score, '\nThe logistic loss is:',LR_Log_Loss)

The testing set accuracy score is: 0.8366412213740458 
The Jaccard similarity coefficient score. is: 0.5091743119266054 
The F1 score is: 0.6747720364741641 
The logistic loss is: 0.38106374371303714


## **Question 16**
 Create and train a SVM model called *SVM* using the training data (*x_train, y_train*)

In [59]:
SVM=svm.SVC()
SVM.fit(X_train, Y_train)

SVC()

## **Question 17**
Use the *predict* method on the testing data (*x_test*) and save it to the array *predictions*

In [61]:
predictions_SVM = SVM.predict(X_test)
predictions_SVM[0:5]

array([0., 0., 0., 0., 0.])

## **Question 18**
Use the *predictions* and the *y_test* dataframe to calculate the value for each metric.

In [62]:
SVM_Accuracy_Score = metrics.accuracy_score(Y_test, predictions_SVM)
SVM_JaccardIndex = metrics.jaccard_score(Y_test, predictions_SVM)
SVM_F1_Score = metrics.f1_score(Y_test, predictions_SVM)
print('The testing set accuracy score is:', SVM_Accuracy_Score, '\nThe Jaccard similarity coefficient score. is:', SVM_JaccardIndex, '\nThe F1 score is:',SVM_F1_Score)

The testing set accuracy score is: 0.7190839694656489 
The Jaccard similarity coefficient score. is: 0.0 
The F1 score is: 0.0


## **Report**
Show the Accuracy,Jaccard Index,F1-Score and LogLoss in a tabular format using data frame for all of the above models.

In [97]:
Report1={'KNN':[KNN_Accuracy_Score,KNN_JaccardIndex, KNN_F1_Score, 'NA'], 'Decision Tree': [Tree_Accuracy_Score,Tree_JaccardIndex, Tree_F1_Score, 'NA'],'Logistic Regression': [LR_Accuracy_Score, LR_JaccardIndex, LR_F1_Score, LR_Log_Loss], 'SVM':[SVM_Accuracy_Score, SVM_JaccardIndex, SVM_F1_Score, 'NA']}

Report1=pd.DataFrame(Report1)

Report1.rename(index={0:'Accuracy', 1:'Jaccard Index', 2:'F1-Score', 3:'LogLoss'}, inplace=True)
Report1

Unnamed: 0,KNN,Decision Tree,Logistic Regression,SVM
Accuracy,0.818321,0.755725,0.836641,0.719084
Jaccard Index,0.425121,0.402985,0.509174,0.0
F1-Score,0.59661,0.574468,0.674772,0.0
LogLoss,,,0.381064,


Resources
https://www.freecodecamp.org/news/how-to-build-and-train-linear-and-logistic-regression-ml-models-in-python/

https://lifewithdata.com/2023/06/05/how-to-calculate-r-squared-in-python/#:~:text=Calculating%20R-Squared%20using%20Scikit-Learn%20Once%20we%20have%20fitted,r_squared%20%3D%20lm.score%20%28X_test%2C%20y_test%29%20print%20%28f%27R-squared%3A%20%7Br_squared%7D%27%29

https://www.geeksforgeeks.org/how-to-print-dataframe-in-python-without-index/

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

https://www.datacamp.com/tutorial/k-nearest-neighbor-classification-scikit-learn

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html

https://scikit-learn.org/stable/modules/svm.html

*This project assignment was created by IBM for the completion of the Honors Project 