<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:Pink;
           font-size:210%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
          color:white;
          text-align:center;"
          >
       WELCOME TO MY NOTEBOOK
</p>
</div>

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:Purple;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">

<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
       About the Dataset: Rain in Australia
</p>
</div>

![](https://media0.giphy.com/media/JjrDsvilNKgw0/giphy.gif)

> **Background:**
The goal is to forecast whether it will rain the following day using classification models. The main focus is on the target attribute "RainTomorrow."

> **Dataset Description:**
The dataset encompasses approximately a decade's worth of everyday weather records across various locations in Australia.

> **Target Variable:**
The target variable for prediction is "RainTomorrow." This signifies whether there will be rainfall the subsequent day. The value is labeled as "Yes" when the precipitation on that day amounts to 1mm or higher.

> This dataset total contains 145460 rows and 23 Columns.


"🔍📓Thanks for exploring the notebook! If you found it helpful or interesting, kindly consider upvoting. Your support means a lot to us and encourages more valuable content. Happy notebooking😊!"

# Import all the necessary libraries

In [None]:
conda install "numpy>=1.16.5,<1.23.0"

In [None]:
# Import all the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from imblearn.over_sampling import SMOTE
from sklearn.impute import SimpleImputer
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler,OneHotEncoder, LabelEncoder
from sklearn.metrics import accuracy_score,roc_auc_score,precision_score, recall_score, f1_score,ConfusionMatrixDisplay,classification_report


import warnings 
warnings.filterwarnings("ignore")

# Read the Dataset

In [None]:
# Read the dataset
dataframe= pd.read_csv("/kaggle/input/weather-dataset-rattle-package/weatherAUS.csv")
dataframe.head()

In [None]:
# Check the shape of the dataset
dataframe.shape

In [None]:
# check the datatypes of dataset
dataframe.info()

In [None]:
dataframe.describe()

In [None]:
# Check Is there any null value in the dataset
dataframe.isna().sum()

In [None]:
# Check Is there any duplicate value in the dataset
dataframe.duplicated().sum()

> There is no duplicate value in the dataset.

In [None]:
# Lets remove the "Date" column from the dataset that has no impact on the target variable
dataframe.drop("Date", axis=1, inplace=True)

# Getting Numerical and Categorical columns

In [None]:
def get_num_cat_columns(dataframe):
    categorical_cols=dataframe.select_dtypes(include="object").columns
    numerical_cols=dataframe.select_dtypes(exclude="object").columns
    
    return categorical_cols, numerical_cols

In [None]:
categorical_cols, numerical_cols= get_num_cat_columns(dataframe)

In [None]:
# Lets check the correlation between numerical variables
dataframe.corr()

# Visualise the Correlation Matrix

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(dataframe.corr(),annot=True, cmap="Greens", fmt=".2f")
plt.show()

> > 
  1. Here we can see that Tem9am and MinTemp are highly correlated with each other with the value of 0.90.
  2. Temp3pm, Temp9am with MaxTemp are highly correlated with the value of 0.98 and 0.89.
  3. Temp9am and Temp3pm are highly correlated with the value of 0.86.
  

# Lets handle the Missing Values in Numerical Columns

In [None]:
numerical_cols

In [None]:
imputer= SimpleImputer(missing_values=np.nan, strategy="median", fill_value=None)
for col in numerical_cols:
    dataframe[col]= imputer.fit_transform(dataframe[[col]])

# Handle Missing Values in Categorical Columns

In [None]:
categorical_cols

In [None]:
#Filling the missing values for categorical variables with mode
dataframe['RainToday']=dataframe['RainToday'].fillna(dataframe['RainToday'].mode()[0])
dataframe['RainTomorrow']=dataframe['RainTomorrow'].fillna(dataframe['RainTomorrow'].mode()[0])

In [None]:
#Filling the missing values for categorical variables with mode
dataframe['WindDir9am'] = dataframe['WindDir9am'].fillna(dataframe['WindDir9am'].mode()[0])
dataframe['WindGustDir'] = dataframe['WindGustDir'].fillna(dataframe['WindGustDir'].mode()[0])
dataframe['WindDir3pm'] = dataframe['WindDir3pm'].fillna(dataframe['WindDir3pm'].mode()[0])

In [None]:
# Lets check again to see Is there any null value in the column
dataframe.isna().sum()

# Exploratory Data Analysis
> # univariate Analysis

# Lets check the unique values in each categorical column

In [None]:
dataframe["Location"].unique()

In [None]:
dataframe['RainTomorrow'].unique()

In [None]:
dataframe['RainToday'].unique()

In [None]:
dataframe['WindDir9am'].unique()

In [None]:
dataframe['WindGustDir'].unique()

In [None]:
dataframe['WindDir3pm'].unique()

# Lets check our Dataset is balanced or not ?

In [None]:
dataframe['RainTomorrow'].value_counts()

In [None]:
plt.figure(figsize=(5,5))
sns.countplot(x=dataframe['RainTomorrow'], palette="muted")
plt.title("Data Distribution of Bankrupt?")
plt.show()

> Here we can see that our dataset is not balanced, most of the data belongs to the "No" Class

# Lets see the Data Distribution for Numerical Columns

In [None]:
colors_list=["red","green","blue","grey","pink", "purple","orange","violet","red","green","blue","grey","pink", "purple","orange","violet"]
for i in range(len(numerical_cols)):
    plt.figure(figsize=(7,7))
    sns.histplot(dataframe[numerical_cols[i]], color=colors_list[i], kde=True, bins=15)
    label=numerical_cols[i]
    plt.xlabel(numerical_cols[i])
    plt.ylabel("count")
    plt.title(label)

# Bivariate Analysis
> # Detecting Outliers in the Dataset

In [None]:
for i in range(len(numerical_cols)):
    plt.figure(figsize=(7,7))
    sns.violinplot(data=dataframe, x=dataframe['RainTomorrow'], y=dataframe[numerical_cols[i]], hue=None ,color=colors_list[i])
    plt.title(f"RainTomorrow vs {numerical_cols[i]}")

# Feature Selection
> Feature Selection is a process of selecting a subset of relevant features from the original set of features.


In [None]:
fs_dataframe=dataframe.copy()

In [None]:
# Convert categorical variables into numerical form
le=LabelEncoder()
for col in categorical_cols:
    fs_dataframe[col]= le.fit_transform(fs_dataframe[col])

In [None]:
# Standardise the numerical columns
scaler= MinMaxScaler()
fs_dataframe[numerical_cols]=scaler.fit_transform(fs_dataframe[numerical_cols])

# Feature Selection Using Filter Method
> Filter Method are the simplest and the most computationally efficient methods for feature selection. In this approcah features are selected based on the statistics properties, such as their correlation with the target variable or their variance.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2
X = fs_dataframe.drop("RainTomorrow", axis=1)
y = fs_dataframe['RainTomorrow']
selector = SelectKBest(chi2, k=12)
selector.fit(X, y)
X_new = selector.transform(X)
print(selector.get_support(indices=True))                     # Print the indices of columns
print(X.columns[selector.get_support(indices=True)])          # Get the Top 12most relevant features.

# Feature Selection Using Wrapper Method
> It involves training a machine learning model to evaluate the performance of different subset of features.

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier as rf

X = fs_dataframe.drop("RainTomorrow", axis=1)
y = fs_dataframe['RainTomorrow'] 
selector = SelectFromModel(rf(n_estimators=100, random_state=0))
selector.fit(X, y)
support = selector.get_support()
features = X.loc[:,support].columns.tolist()
print(features)
print(rf(n_estimators=100, random_state=0).fit(X,y).feature_importances_)

# Divide the Dataset into Training and Testing set

In [None]:
def train_test_split_data(dataframe,target,test_size, random_state):
    x_train,x_test, y_train, y_test= train_test_split(dataframe.drop([target], axis=1),
                                                      dataframe[target],
                                                      test_size=test_size,
                                                      random_state=random_state,
                                                      stratify=dataframe[target]
                                                      )
    
    return x_train,x_test, y_train, y_test

In [None]:
x_train, x_test, y_train, y_test= train_test_split_data(dataframe,target="RainTomorrow",test_size=0.3, random_state=42)

In [None]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

# Detecting Outliers from the dataset using Winsorization Method

In [None]:
def Winsorization_Method(columns, x_train, y_train , a, b):
    outliers=[]

    for col in columns:
        q1= np.percentile(x_train[col], a)
        q2= np.percentile(x_train[col],b)
        
        for pos in range(len(x_train)):
            if x_train[col].iloc[pos]>q2 or x_train[col].iloc[pos]<q1:
                outliers.append(pos) 
                
    outliers= set(outliers)                   # remove the duplicates from the outliers
    outliers= list(outliers)
    
    ratio= round(len(outliers)/len(x_train)*100, 2)                       # Ratio of outliers
    x_train.drop(x_train.index[outliers], inplace=True)    # remove the outliers from the training dataset
    y_train.drop(y_train.index[outliers], inplace=True)
    
    
    
    return ratio, x_train, y_train

In [None]:
ratio, x_train,y_train= Winsorization_Method(numerical_cols, x_train, y_train , a=1, b=99.2)

In [None]:
print(f"The ratio of outliers in the dataset is: {ratio}")

# Data Preprocessing
> # Data Preprocessing  For Training Data


In [None]:
categorical_cols, numerical_cols= get_num_cat_columns(x_train)

In [None]:
categorical_cols

In [None]:
# One-Hot encode non-numeric columns
ohe= OneHotEncoder(handle_unknown="ignore", sparse=False)
x_train_encoded=pd.DataFrame(ohe.fit_transform(x_train[categorical_cols]))
x_train_encoded.columns= ohe.get_feature_names_out(categorical_cols)

# Label Encode the target class
le= LabelEncoder()
y_train=le.fit_transform(y_train)

# Appply RobustScaler for feature scaling
scaler= MinMaxScaler()
x_train_scaled= pd.DataFrame(scaler.fit_transform(x_train[numerical_cols]))
x_train_scaled.columns=x_train.select_dtypes(exclude="object").columns

# Concatenate the encoded and scaled fetures
x_train_processed=pd.concat([x_train_encoded,x_train_scaled], axis=1)
x_train_processed

# 

# Data Preprocessing  For Testing Data

In [None]:
# One-Hot encode non-numeric columns
x_test_encoded=pd.DataFrame(ohe.transform(x_test[categorical_cols]))
x_test_encoded.columns= ohe.get_feature_names_out(categorical_cols)

# Label Encode the target class
y_test=le.transform(y_test)

# Appply RobustScaler for feature scaling
x_test_scaled= pd.DataFrame(scaler.transform(x_test[numerical_cols]))
x_test_scaled.columns=x_test.select_dtypes(exclude="object").columns

# Concatenate the encoded and scaled fetures
x_test_processed=pd.concat([x_test_encoded,x_test_scaled], axis=1)
x_test_processed

# Lets Balance the Dataset using Smote

In [None]:
smote= SMOTE(sampling_strategy='minority', random_state=43)
x_train_smote, y_train_smote= smote.fit_resample(x_train_processed, y_train)

# Modelling by Ensemble Learning
> **Voting Classifier:** Voting Classifier is an Ensemble Machine Leraning Model which combines the predictions from the multiple individual models also known as Base Classifiers

In [None]:
accuracy_result = []
recall_scores = []
precision_scores = []
roc_auc_scores = []
f1_scores = []

# Create the Model
clf_rf= RandomForestClassifier(max_features=17,min_samples_leaf=6,min_samples_split= 2,n_estimators=200,random_state=42)
clf_lr= LogisticRegression(penalty='l1' , solver='liblinear')
clf_svm=SVC(kernel='linear', gamma='scale')
clf_gbc=GradientBoostingClassifier(min_samples_split=6, min_samples_leaf=10)

# Create Votingclassifier Model
voting_clf= VotingClassifier(estimators=[('rf', clf_rf), ('lr', clf_lr), ('svm', clf_svm),('gbc', clf_gbc)], voting='hard', n_jobs=-1)
voting_clf.fit(x_train_smote,y_train_smote)
y_pred=voting_clf.predict(x_test_processed)


# Save the result into List
accuracy_result.append(accuracy_score(y_pred,y_test))
recall_scores.append(recall_score(y_pred, y_test))
precision_scores.append(precision_score(y_pred, y_test))
f1_scores.append(f1_score(y_pred, y_test))
roc_auc_scores.append(roc_auc_score(y_pred, y_test))

# Print the Results
print(f"Accuracy:{accuracy_result}")
print(f"ROC AUC:{roc_auc_scores}")
print(f"Recall:{recall_scores}")
print(f"Precision:{precision_scores}")
print(f"F1-Score:{f1_scores}")
print("Classifiaction Reoprt")
print("---------------------")
print(classification_report(y_test,y_pred,digits=3))
print("Confusion_Matrix")
print("---------------------")
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.show()

In [None]:
model_names = ['VotingClassifier']
result_df = pd.DataFrame({'Recall':recall_scores, 'Precision':precision_scores, 'F1_Score':f1_scores, 'Accuracy': accuracy_result, 'ROC_AUC_Score':roc_auc_scores},index=model_names)

In [None]:
result_df.T.sort_values(by="VotingClassifier", ascending=False)

In [None]:
result_df.T.sort_values(by="VotingClassifier", ascending=False).plot(kind="bar", figsize=(7, 7), color="green").legend(bbox_to_anchor=(1,1));

<div style="color:white;
           display:fill;
           border-radius:25px;
           background-color:Purple;
           font-size:160%;
           font-family:Verdana;
           letter-spacing:0.5px">
<p style="padding: 10px;
          color:white;
          text-align:left;"
          >
       Conclusion
</p>
</div>

> # Here we can see that we train the Voting Classifier Model with 4 different Machine Learning Model, it achieves the accuracy score with the value of 82.47%.

"🔍📓Thanks again for exploring the notebook! If you found it helpful or interesting, kindly consider upvoting. Your support means a lot to us and encourages more valuable content. Happy notebooking😊!"

![](https://i.pinimg.com/originals/da/26/ec/da26ec81abe5c6a31500de2b042d811f.gif)