# PROBLEM STATEMENT 

Dataset contains daily weather observations of Australian weather stations, and your goal is to predict whether it will rain tomorrow based on the data. The target variable you want to predict is "RainTomorrow," which is a binary variable with two possible values: "Yes" or "No."

In this context, "RainTomorrow" is set to "Yes" if the amount of rain recorded for the current day is 1mm or more, indicating that it rained that day. If the recorded rainfall is less than 1mm, "RainTomorrow" is set to "No," indicating that it did not rain that day.

- The aim of this is to test the chances of raining in australia Yes or No

# OVERVIEW 

The dataset weatherAUS.csv contains 40000 rows and 23 columns to weather observations in Australia. Here's an explanation of each column:

- Date: The date of the observation.
- Location: The location where the weather data was recorded.
- MinTemp: The minimum temperature in degrees Celsius.
- MaxTemp: The maximum temperature in degrees Celsius.
- Rainfall: The amount of rainfall recorded for the day in millimeters.
- Evaporation: The so-called Class A pan evaporation (in millimeters) in the 24 hours to 9am.
- Sunshine: The number of hours of bright sunshine in the day.
- WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight.
- WindGustSpeed: The speed (in km/h) of the strongest wind gust in the 24 hours to midnight.
- WindDir9am: Direction of the wind at 9am.
- WindDir3pm: Direction of the wind at 3pm.
- WindSpeed9am: Wind speed (in km/h) averaged over 10 minutes prior to 9am.
- WindSpeed3pm: Wind speed (in km/h) averaged over 10 minutes prior to 3pm.
- Humidity9am: Humidity (percent) at 9am.
- Humidity3pm: Humidity (percent) at 3pm.
- Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9am.
- Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3pm.
- Cloud9am: Fraction of sky obscured by cloud at 9am. This is measured in "oktas," which are a unit of eighths. It records how many eighths of the sky are obscured by cloud.
- Cloud3pm: Fraction of sky obscured by cloud at 3pm. Similar to Cloud9am.
- Temp9am: Temperature (degrees Celsius) at 9am.
- Temp3pm: Temperature (degrees Celsius) at 3pm.
- RainToday: Indicates if it has rained. Yes if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise No.
- RainTomorrow: The target variable. Indicates if it will rain tomorrow. Yes or No.

These columns provide comprehensive information about daily weather conditions, useful for various analyses including weather forecasting, climate studies, and understanding local weather patterns. 











There is a date variable. It is denoted by Date column.
There are 6 categorical variables. These are given by Location, WindGustDir, WindDir9am, WindDir3pm, RainToday and RainTomorrow.
There are two binary categorical variables - RainToday and RainTomorrow.
RainTomorrow is the target variable.


## ------------------------------------------------------------------------------------

## Guidelines to follow in this notebook 
- The name of the main dataframe should be df 
- Keep the seed value 42
- Names of training and testing variables should be X_train, X_test, y_train, y_test
- Keep the name of model instance as "model", e.g. model = DecisionTreeClassifer()
- Keep the predictions on training and testing data in a variable named y_train_pred and y_test_pred respectively.

## ------------------------------------------------------------------------------------

## Import Libraries 
#### Lets begin by importing necessary data libraries 

In [None]:
#import the relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.api.types import is_numeric_dtype
from sklearn.metrics import f1_score
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

## Load the dataset

In [None]:
### load the dataset and print no. of rows and columns 
# your code here
raise NotImplementedError

In [None]:
# Display the first few rows of the dataset 
df.head()

In [None]:
# Checking the null values in the dataset
df.isna().sum()

In [None]:
# Storing the  numerical columns in a list which will be used later while training the model
numerical = [var for var in df.columns if df[var].dtype!='O']
print('There are {} numerical variables\n'.format(len(numerical)))
print('The numerical variables are :', numerical)

# FEATURE GENERATION 

- Now we will create new features using the features available to us in the dataset 

## First we will work on the column 'Date'

In [None]:
# change the datatype of Column "Date" from  object to datetime
# your code here
raise NotImplementedError


- Create three columns using the column 'Date' : day, month and year 

In [None]:
# extract year from date
# your code here
raise NotImplementedError

In [None]:
# your code here
raise NotImplementedError

In [None]:
# extract day from date
df['Day'] = df['Date'].dt.day
df['Day'].min()

In [None]:
# Drop the column Date as we have already created 3 columns using the column Date
# your code here
raise NotImplementedError

- Here Using the Column 'Date', we created three new features and will have to drop the parent column (DATE) as it may lead to collinearity in data

### We can look at the correlation matrix and see if we can create other features  

In [None]:
# Using the corr function to check the correlation matrix
df[numerical].corr()

In [None]:
# create a correlation heat map for numerical columns 
# Plotting the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df[numerical].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title("Correlation Matrix of Weather Features")
plt.show()

As per the correlation plot and Based on its analysis, 
the creation of features like TempRange, PressureChange, AverageHumidity, and 
WindSpeedDiff seems justified, as they capture the variations and averages of these 
correlated measures throughout the day as per the dataset. 


In [None]:
df['TempRange'] = df['MaxTemp'] - df['MinTemp']
df['PressureChange'] = df['Pressure9am'] - df['Pressure3pm']
df['AverageHumidity'] = (df['Humidity9am'] + df['Humidity3pm']) / 2
df['WindSpeedDiff'] = df['WindGustSpeed'] - ((df['WindSpeed9am'] + df['WindSpeed3pm']) / 2)

In [None]:
# Find categorical variables 
# List the names of the features which are categorical(string) which needs to be processed -either encoding or mapping.

categorical=[]
cols_types = zip(list(df.columns), list(df.dtypes))

for i in cols_types:
    if is_string_dtype(df[i[0]].dtype):
        categorical.append(i[0])

### ENCODING CATEGORICAL VARIABLES 

In [None]:
#lets check the unique values counts in the categorical variables 
for i in categorical[1:]: 
    print(i , df[i].nunique())

## Remove 'Date' from categorical columns if present.

In [None]:
# Remove the 'Date' columns from the dataset
# your code here
raise NotImplementedError


In [None]:
categorical

In [None]:
#Encode the categorical variables except for RainTomorrow
# your code here
raise NotImplementedError

In [None]:
assert df_encoded.shape[1] == 123, "After encoding there will be 123 columns. "

### Feature Scaling 

In the next step, we will scale the numerical columns

In [None]:
#Scaling the Numerical variables 
# your code here
raise NotImplementedError

In [None]:
# Storing the scaled datapoints in the dataframe nsame df_Scaled
df_scaled = pd.DataFrame(df_scaled_array, columns=[numerical])

Create a combined dataframe with all the encoded , scaled and target column 

In [None]:
final_df= pd.concat([df_scaled,df_encoded] , axis =1 )

In [None]:
# Modify only the tuple column names
new_column_names = [('_'.join(col) if isinstance(col, tuple) else col) for col in final_df.columns]
final_df.columns = new_column_names

In [None]:
# Finally dropping all the null values from our dataset
final_df.dropna(axis=0, inplace=True)

After the null value treatement, we will build the model on the rest of the data

In [None]:
#splitting of df to training and testing with 0.25 as Test size 
# your code here
raise NotImplementedError

In [None]:
X_train.shape, X_test.shape, y_train.shape,y_test.shape

In [None]:
# Train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression

# instantiate the model
logreg = LogisticRegression(solver='liblinear', random_state=0, C=10)

# fit the model
logreg.fit(X_train, y_train)

## Prediction

In [None]:
#prediction on training
y_pred_train = logreg.predict(X_train)
y_pred_train

In [None]:
#prediction on testing 
y_pred_test = logreg.predict(X_test)

y_pred_test

## Evaluation

In [None]:
# Training Accuracy
from sklearn.metrics import accuracy_score
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

In [None]:
# Testing Accuracy
from sklearn.metrics import accuracy_score, f1_score
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

## Feature Selection
### We can Try and reduce the number of variables and see how the model performs. 

In the case of Logistic regression, we can find out the coefficients of the variables from the equation and 
then use selected top variables with higher values of coefficients and train the model on it. If the model retains the accuracy and performance, we can use those variables only, so that not much time and memory has been used.
Also this method is not genralised since selecting independent factors depends on various other things like its relation with target variable, noise in the variable. etc. 

In [None]:
# Get the coefficients of the model
coefficients = logreg.coef_[0]

# Mapping feature names with their coefficients
coefficients_dict = dict(zip(final_df.columns, coefficients))

# Sorting and displaying the feature importance

coefficients_dict_logr = sorted(coefficients_dict.items(), key=lambda x: x[1], reverse=True)
coefficients_dict_logr

In [None]:
#Lets pick the top 10 features with greater coeffecient values.

coefficients_dict_logr = sorted(coefficients_dict.items(), key=lambda x: x[1], reverse=True)

top_10_feature_names = [feature[0] for feature in coefficients_dict_logr[ :10]]

# Displaying the top 3 feature names
top_10_feature_names

In [None]:
top_10_feature_names.append("RainTomorrow")

In [None]:
final_df[top_10_feature_names].head()

In [None]:
df_new = final_df[top_10_feature_names]

In [None]:
df_new.isna().sum()

## Evaluation

In [None]:
# Testing the performance of the model on the final dataset with selected features
X_train, X_test, y_train, y_test= train_test_split(df_new.drop('RainTomorrow', axis = 1), df_new['RainTomorrow'] , test_size=0.25,shuffle=True, random_state=42)


# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression


# instantiate the model
logreg = LogisticRegression(solver='liblinear', random_state=42, C=5)

# fit the model
logreg.fit(X_train, y_train)

from sklearn.metrics import accuracy_score
y_pred_test11 = logreg.predict(X_test)
print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred_test)))

# Hyper Parameter Tuning

Lets See what are best best paramters for training the data on logistic regression. This Process of tuning can take a lot of time, depending on number of fits and folds and data points and memory and GPU of your computer.

In [None]:
from sklearn.model_selection import GridSearchCV


parameters = [{'penalty':['l1','l2']}, 
              {'C':[1, 10, 100]}]



grid_search = GridSearchCV(estimator = logreg,  
                           param_grid = parameters,
                           scoring = 'accuracy',
                           #cv = 5,
                           verbose=1)


grid_search.fit(X_train, y_train)

In [None]:
# examine the best model

# best score achieved during the GridSearchCV
print('GridSearch CV best score : {:.4f}\n\n'.format(grid_search.best_score_))

# print parameters that give the best results
print('Parameters that give the best results :','\n\n', (grid_search.best_params_))

# print estimator that was chosen by the GridSearch
print('\n\nEstimator that was chosen by the search :','\n\n', (grid_search.best_estimator_))

Now with the values given by paramter tuning, we can set these parameters to fine tune and also use other paramter tuning method to see if there is any improvement

In [None]:
# Try your Code using optimised paramters here 

As per the model accuracy, the results are almost same, so we can drop the variables and use less number of features, this way training time is reduced, memory used is less. 
Note : This is an example where our accuracy didn't go down or up much so you must remember feature selection depends on various factors.
    
It's important to approach the interpretation of feature importance in logistic regression with caution. This includes being mindful of the scale at which the features are measured and acknowledging the possibility of multicollinearity between them. Additionally, one should not draw conclusions about causality from logistic regression models without conducting further validation studies.
    