In [284]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, mean_squared_error, classification_report
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Data Encoding

Encoding is the process of changing original data into a form that can be used by the synthetic generator model. Each column can have different encoding such as numerical, categorical, datetime or geo. In an ideal case, the detection of the encoding types is done automatically by the synthetic data generator.

## Label Encoding

Label Encoding is a technique that is used to convert categorical columns into numerical ones so that they can be fitted by machine learning models which only take numerical data. It is an important pre-processing step in a machine-learning project.

In [66]:
df = pd.read_csv("weather_classification_data.csv")
df.head()

Unnamed: 0,Temperature,Humidity,Wind Speed,Precipitation (%),Cloud Cover,Atmospheric Pressure,UV Index,Season,Visibility (km),Location,Weather Type
0,14.0,73,9.5,82.0,partly cloudy,1010.82,2,Winter,3.5,inland,Rainy
1,39.0,96,8.5,71.0,partly cloudy,1011.43,7,Spring,10.0,inland,Cloudy
2,30.0,64,7.0,16.0,clear,1018.72,5,Spring,5.5,mountain,Sunny
3,38.0,83,1.5,82.0,clear,1026.25,7,Spring,1.0,coastal,Sunny
4,27.0,74,17.0,66.0,overcast,990.67,1,Winter,2.5,mountain,Rainy


Let's see in this dataset we'll do label encoding on the `Season` columnn from the dataset

In [69]:
df_Season = df['Season']
df_Season

0        Winter
1        Spring
2        Spring
3        Spring
4        Winter
          ...  
13195    Summer
13196    Winter
13197    Autumn
13198    Winter
13199    Autumn
Name: Season, Length: 13200, dtype: object

In [71]:
Label = LabelEncoder()
df_label = Label.fit_transform(df_Season)
df_label

array([3, 1, 1, ..., 0, 3, 0])

### **Limitation of label Encoding**
Label encoding converts categorical data into numerical ones, but it assigns a unique number(starting from 0) to each class of data. This may lead to priority issues during data set model training. A label with a high value may be considered to have a higher priority than a label with a lower value.

`Example For Limitation of Label Encoding `
Attribute that has output classes in Mexico, Paris, and Dubai. On Label Encoding, this column lets Mexico is replaced with 0, Paris is replaced with 1, and Dubai is replaced with 2. 

This suggests that Dubai has a higher priority than Mexico and Paris while training the model. Still, there is no such priority relation between these cities here.

**We can also convert the numerical labels back to the original categorical values by using the function inverse_transform().**

In [75]:
Original = Label.inverse_transform(df_label)
Original

array(['Winter', 'Spring', 'Spring', ..., 'Autumn', 'Winter', 'Autumn'],
      dtype=object)

## One-Hot Encoding

In One Hot Encoding, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is a Male, the value will be 1 in the Male column and 0 in the Female column, and vice-versa. Let’s understand with an example: Consider the data where fruits, their corresponding categorical values, and prices are given.

![Local](images/onehot.png)

In [80]:
df_Season

0        Winter
1        Spring
2        Spring
3        Spring
4        Winter
          ...  
13195    Summer
13196    Winter
13197    Autumn
13198    Winter
13199    Autumn
Name: Season, Length: 13200, dtype: object

**One hot Encoding Using Sklearn**

As OneHot Encoder Takes the 2D as the input, so we convert the pandas series into 2D array

In [83]:
array = df_Season.to_numpy()
df_Season_2d = array.reshape((132, 100))

one = OneHotEncoder()
df_one = one.fit(df_Season_2d)

In [52]:
df_one

**Let's see the categories**

In [87]:
one.categories_

[array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'Winter'], dtype=object),
 array(['Autumn', 'Spring', 'Summer', 'W

**The Season names are being encoding wiht one hot encoder**

In [97]:
onetrans = one.transform(df_Season_2d).toarray()
onetrans

array([[0., 0., 0., ..., 0., 0., 1.],
       [1., 0., 0., ..., 0., 0., 1.],
       [0., 0., 1., ..., 0., 0., 1.],
       ...,
       [1., 0., 0., ..., 0., 0., 1.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

**Here we're decoding the categorical data back**

In [102]:
oneinvers = one.inverse_transform(onetrans)
oneinvers

array([['Winter', 'Spring', 'Spring', ..., 'Winter', 'Winter', 'Winter'],
       ['Autumn', 'Winter', 'Winter', ..., 'Autumn', 'Winter', 'Winter'],
       ['Summer', 'Winter', 'Spring', ..., 'Winter', 'Autumn', 'Winter'],
       ...,
       ['Autumn', 'Summer', 'Winter', ..., 'Summer', 'Summer', 'Winter'],
       ['Winter', 'Winter', 'Winter', ..., 'Autumn', 'Autumn', 'Spring'],
       ['Winter', 'Winter', 'Autumn', ..., 'Autumn', 'Winter', 'Autumn']],
      dtype=object)

# Feature Scaling

One way to "normalise" variables or features of data is to use feature scaling. In machine learning, feature scaling could be required for several reasons. It can accelerate training and smooth the gradient decline flow.

In this part, we'll examine two distinct feature scaling techniques from Scikit-Learn: StandardScale and MinMaxScaler.

## StandardScale

Scikit-Learn's StandardScaler() adjusts the data to a Gaussian distribution with a mean of 0 and a variance of 1. In some machine learning methods, it is crucial to modify the dataset to make it follow the Gaussian distribution. For instance, models such as logistic regression or linear regression assume that data have a Gaussian distribution.

In [111]:
df_subset = df[['Humidity', 'Temperature']]
df_subset.head()

Unnamed: 0,Humidity,Temperature
0,73,14.0
1,96,39.0
2,64,30.0
3,83,38.0
4,74,27.0


In [127]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_subset)
pd.DataFrame(scaled_data, columns = ['Humidity', 'Temperature']).head()

Unnamed: 0,Humidity,Temperature
0,0.212404,-0.294931
1,1.351385,1.143035
2,-0.233285,0.625367
3,0.707613,1.085516
4,0.261924,0.452811


## MinMaxScaler()

Scaling data is crucial, and the MinMaxScaler() function is an excellent tool. It transforms data into a range of [0, 1] and even handles negative values to fit within the range of [-1, 1]. Considering that the iris data contains no negative values, it should ideally be scaled between 0 and 1.

In [139]:
minmax= MinMaxScaler(feature_range=(0, 1))
minmax_data = minmax.fit_transform(df_subset)
pd.DataFrame(minmax_data, columns = ['Humidity', 'Temperature']).head()

Unnamed: 0,Humidity,Temperature
0,0.595506,0.291045
1,0.853933,0.477612
2,0.494382,0.410448
3,0.707865,0.470149
4,0.606742,0.38806


# Now let's see it's insight into the model 

**STEP 01** : Loading the Dataset

In [202]:
df = pd.read_csv('weather_classification_data.csv')
x = df.drop('Precipitation (%)', axis=1)
y = df['Precipitation (%)']

**STEP 02** : Splitting the Dataset

In [205]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

**STEP 03** : Preprocessing

### Numerical Features
- Impute missing values with the mean
- Scale features with StandardScaler
### Categorical Features
- Impute missing values with the most frequent value
- Encode categorical variables with OneHotEncoder

In [209]:
numerical_features = x.select_dtypes(include=['int64', 'float64']).columns
categorical_features = x.select_dtypes(include=['object']).columns

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)])

In `sklearn`, a `Pipeline` is used to chain together multiple steps in a machine learning workflow, ensuring that they are executed sequentially. This can include preprocessing steps (like scaling or encoding), feature selection, and the final estimator (model). Using a `Pipeline` provides several benefits:

1. **Simplifies Code:**
Single Object: Combine multiple steps into a single object, reducing the need to manually handle the intermediate steps.
Cleaner Code: Simplifies the code structure and makes it more readable and maintainable.
2. **Ensures Consistency:**
Sequential Execution: Ensures that the same transformations are applied consistently during both training and testing.
Avoids Data Leakage: By encapsulating the steps, it helps prevent data leakage from the test set into the training process.
3. **Easier Parameter Tuning:**
Grid Search: Integrates seamlessly with GridSearchCV or RandomizedSearchCV for hyperparameter tuning, allowing you to search over the parameters of all steps in the pipeline.
Single Parameter Space: Provides a unified parameter space, making it easier to tune parameters for the entire workflow.
4. **Reproducibility:**
Reusable Workflows: Encapsulates the entire workflow in a single object, making it easy to reuse and reproduce the exact same workflow on new data.
5. **Modularity:**
Independent Steps: Allows each step in the pipeline to be a separate, independent component that can be easily swapped out or modified.
Isolation of Concerns: Each step is responsible for a specific part of the process, making it easier to understand and debug.
6. **Efficient Cross-Validation:**
Integrated Cross-Validation: When performing cross-validation, ensures that all preprocessing steps are included in the cross-validation splits, preventing data leakage and ensuring valid evaluation metrics.

**STEP 04** : Creating the Pipeline

**Here we'll create the Pipeline for Many Models so later on we'll fetch them**

**Linear Regression**

In [244]:
pipeline_lr = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', LinearRegression())])
pipeline_lr

**Logistic Regression**

In [252]:
pipeline_logreg = Pipeline(steps=[('preprocessor', preprocessor),
                                  ('model', LogisticRegression())])
pipeline_logreg

**Random Forest**

In [266]:
pipeline_rf = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', RandomForestClassifier(random_state=42))])
pipeline_rf

**Support Vector Machine**

In [309]:
pipeline_svc = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', SVC())])
pipeline_svc

**K-Nearest Neighbour**

In [286]:
pipeline_knn = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', KNeighborsClassifier())])
pipeline_knn

**STEP 06** : Applying the Pipeline

**Let's Apply the pipeline to get the predicions**

In [313]:
pipeline_lr.fit(x_train, y_train)
y_pred = pipeline.predict(x_test)
print(f"Linear Regressio Mean Squared Error: {mean_squared_error(y_test, y_pred)}")

Linear Regressio Mean Squared Error: 413.6434855400367


In [315]:
pipeline_logreg.fit(x_train, y_train)

y_pred_logreg = pipeline_logreg.predict(x_test)
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_logreg)}")

Logistic Regression Accuracy: 0.025757575757575757


In [317]:
pipeline_rf.fit(x_train, y_train)
y_pred_rf = pipeline_rf.predict(x_test)
print(f"Random Forest Accuracy: {accuracy_score(y_test, y_pred_rf)}")

Random Forest Accuracy: 0.026893939393939394


In [319]:
pipeline_svc.fit(x_train, y_train)
y_pred_svc = pipeline_svc.predict(x_test)
print(f"SVC Accuracy: {accuracy_score(y_test, y_pred_svc)}")

SVC Accuracy: 0.022727272727272728


In [320]:
pipeline_knn.fit(x_train, y_train)
y_pred_knn = pipeline_knn.predict(x_test)
print(f"KNN Accuracy: {accuracy_score(y_test, y_pred_knn)}")

KNN Accuracy: 0.026893939393939394


**STEP 07** : Hyperparameter Tuning

**Now here we'll hyper tune the models**

**Logistic Regression**

In [296]:
param_grid_logreg = {
    'model__C': [0.1, 1, 10],
    'model__solver': ['liblinear', 'lbfgs', 'saga']
}

grid_search_logreg = GridSearchCV(estimator=pipeline_logreg, param_grid=param_grid_logreg, cv=5)
grid_search_logreg.fit(x_train, y_train)
print(f"Best Parameters: {grid_search_logreg.best_params_}")

best_model_logreg = grid_search_logreg.best_estimator_
y_pred_best_logreg = best_model_logreg.predict(x_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred_best_logreg)}")



Best Parameters: {'model__C': 0.1, 'model__solver': 'saga'}
Accuracy: 0.025757575757575757


**Random Forest**

In [292]:
param_grid_rf = {
    'model__n_estimators': [100, 200, 500],
    'model__max_features': ['auto', 'sqrt', 'log2'],
    'model__max_depth': [10, 20, 30]
}

grid_search_rf = GridSearchCV(estimator=pipeline_rf, param_grid=param_grid_rf, cv=5)
grid_search_rf.fit(x_train, y_train)
print(f"Best Parameters: {grid_search_rf.best_params_}")

best_model_rf = grid_search_rf.best_estimator_
y_pred_best_rf = best_model_rf.predict(x_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred_best_rf)}")

45 fits failed out of a total of 135.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "C:\ProgramData\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ProgramData\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 475, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "C:\ProgramData\anaconda3\Lib\site-packages\sklearn\base.py", l

Best Parameters: {'model__max_depth': 10, 'model__max_features': 'sqrt', 'model__n_estimators': 100}
Accuracy: 0.02196969696969697


**SVC**

In [293]:
param_grid_svc = {
    'model__C': [0.1, 1, 10],
    'model__kernel': ['linear', 'poly', 'rbf', 'sigmoid']
}

grid_search_svc = GridSearchCV(estimator=pipeline_svc, param_grid=param_grid_svc, cv=5)
grid_search_svc.fit(x_train, y_train)
print(f"Best Parameters: {grid_search_svc.best_params_}")

best_model_svc = grid_search_svc.best_estimator_
y_pred_best_svc = best_model_svc.predict(x_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred_best_svc)}")

Best Parameters: {'model__C': 1, 'model__kernel': 'rbf'}
Accuracy: 0.022727272727272728


**KNN**

In [294]:
param_grid_knn = {
    'model__n_neighbors': [3, 5, 7],
    'model__weights': ['uniform', 'distance']
}

grid_search_knn = GridSearchCV(estimator=pipeline_knn, param_grid=param_grid_knn, cv=5)
grid_search_knn.fit(x_train, y_train)
print(f"Best Parameters: {grid_search_knn.best_params_}")

best_model_knn = grid_search_knn.best_estimator_
y_pred_best_knn = best_model_knn.predict(x_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred_best_knn)}")

Best Parameters: {'model__n_neighbors': 5, 'model__weights': 'uniform'}
Accuracy: 0.026893939393939394
