# Machine Learning

In [13]:
import pandas as pd 
import numpy as np

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, learning_curve, train_test_split
from sklearn.pipeline import make_pipeline, Pipeline

from sklearn.metrics import r2_score, accuracy_score

import plotly.express as px 

### Importing dataframe

In [14]:
df_init = pd.read_parquet("data/base2.parquet", engine="pyarrow")

In [15]:
df_init['BASIN'].unique()

array(['SP', 'SI', 'WP', 'EP', 'NI'], dtype=object)

In [3]:
df_init = df_init.loc[~df_init['BASIN'].str.contains('SI|NI', na=False)]

In [4]:
df = df_init.copy()

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32430 entries, 0 to 67409
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SEASON        32430 non-null  object 
 1   BASIN         32430 non-null  object 
 2   NATURE        32430 non-null  object 
 3   LAT           32430 non-null  float64
 4   LON           32430 non-null  float64
 5   WIND          32430 non-null  float64
 6   DIST2LAND     32430 non-null  float64
 7   STORM_SPEED   32430 non-null  float64
 8   STORM_DIR     32430 non-null  float64
 9   TD9636_STAGE  32430 non-null  float64
dtypes: float64(7), object(3)
memory usage: 2.7+ MB


## Model

### Encoding

The dataframe has been cleaned and only the relevant columns remain, however we need to process the colones further. 

Categorical => OneHotEncoder or one dimension with different values (1, 2, 3, 4, etc.)
- SEASON (4 classes)
- BASIN (7 classes)
- NATURE (6 classes)

Numeric => everything between 0 and 1
- LAT
- LON
- WIND 
- DIST2LAND
- STORM_SPEED
- STORM_DIR

In this notebook we will create personalized values in the categorical columns if possible to limit the dimensions to add

## First models 

We are starting by testing multiple Machine Learning models on a cleaned simplified dataset.

Categorical => OneHotEncoder or one dimension with different values (1, 2, 3, 4, etc.)
- SEASON (4 classes)
- BASIN (7 classes)
- NATURE (6 classes)

Numeric => everything between 0 and 1
- LAT
- LON
- WIND 
- DIST2LAND
- STORM_SPEED
- STORM_DIR

### Choosing the pertient models

The different models can be found here :
https://scikit-learn.org/stable/machine_learning_map.html

| model | person | scores | encoding
| --- | --- | --- | --- | 
| Knn | Vincent | -- | OneHotEncoder/LabelBinarizer | 
| Knn | -- | -- | -- | 
| SVM | Vincent | -- | OneHotEncoder/LabelBinarizer | 
| SVM | -- | -- | -- | 
| Randomforest | Audrey | -- | OneHotEncoder/LabelBinarizer | 
| Randomforest | Audrey | -- | Categorical encoding | 
| LinearSVC | Arnaud | -- | OneHotEncoder/LabelBinarizer | 
| LinearSVC | Arnaud | -- | Categorical encoding | 
| -- | -- | -- | -- | 
| -- | -- | -- | -- | 
| -- | -- | -- | -- | 
| -- | -- | -- | -- | 
| -- | -- | -- | -- | 



## Model

### Encoding

The dataframe has been cleaned and only the relevant columns remain, however we need to process the colones further. 

Categorical => OneHotEncoder or one dimension with different values (1, 2, 3, 4, etc.)
- SEASON (4 classes)
- BASIN (7 classes)
- NATURE (6 classes)

Numeric => everything between 0 and 1
- LAT
- LON
- WIND 
- DIST2LAND
- STORM_SPEED
- STORM_DIR

In this notebook we will create personalized values in the categorical columns if possible to limit the dimensions to add

#### Categorical Columns

1. Seasons

In [6]:
def transorm_seasons_1(row):
    match row["SEASON"]:
        case "Winter":
            return 1
        case "Spring":
            return 0
        case "Summer":
            return -1
        case "Fall":
            return 0

def transorm_seasons_2(row):
    match row["SEASON"]:
        case "Winter":
            return 0
        case "Spring":
            return -1
        case "Summer":
            return 0
        case "Fall":
            return 1    


In [7]:
df["SEASON_1"] = df.apply(transorm_seasons_1, axis=1)
df["SEASON_2"] = df.apply(transorm_seasons_2, axis=1)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32430 entries, 0 to 67409
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SEASON        32430 non-null  object 
 1   BASIN         32430 non-null  object 
 2   NATURE        32430 non-null  object 
 3   LAT           32430 non-null  float64
 4   LON           32430 non-null  float64
 5   WIND          32430 non-null  float64
 6   DIST2LAND     32430 non-null  float64
 7   STORM_SPEED   32430 non-null  float64
 8   STORM_DIR     32430 non-null  float64
 9   TD9636_STAGE  32430 non-null  float64
 10  SEASON_1      32430 non-null  int64  
 11  SEASON_2      32430 non-null  int64  
dtypes: float64(7), int64(2), object(3)
memory usage: 3.2+ MB


In [9]:
df = df.drop(columns=['SEASON'])

2. Basin

In [12]:
df['BASIN'].unique()

array(['SP', 'WP', 'EP'], dtype=object)

The BASIN categories can not be scored wich is why we use pd.get_dummies() to process the this string data

In [314]:
df = pd.get_dummies(df, columns=["BASIN"], drop_first=True)

In [315]:
df.head(3)

Unnamed: 0,NATURE,LAT,LON,WIND,DIST2LAND,STORM_SPEED,STORM_DIR,TD9636_STAGE,SEASON_1,SEASON_2,BASIN_SP,BASIN_WP
0,TS,-12.5,172.5,25.0,647.0,6.0,350.0,1.0,-1,0,True,False
1,TS,-12.2,172.4,25.0,653.0,6.0,350.0,1.0,-1,0,True,False
2,TS,-11.9,172.4,25.0,670.0,5.0,360.0,1.0,-1,0,True,False


3. Nature

We are sorting the NATURE categoriesthem to reflect a progression from the least informative or severe to the most complex or severe nature of cyclones.

0- NR - Not reported: This class indicates that the nature of the cyclone is not reported, so it can be considered the least informative.

1- DS - Disturbance: This class indicates a minor disturbance, which is typically the least severe form of a cyclone.

2- SS - Subtropical: Subtropical cyclones are more organized than disturbances but less severe than tropical cyclones.

3- TS - Tropical: Tropical cyclones are fully developed and more severe than subtropical cyclones.

4- ET - Extratropical: Extratropical cyclones are typically associated with frontal systems and can be very severe.

5- MX - Mixture: This class indicates contradicting nature reports from different agencies, suggesting a complex or uncertain nature, which can be considered the most severe or complex category.

In [233]:
def transorm_nature(row):
    match row["NATURE"]:
        case "NR":
            return 0
        case "DS":
            return 1
        case "SS":
            return 2
        case "TS":
            return 3
        case "ET":
            return 4
        case "MX":
            return 5

In [234]:
df["NATURE_1"] = df.apply(transorm_nature, axis=1)

In [235]:
df = df.drop(columns=['NATURE'])
df.rename(columns={"NATURE_1": "NATURE"}, inplace=True)

### Model Testing

Importing models to test

Logistic Regression can be use as a "base metric" to compare other models performances.

In [236]:
from sklearn.linear_model import LogisticRegression
from sklearn.gaussian_process import GaussianProcessClassifier

from sklearn.model_selection import GridSearchCV

#### Standardizing values

Standardizing the data set for better performence and interpretability by the different models.
Using sklearn built-in preprocessing tools for scaling feat data.

In [237]:
from sklearn import preprocessing

In [238]:
y = df["TD9636_STAGE"]
X = df.copy().drop(columns=['TD9636_STAGE'])

In [239]:
scaler = preprocessing.StandardScaler().fit(X)
scaler

In [240]:
X_scaled = scaler.transform(X)

#### Splitting dataset

Spliting the dataset in two half, the feat data and the target data.\
Test feat data size is set to be 80% of the initial data set. `Random_state` is set to 17 for reproductibility.

In [241]:
df

Unnamed: 0,LAT,LON,WIND,DIST2LAND,STORM_SPEED,STORM_DIR,TD9636_STAGE,SEASON_1,SEASON_2,BASIN_SP,BASIN_WP,NATURE
0,-12.5,172.5,25.0,647.0,6.0,350.0,1.0,-1,0,True,False,3
1,-12.2,172.4,25.0,653.0,6.0,350.0,1.0,-1,0,True,False,3
2,-11.9,172.4,25.0,670.0,5.0,360.0,1.0,-1,0,True,False,3
3,-11.7,172.4,25.0,682.0,4.0,10.0,1.0,-1,0,True,False,3
4,-11.5,172.5,25.0,703.0,4.0,20.0,1.0,-1,0,True,False,3
...,...,...,...,...,...,...,...,...,...,...,...,...
67405,14.6,141.6,25.0,1760.0,9.0,250.0,1.0,1,0,False,True,3
67406,14.4,141.2,25.0,1713.0,9.0,250.0,1.0,1,0,False,True,3
67407,14.3,140.8,25.0,1669.0,8.0,250.0,1.0,1,0,False,True,3
67408,14.1,140.4,23.0,1622.0,8.0,250.0,1.0,1,0,False,True,3


In [242]:
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y,
    test_size = 0.2,
    random_state=17)

#### Logistic Regression

Logistic regression main hyperparameters :
- `penalty`: {"l1", "l2", "elasticnet", None}, default="l2"\
    Specify the norm of the penalty:
    - None: no penalty is added;
    - 'l2': add a L2 penalty term and it is the default choice;
    - 'l1': add a L1 penalty term;
    - 'elasticnet': both L1 and L2 penalty terms are added.
- `tol`: float, default=1e-4\
    Tolerance for stopping criteria.stronger regularization.
- `solver`: {‘lbfgs’, ‘liblinear’, ‘newton-cg’, ‘newton-cholesky’, ‘sag’, ‘saga’}, default=’lbfgs’\
    Algorithm to use in the optimization problem. Set to saga as certain solver dosen't handle all penalty.

In [243]:
LogReg = LogisticRegression(random_state=17,
                            max_iter=3000, # Setting a large number of max_iter as Logistic regression is fast to train
                            verbose=0,
                            tol=1e-6,
                            solver="saga",
                            penalty="elasticnet"
                           ) 

In [244]:
parameters = {
    "l1_ratio": [0, 0.5, 1]
}

In [245]:
Grid = GridSearchCV(
    estimator=LogReg,
    param_grid=parameters,
    scoring=["r2", "accuracy"],
    refit="r2",
    verbose=4,
    cv=5
)

In [246]:
Grid.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5] END l1_ratio=0; accuracy: (test=0.865) r2: (test=0.777) total time=   3.5s
[CV 2/5] END l1_ratio=0; accuracy: (test=0.865) r2: (test=0.776) total time=   3.3s
[CV 3/5] END l1_ratio=0; accuracy: (test=0.863) r2: (test=0.775) total time=   3.3s
[CV 4/5] END l1_ratio=0; accuracy: (test=0.866) r2: (test=0.778) total time=   3.4s
[CV 5/5] END l1_ratio=0; accuracy: (test=0.862) r2: (test=0.774) total time=   3.5s
[CV 1/5] END l1_ratio=0.5; accuracy: (test=0.865) r2: (test=0.779) total time=   8.8s
[CV 2/5] END l1_ratio=0.5; accuracy: (test=0.865) r2: (test=0.775) total time=   7.9s
[CV 3/5] END l1_ratio=0.5; accuracy: (test=0.863) r2: (test=0.775) total time=   8.1s
[CV 4/5] END l1_ratio=0.5; accuracy: (test=0.866) r2: (test=0.782) total time=   7.9s
[CV 5/5] END l1_ratio=0.5; accuracy: (test=0.861) r2: (test=0.770) total time=   7.7s
[CV 1/5] END l1_ratio=1; accuracy: (test=0.866) r2: (test=0.784) total time=  15.6s



The max_iter was reached which means the coef_ did not converge



[CV 2/5] END l1_ratio=1; accuracy: (test=0.865) r2: (test=0.779) total time=  39.6s
[CV 3/5] END l1_ratio=1; accuracy: (test=0.864) r2: (test=0.775) total time=  26.0s
[CV 4/5] END l1_ratio=1; accuracy: (test=0.868) r2: (test=0.785) total time=  29.0s
[CV 5/5] END l1_ratio=1; accuracy: (test=0.862) r2: (test=0.779) total time=  25.6s


### Result of the grid search :
As expected, logistic regression are stable from on penalty to another, with a *acceptable* mean accuracy and r2 score.

In [247]:
print(Grid.best_score_,
     Grid.best_params_)

0.7805457003259004 {'l1_ratio': 1}


In [248]:
pd.DataFrame(Grid.cv_results_).sort_values(by='mean_test_r2', ascending=False).head(5).T

Unnamed: 0,2,1,0
mean_fit_time,27.272042,8.169616,3.486172
std_fit_time,7.707039,0.373395,0.094355
mean_score_time,0.001975,0.002019,0.001982
std_score_time,0.000106,0.000268,0.000111
param_l1_ratio,1.0,0.5,0.0
params,{'l1_ratio': 1},{'l1_ratio': 0.5},{'l1_ratio': 0}
split0_test_r2,0.784169,0.77931,0.777008
split1_test_r2,0.778543,0.774835,0.776241
split2_test_r2,0.775389,0.775006,0.775006
split3_test_r2,0.785372,0.782052,0.777584


In [249]:
y_pred = Grid.predict(X_test)

Interpretation of errors

In [250]:
from sklearn.metrics import accuracy_score, r2_score

In [251]:
from sklearn.metrics import confusion_matrix

First lets take a look at the errors and valid predictions.

In [252]:
print(f"Accuracy: {accuracy_score(y_pred, y_test)}, r2: {r2_score(y_pred, y_test)}")

Accuracy: 0.851218008017268, r2: 0.7695179389815562


In [253]:
cm = confusion_matrix(y_pred, y_test)
px.imshow(cm)

In [254]:
cm

array([[  21,    7,    0,    0,    0,    0,    1],
       [ 119, 1940,  278,    0,    0,    3,    5],
       [   8,  159, 2021,   19,  134,    9,    5],
       [   0,    0,    0,    0,    0,    0,    0],
       [   0,    8,   84,  108, 1527,    2,    0],
       [   0,    3,   11,    0,    2,    3,    0],
       [   0,    0,    0,    0,    0,    0,    9]])

In [255]:
cm = (cm.T / cm.sum(axis=1)).T
px.imshow(cm)


invalid value encountered in divide



Logistic regression does detect correctly stage 6 cyclones (on a really short sample..) and seems to 'group' stage 0 to 2 cyclones detection as we could expect from a good model. Yet, stage 4 cyclones detection is too messy.\

It could be intresting to see the kind of cyclones we are having error with in stage 4.
Futher more, setting a custom score with ponderated error values could be intresting to sort out a better model.

#### Proposition of custom score (not useful for a logistic regression..)

In [256]:
from sklearn.metrics import make_scorer

In [257]:
def custom_loss_score(y_test, y_pred):

    # Turning y_test and y_pred into numpy ndarrays
    y_test = np.array(y_test)
    y_pred = np.array(y_pred)

    # initializing variable
    
    error = 0
    ## Creating a "error" weights table
    table = {
        0: 0.2,
        1: 0.5,
        2: 1,
        3: 2,
        4: 3,
        5: 5,
    }
    total_weight = sum(table.values())

    # Calculating the sum of each error value depending on our custom error weights table
    for pos in range(len(y_test)):
        min_ = int(min(y_test[pos], y_pred[pos]))
        max_ = int(max(y_test[pos], y_pred[pos]))
        error += sum([table[i] for i in range(min_, max_) if min_!=max_])/total_weight

    error = error / len(y_test)
    
    return 1 - error

In [258]:
Grid = GridSearchCV(
    estimator=LogReg,
    param_grid=parameters,
    scoring=make_scorer(custom_loss_score, greater_is_better=True),
    refit=True,
    verbose=4,
    cv=5
)

In [259]:
Grid.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV 1/5] END ........................l1_ratio=0;, score=0.981 total time=   2.9s
[CV 2/5] END ........................l1_ratio=0;, score=0.981 total time=   3.1s
[CV 3/5] END ........................l1_ratio=0;, score=0.981 total time=   3.1s
[CV 4/5] END ........................l1_ratio=0;, score=0.981 total time=   3.1s
[CV 5/5] END ........................l1_ratio=0;, score=0.981 total time=   3.3s
[CV 1/5] END ......................l1_ratio=0.5;, score=0.981 total time=   8.4s
[CV 2/5] END ......................l1_ratio=0.5;, score=0.981 total time=   7.8s
[CV 3/5] END ......................l1_ratio=0.5;, score=0.981 total time=   8.0s
[CV 4/5] END ......................l1_ratio=0.5;, score=0.982 total time=   7.7s
[CV 5/5] END ......................l1_ratio=0.5;, score=0.981 total time=   7.7s
[CV 1/5] END ........................l1_ratio=1;, score=0.981 total time=  14.8s



The max_iter was reached which means the coef_ did not converge



[CV 2/5] END ........................l1_ratio=1;, score=0.981 total time=  37.7s
[CV 3/5] END ........................l1_ratio=1;, score=0.981 total time=  25.3s
[CV 4/5] END ........................l1_ratio=1;, score=0.982 total time=  28.9s
[CV 5/5] END ........................l1_ratio=1;, score=0.981 total time=  25.3s


### Learning Curve

In [261]:
# Calculate learning curve
learning_curve_ = learning_curve(
    Grid.best_estimator_,
    X_train,
    y_train,
    train_sizes=np.linspace(0.1, 1, 10),
    cv=2,
    verbose=5,
    scoring="r2",
)

[learning_curve] Training set sizes: [ 1297  2594  3891  5188  6486  7783  9080 10377 11674 12972]
[CV] END ..................., score=(train=0.787, test=0.747) total time=   1.4s
[CV] END ..................., score=(train=0.782, test=0.750) total time=   3.0s
[CV] END ..................., score=(train=0.788, test=0.781) total time=   4.2s
[CV] END ..................., score=(train=0.777, test=0.780) total time=   4.6s
[CV] END ..................., score=(train=0.760, test=0.773) total time=   4.0s
[CV] END ..................., score=(train=0.775, test=0.782) total time=   4.8s
[CV] END ..................., score=(train=0.780, test=0.785) total time=   5.5s
[CV] END ..................., score=(train=0.778, test=0.779) total time=   5.5s
[CV] END ..................., score=(train=0.785, test=0.780) total time=   6.6s
[CV] END ..................., score=(train=0.783, test=0.779) total time=   7.2s
[CV] END ..................., score=(train=0.812, test=0.759) total time=   1.6s
[CV] END .

[Parallel(n_jobs=1)]: Done  17 tasks      | elapsed:  1.3min


[CV] END ..................., score=(train=0.783, test=0.779) total time=   6.1s
[CV] END ..................., score=(train=0.777, test=0.775) total time=   5.0s
[CV] END ..................., score=(train=0.775, test=0.773) total time=   6.1s


In [262]:
train_sizes, train_scores, train_valids = learning_curve_

In [263]:
def learning_curve_show(train_sizes, train_scores):
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_max = np.max(train_scores, axis=1)
    train_scores_min = np.min(train_scores, axis=1)
    
    fig = go.Figure([
        go.Scatter(
            x=train_sizes,
            y=train_scores_mean,
            line=dict(color='rgb(0,100,80)'),
            mode='lines'
        ),
        go.Scatter(
            x=train_sizes+train_sizes[::-1], # x, then x reversed
            y=train_scores_max+train_scores_min[::-1], # upper, then lower reversed
            fill='toself',
            fillcolor='rgba(0,100,80,0.2)',
            line=dict(color='rgba(255,255,255,0)'),
            hoverinfo="skip",
            showlegend=False
        )
    ])
    fig.show()

learning_curve_show(train_sizes, train_scores)

The history saving thread hit an unexpected error (OperationalError('database is locked')).History will not be written to the database.


Test RFR with the splited (Indian and not indian bassins):

In [264]:
from sklearn.ensemble import RandomForestClassifier

In [265]:
parameters = {'memory': None, 'steps': [('rfr', RandomForestClassifier(max_depth=50, random_state=42))], 'transform_input': None, 'verbose': False, 'rfr': RandomForestClassifier(max_depth=50, random_state=42), 'rfr__bootstrap': True, 'rfr__ccp_alpha': 0.0, 'rfr__class_weight': None, 'rfr__criterion': 'gini', 'rfr__max_depth': 50, 'rfr__max_features': 'sqrt', 'rfr__max_leaf_nodes': None, 'rfr__max_samples': None, 'rfr__min_impurity_decrease': 0.0, 'rfr__min_samples_leaf': 1, 'rfr__min_samples_split': 2, 'rfr__min_weight_fraction_leaf': 0.0, 'rfr__monotonic_cst': None, 'rfr__n_estimators': 100, 'rfr__n_jobs': None, 'rfr__oob_score': False, 'rfr__random_state': 42, 'rfr__verbose': 0, 'rfr__warm_start': False}

In [266]:
rfr_params = {key.replace('rfr__', ''): value for key, value in parameters.items() if key.startswith('rfr__')}

In [267]:
model = RandomForestClassifier(**rfr_params)

In [268]:
model.fit(X_train, y_train)

In [269]:
y_pred = model.predict(X_test)

In [270]:
print(f"Accuracy: {accuracy_score(y_pred, y_test)}, r2: {r2_score(y_pred, y_test)}")

Accuracy: 0.9281529448041936, r2: 0.8806514966946659


In [271]:
# Imports
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go


# Imports for preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    # LabelBinarizer,
    OneHotEncoder,
)

# Imports for KNN and SVC/SVM models with GridSearchCV
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    validation_curve,
    learning_curve,
)

# from sklearn import svm
# from sklearn.svm import SVC
from sklearn import neighbors

# from sklearn.neighbors import KNeighborsClassifier

# Imports for the metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# from sklearn.metrics import ConfusionMatrixDisplay

In [272]:
# From parquet to DataFrame : base dataset
df = pd.read_parquet("data/base.parquet", engine="pyarrow")

# Rename the columns for clarity
df.rename(
    {
        "SEASON": "season",
        "BASIN": "basin",
        "NATURE": "nature",
        "LAT": "latitude",
        "LON": "longitude",
        "WIND": "wind",
        "DIST2LAND": "distance_to_land",
        "STORM_SPEED": "storm_speed",
        "STORM_DIR": "storm_direction",
        "TD9636_STAGE": "storm_stage",
    },
    axis="columns",
    inplace=True,
)

# Backup DataFrame
df_backup = df.copy()

In [273]:
# DataFrame basic description
print(f"Shape rows/columns:\n{df.shape}\n")
print(f"Column names:\n{df.columns}\n")
print(f"Types:\n{df.dtypes}\n")

Shape rows/columns:
(45025, 10)

Column names:
Index(['season', 'basin', 'nature', 'latitude', 'longitude', 'wind',
       'distance_to_land', 'storm_speed', 'storm_direction', 'storm_stage'],
      dtype='object')

Types:
season               object
basin                object
nature               object
latitude            float64
longitude           float64
wind                float64
distance_to_land    float64
storm_speed         float64
storm_direction     float64
storm_stage         float64
dtype: object



In [274]:

# Target column: evaluating the amount of data for each stage of the storms
print(
    f"Distribution of the stages (our target):\n{df["storm_stage"].value_counts()}\n"
)

Distribution of the stages (our target):
storm_stage
2.0    17079
1.0    15409
4.0    10590
0.0      853
3.0      742
5.0      225
6.0      127
Name: count, dtype: int64



In [275]:
# Selecting the categorical columns only
categorical_columns = ["season", "basin", "nature"]

# Encoder
preprocessor = ColumnTransformer(
    transformers=[
        ("", OneHotEncoder(sparse_output=False), categorical_columns)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
)
encoded_data = preprocessor.fit_transform(df)
column_names = preprocessor.get_feature_names_out()
encoded_df = pd.DataFrame(encoded_data, columns=column_names)

In [276]:
# Correlation matrix + heatmap using Plotly for the transformed data
# Checking the correlations to check our hypothesis before using the data with the ML Model
cm_df = encoded_df.corr()
cm_df = cm_df[((cm_df >= 0.2) | (cm_df <= -0.2)) & (cm_df != 1.00)]
cm_df = cm_df.dropna(how="all", axis=0).dropna(how="all", axis=1)

fig = px.imshow(
    cm_df,
    text_auto=".2f",
    zmin=-1,
    zmax=1,
    color_continuous_scale=px.colors.sequential.Turbo,
    title="Correlation Matrix Heatmap",
)

# Layout and show
fig.update_layout(
    title="Correlation Matrix",
    autosize=False,
    width=700,
    height=700,
)
fig.show()

In [277]:
display(encoded_df)

Unnamed: 0,season_Fall,season_Spring,season_Summer,season_Winter,basin_EP,basin_NI,basin_SI,basin_SP,basin_WP,nature_DS,...,nature_MX,nature_NR,nature_TS,latitude,longitude,wind,distance_to_land,storm_speed,storm_direction,storm_stage
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-12.5,172.5,25.0,647.0,6.0,350.0,1.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-12.2,172.4,25.0,653.0,6.0,350.0,1.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-11.9,172.4,25.0,670.0,5.0,360.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-11.7,172.4,25.0,682.0,4.0,10.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-11.5,172.5,25.0,703.0,4.0,20.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45020,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,14.6,141.6,25.0,1760.0,9.0,250.0,1.0
45021,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,14.4,141.2,25.0,1713.0,9.0,250.0,1.0
45022,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,14.3,140.8,25.0,1669.0,8.0,250.0,1.0
45023,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,14.1,140.4,23.0,1622.0,8.0,250.0,1.0


In [278]:
features, target = (
    encoded_df.loc[:, "season_Fall":"storm_direction"],
    encoded_df["storm_stage"],
)

# Splitting the dataframe in two
feat_train, feat_test, target_train, target_test = train_test_split(
    features, target
)

In [279]:
# Scalers
standard_scaler = StandardScaler()  # Less sensible to outliers
min_max_scaler = MinMaxScaler()  # Better for KNN

# Normalizing
feat_train = min_max_scaler.fit_transform(feat_train)
feat_test = min_max_scaler.transform(feat_test)

In [280]:
knn_classifier = neighbors.KNeighborsClassifier(
    n_neighbors=3
)

In [281]:
knn_classifier.fit(feat_train, target_train)

In [282]:
target_pred = knn_classifier.predict(feat_test)

In [283]:
print(f"Accuracy: {accuracy_score(target_test, target_pred)}")

Accuracy: 0.9021053566669628
