# Machine Learning

## Classification

### Documentation

Scikit-learn, choosing the right estimator : https://scikit-learn.org/stable/machine_learning_map.html 

### Model testing steps

1. Encoding : From the current dataset, encode/preprocess the different columns + augmenting the dataset ?
2. Splitting : Splitting the dataset to have a train_set and a test_set
3. Train/Test : Train et test the model (best practices GridSearch & checkpoints to store the best weigths of your model)
4. Evaluate the model: define which metric(s) to evaluate, learning curse, validation curve.

<img src="data/metrics.png">

The dataset is currently very unbalanced. 
- Either augment the dataset and look at the precision, recall metrics
- Or leave the dataset unbalanced and look at the accuracy metric

### Choosing the models

| model | person | scores | encoding
| --- | --- | --- | --- | 
| KNN | Vincent | -- | OneHotEncoder/LabelBinarizer | 
| KNN | -- | -- | -- | 
| SVM | Vincent | -- | OneHotEncoder/LabelBinarizer | 
| SVM | -- | -- | -- | 
| Randomforest | Audrey | -- | OneHotEncoder/LabelBinarizer | 
| Randomforest | Audrey | -- | Categorical encoding | 
| LinearSVC | Arnaud | -- | OneHotEncoder/LabelBinarizer | 
| LinearSVC | Arnaud | -- | Categorical encoding | 
| -- | -- | -- | -- | 
| -- | -- | -- | -- | 



In [None]:
# Imports
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go


# Imports for preprocessing
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    # LabelBinarizer,
    OneHotEncoder,
)

# Imports for KNN and SVC/SVM models with GridSearchCV
from sklearn.model_selection import (
    train_test_split,
    GridSearchCV,
    validation_curve,
    learning_curve,
)

# from sklearn import svm
# from sklearn.svm import SVC
from sklearn import neighbors

# from sklearn.neighbors import KNeighborsClassifier

# Imports for the metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# from sklearn.metrics import ConfusionMatrixDisplay

## Preprocessing the base dataset

We are starting by testing multiple Machine Learning models on a cleaned and reduced dataset based on the data analysis phase.

In [73]:
# From parquet to DataFrame : base dataset
df = pd.read_parquet("data/base.parquet", engine="pyarrow")

# Rename the columns for clarity
df.rename(
    {
        "SEASON": "season",
        "BASIN": "basin",
        "NATURE": "nature",
        "LAT": "latitude",
        "LON": "longitude",
        "WIND": "wind",
        "DIST2LAND": "distance_to_land",
        "STORM_SPEED": "storm_speed",
        "STORM_DIR": "storm_direction",
        "TD9636_STAGE": "storm_stage",
    },
    axis="columns",
    inplace=True,
)

# Backup DataFrame
df_backup = df.copy()

In [75]:
# DataFrame basic description
print(f"Shape rows/columns:\n{df.shape}\n")
print(f"Column names:\n{df.columns}\n")
print(f"Types:\n{df.dtypes}\n")

Shape rows/columns:
(45025, 10)

Column names:
Index(['season', 'basin', 'nature', 'latitude', 'longitude', 'wind',
       'distance_to_land', 'storm_speed', 'storm_direction', 'storm_stage'],
      dtype='object')

Types:
season               object
basin                object
nature               object
latitude            float64
longitude           float64
wind                float64
distance_to_land      int64
storm_speed         float64
storm_direction     float64
storm_stage         float64
dtype: object



In [76]:
# Target column: evaluating the amount of data for each stage of the storms
print(
    f"Distribution of the stages (our target):\n{df["storm_stage"].value_counts()}\n"
)

Distribution of the stages (our target):
storm_stage
2.0    17079
1.0    15409
4.0    10590
0.0      853
3.0      742
5.0      225
6.0      127
Name: count, dtype: int64



Categorical -> needs to be transformed with OneHotEncoder or one dimension with different values (1, 2, 3, 4, etc.)
- season (4 classes)
- basin (7 classes)
- nature (6 classes)

Numerical -> needs to be standardized on the same scale.
- latitude
- longitude
- wind 
- distance_to_land
- storm_speed
- storm_direction


In [None]:
# Selecting the categorical columns only
categorical_columns = ["season", "basin", "nature"]

# Encoder
preprocessor = ColumnTransformer(
    transformers=[
        ("", OneHotEncoder(sparse_output=False), categorical_columns)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False,
)
encoded_data = preprocessor.fit_transform(df)
column_names = preprocessor.get_feature_names_out()
encoded_df = pd.DataFrame(encoded_data, columns=column_names)

In [200]:
# Correlation matrix + heatmap using Plotly for the transformed data
# Checking the correlations to check our hypothesis before using the data with the ML Model
cm_df = encoded_df.corr()
cm_df = cm_df[((cm_df >= 0.2) | (cm_df <= -0.2)) & (cm_df != 1.00)]
cm_df = cm_df.dropna(how="all", axis=0).dropna(how="all", axis=1)

fig = px.imshow(
    cm_df,
    text_auto=".2f",
    zmin=-1,
    zmax=1,
    color_continuous_scale=px.colors.sequential.Turbo,
    title="Correlation Matrix Heatmap",
)

# Layout and show
fig.update_layout(
    title="Correlation Matrix",
    autosize=False,
    width=700,
    height=700,
)
fig.show()

In [129]:
display(encoded_df)

Unnamed: 0,season_fall,season_spring,season_summer,season_winter,basin_EP,basin_NI,basin_SI,basin_SP,basin_WP,nature_DS,...,nature_MX,nature_NR,nature_TS,latitude,longitude,wind,distance_to_land,storm_speed,storm_direction,storm_stage
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-12.5,172.5,25.0,647.0,6.0,350.0,1.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-12.2,172.4,25.0,653.0,6.0,350.0,1.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-11.9,172.4,25.0,670.0,5.0,360.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-11.7,172.4,25.0,682.0,4.0,10.0,1.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,-11.5,172.5,25.0,703.0,4.0,20.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45020,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,14.6,141.6,25.0,1760.0,9.0,250.0,1.0
45021,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,14.4,141.2,25.0,1713.0,9.0,250.0,1.0
45022,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,14.3,140.8,25.0,1669.0,8.0,250.0,1.0
45023,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,1.0,14.1,140.4,23.0,1622.0,8.0,250.0,1.0


#### Splitting the dataframe

In [None]:
features, target = (
    encoded_df.loc[:, "season_fall":"storm_direction"],
    encoded_df["storm_stage"],
)

# Splitting the dataframe in two
feat_train, feat_test, target_train, target_test = train_test_split(
    features, target
)

#### Scalling and Normalizing

In [None]:
# Scalers
standard_scaler = StandardScaler()  # Less sensible to outliers
min_max_scaler = MinMaxScaler()  # Better for KNN

# Normalizing
feat_train = min_max_scaler.fit_transform(feat_train)
feat_test = min_max_scaler.transform(feat_test)

#### KNN GridSearchCV

In [140]:
# Hyperparameters
param_grid = {"n_neighbors": range(2, 51)}

# Score to optimize
score = "accuracy"

# KNN classifier with GridSearchCV
knn_classifier = GridSearchCV(
    neighbors.KNeighborsClassifier(),
    param_grid,
    scoring=score,
)

knn_classifier.fit(feat_train, target_train)

# Results
print("Cross-validation results:\n")
for mean, std, params in zip(
    knn_classifier.cv_results_["mean_test_score"],
    knn_classifier.cv_results_["std_test_score"],
    knn_classifier.cv_results_["params"],
):
    print(
        "{} = {:.2f} (+/-{:.0f}) for {}".format(score, mean, std * 2, params)
    )

target_pred = knn_classifier.predict(feat_train)
target_true = target_train

print("\n", classification_report(target_true, target_pred))

Cross-validation results:

accuracy = 0.89 (+/-0) for {'n_neighbors': 2}
accuracy = 0.89 (+/-0) for {'n_neighbors': 3}
accuracy = 0.88 (+/-0) for {'n_neighbors': 4}
accuracy = 0.88 (+/-0) for {'n_neighbors': 5}
accuracy = 0.87 (+/-0) for {'n_neighbors': 6}
accuracy = 0.87 (+/-0) for {'n_neighbors': 7}
accuracy = 0.86 (+/-0) for {'n_neighbors': 8}
accuracy = 0.86 (+/-0) for {'n_neighbors': 9}
accuracy = 0.85 (+/-0) for {'n_neighbors': 10}
accuracy = 0.85 (+/-0) for {'n_neighbors': 11}
accuracy = 0.85 (+/-0) for {'n_neighbors': 12}
accuracy = 0.85 (+/-0) for {'n_neighbors': 13}
accuracy = 0.85 (+/-0) for {'n_neighbors': 14}
accuracy = 0.85 (+/-0) for {'n_neighbors': 15}
accuracy = 0.84 (+/-0) for {'n_neighbors': 16}
accuracy = 0.84 (+/-0) for {'n_neighbors': 17}
accuracy = 0.84 (+/-0) for {'n_neighbors': 18}
accuracy = 0.84 (+/-0) for {'n_neighbors': 19}
accuracy = 0.84 (+/-0) for {'n_neighbors': 20}
accuracy = 0.84 (+/-0) for {'n_neighbors': 21}
accuracy = 0.83 (+/-0) for {'n_neighbors'

In [None]:
cm = confusion_matrix(target_true, target_pred)
cm_percent = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]

# Custom labels
labels = [f"Stage {i}" for i in range(7)]

# Create a heatmap using Plotly
fig = px.imshow(
    cm_percent,
    text_auto=".2f",
    aspect="auto",
    title="Confusion Matrix - KNN GridSearch CV",
    labels=dict(color="Count"),
    x=labels,
    y=labels,
    color_continuous_scale=px.colors.sequential.tempo,
)

fig.show()

In [201]:
print(f"Best parameters : {knn_classifier.best_params_}\n")
print(f"Best accuracy score : {knn_classifier.best_score_}\n")
print(f"Best estimator : {knn_classifier.best_estimator_}")

Best parameters : {'n_neighbors': 3}

Best accuracy score : 0.8892443946539339

Best estimator : KNeighborsClassifier(n_neighbors=3)


In [204]:
# Calculate learning curve
train_sizes, train_scores, valid_scores = learning_curve(
    knn_classifier.best_estimator_,
    feat_train,
    target_train,
    train_sizes=np.linspace(0.1, 1, 50),
    cv=5,
    scoring="accuracy",
)

# Calculate mean scores
train_scores_mean = np.mean(train_scores, axis=1)
valid_scores_mean = np.mean(valid_scores, axis=1)

# Create a plotly figure
fig = go.Figure()

# Add train scores
fig.add_trace(
    go.Scatter(x=train_sizes, y=train_scores_mean, mode="lines", name="Train")
)

# Add validation scores
fig.add_trace(
    go.Scatter(
        x=train_sizes, y=valid_scores_mean, mode="lines", name="Validation"
    )
)

# Update layout and display
fig.update_layout(
    title="KNN Learning Curve",
    xaxis_title="Training Size",
    yaxis_title="Accuracy",
    legend=dict(x=0, y=1, traceorder="normal"),
)
fig.show()

In [None]:
# NEXT : Validation curve + improved hyperparameters

## Improving the model

Once the first model available we might try and improve the results by adding other columns and modifying further some of the actual columns.
Columns considered: 
- TRACK_TYPE
- IFLAG
- Try and use (merge or other methods) all the R30, R40, etc. content

Further modifications considered:
- Filling the empty TD9636_STAGE rows based on other columns that contain a classification of the storm (USA_SSHS, TOKYO_GRADE, CMA_CAT, HKO_CAT, KMA_CAT, NEWDELHI_GRADE, REUNION_TYPE, BOM_TYPE, NADI_CAT, DS824_STAGE, NEUMANN_CLASS, MLC_CLASS)
- Adding WIND rowsby converting NEWDELHI_WIND, CMA_WIND (and maybe WMO_WIND) as it was done for the others
- mergin BASIN and SUBBASIN to add more precise information 

- test  MONTH / SEASON
- with/without PRESSION (because approximations)