# Scikit-Learn Models

This notebook demonstrates how to train one of the Scikit-Learn models included in our SklearnModels wrapper using the 3W dataset for Multiclass Classification with Time Series data using the ThreeWToolkit.

In [23]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from typing import Dict, Any, List, Callable, Optional
from enum import Enum
from abc import ABC, abstractmethod
import sys
import os

# Adds the root directory to the sys.path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../../')))

from ThreeWToolkit.dataset import ParquetDataset
from ThreeWToolkit.core.base_dataset import ParquetDatasetConfig
from ThreeWToolkit.core.base_preprocessing import WindowingConfig
from ThreeWToolkit.core.base_assessment import ModelAssessmentConfig
from ThreeWToolkit.preprocessing import Windowing
from ThreeWToolkit.assessment.assessment_visualizations import AssessmentVisualization
from ThreeWToolkit.core.base_assessment_visualization import AssessmentVisualizationConfig
from ThreeWToolkit.models.sklearn_models import SklearnModels, SklearnModelsConfig
from ThreeWToolkit.core.enums import ModelTypeEnum
from ThreeWToolkit.core.base_models import BaseModels, ModelsConfig
from ThreeWToolkit.trainer.trainer import ModelTrainer, TrainerConfig

RANDOM_SEED = 2025

## Loading Dataset

The next step is to create a ParqueDataset instance to interact with the 3W dataset, for that we have to define a path location where we want to save the Dataset to (or where it is already located).

In [24]:
ds_config = ParquetDatasetConfig(
    path="./dataset",
    clean_data=True,
    seed=RANDOM_SEED,
    target_class=[0, 1, 2]
)
ds = ParquetDataset(ds_config)
ds[19]

[ParquetDataset] Dataset found at ./dataset
[ParquetDataset] Validating dataset integrity...
[ParquetDataset] Dataset integrity check passed!


{'signal':                      ABER-CKGL  ABER-CKP  ESTADO-DHSV  ESTADO-M1  ESTADO-M2  \
 timestamp                                                                     
 2014-05-30 08:32:03        0.0       0.0          0.0        0.0        0.0   
 2014-05-30 08:32:04        0.0       0.0          0.0        0.0        0.0   
 2014-05-30 08:32:05        0.0       0.0          0.0        0.0        0.0   
 2014-05-30 08:32:06        0.0       0.0          0.0        0.0        0.0   
 2014-05-30 08:32:07        0.0       0.0          0.0        0.0        0.0   
 ...                        ...       ...          ...        ...        ...   
 2014-05-30 10:33:10        0.0       0.0          0.0        0.0        0.0   
 2014-05-30 10:33:11        0.0       0.0          0.0        0.0        0.0   
 2014-05-30 10:33:12        0.0       0.0          0.0        0.0        0.0   
 2014-05-30 10:33:13        0.0       0.0          0.0        0.0        0.0   
 2014-05-30 10:33:14        0.

## Model Trainer Configurations

With the data ready, we are now able to define the `ModelTrainer`, using using the SkLearnModelsConfig and TrainerConfig configuration classes.

Very important to note that the TrainerConfig class receives as a parameter the model configuration (config_model), and we use this information to instantiate `ModelTrainer`.

In [28]:
# First, create the configuration for the specific scikit-learn model you want to use.
sklearn_config = SklearnModelsConfig(
    model_type=ModelTypeEnum.RANDOM_FOREST,
    model_params={"n_estimators": 100, "max_depth": 10}
)

# Next, create the TrainerConfig, passing the model's config to it.
# Note: Parameters like epochs, learning_rate, and criterion are not used by
# scikit-learn models, but are part of the standard TrainerConfig.
trainer_config = TrainerConfig(
    config_model=sklearn_config,
    epochs=1, # Not used by sklearn, can be set to 1
    batch_size=1024, # Not used by sklearn, can be any value
    seed=RANDOM_SEED,
    # The following are ignored by the SklearnModels wrapper but are required by TrainerConfig
    optimizer="adam",
    criterion="cross_entropy",
    learning_rate=0.001,
    shuffle_train=True,
    cross_validation=False,
)

# Instantiate the ModelTrainer
trainer = ModelTrainer(trainer_config)

print("ModelTrainer configured for RandomForestClassifier:")
print(trainer.model.model)

ModelTrainer configured for RandomForestClassifier:
RandomForestClassifier(max_depth=10, random_state=42)


## Prepare data for training without Pipeline Class

When using Time Series data, is very important to maintain consistence regarding data input to the model. To solve this, we use the Sliding Window method.

When using the `ModelTrainer` without the Pipeline class, we have to use the windowing function in order to prepare the dataset for training.

Also while using the isolated classes, we need to prepare the DataFrame event by event, creating a combined DataFrame with all windows and labels.

In [26]:
windowing_config = WindowingConfig(window="hann", window_size=window_size, overlap=0.5, pad_last_window=True)
windowing = Windowing(windowing_config)
selected_col = "T-TPT"
dfs = []

for event in ds:
    windowed_signal = windowing(
        event["signal"][selected_col],
    )
    windowed_signal.drop(columns=["win"], inplace=True)
    windowed_signal["label"] = np.unique(event["label"]["class"])[0]
    dfs.append(windowed_signal)

dfs_final = pd.concat(dfs, ignore_index=True, axis=0)
dfs_final

Unnamed: 0,var1_t0,var1_t1,var1_t2,var1_t3,var1_t4,var1_t5,var1_t6,var1_t7,var1_t8,var1_t9,...,var1_t991,var1_t992,var1_t993,var1_t994,var1_t995,var1_t996,var1_t997,var1_t998,var1_t999,label
0,-0.0,-0.000030,-0.000120,-0.000271,-0.000481,-0.000752,-0.001082,-0.001473,-0.001924,-0.002435,...,-0.002435,-0.001924,-0.001473,-0.001082,-0.000752,-0.000481,-0.000271,-0.000120,-0.000030,2
1,-0.0,-0.000030,-0.000120,-0.000271,-0.000481,-0.000752,-0.001082,-0.001473,-0.001924,-0.002435,...,-0.002435,-0.001924,-0.001473,-0.001082,-0.000752,-0.000481,-0.000271,-0.000120,-0.000030,2
2,-0.0,-0.000030,-0.000120,-0.000271,-0.000481,-0.000752,-0.001082,-0.001473,-0.001924,-0.002435,...,-0.002435,-0.001924,-0.001473,-0.001082,-0.000752,-0.000481,-0.000271,-0.000120,-0.000030,2
3,-0.0,-0.000030,-0.000120,-0.000271,-0.000481,-0.000752,-0.001082,-0.001473,-0.001924,-0.002435,...,-0.002435,-0.001924,-0.001473,-0.001082,-0.000752,-0.000481,-0.000271,-0.000120,-0.000030,2
4,-0.0,-0.000030,-0.000120,-0.000271,-0.000481,-0.000752,-0.001082,-0.001473,-0.001924,-0.002435,...,-0.002434,-0.001923,-0.001473,-0.001082,-0.000751,-0.000481,-0.000270,-0.000120,-0.000030,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44390,0.0,0.000004,0.000017,0.000039,0.000070,0.000109,0.000156,0.000213,0.000278,0.000352,...,0.000354,0.000279,0.000214,0.000157,0.000109,0.000070,0.000039,0.000017,0.000004,0
44391,0.0,0.000004,0.000017,0.000039,0.000070,0.000109,0.000157,0.000214,0.000280,0.000354,...,0.000355,0.000281,0.000215,0.000158,0.000110,0.000070,0.000040,0.000018,0.000004,0
44392,0.0,0.000004,0.000017,0.000039,0.000070,0.000109,0.000158,0.000214,0.000280,0.000355,...,0.000356,0.000281,0.000215,0.000158,0.000110,0.000070,0.000040,0.000018,0.000004,0
44393,0.0,0.000004,0.000018,0.000039,0.000070,0.000110,0.000158,0.000215,0.000280,0.000355,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0


## Training the model

With the data and Trainer ready, we can call the trainer.train() method while passing the x_train and y_train argument.

In [27]:
# Train the sklearn model using the new ModelTrainer interface
trainer.train(x_train=dfs_final.iloc[:, :-1], y_train=dfs_final["label"].astype(int))

## Training Assessment and Results

For gathering the results using the test set, we will use the ModelAssessmentConfig inside a the trainer.assess method.

In [11]:
# Evaluate model performance on validation set using ModelTrainer's test method
assessment_config = ModelAssessmentConfig(
    metrics=["accuracy"],
    batch_size=32,
)

results = trainer.assess(
    dfs_final.iloc[:, :-1],
    dfs_final["label"].astype(int),
    assessment_config=assessment_config
)

print(f"Test Metrics: {results['metrics']}")

Results exported to /home/gabriel.lisboa/Workspace/OP2396_3W/3WToolkit/output
Model Assessment Summary
Model: SklearnModels
Task Type: classification
Timestamp: 2025-10-10T00:57:52.935247

Metrics:
  accuracy: 0.9863
Test Metrics: {'accuracy': 0.9862822389908773}
