# 1. Processing the initial CSV to extract usefull features

# CSVDataProcessor

CSVDataProcessor is a utility class designed to process training and testing CSV files containing user session data. It extracts meaningful features from each session by tracking all unique actions, screens, configurations, and chains, and counting their occurrences. Additionally, it computes:
- The total number of actions in each session
- The session duration
- The user’s average speed during the session

The processed data is returned as a pandas DataFrame




In [1]:
from utils import CSVDataProcessor
import pandas as pd

In [2]:
with CSVDataProcessor("train.csv") as processor:
    train_dataframe = processor.get_processed_train_data()
    test_dataframe = processor.get_processed_test_data(test_data_csv_path="test.csv")

In [3]:
train_dataframe.head(5)

Unnamed: 0,user,navigator,total_actions,session_duration,avg_speed,occurrence of action 'Erreur système grave',occurrence of action 'Action de table',occurrence of action 'Raccourci',occurrence of action 'Dissimulation d'une arborescence',occurrence of action 'Retour sur un écran',...,occurrence of chaine 'approweb',occurrence of chaine 'mobitour',occurrence of chaine 'selenium',occurrence of chaine 'ndf',occurrence of chaine 'valwf',occurrence of chaine 'maj',occurrence of chaine 'sv',occurrence of chaine 'qual',occurrence of chaine 'web',occurrence of chaine 'acti'
0,nuh,Firefox,2514,2905,0.865404,0,11,0,0,2,...,0,0,0,0,0,0,0,0,0,0
1,muz,Google Chrome,90,230,0.391304,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,zrx,Microsoft Edge,608,750,0.810667,0,6,39,0,4,...,0,0,0,0,0,0,0,0,0,0
3,pou,Firefox,886,1445,0.613149,0,7,0,0,12,...,0,0,0,0,0,0,0,0,0,0
4,ald,Google Chrome,173,275,0.629091,0,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [4]:
test_dataframe.head(5)

Unnamed: 0,navigator,total_actions,session_duration,avg_speed,occurrence of action 'Erreur système grave',occurrence of action 'Action de table',occurrence of action 'Raccourci',occurrence of action 'Dissimulation d'une arborescence',occurrence of action 'Retour sur un écran',occurrence of action 'Double-clic',...,occurrence of chaine 'approweb',occurrence of chaine 'mobitour',occurrence of chaine 'selenium',occurrence of chaine 'ndf',occurrence of chaine 'valwf',occurrence of chaine 'maj',occurrence of chaine 'sv',occurrence of chaine 'qual',occurrence of chaine 'web',occurrence of chaine 'acti'
0,Microsoft Edge,300,540,0.555556,0,0,23,0,0,37,...,0,0,0,0,0,0,0,0,0,0
1,Firefox,580,800,0.725,0,0,0,0,1,32,...,0,0,0,0,0,0,0,0,0,0
2,Google Chrome,714,1225,0.582857,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0
3,Google Chrome,1062,1225,0.866939,0,1,1,0,50,61,...,0,0,0,0,0,0,411,0,0,0
4,Firefox,211,280,0.753571,0,1,33,0,6,0,...,0,0,0,0,0,0,0,0,0,0


# 2. Dataset Preparation Workflow

This section defines a set of **helper functions** that perform the essential steps
to transform raw data into a model-ready format.  

These operations include:
- **1. One-hot encoding** — converting categorical feature columns into binary indicators
- **2. Feature and target split** — splitting the dataset into inputs (`X`) and target (`y`)
- **3. Label encoding** — converting categorical target labels into numeric values
- **4. Feature scaling** — standardizing numerical values for better model performance

All these helper functions will be combined into a single entry point, **`prepareDataset()`**,  
which coordinates the entire preparation process and returns clean, ready-to-train data.


## 1. One hot Incoding

In [5]:
from sklearn.preprocessing import OneHotEncoder

def one_hot_encode(data_frame: pd.DataFrame,column: str) -> pd.DataFrame:
    encoder = OneHotEncoder(sparse_output=False)
    encoded = encoder.fit_transform(data_frame[[column]])
    df_encoded = pd.DataFrame(encoded, columns=encoder.get_feature_names_out([column]))
    # Merge encoded columns with the original DataFrame
    df_encoded = pd.concat([data_frame.drop(column, axis=1), df_encoded], axis=1)
    return df_encoded

## 2. feature target split

In [6]:
def feature_target_split(data_frame: pd.DataFrame,target_column: str):
    X_df = data_frame.drop(columns=[target_column])
    X = X_df.values 
    y_str = data_frame[target_column].values

    return X, y_str

## 3. Label Encoding

In [7]:
from sklearn.preprocessing import LabelEncoder, StandardScaler

def encode_labels(y_str):
    label_encoder = LabelEncoder()
    y = label_encoder.fit_transform(y_str)

    return y

## 4. Standard Scaler

In [8]:
from sklearn.preprocessing import StandardScaler

def scale_features(X):
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    return X_scaled
    

## 5. Entry Point Function (Prepare Dataset)
We will use the helper functions to prepare the training data by:

- **1. Converting categorical feature `navigator` into binary indicators**
- **2. Splitting the dataset into inputs (`X`) and target (`y`)**
- **3. Converting categorical target labels `user` into numeric values**
- **4. Standardizing numerical values**

In [9]:
def prepare_dataset(data_frame: pd.DataFrame):
    encoded_data_frame = one_hot_encode(data_frame, "navigator")
    X, y_str = feature_target_split(encoded_data_frame, "user")
    encoded_y = encode_labels(y_str)
    scaled_X = scale_features(X)

    return scaled_X, encoded_y
    

# 3. Prepare train Data

## 1. Prepare Dataset
Transform data into a model-ready format

In [10]:
scaled_X, encoded_y = prepare_dataset(train_dataframe)

## 2. Train Test Split
Split the features and labels into training (80%) and testing sets (20%) for model evaluation    
    

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    scaled_X, encoded_y, 
    test_size=0.2, 
    random_state=42
)

# 4. Choosing the best Classifier

## Model Comparison

We train and evaluate multiple classifiers:
- **1. Random Forest**
- **2. Decision Tree**
- **3. Logistic Regression**
- **4. Naive Bayes**
- **5. KNN**
  
The **weighted F1-score** is used to rank the models from best to worst.


In [12]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
import pandas as pd

# Dictionnaire des modèles à tester
models = {
    "RandomForest": RandomForestClassifier(random_state=42, n_jobs=-1),
    "DecisionTree": DecisionTreeClassifier(random_state=42),
    "LogisticRegression": LogisticRegression(max_iter=1000, random_state=42),
    "NaiveBayes": GaussianNB(),
    "KNN": KNeighborsClassifier()
}

results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    f1 = f1_score(y_test, y_pred, average="weighted")
    results.append({"Model": name, "F1 Score": f1})

# Classement des modèles
results_df = pd.DataFrame(results).sort_values(by="F1 Score", ascending=False).reset_index(drop=True)

results_df


Unnamed: 0,Model,F1 Score
0,RandomForest,0.830259
1,DecisionTree,0.650319
2,LogisticRegression,0.617899
3,NaiveBayes,0.517644
4,KNN,0.388618


## Results
The **Random Forest** classifier achieved the highest `F1-score` among all tested models.  

# 5. Hyperparameter Tuning

To further improve the performance of our classifier, we will now focus on tuning its hyperparameters.  
This step aims to increase the F1-score and optimize the model for better predictive accuracy.


## todo using grid search