# Final use-case : School Failure Prediction

This notebook will try to build a **predictive ai solution** to estimate a school failure for a given student.

The system use a "*Chain of Responsibiliy*" pattern to pipeline the process. Each element of the chain is responsible of one process, then give the result to the other.

This architecture allow to easily change or add process orchestration.

In [145]:
%load_ext autoreload
%autoreload 2
import pandas as pd
from loguru import logger

from core.pipeline_core.pipeline_core import DataHandler, PipelineContext, PipelineOrchestrator

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


*Refactorization*
Handlers are removed from the book :
- DataLoader is now in the core.handlers package (data_loader.py),
- SensitiveDataHandler moved to core.handlers package (sensitive_data_handler.py)
- MergerHandler moved to core.handlers package (merger_handler.py)
- OutlierHandler moved to core.handlers package (outlier_handler.py)
- ImputationHandler moved to core.handlers package (imputation_handler.py)
- DataExportHandler moved to core.handlers package (data_export_handler.py),
- ModelHandler moved to core.handlers package (model_handler.py)

**Notes** :

*Outlier handler :*
Handle *outliers* using strategies :
- IQR,
- Isolation Forest
Finally removes the entire line if one cols is ludicurious

*NaN imputation :*
Identity NaN values from the dataframe. Use different strategies for replacement :
- AIImputation : Using regression to identify NaN (usefull for large dataframe)
- SimpleImputer : Using either mean or median replacement strategy

*Model Handler* :
ModelHandler use *strategies*: 
- Logistic Regression - LR,
- Random Forest Classifier - RF 

and use 4 base hypothesis : full_dataframe, no sensitive data, no_g1, no_g1_g2

The pipeline scheme sets in a YAML config can be tuned to :
- Add other algorithms,
- Sets different hypothesis

During training MLFlow is used to store parameters, artifacts and final models.

## Move orchestration logic to YAML configuration
Using configuration, we offer the possibility to handle steps dynamically. The following class read configuration and build the orchestration.

**PipelineBuilder** moved to core.pipeline_core package (pipeline_builder.py)

## Orchestrator settings (@Deprecated)
- Sets sources,
- Sets sensitive datas,
- Initiate orchestrator

**Deprecated** after delegating to PiplelineBuilder, there's no need to manually configure the orchestrator.

Kept only here for documentation

In [146]:
from core.handlers.data_export_handler import DataExportHandler
from core.handlers.data_loader import DataLoader
from core.handlers.imputation_handler import ImputationHandler
from core.handlers.merger_handler import MergerHandler
from core.handlers.model_handler import ModelHandler
from core.handlers.outlier_handler import OutlierHandler
from core.handlers.sensitive_data_handler import SensitiveDataHandler
from core.strategy_core.outliers_strategies import IsolationForestStrategy
from core.strategy_core.imputation_strategies import AIImputationStrategy

files_to_load = {
    "maths": "datas/student-mat.csv",
    "por": "datas/student-por.csv"
}

sensitives = [
    "romantic", # No correlation
    "Dalc", # Discriminant data, cannot be used
    "Walc", # Discrimant data, cannot be used
]

# Make chain instances :
# 1. Data processing chain
loader = DataLoader(files_to_load=files_to_load)
cleaner = SensitiveDataHandler(sensitive_columns=sensitives)
merger = MergerHandler()

# Sets one of the Outliers detection strategy (Isolation Forest)
outlier_strategy = IsolationForestStrategy(contamination=0.01)
outlier = OutlierHandler(strategy=outlier_strategy, target_columns=["studytime", "absences", "age"])

# Sets one of the Imputation Strategy
imputer_strategy = AIImputationStrategy()
imputer = ImputationHandler(imputer_strategy)

exporter = DataExportHandler()

# Instanciate Pipeline
pipeline = (PipelineOrchestrator()
    .add_handler(loader)
    .add_handler(cleaner)
    .add_handler(merger)
    .add_handler(outlier)
    .add_handler(imputer)
    .add_handler(exporter)
)

# 2. Learning processing
scenarii = [
    (1, "Full_Features", []),
    (2, "No_Sensitive", ["romantic", "Dalc", "Walc"]),
    (3, "No_Sensitive_No_G2", ["romantic", "Dalc", "Walc", "G2"]),
    (4, "No_Sensitive_No_G1_G2", ["romantic", "Dalc", "Walc", "G1", "G2"])
]
from core.strategy_core.training_strategies import LogisticRegressionStrategy
from core.strategy_core.training_strategies import RandomForestStrategy
# 2.1 From definitions add strategies needed
for s_id, s_name, s_exclusions in scenarii:
    for strategy_class in [LogisticRegressionStrategy, RandomForestStrategy]:
        strategy = strategy_class(scenario_id=s_name, exclusions=s_exclusions)
        model_handler = ModelHandler(strategy=strategy, scenario_label=s_name)
        pipeline.add_handler(model_handler)


## Run orchestrator

Orchestrator is a Chain of Responsibilies. At the end of the chain, all processes are done.

**Major update** : delegate chain assembly in a configuration file (see : pipeline_config.yml)


In [147]:
# Initialize context and orchestrator
from core.pipeline_core.pipeline_builder import PipelineBuilder


orchestrator = PipelineOrchestrator()
context = PipelineContext()

# Build pipeline from Notebook classes and YAML config
pipeline = PipelineBuilder.build_from_yaml("pipeline_config.yaml", orchestrator)

# Run the pipeline
try:
    final_context = orchestrator.run(context)
except Exception as e:
    logger.error(f"‚ùå Pipeline failed: {e}")
finally:
    print("üèÅ Pipeline execution ended.")
    
    # Final report
    print("\n--- Merged datas overview ---")
    display(final_context.final_df.head())

    print("\n--- Execution stats ---")
    for step, duration in final_context.execution_time.items():
        print(f"{step:25} : {duration:.4f}s")



[32m2025-12-25 03:31:08.165[0m | [1mINFO    [0m | [36mcore.pipeline_core.pipeline_builder[0m:[36mbuild_from_yaml[0m:[36m6[0m - [1müèóÔ∏è Building pipeline from Notebook classes...[0m
[32m2025-12-25 03:31:08.166[0m | [34m[1mDEBUG   [0m | [36mcore.pipeline_core.pipeline_builder[0m:[36m_get_class_from_anywhere[0m:[36m8[0m - [34m[1müîç Looking for class 'DataLoader' in module 'core.handlers.dataloader'[0m


ModuleNotFoundError: No module named 'core.handlers.dataloader'

## Store the best run

Prepare API exporting the best model (based uppon AUC)

In [None]:
import joblib
import mlflow

# Get the best model from MLFlow tracking (based on AUC score)
runs = mlflow.search_runs(order_by=["metrics.auc_score DESC"])
best_run_id = runs.iloc[0]['run_id']

# Load and save the best model locally
best_model = mlflow.sklearn.load_model(model_uri=f"runs:/{best_run_id}/model")
joblib.dump(best_model, "backend/models/student_model_auc_latest.joblib")
print(f"‚úÖ Best model based on AUC score was saved successfully : {runs.iloc[0]['tags.mlflow.runName']}")

# Get the best model from MLFlow tracking (based on Accuracy score)
runs = mlflow.search_runs(order_by=["metrics.accuracy DESC"])
best_run_id = runs.iloc[0]['run_id']

# Load and save the best model locally
best_model = mlflow.sklearn.load_model(model_uri=f"runs:/{best_run_id}/model")
joblib.dump(best_model, "backend/models/student_model_accuracy_latest.joblib")
print(f"‚úÖ Best model based on Accuracy score was saved successfully : {runs.iloc[0]['tags.mlflow.runName']}")