## Import custom modules and check structure

We add the '../src' directory to the path to enable importing custom Python modules created for this project.  
These include preprocessing functions, modeling utilities, and evaluation logic.

In [None]:
import sys
sys.path.append('../src')  

import os
print(os.listdir("../src"))

from preprocessing import create_aktywny_klient, encode_categorical, scale_data, scale_all_data
from modeling import train_models, get_models
from evaluation import evaluate_with_cv

['data_loading.py', 'evaluation.py', '__pycache__', 'preprocessing.py', 'modeling.py']


In [None]:
#  load the cleaned customer data from the `data/raw/` directory using the `load_data` function.
from data_loading import load_data
df = load_data('../data/raw/Klienci_DB.csv')

## Preprocess and split the data

- Add a new binary feature `aktywny_klient` (active client in the last 90 days).
- Encode categorical variables (`plec`, `miasto`) using LabelEncoder and One-Hot Encoding.
- Split the data into features `X` and target `y`.
- Perform a stratified train/test split (80/20).
- Scale both the train/test data separately and the full dataset for CV.

In [None]:
from sklearn.model_selection import train_test_split
df = create_aktywny_klient(df)
df = encode_categorical(df)


X = df.drop(columns=['czy_churn'])
y = df['czy_churn']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)


X_train_scaled, X_test_scaled = scale_data(X_train, X_test)
X_scaled = scale_all_data(X) 


## Train models and evaluate on test set

We train multiple models (`Random Forest`, `Gradient Boosting`, `Decision Tree`, `KNN`, `XGBoost`)  
on the scaled training data and evaluate their performance on the test set using metrics like Accuracy, Precision, Recall, and F1 Score.

In [None]:
# Training and precision
results_df = train_models(X_train_scaled, X_test_scaled, y_train, y_test)
display(results_df)


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,Random Forest,1.0,1.0,1.0,1.0
1,Gradient Bosting,1.0,1.0,1.0,1.0
2,Decision Tree,1.0,1.0,1.0,1.0
4,XGBoost,0.99715,0.997583,0.997972,0.997777
3,KNN,0.9503,0.965005,0.957176,0.961075


## Cross-Validation (5-fold)

To ensure model stability and generalizability, we use 5-fold cross-validation  
on the full scaled dataset and compare the mean F1 scores across all models.

In [None]:
cv_df =  evaluate_with_cv(get_models(), X_scaled, y)
print(cv_df) 

              Model  Mean F1 (CV)   Std Dev
0     Random Forest      1.000000  0.000000
1  Gradient Bosting      1.000000  0.000000
2     Decision Tree      1.000000  0.000000
4           XGBoost      0.998659  0.000726
3               KNN      0.958667  0.001186
