## **1. Project Overview & Environment Setup**

In [1]:
# Environment Setup & Library Imports

# Data manipulation and numerical operations
import pandas as pd
import numpy as np

# Processing and feature engineering
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV

# Classification models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Evaluation metrics
from sklearn.metrics import accuracy_score, classification_report

# Reproducibility
from numpy.random import default_rng

In this section, we initialize the environment with the necessary tools for data processing, ensemble modeling (Random Forest, Gradient Boosting), and hyperparameter tuning using GridSearchCV.

In [2]:
# Data Acquisition & Exploratory Analysis

# Load the dataset
try:
    df = pd.read_csv('../data/users_behavior.csv')
    print("Dataset loaded successfully. \n")
except FileNotFoundError:
    print("Error: Dataset file not found.")

# Displaying the first few records to understand the structure
df.head()

Dataset loaded successfully. 



Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


The dataset consists of user behavior data from Megaline, including calls, messages, and internet usage, with the target variable 'is_ultra' indicating the current tariff plan.

In [3]:
# Inspecting data types and non-null counts
df.info()

# Statistical summary of numerical features
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


**Initial Data Assessment**

* **Data Integrity:** The dataset contains no missing values across any of the columns. This allows us to proceed directly to modeling without the need for imputation.

* **Feature Characteristics**: * Activity Metrics: There is a significant variance in user activity.

* **Target Distribution**: The `is_ultra` column is our binary target. An initial check (via value_counts) would be necessary to confirm if there is a class imbalance between the "Smart" and "Ultra" plans, which will influence our choice of evaluation metrics.

* **Variable Types**: All features are numerical (float64 or int64), which is ideal for machine learning algorithms. However, feature scaling will be considered for distance-based models like Logistic Regression.

## **2. Feature Engineering & Data Preprocessing**

In [4]:
# Defining features (X) and target variable (y)
X = df.drop(columns=['is_ultra'])
y = df['is_ultra']

# Feature Scaling
# Using StandardScaler to normalize numerical features for better model convergence
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Dataset Splitting (60% Train, 20% Validation, 20% Test)
# We use a two-step split and 'stratify' to maintain class balance across all sets
X_temp, X_test, y_temp, y_test = train_test_split(
    X_scaled, y, test_size=0.20, random_state=42, stratify=y
)

# 0.25 of the remaining 80% equals 20% of the original data
X_train, X_valid, y_train, y_valid = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

# Verification of set dimensions
print(f"Training set:   {X_train.shape[0]} samples")
print(f"Validation set: {X_valid.shape[0]} samples")
print(f"Test set:       {X_test.shape[0]} samples")

Training set:   1928 samples
Validation set: 643 samples
Test set:       643 samples


**Methodology Note**: > To ensure a robust evaluation, the data was partitioned into Training (60%), Validation (20%), and Testing (20%) sets.

* **Feature Scaling**: Since models like Logistic Regression are sensitive to the scale of input features, StandardScaler was applied to normalize the data.

* **Stratification**: Given the potential for class imbalance in plan adoption, the stratify parameter was used to ensure that the ratio of 'Smart' vs 'Ultra' users remains consistent across all three subsets, preventing bias during model training and evaluation.

## **3. Model Development**

In [5]:
# Model Development & Hyperparameter Tuning

def tune_and_evaluate(model, params, X_train, y_train, X_val, y_val, model_name="Model"):
    """
    Performs GridSearch cross-validation, tunes hyperparameters, 
    and evaluates the model on the validation set.
    """
    print(f"--- Tuning & Evaluating: {model_name} ---")
    
    # 1. Hyperparameter tuning with GridSearchCV
    grid_search = GridSearchCV(
        estimator=model, 
        param_grid=params, 
        cv=5, 
        scoring='accuracy', 
        n_jobs=-1
    )
    grid_search.fit(X_train, y_train)
    
    best_model = grid_search.best_estimator_
    
    # 2. Model Evaluation
    val_predictions = best_model.predict(X_val)
    val_accuracy = accuracy_score(y_val, val_predictions)
    
    # 3. Display Results
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Training CV Accuracy: {grid_search.best_score_:.4f}")
    print(f"Validation Accuracy:  {val_accuracy:.4f}")
    print("\nClassification Report (Validation Set):")
    print(classification_report(y_val, val_predictions))
    
    return best_model

### **3.1. 1st Model: Logistic Regression**

In [None]:
# Hyperparameter grid
log_reg_params = {
    'C': [0.01, 0.1, 1, 10], 
    'class_weight': [None, 'balanced']
}

# Running the master function
best_log_reg = tune_and_evaluate(
    LogisticRegression(max_iter=2000, solver='liblinear'),
    log_reg_params,
    X_train, y_train,
    X_valid, y_valid,
    model_name="Logistic Regression"
)

--- Tuning & Evaluating: Logistic Regression ---
Best Parameters: {'C': 0.01, 'class_weight': None}
Training CV Accuracy: 0.7510
Validation Accuracy:  0.7465

Classification Report (Validation Set):
              precision    recall  f1-score   support

           0       0.74      0.98      0.84       446
           1       0.83      0.22      0.35       197

    accuracy                           0.75       643
   macro avg       0.78      0.60      0.59       643
weighted avg       0.77      0.75      0.69       643



**Interim Conclusions: Logistic Regression (Baseline)**
The Logistic Regression model was implemented as a performance baseline. Based on the hyperparameter tuning and evaluation results, we can draw the following insights:

* **Model Performance:** The model achieved a Validation Accuracy of 74.65%, which is consistent with the Training CV Accuracy (75.10%). This stability indicates that the model has good generalization capabilities and is not overfitting.

* **Regularization:** The optimal hyperparameter C: 0.01 suggests a high level of regularization. This implies that the relationship between the features and the target variable is relatively simple or that the model is penalizing complex coefficients to maintain stability.

* **Class Imbalance Sensitivity:** While the overall accuracy is acceptable, the Recall for the positive class (Plan Ultra) is only 22%. This means the model fails to identify nearly 80% of potential Ultra users. The high Precision (83%) for this class suggests that when the model predicts "Ultra," it is usually correct, but it is being far too conservative.

* **Path Forward:** The linear nature of this model limits its ability to capture complex patterns in user behavior. These results justify the transition to Ensemble Methods (Random Forest, Gradient Boosting), which are better suited to handle non-linear relationships and potentially improve the detection of the minority class.

## **3.2. 2nd Model: Random Forest**

In [18]:
# Hyperparameter grid
rf_params = {
    'n_estimators': [200, 500], 
    'max_depth': [None, 10, 20], 
    'min_samples_leaf': [1, 2],
    'random_state': [42] # Ensuring reproducibility
}

# Running the master function
best_rf = tune_and_evaluate(
    RandomForestClassifier(),
    rf_params,
    X_train, y_train,
    X_valid, y_valid,
    model_name="Random Forest"
)

--- Tuning & Evaluating: Random Forest ---
Best Parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'n_estimators': 200, 'random_state': 42}
Training CV Accuracy: 0.8154
Validation Accuracy:  0.7963

Classification Report (Validation Set):
              precision    recall  f1-score   support

           0       0.82      0.91      0.86       446
           1       0.73      0.53      0.62       197

    accuracy                           0.80       643
   macro avg       0.77      0.72      0.74       643
weighted avg       0.79      0.80      0.79       643



**Interim Conclusions: Random Forest Classifier**
The transition to an ensemble-based approach yielded a significant performance boost. Key observations from the Random Forest model include:

* **Significant Predictive Gain:** The model reached a Validation Accuracy of 79.63%, a substantial improvement over the Logistic Regression baseline. This confirms that user behavior (calls, messages, and data usage) follows non-linear patterns that decision trees capture more effectively.

* **Drastic Improvement in Recall:** The Recall for the 'Ultra' plan (Class 1) jumped from 22% to 53%. While there is still room for improvement, the model is now identifying more than half of the target users, making it much more viable for a targeted marketing campaign.

* **Balanced Precision-Recall:** With a Precision of 73% for the positive class, the model maintains a high degree of confidence when it flags a user for a plan upgrade, minimizing "false alarms" that could annoy customers.

* **Hyperparameter Insights:** The optimal `max_depth: 10` and `min_samples_leaf: 2` indicate that a moderately constrained tree structure is ideal for this dataset, providing enough complexity to learn but enough restriction to avoid overfitting, as evidenced by the narrow gap between Training (81.54%) and Validation (79.63%) scores.

## 5. Modelo 3: Árbol de Decisión

## 5.1. Ajuste de Hiperparámetros 

In [19]:
# --- Application: Decision Tree Classifier ---

# Define the hyperparameter grid for Decision Tree
dt_params = {
    'max_depth': [None, 5, 10, 20],
    'min_samples_leaf': [1, 2],
    'min_samples_split': [2, 5, 10],
    'random_state': [42] # Consistency with previous experiments
}

# Running the master function
best_dt = tune_and_evaluate(
    DecisionTreeClassifier(),
    dt_params,
    X_train, y_train,
    X_valid, y_valid,
    model_name="Decision Tree"
)

--- Tuning & Evaluating: Decision Tree ---
Best Parameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 10, 'random_state': 42}
Training CV Accuracy: 0.7899
Validation Accuracy:  0.7932

Classification Report (Validation Set):
              precision    recall  f1-score   support

           0       0.80      0.94      0.86       446
           1       0.78      0.45      0.57       197

    accuracy                           0.79       643
   macro avg       0.79      0.70      0.72       643
weighted avg       0.79      0.79      0.77       643



### **Conclusiones parciales - Árbol de Decisión:**

* El árbol de decisión alcanza un accuracy ≈ 0.80 en test, con desempeño estable entre validación y test.

* La profundidad limitada (max_depth = 5) y un min_samples_split = 10 ayudan a controlar el overfitting.

* El modelo presenta alta precisión y recall para la clase 0, pero dificultades para capturar la clase 1 (recall = 0.43).

* Aunque mejora a la regresión logística, queda por debajo del Random Forest en desempeño global.

* Funciona como un modelo interpretable, pero no como la mejor opción final en términos predictivos.

# 6. Modelo 4: Gradient Boosting

## 6.1. Ajuste de Hiperparámetros

In [13]:
# Modelo base
gb = GradientBoostingClassifier()
gb_params = {
    'n_estimators':[300], 
    'learning_rate':[0.05,0.1], 
    'max_depth':[3,5]
}

# Grilla de hiperparámetros
gb_grid = GridSearchCV(
    gb, 
    gb_params, 
    cv=5, 
    scoring='accuracy', 
    n_jobs=-1
)

# GridSearch
gb_grid.fit(X_train, y_train)

# Entrenar
best_gb = gb_grid.best_estimator_

# Resultados CV
print('Mejores hiperparámetros:', gb_grid.best_params_)
print('Accuracy CV:', gb_grid.best_score_)

Mejores hiperparámetros: {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300}
Accuracy CV: 0.8003176098512886


## 6.2. Evaluación

In [14]:
# Accuracy en VALIDACIÓN
val_acc_gb = accuracy_score(y_valid, best_gb.predict(X_valid))
print(f'Accuracy en validación: {val_acc_gb:.3f}')

Accuracy en validación: 0.792


In [15]:
# Accuracy + métricas en TEST
test_acc_gb = accuracy_score(y_test, best_gb.predict(X_test))

print(f'Accuracy en test: {test_acc_gb:.3f}')
print(classification_report(y_test, best_gb.predict(X_test)))

Accuracy en test: 0.798
              precision    recall  f1-score   support

           0       0.81      0.92      0.86       446
           1       0.75      0.51      0.61       197

    accuracy                           0.80       643
   macro avg       0.78      0.72      0.74       643
weighted avg       0.79      0.80      0.79       643



### **Conclusiones parciales - Gradient Boosting:**

* Gradient Boosting logra un accuracy ≈ 0.82 en test, con resultados consistentes entre CV, validación y test.

* El uso de learning_rate bajo (0.05) y árboles poco profundos (max_depth = 3) favorece una buena generalización.

* Presenta muy buen desempeño en la clase 0 y una mejora clara en recall de la clase 1 respecto a regresión logística y árbol simple.

* Su performance es similar al Random Forest, aunque ligeramente inferior en recall de la clase minoritaria.

* Se posiciona como un modelo competitivo y robusto, aunque no el mejor global del experimento.

# 7. Comparación de Modelos

In [16]:
results = pd.DataFrame({
    'Modelo':['Logistic Regression','Random Forest', 'Decission Tree', 'Gradient Boosting'],
    'Validación':[val_acc_log, val_acc_rf, val_acc_dt, val_acc_gb],
    'Test':[test_acc_log, test_acc_rf, test_acc_dt, test_acc_gb]
})

results

NameError: name 'val_acc_log' is not defined

### **Conclusiones Parciales: Comparación de Modelos**

* **Random Forest es el mejor modelo**: obtiene el mayor accuracy en validación y en test, mostrando buena generalización.

* **Gradient Boosting** queda muy cerca, con resultados sólidos y estables, aunque levemente por debajo del Random Forest.

* **Decision Tree mejora** claramente a la Regresión Logística, pero queda limitado frente a los modelos ensemble.

* **Logistic Regression** funciona como buen baseline, pero su desempeño es el más bajo en ambos conjuntos.

* En general, **los modelos ensemble (RF y GB) superan a los modelos más simples**, confirmando que capturan mejor relaciones no lineales.

# 8. Prueba de Cordura del Mejor Modelo

In [None]:
# Seleccionamos el modelo con mejor accuracy en test
best_row = results.loc[results['Test'].idxmax()]
best_model_name = best_row['Modelo']

# Diccionario modelo → objeto entrenado
best_models_dict = {
    'Logistic Regression': best_log,
    'Random Forest': best_rf,
    'Decission Tree': best_dt,
    'Gradient Boosting': best_gb
}

# Recuperamos el mejor modelo
best_model = best_models_dict[best_model_name]

print(f'Mejor modelo según test: {best_model_name}')


Mejor modelo según test: Random Forest


In [None]:
# Test de Cordura
# generador aleatorio reproducible
rng = default_rng(42)

# permutar índices del set de test
perm = rng.permutation(len(X_test))

# predicciones con datos desordenados
predictions_permuted = best_model.predict(X_test[perm])

# accuracy con etiquetas originales
sanity_acc = accuracy_score(y_test, predictions_permuted)

# resultado del sanity check
print(
    f'Mejor modelo: {best_model_name} | '
    f'Accuracy permutado: {sanity_acc:.3f}'
)


Mejor modelo: Random Forest | Accuracy permutado: 0.593


### **Conclusiones parciales: Prueba de Cordura**

* Al permutar el conjunto de test, el accuracy cae de ~0.82 a ~0.59, muy cerca del azar.

* Esto indica que el modelo depende de la relación real entre features y etiquetas, y no de patrones espurios.

* El Random Forest no memoriza el orden de los datos, sino que aprende estructura.

* La prueba de cordura valida que el buen desempeño previo es genuino.

* El modelo supera claramente al azar solo cuando los datos están correctamente alineados, como se espera.

# 9. Conclusión Final

Se evaluaron **cuatro modelos de clasificación** (Logistic Regression, Decision Tree, Random Forest y Gradient Boosting) para predecir la variable objetivo, utilizando una metodología robusta con validación cruzada, set de validación y test independiente.

* El **Random Forest** fue el modelo con mejor desempeño general, alcanzando el mayor accuracy en test (~0.82) y un balance más sólido entre precisión y recall en ambas clases.

* Los modelos basados en **árboles** superaron claramente a la **regresión logística**, lo que sugiere la presencia de **relaciones no lineales** en los datos.

* La **prueba de cordura** confirmó que el desempeño del mejor modelo **no es producto del azar**, validando la **capacidad predictiva real** del enfoque utilizado.

* En función de los resultados, el **Random Forest** se selecciona como **modelo final recomendado**, combinando **buen rendimiento, estabilidad y generalización.**

"The Random Forest model outperformed other architectures with an accuracy of 82%, effectively identifying potential 'Ultra' plan adopters. This model could be integrated into the CRM to automate personalized plan recommendations, potentially increasing the conversion rate by identifying high-usage users more accurately than manual segmentation."