# **Waze User Churn: Tree‑Based Machine Learning**

This notebook builds and evaluates two tree‑based models—Random Forest and XGBoost—to predict Waze user churn based on behavioral and usage features. The focus is on model performance (with recall as the primary metric), feature importance, and how these models complement the earlier logistic regression baseline.

The workflow includes:
- Feature engineering and encoding for tree‑based models  
- Model training with cross‑validated hyperparameter tuning  
- Model selection using a validation set and final evaluation on a held‑out test set  
- Interpretation of model performance and feature importance in a business context  

### **Tree‑based modeling overview**

Decision tree–based models split the feature space by asking a sequence of if/else questions on features (for example, “is `drives` > threshold?”) and assign churn probabilities to each resulting region. Random Forest builds many such trees on bootstrapped samples and averages their predictions to reduce variance, while XGBoost builds trees sequentially to correct previous errors, often achieving stronger performance at the cost of greater complexity and interpretability.


## **Imports and data loading**

Core libraries for data manipulation, visualization, and model training are imported, and the Waze churn dataset is loaded into a working DataFrame.

In [None]:
# Import packages for data manipulation
import numpy as np
import pandas as pd

# Import packages for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display all columns in DataFrame outputs
pd.set_option('display.max_columns', None)

# Import packages for model selection and evaluation
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.metrics import accuracy_score, precision_score, recall_score,\
f1_score, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay

# Tree-based classifiers
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from xgboost import plot_importance

# Model persistence
import pickle

In [None]:
# Import dataset
df0 = pd.read_csv('waze_dataset.csv')

In [None]:
# Inspect the first five rows
df0.head()

The raw Waze dataset is first loaded into `df0` for inspection, then copied into `df` for feature engineering so the original data remains unchanged.

## **Feature engineering**

Previous EDA and statistical modeling identified usage‑based predictors and highlighted derived features related to intensity and recency. This section engineers those features for tree‑based models, then handles missing labels and encodes categorical variables.


In [None]:
# Work on a copy of the original data
df = df0.copy()

In [None]:
df.info()

#### **`km_per_driving_day`**

Average kilometers driven per driving day captures driving intensity for users who drove at least once in the month and helps distinguish casual drivers from very heavy users.

In [None]:
# Mean kilometers per driving day
df['km_per_driving_day'] = df['driven_km_drives']/df['driving_days']

# Inspect distribution
df['km_per_driving_day'].describe()

In [None]:
# replace infinite values (from 0 driving days) with 0 to keep statistics valid
df.loc[df['km_per_driving_day']==np.inf, 'km_per_driving_day']=0
df['km_per_driving_day'].describe()

#### **`percent_sessions_in_last_month`**
`percent_sessions_in_last_month` represents the fraction of a user’s lifetime sessions that occurred in the last observed month, highlighting recent engagement intensity and behavior changes leading up to churn.

In [None]:
# Share of lifetime sessions that occurred in the last month
df['percent_sessions_in_last_month'] = df['sessions']/df['total_sessions']

# Inspect distribution
df['percent_sessions_in_last_month'].describe()

#### **`professional_driver`**
A binary `professional_driver` flag is used to approximate heavy, likely professional users (≥60 drives and ≥15 driving days in the last month), based on domain‑informed thresholds and earlier EDA showing lower churn among high‑activity users.

In [None]:
# Flag high‑intensity "professional" drivers
df['professional_driver'] = np.where((df['drives']>=60) & (df['driving_days']>=15), 1, 0)

#### **`total_sesions_per_day`**
`total_sesions_per_day` (as named in the data) estimates the average number of sessions per day since onboarding, capturing long‑term engagement intensity over a user’s lifetime.

In [None]:
# Average sessions per day since onboarding
df['total_sesions_per_day'] = df['total_sessions']/df['n_days_after_onboarding']
df['total_sesions_per_day'].describe()

#### **`km_per_hour`**
`km_per_hour` summarizes average driving speed over the month, combining distance and time into a single efficiency metric that may reflect different driving contexts (for example, highway vs. urban driving).

In [None]:
# Average kilometers per hour across all drives in the last month
df['km_per_hour'] = df['driven_km_drives']/(df['duration_minutes_drives']/60)
df['km_per_hour'].describe()

#### **`km_per_drive`**
`km_per_drive` represents average distance per trip, providing another view on driving intensity per drive. As with other ratio features, division by zero for users with no drives produces infinite values, which are recoded to 0 for stability.

In [None]:
# Average kilometers per drive in the last month
df['km_per_drive'] = df['driven_km_drives']/df['drives']
df['km_per_drive'].describe()

In [None]:
# Replace infinite values (from 0 drives) with 0
df.loc[df['km_per_drive']==np.inf, 'km_per_drive'] = 0
df['km_per_drive'].describe()

#### **`percent_of_sessions_to_favorite`**
`percent_of_sessions_to_favorite` approximates the share of sessions spent navigating to saved favorite locations, serving as a proxy for how often users travel to familiar vs. new places, which may relate to exploration and dependence on navigation.

In [None]:
# Share of sessions used to navigate to favorite places
df['percent_of_sessions_to_favorite'] = (df['total_navigations_fav1']+df['total_navigations_fav2'])/df['total_sessions']

# Inspect distribution
df['percent_of_sessions_to_favorite'].describe()

### **Handle missing churn labels**

As in earlier notebooks, rows with missing churn labels (`label`) are dropped because they represent <5% of observations and show no clear non‑random pattern.

In [None]:
# Drop rows with missing churn labels
df = df.dropna(subset=['label'])

### **Outliers and tree‑based models**

Many usage variables contain outliers, but tree‑based models are generally robust to extreme values, so no additional outlier imputation is applied for this stage of modeling.

### **Variable encoding**

#### **Device**

The `device` feature (Android vs. iPhone) is encoded as a binary numeric predictor `device2` to be usable in scikit‑learn models.

In [None]:
# Binary‑encode device: Android = 0, iPhone = 1
df['device2'] = np.where(df['device']=='Android',0,1)
df[['device','device2']].tail()

#### **Target variable**

The churn label is encoded as `label2`, where 0 represents retained users and 1 represents churned users, preserving the original string labels for reference.

In [None]:
# Create binary `label2` column
df['label2'] = np.where(df['label']=='retained',0,1)
df[['label','label2']].tail()

### **Feature set definition**

Tree‑based models can handle multicollinearity, so only the non‑informative identifier `ID` is dropped. The original `device` column is also excluded later in favor of the encoded `device2` feature.

In [None]:
# Drop non‑informative ID column
df = df.drop(['ID'], axis=1)

### **Evaluation metric and class balance**

The churn label is moderately imbalanced: about 18% of users churn and 82% are retained, which is manageable without explicit resampling. Because the main business risk is missing at‑risk users rather than incorrectly flagging some retained users, recall is used as the primary evaluation metric for model selection.


In [None]:
# Get class balance of 'label' column
df['label'].value_counts(normalize=True)

## **Modeling workflow**

The final modeling dataset contains 14,299 samples, enough to support a standard train/validation/test workflow for model training and selection. The process is:

1. Split the data into train/validation/test sets (60/20/20).  
2. Fit models and tune hyperparameters on the training set using cross‑validation.  
3. Select a champion model based on validation recall.  
4. Evaluate the champion model on the held‑out test set to estimate performance on new data.  



![](https://raw.githubusercontent.com/adacert/tiktok/main/optimal_model_flow_numbered.svg)

### **Train/validation/test split**

Features and target are defined and then split into 60/20/20 train, validation, and test sets using stratified sampling to preserve class balance in each partition.

In [None]:
# Feature matrix (exclude label columns and original device)
X = df.drop(columns=['label', 'label2','device'])

# Binary target vector
y = df['label2']

# Initial 80/20 split into interim train and test sets
X_tr, X_test, y_tr, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

# Split interim train into final train (60%) and validation (20%) sets
X_train, X_val, y_train, y_val = train_test_split(X_tr,y_tr, stratify=y_tr, test_size=0.25, random_state=42)

# Verify partition sizes
for x in [X_train, X_val, X_test]:
    print(len(x))

### **Random forest**

A Random Forest classifier is tuned with a small hyperparameter grid using cross‑validated GridSearchCV, optimizing for recall while also tracking precision, F1, and accuracy. Random Forest builds an ensemble of decision trees on bootstrapped samples and averages their predictions, which reduces variance and typically improves generalization compared with a single decision tree.

In [None]:
# Instantiate the random forest classifier
rf = RandomForestClassifier(random_state=42)

# Hyperparameters to evaluate in cross‑validation
cv_params = {'max_depth':[None],
            'max_features':[1.0],
            'max_samples':[1.0],
            'min_samples_leaf': [2],
            'min_samples_split': [2],
            'n_estimators':[300]}

# Metrics to record during grid search
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Cross‑validated grid search, refitting on the model with best recall
rf_cv = GridSearchCV(rf, cv_params, scoring=scoring, cv=5, refit='recall')

The model is fit on the training data and the best cross‑validated recall score and hyperparameters are inspected.

In [None]:
%%time
rf_cv.fit(X_train, y_train)

In [None]:
# Examine best score
rf_cv.best_score_

In [None]:
# Examine best hyperparameter combo
rf_cv.best_params_

The `make_results()` helper function summarizes cross‑validated accuracy, precision, recall, and F1 for the best model according to a chosen metric.

In [None]:
def make_results(model_name:str, model_object, metric:str):
    '''
    Arguments:
        model_name (string): what you want the model to be called in the output table
        model_object: a fit GridSearchCV object
        metric (string): precision, recall, f1, or accuracy

    Returns a pandas df with the F1, recall, precision, and accuracy scores
    for the model with the best mean 'metric' score across all validation folds.
    '''

    # Create dictionary that maps input metric to actual metric name in GridSearchCV
    metric_dict = {'precision': 'mean_test_precision',
                   'recall': 'mean_test_recall',
                   'f1': 'mean_test_f1',
                   'accuracy': 'mean_test_accuracy',
                   }

    # Get all the results from the CV and put them in a df
    cv_results = pd.DataFrame(model_object.cv_results_)

    # Isolate the row of the df with the max(metric) score
    best_estimator_results = cv_results.iloc[cv_results[metric_dict[metric]].idxmax(), :]

    # Extract accuracy, precision, recall, and f1 score from that row
    f1 = best_estimator_results.mean_test_f1
    recall = best_estimator_results.mean_test_recall
    precision = best_estimator_results.mean_test_precision
    accuracy = best_estimator_results.mean_test_accuracy

    # Create table of results
    table = pd.DataFrame({'model': [model_name],
                          'precision': [precision],
                          'recall': [recall],
                          'F1': [f1],
                          'accuracy': [accuracy],
                          },
                         )

    return table

The `make_results()` helper function summarizes cross‑validated accuracy, precision, recall, and F1 for the best model according to a chosen metric.

In [None]:
results = make_results('RF cv', rf_cv, 'recall')
results

Aside from accuracy, the recall and F1 scores are modest, but recall is already markedly higher than in the earlier logistic regression baseline while maintaining similar accuracy. Additional hyperparameter tuning could yield small gains, but the current configuration provides a reasonable benchmark.

### **XGBoost**

An XGBoost classifier is then tuned over a slightly richer hyperparameter grid, again optimizing for recall while tracking other metrics. XGBoost builds trees sequentially, where each new tree is trained to correct the residual errors of the previous ensemble, enabling the model to capture complex nonlinear relationships and interactions in the data.

In [None]:
# Instantiate the XGBoost classifier for binary classification
xgb = XGBClassifier(objective='binary:logistic', random_state=42)

# Hyperparameters to evaluate in cross‑validation
cv_params = {'max_depth': [6,12],
            'min_child_weight': [3,5],
            'learning_rate': [0.01,0.1],
            'n_estimators': [300]}

# Metrics to record during grid search
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Cross‑validated grid search, refitting on the model with best recall
xgb_cv = GridSearchCV(xgb, cv_params, scoring=scoring, cv=5, refit='recall')

In [None]:
%%time
xgb_cv.fit(X_train, y_train)

In [None]:
# Examine best score
xgb_cv.best_score_

In [None]:
# Examine best parameters
xgb_cv.best_params_

In [None]:
# Call 'make_results()' on the GridSearch object
xgb_cv_results = make_results('XGB cv', xgb_cv, 'recall')
results = pd.concat([results, xgb_cv_results], axis=0)
results

The tuned XGBoost model achieves higher recall than both the logistic regression baseline and the random forest, while maintaining similar accuracy and precision.

## **Model selection on validation data**

The best random forest and XGBoost models from cross‑validation are evaluated on the validation set, and the model with higher recall is selected as the champion.

#### **Random forest**

In [None]:
# Use random forest model to predict on validation data
rf_val_preds = rf_cv.best_estimator_.predict(X_val)

In [None]:
def get_test_scores(model_name:str, preds, y_test_data):
    '''
    Generate a table of test scores.

    In:
        model_name (string): Your choice: how the model will be named in the output table
        preds: numpy array of test predictions
        y_test_data: numpy array of y_test data

    Out:
        table: a pandas df of precision, recall, f1, and accuracy scores for your model
    '''
    accuracy = accuracy_score(y_test_data, preds)
    precision = precision_score(y_test_data, preds)
    recall = recall_score(y_test_data, preds)
    f1 = f1_score(y_test_data, preds)

    table = pd.DataFrame({'model': [model_name],
                          'precision': [precision],
                          'recall': [recall],
                          'F1': [f1],
                          'accuracy': [accuracy]
                          })

    return table

In [None]:
# Get validation scores for RF model
rf_val_scores = get_test_scores('RF val', rf_val_preds, y_val)

# Append to the results table
results = pd.concat([results, rf_val_scores], axis=0)
results

Validation scores for the random forest drop slightly relative to cross‑validation results across all metrics, which is expected and suggests limited overfitting.

#### **XGBoost**



In [None]:
# Use XGBoost model to predict on validation data
xgb_val_preds = xgb_cv.best_estimator_.predict(X_val)

# Get validation scores for XGBoost model
xgb_val_scores = get_test_scores('XGB val', xgb_val_preds, y_val)

# Append to the results table
results = pd.concat([results, xgb_val_scores], axis=0)
results

The XGBoost model shows a similar small drop from cross‑validation to validation scores but still outperforms the random forest on recall, confirming it as the champion model.

## **Champion model evaluation on test data**

The champion XGBoost model is applied to the held‑out test set to estimate performance on new, unseen users.

In [None]:
# Use XGBoost model to predict on test data
xgb_test_preds = xgb_cv.best_estimator_.predict(X_test)

# Get test scores for XGBoost model
xgb_test_scores = get_test_scores('XGB test', xgb_test_preds, y_test)

# Append to the results table
results = pd.concat([results,xgb_test_scores], axis=0)
results

On the held‑out test set, recall matches the validation result, while precision declines somewhat, leading to small drops in F1 and accuracy. This gap is within a reasonable range for generalization error and supports using the validation‑selected XGBoost model as a stable performance estimate.

### **Confusion matrix**

A confusion matrix summarizes the champion model’s churn predictions on the test set and highlights the trade‑off between correctly identifying churners and misclassifying retained users.

In [None]:
# Generate array of values for confusion matrix
cm = confusion_matrix(y_test, xgb_test_preds, labels=xgb_cv.classes_)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=['retained','churned'])
disp.plot();

The confusion matrix shows roughly three times as many false negatives as false positives, and the model correctly identifies about 16–17% of actual churners. This confirms that, even with improved recall over logistic regression, many at‑risk users remain undetected at the default decision threshold.


### **Feature importance**

Feature importance from the XGBoost model indicates which predictors contribute most strongly to churn predictions.

In [None]:
plot_importance(xgb_cv.best_estimator_);

The XGBoost model distributes importance across a broader set of predictors than the earlier logistic regression, which relied heavily on `activity_days`. Engineered features account for a majority of the top‑ranked predictors, underscoring how feature engineering can substantially improve model performance.

Differences in feature importance between models reflect complex interactions among predictors: a feature that appears weak in one model may still carry signal in combination with others in a more flexible algorithm. This reinforces the value of testing multiple model families before discarding features as uninformative.


## **Model insights and recommendations**

Across all models, XGBoost achieved the highest recall on validation and test data, followed by random forest, with logistic regression performing worst on recall but best on interpretability. This pattern is consistent with expectations: tree‑based ensembles typically capture more complex patterns at the expense of transparency, while generalized linear models provide clearer explanations with more limited flexibility.

- **Suitability for deployment:**  
  The tuned XGBoost model improves recall relative to logistic regression and random forest but still misses many churners and generates a notable number of false positives. It is more suitable as a decision‑support tool and experimentation baseline than as a standalone system for high‑stakes, automated retention decisions.

- **Train/validation/test split trade‑off:**  
  Using separate validation and test sets reduces the data available for training but enables unbiased model selection on validation data and a cleaner final performance estimate on the untouched test set. This structure improves confidence that the selected model will generalize to new users.

- **Logistic regression vs. tree‑based ensembles:**  
  Logistic regression offers straightforward interpretability via coefficients and clear directionality of effects, making it valuable for explaining drivers of churn. Tree‑based ensembles like random forest and XGBoost typically deliver stronger predictive performance, handle nonlinearities and interactions, and require fewer distributional assumptions, at the cost of interpretability.

- **Paths to improvement:**  
  Further gains are likely to come from richer feature engineering (for example, temporal trends, volatility of behavior, or route diversity), targeted class‑weighting or threshold tuning to prioritize recall, and incorporating additional data on drive‑level behavior and in‑app interactions.

- **Additional data needs:**  
  Drive‑level details (timing, route type, context), more granular app‑interaction logs (searches, reports, confirmations), and geographic or contextual features would help distinguish structural churn from temporary disengagement and support more precise intervention strategies.