# **Waze User Churn: Logistic Regression Modeling**
**This notebook develops and evaluates a binomial logistic regression model to predict user chrun for Waze based on app usage and driver behavior variables. The focus is on translating exploratory data analysis into a predictive model and interpreting model performance and feature effects in a business context.**

The analysis proceeds in three stages:
- Exploratory data analysis (EDA) and assumption checks for logistic regression
- Model building and evaluation using a binomial logistic regression classifier
- Interpretation of model results and implications for churn-focused business decisions

The objective is to build a model that predicts whether a user churns and to understand which behavioral features are most associated with churn. 

### **Data and Libraries**
The Waze churn dataset is loaded into a pandas DataFrame, and standard Python and scikit-learn tools are used for visualizations, feature engineering and logistics regression modeling. 

In [None]:
# Core libraries for data manipulation and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# tools for preprocessing, model training, and evaluation
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, precision_score, \
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.linear_model import LogisticRegression

The churn dataset is read from CSV into a DataFrame for analysis. 

In [None]:
df = pd.read_csv('waze_dataset.csv')

## **Exploratory data analysis**

EDA is used to understand class balance, identify missing values, and detect potential outliers or data quality issues that may influence a logistic regression model. Visual inspection of distributions and summary staistics support later decisions on feature engineer, outlier handling, and model assumptions. 

### **Data structure, missing values, and outliers**

The intial EDA examines dataset dimnesions, data types, missing labels, and summary statistics to understand overall structure and detect variables with extreme values. 


In [None]:
print(df.shape)
df.info()

In [None]:
df.head()

In [None]:
# drop the unique identifier not needed for modeling
df = df.drop('ID',axis=1)

In [None]:
# check class balance of the churn label
df['label'].value_counts(normalize='True')

In [None]:
# summary staistics for numeric variables
df.describe()

The dataset contains 700 missing values in the `label` (target) column, representing less than 5% of observations. 

Several usage-related variables(`sessions`,`drives`,`total_sessions`,`total_navigations_fav1`,`total_navigations_fav2`,`driven_km_drives`,`duration_minutes_drives`) exhibit extreme values, with maxima standard deviations above the upper quartile, indicating potential outliers. 

### **Feature Engineering**

To better capture driving intensity, a new feature `km_per_driving_day` is created as the average distance driven per day for each user. Thsi condenses multiple raw variables into a single measure of driving behavior over the last month. 

In [None]:
# mean distance driven per driving day
df['km_per_driving_day'] = df['driven_km_drives']/df['driving_days']
df['km_per_driving_day'].describe()

In [None]:
# replace infinite values with zero
df.loc[df['km_per_driving_day']==np.inf, 'km_per_driving_day'] = 0
df['km_per_driving_day'].describe()

A binary `professional_driver` flag is introduced to distinguish heavy users (at least 60 drives and 15 or more driving days in the last month) from other drivers, using domain-informed thresholds.

In [None]:
# `professional_driver` column
df['professional_driver'] = np.where((df['drives']>=60) & (df['driving_days']>=15), 1, 0)

In [None]:
print(df['professional_driver'].value_counts())
df.groupby(['professional_driver'])['label'].value_counts(normalize=True)

Professional drivers show a churn rate of about 7.6%, compared with roughly 19.9% for non-professional users, indicating that high-activity drivers are substantially more like to be retained and that this feature may add predictive signal to the model. 

## **Model construction strategy**

Predictor selection us guided by the business objective (predicting churn) and prior EDA, with multicollinearity used to drop redundant variables while retaining features with stronger relationships to churn. Iterative model runs and performance metrics such as accuracy, precision and recall help refine the feature set. 

### **Handling missing labels and outliers**
The `label` column is inspected for type and missingness, and rows with missing labels are dropped because they are relatively few and appear randomly distributed. Extreme values in several high-variance usage variables are winsorized at the 95th percentile to reduce the influece of outliers while retaining all observations. 

In [None]:
df.info()

In [None]:
# Drop rows with missing data in `label` column
df = df.dropna(subset=['label'])

In [None]:
# Impute outliers
for column in ['sessions', 'drives', 'total_sessions', 'total_navigations_fav1', 'total_navigations_fav2', 
               'driven_km_drives', 'duration_minutes_drives']:
    threshold = df[column].quantile(0.95)
    df.loc[df[column] > threshold,column]=threshold

In [None]:
df.describe()

#### **Encoding the churn label**
A binary target variable `label` is created where `1` indicates a churned user and `0` indicates a retained user, preserving the original categorical `label` for reference. 

In [None]:
# Create binary `label2` column
df['label2'] = np.where(df['label']=='churned', 1, 0)
df[['label', 'label2']].tail()

### **Logistic regression assumptinos**

The logistic regression model assumes independent observations, a binary outcome, limited extreme outliers, low multicollinearity among predictors, and an approximately linear relationship between continuous predictors and the log-odds of churn. Independence is assumed from the data collection process, outliers have been mitigated by winsorization, and multicollinearity is assessed via the correlation matrix. 

In [None]:
# Generate a correlation matrix
df.corr(method='pearson')

In [None]:
# Plot correlation heatmap
plt.figure(figsize=(15,10))
sns.heatmap(df.corr(method='pearson'), vmin=-1, vmax=1, annot=True, cmap='coolwarm')
plt.title('Correlation heatmap indicates many low correlated variables', fontsize=18)
plt.show();

The correlation matrix and heatmap highlight strong multicollinearity between `sessions` and `drives` (correlation near 1.0), and between `activity_days` and `driving_days` (correlation around 0.95).

To reduce redundancy, only one variable from each highly correlated pair is retained in the final feature set. 

### **Encoding device type**

The `device` variable is binary-encoded as `device2` (0 for Android, ` for iPhone) so it can be used directly as a numeric predictor. 

In [None]:
# Create new `device2` variable
df['device2'] = np.where(df['device']=='Android', 0, 1)
df[['device','device2']].tail()

### **Feature set and target**
The feature matrix `x` excludes the original label fields, the unencoded device column, and two highly collinear variables (`session`, `driving_days`) in favor of `drives` and `activity_days`, which show slightly stronger associations with churn. 

In [None]:
# Isolate predictor variables
X = df.drop(columns=['label','label2','device','sessions','driving_days'])

In [None]:
# Isolate target variable
y=df['label2']

#### **Train-test split**

The data is split into training and test sets using stratified sampling on the target to preserve the original churn vs. retention ratio, which helps obtain reliable performance estimates on an imbalanced classification probem. 

In [None]:
# Perform the train-test split, stratify=y preserves class proportions
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
X_train.head()

A binomial logistic regression model without regularization (`penalty='none'`) is fitted to the training data to estimate the relationship between predictors and the log-odds of churn. 

In [None]:
model = LogisticRegression(penalty=None, max_iter=400)

model.fit(X_train, y_train)

### **Model coefficients and intercept**

Model coefficients quantify how each predictor is associated with the log‑odds of churn, holding other variables constant. Positive coefficients increase the log‑odds (and thus the probability) of churn, while negative coefficients decrease it.

In [None]:
# coefficients with respective feature names
pd.Series(model.coef_[0], index=X.columns)

In [None]:
# intercept value
model.intercept_

Larger magnitude coefficients indicate features with stronger influence on the predicted log‑odds of churn, although statistical significance is not directly assessed in this scikit‑learn implementation.


#### **Logit linearity check**

To assess the assumption of approximate linearity between continuous predictors and the log‑odds of churn, predicted probabilities on the training data are transformed to logits and plotted against a key predictor.


In [None]:
# Get the predicted probabilities of the training data
training_probabilities = model.predict_proba(X_train)
training_probabilities

In [None]:
# Copy the training predictors and add the logit of chrun probability
logit_data = X_train.copy()
logit_data['logit'] = [np.log(prob[1]/prob[0]) for prob in training_probabilities]

In [None]:
# Plot regplot of `activity_days` vs log-odds
sns.regplot(x='activity_days', y='logit', data=logit_data, scatter_kws={'s': 2, 'alpha': 0.5})
plt.title('Log-odds: activity_days');

The regplot for `activity_days` suggests an approximately monotonic, near‑linear relationship between activity and the log‑odds of churn, which is reasonably consistent with the logistic regression linearity assumption for this predictor.


## **Model evaluation**

### **Classification performance**

Predictions are generated on the held‑out test set, and standard classification metrics are used to assess how well the model identifies churned users versus retained users.


In [None]:
# Generate predictions on X_test
y_preds = model.predict(X_test)

In [None]:
# accuracy on the test data
model.score(X_test,y_test)

Accuracy provides an overall proportion of correct predictions but can be misleading when classes are imbalanced, so additional metrics are examined.


In [None]:
# confusion matrix display
cm = confusion_matrix(y_test, y_preds)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                             display_labels=['retained','churned'])
disp.plot();

In [None]:
# Calculate precision and recall manually
precision = cm[1, 1] / (cm[0, 1] + cm[1, 1])
recall = cm[1, 1] / (cm[1, 0] + cm[1, 1])
precision, recall

In [None]:
# Full classification report
target_labels = ['retained', 'churned']
print(classification_report(y_test,y_preds, target_names=target_labels))

The model achieves decent precision but relatively low recall for the churn class, indicating that it misses a substantial number of true churners (false negatives). For churn mitigation, this means many at‑risk users would not be flagged by the model.


### **Feature importance (coefficients)**

To visualize which features most strongly influence the model’s predictions, standardized coefficients are sorted and plotted.


In [None]:
# Create a list of (feature, coefficient) tuples
feature_importance = list(zip(X_train.columns, model.coef_[0]))

feature_importance = sorted(feature_importance, key=lambda x:x[1], reverse = True)
feature_importance

In [None]:
# Plot the feature importances
sns.barplot(x = [x[1] for x in feature_importance],
           y = [x[0] for x in feature_importance],
           orient = 'h')
plt.title('Feature importance');
plt.xlabel('Coefficient');
plt.ylabel('Feature');

Features with larger positive coefficients are associated with higher churn risk, while those with large negative coefficients are associated with retention, holding other variables constant.


## **Model insights and business implications**

- `activity_days` is by far the most influential variable in the model’s prediction, with higher activity strongly associated with retention and lower churn probability. This aligns with earlier EDA showing that more active users tend to stay on the platform.  
- In prior EDA, churn increased as `km_per_driving_day` rose, and the correlation heatmap indicated a strong positive association with churn. In the multivariate logistic model, however, this feature becomes relatively weak, suggesting that its apparent effect is largely explained by other, more informative usage variables.

In a multiple logistic regression model, predictors can interact and share variance, which can make some features look less important once others are included. This can improve predictive performance while making interpretation less intuitive.

From a business perspective, the current model’s low recall on churners limits its usefulness for high‑stakes retention campaigns, where missing at‑risk users is costly. It is more suitable as a baseline model to guide further feature engineering and model experimentation rather than as a deployment‑ready churn predictor.

Potential improvements include:
- Engineering additional behavioral and temporal features (for example, recent changes in usage, patterns of cancellation, or route diversity) to capture early signs of disengagement.  
- Exploring alternative model specifications (feature subsets, regularization, and class‑weighting) and comparing against more flexible machine learning models in the subsequent notebook.

Additional data such as drive‑level details (route characteristics, duration, time of day) and richer in‑app interaction signals (reports, confirmations, search behavior) would likely improve both predictive power and actionability for churn prevention strategies.
