# **Waze User Churn: Data Overview**

### Identify data types and compile summary information

This notebook provides an initial overview of the Waze churn dataset, including data types, missing values, and basic behavior summaries for churned vs. retained users. The goal is to understand who is in the dataset and whether early patterns in driving behavior align with churn outcomes.

### **Importing and Loading Data**



In [1]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

#### **Data summary information**
The first few rows and basic DataFrame metadata help confirm schema, data types, and overall shape.

In [2]:
df.head(10)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 1.5+ MB


There are 14,999 rows and 13 columns, with a mix of float, integer, and object (string) variables. The dataset has 700 missing values in the `label` column, while all other fields are fully populated.


### **Inspecting missing values and summary statistics**

To check whether rows with missing churn labels differ systematically from complete cases, summary statistics are compared for the two groups.

In [3]:
# Isolate rows with null values
df_null = df[df['label'].isnull()]
# Display summary stats of rows with null values
df_null.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,7405.584286,80.837143,67.798571,198.483348,1709.295714,118.717143,30.371429,3935.967029,1795.123358,15.382857,12.125714
std,4306.900234,79.98744,65.271926,140.561715,1005.306562,156.30814,46.306984,2443.107121,1419.242246,8.772714,7.626373
min,77.0,0.0,0.0,5.582648,16.0,0.0,0.0,290.119811,66.588493,0.0,0.0
25%,3744.5,23.0,20.0,94.05634,869.0,4.0,0.0,2119.344818,779.009271,8.0,6.0
50%,7443.0,56.0,47.5,177.255925,1650.5,62.5,10.0,3421.156721,1414.966279,15.0,12.0
75%,11007.0,112.25,94.0,266.058022,2508.75,169.25,43.0,5166.097373,2443.955404,23.0,18.0
max,14993.0,556.0,445.0,1076.879741,3498.0,1096.0,352.0,15135.39128,9746.253023,31.0,30.0


In [4]:
# Isolate rows without null values
df_nonnull = df[df['label'].notnull()]
# Display summary stats of rows without null values
df_nonnull.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0
mean,7503.573117,80.62382,67.255822,189.547409,1751.822505,121.747395,29.638296,4044.401535,1864.199794,15.544653,12.18253
std,4331.207621,80.736502,65.947295,136.189764,1008.663834,147.713428,45.35089,2504.97797,1448.005047,9.016088,7.833835
min,0.0,0.0,0.0,0.220211,4.0,0.0,0.0,60.44125,18.282082,0.0,0.0
25%,3749.5,23.0,20.0,90.457733,878.5,10.0,0.0,2217.319909,840.181344,8.0,5.0
50%,7504.0,56.0,48.0,158.718571,1749.0,71.0,9.0,3496.545617,1479.394387,16.0,12.0
75%,11257.5,111.0,93.0,253.54045,2627.5,178.0,43.0,5299.972162,2466.928876,23.0,19.0
max,14998.0,743.0,596.0,1216.154633,3500.0,1236.0,415.0,21183.40189,15851.72716,31.0,30.0


The summary statistics for rows with and without missing churn labels are very similar, suggesting no obvious pattern or systematic bias in where the label is missing.

### **Missing churn label by device type**

Next, the distribution of missing churn labels is examined by device type to check for device‑specific bias.

In [5]:
# Get count of null values by device
print(df_null.groupby('device').size())
df_null['device'].value_counts() 

device
Android    253
iPhone     447
dtype: int64


device
iPhone     447
Android    253
Name: count, dtype: int64

253 Android users and 447 iPhone users have missing churn labels.

In [6]:
# Calculate % of iPhone nulls and Android nulls
df_null['device'].value_counts('iPhone','Android')

device
iPhone     0.638571
Android    0.361429
Name: proportion, dtype: float64

In [7]:
# Calculate % of iPhone users and Android users in full dataset
df['device'].value_counts(normalize=True)

device
iPhone     0.644843
Android    0.355157
Name: proportion, dtype: float64

The percentage of missing labels for each device closely matches each device’s share in the full dataset, which supports the assumption that churn label missingness is approximately random with respect to device type.

### **Churn vs. retention distribution**

The churn label distribution indicates how imbalanced the classification problem is.

In [8]:
# Calculate counts of churned vs. retained
print(df['label'].value_counts())
print()
df['label'].value_counts('retained','churned')

label
retained    11763
churned      2536
Name: count, dtype: int64



label
retained    0.822645
churned     0.177355
Name: proportion, dtype: float64

Among users with known churn status, roughly 82% are retained and 18% are churned, indicating a moderately imbalanced classification task.


### **Median behavior by churn status**

Median values are used to summarize typical churned vs. retained users because several variables contain extreme values and skewed distributions. The median is more robust to outliers than the mean and better reflects a “typical” user in this context.


In [9]:
# Calculate median values of all numeric columns for churned and retained users
df.groupby('label').median(numeric_only=True)

Unnamed: 0_level_0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
churned,7477.5,59.0,50.0,164.339042,1321.0,84.5,11.0,3652.655666,1607.183785,8.0,6.0
retained,7509.0,56.0,47.0,157.586756,1843.0,68.0,9.0,3464.684614,1458.046141,17.0,14.0


Users who churned completed slightly more drives in the last month than retained users, but retained users used the app on more than twice as many days. Churned users also drove more total distance and spent more time driving, suggesting fewer but longer and more intensive driving days.

### **Distance per drive and per driving day**

To better understand intensity of use, efficiency metrics are derived for kilometers per drive and per driving day, segmented by churn status.

In [10]:
# Add a column to df called `km_per_drive`
df['km_per_drive'] = df['driven_km_drives'] / df['drives']

# Group by `label`, calculate the median km per drive
df.groupby('label')['km_per_drive'].median()

# Exemplar median table
median_km_per_drive = df.groupby('label').median(numeric_only=True)[['km_per_drive']]
median_km_per_drive

Unnamed: 0_level_0,km_per_drive
label,Unnamed: 1_level_1
churned,74.109416
retained,75.014702


The median retained user drives slightly farther per trip than the median churned user, but this gap is smaller than the difference in overall distance per driving day.

In [11]:
# Add a column to df called `km_per_driving_day`
df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days']

# Group by `label`, calculate the median km per driving day
df.groupby('label')['km_per_driving_day'].median()

label
churned     697.541999
retained    289.549333
Name: km_per_driving_day, dtype: float64

In [12]:
# Add a column to df called `drives_per_driving_day`
df['drives_per_driving_day'] = df['drives'] / df['driving_days']

# Group by `label`, calculate the median drives per driving day
df.groupby('label')['drives_per_driving_day'].median()

label
churned     10.0000
retained     4.0625
Name: drives_per_driving_day, dtype: float64

Churned users drive substantially more kilometers and complete more drives per driving day than retained users, indicating a segment of very high‑intensity drivers. This group likely represents professional or long‑haul users whose needs may differ from typical commuters and who may churn if the app does not fully support their workflows.

### **Device mix and churn**

Device distribution is examined within churned and retained groups to assess whether churn is associated with iPhone vs. Android usage.

In [13]:
# For each label, calculate the number of Android users and iPhone users
grouped = df.groupby(['label', 'device']).size()
grouped

label     device 
churned   Android     891
          iPhone     1645
retained  Android    4183
          iPhone     7580
dtype: int64

In [14]:
# For each label, calculate the ratio of Android to iPhone users
print('churned', grouped.loc[('churned', 'Android')] / grouped.loc[('churned', 'iPhone')])
print('retained', grouped.loc[('retained', 'Android')] / grouped.loc[('retained', 'iPhone')])

churned 0.5416413373860183
retained 0.5518469656992084


The Android‑to‑iPhone ratio is very similar for churned and retained users and closely matches the overall device mix (approximately 64% iPhone and 36% Android), suggesting no meaningful difference in churn rate between device types.

## **Initial insights and next steps**

- The dataset contains 14,999 users with 700 missing churn labels, all in the `label` column; missingness appears roughly random and not concentrated in any specific device segment.
- The churn label is moderately imbalanced (about 18% churn), which will need to be considered when training and evaluating predictive models. 
- Churned users tend to drive more total distance and complete more drives per driving day than retained users, pointing to a high‑intensity segment that may have distinct needs and churn drivers.
- Device type (iPhone vs. Android) does not show an appreciable difference in churn behavior in this dataset.

These findings motivate deeper exploratory analysis and feature engineering in subsequent notebooks, with particular attention to heavy‑usage “super‑drivers” and how their behavior evolves leading up to churn.