# **Waze User Churn: Data Overview**

### Identify data types and compile summary information

This notebook provides an initial overview of the Waze churn dataset, including data types, missing values, and basic behavior summaries for churned vs. retained users. The goal is to understand who is in the dataset and whether early patterns in driving behavior align with churn outcomes.

### **Importing and Loading Data**



In [None]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

#### **Data summary information**
The first few rows and basic DataFrame metadata help confirm schema, data types, and overall shape.

In [None]:
df.head(10)

df.info()

There are 14,999 rows and 13 columns, with a mix of float, integer, and object (string) variables. The dataset has 700 missing values in the `label` column, while all other fields are fully populated.


### **Inspecting missing values and summary statistics**

To check whether rows with missing churn labels differ systematically from complete cases, summary statistics are compared for the two groups.

In [None]:
# Isolate rows with null values
df_null = df[df['label'].isnull()]
# Display summary stats of rows with null values
df_null.describe()

In [None]:
# Isolate rows without null values
df_nonnull = df[df['label'].notnull()]
# Display summary stats of rows without null values
df_nonnull.describe()

The summary statistics for rows with and without missing churn labels are very similar, suggesting no obvious pattern or systematic bias in where the label is missing.

### **Missing churn label by device type**

Next, the distribution of missing churn labels is examined by device type to check for device‑specific bias.

In [None]:
# Get count of null values by device
print(df_null.groupby('device').size())
df_null['device'].value_counts() 

253 Android users and 447 iPhone users have missing churn labels.

In [None]:
# Calculate % of iPhone nulls and Android nulls
df_null['device'].value_counts('iPhone','Android')

In [None]:
# Calculate % of iPhone users and Android users in full dataset
df['device'].value_counts(normalize=True)

The percentage of missing labels for each device closely matches each device’s share in the full dataset, which supports the assumption that churn label missingness is approximately random with respect to device type.

### **Churn vs. retention distribution**

The churn label distribution indicates how imbalanced the classification problem is.

In [None]:
# Calculate counts of churned vs. retained
print(df['label'].value_counts())
print()
df['label'].value_counts('retained','churned')

Among users with known churn status, roughly 82% are retained and 18% are churned, indicating a moderately imbalanced classification task.


### **Median behavior by churn status**

Median values are used to summarize typical churned vs. retained users because several variables contain extreme values and skewed distributions. The median is more robust to outliers than the mean and better reflects a “typical” user in this context.


In [None]:
# Calculate median values of all numeric columns for churned and retained users
df.groupby('label').median(numeric_only=True)

Users who churned completed slightly more drives in the last month than retained users, but retained users used the app on more than twice as many days. Churned users also drove more total distance and spent more time driving, suggesting fewer but longer and more intensive driving days.

### **Distance per drive and per driving day**

To better understand intensity of use, efficiency metrics are derived for kilometers per drive and per driving day, segmented by churn status.

In [None]:
# Add a column to df called `km_per_drive`
df['km_per_drive'] = df['driven_km_drives'] / df['drives']

# Group by `label`, calculate the median km per drive
df.groupby('label')['km_per_drive'].median()

# Exemplar median table
median_km_per_drive = df.groupby('label').median(numeric_only=True)[['km_per_drive']]
median_km_per_drive

The median retained user drives slightly farther per trip than the median churned user, but this gap is smaller than the difference in overall distance per driving day.

In [None]:
# Add a column to df called `km_per_driving_day`
df['km_per_driving_day'] = df['driven_km_drives'] / df['driving_days']

# Group by `label`, calculate the median km per driving day
df.groupby('label')['km_per_driving_day'].median()

In [None]:
# Add a column to df called `drives_per_driving_day`
df['drives_per_driving_day'] = df['drives'] / df['driving_days']

# Group by `label`, calculate the median drives per driving day
df.groupby('label')['drives_per_driving_day'].median()

Churned users drive substantially more kilometers and complete more drives per driving day than retained users, indicating a segment of very high‑intensity drivers. This group likely represents professional or long‑haul users whose needs may differ from typical commuters and who may churn if the app does not fully support their workflows.

### **Device mix and churn**

Device distribution is examined within churned and retained groups to assess whether churn is associated with iPhone vs. Android usage.

In [None]:
# For each label, calculate the number of Android users and iPhone users
grouped = df.groupby(['label', 'device']).size()
grouped

In [None]:
# For each label, calculate the ratio of Android to iPhone users
print('churned', grouped.loc[('churned', 'Android')] / grouped.loc[('churned', 'iPhone')])
print('retained', grouped.loc[('retained', 'Android')] / grouped.loc[('retained', 'iPhone')])

The Android‑to‑iPhone ratio is very similar for churned and retained users and closely matches the overall device mix (approximately 64% iPhone and 36% Android), suggesting no meaningful difference in churn rate between device types.

## **Initial insights and next steps**

- The dataset contains 14,999 users with 700 missing churn labels, all in the `label` column; missingness appears roughly random and not concentrated in any specific device segment.
- The churn label is moderately imbalanced (about 18% churn), which will need to be considered when training and evaluating predictive models. 
- Churned users tend to drive more total distance and complete more drives per driving day than retained users, pointing to a high‑intensity segment that may have distinct needs and churn drivers.
- Device type (iPhone vs. Android) does not show an appreciable difference in churn behavior in this dataset.

These findings motivate deeper exploratory analysis and feature engineering in subsequent notebooks, with particular attention to heavy‑usage “super‑drivers” and how their behavior evolves leading up to churn.