### Notebook Overview  

This Jupyter notebook is the deliverable for **Course 6 – "The Nuts and Bolts of Machine Learning" (Part 1)** of the Waze churn prediction project. Its goals are to  

1. **Finalize the feature set** by dropping redundant categorical columns and creating new behavioral metrics (`percent_sessions_in_last_month`, `total_sessions_per_day`, `km_per_hour`, `ratio_of_favorite_navigations_to_drives`).  
2. **Validate feature distributions** to ensure interpretability and document anomalies (e.g., outliers in `km_per_hour`, ratios exceeding 1.0 due to logging differences).  
3. **Confirm class balance** in the target variable (`label_2`), which remains heavily skewed (~80% retained / ~20% churned).  
4. **Save the finalized dataset** as `waze_modeling_set.csv`, providing a clean handoff for downstream modeling.  

**Deliverables:**  
- Extended feature set capturing recency of use, normalized activity, driving efficiency, and navigation habits  
- Data quality notes documenting anomalies and modeling decisions 
- Final modeling dataset (`waze_modeling_set.csv`) containing all engineered predictors and numeric encodings

In [1]:
# In the previous stage, we built a baseline logistic regression model to predict user churn. 
# That model provided interpretability and strong recall, establishing a benchmark for comparison.

In [2]:
# Objective:
# In this notebook we extend the analysis by:
#   1. Engineering additional behavioral features to enrich the dataset.
#   2. Validating feature distributions and documenting anomalies.
#   3. Preparing and saving the final modeling dataset (waze_modeling_set.csv).

# Why:
# Waze leadership is interested in identifying users most at risk of churn 
# so that product and marketing teams can design targeted retention strategies. 
# Adding behavioral features improves the chance of capturing nonlinear 
# patterns in user activity that logistic regression may miss.

# Ethical Note:
# Predictive churn models should be applied in ways that benefit users.
# If misapplied, there are risks on both sides:
# - Waze may fail to take proactive measures to retain users who are 
#   actually at risk of churning (missed opportunities to improve satisfaction).
# - Waze may also take proactive measures with loyal users who are not at risk,
#   creating unnecessary notifications or surveys that could annoy them.
# These trade-offs should be considered when designing retention strategies.

In [3]:
import pandas as pd
import numpy as np

In [4]:
# Load the latest clean and feature-added csv into a dataframe
waze_df = pd.read_csv('../data/waze_features_v2.csv')
waze_df.head()

Unnamed: 0,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,km_per_drive,km_per_driving_day,drives_per_driving_day,professional_driver,label_2,device_2
0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,11.632058,138.360267,11.894737,1,0,0
1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,128.186173,1246.901868,9.727273,0,0,1
2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,32.201567,382.393602,11.875,0,0,0
3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,22.839778,304.530374,13.333333,0,0,1
4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,58.091206,219.455667,3.777778,1,0,0


In [5]:
waze_df.dtypes

label                       object
sessions                     int64
drives                       int64
total_sessions             float64
n_days_after_onboarding      int64
total_navigations_fav1       int64
total_navigations_fav2       int64
driven_km_drives           float64
duration_minutes_drives    float64
activity_days                int64
driving_days                 int64
device                      object
km_per_drive               float64
km_per_driving_day         float64
drives_per_driving_day     float64
professional_driver          int64
label_2                      int64
device_2                     int64
dtype: object

In [6]:
# Finalzing the modeling data set:
# Drop orginal 'label' and 'device' columns since we already have the numeric encoded versions

waze_df.drop(columns=['label', 'device'], axis=1, inplace=True)

In [7]:
waze_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14299 entries, 0 to 14298
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   sessions                 14299 non-null  int64  
 1   drives                   14299 non-null  int64  
 2   total_sessions           14299 non-null  float64
 3   n_days_after_onboarding  14299 non-null  int64  
 4   total_navigations_fav1   14299 non-null  int64  
 5   total_navigations_fav2   14299 non-null  int64  
 6   driven_km_drives         14299 non-null  float64
 7   duration_minutes_drives  14299 non-null  float64
 8   activity_days            14299 non-null  int64  
 9   driving_days             14299 non-null  int64  
 10  km_per_drive             14299 non-null  float64
 11  km_per_driving_day       14299 non-null  float64
 12  drives_per_driving_day   14299 non-null  float64
 13  professional_driver      14299 non-null  int64  
 14  label_2               

In [8]:
# Feature Engineering: additional behavioral metrics
# These features capture recent engagement, driving efficiency, 
# and navigation habits that may be predictive of churn.

In [9]:
# Share of user sessions that occurred in the most recent month.
# Helps measure recency of engagement.
waze_df['percent_sessions_in_last_month'] = (waze_df['sessions'] / waze_df['total_sessions']) * 100


# Average number of sessions per day since onboarding.
# Normalizes activity by account age.
waze_df['total_sessions_per_day'] = waze_df['total_sessions'] / waze_df['n_days_after_onboarding']


# Driving efficiency: kilometers per hour.
# Higher or lower values may indicate different driving patterns.
waze_df['km_per_hour'] = waze_df['driven_km_drives'] / (waze_df['duration_minutes_drives'] / 60)


# Average number of favorite navigations per drive.
# Values can exceed 1.0 because users may record multiple favorite
# navigations relative to the number of drives captured.
waze_df['ratio_of_favorite_navigations_to_drives'] = (
    (waze_df['total_navigations_fav1'] + waze_df['total_navigations_fav2'])
    .div(waze_df['drives'].replace(0, np.nan))
    .fillna(0.0)
)

In [10]:
waze_df['percent_sessions_in_last_month'].describe()

count    14299.000000
mean        44.983669
std         28.686297
min          0.000000
25%         19.688992
50%         42.431025
75%         68.725073
max        153.063707
Name: percent_sessions_in_last_month, dtype: float64

In [11]:
# percent_sessions_in_last_month:
#   Mostly within 0–100, but a few values >100%. 
#   Likely a calculation quirk; note anomaly, keep as-is.

In [12]:
waze_df['total_sessions_per_day'].describe()

count    14299.000000
mean         0.338207
std          1.319814
min          0.000298
25%          0.050818
50%          0.100457
75%          0.215210
max         39.763874
Name: total_sessions_per_day, dtype: float64

In [13]:
# total_sessions_per_day:
#   Highly skewed; most users <1/day, max ~40/day. 
#   Reasonable but uneven usage patterns.

In [14]:
waze_df['km_per_hour'].describe()

count    14299.000000
mean       190.883283
std        339.885059
min         72.013095
25%         91.035896
50%        122.342918
75%        193.238449
max      23642.920871
Name: km_per_hour, dtype: float64

In [15]:
# km_per_hour:
#   Median realistic (~122 km/h), but extreme outliers >20,000 km/h 
#   due to very short trip durations. Retain; tree models are robust.

In [16]:
waze_df['ratio_of_favorite_navigations_to_drives'].describe()

count    14299.000000
mean         8.664833
std         28.904597
min          0.000000
25%          0.581395
50%          2.083333
75%          6.160028
max        799.000000
Name: ratio_of_favorite_navigations_to_drives, dtype: float64

In [17]:
# ratio_of_favorite_navigations_to_drives:
#   Represents average trips to saved favorite places per drive.
#   Not bounded by 1.0 — values >1 occur due to logging differences
#   between drives and favorite navigations.
#   Retained for modeling, as it may signal strong routine usage.

In [18]:
# Confirm class balance (~80% retained / 20% churned) observed earlier
waze_df['label_2'].value_counts(normalize=True)

label_2
0    0.822645
1    0.177355
Name: proportion, dtype: float64

In [19]:
# Confirm dtypes (all numeric) before saving modeling set
waze_df.dtypes

sessions                                     int64
drives                                       int64
total_sessions                             float64
n_days_after_onboarding                      int64
total_navigations_fav1                       int64
total_navigations_fav2                       int64
driven_km_drives                           float64
duration_minutes_drives                    float64
activity_days                                int64
driving_days                                 int64
km_per_drive                               float64
km_per_driving_day                         float64
drives_per_driving_day                     float64
professional_driver                          int64
label_2                                      int64
device_2                                     int64
percent_sessions_in_last_month             float64
total_sessions_per_day                     float64
km_per_hour                                float64
ratio_of_favorite_navigations_t

In [20]:
# Save finalized modeling dataset
waze_df.to_csv('../data/waze_modeling_set.csv', index=False)