<a href="https://colab.research.google.com/github/deedee-ke/Data-Science-Project-Portolio/blob/main/Waze%20User%20Churn%20Analysis%3A%20Exploring%20Retention%20Metrics%20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Waze User Churn Analysis**

### Introduction

Welcome to the Waze User Churn Analysis project! This analysis aims to explore user behavior within the Waze application to uncover insights into user retention and churn. Understanding why certain users discontinue using the app while others remain engaged is crucial for improving user experience and sustaining app growth.

### Project Overview

The primary objectives of this analysis include:

1. **Understanding User Behavior:** Investigating user engagement metrics and driving patterns among Waze users.
2. **Identifying Churn Factors:** Exploring potential correlations between user behaviors and churn rates.
3. **Providing Insights:** Presenting key findings and recommendations based on data-driven analysis.


## Step 1: Loading the Data

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/sample_data/waze_dataset.csv')


In [2]:
# Summary information about the dataframe
df.head(10)

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android
5,5,retained,113,103,279.544437,2637,0,0,901.238699,439.101397,15,11,iPhone
6,6,retained,3,2,236.725314,360,185,18,5249.172828,726.577205,28,23,iPhone
7,7,retained,39,35,176.072845,2999,0,0,7892.052468,2466.981741,22,20,iPhone
8,8,retained,57,46,183.532018,424,0,26,2651.709764,1594.342984,25,20,Android
9,9,churned,84,68,244.802115,2997,72,0,6043.460295,2341.838528,7,3,iPhone


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 1.5+ MB


In [4]:
df.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,7499.0,80.633776,67.281152,189.964447,1749.837789,121.605974,29.672512,4039.340921,1860.976012,15.537102,12.179879
std,4329.982679,80.699065,65.913872,136.405128,1008.513876,148.121544,45.394651,2502.149334,1446.702288,9.004655,7.824036
min,0.0,0.0,0.0,0.220211,4.0,0.0,0.0,60.44125,18.282082,0.0,0.0
25%,3749.5,23.0,20.0,90.661156,878.0,9.0,0.0,2212.600607,835.99626,8.0,5.0
50%,7499.0,56.0,48.0,159.568115,1741.0,71.0,9.0,3493.858085,1478.249859,16.0,12.0
75%,11248.5,112.0,93.0,254.192341,2623.5,178.0,43.0,5289.861262,2464.362632,23.0,19.0
max,14998.0,743.0,596.0,1216.154633,3500.0,1236.0,415.0,21183.40189,15851.72716,31.0,30.0


## Step 2: Explore Missing Values

In [5]:
# Check for missing values and variables affected
missing_values = df.isnull().sum()
print(missing_values)


ID                           0
label                      700
sessions                     0
drives                       0
total_sessions               0
n_days_after_onboarding      0
total_navigations_fav1       0
total_navigations_fav2       0
driven_km_drives             0
duration_minutes_drives      0
activity_days                0
driving_days                 0
device                       0
dtype: int64


**Observations:**

1. None of the variables in the first 10 observations have missing values.
2. The dataset has 700 missing values in the `label` column.

In [6]:
# Investigate missing values in a specific column
null_df = df[df['label'].isnull()]
null_df.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,7405.584286,80.837143,67.798571,198.483348,1709.295714,118.717143,30.371429,3935.967029,1795.123358,15.382857,12.125714
std,4306.900234,79.98744,65.271926,140.561715,1005.306562,156.30814,46.306984,2443.107121,1419.242246,8.772714,7.626373
min,77.0,0.0,0.0,5.582648,16.0,0.0,0.0,290.119811,66.588493,0.0,0.0
25%,3744.5,23.0,20.0,94.05634,869.0,4.0,0.0,2119.344818,779.009271,8.0,6.0
50%,7443.0,56.0,47.5,177.255925,1650.5,62.5,10.0,3421.156721,1414.966279,15.0,12.0
75%,11007.0,112.25,94.0,266.058022,2508.75,169.25,43.0,5166.097373,2443.955404,23.0,18.0
max,14993.0,556.0,445.0,1076.879741,3498.0,1096.0,352.0,15135.39128,9746.253023,31.0,30.0


In [7]:
not_null_df = df[~df['label'].isnull()]
not_null_df.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0
mean,7503.573117,80.62382,67.255822,189.547409,1751.822505,121.747395,29.638296,4044.401535,1864.199794,15.544653,12.18253
std,4331.207621,80.736502,65.947295,136.189764,1008.663834,147.713428,45.35089,2504.97797,1448.005047,9.016088,7.833835
min,0.0,0.0,0.0,0.220211,4.0,0.0,0.0,60.44125,18.282082,0.0,0.0
25%,3749.5,23.0,20.0,90.457733,878.5,10.0,0.0,2217.319909,840.181344,8.0,5.0
50%,7504.0,56.0,48.0,158.718571,1749.0,71.0,9.0,3496.545617,1479.394387,16.0,12.0
75%,11257.5,111.0,93.0,253.54045,2627.5,178.0,43.0,5299.972162,2466.928876,23.0,19.0
max,14998.0,743.0,596.0,1216.154633,3500.0,1236.0,415.0,21183.40189,15851.72716,31.0,30.0


**Observation:**

>Comparing summary statistics of the observations with missing retention labels with those that aren't missing any values reveals nothing remarkable. The means and standard deviations are fairly consistent between the two groups.

In [8]:
# Check missing values by device
print(null_df['device'].value_counts())
print()
print(null_df['device'].value_counts(normalize=True))

iPhone     447
Android    253
Name: device, dtype: int64

iPhone     0.638571
Android    0.361429
Name: device, dtype: float64


In [10]:
# Calculate the percentage of iPhone and Android users in the full dataset
df['device'].value_counts(normalize=True)

iPhone     0.644843
Android    0.355157
Name: device, dtype: float64

**Observation:**
>The percentage of missing values is the same for iPhone and Android users.

>There is no evidence that the missing data is caused by something other than chance.

## Step 3: Summary Statistics

In [25]:
df.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0,14999.0
mean,7499.0,80.633776,67.281152,189.964447,1749.837789,121.605974,29.672512,4039.340921,1860.976012,15.537102,12.179879
std,4329.982679,80.699065,65.913872,136.405128,1008.513876,148.121544,45.394651,2502.149334,1446.702288,9.004655,7.824036
min,0.0,0.0,0.0,0.220211,4.0,0.0,0.0,60.44125,18.282082,0.0,0.0
25%,3749.5,23.0,20.0,90.661156,878.0,9.0,0.0,2212.600607,835.99626,8.0,5.0
50%,7499.0,56.0,48.0,159.568115,1741.0,71.0,9.0,3493.858085,1478.249859,16.0,12.0
75%,11248.5,112.0,93.0,254.192341,2623.5,178.0,43.0,5289.861262,2464.362632,23.0,19.0
max,14998.0,743.0,596.0,1216.154633,3500.0,1236.0,415.0,21183.40189,15851.72716,31.0,30.0


In [11]:
# Counts of churned vs. retained users
print(df['label'].value_counts())
print()
print(df['label'].value_counts(normalize=True))

retained    11763
churned      2536
Name: label, dtype: int64

retained    0.822645
churned     0.177355
Name: label, dtype: float64


**Observation:**

>This dataset contains 82% retained users and 18% churned users.

In [15]:
df.groupby('label').median(numeric_only=True)

Unnamed: 0_level_0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
churned,7477.5,59.0,50.0,164.339042,1321.0,84.5,11.0,3652.655666,1607.183785,8.0,6.0
retained,7509.0,56.0,47.0,157.586756,1843.0,68.0,9.0,3464.684614,1458.046141,17.0,14.0


**Observation:**

Churned users drove more miles and for longer durations in fewer days than retained users. This suggests that churned users may have used the app for more infrequent, long-distance trips.

In [16]:
# Calculate median kilometers per drive for churned and retained users
medians_by_label = df.groupby('label').median(numeric_only=True)
print(medians_by_label['driven_km_drives'] / medians_by_label['drives'])

label
churned     73.053113
retained    73.716694
dtype: float64


**Observation:**

>The median user from both groups drove ~73 km/drive.

In [17]:
# Calculate median kilometers per driving day for churned and retained users
print(medians_by_label['driven_km_drives'] / medians_by_label['driving_days'])


label
churned     608.775944
retained    247.477472
dtype: float64


In [18]:
# Calculate median drives per driving day for churned and retained users
print(medians_by_label['drives'] / medians_by_label['driving_days'])


label
churned     8.333333
retained    3.357143
dtype: float64


**Observation:**

>Churned users drove significantly **more** miles per day and had **more** drives per day than retained users, suggesting that they may be **long-haul truckers**. Waze should gather more data on these users to better understand their needs and improve the app for them.

In [23]:
# Check the ratio of iPhone and Android users in churned and retained groups
df.groupby(['label', 'device']).size()

label     device 
churned   Android     891
          iPhone     1645
retained  Android    4183
          iPhone     7580
dtype: int64

In [24]:
df.groupby('label')['device'].value_counts(normalize=True)

label     device 
churned   iPhone     0.648659
          Android    0.351341
retained  iPhone     0.644393
          Android    0.355607
Name: device, dtype: float64

**Observation:**

>The proportion of iPhone and Android users is similar among churned, retained, and overall users.

## Step 4: Conclusions (Executive Summary)

Based on the observations from the analysis:

1. **Missing Values**: There were 700 missing values in the `label` column, with no discernible pattern.

2. **Benefit of Median vs. Mean**: Median values are less sensitive to outliers compared to mean values, providing a more robust estimation of central tendency.

3. **Further Questions**: The data showed that churned users tend to drive significantly more per day than retained users. It would be valuable to understand if this data represents a non-random sample of users and how it was collected.

4. **Device Distribution**: Approximately 36% of users were Android users, while 64% were iPhone users.

5. **User Characteristics**: Churned users generally exhibited longer and farther drives in fewer days compared to retained users. They also had fewer app usage instances over the same period.

6. **Churn Rate by Device Type**: The churn rates for both iPhone and Android users were similar, indicating no significant correlation between churn and device type.

