### Notebook Overview  

This Jupyter notebook is the deliverable for **Course 4 – "The Power of Statistics"** in the *Google Advanced Data Analytics Professional Certificate* program. Its goals are to  

1. **Formulate and test hypotheses** about Waze user behavior using inferential statistics.  
2. **Apply a two-sample hypothesis test (t-test)** to compare driving behavior across device types (iPhone vs Android).  
3. **Interpret statistical results** by evaluating p-values against a chosen significance level (α = 0.05).  
4. **Translate findings into business insights** that inform whether device type should be considered in future churn modeling.  

**Deliverables:**  
- Group-level descriptive statistics for drives by device type  
- Hypothesis test results with p-value and decision rule  
- Executive summary linking statistical outcomes to stakeholder implications  

In [1]:
import pandas as pd
from scipy import stats

In [2]:
# Next step: conduct a two-sample hypothesis test (t-test) to analyze 
# whether the mean number of rides differs significantly between 
# iPhone users and Android users in waze_df.

In [None]:
# Load the cleaned csv with added features into a dataframe
waze_df = pd.read_csv('../data/waze_features_v1.csv')
waze_df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,km_per_drive,km_per_driving_day,drives_per_driving_day
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,11.632058,138.360267,11.894737
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,128.186173,1246.901868,9.727273
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,32.201567,382.393602,11.875
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone,22.839778,304.530374,13.333333
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android,58.091206,219.455667,3.777778


In [4]:
# Map device labels ('iPhone', 'Android) to numeric codes (1, 2) in new column device_type
map_dict = {'iPhone': 1,'Android': 2}
waze_df['device_type'] = waze_df['device'].map(map_dict)

In [5]:
waze_df[['device', 'device_type']].head(10)

Unnamed: 0,device,device_type
0,Android,2
1,iPhone,1
2,Android,2
3,iPhone,1
4,Android,2
5,iPhone,1
6,iPhone,1
7,iPhone,1
8,Android,2
9,iPhone,1


In [6]:
# Confirm no nulls
waze_df['device_type'].isna().sum()

np.int64(0)

In [7]:
# Calculate average number of drives per device type
waze_df.groupby('device_type')['drives'].mean()

device_type
1    67.933225
2    66.024241
Name: drives, dtype: float64

In [8]:
# Average drives: iPhone users ≈ 67.9 vs Android users ≈ 66.0
# Next: test if this difference is due to chance or statistically significant; run a two-sample t-test

In [9]:
# H0: There is no difference in the avg number of drives between iPhone and Android users
# H1: There is a difference in the avg number of drives between iPhone and Android users

In [10]:
# Significance level = 5%

In [11]:
# Isolate the colums needed for t-test
iphone_drives = waze_df[waze_df['device_type'] == 1]['drives']
android_drives = waze_df[waze_df['device_type'] == 2]['drives']

In [12]:
# Perform the t-test
stats.ttest_ind(a=iphone_drives, b=android_drives, equal_var=False)

TtestResult(statistic=np.float64(1.676594122141587), pvalue=np.float64(0.09365074661708836), df=np.float64(10826.925404660755))

In [13]:
# t-test result: p-value ≈ 0.094 > 0.05

In [None]:
# Key Insight:
# We fail to reject the H0; There is not a statistically significant difference in avg drives between iPhone and Android users.

### Executive Summary
[Executive Summary - Milestone 4](../reports/executive_summary_milestone_4.pdf)