# **Waze Project**

## Background

*Your team is nearing the midpoint of their user churn project. So far, you’ve completed a project proposal, and used Python to explore and analyze Waze’s user data. You’ve also used Python to create data visualizations. The next step is to use statistical methods to analyze and interpret your data.*

*You receive a new email from Sylvester Esperanza, your project manager. Sylvester tells your team about a new request from leadership: to analyze the relationship between mean amount of rides and device type. You also discover follow-up emails from three other team members: May Santner, Chidi Ga, and Harriet Hadzic. These emails discuss the details of the analysis. They would like a statistical analysis of ride data based on device type. In particular, leadership wants to know if there is a statistically significant difference in mean amount of rides between iPhone® users and Android™ users. A final email from Chidi includes your specific assignment: to conduct a two-sample hypothesis test (t-test) to analyze the difference in the mean amount of rides between iPhone users and Android users.* $^1$


## Description
<br/>
**The purpose** of this project is to demostrate knowledge of how to conduct a **two-sample hypothesis test**.

**The goal** is to apply **descriptive statistics** and **hypothesis testing** in Python.
<br/>

*This activity has three parts:*

**Part 1:** Imports and data loading

**Part 2:** Conduct hypothesis testing

**Part 3:** Communicate insights with stakeholders

* What key business insight(s) emerged from the hypothesis test?

* What business recommendations can be proposed based on the results?


## **Plan**

1. **What is the research question for this data project?** 

    - *Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?*

### **Task 1. Imports and data loading**

In [1]:
# Import any relevant packages or libraries
import pandas as pd
from scipy import stats

In [5]:
# Load dataset into dataframe
df = pd.read_csv('Data/waze_dataset.csv')

In [6]:
df.head()

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android


### **Task 2. Data exploration**

Use descriptive statistics to conduct exploratory data analysis (EDA).

Computing descriptive statistics helps you quickly compare the average amount of drives by device type

In the dataset, `device` is a categorical variable with the labels `iPhone` and `Android`.

We are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type:

In [4]:
df.groupby('device')['drives'].mean()

device
Android    66.231838
iPhone     67.859078
Name: drives, dtype: float64

Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, we can conduct a hypothesis test.


### **Task 3. Hypothesis testing**

Your goal is to conduct a **two-sample t-test**. The steps for conducting a hypothesis test are the following:

1.   State the null hypothesis and the alternative hypothesis
2.   Choose a signficance level
3.   Find the p-value
4.   Reject or fail to reject the null hypothesis


**Hypotheses:**

$H_0$: There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

$H_A$: There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

We can set the **significance level** to **5%.**

We can use the `stats.ttest_ind()` function to perform the test.

**Technical note**: The default for the argument `equal_var` in `stats.ttest_ind()` is `True`, which assumes population variances are equal. This equal variance assumption might not hold in practice (that is, there is no strong reason to assume that the two groups have the same variance); we can relax this assumption by setting `equal_var` to `False`, and `stats.ttest_ind()` will perform the unequal variances $t$-test (known as Welch's `t`-test). Refer to the [scipy t-test documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html) for more information.


In [8]:
# 1. Isolate the `drives` column for iPhone users.
iPhone = df[df['device'] == 'iPhone']['drives']

# 2. Isolate the `drives` column for Android users.
Android = df[df['device'] == 'Android']['drives']

# 3. Perform the t-test
stats.ttest_ind(a=iPhone, b=Android, equal_var=False)

Ttest_indResult(statistic=1.4635232068852353, pvalue=0.14335197268020591)

Since the **p-value** is larger than the chosen **significance level** (5%), we **fail to reject** the **null hypothesis.** We conclude that there is **not** a **statistically significant** difference in the average number of drives between drivers who use iPhones and drivers who use Androids.

### **Task 4: Communicate insights with stakeholders**

* What business insight(s) can be obtained from the result of the hypothesis test?

  - *The key business insight is that drivers who use iPhone devices on average have a similar number of drives as those who use Androids.*

  - *One potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.*
  
$^1$ The background and data used for this project was taken from Coursera, course: **The Power of Statistics**