# **Waze Project**
**Course 4 - The Power of Statistics**

Your team is nearing the midpoint of their user churn project. So far, you’ve completed a project proposal, and used Python to explore and analyze Waze’s user data. You’ve also used Python to create data visualizations. The next step is to use statistical methods to analyze and interpret your data.

You receive a new email from Sylvester Esperanza, your project manager. Sylvester tells your team about a new request from leadership: to analyze the relationship between mean amount of rides and device type. You also discover follow-up emails from three other team members: May Santner, Chidi Ga, and Harriet Hadzic. These emails discuss the details of the analysis. They would like a statistical analysis of ride data based on device type. In particular, leadership wants to know if there is a statistically significant difference in mean amount of rides between iPhone® users and Android™ users. A final email from Chidi includes your specific assignment: to conduct a two-sample hypothesis test (t-test) to analyze the difference in the mean amount of rides between iPhone users and Android users.

A notebook was structured and prepared to help you in this project. Please complete the following questions and prepare an executive summary.

# **Course 4 End-of-course project: Data exploration and hypothesis testing**

In this activity, you will explore the data provided and conduct a hypothesis test.
<br/>

**The purpose** of this project is to demostrate knowledge of how to conduct a two-sample hypothesis test.

**The goal** is to apply descriptive statistics and hypothesis testing in Python.
<br/>


<br/>


Follow the instructions and answer the questions below to complete the activity. Then, you will complete an Executive Summary using the questions listed on the PACE Strategy Document.

Be sure to complete this activity before moving on. The next course item will provide you with a completed exemplar to compare to your own work.


# **Data exploration and hypothesis testing**

<img src="images/Pace.png" width="100" height="100" align=left>

# **PACE stages**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response:
1. What is your research question for this data project? Later on, you will need to formulate the null and alternative hypotheses as the first step of your hypothesis test. Consider your research question now, at the start of this task.


Is there a statistically signficant difference in the mean of amount of rides between iphone users and android users.

*Complete the following tasks to perform statistical analysis of your data:*

### **Task 1. Imports and data loading**




Import packages and libraries needed to compute descriptive statistics and conduct a hypothesis test.

<details>
  <summary><h4><strong>Hint:</strong></h4></summary>

Before you begin, recall the following Python packages and functions:

*Main functions*: stats.ttest_ind(a, b, equal_var)

*Other functions*: mean()

*Packages*: pandas, stats.scipy

</details>

In [20]:
# Import any relevant packages or libraries
import pandas as pd
pd.set_option('display.max_columns', None)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

Import the dataset.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [21]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

In [22]:
print(f'df.shape: {df.shape}')
print(f'df.dtypes: {df.dtypes}')
print(f'df.isnull(): {df.isnull().sum()}') # 700 missing labels

df.shape: (14999, 13)
df.dtypes: ID                           int64
label                       object
sessions                     int64
drives                       int64
total_sessions             float64
n_days_after_onboarding      int64
total_navigations_fav1       int64
total_navigations_fav2       int64
driven_km_drives           float64
duration_minutes_drives    float64
activity_days                int64
driving_days                 int64
device                      object
dtype: object
df.isnull(): ID                           0
label                      700
sessions                     0
drives                       0
total_sessions               0
n_days_after_onboarding      0
total_navigations_fav1       0
total_navigations_fav2       0
driven_km_drives             0
duration_minutes_drives      0
activity_days                0
driving_days                 0
device                       0
dtype: int64


<img src="images/Analyze.png" width="100" height="100" align=left>

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Analyze and Construct**

Consider the questions in your PACE Strategy Document and those below to craft your response:
1. Data professionals use descriptive statistics for exploratory data analysis (EDA). How can computing descriptive statistics help you learn more about your data in this stage of your analysis?


Descriptive statistics are useful because they give us a quick and high-level overview of the center of our data and important summary statistics to be able to make better decisions into what relationships to explore and also how to structure our data better.


### **Task 2. Data exploration**

Use descriptive statistics to conduct exploratory data analysis (EDA).

In [23]:
# 1. Create `map_dictionary`
map_dictionary = {'iPhone': 1, 'Android': 2}

# 2. Create new `device_type` column
new_col = 'device_type'
df[new_col] = df['device']
df.dtypes
df.head(3)

# 3. Map the new column to the dictionary
df[new_col] = df[new_col].map(map_dictionary)
df.head(3)

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device,device_type
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android,2
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone,1
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android,2


You are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

In [24]:
df.groupby('device')['drives'].mean().reset_index()

Unnamed: 0,device,drives
0,Android,66.231838
1,iPhone,67.859078


Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, you can conduct a hypothesis test.


### **Task 3. Hypothesis testing**

Recall the difference between the null hypothesis ($H_0$) and the alternative hypothesis ($H_A$).

**Question:** What are your hypotheses for this data project?

H0: There is no difference in the mean number of drives between iPhone and Android users.

HA: There is a difference in the mean number of drives between iPhone and Android users.

In [25]:
df['device'].value_counts()

iPhone     9672
Android    5327
Name: device, dtype: int64

In [31]:
# 1. Isolate the `drives` column for iPhone users.
iphone_mask = df['device'] == 'iPhone'
iphone = df.loc[iphone_mask]

# 2. Isolate the `drives` column for Android users.
android = df.loc[~iphone_mask]
print(f'iphone.shape: {iphone.shape}')
print(f'android.shape: {android.shape}')

# 3. Perform the t-test
tstat, pvalue = stats.ttest_ind(a=iphone['drives'], b=android['drives'], equal_var=False)
print(f'tstat: {tstat}')
print(f'pvalue: {pvalue}')

iphone.shape: (9672, 14)
android.shape: (5327, 14)
tstat: 1.4635232068852353
pvalue: 0.1433519726802059


**Question:** Based on the p-value you got above, do you reject or fail to reject the null hypothesis?

fail to reject null hypothesis

<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document to reflect on the Execute stage.

### **Task 4. Communicate insights with stakeholders**

Now that you've completed your hypothesis test, the next step is to share your findings with the Waze leadership team. Consider the following question as you prepare to write your executive summary:

* What business insight(s) can you draw from the result of your hypothesis test?

There are different ways of saying this:

The p-value of 0.14 says there is a 14% probability that we would see a difference in the mean number of drives as or more extreme than was observed if the null hypothesis were true.

Since we chose a 5% significance level and our p-value is greater than the threshold to be considered signficant, we fail to reject the null hypothesis.  Therefore, we can say with 5% significance level there is no difference in the mean number of drives between iPhone and Android users.  Said another way, this suggests that there is insufficient statistical evidence to conclude a significant difference in the mean number of drives between iPhone and Android users.

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.