**Kernel:** newenv (Python 3.8.17) (miniconda environment)

The **goal** is to apply descriptive statistics and hypothesis testing in Python. 

Part 1: Imports and data loading
- What data packages will be necessary for hypothesis testing?

Part 2: Conduct hypothesis testing
- How did computing descriptive statistics help we analyze our data?
- How did we formulate our null hypothesis and alternative hypothesis?

Part 3: Communicate insights with stakeholders
- What key business insight(s) emerged from our hypothesis test?
- What business recommendations do we propose based on our results?

Research question:

"Do drivers who open the application using an iPhone have the same number of drives on average as drivers who use Android devices?"

# Imports and Data Loading

In [41]:
import sys
print(sys.path)


['/Users/dylanlam/Documents/GitHub/tiktok_data_science_project', '/Users/dylanlam/miniconda3/envs/newenv/lib/python38.zip', '/Users/dylanlam/miniconda3/envs/newenv/lib/python3.8', '/Users/dylanlam/miniconda3/envs/newenv/lib/python3.8/lib-dynload', '', '/Users/dylanlam/miniconda3/envs/newenv/lib/python3.8/site-packages']


In [42]:
# Import any relevant packages or libraries
import pandas as pd
import scipy.stats as stats

In [43]:
# Load dataset into dataframe
df = pd.read_csv('waze_dataset.csv')

In general, descriptive statistics are useful because they let we quickly explore and understand large amounts of data. In this case, computing descriptive statistics helps we quickly compare the average amount of drives by device type.

# Data Exploration

Note: In the dataset, device is a categorical variable with the labels iPhone and Android.

In order to perform this analysis, we must turn each label into an integer. The following code assigns a 1 for an iPhone user and a 2 for Android. It assigns this label back to the variable device_type.

Note: Creating a new variable is ideal so that we don't overwrite original data.
1. Create a dictionary called map_dictionary that contains the class labels ('Android' and 'iPhone') for keys and the values we want to convert them to (2 and 1) as values.
2. Create a new column called device_type that is a copy of the device column.
3. Use the map() method on the device_type series. Pass map_dictionary as its argument. Reassign the result back to the device_type series. When we pass a dictionary to the Series.map() method, it will replace the data in the series where that data matches the dictionary's keys. The values that get imputed are the values of the dictionary.

In [44]:
# 1. Create `map_dictionary`
map_dictionary = {'Android': 2, 'iPhone': 1}

# 2. Create new `device_type` column
df['device_type'] = df['device']

# 3. Map the new column to the dictionary
df['device_type'] = df['device_type'].map(map_dictionary)

df['device_type'].head()

0    2
1    1
2    2
3    1
4    2
Name: device_type, dtype: int64

We are interested in the relationship between device type and the number of drives. One approach is to look at the average number of drives for each device type. Calculate these averages.

In [45]:
df.groupby('device_type')['drives'].mean()

device_type
1    67.859078
2    66.231838
Name: drives, dtype: float64

Based on the averages shown, it appears that drivers who use an iPhone device to interact with the application have a higher number of drives on average. However, this difference might arise from random sampling, rather than being a true difference in the number of drives. To assess whether the difference is statistically significant, we can conduct a hypothesis test.


# Hypothesis Testing

Our goal is to conduct a two-sample t-test. Recall the steps for conducting a hypothesis test:
1. State the null hypothesis and the alternative hypothesis
2. Choose a signficance level
3. Find the p-value
4. Reject or fail to reject the null hypothesis

Note: This is a t-test for two independent samples. This is the appropriate test since the two groups are independent (Android users vs. iPhone users).


**Hypotheses**:

H
0
 : There is no difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.
 
H
A
 : There is a difference in average number of drives between drivers who use iPhone devices and drivers who use Androids.

We choose 5% as the significance level and proceed with a two-sample t-test.

**Technical note:**

The default for the argument equal_var in stats.ttest_ind() is True, which assumes population variances are equal. This equal variance assumption might not hold in practice (that is, there is no strong reason to assume that the two groups have the same variance); we can relax this assumption by setting equal_var to False, and stats.ttest_ind() will perform the unequal variances  𝑡
t
 -test (known as Welch's t-test). 

1. Isolate the drives column for iPhone users.
2. Isolate the drives column for Android users.
3. Perform the t-test

In [46]:
# 1. Isolate the `drives` column for iPhone users.
iPhone = df[df['device_type'] == 1]['drives']

# 2. Isolate the `drives` column for Android users.
Android = df[df['device_type'] == 2]['drives']

# 3. Perform the t-test
stats.ttest_ind(a=iPhone, b=Android, equal_var=False)

Ttest_indResult(statistic=1.463523206885235, pvalue=0.143351972680206)

Since the p-value is larger than the chosen significance level (5%), we fail to reject the null hypothesis. We conclude that there is not a statistically significant difference in the average number of drives between drivers who use iPhones and drivers who use Androids.


# Communicating Insights With (Hypothetical) Stakeholders

The key business insight is that drivers who use iPhone devices on average have a similar number of drives as those who use Androids.

One potential next step is to explore what other factors influence the variation in the number of drives, and run additonal hypothesis tests to learn more about user behavior. Further, temporary changes in marketing or user interface for the Waze app may provide more data to investigate churn.