# **Waze Project**

In this project, I am an analyst assisting The Waze data analytics team on a user churn project. To get clear insights, the user data provided must be inspected and prepared for the upcoming process of exploratory data analysis (EDA), which is where I come in.

# **Course 2 End-of-course project: Inspect and analyze data**

In this activity, I will examine data provided and prepare it for analysis using the PACE workflow. This activity will help ensure the information is,

1.   Ready to answer questions and yield insights

2.   Ready for visualizations

3.   Ready for future hypothesis testing and statistical methods
<br/>

**The purpose** of this project is to investigate and understand the data provided.

**The goal** is to use a dataframe contructed within Python, perform a cursory inspection of the provided dataset, and inform team members of my findings.
<br/>

*This activity has three parts:*

**Part 1:** Understand the situation
* How can I best prepare to understand and organize the provided information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning, future exploratory data analysis (EDA), and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from my examination of the summary data to guide deeper investigation into variables




# **Identify data types and compile summary information**


<img src="data/images/Pace.png" width="100" height="100" align=left>

# **PACE stages**

Throughout the project, you'll see references to the problem-solving framework, PACE. The following notebook components are labeled with the respective PACE stages: Plan, Analyze, Construct, and Execute.

<img src="data/images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**


### **Task 1. Understand the situation**

*   How can I best prepare to understand and organize the provided driver data?


*Begin by exploring the dataset and consider reviewing the Data Dictionary.*

Having a solid understanding of the columns and rows and knowing what the data entails and the purpose of it's existence will aid in doing proper analysis.

<img src="data/images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**


### **Task 2a. Imports and data loading**


In [22]:
# Import packages for data manipulation
import pandas as pd
import numpy as np

In [5]:
# Load dataset into dataframe
df = pd.read_csv('Data/P1_waze_dataset.csv')

### **Task 2b. Summary information**

View and inspect summary information about the dataframe by **coding the following:**

1.   df.head(10)
2.   df.info()

*Consider the following questions:*

1. When reviewing the `df.head()` output, are there any variables that have missing values?

2. When reviewing the `df.info()` output, what are the data types? How many rows and columns are there?

3. Does the dataset have any missing values?

In [41]:
df.head(10)

Unnamed: 0,ID,label,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days,device
0,0,retained,283,226,296.748273,2276,208,0,2628.845068,1985.775061,28,19,Android
1,1,retained,133,107,326.896596,1225,19,64,13715.92055,3160.472914,13,11,iPhone
2,2,retained,114,95,135.522926,2651,0,0,3059.148818,1610.735904,14,8,Android
3,3,retained,49,40,67.589221,15,322,7,913.591123,587.196542,7,3,iPhone
4,4,retained,84,68,168.24702,1562,166,5,3950.202008,1219.555924,27,18,Android
5,5,retained,113,103,279.544437,2637,0,0,901.238699,439.101397,15,11,iPhone
6,6,retained,3,2,236.725314,360,185,18,5249.172828,726.577205,28,23,iPhone
7,7,retained,39,35,176.072845,2999,0,0,7892.052468,2466.981741,22,20,iPhone
8,8,retained,57,46,183.532018,424,0,26,2651.709764,1594.342984,25,20,Android
9,9,churned,84,68,244.802115,2997,72,0,6043.460295,2341.838528,7,3,iPhone


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14999 entries, 0 to 14998
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   ID                       14999 non-null  int64  
 1   label                    14299 non-null  object 
 2   sessions                 14999 non-null  int64  
 3   drives                   14999 non-null  int64  
 4   total_sessions           14999 non-null  float64
 5   n_days_after_onboarding  14999 non-null  int64  
 6   total_navigations_fav1   14999 non-null  int64  
 7   total_navigations_fav2   14999 non-null  int64  
 8   driven_km_drives         14999 non-null  float64
 9   duration_minutes_drives  14999 non-null  float64
 10  activity_days            14999 non-null  int64  
 11  driving_days             14999 non-null  int64  
 12  device                   14999 non-null  object 
dtypes: float64(3), int64(8), object(2)
memory usage: 1.5+ MB


When reviewing the df.head() output, are there any variables that have missing values?
- There are zeroes in the total_navigations column, but the data is most likely valid

When reviewing the df.info() output, what are the data types? How many rows and columns do are there?
- integers, floats, and objects
- Total of 14999 columns

Does the dataset have any missing values?
- Label column has 14299 rows when all other columns have 14999


### **Task 2c. Null values and summary statistics**

Compare the summary statistics of the 700 rows that are missing labels with summary statistics of the rows that are not missing any values.

**Question:** Is there a discernible difference between the two populations?


In [25]:
# Isolate rows with null values
'''
df_null_values and null_values are the same. The difference is null_values needs to be called using
df[null_values], while df_null_values does not since it's already being called within the boolean mask.
'''
df_null_values = df[df['label'].isnull()]
# nullvalues

null_values = df['label'].isnull()
# df[nullvalues2]

# Display summary stats of rows with null values
# df_nullvalues.describe()
df[null_values].describe()


Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0,700.0
mean,7405.584286,80.837143,67.798571,198.483348,1709.295714,118.717143,30.371429,3935.967029,1795.123358,15.382857,12.125714
std,4306.900234,79.98744,65.271926,140.561715,1005.306562,156.30814,46.306984,2443.107121,1419.242246,8.772714,7.626373
min,77.0,0.0,0.0,5.582648,16.0,0.0,0.0,290.119811,66.588493,0.0,0.0
25%,3744.5,23.0,20.0,94.05634,869.0,4.0,0.0,2119.344818,779.009271,8.0,6.0
50%,7443.0,56.0,47.5,177.255925,1650.5,62.5,10.0,3421.156721,1414.966279,15.0,12.0
75%,11007.0,112.25,94.0,266.058022,2508.75,169.25,43.0,5166.097373,2443.955404,23.0,18.0
max,14993.0,556.0,445.0,1076.879741,3498.0,1096.0,352.0,15135.39128,9746.253023,31.0,30.0


In [9]:
# Isolate rows without null values
df_no_null_values = df[df['label'].notnull()]

# Display summary stats of rows without null values
df_no_null_values.describe()

Unnamed: 0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
count,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0,14299.0
mean,7503.573117,80.62382,67.255822,189.547409,1751.822505,121.747395,29.638296,4044.401535,1864.199794,15.544653,12.18253
std,4331.207621,80.736502,65.947295,136.189764,1008.663834,147.713428,45.35089,2504.97797,1448.005047,9.016088,7.833835
min,0.0,0.0,0.0,0.220211,4.0,0.0,0.0,60.44125,18.282082,0.0,0.0
25%,3749.5,23.0,20.0,90.457733,878.5,10.0,0.0,2217.319909,840.181344,8.0,5.0
50%,7504.0,56.0,48.0,158.718571,1749.0,71.0,9.0,3496.545617,1479.394387,16.0,12.0
75%,11257.5,111.0,93.0,253.54045,2627.5,178.0,43.0,5299.972162,2466.928876,23.0,19.0
max,14998.0,743.0,596.0,1216.154633,3500.0,1236.0,415.0,21183.40189,15851.72716,31.0,30.0


Is there a discernible difference between the two populations?
- There is a difference between null and not null, however, it's nothing alarming

### **Task 2d. Null values - device counts**

Next, check the two populations with respect to the `device` variable.

**Question:** How many iPhone users had null values and how many Android users had null values?

In [10]:
# Get count of null values by device
df_null_values['device'].value_counts()

iPhone     447
Android    253
Name: device, dtype: int64

> 447 iPhone users and 253 Android users from a total of 700 users with null values

Now, of the rows with null values, calculate the percentage with each device&mdash;Android and iPhone.

In [11]:
# Calculate % of iPhone nulls and Android nulls
null_iphone_users = (df['device'] == 'iPhone') & (df['label'].isnull())
null_android_users = (df['device'] == 'Android') & (df['label'].isnull())

df_null_values['device'].value_counts(normalize=True)

iPhone     0.638571
Android    0.361429
Name: device, dtype: float64

In [12]:
df[null_iphone_users].value_counts(normalize=True)

Series([], dtype: float64)

How does this compare to the device ratio in the full dataset?

In [13]:
# Calculate % of iPhone users and Android users in full dataset
df['device'].value_counts(normalize=True)

iPhone     0.644843
Android    0.355157
Name: device, dtype: float64

The percentage of missing values by each device is consistent with their representation in the data overall.

There is nothing to suggest a non-random cause of the missing data.

Examine the counts and percentages of users who churned vs. those who were retained. How many of each group are represented in the data?

In [14]:
# Calculate counts of churned vs. retained
print(df['label'].value_counts())
print()
print(df['label'].value_counts(normalize=True))

retained    11763
churned      2536
Name: label, dtype: int64

retained    0.822645
churned     0.177355
Name: label, dtype: float64


This dataset contains 82% retained users and 18% churned users.

Next, compare the medians of each variable for churned and retained users. The reason for calculating the median and not the mean is that you don't want outliers to unduly affect the portrayal of a typical user. Notice, for example, that the maximum value in the `driven_km_drives` column is 21,183 km. That's more than half the circumference of the earth!

In [30]:
# Calculate median values of all columns for churned and retained users
df.groupby('label').median(numeric_only=True)

Unnamed: 0_level_0,ID,sessions,drives,total_sessions,n_days_after_onboarding,total_navigations_fav1,total_navigations_fav2,driven_km_drives,duration_minutes_drives,activity_days,driving_days
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
churned,7477.5,59.0,50.0,164.339042,1321.0,84.5,11.0,3652.655666,1607.183785,8.0,6.0
retained,7509.0,56.0,47.0,157.586756,1843.0,68.0,9.0,3464.684614,1458.046141,17.0,14.0


In [31]:
df.groupby(['label']).agg(['median', 'mean'])

  df.groupby(['label']).agg(['median', 'mean'])


Unnamed: 0_level_0,ID,ID,sessions,sessions,drives,drives,total_sessions,total_sessions,n_days_after_onboarding,n_days_after_onboarding,...,total_navigations_fav2,total_navigations_fav2,driven_km_drives,driven_km_drives,duration_minutes_drives,duration_minutes_drives,activity_days,activity_days,driving_days,driving_days
Unnamed: 0_level_1,median,mean,median,mean,median,mean,median,mean,median,mean,...,median,mean,median,mean,median,mean,median,mean,median,mean
label,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
churned,7477.5,7544.852918,59.0,87.238959,50.0,72.730678,164.339042,196.893424,1321.0,1471.027603,...,11.0,31.596609,3652.655666,4147.171864,1607.183785,1975.45963,8.0,9.644716,6.0,7.21806
retained,7509.0,7494.673553,56.0,79.197654,47.0,66.075491,157.586756,187.963672,1843.0,1812.359432,...,9.0,29.216101,3464.684614,4022.24515,1458.046141,1840.213146,17.0,16.816628,14.0,13.252827


This offers an interesting snapshot of the two groups, churned vs. retained:

Users who churned averaged ~3 more drives in the last month than retained users, but retained users used the app on over twice as many days as churned users in the same time period.

The median churned user drove ~200 more kilometers and 2.5 more hours during the last month than the median retained user.

It seems that churned users had more drives in fewer days, and their trips were farther and longer in duration. Perhaps this is suggestive of a user profile.

Calculate the median kilometers per drive in the last month for both retained and churned users.

In [39]:
# Group data by `label` and calculate the medians
medians_by_label = df.groupby(['label']).median(numeric_only=True)

# Divide the median distance by median number of drives
medians_by_label['driven_km_drives'] / medians_by_label['drives']


label
churned     73.053113
retained    73.716694
dtype: float64

The median user from both groups drove ~73 km/drive. How many kilometers per driving day was this?

In [33]:
# Divide the median distance by median number of driving days
print ('Median number of km driven per median number of driving days:')
medians_by_label['driven_km_drives'] / medians_by_label['driving_days']

Median number of km driven per median number of driving days:


label
churned     608.775944
retained    247.477472
dtype: float64

Now, calculate the median number of drives per driving day for each group.

In [34]:
# Divide the median number of drives by median number of driving days
print ('Median number of drives per median number of driving days:')
medians_by_label['drives'] / medians_by_label['driving_days']

Median number of drives per median number of driving days:


label
churned     8.333333
retained    3.357143
dtype: float64

The median user who churned drove 608 kilometers each day they drove last month, which is almost 250% the per-drive-day distance of retained users. The median churned user had a similarly disproporionate number of drives per drive day compared to retained users.

It is clear from these figures that, regardless of whether a user churned or not, the users represented in this data are serious drivers! It would probably be safe to assume that this data does not represent typical drivers at large. Perhaps the data&mdash;and in particular the sample of churned users&mdash;contains a high proportion of long-haul truckers.

In consideration of how much these users drive, it would be worthwhile to recommend to Waze that they gather more data on these super-drivers. It's possible that the reason for their driving so much is also the reason why the Waze app does not meet their specific set of needs, which may differ from the needs of a more typical driver, such as a commuter.

Finally, examine whether there is an imbalance in how many users churned by device type.

Begin by getting the overall counts of each device type for each group, churned and retained.

In [35]:
# For each label, calculate the number of Android users and iPhone users
df.groupby(['label', 'device']).size()

label     device 
churned   Android     891
          iPhone     1645
retained  Android    4183
          iPhone     7580
dtype: int64

Now, within each group, churned and retained, calculate what percent was Android and what percent was iPhone.

In [36]:
# For each label, calculate the percentage of Android users and iPhone users
df.groupby('label')['device'].value_counts(normalize=True)

label     device 
churned   iPhone     0.648659
          Android    0.351341
retained  iPhone     0.644393
          Android    0.355607
Name: device, dtype: float64

The ratio of iPhone users and Android users is consistent between the churned group and the retained group, and those ratios are both consistent with the ratio found in the overall dataset.

<img src="data/images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**

**Note**: The Construct stage does not apply to this workflow. The PACE framework can be adapted to fit the specific requirements of any project.



<img src="data/images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**


### **Task 3. Conclusion**

What findings would I share with the data team in an executive summary?

**Questions:**

1. Did the data contain any missing values? How many, and which variables were affected? Was there a pattern to the missing data?

2. What is a benefit of using the median value of a sample instead of the mean?

3. Did the investigation give rise to further questions that I would like to explore or ask the Waze team about?

4. What percentage of the users in the dataset were Android users and what percentage were iPhone users?

5. What were some distinguishing characteristics of users who churned vs. users who were retained?

6. Was there an appreciable difference in churn rate between iPhone users vs. Android users?





> 1. Yes, the data did contain missing values. However, only values from the 'labels' column were missing - 700 in total, and there is no obvious pattern to the missing values.
> 2. Mean accounts is the sum of all values divided by the number values. This includes outliers, which can skew results. Using median instead can give a better idea of what a true middle range would be.
> 3. Yes, perhaps being given more data to work with. From the analysis so far, it seems like most of the drivers churned are long-commuters e.g. truckers. The median user drove 608 km/day, 250% over the median retained user.
> 4. 64% iPhone users, and 36% Android users
> 5. Users who churned had a lot more miles driven than those retained, even in a shorter a time frame. Additionally, they used the app about half as many times as retained users over the same period.
> 6. No, there was not. iPhone vs Android in churned vs retained had minimal difference. Percentages for both were near equal.