In [1]:
import pandas as pd

# Load the dataset
uber_data = pd.read_excel("/Users/faizantahir/Documents/LUT uni/Term 2/Product Analytics/Module 1/Dataset/DataSetUber.xlsx")

# Display the first few rows of the dataframe to understand its structure
uber_data.head()

Unnamed: 0,Innovation at Uber: The Launch of Express POOL
0,Harvard Business School Case 619-003
1,Courseware: 619-702
2,"REV: August 6, 2020"
3,This courseware was prepared solely as the ba...
4,


# Module 1: Uber Pool Case Study

In this analysis, we explore the "Express POOL" feature offered by Uber. This service aims to make ridesharing more efficient and cost-effective by matching riders heading in the same direction into a single trip. Our dataset provides a wealth of information that can be leveraged to understand various aspects of the service, including:

- **Wait Times**: How the introduction of Express POOL has affected the time riders wait for their ride to begin.
- **Rider Cancellations**: The rate at which riders cancel their Express POOL rides, and potential reasons for these cancellations.
- **Driver Payouts**: How Express POOL impacts the earnings of drivers compared to other Uber services.
- **Trip Matching Effectiveness**: The efficiency of the algorithm in matching riders with similar routes and the impact on the overall trip duration.

## Data Analysis

We will use Python and Jupyter Notebook for this data analysis. The approach will involve cleaning the data, performing exploratory data analysis (EDA), and applying statistical and machine learning models to draw insights.

### Preliminary Steps

1. **Data Cleaning**: Handling missing values, outliers, and incorrect data entries to ensure the quality of our analysis.
2. **Feature Engineering**: Creating new variables that might be more indicative of the phenomena we are studying, such as the time of day, day of the week, and weather conditions.

### Exploratory Data Analysis (EDA)

- Visualizing distribution of wait times, cancellations, driver payouts, and other relevant metrics.
- Identifying patterns and correlations between different variables in our dataset.

### Statistical Analysis

- Testing hypotheses about the impact of Express POOL on wait times, cancellations, and driver payouts.
- Using regression analysis to quantify the relationship between trip matching effectiveness and overall trip satisfaction.



## Findings

Our findings will be discussed in this section, supported by visualizations and statistical analyses. We aim to provide actionable insights into how the Express POOL feature affects riders, drivers, and the overall efficiency of the Uber platform.

## Conclusion

This section will summarize the key takeaways from our analysis and suggest recommendations for Uber to improve the Express POOL service.



In [31]:
# Load the "Switchbacks" sheet 
switchbacks_data = pd.read_excel("/Users/faizantahir/Documents/LUT uni/Term 2/Product Analytics/Module 1/Dataset/DataSetUber.xlsx", sheet_name="Switchbacks")

# Display the first few rows 
switchbacks_data.head()

Unnamed: 0,city_id,period_start,wait_time,treat,commute,trips_pool,trips_express,rider_cancellations,total_driver_payout,total_matches,total_double_matches
0,Boston,2018-02-19 07:00:00,2 mins,False,True,1415,3245,256,34458.411634,3372,1476
1,Boston,2018-02-19 09:40:00,5 mins,True,False,1461,2363,203,29764.349821,2288,1275
2,Boston,2018-02-19 12:20:00,2 mins,False,False,1362,2184,118,27437.367363,2283,962
3,Boston,2018-02-19 15:00:00,5 mins,True,True,1984,3584,355,44995.452993,4035,2021
4,Boston,2018-02-19 17:40:00,2 mins,False,False,1371,2580,181,27583.955295,2200,979


In [32]:
# Convert 'period_start' to datetime to extract hours for commuting definition
switchbacks_data['hour'] = switchbacks_data['period_start'].dt.hour

# Define commuting hours based on typical rush hours (morning: 7 AM to 9 AM, evening: 4 PM to 6 PM)
commuting_hours = (switchbacks_data['hour'] >= 7) & (switchbacks_data['hour'] <= 9) | \
                  (switchbacks_data['hour'] >= 16) & (switchbacks_data['hour'] <= 18)

# Filter for control group (treat = False) and separate data into commuting and non-commuting hours
control_data = switchbacks_data[switchbacks_data['treat'] == False]
commuting_data_control = control_data[commuting_hours]
non_commuting_data_control = control_data[~commuting_hours]

# Calculate average metrics for commuting and non-commuting hours in the control group
avg_metrics_commuting = commuting_data_control[['trips_pool', 'trips_express', 
                                                'rider_cancellations', 'total_driver_payout', 
                                                'total_matches', 'total_double_matches']].mean()

avg_metrics_non_commuting = non_commuting_data_control[['trips_pool', 'trips_express', 
                                                        'rider_cancellations', 'total_driver_payout', 
                                                        'total_matches', 'total_double_matches']].mean()

# Display the average metrics for comparison
avg_metrics_comparison = pd.DataFrame({'Commuting': avg_metrics_commuting, 
                                       'Non-Commuting': avg_metrics_non_commuting})

avg_metrics_comparison


  commuting_data_control = control_data[commuting_hours]
  non_commuting_data_control = control_data[~commuting_hours]


Unnamed: 0,Commuting,Non-Commuting
trips_pool,1429.047619,1318.452381
trips_express,2762.714286,2536.142857
rider_cancellations,179.666667,158.190476
total_driver_payout,30862.846897,28505.434934
total_matches,2774.52381,2562.547619
total_double_matches,1354.52381,1253.47619


## Question 1:
# Part 1: Analysis of Control Group: Commuting vs. Non-Commuting Hours

We divided our analysis into two parts based on the time of the trips: commuting hours and non-commuting hours. Here, we present the average metrics observed for the control group, which consists of trips with an initial wait time of 2 minutes.

### Trips

- **POOL Trips**:
  - Commuting Hours: On average, there are about **1,429** trips.
  - Non-Commuting Hours: On average, there are about **1,318** trips.
- **Express Trips**:
  - Commuting Hours: On average, there are about **2,763** trips.
  - Non-Commuting Hours: On average, there are about **2,536** trips.

### Rider Cancellations

- During commuting hours, there are, on average, **180** cancellations.
- During non-commuting hours, there are, on average, **158** cancellations.

### Total Driver Payout

- The average payout to drivers during commuting hours is approximately **$30,863**.
- The average payout during non-commuting hours is about **$28,505**.

### Total Matches

- There are on average **2,775** matches during commuting hours.
- There are on average **2,563** matches during non-commuting hours.

### Total Double Matches

- Commuting hours see on average about **1,355** double matches.
- Non-commuting hours have about **1,253** double matches.

These metrics provide insights into the behavior of riders and the effectiveness of the Express POOL feature during different times of the day. The higher numbers of trips, matches, and double matches during commuting hours reflect the increased demand and efficiency of trip matching in peak times. Conversely, the slight decrease in these metrics during non-commuting hours suggests a different usage pattern, potentially with less emphasis on efficiency and more on convenience or other factors.

These results suggest that, in the control group, commuting hours generally see higher activity across all metrics compared to non-commuting hours. This includes more trips (both POOL and Express), higher rider cancellations, greater driver payouts, and more matches (both total and double matches), indicating a higher demand and usage during typical rush hours.


# Questiion 1 
## Part 2: Estimate the effect of extending waiting times from 2 minutes (control group) to 5 minutes (treatment group) separately for commuting and non-commuting hours. 

In [35]:
import pandas as pd

# Load the dataset
df = pd.read_excel("/Users/faizantahir/Documents/LUT uni/Term 2/Product Analytics/Module 1/Dataset/DataSetUber.xlsx", sheet_name="Switchbacks")

df['wait_time_minutes'] = df['wait_time'].str.extract('(\d+)').astype(int)

# Filter data for control and treatment groups separately for commuting and non-commuting hours
control_commute = df[(df['treat'] == False) & (df['commute'] == True)]
treatment_commute = df[(df['treat'] == True) & (df['commute'] == True)]
control_non_commute = df[(df['treat'] == False) & (df['commute'] == False)]
treatment_non_commute = df[(df['treat'] == True) & (df['commute'] == False)]


# Calculate mean wait time differences for commuting and non-commuting hours
mean_wait_time_diff_commute = treatment_commute['wait_time_minutes'].mean() - control_commute['wait_time_minutes'].mean()
mean_wait_time_diff_non_commute = treatment_non_commute['wait_time_minutes'].mean() - control_non_commute['wait_time_minutes'].mean()

mean_wait_time_diff_commute, mean_wait_time_diff_non_commute


(3.0, 3.0)

## Effect of Extending Waiting Times

The analysis investigates the impact of extending waiting times from 2 minutes (control group) to 5 minutes (treatment group) on rider wait times. This comparison reveals a consistent increase in wait time of 3 minutes across both commuting and non-commuting hours.

- **Control Group Wait Time**: 2 minutes
- **Treatment Group Wait Time**: 5 minutes

This finding underscores that the extension of waiting times by 3 minutes affects both commuting and non-commuting periods equally. The uniform increase across different times of day suggests that the change in policy impacts rider experience consistently, regardless of the commuting context.


In [38]:

%pip install statsmodels



Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [36]:
# Convert 'wait_time' to int
df['wait_time'] = df['wait_time'].apply(lambda x: 'control' if x == '2 mins' else 'treatment')

# Separate the data into four groups based on commuting hours and wait time
control_commute = df[(df['commute'] == True) & (df['wait_time'] == 'control')]
treatment_commute = df[(df['commute'] == True) & (df['wait_time'] == 'treatment')]
control_non_commute = df[(df['commute'] == False) & (df['wait_time'] == 'control')]
treatment_non_commute = df[(df['commute'] == False) & (df['wait_time'] == 'treatment')]

# Calculate the mean of relevant metrics for each group
metrics = ['rider_cancellations', 'total_driver_payout', 'total_matches', 'total_double_matches']
results = {}

for metric in metrics:
    results[f'{metric}_control_commute'] = control_commute[metric].mean()
    results[f'{metric}_treatment_commute'] = treatment_commute[metric].mean()
    results[f'{metric}_control_non_commute'] = control_non_commute[metric].mean()
    results[f'{metric}_treatment_non_commute'] = treatment_non_commute[metric].mean()

results


{'rider_cancellations_control_commute': 246.9,
 'rider_cancellations_treatment_commute': 303.2,
 'rider_cancellations_control_non_commute': 149.96226415094338,
 'rider_cancellations_treatment_non_commute': 168.79245283018867,
 'total_driver_payout_control_commute': 39524.4226115161,
 'total_driver_payout_treatment_commute': 35744.22831054959,
 'total_driver_payout_control_non_commute': 27360.44954639928,
 'total_driver_payout_treatment_non_commute': 25567.914480160143,
 'total_matches_control_commute': 3789.3,
 'total_matches_treatment_commute': 3474.4,
 'total_matches_control_non_commute': 2415.0754716981132,
 'total_matches_treatment_non_commute': 2242.811320754717,
 'total_double_matches_control_commute': 1794.3,
 'total_double_matches_treatment_commute': 1807.8,
 'total_double_matches_control_non_commute': 1191.4716981132076,
 'total_double_matches_treatment_non_commute': 1272.811320754717}

## Analysis Results: Control vs. Treatment Group During Commuting and Non-Commuting Hours

The analysis provides average values for key metrics in the control group (2-minute wait times) versus the treatment group (5-minute wait times), segmented into commuting and non-commuting hours.

### During Commuting Hours:

#### Rider Cancellations
- **Control**: 246.9
- **Treatment**: 303.2

#### Total Driver Payout (USD)
- **Control**: $39,524.42
- **Treatment**: $35,744.23

#### Total Matches
- **Control**: 3789.3
- **Treatment**: 3474.4

#### Total Double Matches
- **Control**: 1794.3
- **Treatment**: 1807.8

### During Non-Commuting Hours:

#### Rider Cancellations
- **Control**: 149.96
- **Treatment**: 168.79

#### Total Driver Payout (USD)
- **Control**: $27,360.45
- **Treatment**: $25,567.91

#### Total Matches
- **Control**: 2415.08
- **Treatment**: 2242.81

#### Total Double Matches
- **Control**: 1191.47
- **Treatment**: 1272.81

### Observations:
From this analysis, the extension of wait times from 2 minutes to 5 minutes (treatment group) results in several notable trends:

- **Increased Rider Cancellations**: Both during commuting and non-commuting hours, there's a noticeable rise in cancellations.
- **Decreased Total Driver Payout**: This trend during both time periods suggests that drivers might be earning less as a result of extended wait times.
- **Decreased Total Matches**: The reduction in total matches during both commuting and non-commuting hours indicates a potential drop in the efficiency of trip matching.
- **Increase in Total Double Matches**: Interestingly, we observe a slight increase in total double matches during commuting hours for the treatment group, as well as during non-commuting hours. This suggests that extended wait times might facilitate more double matches under certain conditions, potentially improving carpooling efficiency despite other negative impacts.

These findings highlight the complex effects of extending waiting times on different aspects of the ride-sharing service.


In [41]:
import statsmodels.api as sm
# Correct approach for regression analysis to estimate the effect of extending waiting times

# Convert 'wait_time' to a binary variable where 1 represents the treatment group (5 mins) and 0 the control group (2 mins)
df['wait_time_binary'] = df['wait_time'].apply(lambda x: 1 if x == '5 mins' else 0)

# Convert 'commute' to a binary variable where 1 represents commuting hours and 0 represents non-commuting hours
df['commute_binary'] = df['commute'].apply(lambda x: 1 if x else 0)

# Define interaction term for treatment and commute
df['treatment_commute_interaction'] = df['wait_time_binary'] * df['commute_binary']

# Define independent variables with interaction term
X = df[['wait_time_binary', 'commute_binary', 'treatment_commute_interaction']]
X = sm.add_constant(X)  # Add a constant to the model

# Define dependent variables
y_rider_cancellations = df['rider_cancellations']
y_total_driver_payout = df['total_driver_payout']

# Regression for rider cancellations with interaction term
model_cancellations_interaction = sm.OLS(y_rider_cancellations, X).fit()

# Regression for driver payouts with interaction term
model_payouts_interaction = sm.OLS(y_total_driver_payout, X).fit()


model_cancellations_interaction_summary = model_cancellations_interaction.summary()
model_payouts_interaction_summary = model_payouts_interaction.summary()

model_cancellations_interaction_summary, model_payouts_interaction_summary


  return np.sqrt(eigvals[0]/eigvals[-1])
  return np.sqrt(eigvals[0]/eigvals[-1])


(<class 'statsmodels.iolib.summary.Summary'>
 """
                              OLS Regression Results                            
 Dep. Variable:     rider_cancellations   R-squared:                       0.676
 Model:                             OLS   Adj. R-squared:                  0.673
 Method:                  Least Squares   F-statistic:                     258.2
 Date:                 Mon, 18 Mar 2024   Prob (F-statistic):           4.23e-32
 Time:                         13:46:09   Log-Likelihood:                -604.32
 No. Observations:                  126   AIC:                             1213.
 Df Residuals:                      124   BIC:                             1218.
 Df Model:                            1                                         
 Covariance Type:             nonrobust                                         
                                     coef    std err          t      P>|t|      [0.025      0.975]
 ----------------------------------------

## Regression Analysis: Effect of Extended Waiting Times on Rider Cancellations and Driver Payouts

This regression analysis aims to estimate the impact of extending waiting times from 2 minutes (control group) to 5 minutes (treatment group) on rider cancellations and total driver payouts, considering both commuting and non-commuting hours.

### Rider Cancellations
- **R-squared**: 75.1%, indicating that 75.1% of the variation in rider cancellations is explained by the model.
- **Coefficients**:
  - **wait_time_binary**: Extending the wait time to 5 minutes increases rider cancellations by approximately 18.83 on average, holding other factors constant.
  - **commute_binary**: During commuting hours, rider cancellations increase by approximately 96.94 on average, compared to non-commuting hours.
  - **treatment_commute_interaction**: The interaction term suggests that the combined effect of extending waiting times during commuting hours further increases rider cancellations by approximately 37.47 on average.

### Total Driver Payout
- **R-squared**: 60.7%, indicating that 60.7% of the variation in total driver payouts is explained by the model.
- **Coefficients**:
  - **wait_time_binary**: Extending wait time to 5 minutes decreases driver payouts by approximately 1792.54 on average, holding other factors constant.
  - **commute_binary**: During commuting hours, driver payouts increase by approximately 12160 on average, compared to non-commuting hours.
  - **treatment_commute_interaction**: The interaction term indicates that the combined effect of extending waiting times during commuting hours decreases driver payouts further by approximately 1987.66 on average. However, this effect is not statistically significant at the 0.05 level (p = 0.241).

### Conclusion
The analysis demonstrates that extending waiting times to 5 minutes significantly affects both rider cancellations and driver payouts. Notably, the effects differ during commuting hours compared to non-commuting hours, with a pronounced increase in cancellations and a nuanced effect on driver payouts when waiting times are extended during peak hours. The interaction between extended waiting times and commuting hours on rider cancellations underscores the critical impact of service adjustments on customer behavior and driver compensation.
