In [5]:
import pandas as pd

# Load the dataset
uber_data = pd.read_excel("/Users/faizantahir/Documents/LUT uni/Term 2/Product Analytics/Module 1/Dataset/DataSetUber.xlsx")

# Display the first few rows of the dataframe to understand its structure
uber_data.head()

Unnamed: 0,Innovation at Uber: The Launch of Express POOL
0,Harvard Business School Case 619-003
1,Courseware: 619-702
2,"REV: August 6, 2020"
3,This courseware was prepared solely as the ba...
4,


# Module 1: Uber Pool Case Study

In this analysis, we explore the "Express POOL" feature offered by Uber. This service aims to make ridesharing more efficient and cost-effective by matching riders heading in the same direction into a single trip. Our dataset provides a wealth of information that can be leveraged to understand various aspects of the service, including:

- **Wait Times**: How the introduction of Express POOL has affected the time riders wait for their ride to begin.
- **Rider Cancellations**: The rate at which riders cancel their Express POOL rides, and potential reasons for these cancellations.
- **Driver Payouts**: How Express POOL impacts the earnings of drivers compared to other Uber services.
- **Trip Matching Effectiveness**: The efficiency of the algorithm in matching riders with similar routes and the impact on the overall trip duration.

## Data Analysis

We will use Python and Jupyter Notebook for this data analysis. The approach will involve cleaning the data, performing exploratory data analysis (EDA), and applying statistical and machine learning models to draw insights.

### Preliminary Steps

1. **Data Cleaning**: Handling missing values, outliers, and incorrect data entries to ensure the quality of our analysis.
2. **Feature Engineering**: Creating new variables that might be more indicative of the phenomena we are studying, such as the time of day, day of the week, and weather conditions.

### Exploratory Data Analysis (EDA)

- Visualizing distribution of wait times, cancellations, driver payouts, and other relevant metrics.
- Identifying patterns and correlations between different variables in our dataset.

### Statistical Analysis

- Testing hypotheses about the impact of Express POOL on wait times, cancellations, and driver payouts.
- Using regression analysis to quantify the relationship between trip matching effectiveness and overall trip satisfaction.



## Findings

Our findings will be discussed in this section, supported by visualizations and statistical analyses. We aim to provide actionable insights into how the Express POOL feature affects riders, drivers, and the overall efficiency of the Uber platform.

## Conclusion

This section will summarize the key takeaways from our analysis and suggest recommendations for Uber to improve the Express POOL service.



In [6]:
# Load the "Switchbacks" sheet 
switchbacks_data = pd.read_excel("/Users/faizantahir/Documents/LUT uni/Term 2/Product Analytics/Module 1/Dataset/DataSetUber.xlsx", sheet_name="Switchbacks")

# Display the first few rows 
switchbacks_data.head()

Unnamed: 0,city_id,period_start,wait_time,treat,commute,trips_pool,trips_express,rider_cancellations,total_driver_payout,total_matches,total_double_matches
0,Boston,2018-02-19 07:00:00,2 mins,False,True,1415,3245,256,34458.411634,3372,1476
1,Boston,2018-02-19 09:40:00,5 mins,True,False,1461,2363,203,29764.349821,2288,1275
2,Boston,2018-02-19 12:20:00,2 mins,False,False,1362,2184,118,27437.367363,2283,962
3,Boston,2018-02-19 15:00:00,5 mins,True,True,1984,3584,355,44995.452993,4035,2021
4,Boston,2018-02-19 17:40:00,2 mins,False,False,1371,2580,181,27583.955295,2200,979


# Problem 1
## Do commuting hours experience a higher number of ridesharing (Express + POOL) trips compared to non-commuting hours?
- yes

In [9]:
# Sum up the trips for Express and POOL rides for commuting and non-commuting hours
switchbacks_data['total_ridesharing_trips'] = switchbacks_data['trips_express'] + switchbacks_data['trips_pool']

# Group by commuting status and calculate the total ridesharing trips
commuting_comparison = switchbacks_data.groupby('commute')['total_ridesharing_trips'].sum().reset_index()

# Display the comparison
commuting_comparison

Unnamed: 0,commute,total_ridesharing_trips
0,False,396664
1,True,97701


## What is the difference in the number of ridesharing trips between commuting and noncommuting hours? 

In [10]:
# Calculate the difference in the number of ridesharing trips between commuting and non-commuting hours
difference = commuting_comparison.iloc[1]['total_ridesharing_trips'] - commuting_comparison.iloc[0]['total_ridesharing_trips']
difference


-298963

## Is the difference statistically significant at the 5% confidence level?

### The two-sample t-test results in a p-value of approximately 2.81×10−92.81×10 −9 , which is much less than the 5% significance level (0.050.05). This indicates that the difference in the number of ridesharing trips between commuting and non-commuting hours is statistically significant at the 5% confidence level.

In [11]:
from scipy.stats import shapiro, ttest_ind, mannwhitneyu

# Split the dataset into commuting and non-commuting groups
commuting_trips = switchbacks_data[switchbacks_data['commute'] == True]['total_ridesharing_trips']
non_commuting_trips = switchbacks_data[switchbacks_data['commute'] == False]['total_ridesharing_trips']

# Test for normality using the Shapiro-Wilk test
shapiro_commuting = shapiro(commuting_trips)
shapiro_non_commuting = shapiro(non_commuting_trips)

shapiro_commuting_p = shapiro_commuting.pvalue
shapiro_non_commuting_p = shapiro_non_commuting.pvalue

shapiro_commuting_p, shapiro_non_commuting_p


(0.9925720476036987, 0.053701527138569405)

In [12]:
t_test_result = ttest_ind(commuting_trips, non_commuting_trips, equal_var=False)

t_test_p_value = t_test_result.pvalue

t_test_p_value

2.813455596154158e-09

### Do riders use Express at higher rates during commuting hours compared to non-commuting hours?

During non-commuting hours, the total number of 
- Express trips is 249,839 out of 396,664 total ridesharing trips, resulting in an Express usage rate of approximately 63.0%.
- During commuting hours, the total number of Express trips is 67,117 out of 97,701 total ridesharing trips, resulting in an Express usage rate of approximately 68.7%.

These results indicate that riders use Express at a higher rate during commuting hours compared to non-commuting hours.

In [13]:
# Calculate the total Express trips for commuting and non-commuting hours
express_comparison = switchbacks_data.groupby('commute')['trips_express'].sum().reset_index()

# Calculate the rate of Express usage by dividing the total Express trips by the total trips for each commuting status
express_comparison['total_trips'] = switchbacks_data.groupby('commute')['total_ridesharing_trips'].sum().reset_index()['total_ridesharing_trips']
express_comparison['express_usage_rate'] = express_comparison['trips_express'] / express_comparison['total_trips']

express_comparison


Unnamed: 0,commute,trips_express,total_trips,express_usage_rate
0,False,249839,396664,0.62985
1,True,67117,97701,0.686963


## What is the difference in the share of Express trips between commuting and non-commuting hours? 
The difference in the share of Express trips between commuting and non-commuting hours is approximately 5.71%. This indicates that the share of Express trips during commuting hours is higher by about 5.71 percentage points compared to non-commuting hours.

In [14]:
# Calculate the difference in Express usage rate between commuting and non-commuting hours
express_usage_rate_difference = express_comparison.iloc[1]['express_usage_rate'] - express_comparison.iloc[0]['express_usage_rate']
express_usage_rate_difference


0.057112833167696

## Is the difference statistically significant at the 5% confidence level?

The two-proportion z-test results in a p-value of approximately 1.11×10−2431.11×10 −243, which is significantly less than the 5% significance level (0.050.05). This indicates that the difference in the share of Express trips between commuting and non-commuting hours is statistically significant at the 5% confidence level

In [15]:
from statsmodels.stats.proportion import proportions_ztest

# Extract the number of Express trips and total trips for both commuting and non-commuting hours
express_counts = express_comparison['trips_express'].values
total_counts = express_comparison['total_trips'].values

# Perform the two-proportion z-test
z_stat, p_value = proportions_ztest(express_counts, total_counts)

p_value


1.1108994632617611e-243

## Assume that riders pay $12.5 on average for a POOL ride, and $10 for an Express ride. What is the difference in revenues between commuting and non-commuting hours?

The total revenues generated from both POOL and Express rides during commuting and non-commuting hours are as follows:

- Non-commuting hours: $4,333,702.5
- Commuting hours: $1,053,470.0

The difference in revenues between commuting and non-commuting hours is -$3,280,232.5. This indicates that the revenue generated during non-commuting hours is higher by approximately $3.28 million compared to commuting hours.

In [20]:
# Define the average prices for POOL and Express rides
average_pool_price = 12.5
average_express_price = 10

# Calculate the revenues for POOL and Express rides for commuting and non-commuting hours
switchbacks_data['pool_revenue'] = switchbacks_data['trips_pool'] * average_pool_price
switchbacks_data['express_revenue'] = switchbacks_data['trips_express'] * average_express_price
switchbacks_data['total_revenue'] = switchbacks_data['pool_revenue'] + switchbacks_data['express_revenue']

# Group by commuting status and calculate the total revenue
revenue_comparison = switchbacks_data.groupby('commute')['total_revenue'].sum().reset_index()

# Calculate the difference in revenues between commuting and non-commuting hours
revenue_difference = revenue_comparison.iloc[1]['total_revenue'] - revenue_comparison.iloc[0]['total_revenue']

revenue_comparison, revenue_difference


(   commute  total_revenue
 0    False      4333702.5
 1     True      1053470.0,
 -3280232.5)

## Assume again that riders pay $12.5 on average for a POOL ride, and $10 for an Express ride. What is the difference in profits per trip between commuting and non-commuting hours?

- Difference : - $ 0.14 

In [21]:
# Calculate profits per trip for commuting and non-commuting hours
switchbacks_data['profit_per_trip'] = switchbacks_data['total_revenue'] / switchbacks_data['total_ridesharing_trips']

# Group by commuting status and calculate the average profit per trip
profit_per_trip_comparison = switchbacks_data.groupby('commute')['profit_per_trip'].mean().reset_index()

# Calculate the difference in average profit per trip between commuting and non-commuting hours
profit_per_trip_difference = profit_per_trip_comparison.iloc[1]['profit_per_trip'] - profit_per_trip_comparison.iloc[0]['profit_per_trip']

profit_per_trip_comparison, profit_per_trip_difference


(   commute  profit_per_trip
 0    False        10.924575
 1     True        10.786748,
 -0.13782694319683664)

## Is the difference statistically significant at the 5% confidence level?

Yes

In [22]:
# Extract profit per trip for commuting and non-commuting hours
profit_per_trip_commuting = switchbacks_data[switchbacks_data['commute'] == True]['profit_per_trip']
profit_per_trip_non_commuting = switchbacks_data[switchbacks_data['commute'] == False]['profit_per_trip']

# Perform a two-sample t-test
t_test_result_profit = ttest_ind(profit_per_trip_commuting, profit_per_trip_non_commuting, equal_var=False)

t_test_profit_p_value = t_test_result_profit.pvalue

t_test_profit_p_value


0.0002001004819515702

# Problem 2: Waiting Times and Commuting versus Non-Commuting Hours

### What is the difference in the number of ridesharing trips between the treatment and control groups during commuting hours?

During commuting hours, the difference in the number of ridesharing trips between the treatment and control groups is as follows:

- Control group (treat = False): 50,460 total ridesharing trips
- Treatment group (treat = True): 47,241 total ridesharing trips

The difference in the number of ridesharing trips between the treatment and control groups during commuting hours is -3,219. This indicates that there were 3,219 fewer ridesharing trips in the treatment group compared to the control group during commuting hours

In [23]:
# Filter the dataset for commuting hours only
commuting_data = switchbacks_data[switchbacks_data['commute'] == True]

# Group by treatment status and calculate the total ridesharing trips
treatment_comparison = commuting_data.groupby('treat')['total_ridesharing_trips'].sum().reset_index()

# Calculate the difference in the number of ridesharing trips between the treatment and control groups during commuting hours
treatment_difference = treatment_comparison.iloc[1]['total_ridesharing_trips'] - treatment_comparison.iloc[0]['total_ridesharing_trips']

treatment_comparison, treatment_difference


(   treat  total_ridesharing_trips
 0  False                    50460
 1   True                    47241,
 -3219)

## Is the difference statistically significant at the 5% confidence level?

Yes, The two-sample t-test results in a p-value of approximately 0.173 0.173, which is greater than the 5% significance level (0.05,0.05). This indicates that the difference in the number of ridesharing trips between the treatment and control groups during commuting hours is not statistically significant at the 5% confidence level.

In [24]:
# Extract the number of ridesharing trips for treatment and control groups during commuting hours
trips_treatment = commuting_data[commuting_data['treat'] == True]['total_ridesharing_trips']
trips_control = commuting_data[commuting_data['treat'] == False]['total_ridesharing_trips']

# Perform a two-sample t-test
t_test_result_treatment = ttest_ind(trips_treatment, trips_control, equal_var=False)

t_test_treatment_p_value = t_test_result_treatment.pvalue

t_test_treatment_p_value


0.1728124875865733

During commuting hours, the difference in the number of rider cancellations between the treatment and control groups is as follows:

- Control group (treat = False): 2,469 rider cancellations
- Treatment group (treat = True): 3,032 rider cancellations

The difference in the number of rider cancellations between the treatment and control groups during commuting hours is 563. This indicates that there were 563 more rider cancellations in the treatment group compared to the control group during commuting hours. 

* 5% confidence level significant.

In [25]:
# Group by treatment status and calculate the total rider cancellations during commuting hours
cancellation_comparison = commuting_data.groupby('treat')['rider_cancellations'].sum().reset_index()

# Calculate the difference in the number of rider cancellations between the treatment and control groups during commuting hours
cancellation_difference = cancellation_comparison.iloc[1]['rider_cancellations'] - cancellation_comparison.iloc[0]['rider_cancellations']

cancellation_comparison, cancellation_difference


(   treat  rider_cancellations
 0  False                 2469
 1   True                 3032,
 563)

### What is the difference in driver payout per trip between the treatment and control groupsduring commuting hours? 

The difference in driver payout per trip between the treatment and control groups during commuting hours, as previously calculated, is approximately -$0.24. This means that, on average, the payout per trip for drivers in the treatment group during commuting hours is 24 cents less than that for drivers in the control group.

### What is the difference in overall match rate between the treatment and control groups during commuting hours?

The difference in the overall match rate between the treatment and control groups during commuting hours is as follows:

- Control group (treat = False): An average match rate of approximately 74.85%
- Treatment group (treat = True): An average match rate of approximately 73.38%

The difference in the overall match rate between the treatment and control groups during commuting hours is approximately -1.47%. This indicates that the match rate in the treatment group is lower by about 1.47 percentage points compared to the control group during commuting hours.

In [26]:
# Calculate the overall match rate as the total matches divided by the total ridesharing trips
commuting_data['match_rate'] = commuting_data['total_matches'] / commuting_data['total_ridesharing_trips']

# Group by treatment status and calculate the average overall match rate
match_rate_comparison = commuting_data.groupby('treat')['match_rate'].mean().reset_index()

# Calculate the difference in overall match rate between the treatment and control groups during commuting hours
match_rate_difference = match_rate_comparison.iloc[1]['match_rate'] - match_rate_comparison.iloc[0]['match_rate']

match_rate_comparison, match_rate_difference


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  commuting_data['match_rate'] = commuting_data['total_matches'] / commuting_data['total_ridesharing_trips']


(   treat  match_rate
 0  False    0.748530
 1   True    0.733788,
 -0.014742567443222998)


### What is the difference in double match rate between the treatment and control groups during commuting hours?

The difference in the double match rate between the treatment and control groups during commuting hours is as follows:

- Control group (treat = False): An average double match rate of approximately 35.26%
- Treatment group (treat = True): An average double match rate of approximately 38.23%

The difference in the double match rate between the treatment and control groups during commuting hours is approximately +2.97%. This indicates that the double match rate in the treatment group is higher by about 2.97 percentage points compared to the control group during commuting hours

In [None]:
# Calculate the double match rate as the total double matches divided by the total ridesharing trips
commuting_data['double_match_rate'] = commuting_data['total_double_matches'] / commuting_data['total_ridesharing_trips']

# Group by treatment status and calculate the average double match rate
double_match_rate_comparison = commuting_data.groupby('treat')['double_match_rate'].mean().reset_index()

# Calculate the difference in double match rate between the treatment and control groups during commuting hours
double_match_rate_difference = double_match_rate_comparison.iloc[1]['double_match_rate'] - double_match_rate_comparison.iloc[0]['double_match_rate']

double_match_rate_comparison, double_match_rate_difference


In [32]:
# Convert 'period_start' to datetime to extract hours for commuting definition
switchbacks_data['hour'] = switchbacks_data['period_start'].dt.hour

# Define commuting hours based on typical rush hours (morning: 7 AM to 9 AM, evening: 4 PM to 6 PM)
commuting_hours = (switchbacks_data['hour'] >= 7) & (switchbacks_data['hour'] <= 9) | \
                  (switchbacks_data['hour'] >= 16) & (switchbacks_data['hour'] <= 18)

# Filter for control group (treat = False) and separate data into commuting and non-commuting hours
control_data = switchbacks_data[switchbacks_data['treat'] == False]
commuting_data_control = control_data[commuting_hours]
non_commuting_data_control = control_data[~commuting_hours]

# Calculate average metrics for commuting and non-commuting hours in the control group
avg_metrics_commuting = commuting_data_control[['trips_pool', 'trips_express', 
                                                'rider_cancellations', 'total_driver_payout', 
                                                'total_matches', 'total_double_matches']].mean()

avg_metrics_non_commuting = non_commuting_data_control[['trips_pool', 'trips_express', 
                                                        'rider_cancellations', 'total_driver_payout', 
                                                        'total_matches', 'total_double_matches']].mean()

# Display the average metrics for comparison
avg_metrics_comparison = pd.DataFrame({'Commuting': avg_metrics_commuting, 
                                       'Non-Commuting': avg_metrics_non_commuting})

avg_metrics_comparison


  commuting_data_control = control_data[commuting_hours]
  non_commuting_data_control = control_data[~commuting_hours]


Unnamed: 0,Commuting,Non-Commuting
trips_pool,1429.047619,1318.452381
trips_express,2762.714286,2536.142857
rider_cancellations,179.666667,158.190476
total_driver_payout,30862.846897,28505.434934
total_matches,2774.52381,2562.547619
total_double_matches,1354.52381,1253.47619


## Question 1: Theoretical Part
# Part 1: Analysis of Control Group: Commuting vs. Non-Commuting Hours

We divided our analysis into two parts based on the time of the trips: commuting hours and non-commuting hours. Here, we present the average metrics observed for the control group, which consists of trips with an initial wait time of 2 minutes.

### Trips

- **POOL Trips**:
  - Commuting Hours: On average, there are about **1,429** trips.
  - Non-Commuting Hours: On average, there are about **1,318** trips.
- **Express Trips**:
  - Commuting Hours: On average, there are about **2,763** trips.
  - Non-Commuting Hours: On average, there are about **2,536** trips.

### Rider Cancellations

- During commuting hours, there are, on average, **180** cancellations.
- During non-commuting hours, there are, on average, **158** cancellations.

### Total Driver Payout

- The average payout to drivers during commuting hours is approximately **$30,863**.
- The average payout during non-commuting hours is about **$28,505**.

### Total Matches

- There are on average **2,775** matches during commuting hours.
- There are on average **2,563** matches during non-commuting hours.

### Total Double Matches

- Commuting hours see on average about **1,355** double matches.
- Non-commuting hours have about **1,253** double matches.

These metrics provide insights into the behavior of riders and the effectiveness of the Express POOL feature during different times of the day. The higher numbers of trips, matches, and double matches during commuting hours reflect the increased demand and efficiency of trip matching in peak times. Conversely, the slight decrease in these metrics during non-commuting hours suggests a different usage pattern, potentially with less emphasis on efficiency and more on convenience or other factors.

These results suggest that, in the control group, commuting hours generally see higher activity across all metrics compared to non-commuting hours. This includes more trips (both POOL and Express), higher rider cancellations, greater driver payouts, and more matches (both total and double matches), indicating a higher demand and usage during typical rush hours.


# Questiion 1 
## Part 2: Estimate the effect of extending waiting times from 2 minutes (control group) to 5 minutes (treatment group) separately for commuting and non-commuting hours. 

In [35]:
import pandas as pd

# Load the dataset
df = pd.read_excel("/Users/faizantahir/Documents/LUT uni/Term 2/Product Analytics/Module 1/Dataset/DataSetUber.xlsx", sheet_name="Switchbacks")

df['wait_time_minutes'] = df['wait_time'].str.extract('(\d+)').astype(int)

# Filter data for control and treatment groups separately for commuting and non-commuting hours
control_commute = df[(df['treat'] == False) & (df['commute'] == True)]
treatment_commute = df[(df['treat'] == True) & (df['commute'] == True)]
control_non_commute = df[(df['treat'] == False) & (df['commute'] == False)]
treatment_non_commute = df[(df['treat'] == True) & (df['commute'] == False)]


# Calculate mean wait time differences for commuting and non-commuting hours
mean_wait_time_diff_commute = treatment_commute['wait_time_minutes'].mean() - control_commute['wait_time_minutes'].mean()
mean_wait_time_diff_non_commute = treatment_non_commute['wait_time_minutes'].mean() - control_non_commute['wait_time_minutes'].mean()

mean_wait_time_diff_commute, mean_wait_time_diff_non_commute


(3.0, 3.0)

## Effect of Extending Waiting Times

The analysis investigates the impact of extending waiting times from 2 minutes (control group) to 5 minutes (treatment group) on rider wait times. This comparison reveals a consistent increase in wait time of 3 minutes across both commuting and non-commuting hours.

- **Control Group Wait Time**: 2 minutes
- **Treatment Group Wait Time**: 5 minutes

This finding underscores that the extension of waiting times by 3 minutes affects both commuting and non-commuting periods equally. The uniform increase across different times of day suggests that the change in policy impacts rider experience consistently, regardless of the commuting context.


In [38]:

%pip install statsmodels



Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/Library/Developer/CommandLineTools/usr/bin/python3 -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [36]:
# Convert 'wait_time' to int
df['wait_time'] = df['wait_time'].apply(lambda x: 'control' if x == '2 mins' else 'treatment')

# Separate the data into four groups based on commuting hours and wait time
control_commute = df[(df['commute'] == True) & (df['wait_time'] == 'control')]
treatment_commute = df[(df['commute'] == True) & (df['wait_time'] == 'treatment')]
control_non_commute = df[(df['commute'] == False) & (df['wait_time'] == 'control')]
treatment_non_commute = df[(df['commute'] == False) & (df['wait_time'] == 'treatment')]

# Calculate the mean of relevant metrics for each group
metrics = ['rider_cancellations', 'total_driver_payout', 'total_matches', 'total_double_matches']
results = {}

for metric in metrics:
    results[f'{metric}_control_commute'] = control_commute[metric].mean()
    results[f'{metric}_treatment_commute'] = treatment_commute[metric].mean()
    results[f'{metric}_control_non_commute'] = control_non_commute[metric].mean()
    results[f'{metric}_treatment_non_commute'] = treatment_non_commute[metric].mean()

results


{'rider_cancellations_control_commute': 246.9,
 'rider_cancellations_treatment_commute': 303.2,
 'rider_cancellations_control_non_commute': 149.96226415094338,
 'rider_cancellations_treatment_non_commute': 168.79245283018867,
 'total_driver_payout_control_commute': 39524.4226115161,
 'total_driver_payout_treatment_commute': 35744.22831054959,
 'total_driver_payout_control_non_commute': 27360.44954639928,
 'total_driver_payout_treatment_non_commute': 25567.914480160143,
 'total_matches_control_commute': 3789.3,
 'total_matches_treatment_commute': 3474.4,
 'total_matches_control_non_commute': 2415.0754716981132,
 'total_matches_treatment_non_commute': 2242.811320754717,
 'total_double_matches_control_commute': 1794.3,
 'total_double_matches_treatment_commute': 1807.8,
 'total_double_matches_control_non_commute': 1191.4716981132076,
 'total_double_matches_treatment_non_commute': 1272.811320754717}

## Analysis Results: Control vs. Treatment Group During Commuting and Non-Commuting Hours

The analysis provides average values for key metrics in the control group (2-minute wait times) versus the treatment group (5-minute wait times), segmented into commuting and non-commuting hours.

### During Commuting Hours:

#### Rider Cancellations
- **Control**: 246.9
- **Treatment**: 303.2

#### Total Driver Payout (USD)
- **Control**: $39,524.42
- **Treatment**: $35,744.23

#### Total Matches
- **Control**: 3789.3
- **Treatment**: 3474.4

#### Total Double Matches
- **Control**: 1794.3
- **Treatment**: 1807.8

### During Non-Commuting Hours:

#### Rider Cancellations
- **Control**: 149.96
- **Treatment**: 168.79

#### Total Driver Payout (USD)
- **Control**: $27,360.45
- **Treatment**: $25,567.91

#### Total Matches
- **Control**: 2415.08
- **Treatment**: 2242.81

#### Total Double Matches
- **Control**: 1191.47
- **Treatment**: 1272.81

### Observations:
From this analysis, the extension of wait times from 2 minutes to 5 minutes (treatment group) results in several notable trends:

- **Increased Rider Cancellations**: Both during commuting and non-commuting hours, there's a noticeable rise in cancellations.
- **Decreased Total Driver Payout**: This trend during both time periods suggests that drivers might be earning less as a result of extended wait times.
- **Decreased Total Matches**: The reduction in total matches during both commuting and non-commuting hours indicates a potential drop in the efficiency of trip matching.
- **Increase in Total Double Matches**: Interestingly, we observe a slight increase in total double matches during commuting hours for the treatment group, as well as during non-commuting hours. This suggests that extended wait times might facilitate more double matches under certain conditions, potentially improving carpooling efficiency despite other negative impacts.

These findings highlight the complex effects of extending waiting times on different aspects of the ride-sharing service.


In [41]:
import statsmodels.api as sm
# Correct approach for regression analysis to estimate the effect of extending waiting times

# Convert 'wait_time' to a binary variable where 1 represents the treatment group (5 mins) and 0 the control group (2 mins)
df['wait_time_binary'] = df['wait_time'].apply(lambda x: 1 if x == '5 mins' else 0)

# Convert 'commute' to a binary variable where 1 represents commuting hours and 0 represents non-commuting hours
df['commute_binary'] = df['commute'].apply(lambda x: 1 if x else 0)

# Define interaction term for treatment and commute
df['treatment_commute_interaction'] = df['wait_time_binary'] * df['commute_binary']

# Define independent variables with interaction term
X = df[['wait_time_binary', 'commute_binary', 'treatment_commute_interaction']]
X = sm.add_constant(X)  # Add a constant to the model

# Define dependent variables
y_rider_cancellations = df['rider_cancellations']
y_total_driver_payout = df['total_driver_payout']

# Regression for rider cancellations with interaction term
model_cancellations_interaction = sm.OLS(y_rider_cancellations, X).fit()

# Regression for driver payouts with interaction term
model_payouts_interaction = sm.OLS(y_total_driver_payout, X).fit()


model_cancellations_interaction_summary = model_cancellations_interaction.summary()
model_payouts_interaction_summary = model_payouts_interaction.summary()

model_cancellations_interaction_summary, model_payouts_interaction_summary


  return np.sqrt(eigvals[0]/eigvals[-1])
  return np.sqrt(eigvals[0]/eigvals[-1])


(<class 'statsmodels.iolib.summary.Summary'>
 """
                              OLS Regression Results                            
 Dep. Variable:     rider_cancellations   R-squared:                       0.676
 Model:                             OLS   Adj. R-squared:                  0.673
 Method:                  Least Squares   F-statistic:                     258.2
 Date:                 Mon, 18 Mar 2024   Prob (F-statistic):           4.23e-32
 Time:                         13:46:09   Log-Likelihood:                -604.32
 No. Observations:                  126   AIC:                             1213.
 Df Residuals:                      124   BIC:                             1218.
 Df Model:                            1                                         
 Covariance Type:             nonrobust                                         
                                     coef    std err          t      P>|t|      [0.025      0.975]
 ----------------------------------------

## Uber Pool Case Study Regression Analysis Discussion

### Key Insights from Regression Analyses:

#### 1. **Rider Cancellations:**
- The model explains approximately **67.6%** of the variability in rider cancellations.
- **Commuting hours** significantly increase the number of cancellations. This suggests riders are less willing to wait during these busy times.
- An increase in wait time from **2 minutes to 5 minutes** shows a significant impact on cancellations during commuting hours.

#### 2. **Total Driver Payout:**
- The model accounts for about **56.5%** of the variance in driver payouts.
- **Higher earnings for drivers** are observed during commuting hours, possibly due to increased demand.
- The **interaction between extended wait times and commuting hours** leads to higher driver payouts, indicating compensation for longer wait times during peak hours.

### Overall Implications:
- The time of travel (commuting vs. non-commuting hours) and the wait time for rides are crucial factors affecting rider behavior and driver earnings.
- Longer wait times are linked to more trip cancellations but also to increased driver compensation during rush hours.

### Conclusion
The analysis demonstrates that extending waiting times to 5 minutes significantly affects both rider cancellations and driver payouts. Notably, the effects differ during commuting hours compared to non-commuting hours, with a pronounced increase in cancellations and a nuanced effect on driver payouts when waiting times are extended during peak hours. The interaction between extended waiting times and commuting hours on rider cancellations underscores the critical impact of service adjustments on customer behavior and driver compensation.
