# Exploratory Data Analysis (EDA) â€“ Ola Booking Dataset


Ola operates as a ride-hailing platform where operational efficiency,
customer satisfaction, and revenue optimization are critical.
Understanding booking patterns, cancellations, payment behavior,
and ride performance metrics helps the business improve service quality,
reduce losses, and enhance user experience.


### The primary objectives of this EDA are:

1. To analyze booking completion, cancellation, and incomplete ride patterns.
2. To identify major reasons behind ride cancellations and incomplete trips.
3. To study customer payment behavior and revenue contribution.
4. To evaluate operational performance using Turn Around Time (TAT) metrics.
5. To understand ride distance distribution and rating trends.
6. To uncover business insights that can help reduce cancellations
   and improve overall platform efficiency.

In [1]:
import pandas as pd
import numpy as np

In [7]:
df = pd.read_csv("outputs/ola_booking_cleaned.csv",dtype={"avg_ctat": "Int64", "avg_vtat": "Int64"})
df.head()


Unnamed: 0,date,time,booking_id,booking_status,customer_id,vehicle_type,pickup_location,drop_location,avg_vtat,avg_ctat,cancelled_rides_by_customer,cancelled_rides_by_driver,incomplete_rides,incomplete_rides_reason,booking_value,payment_method,ride_distance,driver_ratings,customer_rating,payment_value
0,2025-01-29,02:18:00,MUM0000001,Incomplete,CUST73037,Prime SUV,Ghatkopar West,Thane East,10.0,30.0,NO,NO,Yes,Driver unable to locate pickup,360.22,CASH,0.0,0.0,0.0,180.11
1,2025-01-12,13:33:00,MUM0000002,Cancelled,CUST37992,eBike,Malad East,Thane West,,,NO,Vehicle or personal issue,No,,570.47,Not Charged,0.0,0.0,0.0,0.0
2,2025-01-23,16:05:00,MUM0000003,Incomplete,CUST68431,eBike,Bandra West,Colaba,10.0,30.0,NO,NO,Yes,Payment issue,147.72,CASH,0.0,0.0,0.0,0.0
3,2025-01-16,20:35:00,MUM0000004,Completed,CUST55906,eBike,Mulund West,Goregaon East,16.0,10.0,NO,NO,No,,863.22,UPI,15.76,3.5,4.2,863.22
4,2025-01-16,21:42:00,MUM0000005,Incomplete,CUST29698,Mini,Mulund East,Cuffe Parade,10.0,30.0,NO,NO,Yes,Payment issue,749.76,UPI,0.0,0.0,0.0,0.0


 #### (a). Dataset Overview (Understanding)

In [8]:
df.shape
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 20 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   date                         100000 non-null  object 
 1   time                         100000 non-null  object 
 2   booking_id                   100000 non-null  object 
 3   booking_status               100000 non-null  object 
 4   customer_id                  100000 non-null  object 
 5   vehicle_type                 100000 non-null  object 
 6   pickup_location              100000 non-null  object 
 7   drop_location                100000 non-null  object 
 8   avg_vtat                     81045 non-null   Int64  
 9   avg_ctat                     81045 non-null   Int64  
 10  cancelled_rides_by_customer  100000 non-null  object 
 11  cancelled_rides_by_driver    100000 non-null  object 
 12  incomplete_rides             100000 non-null  object 
 13  

Unnamed: 0,avg_vtat,avg_ctat,booking_value,ride_distance,driver_ratings,customer_rating,payment_value
count,81045.0,81045.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,9.95964,29.980961,495.722178,12.690796,2.636901,2.638221,355.174845
std,4.281621,10.364694,233.745145,13.29329,2.090636,2.091766,290.789996
min,2.0,10.0,90.0,0.0,0.0,0.0,0.0
25%,7.0,23.0,294.0175,0.0,0.0,0.0,0.0
50%,10.0,30.0,495.46,8.6,3.8,3.8,331.78
75%,13.0,37.0,697.6025,24.22,4.4,4.4,605.9
max,18.0,50.0,899.99,40.0,5.0,5.0,899.99


#### (b). Booking Status Analysis

In [10]:
df['booking_status'].value_counts()


booking_status
Completed     62045
Incomplete    19000
Cancelled     18955
Name: count, dtype: int64

In [11]:
df['booking_status'].value_counts(normalize=True) * 100


booking_status
Completed     62.045
Incomplete    19.000
Cancelled     18.955
Name: proportion, dtype: float64

### Insight:

#### Booking Status Analysis
- Majority of bookings are **Completed**, indicating strong platform usage.
- A significant number of rides are **Cancelled** or **Incomplete**, which may impact revenue and customer satisfaction.

 #### (c). Cancellation Analysis

In [16]:
#Cancelled rides count
cancelled_df = df[df['booking_status'] == 'Cancelled']
cancelled_df.shape

(18955, 20)

In [17]:
cancelled_df[['cancelled_rides_by_customer',
              'cancelled_rides_by_driver']].notna().sum()

cancelled_rides_by_customer    18955
cancelled_rides_by_driver      18955
dtype: int64

cancelled_df['cancelled_rides_by_driver'].value_counts()


In [23]:
cancelled_df['cancelled_rides_by_customer'].value_counts()


cancelled_rides_by_customer
NO                                  9440
Change of travel plan               1645
Emergency or personal reason        1631
Fare too high                       1582
Found alternative transport         1569
Booked by mistake                   1565
Driver taking too long to arrive    1523
Name: count, dtype: int64

### Insight:

#### Cancellation Reasons Analysis (Driver vs Customer)


##### Customer-Initiated Cancellations

Customer-driven cancellations are primarily influenced by
personal decisions and service expectations.

Key customer cancellation reasons include:
- Change of travel plan
- Emergency or personal reasons
- Fare perceived as too high
- Availability of alternative transport
- Booking made by mistake
- Driver taking too long to arrive

These reasons indicate that a large portion of customer cancellations
occur before ride execution, often due to pricing sensitivity,
waiting time, or sudden changes in travel requirements.

##### Driver-Initiated Cancellations

Driver-driven cancellations are mainly associated with
operational and economic constraints.

Major driver cancellation reasons include:
- Low fare or short trip
- Vehicle or personal issues
- Traffic or route-related problems
- Customer not reachable
- Pickup location being too far

These factors suggest that drivers are more likely to cancel rides
when trips are perceived as unprofitable, logistically difficult,
or when communication with the customer fails.

##### (d). incomplete_ride_analysis

In [26]:

incomplete_df = df[df['booking_status'] == 'Incomplete']
incomplete_df.shape


(19000, 20)

In [27]:
# Reason of incomplete ride
incomplete_reasons = incomplete_df['incomplete_rides_reason'].value_counts()
incomplete_reasons

incomplete_rides_reason
Customer no-show at pickup              3192
Payment issue                           3186
Driver unable to locate pickup          3176
Ride stopped midway due to emergency    3163
Vehicle breakdown during trip           3145
Route blocked or severe traffic         3138
Name: count, dtype: int64

#### (e). Payment Method Analysis

In [28]:
df['payment_method'].value_counts()


payment_method
UPI            40549
CASH           40496
Not Charged    18955
Name: count, dtype: int64

#### Payment Method Analysis


- CASH and UPI are the most commonly used payment methods.
- Not Charged payments are associated with cancelled bookings.
- Online payments show better revenue realization.

#### (f).Revenue & Booking Value Analysis

In [30]:
df['booking_value'].describe()


count    100000.000000
mean        495.722178
std         233.745145
min          90.000000
25%         294.017500
50%         495.460000
75%         697.602500
max         899.990000
Name: booking_value, dtype: float64

In [31]:
df.groupby('booking_status')['booking_value'].mean()


booking_status
Cancelled     493.370859
Completed     496.217815
Incomplete    496.449413
Name: booking_value, dtype: float64

In [32]:
df.groupby('booking_status')['payment_value'].sum()


booking_status
Cancelled            0.00
Completed     30787834.31
Incomplete     4729650.14
Name: payment_value, dtype: float64

 ### Insight:

#### Revenue Analysis

- Completed rides generate the highest booking and payment values.
- Cancelled rides contribute no revenue.
- Incomplete rides sometimes generate partial revenue.

#### (g). Ratings Analysis

In [36]:
#Driver Rating
df['driver_ratings'].describe()


count    100000.000000
mean          2.636901
std           2.090636
min           0.000000
25%           0.000000
50%           3.800000
75%           4.400000
max           5.000000
Name: driver_ratings, dtype: float64

In [37]:
#Customer Rating
df['customer_rating'].describe()


count    100000.000000
mean          2.638221
std           2.091766
min           0.000000
25%           0.000000
50%           3.800000
75%           4.400000
max           5.000000
Name: customer_rating, dtype: float64

### Insight:

#### Ratings Analysis

- Driver and customer ratings mostly lie between 3.5 and 5.
- This indicates overall satisfactory service quality.
- Zero ratings are associated with non-completed rides.

#### (h).TAT (avg_vtat & avg_ctat) Analysis

In [39]:
df[['avg_vtat', 'avg_ctat']].describe()


Unnamed: 0,avg_vtat,avg_ctat
count,81045.0,81045.0
mean,9.95964,29.980961
std,4.281621,10.364694
min,2.0,10.0
25%,7.0,23.0
50%,10.0,30.0
75%,13.0,37.0
max,18.0,50.0


In [40]:
df.groupby('booking_status')[['avg_vtat','avg_ctat']].mean()


Unnamed: 0_level_0,avg_vtat,avg_ctat
booking_status,Unnamed: 1_level_1,Unnamed: 2_level_1
Cancelled,,
Completed,9.94728,29.975131
Incomplete,10.0,30.0


#### Turn Around Time (TAT) Analysis

- TAT values are present only for completed and some incomplete rides.
- Cancelled rides correctly have missing TAT values.
- Average trip completion time is higher than vehicle arrival time.

#### (i). Missing Values Validation

In [41]:
df.isna().sum()


date                               0
time                               0
booking_id                         0
booking_status                     0
customer_id                        0
vehicle_type                       0
pickup_location                    0
drop_location                      0
avg_vtat                       18955
avg_ctat                       18955
cancelled_rides_by_customer        0
cancelled_rides_by_driver          0
incomplete_rides                   0
incomplete_rides_reason        81000
booking_value                      0
payment_method                     0
ride_distance                      0
driver_ratings                     0
customer_rating                    0
payment_value                      0
dtype: int64

### Insight:

#### Missing Values Check

- Missing values are expected in TAT, incomplete_ride_reason columns
  for cancelled and incomplete bookings.
- No unexpected data quality issues found.

### EDA Summary

- Majority of bookings are completed successfully, indicating strong platform adoption
  and stable demand for ride services.
- Cancellations are mostly customer-driven and generally occur before ride execution,
  resulting in no revenue collection for such bookings.
- Cash and UPI dominate as the primary payment methods across all ride categories.
- Revenue is mainly concentrated in completed rides where 100% of the booking value
  is collected.
- For incomplete rides caused due to driver-related issues, only 50% of the booking
  value is charged as per business rules, leading to partial revenue realization.
- Incomplete rides therefore contribute less revenue compared to completed rides,
  while still impacting operational costs and service quality.


######                     TAT (Turn Around Time) Handling

For completed rides, `avg_vtat` (vehicle arrival time) and `avg_ctat`
  (trip completion time) are recorded accurately.
- In case of incomplete rides, TAT values are often missing due to early ride termination.
- To maintain analytical consistency, missing `avg_vtat` and `avg_ctat` values
  for incomplete rides have been imputed using the **median TAT values of completed rides**.
- Median imputation was chosen instead of mean to reduce the impact of extreme
  outliers and skewed ride durations.
- This approach ensures that TAT-based analysis remains realistic and comparable
  across completed and incomplete ride categories.


Overall, the analysis highlights that improving ride completion rates,
reducing driver-side failures, and minimizing customer-driven cancellations
can significantly enhance revenue efficiency and overall customer satisfaction.
