# Uber Data Analytics Dashboard

## Dataset Details

The dataset contains detailed ride-sharing data from Uber Operations for the year 2024. 

The dataset contains the following columns:

<table>
  <thead>
    <tr>
      <th>Column Name</th>
      <th>Column Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Date</td>
      <td>Date of the booking</td>
    </tr>
    <tr>
      <td>Time</td>
      <td>Time of the booking</td>
    </tr>
    <tr>
      <td>Booking ID</td>
      <td>Unique identifier for each ride booking</td>
    </tr>
    <tr>
      <td>Booking Status</td>
      <td>Status of booking (Completed, Cancelled by Customer, Cancelled by Driver, etc.)</td>
    </tr>
    <tr>
      <td>Customer ID</td>
      <td>Unique identifier for customers</td>
    </tr>
    <tr>
      <td>Vehicle Type</td>
      <td>Type of vehicle (Go Mini, Go Sedan, Auto, eBike/Bike, UberXL, Premier Sedan)</td>
    </tr>
    <tr>
      <td>Pickup Location</td>
      <td>Starting location of the ride</td>
    </tr>
    <tr>
      <td>Drop Location</td>
      <td>Destination location of the ride</td>
    </tr>
    <tr>
      <td>Avg VTAT</td>
      <td>Average time for driver to reach pickup location (in minutes)</td>
    </tr>
    <tr>
      <td>Avg CTAT</td>
      <td>Average trip duration from pickup to destination (in minutes)</td>
    </tr>
    <tr>
      <td>Cancelled Rides by Customer</td>
      <td>Customer-initiated cancellation flag</td>
    </tr>
    <tr>
      <td>Reason for cancelling by Customer</td>
      <td>Reason for customer cancellation</td>
    </tr>
    <tr>
      <td>Cancelled Rides by Driver</td>
      <td>Driver-initiated cancellation flag</td>
    </tr>
    <tr>
      <td>Driver Cancellation Reason</td>
      <td>Reason for driver cancellation</td>
    </tr>
    <tr>
      <td>Incomplete Rides</td>
      <td>Incomplete ride flag</td>
    </tr>
    <tr>
      <td>Incomplete Rides Reason</td>
      <td>Reason for incomplete rides</td>
    </tr>
    <tr>
      <td>Booking Value</td>
      <td>Total fare amount for the ride</td>
    </tr>
    <tr>
      <td>Ride Distance</td>
      <td>Distance covered during the ride (in km)</td>
    </tr>
    <tr>
      <td>Driver Ratings</td>
      <td>Rating given to driver (1-5 scale)</td>
    </tr>
    <tr>
      <td>Customer Rating</td>
      <td>Rating given by customer (1-5 scale)</td>
    </tr>
    <tr>
      <td>Payment Method</td>
      <td>Method used for payment (UPI, Cash, Credit Card, Uber Wallet, Debit Card)</td>
    </tr>
  </tbody>
</table>

In [1]:
# Setting up the environment
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime

print("Setup Complete")

warnings.filterwarnings("ignore") # ignore all warnings

Setup Complete


In [2]:
# Load the Dataset
filepath =  "../input/uber-ride-analytics-dashboard/ncr_ride_bookings.csv"
uber_data = pd.read_csv(filepath)
uber_data.head(5)

Unnamed: 0,Date,Time,Booking ID,Booking Status,Customer ID,Vehicle Type,Pickup Location,Drop Location,Avg VTAT,Avg CTAT,...,Reason for cancelling by Customer,Cancelled Rides by Driver,Driver Cancellation Reason,Incomplete Rides,Incomplete Rides Reason,Booking Value,Ride Distance,Driver Ratings,Customer Rating,Payment Method
0,2024-03-23,12:29:38,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,,,...,,,,,,,,,,
1,2024-11-29,18:01:39,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,...,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI
2,2024-08-23,08:56:10,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,...,,,,,,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,...,,,,,,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,...,,,,,,737.0,48.21,4.1,4.3,UPI


In [3]:
uber_data.shape

(150000, 21)

In [4]:
uber_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 21 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Date                               150000 non-null  object 
 1   Time                               150000 non-null  object 
 2   Booking ID                         150000 non-null  object 
 3   Booking Status                     150000 non-null  object 
 4   Customer ID                        150000 non-null  object 
 5   Vehicle Type                       150000 non-null  object 
 6   Pickup Location                    150000 non-null  object 
 7   Drop Location                      150000 non-null  object 
 8   Avg VTAT                           139500 non-null  float64
 9   Avg CTAT                           102000 non-null  float64
 10  Cancelled Rides by Customer        10500 non-null   float64
 11  Reason for cancelling by Customer  1050

In [5]:
uber_data.describe()

Unnamed: 0,Avg VTAT,Avg CTAT,Cancelled Rides by Customer,Cancelled Rides by Driver,Incomplete Rides,Booking Value,Ride Distance,Driver Ratings,Customer Rating
count,139500.0,102000.0,10500.0,27000.0,9000.0,102000.0,102000.0,93000.0,93000.0
mean,8.456352,29.149636,1.0,1.0,1.0,508.295912,24.637012,4.230992,4.404584
std,3.773564,8.902577,0.0,0.0,0.0,395.805774,14.002138,0.436871,0.437819
min,2.0,10.0,1.0,1.0,1.0,50.0,1.0,3.0,3.0
25%,5.3,21.6,1.0,1.0,1.0,234.0,12.46,4.1,4.2
50%,8.3,28.8,1.0,1.0,1.0,414.0,23.72,4.3,4.5
75%,11.3,36.8,1.0,1.0,1.0,689.0,36.82,4.6,4.8
max,20.0,45.0,1.0,1.0,1.0,4277.0,50.0,5.0,5.0


### Handling NaN values

In [6]:
uber_data.isnull().sum() # Verifying the NaN values per column

Date                                      0
Time                                      0
Booking ID                                0
Booking Status                            0
Customer ID                               0
Vehicle Type                              0
Pickup Location                           0
Drop Location                             0
Avg VTAT                              10500
Avg CTAT                              48000
Cancelled Rides by Customer          139500
Reason for cancelling by Customer    139500
Cancelled Rides by Driver            123000
Driver Cancellation Reason           123000
Incomplete Rides                     141000
Incomplete Rides Reason              141000
Booking Value                         48000
Ride Distance                         48000
Driver Ratings                        57000
Customer Rating                       57000
Payment Method                        48000
dtype: int64

In [7]:
uber_data['Reason for cancelling by Customer'] = uber_data['Reason for cancelling by Customer'].fillna('Unknow', inplace=True)
uber_data['Driver Cancellation Reason'] = uber_data['Driver Cancellation Reason'].fillna('Unknow', inplace=True)
uber_data['Incomplete Rides Reason'] = uber_data['Incomplete Rides Reason'].fillna('Unknow', inplace=True)
uber_data['Payment Method'] = uber_data['Payment Method'].fillna('Unknow', inplace=True)
uber_data.fillna(0, inplace=True)

In [8]:
uber_data.isnull().sum() # Verifying the NaN values per column

Date                                 0
Time                                 0
Booking ID                           0
Booking Status                       0
Customer ID                          0
Vehicle Type                         0
Pickup Location                      0
Drop Location                        0
Avg VTAT                             0
Avg CTAT                             0
Cancelled Rides by Customer          0
Reason for cancelling by Customer    0
Cancelled Rides by Driver            0
Driver Cancellation Reason           0
Incomplete Rides                     0
Incomplete Rides Reason              0
Booking Value                        0
Ride Distance                        0
Driver Ratings                       0
Customer Rating                      0
Payment Method                       0
dtype: int64

### Updating the column type to the Data type

In [9]:
uber_data['Booking ID'] = uber_data['Booking ID'].astype('string')
uber_data['Booking Status'] = uber_data['Booking Status'].astype('string')
uber_data['Customer ID'] = uber_data['Customer ID'].astype('string')
uber_data['Vehicle Type'] = uber_data['Vehicle Type'].astype('string')
uber_data['Pickup Location'] = uber_data['Pickup Location'].astype('string')
uber_data['Drop Location'] = uber_data['Drop Location'].astype('string')
uber_data['Driver Cancellation Reason'] = uber_data['Driver Cancellation Reason'].astype('string')
uber_data['Cancelled Rides by Customer'] = uber_data['Cancelled Rides by Customer'].astype('int')
uber_data['Reason for cancelling by Customer'] = uber_data['Reason for cancelling by Customer'].astype('string')
uber_data['Cancelled Rides by Driver'] = uber_data['Cancelled Rides by Driver'].astype('int')
uber_data['Driver Cancellation Reason'] = uber_data['Driver Cancellation Reason'].astype('string')
uber_data['Incomplete Rides'] = uber_data['Incomplete Rides'].astype('int')
uber_data['Incomplete Rides Reason'] = uber_data['Incomplete Rides Reason'].astype('string')
uber_data['Booking Value'] = uber_data['Booking Value'].astype('int')
uber_data['Payment Method'] = uber_data['Payment Method'].astype('string')

In [10]:
uber_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 21 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Date                               150000 non-null  object 
 1   Time                               150000 non-null  object 
 2   Booking ID                         150000 non-null  string 
 3   Booking Status                     150000 non-null  string 
 4   Customer ID                        150000 non-null  string 
 5   Vehicle Type                       150000 non-null  string 
 6   Pickup Location                    150000 non-null  string 
 7   Drop Location                      150000 non-null  string 
 8   Avg VTAT                           150000 non-null  float64
 9   Avg CTAT                           150000 non-null  float64
 10  Cancelled Rides by Customer        150000 non-null  int64  
 11  Reason for cancelling by Customer  1500

Link to the Dataset: [Uber Data Analytics Dashboard](https://www.kaggle.com/datasets/yashdevladdha/uber-ride-analytics-dashboard)