## Step 1: Import Libraries and Load Data
We start by importing pandas and loading the dataset from a CSV file.  
We also define the column (`Booking ID`) that we want to check for duplicates.


In [6]:
import pandas as pd

url = r"https://raw.githubusercontent.com/edwardpy-data/Data-Analysis/490c299a38557a999f01aa0e3046daa0fe608154/ncr_ride_bookings.csv" #provide the url to the csv file
column = "Booking ID" #provide the column to check

df = pd.read_csv(path)  #read the csv file
#print the first 5 rows
df.head(5)

Unnamed: 0,Date,Time,Booking ID,Booking Status,Customer ID,Vehicle Type,Pickup Location,Drop Location,Avg VTAT,Avg CTAT,...,Reason for cancelling by Customer,Cancelled Rides by Driver,Driver Cancellation Reason,Incomplete Rides,Incomplete Rides Reason,Booking Value,Ride Distance,Driver Ratings,Customer Rating,Payment Method
0,2024-03-23,12:29:38,"""CNR5884300""",No Driver Found,"""CID1982111""",eBike,Palam Vihar,Jhilmil,,,...,,,,,,,,,,
1,2024-11-29,18:01:39,"""CNR1326809""",Incomplete,"""CID4604802""",Go Sedan,Shastri Nagar,Gurgaon Sector 56,4.9,14.0,...,,,,1.0,Vehicle Breakdown,237.0,5.73,,,UPI
2,2024-08-23,08:56:10,"""CNR8494506""",Completed,"""CID9202816""",Auto,Khandsa,Malviya Nagar,13.4,25.8,...,,,,,,627.0,13.58,4.9,4.9,Debit Card
3,2024-10-21,17:17:25,"""CNR8906825""",Completed,"""CID2610914""",Premier Sedan,Central Secretariat,Inderlok,13.1,28.5,...,,,,,,416.0,34.02,4.6,5.0,UPI
4,2024-09-16,22:08:00,"""CNR1950162""",Completed,"""CID9933542""",Bike,Ghitorni Village,Khan Market,5.3,19.6,...,,,,,,737.0,48.21,4.1,4.3,UPI


## Step 2: Validate Column Existence
Before proceeding, we make sure that the column we specified (`Booking ID`) actually exists in the dataset.  
If not, we raise an error.


In [2]:
if column not in df.columns:
    raise KeyError(column)
else:
    print("The column is available") 

The column is available


## Step 3: Count Total Duplicate Entries
We check how many duplicate values exist in the chosen column.  
If none are found, we stop the process; otherwise, we display the count.

In [3]:
duplicates_sum = df[column].duplicated().sum()

if duplicates_sum:
    print(f"The column contains the following number of duplicates: {duplicates_sum}")
else:
    raise ValueError(f"No duplicates found in column '{column}'. Exiting.")

The column contains the following number of duplicates: 1233


## Step 4: Get Frequency of Duplicate Values
We count how many times each value in the chosen column is repeated.  
Then, we filter to keep only those values that occur more than once.

In [4]:
duplicates_count = df[column].value_counts()
#filter where count is more than 1 using masking
duplicates_count = duplicates_count[duplicates_count > 1]
print(duplicates_count[:5])

Booking ID
"CNR7908610"    3
"CNR9603232"    3
"CNR7199036"    3
"CNR6337479"    3
"CNR5292943"    3
Name: count, dtype: int64


## Step 5: Extract Duplicate Rows
Finally, we extract all rows where the chosen column has duplicate values.  
We also sort them by the column to make patterns easier to spot.

In [5]:
duplicates_row = df[df.duplicated(subset = column, keep = False)]
#sort them by column value
duplicates_row = duplicates_row.sort_values(by = column)
#print the first 10 duplicates
print(duplicates_row.head(10))

              Date      Time    Booking ID         Booking Status  \
81334   2024-10-15  18:17:23  "CNR1026036"              Completed   
9192    2024-07-21  17:59:41  "CNR1026036"        No Driver Found   
9587    2024-12-17  19:19:02  "CNR1029172"              Completed   
1353    2024-01-19  17:00:57  "CNR1029172"             Incomplete   
82029   2024-11-05  16:50:20  "CNR1051228"              Completed   
110412  2024-01-31  08:42:04  "CNR1051228"              Completed   
87333   2024-04-20  23:16:21  "CNR1056023"              Completed   
120008  2024-07-07  17:18:49  "CNR1056023"             Incomplete   
71570   2024-02-12  19:02:47  "CNR1058956"  Cancelled by Customer   
38548   2024-10-15  13:33:55  "CNR1058956"              Completed   

         Customer ID   Vehicle Type   Pickup Location Drop Location  Avg VTAT  \
81334   "CID6480133"        Go Mini           Khandsa   Ashok Vihar       3.8   
9192    "CID6974869"        Go Mini         Seelampur   Nehru Place       NaN 