<a href="https://colab.research.google.com/github/ferjuaco03/UberDataAnalysis/blob/main/Notebook_Uber.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd

#importar el dataset desde github
url = "https://raw.githubusercontent.com/ferjuaco03/UberDataAnalysis/refs/heads/main/ncr_ride_bookings.csv"
df = pd.read_csv(url)

#Primera validacion del dataset
print(df.head(10))
print(df.columns)
print(df.shape)
print(df.info())
print(df.describe())
print(df.isnull().sum())

         Date      Time    Booking ID   Booking Status   Customer ID  \
0  2024-03-23  12:29:38  "CNR5884300"  No Driver Found  "CID1982111"   
1  2024-11-29  18:01:39  "CNR1326809"       Incomplete  "CID4604802"   
2  2024-08-23  08:56:10  "CNR8494506"        Completed  "CID9202816"   
3  2024-10-21  17:17:25  "CNR8906825"        Completed  "CID2610914"   
4  2024-09-16  22:08:00  "CNR1950162"        Completed  "CID9933542"   
5  2024-02-06  09:44:56  "CNR4096693"        Completed  "CID4670564"   
6  2024-06-17  15:45:58  "CNR2002539"        Completed  "CID6800553"   
7  2024-03-19  17:37:37  "CNR6568000"        Completed  "CID8610436"   
8  2024-09-14  12:49:09  "CNR4510807"  No Driver Found  "CID7873618"   
9  2024-12-16  19:06:48  "CNR7721892"       Incomplete  "CID5214275"   

    Vehicle Type      Pickup Location      Drop Location  Avg VTAT  Avg CTAT  \
0          eBike          Palam Vihar            Jhilmil       NaN       NaN   
1       Go Sedan        Shastri Nagar  Gurgaon 

## **Esquema De Datos**

| Column Name                        | Description                                                        | Dtype   | Non-Null Count |
|------------------------------------|--------------------------------------------------------------------|---------|---------------:|
| Date                               | Date of the booking                                               | object  | 150000 |
| Time                               | Time of the booking                                               | object  | 150000 |
| Booking ID                         | Unique identifier for each ride booking                            | object  | 150000 |
| Booking Status                     | Status of booking (Completed, Cancelled by Customer, etc.)         | object  | 150000 |
| Customer ID                        | Unique identifier for customers                                    | object  | 150000 |
| Vehicle Type                       | Type of vehicle (Go Mini, Go Sedan, Auto, etc.)                    | object  | 150000 |
| Pickup Location                    | Starting location of the ride                                      | object  | 150000 |
| Drop Location                      | Destination location of the ride                                   | object  | 150000 |
| Avg VTAT                           | Average time for driver to reach pickup (minutes)                  | float64 | 139500 |
| Avg CTAT                           | Average trip duration (minutes)                                    | float64 | 102000 |
| Cancelled Rides by Customer        | Customer-initiated cancellation flag                               | float64 | 10500  |
| Reason for cancelling by Customer  | Reason for customer cancellation                                   | object  | 10500  |
| Cancelled Rides by Driver          | Driver-initiated cancellation flag                                 | float64 | 27000  |
| Driver Cancellation Reason         | Reason for driver cancellation                                     | object  | 27000  |
| Incomplete Rides                   | Incomplete ride flag                                              | float64 | 9000   |
| Incomplete Rides Reason            | Reason for incomplete rides                                        | object  | 9000   |
| Booking Value                      | Total fare amount for the ride                                    | float64 | 102000 |
| Ride Distance                      | Distance covered during the ride (km)                              | float64 | 102000 |
| Driver Ratings                     | Rating given to driver (1–5 scale)                                 | float64 | 93000  |
| Customer Rating                    | Rating given by customer (1–5 scale)                               | float64 | 93000  |
| Payment Method                     | Method used for payment (UPI, Cash, Credit Card, etc.)             | object  | 102000 |



## **Identificar valores Duplicados**

In [3]:
# Identificar valores duplicados
duplicate_rows = df.duplicated().sum()

print(f"Número de filas duplicadas: {duplicate_rows}")

Número de filas duplicadas: 0


## **Identificar valores Nulos**

In [7]:
# Crear un dataframe con TotalValues, NullValues, and PercentageNulls
null_summary = pd.DataFrame({
    'TotalValues': df.count(),
    'NullValues': df.isnull().sum(),
    'PercentageNull': (df.isnull().sum() / len(df)) * 100
})

print("Summary of Null Values per Column:")
display(null_summary)


Summary of Null Values per Column:


Unnamed: 0,TotalValues,NullValues,PercentageNull
Date,150000,0,0.0
Time,150000,0,0.0
Booking ID,150000,0,0.0
Booking Status,150000,0,0.0
Customer ID,150000,0,0.0
Vehicle Type,150000,0,0.0
Pickup Location,150000,0,0.0
Drop Location,150000,0,0.0
Avg VTAT,139500,10500,7.0
Avg CTAT,102000,48000,32.0


## **Validar unique values para  la columna Booking Status**

In [8]:
# Validar unique values para  'Booking Status'
booking_status_counts = df['Booking Status'].value_counts()

print("Cantidad de valores por tipo en 'Booking Status':")
print(booking_status_counts)

Cantidad de valores por tipo en 'Booking Status':
Booking Status
Completed                93000
Cancelled by Driver      27000
No Driver Found          10500
Cancelled by Customer    10500
Incomplete                9000
Name: count, dtype: int64


## **Validar unique values para  Cancelled Rides by Customer**

In [13]:
# Validar unique values para  'Cancelled Rides by Customer'
Cancelled_Customer_counts = df['Cancelled Rides by Customer'].value_counts()
Cancelled_Customer_counts_Null = df['Cancelled Rides by Customer'].isnull().sum()

print("Cantidad de valores por tipo en 'Cancelled Rides by Customer':")
print(Cancelled_Customer_counts)
print("Cantidad de valores Null en 'Cancelled Rides by Customer':")
print(Cancelled_Customer_counts_Null)

Cantidad de valores por tipo en 'Cancelled Rides by Customer':
Cancelled Rides by Customer
1.0    10500
Name: count, dtype: int64
Cantidad de valores Null en 'Cancelled Rides by Customer':
139500


Se identifica que los valores Nulos son para los viajes que no fueron cancelados por el customer, pero no tiene datos por lo tanto se van a llenar con cero (0) para no tener valores nulos.

In [15]:
# llenar valores nulos en  'Cancelled Rides by Customer' con  0
df['Cancelled Rides by Customer'] = df['Cancelled Rides by Customer'].fillna(0)

# Validar cambios
print(df['Cancelled Rides by Customer'].value_counts())
print("Valores Nulos")
print(df['Cancelled Rides by Customer'].isnull().sum())
print(df['Cancelled Rides by Customer'].dtype)

Cancelled Rides by Customer
0.0    139500
1.0     10500
Name: count, dtype: int64
Valores Nulos
0
float64
